docs: push syncing tags doc (#1429)

This commit is contained in:
Anton Evangelatov
2019-06-06 14:51:58 +02:00
committed by GitHub
parent a454aa6c8d
commit 5663cb2c76

87
docs/Push-syncing-tags.md Normal file
View File

@ -0,0 +1,87 @@
Push syncing tags
=====
motivation
------
push syncing tags are to provide the ability to measure how long it is going to take for a file uploaded to swarm to finish syncing, this in turn allows a node (presumably a light node) to know when it can go offline once all uploads are synced to the swarm.
definitions
---
* tag - an upload tag which is transactional and maps an entire upload transaction to a unique identifier. this ID is randomly generated on the fly. (i.e. `swarm up --recursive mydir`, `mydir` being the tag)
* tag index - an index that creates a unique ID for each upload, allowing for less full iterations on indexes when querying all results for pending tags. defined as `UploadID|TARFilename->TotalChunkCount|SyncedChunkCount`
or `UploadID->TotalChunkCount|SyncedChunkCount|TARFilename`
* push index - `localstore` push syncing index
tags spec
----
* whats an upload tag?
* an ID that is transactional for a complete upload (i.e. `swarm up <dirname>`)
* tag index operations:
* create - once an upload txn is started (saves too)
* persist - save a tag to disk
* delete - once upload txn is synced to the swarm (maybe we don't want to delete immediately? let user see upload history? including uploaded hashes?)
* get one/all - get status of upload(s)
* which operations should tags facilitate?
* get count of distinct chunks for a file
* get count of chunks pending to sync for a file
* get existing tags (files with pending syncronisation)
example sync status
```
swarm status sync
file1.tar.gz, 5% complete, ETA 05:52
[=========> ]
file2.tar.gz, 99% complete, ETA 00:02
[=============>]
```
## Issues
### other states
as part of this effort we want to support progress bars/metrics for
* progress of chunking (splitting to chunks)
* progress of storage
* progress of sending out to push sync
For this we need to introduce counts for 5 states
* SPLIT - count chunk instances
* STORED - count chunk instances
* SEEN - count of chunks previously stored (duplicates)
* SENT - count distinct chunks
* SYNCED - count distinct chunks
progress on a state is characterised by 2 integers `c, n` standing for "completed `c` chunks out of known `n`". This is the main interface that progress bar UX can call and also which enables ETA calculation .
If we want progress on localstore storage, the STORED count should increment every time localstore `Put` is called.
### known file sizes
If we know a files's size we can use it to calculate the total number of chunks (note that it depends on encryption), so that progress of chunking (SPLIT) and storage (STORED) can be meaningful.
If the size and total number of chunks split is not yet known, progress of SPLIT is undefined. After the chunker finished splitting, one can set `total` to the SPLIT count.
If we relied on SPLIT count only, we would lose the very common use case of uploading one file.
Note that if upload also includes a manifest, the total count will serve only as an estimation until `total` is set SPLIT count. This estimation converges to the correct value as the size of the file grows.
### duplicate chunks
Duplicate chunks are chunks that occur multiple times within an upload or across uploads. In order to have a locally verifiable definition, we define a chunk as a duplicate (or seen) if and only if it is already found in the localstore.
When chunks enter the localstore via upload they are push synced, therefore seen chunks need not push sync again.
In other words only newly stored chunks need counting when assessing the synced ETA of an upload.
If we want progress on SENT/SYNCED counts, we need to give a status, where `n ` represents the total count of *distinct* chunks. Therefore SENT/SYNCED need comparison to `STORED`.