updates to reflect new_from_parent() (#3076)

* design draft * update * section on updating root forks * updates to reflect new_from_parent() * fixup * Grammar check
2019-03-10 13:59:16 -07:00
parent 195a880576
commit cd0bc1dea5
2 changed files with 154 additions and 0 deletions
--- a/book/src/SUMMARY.md
+++ b/book/src/SUMMARY.md
@@ -40,6 +40,7 @@
  - [Fork Selection](fork-selection.md)
  - [Reliable Vote Transmission](reliable-vote-transmission.md)
  - [Bank Forks](bank-forks.md)
  - [Persistent Account Storage](persistent-account-storage.md)
  - [Leader to Leader Transition](leader-leader-transition.md)
  - [Cluster Economics](ed_overview.md)
    - [Validation-client Economics](ed_validation_client_economics.md)
--- a/book/src/persistent-account-storage.md
+++ b/book/src/persistent-account-storage.md
@@ -0,0 +1,153 @@
 # Persistent Account Storage
 The set of Accounts represent the current computed state of all the transactions
 that have been processed by a fullnode.  Each fullnode needs to maintain this
 entire set.  Each block that is proposed by the network represents a change to
 this set, and since each block is a potential rollback point the changes need to
 be reversible.
 Persistent storage like NVMEs are 20 to 40 times cheaper than DDR.  The problem
 with persistent storage is that write and read performance is much slower than
 DDR and care must be taken in how data is read or written to.  Both reads and
 writes can be split between multiple storage drives and accessed in parallel.
 This design proposes a data structure that allows for concurrent reads and
 concurrent writes of storage.   Writes are optimized by using an AppendVec data
 structure, which allows a single writer to append while allowing access to many
 concurrent readers.  The accounts index maintains a pointer to a spot where the
 account was appended to every fork, thus removing the need for explicit
 checkpointing of state.
 # AppendVec
 AppendVec is a data structure that allows for random reads concurrent with a
 single append-only writer.  Growing or resizing the capacity of the AppendVec
 requires exclusive access.  This is implemented with an atomic `offset`, which
 is updated at the end of a completed append.
 The underlying memory for an AppendVec is a memory-mapped file.  Memory-mapped
 files allow for fast random access and paging is handled by the OS.
 # Account Index
 The account index is designed to support a single index for all the currently
 forked Accounts.
 ```rust,ignore
 type AppendVecId = usize;
 type Fork = u64;
 struct AccountMap(Hashmap<Fork, (AppendVecId, u64)>);
 type AccountIndex = HashMap<Pubkey, AccountMap>;
 ```
 The index is a map of account Pubkeys to a map of Forks and the location of the
 Account data in an AppendVec.  To get the version of an account for a specific Fork:
 ```rust,ignore
 /// Load the account for the pubkey.
 /// This function will load the account from the specified fork, falling back to the fork's parents
 /// * fork - a virtual Accounts instance, keyed by Fork.  Accounts keep track of their parents with Forks,
 ///       the persistent store
 /// * pubkey - The Account's public key.
 pub fn load_slow(&self, id: Fork, pubkey: &Pubkey) -> Option<&Account>
 ```
 The read is satisfied by pointing to a memory-mapped location in the
 `AppendVecId` at the stored offset.  A reference can be returned without a copy.
 ## Root Forks
 The [fork selection algorithm](fork-selection.md) eventually selects a fork as a
 root fork and the fork is squashed.  A squashed/root fork cannot be rolled back.
 When a fork is squashed, all accounts in its parents not already present in the
 fork are pulled up into the fork by updating the indexes.  Accounts with zero
 balance in the squashed fork are removed from fork by updating the indexes.
 An account can be *garbage-collected* when squashing makes it unreachable.
 Three possible options exist:
 * Maintain a HashSet<u64> of root forks.  One is expected to be created every
 second.  The entire tree can be garbage-collected later. Alternatively, if
 every fork keeps a reference count of accounts, garbage collection could occur
 any time an index location is updated.
 * Remove any pruned forks from the index.  Any remaining forks lower in number
 than the root are can be considered root.
 * Scan the index, migrate any old roots into the new one.   Any remaining forks
 lower than the new root can be deleted later.
 # Append-only Writes
 All the updates to Accounts occur as append-only updates.  For every account
 update, a new version is stored in the AppendVec.
 It is possible to optimize updates within a single fork by returning a mutable
 reference to an already stored account in a fork.  The Bank already tracks
 concurrent access of accounts and guarantees that a write to a specific account
 fork will not be concurrent with a read to an account at that fork. To support
 this operation, AppendVec should implement this function:
 ```rust,ignore
 fn get_mut(&self, index: u64) -> &mut T;
 ```
 This API allows for concurrent mutable access to a memory region at `index`.  It
 relies on the Bank to guarantee exclusive access to that index.
 # Garbage collection
 As accounts get updated, they move to the end of the AppendVec.  Once capacity
 has run out, a new AppendVec can be created and updates can be stored there.
 Eventually references to an older AppendVec will disappear because all the
 accounts have been updated, and the old AppendVec can be deleted.
 To speed up this process, it's possible to move Accounts that have not been
 recently updated to the front of a new AppendVec.  This form of garbage
 collection can be done without requiring exclusive locks to any of the data
 structures except for the index update.
 The initial implementation for garbage collection is that once all the accounts in
 an AppendVec become stale versions, it gets reused. The accounts are not updated
 or moved around once appended.
 # Index Recovery
 Each bank thread has exclusive access to the accounts during append, since the
 accounts locks cannot be released until the data is committed. But there is no
 explicit order of writes between the separate AppendVec files.  To create an
 ordering, the index maintains an atomic write version counter.  Each append to
 the AppendVec records the index write version number for that append in the
 entry for the Account in the AppendVec.
 To recover the index, all the AppendVec files can be read in any order, and the
 latest write version for every fork should be stored in the index.
 # Snapshots
 To snapshot, the underlying memory-mapped files in the AppendVec need to be
 flushed to disk.  The index can be written out to disk as well.
 # Performance
 * Append-only writes are fast.  SSDs and NVMEs, as well as all the OS level
 kernel data structures, allow for appends to run as fast as PCI or NVMe bandwidth
 will allow (2,700 MB/s).
 * Each replay and banking thread writes concurrently to its own AppendVec.
 * Each AppendVec could potentially be hosted on a separate NVMe.
 * Each replay and banking thread has concurrent read access to all the
 AppendVecs without blocking writes.
 * Index requires an exclusive write lock for writes.  Single-thread performance
 for HashMap updates is on the order of 10m per second.
 * Banking and Replay stages should use 32 threads per NVMe.  NVMes have
 optimal performance with 32 concurrent readers or writers.