updates to reflect new_from_parent() (#3076)

* design draft

* update

* section on updating root forks

* updates to reflect new_from_parent()

* fixup

* Grammar check
This commit is contained in:
Rob Walker
2019-03-10 13:59:16 -07:00
committed by GitHub
parent 195a880576
commit cd0bc1dea5
2 changed files with 154 additions and 0 deletions

View File

@ -40,6 +40,7 @@
- [Fork Selection](fork-selection.md) - [Fork Selection](fork-selection.md)
- [Reliable Vote Transmission](reliable-vote-transmission.md) - [Reliable Vote Transmission](reliable-vote-transmission.md)
- [Bank Forks](bank-forks.md) - [Bank Forks](bank-forks.md)
- [Persistent Account Storage](persistent-account-storage.md)
- [Leader to Leader Transition](leader-leader-transition.md) - [Leader to Leader Transition](leader-leader-transition.md)
- [Cluster Economics](ed_overview.md) - [Cluster Economics](ed_overview.md)
- [Validation-client Economics](ed_validation_client_economics.md) - [Validation-client Economics](ed_validation_client_economics.md)

View File

@ -0,0 +1,153 @@
# Persistent Account Storage
The set of Accounts represent the current computed state of all the transactions
that have been processed by a fullnode. Each fullnode needs to maintain this
entire set. Each block that is proposed by the network represents a change to
this set, and since each block is a potential rollback point the changes need to
be reversible.
Persistent storage like NVMEs are 20 to 40 times cheaper than DDR. The problem
with persistent storage is that write and read performance is much slower than
DDR and care must be taken in how data is read or written to. Both reads and
writes can be split between multiple storage drives and accessed in parallel.
This design proposes a data structure that allows for concurrent reads and
concurrent writes of storage. Writes are optimized by using an AppendVec data
structure, which allows a single writer to append while allowing access to many
concurrent readers. The accounts index maintains a pointer to a spot where the
account was appended to every fork, thus removing the need for explicit
checkpointing of state.
# AppendVec
AppendVec is a data structure that allows for random reads concurrent with a
single append-only writer. Growing or resizing the capacity of the AppendVec
requires exclusive access. This is implemented with an atomic `offset`, which
is updated at the end of a completed append.
The underlying memory for an AppendVec is a memory-mapped file. Memory-mapped
files allow for fast random access and paging is handled by the OS.
# Account Index
The account index is designed to support a single index for all the currently
forked Accounts.
```rust,ignore
type AppendVecId = usize;
type Fork = u64;
struct AccountMap(Hashmap<Fork, (AppendVecId, u64)>);
type AccountIndex = HashMap<Pubkey, AccountMap>;
```
The index is a map of account Pubkeys to a map of Forks and the location of the
Account data in an AppendVec. To get the version of an account for a specific Fork:
```rust,ignore
/// Load the account for the pubkey.
/// This function will load the account from the specified fork, falling back to the fork's parents
/// * fork - a virtual Accounts instance, keyed by Fork. Accounts keep track of their parents with Forks,
/// the persistent store
/// * pubkey - The Account's public key.
pub fn load_slow(&self, id: Fork, pubkey: &Pubkey) -> Option<&Account>
```
The read is satisfied by pointing to a memory-mapped location in the
`AppendVecId` at the stored offset. A reference can be returned without a copy.
## Root Forks
The [fork selection algorithm](fork-selection.md) eventually selects a fork as a
root fork and the fork is squashed. A squashed/root fork cannot be rolled back.
When a fork is squashed, all accounts in its parents not already present in the
fork are pulled up into the fork by updating the indexes. Accounts with zero
balance in the squashed fork are removed from fork by updating the indexes.
An account can be *garbage-collected* when squashing makes it unreachable.
Three possible options exist:
* Maintain a HashSet<u64> of root forks. One is expected to be created every
second. The entire tree can be garbage-collected later. Alternatively, if
every fork keeps a reference count of accounts, garbage collection could occur
any time an index location is updated.
* Remove any pruned forks from the index. Any remaining forks lower in number
than the root are can be considered root.
* Scan the index, migrate any old roots into the new one. Any remaining forks
lower than the new root can be deleted later.
# Append-only Writes
All the updates to Accounts occur as append-only updates. For every account
update, a new version is stored in the AppendVec.
It is possible to optimize updates within a single fork by returning a mutable
reference to an already stored account in a fork. The Bank already tracks
concurrent access of accounts and guarantees that a write to a specific account
fork will not be concurrent with a read to an account at that fork. To support
this operation, AppendVec should implement this function:
```rust,ignore
fn get_mut(&self, index: u64) -> &mut T;
```
This API allows for concurrent mutable access to a memory region at `index`. It
relies on the Bank to guarantee exclusive access to that index.
# Garbage collection
As accounts get updated, they move to the end of the AppendVec. Once capacity
has run out, a new AppendVec can be created and updates can be stored there.
Eventually references to an older AppendVec will disappear because all the
accounts have been updated, and the old AppendVec can be deleted.
To speed up this process, it's possible to move Accounts that have not been
recently updated to the front of a new AppendVec. This form of garbage
collection can be done without requiring exclusive locks to any of the data
structures except for the index update.
The initial implementation for garbage collection is that once all the accounts in
an AppendVec become stale versions, it gets reused. The accounts are not updated
or moved around once appended.
# Index Recovery
Each bank thread has exclusive access to the accounts during append, since the
accounts locks cannot be released until the data is committed. But there is no
explicit order of writes between the separate AppendVec files. To create an
ordering, the index maintains an atomic write version counter. Each append to
the AppendVec records the index write version number for that append in the
entry for the Account in the AppendVec.
To recover the index, all the AppendVec files can be read in any order, and the
latest write version for every fork should be stored in the index.
# Snapshots
To snapshot, the underlying memory-mapped files in the AppendVec need to be
flushed to disk. The index can be written out to disk as well.
# Performance
* Append-only writes are fast. SSDs and NVMEs, as well as all the OS level
kernel data structures, allow for appends to run as fast as PCI or NVMe bandwidth
will allow (2,700 MB/s).
* Each replay and banking thread writes concurrently to its own AppendVec.
* Each AppendVec could potentially be hosted on a separate NVMe.
* Each replay and banking thread has concurrent read access to all the
AppendVecs without blocking writes.
* Index requires an exclusive write lock for writes. Single-thread performance
for HashMap updates is on the order of 10m per second.
* Banking and Replay stages should use 32 threads per NVMe. NVMes have
optimal performance with 32 concurrent readers or writers.