updates to reflect new_from_parent() (#3076)
* design draft * update * section on updating root forks * updates to reflect new_from_parent() * fixup * Grammar check
This commit is contained in:
@ -40,6 +40,7 @@
|
||||
- [Fork Selection](fork-selection.md)
|
||||
- [Reliable Vote Transmission](reliable-vote-transmission.md)
|
||||
- [Bank Forks](bank-forks.md)
|
||||
- [Persistent Account Storage](persistent-account-storage.md)
|
||||
- [Leader to Leader Transition](leader-leader-transition.md)
|
||||
- [Cluster Economics](ed_overview.md)
|
||||
- [Validation-client Economics](ed_validation_client_economics.md)
|
||||
|
153
book/src/persistent-account-storage.md
Normal file
153
book/src/persistent-account-storage.md
Normal file
@ -0,0 +1,153 @@
|
||||
# Persistent Account Storage
|
||||
|
||||
The set of Accounts represent the current computed state of all the transactions
|
||||
that have been processed by a fullnode. Each fullnode needs to maintain this
|
||||
entire set. Each block that is proposed by the network represents a change to
|
||||
this set, and since each block is a potential rollback point the changes need to
|
||||
be reversible.
|
||||
|
||||
Persistent storage like NVMEs are 20 to 40 times cheaper than DDR. The problem
|
||||
with persistent storage is that write and read performance is much slower than
|
||||
DDR and care must be taken in how data is read or written to. Both reads and
|
||||
writes can be split between multiple storage drives and accessed in parallel.
|
||||
This design proposes a data structure that allows for concurrent reads and
|
||||
concurrent writes of storage. Writes are optimized by using an AppendVec data
|
||||
structure, which allows a single writer to append while allowing access to many
|
||||
concurrent readers. The accounts index maintains a pointer to a spot where the
|
||||
account was appended to every fork, thus removing the need for explicit
|
||||
checkpointing of state.
|
||||
|
||||
# AppendVec
|
||||
|
||||
AppendVec is a data structure that allows for random reads concurrent with a
|
||||
single append-only writer. Growing or resizing the capacity of the AppendVec
|
||||
requires exclusive access. This is implemented with an atomic `offset`, which
|
||||
is updated at the end of a completed append.
|
||||
|
||||
The underlying memory for an AppendVec is a memory-mapped file. Memory-mapped
|
||||
files allow for fast random access and paging is handled by the OS.
|
||||
|
||||
# Account Index
|
||||
|
||||
The account index is designed to support a single index for all the currently
|
||||
forked Accounts.
|
||||
|
||||
```rust,ignore
|
||||
type AppendVecId = usize;
|
||||
|
||||
type Fork = u64;
|
||||
|
||||
struct AccountMap(Hashmap<Fork, (AppendVecId, u64)>);
|
||||
|
||||
type AccountIndex = HashMap<Pubkey, AccountMap>;
|
||||
|
||||
```
|
||||
|
||||
The index is a map of account Pubkeys to a map of Forks and the location of the
|
||||
Account data in an AppendVec. To get the version of an account for a specific Fork:
|
||||
|
||||
```rust,ignore
|
||||
/// Load the account for the pubkey.
|
||||
/// This function will load the account from the specified fork, falling back to the fork's parents
|
||||
/// * fork - a virtual Accounts instance, keyed by Fork. Accounts keep track of their parents with Forks,
|
||||
/// the persistent store
|
||||
/// * pubkey - The Account's public key.
|
||||
pub fn load_slow(&self, id: Fork, pubkey: &Pubkey) -> Option<&Account>
|
||||
```
|
||||
|
||||
The read is satisfied by pointing to a memory-mapped location in the
|
||||
`AppendVecId` at the stored offset. A reference can be returned without a copy.
|
||||
|
||||
## Root Forks
|
||||
|
||||
The [fork selection algorithm](fork-selection.md) eventually selects a fork as a
|
||||
root fork and the fork is squashed. A squashed/root fork cannot be rolled back.
|
||||
|
||||
When a fork is squashed, all accounts in its parents not already present in the
|
||||
fork are pulled up into the fork by updating the indexes. Accounts with zero
|
||||
balance in the squashed fork are removed from fork by updating the indexes.
|
||||
|
||||
An account can be *garbage-collected* when squashing makes it unreachable.
|
||||
|
||||
Three possible options exist:
|
||||
|
||||
* Maintain a HashSet<u64> of root forks. One is expected to be created every
|
||||
second. The entire tree can be garbage-collected later. Alternatively, if
|
||||
every fork keeps a reference count of accounts, garbage collection could occur
|
||||
any time an index location is updated.
|
||||
|
||||
* Remove any pruned forks from the index. Any remaining forks lower in number
|
||||
than the root are can be considered root.
|
||||
|
||||
* Scan the index, migrate any old roots into the new one. Any remaining forks
|
||||
lower than the new root can be deleted later.
|
||||
|
||||
# Append-only Writes
|
||||
|
||||
All the updates to Accounts occur as append-only updates. For every account
|
||||
update, a new version is stored in the AppendVec.
|
||||
|
||||
It is possible to optimize updates within a single fork by returning a mutable
|
||||
reference to an already stored account in a fork. The Bank already tracks
|
||||
concurrent access of accounts and guarantees that a write to a specific account
|
||||
fork will not be concurrent with a read to an account at that fork. To support
|
||||
this operation, AppendVec should implement this function:
|
||||
|
||||
```rust,ignore
|
||||
fn get_mut(&self, index: u64) -> &mut T;
|
||||
```
|
||||
|
||||
This API allows for concurrent mutable access to a memory region at `index`. It
|
||||
relies on the Bank to guarantee exclusive access to that index.
|
||||
|
||||
# Garbage collection
|
||||
|
||||
As accounts get updated, they move to the end of the AppendVec. Once capacity
|
||||
has run out, a new AppendVec can be created and updates can be stored there.
|
||||
Eventually references to an older AppendVec will disappear because all the
|
||||
accounts have been updated, and the old AppendVec can be deleted.
|
||||
|
||||
To speed up this process, it's possible to move Accounts that have not been
|
||||
recently updated to the front of a new AppendVec. This form of garbage
|
||||
collection can be done without requiring exclusive locks to any of the data
|
||||
structures except for the index update.
|
||||
|
||||
The initial implementation for garbage collection is that once all the accounts in
|
||||
an AppendVec become stale versions, it gets reused. The accounts are not updated
|
||||
or moved around once appended.
|
||||
|
||||
# Index Recovery
|
||||
|
||||
Each bank thread has exclusive access to the accounts during append, since the
|
||||
accounts locks cannot be released until the data is committed. But there is no
|
||||
explicit order of writes between the separate AppendVec files. To create an
|
||||
ordering, the index maintains an atomic write version counter. Each append to
|
||||
the AppendVec records the index write version number for that append in the
|
||||
entry for the Account in the AppendVec.
|
||||
|
||||
To recover the index, all the AppendVec files can be read in any order, and the
|
||||
latest write version for every fork should be stored in the index.
|
||||
|
||||
# Snapshots
|
||||
|
||||
To snapshot, the underlying memory-mapped files in the AppendVec need to be
|
||||
flushed to disk. The index can be written out to disk as well.
|
||||
|
||||
# Performance
|
||||
|
||||
* Append-only writes are fast. SSDs and NVMEs, as well as all the OS level
|
||||
kernel data structures, allow for appends to run as fast as PCI or NVMe bandwidth
|
||||
will allow (2,700 MB/s).
|
||||
|
||||
* Each replay and banking thread writes concurrently to its own AppendVec.
|
||||
|
||||
* Each AppendVec could potentially be hosted on a separate NVMe.
|
||||
|
||||
* Each replay and banking thread has concurrent read access to all the
|
||||
AppendVecs without blocking writes.
|
||||
|
||||
* Index requires an exclusive write lock for writes. Single-thread performance
|
||||
for HashMap updates is on the order of 10m per second.
|
||||
|
||||
* Banking and Replay stages should use 32 threads per NVMe. NVMes have
|
||||
optimal performance with 32 concurrent readers or writers.
|
Reference in New Issue
Block a user