11
docs/src/validator/README.md
Normal file
11
docs/src/validator/README.md
Normal file
@@ -0,0 +1,11 @@
|
||||
# Anatomy of a Validator
|
||||
|
||||

|
||||
|
||||
## Pipelining
|
||||
|
||||
The validators make extensive use of an optimization common in CPU design, called _pipelining_. Pipelining is the right tool for the job when there's a stream of input data that needs to be processed by a sequence of steps, and there's different hardware responsible for each. The quintessential example is using a washer and dryer to wash/dry/fold several loads of laundry. Washing must occur before drying and drying before folding, but each of the three operations is performed by a separate unit. To maximize efficiency, one creates a pipeline of _stages_. We'll call the washer one stage, the dryer another, and the folding process a third. To run the pipeline, one adds a second load of laundry to the washer just after the first load is added to the dryer. Likewise, the third load is added to the washer after the second is in the dryer and the first is being folded. In this way, one can make progress on three loads of laundry simultaneously. Given infinite loads, the pipeline will consistently complete a load at the rate of the slowest stage in the pipeline.
|
||||
|
||||
## Pipelining in the Validator
|
||||
|
||||
The validator contains two pipelined processes, one used in leader mode called the TPU and one used in validator mode called the TVU. In both cases, the hardware being pipelined is the same, the network input, the GPU cards, the CPU cores, writes to disk, and the network output. What it does with that hardware is different. The TPU exists to create ledger entries whereas the TVU exists to validate them.
|
90
docs/src/validator/blockstore.md
Normal file
90
docs/src/validator/blockstore.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# Blockstore
|
||||
|
||||
After a block reaches finality, all blocks from that one on down to the genesis block form a linear chain with the familiar name blockchain. Until that point, however, the validator must maintain all potentially valid chains, called _forks_. The process by which forks naturally form as a result of leader rotation is described in [fork generation](../../cluster/fork-generation.md). The _blockstore_ data structure described here is how a validator copes with those forks until blocks are finalized.
|
||||
|
||||
The blockstore allows a validator to record every shred it observes on the network, in any order, as long as the shred is signed by the expected leader for a given slot.
|
||||
|
||||
Shreds are moved to a fork-able key space the tuple of `leader slot` + `shred index` \(within the slot\). This permits the skip-list structure of the Solana protocol to be stored in its entirety, without a-priori choosing which fork to follow, which Entries to persist or when to persist them.
|
||||
|
||||
Repair requests for recent shreds are served out of RAM or recent files and out of deeper storage for less recent shreds, as implemented by the store backing Blockstore.
|
||||
|
||||
## Functionalities of Blockstore
|
||||
|
||||
1. Persistence: the Blockstore lives in the front of the nodes verification
|
||||
|
||||
pipeline, right behind network receive and signature verification. If the
|
||||
|
||||
shred received is consistent with the leader schedule \(i.e. was signed by the
|
||||
|
||||
leader for the indicated slot\), it is immediately stored.
|
||||
|
||||
2. Repair: repair is the same as window repair above, but able to serve any
|
||||
|
||||
shred that's been received. Blockstore stores shreds with signatures,
|
||||
|
||||
preserving the chain of origination.
|
||||
|
||||
3. Forks: Blockstore supports random access of shreds, so can support a
|
||||
|
||||
validator's need to rollback and replay from a Bank checkpoint.
|
||||
|
||||
4. Restart: with proper pruning/culling, the Blockstore can be replayed by
|
||||
|
||||
ordered enumeration of entries from slot 0. The logic of the replay stage
|
||||
|
||||
\(i.e. dealing with forks\) will have to be used for the most recent entries in
|
||||
|
||||
the Blockstore.
|
||||
|
||||
## Blockstore Design
|
||||
|
||||
1. Entries in the Blockstore are stored as key-value pairs, where the key is the concatenated slot index and shred index for an entry, and the value is the entry data. Note shred indexes are zero-based for each slot \(i.e. they're slot-relative\).
|
||||
2. The Blockstore maintains metadata for each slot, in the `SlotMeta` struct containing:
|
||||
* `slot_index` - The index of this slot
|
||||
* `num_blocks` - The number of blocks in the slot \(used for chaining to a previous slot\)
|
||||
* `consumed` - The highest shred index `n`, such that for all `m < n`, there exists a shred in this slot with shred index equal to `n` \(i.e. the highest consecutive shred index\).
|
||||
* `received` - The highest received shred index for the slot
|
||||
* `next_slots` - A list of future slots this slot could chain to. Used when rebuilding
|
||||
|
||||
the ledger to find possible fork points.
|
||||
|
||||
* `last_index` - The index of the shred that is flagged as the last shred for this slot. This flag on a shred will be set by the leader for a slot when they are transmitting the last shred for a slot.
|
||||
* `is_rooted` - True iff every block from 0...slot forms a full sequence without any holes. We can derive is\_rooted for each slot with the following rules. Let slot\(n\) be the slot with index `n`, and slot\(n\).is\_full\(\) is true if the slot with index `n` has all the ticks expected for that slot. Let is\_rooted\(n\) be the statement that "the slot\(n\).is\_rooted is true". Then:
|
||||
|
||||
is\_rooted\(0\) is\_rooted\(n+1\) iff \(is\_rooted\(n\) and slot\(n\).is\_full\(\)
|
||||
3. Chaining - When a shred for a new slot `x` arrives, we check the number of blocks \(`num_blocks`\) for that new slot \(this information is encoded in the shred\). We then know that this new slot chains to slot `x - num_blocks`.
|
||||
4. Subscriptions - The Blockstore records a set of slots that have been "subscribed" to. This means entries that chain to these slots will be sent on the Blockstore channel for consumption by the ReplayStage. See the `Blockstore APIs` for details.
|
||||
5. Update notifications - The Blockstore notifies listeners when slot\(n\).is\_rooted is flipped from false to true for any `n`.
|
||||
|
||||
## Blockstore APIs
|
||||
|
||||
The Blockstore offers a subscription based API that ReplayStage uses to ask for entries it's interested in. The entries will be sent on a channel exposed by the Blockstore. These subscription API's are as follows: 1. `fn get_slots_since(slot_indexes: &[u64]) -> Vec<SlotMeta>`: Returns new slots connecting to any element of the list `slot_indexes`.
|
||||
|
||||
1. `fn get_slot_entries(slot_index: u64, entry_start_index: usize, max_entries: Option<u64>) -> Vec<Entry>`: Returns the entry vector for the slot starting with `entry_start_index`, capping the result at `max` if `max_entries == Some(max)`, otherwise, no upper limit on the length of the return vector is imposed.
|
||||
|
||||
Note: Cumulatively, this means that the replay stage will now have to know when a slot is finished, and subscribe to the next slot it's interested in to get the next set of entries. Previously, the burden of chaining slots fell on the Blockstore.
|
||||
|
||||
## Interfacing with Bank
|
||||
|
||||
The bank exposes to replay stage:
|
||||
|
||||
1. `prev_hash`: which PoH chain it's working on as indicated by the hash of the last
|
||||
|
||||
entry it processed
|
||||
|
||||
2. `tick_height`: the ticks in the PoH chain currently being verified by this
|
||||
|
||||
bank
|
||||
|
||||
3. `votes`: a stack of records that contain: 1. `prev_hashes`: what anything after this vote must chain to in PoH 2. `tick_height`: the tick height at which this vote was cast 3. `lockout period`: how long a chain must be observed to be in the ledger to
|
||||
|
||||
be able to be chained below this vote
|
||||
|
||||
Replay stage uses Blockstore APIs to find the longest chain of entries it can hang off a previous vote. If that chain of entries does not hang off the latest vote, the replay stage rolls back the bank to that vote and replays the chain from there.
|
||||
|
||||
## Pruning Blockstore
|
||||
|
||||
Once Blockstore entries are old enough, representing all the possible forks becomes less useful, perhaps even problematic for replay upon restart. Once a validator's votes have reached max lockout, however, any Blockstore contents that are not on the PoH chain for that vote for can be pruned, expunged.
|
||||
|
||||
Archiver nodes will be responsible for storing really old ledger contents, and validators need only persist their bank periodically.
|
||||
|
89
docs/src/validator/gossip.md
Normal file
89
docs/src/validator/gossip.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# Gossip Service
|
||||
|
||||
The Gossip Service acts as a gateway to nodes in the control plane. Validators use the service to ensure information is available to all other nodes in a cluster. The service broadcasts information using a gossip protocol.
|
||||
|
||||
## Gossip Overview
|
||||
|
||||
Nodes continuously share signed data objects among themselves in order to manage a cluster. For example, they share their contact information, ledger height, and votes.
|
||||
|
||||
Every tenth of a second, each node sends a "push" message and/or a "pull" message. Push and pull messages may elicit responses, and push messages may be forwarded on to others in the cluster.
|
||||
|
||||
Gossip runs on a well-known UDP/IP port or a port in a well-known range. Once a cluster is bootstrapped, nodes advertise to each other where to find their gossip endpoint \(a socket address\).
|
||||
|
||||
## Gossip Records
|
||||
|
||||
Records shared over gossip are arbitrary, but signed and versioned \(with a timestamp\) as needed to make sense to the node receiving them. If a node receives two records from the same source, it updates its own copy with the record with the most recent timestamp.
|
||||
|
||||
## Gossip Service Interface
|
||||
|
||||
### Push Message
|
||||
|
||||
A node sends a push message to tells the cluster it has information to share. Nodes send push messages to `PUSH_FANOUT` push peers.
|
||||
|
||||
Upon receiving a push message, a node examines the message for:
|
||||
|
||||
1. Duplication: if the message has been seen before, the node drops the message and may respond with `PushMessagePrune` if forwarded from a low staked node
|
||||
2. New data: if the message is new to the node
|
||||
* Stores the new information with an updated version in its cluster info and
|
||||
|
||||
purges any previous older value
|
||||
|
||||
* Stores the message in `pushed_once` \(used for detecting duplicates,
|
||||
|
||||
purged after `PUSH_MSG_TIMEOUT * 5` ms\)
|
||||
|
||||
* Retransmits the messages to its own push peers
|
||||
3. Expiration: nodes drop push messages that are older than `PUSH_MSG_TIMEOUT`
|
||||
|
||||
### Push Peers, Prune Message
|
||||
|
||||
A nodes selects its push peers at random from the active set of known peers. The node keeps this selection for a relatively long time. When a prune message is received, the node drops the push peer that sent the prune. Prune is an indication that there is another, higher stake weighted path to that node than direct push.
|
||||
|
||||
The set of push peers is kept fresh by rotating a new node into the set every `PUSH_MSG_TIMEOUT/2` milliseconds.
|
||||
|
||||
### Pull Message
|
||||
|
||||
A node sends a pull message to ask the cluster if there is any new information. A pull message is sent to a single peer at random and comprises a Bloom filter that represents things it already has. A node receiving a pull message iterates over its values and constructs a pull response of things that miss the filter and would fit in a message.
|
||||
|
||||
A node constructs the pull Bloom filter by iterating over current values and recently purged values.
|
||||
|
||||
A node handles items in a pull response the same way it handles new data in a push message.
|
||||
|
||||
## Purging
|
||||
|
||||
Nodes retain prior versions of values \(those updated by a pull or push\) and expired values \(those older than `GOSSIP_PULL_CRDS_TIMEOUT_MS`\) in `purged_values` \(things I recently had\). Nodes purge `purged_values` that are older than `5 * GOSSIP_PULL_CRDS_TIMEOUT_MS`.
|
||||
|
||||
## Eclipse Attacks
|
||||
|
||||
An eclipse attack is an attempt to take over the set of node connections with adversarial endpoints.
|
||||
|
||||
This is relevant to our implementation in the following ways.
|
||||
|
||||
* Pull messages select a random node from the network. An eclipse attack on _pull_ would require an attacker to influence the random selection in such a way that only adversarial nodes are selected for pull.
|
||||
* Push messages maintain an active set of nodes and select a random fanout for every push message. An eclipse attack on _push_ would influence the active set selection, or the random fanout selection.
|
||||
|
||||
### Time and Stake based weights
|
||||
|
||||
Weights are calculated based on `time since last picked` and the `natural log` of the `stake weight`.
|
||||
|
||||
Taking the `ln` of the stake weight allows giving all nodes a fairer chance of network coverage in a reasonable amount of time. It helps normalize the large possible `stake weight` differences between nodes. This way a node with low `stake weight`, compared to a node with large `stake weight` will only have to wait a few multiples of ln\(`stake`\) seconds before it gets picked.
|
||||
|
||||
There is no way for an adversary to influence these parameters.
|
||||
|
||||
### Pull Message
|
||||
|
||||
A node is selected as a pull target based on the weights described above.
|
||||
|
||||
### Push Message
|
||||
|
||||
A prune message can only remove an adversary from a potential connection.
|
||||
|
||||
Just like _pull message_, nodes are selected into the active set based on weights.
|
||||
|
||||
## Notable differences from PlumTree
|
||||
|
||||
The active push protocol described here is based on [Plum Tree](https://haslab.uminho.pt/jop/files/lpr07a.pdf). The main differences are:
|
||||
|
||||
* Push messages have a wallclock that is signed by the originator. Once the wallclock expires the message is dropped. A hop limit is difficult to implement in an adversarial setting.
|
||||
* Lazy Push is not implemented because its not obvious how to prevent an adversary from forging the message fingerprint. A naive approach would allow an adversary to be prioritized for pull based on their input.
|
||||
|
69
docs/src/validator/runtime.md
Normal file
69
docs/src/validator/runtime.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# The Runtime
|
||||
|
||||
## The Runtime
|
||||
|
||||
The runtime is a concurrent transaction processor. Transactions specify their data dependencies upfront and dynamic memory allocation is explicit. By separating program code from the state it operates on, the runtime is able to choreograph concurrent access. Transactions accessing only read-only accounts are executed in parallel whereas transactions accessing writable accounts are serialized. The runtime interacts with the program through an entrypoint with a well-defined interface. The data stored in an account is an opaque type, an array of bytes. The program has full control over its contents.
|
||||
|
||||
The transaction structure specifies a list of public keys and signatures for those keys and a sequential list of instructions that will operate over the states associated with the account keys. For the transaction to be committed all the instructions must execute successfully; if any abort the whole transaction fails to commit.
|
||||
|
||||
#### Account Structure
|
||||
|
||||
Accounts maintain a lamport balance and program-specific memory.
|
||||
|
||||
## Transaction Engine
|
||||
|
||||
The engine maps public keys to accounts and routes them to the program's entrypoint.
|
||||
|
||||
### Execution
|
||||
|
||||
Transactions are batched and processed in a pipeline. The TPU and TVU follow a slightly different path. The TPU runtime ensures that PoH record occurs before memory is committed.
|
||||
|
||||
The TVU runtime ensures that PoH verification occurs before the runtime processes any transactions.
|
||||
|
||||

|
||||
|
||||
At the _execute_ stage, the loaded accounts have no data dependencies, so all the programs can be executed in parallel.
|
||||
|
||||
The runtime enforces the following rules:
|
||||
|
||||
1. Only the _owner_ program may modify the contents of an account. This means that upon assignment data vector is guaranteed to be zero.
|
||||
2. Total balances on all the accounts is equal before and after execution of a transaction.
|
||||
3. After the transaction is executed, balances of read-only accounts must be equal to the balances before the transaction.
|
||||
4. All instructions in the transaction executed atomically. If one fails, all account modifications are discarded.
|
||||
|
||||
Execution of the program involves mapping the program's public key to an entrypoint which takes a pointer to the transaction, and an array of loaded accounts.
|
||||
|
||||
### SystemProgram Interface
|
||||
|
||||
The interface is best described by the `Instruction::data` that the user encodes.
|
||||
|
||||
* `CreateAccount` - This allows the user to create an account with an allocated data array and assign it to a Program.
|
||||
* `CreateAccountWithSeed` - Same as `CreateAccount`, but the new account's address is derived from
|
||||
- the funding account's pubkey,
|
||||
- a mnemonic string (seed), and
|
||||
- the pubkey of the Program
|
||||
* `Assign` - Allows the user to assign an existing account to a program.
|
||||
* `Transfer` - Transfers lamports between accounts.
|
||||
|
||||
### Program State Security
|
||||
|
||||
For blockchain to function correctly, the program code must be resilient to user inputs. That is why in this design the program specific code is the only code that can change the state of the data byte array in the Accounts that are assigned to it. It is also the reason why `Assign` or `CreateAccount` must zero out the data. Otherwise there would be no possible way for the program to distinguish the recently assigned account data from a natively generated state transition without some additional metadata from the runtime to indicate that this memory is assigned instead of natively generated.
|
||||
|
||||
To pass messages between programs, the receiving program must accept the message and copy the state over. But in practice a copy isn't needed and is undesirable. The receiving program can read the state belonging to other Accounts without copying it, and during the read it has a guarantee of the sender program's state.
|
||||
|
||||
### Notes
|
||||
|
||||
* There is no dynamic memory allocation. Client's need to use `CreateAccount` instructions to create memory before passing it to another program. This instruction can be composed into a single transaction with the call to the program itself.
|
||||
* `CreateAccount` and `Assign` guarantee that when account is assigned to the program, the Account's data is zero initialized.
|
||||
* Transactions that assign an account to a program or allocate space must be signed by the Account address' private key unless the Account is being created by `CreateAccountWithSeed`, in which case there is no corresponding private key for the account's address/pubkey.
|
||||
* Once assigned to program an Account cannot be reassigned.
|
||||
* Runtime guarantees that a program's code is the only code that can modify Account data that the Account is assigned to.
|
||||
* Runtime guarantees that the program can only spend lamports that are in accounts that are assigned to it.
|
||||
* Runtime guarantees the balances belonging to accounts are balanced before and after the transaction.
|
||||
* Runtime guarantees that instructions all executed successfully when a transaction is committed.
|
||||
|
||||
## Future Work
|
||||
|
||||
* \[Continuations and Signals for long running
|
||||
|
||||
Transactions\]\([https://github.com/solana-labs/solana/issues/1485](https://github.com/solana-labs/solana/issues/1485)\)
|
3
docs/src/validator/tpu.md
Normal file
3
docs/src/validator/tpu.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# TPU
|
||||
|
||||

|
4
docs/src/validator/tvu.md
Normal file
4
docs/src/validator/tvu.md
Normal file
@@ -0,0 +1,4 @@
|
||||
# TVU
|
||||
|
||||

|
||||
|
Reference in New Issue
Block a user