Move docs from book/ to docs/ (#8469)

automerge
2020-02-26 23:11:38 +08:00
parent 8839dbfe5b
commit 021d0a46f8
140 changed files with 56 additions and 58 deletions
--- a/docs/src/cluster/README.md
+++ b/docs/src/cluster/README.md
@@ -0,0 +1,40 @@
+# A Solana Cluster
+
+A Solana cluster is a set of validators working together to serve client transactions and maintain the integrity of the ledger. Many clusters may coexist. When two clusters share a common genesis block, they attempt to converge. Otherwise, they simply ignore the existence of the other. Transactions sent to the wrong one are quietly rejected. In this section, we'll discuss how a cluster is created, how nodes join the cluster, how they share the ledger, how they ensure the ledger is replicated, and how they cope with buggy and malicious nodes.
+
+## Creating a Cluster
+
+Before starting any validators, one first needs to create a _genesis config_. The config references two public keys, a _mint_ and a _bootstrap validator_. The validator holding the bootstrap validator's private key is responsible for appending the first entries to the ledger. It initializes its internal state with the mint's account. That account will hold the number of native tokens defined by the genesis config. The second validator then contacts the bootstrap validator to register as a _validator_ or _archiver_. Additional validators then register with any registered member of the cluster.
+
+A validator receives all entries from the leader and submits votes confirming those entries are valid. After voting, the validator is expected to store those entries until archiver nodes submit proofs that they have stored copies of it. Once the validator observes a sufficient number of copies exist, it deletes its copy.
+
+## Joining a Cluster
+
+Validators and archivers enter the cluster via registration messages sent to its _control plane_. The control plane is implemented using a _gossip_ protocol, meaning that a node may register with any existing node, and expect its registration to propagate to all nodes in the cluster. The time it takes for all nodes to synchronize is proportional to the square of the number of nodes participating in the cluster. Algorithmically, that's considered very slow, but in exchange for that time, a node is assured that it eventually has all the same information as every other node, and that that information cannot be censored by any one node.
+
+## Sending Transactions to a Cluster
+
+Clients send transactions to any validator's Transaction Processing Unit \(TPU\) port. If the node is in the validator role, it forwards the transaction to the designated leader. If in the leader role, the node bundles incoming transactions, timestamps them creating an _entry_, and pushes them onto the cluster's _data plane_. Once on the data plane, the transactions are validated by validator nodes and replicated by archiver nodes, effectively appending them to the ledger.
+
+## Confirming Transactions
+
+A Solana cluster is capable of subsecond _confirmation_ for up to 150 nodes with plans to scale up to hundreds of thousands of nodes. Once fully implemented, confirmation times are expected to increase only with the logarithm of the number of validators, where the logarithm's base is very high. If the base is one thousand, for example, it means that for the first thousand nodes, confirmation will be the duration of three network hops plus the time it takes the slowest validator of a supermajority to vote. For the next million nodes, confirmation increases by only one network hop.
+
+Solana defines confirmation as the duration of time from when the leader timestamps a new entry to the moment when it recognizes a supermajority of ledger votes.
+
+A gossip network is much too slow to achieve subsecond confirmation once the network grows beyond a certain size. The time it takes to send messages to all nodes is proportional to the square of the number of nodes. If a blockchain wants to achieve low confirmation and attempts to do it using a gossip network, it will be forced to centralize to just a handful of nodes.
+
+Scalable confirmation can be achieved using the follow combination of techniques:
+
+1. Timestamp transactions with a VDF sample and sign the timestamp.
+2. Split the transactions into batches, send each to separate nodes and have
+
+   each node share its batch with its peers.
+
+3. Repeat the previous step recursively until all nodes have all batches.
+
+Solana rotates leaders at fixed intervals, called _slots_. Each leader may only produce entries during its allotted slot. The leader therefore timestamps transactions so that validators may lookup the public key of the designated leader. The leader then signs the timestamp so that a validator may verify the signature, proving the signer is owner of the designated leader's public key.
+
+Next, transactions are broken into batches so that a node can send transactions to multiple parties without making multiple copies. If, for example, the leader needed to send 60 transactions to 6 nodes, it would break that collection of 60 into batches of 10 transactions and send one to each node. This allows the leader to put 60 transactions on the wire, not 60 transactions for each node. Each node then shares its batch with its peers. Once the node has collected all 6 batches, it reconstructs the original set of 60 transactions.
+
+A batch of transactions can only be split so many times before it is so small that header information becomes the primary consumer of network bandwidth. At the time of this writing, the approach is scaling well up to about 150 validators. To scale up to hundreds of thousands of validators, each node can apply the same technique as the leader node to another set of nodes of equal size. We call the technique [_Turbine Block Propogation_](turbine-block-propagation.md).
--- a/docs/src/cluster/fork-generation.md
+++ b/docs/src/cluster/fork-generation.md
@@ -0,0 +1,80 @@
+# Fork Generation
+
+This section describes how forks naturally occur as a consequence of [leader rotation](leader-rotation.md).
+
+## Overview
+
+Nodes take turns being leader and generating the PoH that encodes state changes. The cluster can tolerate loss of connection to any leader by synthesizing what the leader _**would**_ have generated had it been connected but not ingesting any state changes. The possible number of forks is thereby limited to a "there/not-there" skip list of forks that may arise on leader rotation slot boundaries. At any given slot, only a single leader's transactions will be accepted.
+
+## Message Flow
+
+1. Transactions are ingested by the current leader.
+2. Leader filters valid transactions.
+3. Leader executes valid transactions updating its state.
+4. Leader packages transactions into entries based off its current PoH slot.
+5. Leader transmits the entries to validator nodes \(in signed shreds\) 1. The PoH stream includes ticks; empty entries that indicate liveness of
+
+   the leader and the passage of time on the cluster.
+
+   1. A leader's stream begins with the tick entries necessary complete the PoH
+
+      back to the leaders most recently observed prior leader slot.
+
+6. Validators retransmit entries to peers in their set and to further
+
+   downstream nodes.
+
+7. Validators validate the transactions and execute them on their state.
+8. Validators compute the hash of the state.
+9. At specific times, i.e. specific PoH tick counts, validators transmit votes
+
+   to the leader.
+
+   1. Votes are signatures of the hash of the computed state at that PoH tick
+
+      count
+
+   2. Votes are also propagated via gossip
+
+10. Leader executes the votes as any other transaction and broadcasts them to
+
+    the cluster.
+
+11. Validators observe their votes and all the votes from the cluster.
+
+## Partitions, Forks
+
+Forks can arise at PoH tick counts that correspond to a vote. The next leader may not have observed the last vote slot and may start their slot with generated virtual PoH entries. These empty ticks are generated by all nodes in the cluster at a cluster-configured rate for hashes/per/tick `Z`.
+
+There are only two possible versions of the PoH during a voting slot: PoH with `T` ticks and entries generated by the current leader, or PoH with just ticks. The "just ticks" version of the PoH can be thought of as a virtual ledger, one that all nodes in the cluster can derive from the last tick in the previous slot.
+
+Validators can ignore forks at other points \(e.g. from the wrong leader\), or slash the leader responsible for the fork.
+
+Validators vote based on a greedy choice to maximize their reward described in [Tower BFT](../implemented-proposals/tower-bft.md).
+
+### Validator's View
+
+#### Time Progression
+
+The diagram below represents a validator's view of the PoH stream with possible forks over time. L1, L2, etc. are leader slots, and `E`s represent entries from that leader during that leader's slot. The `x`s represent ticks only, and time flows downwards in the diagram.
+
+![Fork generation](../.gitbook/assets/fork-generation.svg)
+
+Note that an `E` appearing on 2 forks at the same slot is a slashable condition, so a validator observing `E3` and `E3'` can slash L3 and safely choose `x` for that slot. Once a validator commits to a forks, other forks can be discarded below that tick count. For any slot, validators need only consider a single "has entries" chain or a "ticks only" chain to be proposed by a leader. But multiple virtual entries may overlap as they link back to the a previous slot.
+
+#### Time Division
+
+It's useful to consider leader rotation over PoH tick count as time division of the job of encoding state for the cluster. The following table presents the above tree of forks as a time-divided ledger.
+
+| leader slot | L1 | L2 | L3 | L4 | L5 |
+| :--- | :--- | :--- | :--- | :--- | :--- |
+| data | E1 | E2 | E3 | E4 | E5 |
+| ticks since prev |  |  |  | x | xx |
+
+Note that only data from leader L3 will be accepted during leader slot L3. Data from L3 may include "catchup" ticks back to a slot other than L2 if L3 did not observe L2's data. L4 and L5's transmissions include the "ticks to prev" PoH entries.
+
+This arrangement of the network data streams permits nodes to save exactly this to the ledger for replay, restart, and checkpoints.
+
+### Leader's View
+
+When a new leader begins a slot, it must first transmit any PoH \(ticks\) required to link the new slot with the most recently observed and voted slot. The fork the leader proposes would link the current slot to a previous fork that the leader has voted on with virtual ticks.
--- a/docs/src/cluster/leader-rotation.md
+++ b/docs/src/cluster/leader-rotation.md
@@ -0,0 +1,97 @@
+# Leader Rotation
+
+At any given moment, a cluster expects only one validator to produce ledger entries. By having only one leader at a time, all validators are able to replay identical copies of the ledger. The drawback of only one leader at a time, however, is that a malicious leader is capable of censoring votes and transactions. Since censoring cannot be distinguished from the network dropping packets, the cluster cannot simply elect a single node to hold the leader role indefinitely. Instead, the cluster minimizes the influence of a malicious leader by rotating which node takes the lead.
+
+Each validator selects the expected leader using the same algorithm, described below. When the validator receives a new signed ledger entry, it can be certain that entry was produced by the expected leader. The order of slots which each leader is assigned a slot is called a _leader schedule_.
+
+## Leader Schedule Rotation
+
+A validator rejects blocks that are not signed by the _slot leader_. The list of identities of all slot leaders is called a _leader schedule_. The leader schedule is recomputed locally and periodically. It assigns slot leaders for a duration of time called an _epoch_. The schedule must be computed far in advance of the slots it assigns, such that the ledger state it uses to compute the schedule is finalized. That duration is called the _leader schedule offset_. Solana sets the offset to the duration of slots until the next epoch. That is, the leader schedule for an epoch is calculated from the ledger state at the start of the previous epoch. The offset of one epoch is fairly arbitrary and assumed to be sufficiently long such that all validators will have finalized their ledger state before the next schedule is generated. A cluster may choose to shorten the offset to reduce the time between stake changes and leader schedule updates.
+
+While operating without partitions lasting longer than an epoch, the schedule only needs to be generated when the root fork crosses the epoch boundary. Since the schedule is for the next epoch, any new stakes committed to the root fork will not be active until the next epoch. The block used for generating the leader schedule is the first block to cross the epoch boundary.
+
+Without a partition lasting longer than an epoch, the cluster will work as follows:
+
+1. A validator continuously updates its own root fork as it votes.
+2. The validator updates its leader schedule each time the slot height crosses an epoch boundary.
+
+For example:
+
+The epoch duration is 100 slots. The root fork is updated from fork computed at slot height 99 to a fork computed at slot height 102. Forks with slots at height 100,101 were skipped because of failures. The new leader schedule is computed using fork at slot height 102. It is active from slot 200 until it is updated again.
+
+No inconsistency can exist because every validator that is voting with the cluster has skipped 100 and 101 when its root passes 102. All validators, regardless of voting pattern, would be committing to a root that is either 102, or a descendant of 102.
+
+### Leader Schedule Rotation with Epoch Sized Partitions.
+
+The duration of the leader schedule offset has a direct relationship to the likelihood of a cluster having an inconsistent view of the correct leader schedule.
+
+Consider the following scenario:
+
+Two partitions that are generating half of the blocks each. Neither is coming to a definitive supermajority fork. Both will cross epoch 100 and 200 without actually committing to a root and therefore a cluster wide commitment to a new leader schedule.
+
+In this unstable scenario, multiple valid leader schedules exist.
+
+* A leader schedule is generated for every fork whose direct parent is in the previous epoch.
+* The leader schedule is valid after the start of the next epoch for descendant forks until it is updated.
+
+Each partition's schedule will diverge after the partition lasts more than an epoch. For this reason, the epoch duration should be selected to be much much larger then slot time and the expected length for a fork to be committed to root.
+
+After observing the cluster for a sufficient amount of time, the leader schedule offset can be selected based on the median partition duration and its standard deviation. For example, an offset longer then the median partition duration plus six standard deviations would reduce the likelihood of an inconsistent ledger schedule in the cluster to 1 in 1 million.
+
+## Leader Schedule Generation at Genesis
+
+The genesis config declares the first leader for the first epoch. This leader ends up scheduled for the first two epochs because the leader schedule is also generated at slot 0 for the next epoch. The length of the first two epochs can be specified in the genesis config as well. The minimum length of the first epochs must be greater than or equal to the maximum rollback depth as defined in [Tower BFT](../implemented-proposals/tower-bft.md).
+
+## Leader Schedule Generation Algorithm
+
+Leader schedule is generated using a predefined seed. The process is as follows:
+
+1. Periodically use the PoH tick height \(a monotonically increasing counter\) to
+
+   seed a stable pseudo-random algorithm.
+
+2. At that height, sample the bank for all the staked accounts with leader
+
+   identities that have voted within a cluster-configured number of ticks. The
+
+   sample is called the _active set_.
+
+3. Sort the active set by stake weight.
+4. Use the random seed to select nodes weighted by stake to create a
+
+   stake-weighted ordering.
+
+5. This ordering becomes valid after a cluster-configured number of ticks.
+
+## Schedule Attack Vectors
+
+### Seed
+
+The seed that is selected is predictable but unbiasable. There is no grinding attack to influence its outcome.
+
+### Active Set
+
+A leader can bias the active set by censoring validator votes. Two possible ways exist for leaders to censor the active set:
+
+* Ignore votes from validators
+* Refuse to vote for blocks with votes from validators
+
+To reduce the likelihood of censorship, the active set is calculated at the leader schedule offset boundary over an _active set sampling duration_. The active set sampling duration is long enough such that votes will have been collected by multiple leaders.
+
+### Staking
+
+Leaders can censor new staking transactions or refuse to validate blocks with new stakes. This attack is similar to censorship of validator votes.
+
+### Validator operational key loss
+
+Leaders and validators are expected to use ephemeral keys for operation, and stake owners authorize the validators to do work with their stake via delegation.
+
+The cluster should be able to recover from the loss of all the ephemeral keys used by leaders and validators, which could occur through a common software vulnerability shared by all the nodes. Stake owners should be able to vote directly co-sign a validator vote even though the stake is currently delegated to a validator.
+
+## Appending Entries
+
+The lifetime of a leader schedule is called an _epoch_. The epoch is split into _slots_, where each slot has a duration of `T` PoH ticks.
+
+A leader transmits entries during its slot. After `T` ticks, all the validators switch to the next scheduled leader. Validators must ignore entries sent outside a leader's assigned slot.
+
+All `T` ticks must be observed by the next leader for it to build its own entries on. If entries are not observed \(leader is down\) or entries are invalid \(leader is buggy or malicious\), the next leader must produce ticks to fill the previous leader's slot. Note that the next leader should do repair requests in parallel, and postpone sending ticks until it is confident other validators also failed to observe the previous leader's entries. If a leader incorrectly builds on its own ticks, the leader following it must replace all its ticks.
--- a/docs/src/cluster/ledger-replication.md
+++ b/docs/src/cluster/ledger-replication.md
@@ -0,0 +1,269 @@
+# Ledger Replication
+
+At full capacity on a 1gbps network solana will generate 4 petabytes of data per year. To prevent the network from centralizing around validators that have to store the full data set this protocol proposes a way for mining nodes to provide storage capacity for pieces of the data.
+
+The basic idea to Proof of Replication is encrypting a dataset with a public symmetric key using CBC encryption, then hash the encrypted dataset. The main problem with the naive approach is that a dishonest storage node can stream the encryption and delete the data as it's hashed. The simple solution is to periodically regenerate the hash based on a signed PoH value. This ensures that all the data is present during the generation of the proof and it also requires validators to have the entirety of the encrypted data present for verification of every proof of every identity. So the space required to validate is `number_of_proofs * data_size`
+
+## Optimization with PoH
+
+Our improvement on this approach is to randomly sample the encrypted segments faster than it takes to encrypt, and record the hash of those samples into the PoH ledger. Thus the segments stay in the exact same order for every PoRep and verification can stream the data and verify all the proofs in a single batch. This way we can verify multiple proofs concurrently, each one on its own CUDA core. The total space required for verification is `1_ledger_segment + 2_cbc_blocks * number_of_identities` with core count equal to `number_of_identities`. We use a 64-byte chacha CBC block size.
+
+## Network
+
+Validators for PoRep are the same validators that are verifying transactions. If an archiver can prove that a validator verified a fake PoRep, then the validator will not receive a reward for that storage epoch.
+
+Archivers are specialized _light clients_. They download a part of the ledger \(a.k.a Segment\) and store it, and provide PoReps of storing the ledger. For each verified PoRep archivers earn a reward of sol from the mining pool.
+
+## Constraints
+
+We have the following constraints:
+
+* Verification requires generating the CBC blocks. That requires space of 2
+
+  blocks per identity, and 1 CUDA core per identity for the same dataset. So as
+
+  many identities at once should be batched with as many proofs for those
+
+  identities verified concurrently for the same dataset.
+
+* Validators will randomly sample the set of storage proofs to the set that
+
+  they can handle, and only the creators of those chosen proofs will be
+
+  rewarded. The validator can run a benchmark whenever its hardware configuration
+
+  changes to determine what rate it can validate storage proofs.
+
+## Validation and Replication Protocol
+
+### Constants
+
+1. SLOTS\_PER\_SEGMENT: Number of slots in a segment of ledger data. The
+
+   unit of storage for an archiver.
+
+2. NUM\_KEY\_ROTATION\_SEGMENTS: Number of segments after which archivers
+
+   regenerate their encryption keys and select a new dataset to store.
+
+3. NUM\_STORAGE\_PROOFS: Number of storage proofs required for a storage proof
+
+   claim to be successfully rewarded.
+
+4. RATIO\_OF\_FAKE\_PROOFS: Ratio of fake proofs to real proofs that a storage
+
+   mining proof claim has to contain to be valid for a reward.
+
+5. NUM\_STORAGE\_SAMPLES: Number of samples required for a storage mining
+
+   proof.
+
+6. NUM\_CHACHA\_ROUNDS: Number of encryption rounds performed to generate
+
+   encrypted state.
+
+7. NUM\_SLOTS\_PER\_TURN: Number of slots that define a single storage epoch or
+
+   a "turn" of the PoRep game.
+
+### Validator behavior
+
+1. Validators join the network and begin looking for archiver accounts at each
+
+   storage epoch/turn boundary.
+
+2. Every turn, Validators sign the PoH value at the boundary and use that signature
+
+   to randomly pick proofs to verify from each storage account found in the turn boundary.
+
+   This signed value is also submitted to the validator's storage account and will be used by
+
+   archivers at a later stage to cross-verify.
+
+3. Every `NUM_SLOTS_PER_TURN` slots the validator advertises the PoH value. This is value
+
+   is also served to Archivers via RPC interfaces.
+
+4. For a given turn N, all validations get locked out until turn N+3 \(a gap of 2 turn/epoch\).
+
+   At which point all validations during that turn are available for reward collection.
+
+5. Any incorrect validations will be marked during the turn in between.
+
+### Archiver behavior
+
+1. Since an archiver is somewhat of a light client and not downloading all the
+
+   ledger data, they have to rely on other validators and archivers for information.
+
+   Any given validator may or may not be malicious and give incorrect information, although
+
+   there are not any obvious attack vectors that this could accomplish besides having the
+
+   archiver do extra wasted work. For many of the operations there are a number of options
+
+   depending on how paranoid an archiver is:
+
+   * \(a\) archiver can ask a validator
+   * \(b\) archiver can ask multiple validators
+   * \(c\) archiver can ask other archivers
+   * \(d\) archiver can subscribe to the full transaction stream and generate
+
+     the information itself \(assuming the slot is recent enough\)
+
+   * \(e\) archiver can subscribe to an abbreviated transaction stream to
+
+     generate the information itself \(assuming the slot is recent enough\)
+
+2. An archiver obtains the PoH hash corresponding to the last turn with its slot.
+3. The archiver signs the PoH hash with its keypair. That signature is the
+
+   seed used to pick the segment to replicate and also the encryption key. The
+
+   archiver mods the signature with the slot to get which segment to
+
+   replicate.
+
+4. The archiver retrives the ledger by asking peer validators and
+
+   archivers. See 6.5.
+
+5. The archiver then encrypts that segment with the key with chacha algorithm
+
+   in CBC mode with `NUM_CHACHA_ROUNDS` of encryption.
+
+6. The archiver initializes a chacha rng with the a signed recent PoH value as
+
+   the seed.
+
+7. The archiver generates `NUM_STORAGE_SAMPLES` samples in the range of the
+
+   entry size and samples the encrypted segment with sha256 for 32-bytes at each
+
+   offset value. Sampling the state should be faster than generating the encrypted
+
+   segment.
+
+8. The archiver sends a PoRep proof transaction which contains its sha state
+
+   at the end of the sampling operation, its seed and the samples it used to the
+
+   current leader and it is put onto the ledger.
+
+9. During a given turn the archiver should submit many proofs for the same segment
+
+   and based on the `RATIO_OF_FAKE_PROOFS` some of those proofs must be fake.
+
+10. As the PoRep game enters the next turn, the archiver must submit a
+
+    transaction with the mask of which proofs were fake during the last turn. This
+
+    transaction will define the rewards for both archivers and validators.
+
+11. Finally for a turn N, as the PoRep game enters turn N + 3, archiver's proofs for
+
+    turn N will be counted towards their rewards.
+
+### The PoRep Game
+
+The Proof of Replication game has 4 primary stages. For each "turn" multiple PoRep games can be in progress but each in a different stage.
+
+The 4 stages of the PoRep Game are as follows:
+
+1. Proof submission stage
+   * Archivers: submit as many proofs as possible during this stage
+   * Validators: No-op
+2. Proof verification stage
+   * Archivers: No-op
+   * Validators: Select archivers and verify their proofs from the previous turn
+3. Proof challenge stage
+   * Archivers: Submit the proof mask with justifications \(for fake proofs submitted 2 turns ago\)
+   * Validators: No-op
+4. Reward collection stage
+   * Archivers: Collect rewards for 3 turns ago
+   * Validators:  Collect rewards for 3 turns ago
+
+For each turn of the PoRep game, both Validators and Archivers evaluate each stage. The stages are run as separate transactions on the storage program.
+
+### Finding who has a given block of ledger
+
+1. Validators monitor the turns in the PoRep game and look at the rooted bank
+
+   at turn boundaries for any proofs.
+
+2. Validators maintain a map of ledger segments and corresponding archiver public keys.
+
+   The map is updated when a Validator processes an archiver's proofs for a segment.
+
+   The validator provides an RPC interface to access the this map. Using this API, clients
+
+   can map a segment to an archiver's network address \(correlating it via cluster\_info table\).
+
+   The clients can then send repair requests to the archiver to retrieve segments.
+
+3. Validators would need to invalidate this list every N turns.
+
+## Sybil attacks
+
+For any random seed, we force everyone to use a signature that is derived from a PoH hash at the turn boundary. Everyone uses the same count, so the same PoH hash is signed by every participant. The signatures are then each cryptographically tied to the keypair, which prevents a leader from grinding on the resulting value for more than 1 identity.
+
+Since there are many more client identities then encryption identities, we need to split the reward for multiple clients, and prevent Sybil attacks from generating many clients to acquire the same block of data. To remain BFT we want to avoid a single human entity from storing all the replications of a single chunk of the ledger.
+
+Our solution to this is to force the clients to continue using the same identity. If the first round is used to acquire the same block for many client identities, the second round for the same client identities will force a redistribution of the signatures, and therefore PoRep identities and blocks. Thus to get a reward for archivers need to store the first block for free and the network can reward long lived client identities more than new ones.
+
+## Validator attacks
+
+* If a validator approves fake proofs, archiver can easily out them by
+
+  showing the initial state for the hash.
+
+* If a validator marks real proofs as fake, no on-chain computation can be done
+
+  to distinguish who is correct. Rewards would have to rely on the results from
+
+  multiple validators to catch bad actors and archivers from being denied rewards.
+
+* Validator stealing mining proof results for itself. The proofs are derived
+
+  from a signature from an archiver, since the validator does not know the
+
+  private key used to generate the encryption key, it cannot be the generator of
+
+  the proof.
+
+## Reward incentives
+
+Fake proofs are easy to generate but difficult to verify. For this reason, PoRep proof transactions generated by archivers may require a higher fee than a normal transaction to represent the computational cost required by validators.
+
+Some percentage of fake proofs are also necessary to receive a reward from storage mining.
+
+## Notes
+
+* We can reduce the costs of verification of PoRep by using PoH, and actually
+
+  make it feasible to verify a large number of proofs for a global dataset.
+
+* We can eliminate grinding by forcing everyone to sign the same PoH hash and
+
+  use the signatures as the seed
+
+* The game between validators and archivers is over random blocks and random
+
+  encryption identities and random data samples. The goal of randomization is
+
+  to prevent colluding groups from having overlap on data or validation.
+
+* Archiver clients fish for lazy validators by submitting fake proofs that
+
+  they can prove are fake.
+
+* To defend against Sybil client identities that try to store the same block we
+
+  force the clients to store for multiple rounds before receiving a reward.
+
+* Validators should also get rewarded for validating submitted storage proofs
+
+  as incentive for storing the ledger. They can only validate proofs if they
+
+  are storing that slice of the ledger.
+
--- a/docs/src/cluster/managing-forks.md
+++ b/docs/src/cluster/managing-forks.md
@@ -0,0 +1,34 @@
+# Managing Forks
+
+The ledger is permitted to fork at slot boundaries. The resulting data structure forms a tree called a _blockstore_. When the validator interprets the blockstore, it must maintain state for each fork in the chain. We call each instance an _active fork_. It is the responsibility of a validator to weigh those forks, such that it may eventually select a fork.
+
+A validator selects a fork by submiting a vote to a slot leader on that fork. The vote commits the validator for a duration of time called a _lockout period_. The validator is not permitted to vote on a different fork until that lockout period expires. Each subsequent vote on the same fork doubles the length of the lockout period. After some cluster-configured number of votes \(currently 32\), the length of the lockout period reaches what's called _max lockout_. Until the max lockout is reached, the validator has the option to wait until the lockout period is over and then vote on another fork. When it votes on another fork, it performs a operation called _rollback_, whereby the state rolls back in time to a shared checkpoint and then jumps forward to the tip of the fork that it just voted on. The maximum distance that a fork may roll back is called the _rollback depth_. Rollback depth is the number of votes required to achieve max lockout. Whenever a validator votes, any checkpoints beyond the rollback depth become unreachable. That is, there is no scenario in which the validator will need to roll back beyond rollback depth. It therefore may safely _prune_ unreachable forks and _squash_ all checkpoints beyond rollback depth into the root checkpoint.
+
+## Active Forks
+
+An active fork is as a sequence of checkpoints that has a length at least one longer than the rollback depth. The shortest fork will have a length exactly one longer than the rollback depth. For example:
+
+![Forks](../.gitbook/assets/forks.svg)
+
+The following sequences are _active forks_:
+
+* {4, 2, 1}
+* {5, 2, 1}
+* {6, 3, 1}
+* {7, 3, 1}
+
+## Pruning and Squashing
+
+A validator may vote on any checkpoint in the tree. In the diagram above, that's every node except the leaves of the tree. After voting, the validator prunes nodes that fork from a distance farther than the rollback depth and then takes the opportunity to minimize its memory usage by squashing any nodes it can into the root.
+
+Starting from the example above, wth a rollback depth of 2, consider a vote on 5 versus a vote on 6. First, a vote on 5:
+
+![Forks after pruning](../.gitbook/assets/forks-pruned.svg)
+
+The new root is 2, and any active forks that are not descendants from 2 are pruned.
+
+Alternatively, a vote on 6:
+
+![Forks](../.gitbook/assets/forks-pruned2.svg)
+
+The tree remains with a root of 1, since the active fork starting at 6 is only 2 checkpoints from the root.
--- a/docs/src/cluster/performance-metrics.md
+++ b/docs/src/cluster/performance-metrics.md
@@ -0,0 +1,24 @@
+# Performance Metrics
+
+Solana cluster performance is measured as average number of transactions per second that the network can sustain \(TPS\). And, how long it takes for a transaction to be confirmed by super majority of the cluster \(Confirmation Time\).
+
+Each cluster node maintains various counters that are incremented on certain events. These counters are periodically uploaded to a cloud based database. Solana's metrics dashboard fetches these counters, and computes the performance metrics and displays it on the dashboard.
+
+## TPS
+
+Each node's bank runtime maintains a count of transactions that it has processed. The dashboard first calculates the median count of transactions across all metrics enabled nodes in the cluster. The median cluster transaction count is then averaged over a 2 second period and displayed in the TPS time series graph. The dashboard also shows the Mean TPS, Max TPS and Total Transaction Count stats which are all calculated from the median transaction count.
+
+## Confirmation Time
+
+Each validator node maintains a list of active ledger forks that are visible to the node. A fork is considered to be frozen when the node has received and processed all entries corresponding to the fork. A fork is considered to be confirmed when it receives cumulative super majority vote, and when one of its children forks is frozen.
+
+The node assigns a timestamp to every new fork, and computes the time it took to confirm the fork. This time is reflected as validator confirmation time in performance metrics. The performance dashboard displays the average of each validator node's confirmation time as a time series graph.
+
+## Hardware setup
+
+The validator software is deployed to GCP n1-standard-16 instances with 1TB pd-ssd disk, and 2x Nvidia V100 GPUs. These are deployed in the us-west-1 region.
+
+solana-bench-tps is started after the network converges from a client machine with n1-standard-16 CPU-only instance with the following arguments: `--tx\_count=50000 --thread-batch-sleep 1000`
+
+TPS and confirmation metrics are captured from the dashboard numbers over a 5 minute average of when the bench-tps transfer stage begins.
+
--- a/docs/src/cluster/stake-delegation-and-rewards.md
+++ b/docs/src/cluster/stake-delegation-and-rewards.md
@@ -0,0 +1,197 @@
+# Stake Delegation and Rewards
+
+Stakers are rewarded for helping to validate the ledger. They do this by delegating their stake to validator nodes. Those validators do the legwork of replaying the ledger and send votes to a per-node vote account to which stakers can delegate their stakes. The rest of the cluster uses those stake-weighted votes to select a block when forks arise. Both the validator and staker need some economic incentive to play their part. The validator needs to be compensated for its hardware and the staker needs to be compensated for the risk of getting its stake slashed. The economics are covered in [staking rewards](../implemented-proposals/staking-rewards.md). This section, on the other hand, describes the underlying mechanics of its implementation.
+
+## Basic Design
+
+The general idea is that the validator owns a Vote account. The Vote account tracks validator votes, counts validator generated credits, and provides any additional validator specific state. The Vote account is not aware of any stakes delegated to it and has no staking weight.
+
+A separate Stake account \(created by a staker\) names a Vote account to which the stake is delegated. Rewards generated are proportional to the amount of lamports staked. The Stake account is owned by the staker only. Some portion of the lamports stored in this account are the stake.
+
+## Passive Delegation
+
+Any number of Stake accounts can delegate to a single Vote account without an interactive action from the identity controlling the Vote account or submitting votes to the account.
+
+The total stake allocated to a Vote account can be calculated by the sum of all the Stake accounts that have the Vote account pubkey as the `StakeState::Stake::voter_pubkey`.
+
+## Vote and Stake accounts
+
+The rewards process is split into two on-chain programs. The Vote program solves the problem of making stakes slashable. The Stake program acts as custodian of the rewards pool and provides for passive delegation. The Stake program is responsible for paying rewards to staker and voter when shown that a staker's delegate has participated in validating the ledger.
+
+### VoteState
+
+VoteState is the current state of all the votes the validator has submitted to the network. VoteState contains the following state information:
+
+* `votes` - The submitted votes data structure.
+* `credits` - The total number of rewards this vote program has generated over its lifetime.
+* `root_slot` - The last slot to reach the full lockout commitment necessary for rewards.
+* `commission` - The commission taken by this VoteState for any rewards claimed by staker's Stake accounts. This is the percentage ceiling of the reward.
+* Account::lamports - The accumulated lamports from the commission. These do not count as stakes.
+* `authorized_voter` - Only this identity is authorized to submit votes. This field can only modified by this identity.
+* `node_pubkey` - The Solana node that votes in this account.
+* `authorized_withdrawer` - the identity of the entity in charge of the lamports of this account, separate from the account's address and the authorized vote signer
+
+### VoteInstruction::Initialize\(VoteInit\)
+
+* `account[0]` - RW - The VoteState
+
+  `VoteInit` carries the new vote account's `node_pubkey`, `authorized_voter`, `authorized_withdrawer`, and `commission`
+
+  other VoteState members defaulted
+
+### VoteInstruction::Authorize\(Pubkey, VoteAuthorize\)
+
+Updates the account with a new authorized voter or withdrawer, according to the VoteAuthorize parameter \(`Voter` or `Withdrawer`\). The transaction must be by signed by the Vote account's current `authorized_voter` or `authorized_withdrawer`.
+
+* `account[0]` - RW - The VoteState
+  `VoteState::authorized_voter` or `authorized_withdrawer` is set to to `Pubkey`.
+
+### VoteInstruction::Vote\(Vote\)
+
+* `account[0]` - RW - The VoteState
+  `VoteState::lockouts` and `VoteState::credits` are updated according to voting lockout rules see [Tower BFT](../implemented-proposals/tower-bft.md)
+
+* `account[1]` - RO - `sysvar::slot_hashes` A list of some N most recent slots and their hashes for the vote to be verified against.
+* `account[2]` - RO - `sysvar::clock` The current network time, expressed in slots, epochs.
+
+### StakeState
+
+A StakeState takes one of four forms, StakeState::Uninitialized, StakeState::Initialized, StakeState::Stake, and StakeState::RewardsPool. Only the first three forms are used in staking, but only StakeState::Stake is interesting. All RewardsPools are created at genesis.
+
+### StakeState::Stake
+
+StakeState::Stake is the current delegation preference of the **staker** and contains the following state information:
+
+* Account::lamports - The lamports available for staking.
+* `stake` - the staked amount \(subject to warm up and cool down\) for generating rewards, always less than or equal to Account::lamports
+* `voter_pubkey` - The pubkey of the VoteState instance the lamports are delegated to.
+* `credits_observed` - The total credits claimed over the lifetime of the program.
+* `activated` - the epoch at which this stake was activated/delegated. The full stake will be counted after warm up.
+* `deactivated` - the epoch at which this stake was de-activated, some cool down epochs are required before the account is fully deactivated, and the stake available for withdrawal
+
+* `authorized_staker` - the pubkey of the entity that must sign delegation, activation, and deactivation transactions
+* `authorized_withdrawer` - the identity of the entity in charge of the lamports of this account, separate from the account's address, and the authorized staker
+
+### StakeState::RewardsPool
+
+To avoid a single network wide lock or contention in redemption, 256 RewardsPools are part of genesis under pre-determined keys, each with std::u64::MAX credits to be able to satisfy redemptions according to point value.
+
+The Stakes and the RewardsPool are accounts that are owned by the same `Stake` program.
+
+### StakeInstruction::DelegateStake
+
+The Stake account is moved from Initialized to StakeState::Stake form, or from a deactivated (i.e. fully cooled-down) StakeState::Stake to activated StakeState::Stake. This is how stakers choose the vote account and validator node to which their stake account lamports are delegated. The transaction must be signed by the stake's `authorized_staker`.
+
+* `account[0]` - RW - The StakeState::Stake instance.  `StakeState::Stake::credits_observed` is initialized to `VoteState::credits`,  `StakeState::Stake::voter_pubkey` is initialized to `account[1]`.  If this is the initial delegation of stake, `StakeState::Stake::stake` is initialized to the account's balance in lamports,  `StakeState::Stake::activated` is initialized to the current Bank epoch, and  `StakeState::Stake::deactivated` is initialized to std::u64::MAX
+* `account[1]` - R - The VoteState instance.
+* `account[2]` - R - sysvar::clock account, carries information about current Bank epoch
+* `account[3]` - R - sysvar::stakehistory account, carries information about stake history
+* `account[4]` - R - stake::Config accoount, carries warmup, cooldown, and slashing configuration
+
+### StakeInstruction::Authorize\(Pubkey, StakeAuthorize\)
+
+Updates the account with a new authorized staker or withdrawer, according to the StakeAuthorize parameter \(`Staker` or `Withdrawer`\). The transaction must be by signed by the Stakee account's current `authorized_staker` or `authorized_withdrawer`.  Any stake lock-up must have expired, or the lock-up custodian must also sign the transaction.
+
+* `account[0]` - RW - The StakeState
+
+  `StakeState::authorized_staker` or `authorized_withdrawer` is set to to `Pubkey`.
+
+### StakeInstruction::Deactivate
+
+A staker may wish to withdraw from the network. To do so he must first deactivate his stake, and wait for cool down.
+The transaction must be signed by the stake's `authorized_staker`.
+
+* `account[0]` - RW - The StakeState::Stake instance that is deactivating.
+* `account[1]` - R - sysvar::clock account from the Bank that carries current epoch
+
+StakeState::Stake::deactivated is set to the current epoch + cool down. The account's stake will ramp down to zero by that epoch, and Account::lamports will be available for withdrawal.
+
+### StakeInstruction::Withdraw\(u64\)
+
+Lamports build up over time in a Stake account and any excess over activated stake can be withdrawn. The transaction must be signed by the stake's `authorized_withdrawer`.
+
+* `account[0]` - RW - The StakeState::Stake from which to withdraw.
+* `account[1]` - RW - Account that should be credited with the withdrawn lamports.
+* `account[2]` - R - sysvar::clock account from the Bank that carries current epoch, to calculate stake.
+* `account[3]` - R - sysvar::stake\_history account from the Bank that carries stake warmup/cooldown history
+
+## Benefits of the design
+
+* Single vote for all the stakers.
+* Clearing of the credit variable is not necessary for claiming rewards.
+* Each delegated stake can claim its rewards independently.
+* Commission for the work is deposited when a reward is claimed by the delegated stake.
+
+## Example Callflow
+
+![Passive Staking Callflow](../.gitbook/assets/passive-staking-callflow.svg)
+
+## Staking Rewards
+
+The specific mechanics and rules of the validator rewards regime is outlined here. Rewards are earned by delegating stake to a validator that is voting correctly. Voting incorrectly exposes that validator's stakes to [slashing](../proposals/slashing.md).
+
+### Basics
+
+The network pays rewards from a portion of network [inflation](../terminology.md#inflation). The number of lamports available to pay rewards for an epoch is fixed and must be evenly divided among all staked nodes according to their relative stake weight and participation. The weighting unit is called a [point](../terminology.md#point).
+
+Rewards for an epoch are not available until the end of that epoch.
+
+At the end of each epoch, the total number of points earned during the epoch is summed and used to divide the rewards portion of epoch inflation to arrive at a point value. This value is recorded in the bank in a [sysvar](../terminology.md#sysvar) that maps epochs to point values.
+
+During redemption, the stake program counts the points earned by the stake for each epoch, multiplies that by the epoch's point value, and transfers lamports in that amount from a rewards account into the stake and vote accounts according to the vote account's commission setting.
+
+### Economics
+
+Point value for an epoch depends on aggregate network participation. If participation in an epoch drops off, point values are higher for those that do participate.
+
+### Earning credits
+
+Validators earn one vote credit for every correct vote that exceeds maximum lockout, i.e. every time the validator's vote account retires a slot from its lockout list, making that vote a root for the node.
+
+Stakers who have delegated to that validator earn points in proportion to their stake. Points earned is the product of vote credits and stake.
+
+### Stake warmup, cooldown, withdrawal
+
+Stakes, once delegated, do not become effective immediately. They must first pass through a warm up period. During this period some portion of the stake is considered "effective", the rest is considered "activating". Changes occur on epoch boundaries.
+
+The stake program limits the rate of change to total network stake, reflected in the stake program's `config::warmup_rate` \(set to 25% per epoch in the current implementation\).
+
+The amount of stake that can be warmed up each epoch is a function of the previous epoch's total effective stake, total activating stake, and the stake program's configured warmup rate.
+
+Cooldown works the same way. Once a stake is deactivated, some part of it is considered "effective", and also "deactivating". As the stake cools down, it continues to earn rewards and be exposed to slashing, but it also becomes available for withdrawal.
+
+Bootstrap stakes are not subject to warmup.
+
+Rewards are paid against the "effective" portion of the stake for that epoch.
+
+#### Warmup example
+
+Consider the situation of a single stake of 1,000 activated at epoch N, with network warmup rate of 20%, and a quiescent total network stake at epoch N of 2,000.
+
+At epoch N+1, the amount available to be activated for the network is 400 \(20% of 200\), and at epoch N, this example stake is the only stake activating, and so is entitled to all of the warmup room available.
+
+| epoch | effective | activating | total effective | total activating |
+| :--- | ---: | ---: | ---: | ---: |
+| N-1 |  |  | 2,000 | 0 |
+| N | 0 | 1,000 | 2,000 | 1,000 |
+| N+1 | 400 | 600 | 2,400 | 600 |
+| N+2 | 880 | 120 | 2,880 | 120 |
+| N+3 | 1000 | 0 | 3,000 | 0 |
+
+Were 2 stakes \(X and Y\) to activate at epoch N, they would be awarded a portion of the 20% in proportion to their stakes. At each epoch effective and activating for each stake is a function of the previous epoch's state.
+
+| epoch | X eff | X act | Y eff | Y act | total effective | total activating |
+| :--- | ---: | ---: | ---: | ---: | ---: | ---: |
+| N-1 |  |  |  |  | 2,000 | 0 |
+| N | 0 | 1,000 | 0 | 200 | 2,000 | 1,200 |
+| N+1 | 333 | 667 | 67 | 133 | 2,400 | 800 |
+| N+2 | 733 | 267 | 146 | 54 | 2,880 | 321 |
+| N+3 | 1000 | 0 | 200 | 0 | 3,200 | 0 |
+
+### Withdrawal
+
+Only lamports in excess of effective+activating stake may be withdrawn at any time. This means that during warmup, effectively no stake can be withdrawn. During cooldown, any tokens in excess of effective stake may be withdrawn \(activating == 0\). Because earned rewards are automatically added to stake, withdrawal is generally only possible after deactivation.
+
+### Lock-up
+
+Stake accounts support the notion of lock-up, wherein the stake account balance is unavailable for withdrawal until a specified time. Lock-up is specified as an epoch height, i.e. the minimum epoch height that must be reached by the network before the stake account balance is available for withdrawal, unless the transaction is also signed by a specified custodian. This information is gathered when the stake account is created, and stored in the Lockup field of the stake account's state.  Changing the authorized staker or withdrawer is also subject to lock-up, as such an operation is effectively a transfer.
--- a/docs/src/cluster/synchronization.md
+++ b/docs/src/cluster/synchronization.md
@@ -0,0 +1,27 @@
+# Synchronization
+
+Fast, reliable synchronization is the biggest reason Solana is able to achieve such high throughput. Traditional blockchains synchronize on large chunks of transactions called blocks. By synchronizing on blocks, a transaction cannot be processed until a duration called "block time" has passed. In Proof of Work consensus, these block times need to be very large \(~10 minutes\) to minimize the odds of multiple validators producing a new valid block at the same time. There's no such constraint in Proof of Stake consensus, but without reliable timestamps, a validator cannot determine the order of incoming blocks. The popular workaround is to tag each block with a [wallclock timestamp](https://en.bitcoin.it/wiki/Block_timestamp). Because of clock drift and variance in network latencies, the timestamp is only accurate within an hour or two. To workaround the workaround, these systems lengthen block times to provide reasonable certainty that the median timestamp on each block is always increasing.
+
+Solana takes a very different approach, which it calls _Proof of History_ or _PoH_. Leader nodes "timestamp" blocks with cryptographic proofs that some duration of time has passed since the last proof. All data hashed into the proof most certainly have occurred before the proof was generated. The node then shares the new block with validator nodes, which are able to verify those proofs. The blocks can arrive at validators in any order or even could be replayed years later. With such reliable synchronization guarantees, Solana is able to break blocks into smaller batches of transactions called _entries_. Entries are streamed to validators in realtime, before any notion of block consensus.
+
+Solana technically never sends a _block_, but uses the term to describe the sequence of entries that validators vote on to achieve _confirmation_. In that way, Solana's confirmation times can be compared apples to apples to block-based systems. The current implementation sets block time to 800ms.
+
+What's happening under the hood is that entries are streamed to validators as quickly as a leader node can batch a set of valid transactions into an entry. Validators process those entries long before it is time to vote on their validity. By processing the transactions optimistically, there is effectively no delay between the time the last entry is received and the time when the node can vote. In the event consensus is **not** achieved, a node simply rolls back its state. This optimisic processing technique was introduced in 1981 and called [Optimistic Concurrency Control](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.65.4735). It can be applied to blockchain architecture where a cluster votes on a hash that represents the full ledger up to some _block height_. In Solana, it is implemented trivially using the last entry's PoH hash.
+
+## Relationship to VDFs
+
+The Proof of History technique was first described for use in blockchain by Solana in November of 2017. In June of the following year, a similar technique was described at Stanford and called a [verifiable delay function](https://eprint.iacr.org/2018/601.pdf) or _VDF_.
+
+A desirable property of a VDF is that verification time is very fast. Solana's approach to verifying its delay function is proportional to the time it took to create it. Split over a 4000 core GPU, it is sufficiently fast for Solana's needs, but if you asked the authors of the paper cited above, they might tell you \([and have](https://github.com/solana-labs/solana/issues/388)\) that Solana's approach is algorithmically slow and it shouldn't be called a VDF. We argue the term VDF should represent the category of verifiable delay functions and not just the subset with certain performance characteristics. Until that's resolved, Solana will likely continue using the term PoH for its application-specific VDF.
+
+Another difference between PoH and VDFs is that a VDF is used only for tracking duration. PoH's hash chain, on the other hand, includes hashes of any data the application observed. That data is a double-edged sword. On one side, the data "proves history" - that the data most certainly existed before hashes after it. On the side, it means the application can manipulate the hash chain by changing _when_ the data is hashed. The PoH chain therefore does not serve as a good source of randomness whereas a VDF without that data could. Solana's [leader rotation algorithm](synchronization.md#leader-rotation), for example, is derived only from the VDF _height_ and not its hash at that height.
+
+## Relationship to Consensus Mechanisms
+
+Proof of History is not a consensus mechanism, but it is used to improve the performance of Solana's Proof of Stake consensus. It is also used to improve the performance of the data plane and replication protocols.
+
+## More on Proof of History
+
+* [water clock analogy](https://medium.com/solana-labs/proof-of-history-explained-by-a-water-clock-e682183417b8)
+* [Proof of History overview](https://medium.com/solana-labs/proof-of-history-a-clock-for-blockchain-cf47a61a9274)
+
--- a/docs/src/cluster/turbine-block-propagation.md
+++ b/docs/src/cluster/turbine-block-propagation.md
@@ -0,0 +1,96 @@
+# Turbine Block Propagation
+
+A Solana cluster uses a multi-layer block propagation mechanism called _Turbine_ to broadcast transaction shreds to all nodes with minimal amount of duplicate messages. The cluster divides itself into small collections of nodes, called _neighborhoods_. Each node is responsible for sharing any data it receives with the other nodes in its neighborhood, as well as propagating the data on to a small set of nodes in other neighborhoods. This way each node only has to communicate with a small number of nodes.
+
+During its slot, the leader node distributes shreds between the validator nodes in the first neighborhood \(layer 0\). Each validator shares its data within its neighborhood, but also retransmits the shreds to one node in some neighborhoods in the next layer \(layer 1\). The layer-1 nodes each share their data with their neighborhood peers, and retransmit to nodes in the next layer, etc, until all nodes in the cluster have received all the shreds.
+
+## Neighborhood Assignment - Weighted Selection
+
+In order for data plane fanout to work, the entire cluster must agree on how the cluster is divided into neighborhoods. To achieve this, all the recognized validator nodes \(the TVU peers\) are sorted by stake and stored in a list. This list is then indexed in different ways to figure out neighborhood boundaries and retransmit peers. For example, the leader will simply select the first nodes to make up layer 0. These will automatically be the highest stake holders, allowing the heaviest votes to come back to the leader first. Layer-0 and lower-layer nodes use the same logic to find their neighbors and next layer peers.
+
+To reduce the possibility of attack vectors, each shred is transmitted over a random tree of neighborhoods. Each node uses the same set of nodes representing the cluster. A random tree is generated from the set for each shred using a seed derived from the leader id, slot and shred index.
+
+## Layer and Neighborhood Structure
+
+The current leader makes its initial broadcasts to at most `DATA_PLANE_FANOUT` nodes. If this layer 0 is smaller than the number of nodes in the cluster, then the data plane fanout mechanism adds layers below. Subsequent layers follow these constraints to determine layer-capacity: Each neighborhood contains `DATA_PLANE_FANOUT` nodes. Layer-0 starts with 1 neighborhood with fanout nodes. The number of nodes in each additional layer grows by a factor of fanout.
+
+As mentioned above, each node in a layer only has to broadcast its shreds to its neighbors and to exactly 1 node in some next-layer neighborhoods, instead of to every TVU peer in the cluster. A good way to think about this is, layer-0 starts with 1 neighborhood with fanout nodes, layer-1 adds "fanout" neighborhoods, each with fanout nodes and layer-2 will have `fanout * number of nodes in layer-1` and so on.
+
+This way each node only has to communicate with a maximum of `2 * DATA_PLANE_FANOUT - 1` nodes.
+
+The following diagram shows how the Leader sends shreds with a Fanout of 2 to Neighborhood 0 in Layer 0 and how the nodes in Neighborhood 0 share their data with each other.
+
+![Leader sends shreds to Neighborhood 0 in Layer 0](../.gitbook/assets/data-plane-seeding.svg)
+
+The following diagram shows how Neighborhood 0 fans out to Neighborhoods 1 and 2.
+
+![Neighborhood 0 Fanout to Neighborhood 1 and 2](../.gitbook/assets/data-plane-fanout.svg)
+
+Finally, the following diagram shows a two layer cluster with a Fanout of 2.
+
+![Two layer cluster with a Fanout of 2](../.gitbook/assets/data-plane.svg)
+
+### Configuration Values
+
+`DATA_PLANE_FANOUT` - Determines the size of layer 0. Subsequent layers grow by a factor of `DATA_PLANE_FANOUT`. The number of nodes in a neighborhood is equal to the fanout value. Neighborhoods will fill to capacity before new ones are added, i.e if a neighborhood isn't full, it _must_ be the last one.
+
+Currently, configuration is set when the cluster is launched. In the future, these parameters may be hosted on-chain, allowing modification on the fly as the cluster sizes change.
+
+## Calcuating the required FEC rate
+
+Turbine relies on retransmission of packets between validators.  Due to
+retransmission, any network wide packet loss is compounded, and the
+probability of the packet failing to reach is destination increases
+on each hop.  The FEC rate needs to take into account the network wide
+packet loss, and the propagation depth.
+
+A shred group is the set of data and coding packets that can be used
+to reconstruct each other.  Each shred group has a chance of failure,
+based on the likelyhood of the number of packets failing that exceeds
+the FEC rate. If a validator fails to reconstruct the shred group,
+then the block cannot be reconstructed, and the validator has to rely
+on repair to fixup the blocks.
+
+The probability of the shred group failing can be computed using the
+binomial distribution.  If the FEC rate is `16:4`, then the group size
+is 20, and at least 4 of the shreds must fail for the group to fail.
+Which is equal to the sum of the probability of 4 or more trails failing
+out of 20.
+
+Probability of a block succeeding in turbine:
+
+* Probability of packet failure: `P = 1 - (1 - network_packet_loss_rate)^2`
+* FEC rate: `K:M`
+* Number of trials: `N = K + M`
+* Shred group failure rate: `S = SUM of i=0 -> M for binomial(prob_failure = P,  trials = N, failures = i)`
+* Shreds per block: `G`
+* Block success rate: `B = (1 - S) ^ (G / N) `
+* Binomial distribution for exactly `i` results with probability of P in N trials is defined as `(N choose i) * P^i * (1 - P)^(N-i)`
+
+For example:
+
+* Network packet loss rate is 15%.
+* 50kpts network generates 6400 shreds per second.
+* FEC rate increases the total shres per block by the FEC ratio.
+
+With a FEC rate: `16:4`
+* `G = 8000`
+* `P = 1 - 0.85 * 0.85 = 1 - 0.7225 = 0.2775`
+* `S = SUM of i=0 -> 4 for binomial(prob_failure = 0.2775,  trials = 20, failures = i) = 0.689414`
+* `B = (1 - 0.689) ^ (8000 / 20) = 10^-203`
+
+With FEC rate of `16:16`
+* `G = 12800`
+* `S = SUM of i=0 -> 32 for binomial(prob_failure = 0.2775,  trials = 64, failures = i) = 0.002132`
+* `B = (1 - 0.002132) ^ (12800 / 32) = 0.42583`
+
+With FEC rate of `32:32`
+* `G = 12800`
+* `S = SUM of i=0 -> 32 for binomial(prob_failure = 0.2775,  trials = 64, failures = i) = 0.000048`
+* `B = (1 - 0.000048) ^ (12800 / 64) = 0.99045`
+
+## Neighborhoods
+
+The following diagram shows how two neighborhoods in different layers interact. To cripple a neighborhood, enough nodes \(erasure codes +1\) from the neighborhood above need to fail. Since each neighborhood receives shreds from multiple nodes in a neighborhood in the upper layer, we'd need a big network failure in the upper layers to end up with incomplete data.
+
+![Inner workings of a neighborhood](../.gitbook/assets/data-plane-neighborhood.svg)
--- a/docs/src/cluster/vote-signing.md
+++ b/docs/src/cluster/vote-signing.md
@@ -0,0 +1,67 @@
+# Secure Vote Signing
+
+A validator receives entries from the current leader and submits votes confirming those entries are valid. This vote submission presents a security challenge, because forged votes that violate consensus rules could be used to slash the validator's stake.
+
+The validator votes on its chosen fork by submitting a transaction that uses an asymmetric key to sign the result of its validation work. Other entities can verify this signature using the validator's public key. If the validator's key is used to sign incorrect data \(e.g. votes on multiple forks of the ledger\), the node's stake or its resources could be compromised.
+
+Solana addresses this risk by splitting off a separate _vote signer_ service that evaluates each vote to ensure it does not violate a slashing condition.
+
+## Validators, Vote Signers, and Stakeholders
+
+When a validator receives multiple blocks for the same slot, it tracks all possible forks until it can determine a "best" one. A validator selects the best fork by submitting a vote to it, using a vote signer to minimize the possibility of its vote inadvertently violating a consensus rule and getting a stake slashed.
+
+A vote signer evaluates the vote proposed by the validator and signs the vote only if it does not violate a slashing condition. A vote signer only needs to maintain minimal state regarding the votes it signed and the votes signed by the rest of the cluster. It doesn't need to process a full set of transactions.
+
+A stakeholder is an identity that has control of the staked capital. The stakeholder can delegate its stake to the vote signer. Once a stake is delegated, the vote signer votes represent the voting weight of all the delegated stakes, and produce rewards for all the delegated stakes.
+
+Currently, there is a 1:1 relationship between validators and vote signers, and stakeholders delegate their entire stake to a single vote signer.
+
+## Signing service
+
+The vote signing service consists of a JSON RPC server and a request processor. At startup, the service starts the RPC server at a configured port and waits for validator requests. It expects the following type of requests: 1. Register a new validator node
+
+* The request must contain validator's identity \(public key\)
+* The request must be signed with the validator's private key
+* The service drops the request if signature of the request cannot be
+
+  verified
+
+* The service creates a new voting asymmetric key for the validator, and
+
+  returns the public key as a response
+
+* If a validator tries to register again, the service returns the public key
+
+  from the pre-existing keypair
+
+  1. Sign a vote
+
+* The request must contain a voting transaction and all verification data
+* The request must be signed with the validator's private key
+* The service drops the request if signature of the request cannot be
+
+  verified
+
+* The service verifies the voting data
+* The service returns a signature for the transaction
+
+## Validator voting
+
+A validator node, at startup, creates a new vote account and registers it with the cluster by submitting a new "vote register" transaction. The other nodes on the cluster process this transaction and include the new validator in the active set. Subsequently, the validator submits a "new vote" transaction signed with the validator's voting private key on each voting event.
+
+### Configuration
+
+The validator node is configured with the signing service's network endpoint \(IP/Port\).
+
+### Registration
+
+At startup, the validator registers itself with its signing service using JSON RPC. The RPC call returns the voting public key for the validator node. The validator creates a new "vote register" transaction including this public key, and submits it to the cluster.
+
+### Vote Collection
+
+The validator looks up the votes submitted by all the nodes in the cluster for the last voting period. This information is submitted to the signing service with a new vote signing request.
+
+### New Vote Signing
+
+The validator creates a "new vote" transaction and sends it to the signing service using JSON RPC. The RPC request also includes the vote verification data. On success, the RPC call returns the signature for the vote. On failure, RPC call returns the failure code.
+