From f4b9e93b11c91408617130381c774011f0397d80 Mon Sep 17 00:00:00 2001 From: Greg Fitzgerald Date: Thu, 8 Nov 2018 10:46:13 -0700 Subject: [PATCH] Migrate storage RFC to book --- src/SUMMARY.md | 2 +- src/porep.md | 1 - src/storage.md | 114 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 115 insertions(+), 2 deletions(-) delete mode 100644 src/porep.md create mode 100644 src/storage.md diff --git a/src/SUMMARY.md b/src/SUMMARY.md index bd4f35ab0b..0254dabe03 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -15,7 +15,7 @@ - [JsonRpcService](jsonrpc-service.md) - [Avalanche replication](avalanche.md) - - [Proof of replication](porep.md) + - [Storage](storage.md) - [On-chain programs](programs.md) - [The Runtime](runtime.md) diff --git a/src/porep.md b/src/porep.md deleted file mode 100644 index 176b740ebc..0000000000 --- a/src/porep.md +++ /dev/null @@ -1 +0,0 @@ -# Proof of replication diff --git a/src/storage.md b/src/storage.md new file mode 100644 index 0000000000..fa969bf20d --- /dev/null +++ b/src/storage.md @@ -0,0 +1,114 @@ +# Storage + +## Background + +At full capacity on a 1gbps network Solana would generate 4 petabytes of data +per year. If each fullnode was required to store the full ledger, the cost of +storage would discourage fullnode participation, thus centralizing the network +around those that could afford it. Solana aims to keep the cost of a fullnode +below $5,000 USD to maximize participation. To achieve that, the network needs +to minimize redundant storage while at the same time ensuring the validity and +availability of each copy. + +To trust storage of ledger segments, Solana has *replicators* periodically +submit proofs to the network that the data was replicated. Each proof is called +a Proof of Replication. The basic idea of it is to encrypt a dataset with a +public symmetric key and then hash the encrypted dataset. Solana uses [CBC +encryption](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Cipher_Block_Chaining_(CBC)). +To prevent a malicious replicator from deleting the data as soon as it's +hashed, a replicator is required hash random segments of the dataset. +Alternatively, Solana could require hashing the reverse of the encrypted data, +but random sampling is sufficient and much faster. Either solution ensures +that all the data is present during the generation of the proof and also +requires the validator to have the entirety of the encrypted data present for +verification of every proof of every identity. The space required to validate +is: + +``` number_of_proofs * data_size ``` + +# Optimization with PoH + +Solana is not the only distribute systems project using Proof of Replication, +but it might be the most efficient implementation because of its ability to +synchronize nodes with its Proof of History. With PoH, Solana is able to record +a hash of the PoRep samples in the ledger. Thus the blocks stay in the exact +same order for every PoRep and verification can stream the data and verify all +the proofs in a single batch. This way Solana can verify multiple proofs +concurrently, each one on its own GPU core. With the current generation of +graphics cards our network can support up to 14,000 replication identities or +symmetric keys. The total space required for verification is: + +``` 2 CBC_blocks * number_of_identities ``` + +with core count of equal to (Number of Identities). A CBC block is expected to +be 1MB in size. + +# Network + +Validators for PoRep are the same validators that are verifying transactions. +They have some stake that they have put up as collateral that ensures that +their work is honest. If you can prove that a validator verified a fake PoRep, +then the validator's stake is slashed. + +Replicators are specialized light clients. They download a part of the ledger +and store it and provide proofs of storing the ledger. For each verified proof, +replicators are rewarded tokens from the mining pool. + +# Constraints + +Solana's PoRep protocol instroduces the following constraints: + +* At most 14,000 replication identities can be used, because that is how many GPU + cores are currently available to a computer costing under $5,000 USD. +* Verification requires generating the CBC blocks. That requires space of 2 + blocks per identity, and 1 GPU core per identity for the same dataset. As +many identities at once are batched with as many proofs for those identities +verified concurrently for the same dataset. + +# Validation and Replication Protocol + +1. The network sets a replication target number, let's say 1k. 1k PoRep + identities are created from signatures of a PoH hash. They are tied to a +specific PoH hash. It doesn't matter who creates them, or it could simply be +the last 1k validation signatures we saw for the ledger at that count. This is +maybe just the initial batch of identities, because we want to stagger identity +rotation. +2. Any client can use any of these identities to create PoRep proofs. + Replicator identities are the CBC encryption keys. +3. Periodically at a specific PoH count, a replicator that wants to create + PoRep proofs signs the PoH hash at that count. That signature is the seed +used to pick the block and identity to replicate. A block is 1TB of ledger. +4. Periodically at a specific PoH count, a replicator submits PoRep proofs for + their selected block. A signature of the PoH hash at that count is the seed +used to sample the 1TB encrypted block, and hash it. This is done faster than +it takes to encrypt the 1TB block with the original identity. +5. Replicators must submit some number of fake proofs, which they can prove to + be fake by providing the seed for the hash result. +6. Periodically at a specific PoH count, validators sign the hash and use the + signature to select the 1TB block that they need to validate. They batch all +the identities and proofs and submit approval for all the verified ones. +7. After #6, replicator client submit the proofs of fake proofs. + +For any random seed, Solana requires everyone to use a signature that is +derived from a PoH hash. Every node uses the same count so that the same PoH +hash is signed by every participant. The signatures are then each +cryptographically tied to the keypair, which prevents a leader from grinding on +the resulting value for more than 1 identity. + +Key rotation is *staggered*. Once going, the next identity is generated by +hashing itself with a PoH hash. + +Since there are many more client identities then encryption identities, the +reward is split amont multiple clients to prevent Sybil attacks from generating +many clients to acquire the same block of data. To remain BFT, the network +needs to avoid a single human entity from storing all the replications of a +single chunk of the ledger. + +Solana's solution to this is to require clients to continue using the same +identity. If the first round is used to acquire the same block for many client +identities, the second round for the same client identities will require a +redistribution of the signatures, and therefore PoRep identities and blocks. +Thus to get a reward for storage, clients are not rewarded for storage of the +first block. The network rewards long-lived client identities more than new +ones. +