Commit Graph

2273 Commits

Author SHA1 Message Date
b6dff12923 update ledger tool to restore cost table from blockstore (#18489)
* update ledger tool to restore cost model from blockstore when compute-slot-cost

* Move initialize_cost_table into cost_model, so the function can be tested and shared between validator and ledger-tool

* refactor and simplify a test
2021-07-07 23:44:51 -05:00
1e0942e900 Rename ClusterInfo::send_vote to ClusterInfo::send_transaction 2021-07-07 15:51:14 -07:00
a86ced0bac generate deterministic seeds for shreds (#17950)
* generate shred seed from leader pubkey

* clippy

* clippy

* review

* review 2

* fmt

* review

* check

* review

* cleanup

* fmt
2021-07-07 08:21:12 -07:00
a0551b4054 persists repair-peers cache across repair service loops (#18400)
The repair-peers cache is reset each time repair service loop runs,
and so computed repeatedly for the same slots:
https://github.com/solana-labs/solana/blob/d2b07dca9/core/src/repair_service.rs#L275

This commit uses an LRU cache to persists repair-peers for each slot.
In addition to LRU eviction rules, in order to avoid re-using outdated
data, each entry also has 10 seconds TTL.
2021-07-07 14:12:09 +00:00
04787be8b1 encapsulates turbine peers computations of broadcast & retransmit stages (#18238)
Broadcast stage and retransmit stage should arrange nodes on turbine
broadcast tree in exactly same order. Additionally any changes to this
ordering (e.g. updating how unstaked nodes are handled) requires feature
gating to keep the cluster in sync.

Current implementation is scattered out over several public methods and
exposes too much of implementation details (e.g. usize indices into
peers vector) which makes code changes and checking for feature
activations more difficult.

This commit encapsulates turbine peer computations into a new struct,
and only exposes two public methods, get_broadcast_peer and
get_retransmit_peers, for call-sites.
2021-07-07 00:35:25 +00:00
100fabf469 Remove feature switch for demoting sysvar write locks (#18373) 2021-07-06 21:22:22 +00:00
0e039b4094 Aggregate cost_model into cost_tracker (#18374)
* * aggregate cost_model into cost_tracker, decouple it from banking_stage to prevent accidental deadlock. * Simplified code, removed unused functions

* review fixes
2021-07-06 15:41:25 +00:00
d5c2c72360 Rename Tower::lockouts to Tower::vote_state 2021-07-02 18:35:49 -07:00
7cd6224caf log warning when channel send fails (#18391) 2021-07-02 19:04:09 +00:00
0eca92de18 Make set roots an iterator (#18357) 2021-07-01 20:02:40 -07:00
b6792a3328 Add ability to change the validator identity at runtime 2021-07-01 17:50:04 -07:00
45d54b1fc6 Add SnapshotArchiveInfo and refactor functions in snapshot_utils (#18232) 2021-07-01 12:20:56 -05:00
5e424826ba Persist cost table to blockstore (#18123)
* Add `ProgramCosts` Column Family to blockstore, implement LedgerColumn; add `delete_cf` to Rocks
* Add ProgramCosts to compaction excluding list alone side with TransactionStatusIndex in one place: `excludes_from_compaction()`

* Write cost table to blockstore after `replay_stage` replayed active banks; add stats to measure persist time
* Deletes program from `ProgramCosts` in blockstore when they are removed from cost_table in memory
* Only try to persist to blockstore when cost_table is changed.
* Restore cost table during validator startup

* Offload `cost_model` related operations from replay main thread to dedicated service thread, add channel to send execute_timings between these threads;
* Move `cost_update_service` to its own module; replay_stage is now decoupled from cost_model.
2021-07-01 11:32:41 -05:00
89a3e4f91e Move SnapshotConfig into its own module (#18331)
Also move ArchiveFormat to snapshot_utils, and do not
reexport SnapshotVersion.
2021-07-01 08:55:26 -05:00
8d9a6deda4 Add repair number per slot (#18082) 2021-06-30 18:20:07 +02:00
02b14caa5f test-validator: hold rent constant with --slots-per-epoch 2021-06-30 00:46:12 -06:00
68c87469c3 Cleanup ReplayStage tests (#18241) 2021-06-28 20:19:42 -07:00
9d6f1ebef4 investigate system performance test degradation (#17919)
* Add stats and counter around cost model ops, mainly:
- calculate transaction cost
- check transaction can fit in a block
- update block cost tracker after transactions are added to block
- replay_stage to update/insert execution cost to table

* Change mutex on cost_tracker to RwLock

* removed cloning cost_tracker for local use, as the metrics show clone is very expensive.

* acquire and hold locks for block of TXs, instead of acquire and release per transaction;

* remove redundant would_fit check from cost_tracker update execution path

* refactor cost checking with less frequent lock acquiring

* avoid many Transaction_cost heap allocation when calculate cost, which
is in the hot path - executed per transaction.

* create hashmap with new_capacity to reduce runtime heap realloc.

* code review changes: categorize stats, replace explicit drop calls, concisely initiate to default

* address potential deadlock by acquiring locks one at time
2021-06-28 21:34:04 -05:00
5d08bf9aa3 More detailed voting timings in replay stage (#18229) 2021-06-26 17:32:08 +02:00
d269975784 Revert "Clean up build warning"
This reverts commit 17a173ebb5.
2021-06-24 19:57:52 -06:00
314102cb54 Remove redundant JsonRpcConfig::identity_pubkey field 2021-06-22 17:20:11 -07:00
e808f34b0b Add batch stats (#18096) 2021-06-22 15:23:26 +02:00
3b1517237c Clean up argument names 2021-06-21 21:29:52 -07:00
84b9de8c18 Shredder no longer holds a keypair 2021-06-21 21:29:52 -07:00
2435ea3ad8 Remove redundant ReplayStageConfig::my_pubkey field 2021-06-21 21:29:52 -07:00
51a0007001 serve_repair: Remove internal ContactInfo field duplication 2021-06-21 17:23:49 -07:00
598093b5db adds shred-version to ip-echo-server response
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
https://github.com/solana-labs/solana/pull/17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
2021-06-21 19:37:16 +00:00
ec2f930475 user process.accounts_db_test_hash_calculation for debug_verify hash (#18053) 2021-06-21 10:20:27 -05:00
4a12c715a3 Drop Error suffix from enum values to avoid the enum_variant_names clippy lint 2021-06-18 23:02:13 +00:00
789f33e8db chore: cargo fmt 2021-06-18 10:42:46 -07:00
6514096a67 chore: cargo +nightly clippy --fix -Z unstable-options 2021-06-18 10:42:46 -07:00
d0511de9a6 chore: bump trees from 0.2.1 to 0.4.2 (#18052)
* chore: bump trees from 0.2.1 to 0.4.2 (#18041)

Bumps [trees](https://github.com/oooutlk/trees) from 0.2.1 to 0.4.2.
- [Release notes](https://github.com/oooutlk/trees/releases)
- [Commits](https://github.com/oooutlk/trees/commits)

---
updated-dependencies:
- dependency-name: trees
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Accommodate field & type changes

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-06-17 22:45:09 +00:00
071b1ee3e5 Removed pub from some functions which are actually private to improve encapsulation (#18030)
Remove the pub marker to improve encapsulation. Readability improvement only, no functional impact.
2021-06-17 10:14:21 -07:00
fa04531c7a Extricate RpcCompletedSlotsService from RetransmitStage 2021-06-16 16:20:35 -07:00
5bc6c89adc validator: run poh speed test earlier in start up 2021-06-16 21:27:08 +00:00
161838655c removes port-based forwarding logic from turbine retransmit (#17716)
Turbine retransmit logic is based on which socket it received the packet
from (i.e `packet.meta.forward`):
https://github.com/solana-labs/solana/blob/708bbcb00/core/src/retransmit_stage.rs#L467-L470

This can leave the cluster vulnerable to spoofing and selective
propagation of packets; see
https://github.com/solana-labs/solana/issues/6672
https://github.com/solana-labs/solana/pull/7774

This commit identifies if the node is on the "critical path" based on
its index in the shuffled cluster. If so, it forwards the packet to both
neighbors and children; otherwise, the packet is only forwarded to the
children.

The metrics added in
https://github.com/solana-labs/solana/pull/17351
shows that the number of times the index does not match the port is very
rare, and therefore this change should be safe.
2021-06-15 13:19:41 +00:00
ccc013e134 Handle removing slots during account scans (#17471) 2021-06-14 21:04:01 -07:00
eeee75c5be Don't use pinned memory when unnecessary (#17832)
Reports of excessive GPU memory usage and errors
from cudaHostRegister. There are some cases where pinning is
not required.
2021-06-14 16:10:04 +02:00
0feac57cb0 Don't store votes unless we are leader soon (#17803) 2021-06-11 18:29:05 +02:00
c8535be0e1 Port unconfirmed duplicate tracking logic from ProgressMap to ForkChoice (#17779) 2021-06-11 03:09:57 -07:00
afafa624a3 Account for duplicate before a bank is frozen or replayed (#17866) 2021-06-10 22:28:23 -07:00
269d995832 Make account shrink configurable #17544 (#17778)
1. Added both options for measuring space usage using total accounts usage and for individual store shrink ratio using an enum. Validator CLI options: --accounts-shrink-optimize-total-space and --accounts-shrink-ratio
2. Added code for selecting candidates based on total usage in a separate function select_candidates_by_total_usage
3. Added unit tests for the new functions added
4. The default implementations is kept at 0.8 shrink ratio with --accounts-shrink-optimize-total-space set to true

Fixes #17544
2021-06-09 21:21:32 -07:00
ae27fcbcda replay stage feed back program cost (#17731)
* replay stage feeds back realtime per-program execution cost to cost model;

* program cost execution table is initialized into empty table, no longer populated with hardcoded numbers;

* changed cost unit to microsecond, using value collected from mainnet;

* add ExecuteCostTable with fixed capacity for security concern, when its limit is reached, programs with old age AND less occurrence will be pushed out to make room for new programs.
2021-06-09 17:10:59 -05:00
050bb5446d Add local cluster tests that broadcast duplicate slots (#13995)
* Add duplicate node local cluster test

* fix clippy

* remove dupe test
2021-06-09 15:01:48 -07:00
e5e7390d44 Wrap long lines 2021-06-08 12:05:29 -07:00
544b3c0d17 Create solana-poh and move remaining rpc modules to solana-rpc (#17698)
* Create solana-poh crate

* Move BigTableUploadService to solana-ledger

* Add solana-rpc to workspace

* Move dependencies to solana-rpc

* Move remaining rpc modules to solana-rpc

* Single use statement solana-poh

* Single use statement solana-rpc
2021-06-04 09:23:06 -06:00
f97ce2cd7e Per-program id timings (#17554) 2021-06-04 16:04:31 +02:00
be957f25c9 adds fallback logic if retransmit multicast fails (#17714)
In retransmit-stage, based on the packet.meta.seed and resulting
children/neighbors, each packet is sent to a different set of peers:
https://github.com/solana-labs/solana/blob/708bbcb00/core/src/retransmit_stage.rs#L421-L457

However, current code errors out as soon as a multicast call fails,
which will skip all the remaining packets:
https://github.com/solana-labs/solana/blob/708bbcb00/core/src/retransmit_stage.rs#L467-L470

This can exacerbate packets loss in turbine.

This commit:
  * keeps iterating over retransmit packets for loop even if some
    intermediate sends fail.
  * adds a fallback to UdpSocket::send_to if multicast fails.

Recent discord chat:
https://discord.com/channels/428295358100013066/689412830075551748/849530845052403733
2021-06-04 12:16:37 +00:00
3a647c4bea Rename ValidatorExit and move to sdk (#17728) 2021-06-04 03:06:13 +00:00
96ba2edfeb Switch EpochSlots to be frozen slots, not completed slots (#17168) 2021-06-03 00:20:00 +00:00