2273 Commits

Author SHA1 Message Date
Tao Zhu
b6dff12923
update ledger tool to restore cost table from blockstore (#18489)
* update ledger tool to restore cost model from blockstore when compute-slot-cost

* Move initialize_cost_table into cost_model, so the function can be tested and shared between validator and ledger-tool

* refactor and simplify a test
2021-07-07 23:44:51 -05:00
Michael Vines
1e0942e900 Rename ClusterInfo::send_vote to ClusterInfo::send_transaction 2021-07-07 15:51:14 -07:00
jbiseda
a86ced0bac
generate deterministic seeds for shreds (#17950)
* generate shred seed from leader pubkey

* clippy

* clippy

* review

* review 2

* fmt

* review

* check

* review

* cleanup

* fmt
2021-07-07 08:21:12 -07:00
behzad nouri
a0551b4054
persists repair-peers cache across repair service loops (#18400)
The repair-peers cache is reset each time repair service loop runs,
and so computed repeatedly for the same slots:
https://github.com/solana-labs/solana/blob/d2b07dca9/core/src/repair_service.rs#L275

This commit uses an LRU cache to persists repair-peers for each slot.
In addition to LRU eviction rules, in order to avoid re-using outdated
data, each entry also has 10 seconds TTL.
2021-07-07 14:12:09 +00:00
behzad nouri
04787be8b1
encapsulates turbine peers computations of broadcast & retransmit stages (#18238)
Broadcast stage and retransmit stage should arrange nodes on turbine
broadcast tree in exactly same order. Additionally any changes to this
ordering (e.g. updating how unstaked nodes are handled) requires feature
gating to keep the cluster in sync.

Current implementation is scattered out over several public methods and
exposes too much of implementation details (e.g. usize indices into
peers vector) which makes code changes and checking for feature
activations more difficult.

This commit encapsulates turbine peer computations into a new struct,
and only exposes two public methods, get_broadcast_peer and
get_retransmit_peers, for call-sites.
2021-07-07 00:35:25 +00:00
Justin Starry
100fabf469
Remove feature switch for demoting sysvar write locks (#18373) 2021-07-06 21:22:22 +00:00
Tao Zhu
0e039b4094
Aggregate cost_model into cost_tracker (#18374)
* * aggregate cost_model into cost_tracker, decouple it from banking_stage to prevent accidental deadlock. * Simplified code, removed unused functions

* review fixes
2021-07-06 15:41:25 +00:00
Michael Vines
d5c2c72360 Rename Tower::lockouts to Tower::vote_state 2021-07-02 18:35:49 -07:00
Tao Zhu
7cd6224caf
log warning when channel send fails (#18391) 2021-07-02 19:04:09 +00:00
carllin
0eca92de18
Make set roots an iterator (#18357) 2021-07-01 20:02:40 -07:00
Michael Vines
b6792a3328 Add ability to change the validator identity at runtime 2021-07-01 17:50:04 -07:00
Brooks Prumo
45d54b1fc6
Add SnapshotArchiveInfo and refactor functions in snapshot_utils (#18232) 2021-07-01 12:20:56 -05:00
Tao Zhu
5e424826ba
Persist cost table to blockstore (#18123)
* Add `ProgramCosts` Column Family to blockstore, implement LedgerColumn; add `delete_cf` to Rocks
* Add ProgramCosts to compaction excluding list alone side with TransactionStatusIndex in one place: `excludes_from_compaction()`

* Write cost table to blockstore after `replay_stage` replayed active banks; add stats to measure persist time
* Deletes program from `ProgramCosts` in blockstore when they are removed from cost_table in memory
* Only try to persist to blockstore when cost_table is changed.
* Restore cost table during validator startup

* Offload `cost_model` related operations from replay main thread to dedicated service thread, add channel to send execute_timings between these threads;
* Move `cost_update_service` to its own module; replay_stage is now decoupled from cost_model.
2021-07-01 11:32:41 -05:00
Brooks Prumo
89a3e4f91e
Move SnapshotConfig into its own module (#18331)
Also move ArchiveFormat to snapshot_utils, and do not
reexport SnapshotVersion.
2021-07-01 08:55:26 -05:00
sakridge
8d9a6deda4
Add repair number per slot (#18082) 2021-06-30 18:20:07 +02:00
Trent Nelson
02b14caa5f test-validator: hold rent constant with --slots-per-epoch 2021-06-30 00:46:12 -06:00
carllin
68c87469c3
Cleanup ReplayStage tests (#18241) 2021-06-28 20:19:42 -07:00
Tao Zhu
9d6f1ebef4
investigate system performance test degradation (#17919)
* Add stats and counter around cost model ops, mainly:
- calculate transaction cost
- check transaction can fit in a block
- update block cost tracker after transactions are added to block
- replay_stage to update/insert execution cost to table

* Change mutex on cost_tracker to RwLock

* removed cloning cost_tracker for local use, as the metrics show clone is very expensive.

* acquire and hold locks for block of TXs, instead of acquire and release per transaction;

* remove redundant would_fit check from cost_tracker update execution path

* refactor cost checking with less frequent lock acquiring

* avoid many Transaction_cost heap allocation when calculate cost, which
is in the hot path - executed per transaction.

* create hashmap with new_capacity to reduce runtime heap realloc.

* code review changes: categorize stats, replace explicit drop calls, concisely initiate to default

* address potential deadlock by acquiring locks one at time
2021-06-28 21:34:04 -05:00
sakridge
5d08bf9aa3
More detailed voting timings in replay stage (#18229) 2021-06-26 17:32:08 +02:00
Trent Nelson
d269975784 Revert "Clean up build warning"
This reverts commit 17a173ebb5cb87271069abdaa0fb9c01d4c8d9d4.
2021-06-24 19:57:52 -06:00
Michael Vines
314102cb54 Remove redundant JsonRpcConfig::identity_pubkey field 2021-06-22 17:20:11 -07:00
sakridge
e808f34b0b
Add batch stats (#18096) 2021-06-22 15:23:26 +02:00
Michael Vines
3b1517237c Clean up argument names 2021-06-21 21:29:52 -07:00
Michael Vines
84b9de8c18 Shredder no longer holds a keypair 2021-06-21 21:29:52 -07:00
Michael Vines
2435ea3ad8 Remove redundant ReplayStageConfig::my_pubkey field 2021-06-21 21:29:52 -07:00
Michael Vines
51a0007001 serve_repair: Remove internal ContactInfo field duplication 2021-06-21 17:23:49 -07:00
behzad nouri
598093b5db adds shred-version to ip-echo-server response
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
https://github.com/solana-labs/solana/pull/17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
2021-06-21 19:37:16 +00:00
Jeff Washington (jwash)
ec2f930475
user process.accounts_db_test_hash_calculation for debug_verify hash (#18053) 2021-06-21 10:20:27 -05:00
Michael Vines
4a12c715a3 Drop Error suffix from enum values to avoid the enum_variant_names clippy lint 2021-06-18 23:02:13 +00:00
Alexander Meißner
789f33e8db chore: cargo fmt 2021-06-18 10:42:46 -07:00
Alexander Meißner
6514096a67 chore: cargo +nightly clippy --fix -Z unstable-options 2021-06-18 10:42:46 -07:00
Tyera Eulberg
d0511de9a6
chore: bump trees from 0.2.1 to 0.4.2 (#18052)
* chore: bump trees from 0.2.1 to 0.4.2 (#18041)

Bumps [trees](https://github.com/oooutlk/trees) from 0.2.1 to 0.4.2.
- [Release notes](https://github.com/oooutlk/trees/releases)
- [Commits](https://github.com/oooutlk/trees/commits)

---
updated-dependencies:
- dependency-name: trees
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Accommodate field & type changes

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-06-17 22:45:09 +00:00
Lijun Wang
071b1ee3e5
Removed pub from some functions which are actually private to improve encapsulation (#18030)
Remove the pub marker to improve encapsulation. Readability improvement only, no functional impact.
2021-06-17 10:14:21 -07:00
Michael Vines
fa04531c7a Extricate RpcCompletedSlotsService from RetransmitStage 2021-06-16 16:20:35 -07:00
Trent Nelson
5bc6c89adc validator: run poh speed test earlier in start up 2021-06-16 21:27:08 +00:00
behzad nouri
161838655c
removes port-based forwarding logic from turbine retransmit (#17716)
Turbine retransmit logic is based on which socket it received the packet
from (i.e `packet.meta.forward`):
https://github.com/solana-labs/solana/blob/708bbcb00/core/src/retransmit_stage.rs#L467-L470

This can leave the cluster vulnerable to spoofing and selective
propagation of packets; see
https://github.com/solana-labs/solana/issues/6672
https://github.com/solana-labs/solana/pull/7774

This commit identifies if the node is on the "critical path" based on
its index in the shuffled cluster. If so, it forwards the packet to both
neighbors and children; otherwise, the packet is only forwarded to the
children.

The metrics added in
https://github.com/solana-labs/solana/pull/17351
shows that the number of times the index does not match the port is very
rare, and therefore this change should be safe.
2021-06-15 13:19:41 +00:00
carllin
ccc013e134
Handle removing slots during account scans (#17471) 2021-06-14 21:04:01 -07:00
sakridge
eeee75c5be
Don't use pinned memory when unnecessary (#17832)
Reports of excessive GPU memory usage and errors
from cudaHostRegister. There are some cases where pinning is
not required.
2021-06-14 16:10:04 +02:00
sakridge
0feac57cb0
Don't store votes unless we are leader soon (#17803) 2021-06-11 18:29:05 +02:00
carllin
c8535be0e1
Port unconfirmed duplicate tracking logic from ProgressMap to ForkChoice (#17779) 2021-06-11 03:09:57 -07:00
carllin
afafa624a3
Account for duplicate before a bank is frozen or replayed (#17866) 2021-06-10 22:28:23 -07:00
Lijun Wang
269d995832
Make account shrink configurable #17544 (#17778)
1. Added both options for measuring space usage using total accounts usage and for individual store shrink ratio using an enum. Validator CLI options: --accounts-shrink-optimize-total-space and --accounts-shrink-ratio
2. Added code for selecting candidates based on total usage in a separate function select_candidates_by_total_usage
3. Added unit tests for the new functions added
4. The default implementations is kept at 0.8 shrink ratio with --accounts-shrink-optimize-total-space set to true

Fixes #17544
2021-06-09 21:21:32 -07:00
Tao Zhu
ae27fcbcda
replay stage feed back program cost (#17731)
* replay stage feeds back realtime per-program execution cost to cost model;

* program cost execution table is initialized into empty table, no longer populated with hardcoded numbers;

* changed cost unit to microsecond, using value collected from mainnet;

* add ExecuteCostTable with fixed capacity for security concern, when its limit is reached, programs with old age AND less occurrence will be pushed out to make room for new programs.
2021-06-09 17:10:59 -05:00
Justin Starry
050bb5446d
Add local cluster tests that broadcast duplicate slots (#13995)
* Add duplicate node local cluster test

* fix clippy

* remove dupe test
2021-06-09 15:01:48 -07:00
Michael Vines
e5e7390d44 Wrap long lines 2021-06-08 12:05:29 -07:00
Tyera Eulberg
544b3c0d17
Create solana-poh and move remaining rpc modules to solana-rpc (#17698)
* Create solana-poh crate

* Move BigTableUploadService to solana-ledger

* Add solana-rpc to workspace

* Move dependencies to solana-rpc

* Move remaining rpc modules to solana-rpc

* Single use statement solana-poh

* Single use statement solana-rpc
2021-06-04 09:23:06 -06:00
sakridge
f97ce2cd7e
Per-program id timings (#17554) 2021-06-04 16:04:31 +02:00
behzad nouri
be957f25c9
adds fallback logic if retransmit multicast fails (#17714)
In retransmit-stage, based on the packet.meta.seed and resulting
children/neighbors, each packet is sent to a different set of peers:
https://github.com/solana-labs/solana/blob/708bbcb00/core/src/retransmit_stage.rs#L421-L457

However, current code errors out as soon as a multicast call fails,
which will skip all the remaining packets:
https://github.com/solana-labs/solana/blob/708bbcb00/core/src/retransmit_stage.rs#L467-L470

This can exacerbate packets loss in turbine.

This commit:
  * keeps iterating over retransmit packets for loop even if some
    intermediate sends fail.
  * adds a fallback to UdpSocket::send_to if multicast fails.

Recent discord chat:
https://discord.com/channels/428295358100013066/689412830075551748/849530845052403733
2021-06-04 12:16:37 +00:00
Tyera Eulberg
3a647c4bea
Rename ValidatorExit and move to sdk (#17728) 2021-06-04 03:06:13 +00:00
carllin
96ba2edfeb
Switch EpochSlots to be frozen slots, not completed slots (#17168) 2021-06-03 00:20:00 +00:00