Cost model 1.7 (#20188)

* Cost Model to limit transactions which are not parallelizeable (#16694) * * Add following to banking_stage: 1. CostModel as immutable ref shared between threads, to provide estimated cost for transactions. 2. CostTracker which is shared between threads, tracks transaction costs for each block. * replace hard coded program ID with id() calls * Add Account Access Cost as part of TransactionCost. Account Access cost are weighted differently between read and write, signed and non-signed. * Establish instruction_execution_cost_table, add function to update or insert instruction cost, unit tested. It is read-only for now; it allows Replay to insert realtime instruction execution costs to the table. * add test for cost_tracker atomically try_add operation, serves as safety guard for future changes * check cost against local copy of cost_tracker, return transactions that would exceed limit as unprocessed transaction to be buffered; only apply bank processed transactions cost to tracker; * bencher to new banking_stage with max cost limit to allow cost model being hit consistently during bench iterations * replay stage feed back program cost (#17731) * replay stage feeds back realtime per-program execution cost to cost model; * program cost execution table is initialized into empty table, no longer populated with hardcoded numbers; * changed cost unit to microsecond, using value collected from mainnet; * add ExecuteCostTable with fixed capacity for security concern, when its limit is reached, programs with old age AND less occurrence will be pushed out to make room for new programs. * investigate system performance test degradation (#17919) * Add stats and counter around cost model ops, mainly: - calculate transaction cost - check transaction can fit in a block - update block cost tracker after transactions are added to block - replay_stage to update/insert execution cost to table * Change mutex on cost_tracker to RwLock * removed cloning cost_tracker for local use, as the metrics show clone is very expensive. * acquire and hold locks for block of TXs, instead of acquire and release per transaction; * remove redundant would_fit check from cost_tracker update execution path * refactor cost checking with less frequent lock acquiring * avoid many Transaction_cost heap allocation when calculate cost, which is in the hot path - executed per transaction. * create hashmap with new_capacity to reduce runtime heap realloc. * code review changes: categorize stats, replace explicit drop calls, concisely initiate to default * address potential deadlock by acquiring locks one at time * Persist cost table to blockstore (#18123) * Add `ProgramCosts` Column Family to blockstore, implement LedgerColumn; add `delete_cf` to Rocks * Add ProgramCosts to compaction excluding list alone side with TransactionStatusIndex in one place: `excludes_from_compaction()` * Write cost table to blockstore after `replay_stage` replayed active banks; add stats to measure persist time * Deletes program from `ProgramCosts` in blockstore when they are removed from cost_table in memory * Only try to persist to blockstore when cost_table is changed. * Restore cost table during validator startup * Offload `cost_model` related operations from replay main thread to dedicated service thread, add channel to send execute_timings between these threads; * Move `cost_update_service` to its own module; replay_stage is now decoupled from cost_model. * log warning when channel send fails (#18391) * Aggregate cost_model into cost_tracker (#18374) * * aggregate cost_model into cost_tracker, decouple it from banking_stage to prevent accidental deadlock. * Simplified code, removed unused functions * review fixes * update ledger tool to restore cost table from blockstore (#18489) * update ledger tool to restore cost model from blockstore when compute-slot-cost * Move initialize_cost_table into cost_model, so the function can be tested and shared between validator and ledger-tool * refactor and simplify a test * manually fix merge conflicts * Per-program id timings (#17554) * more manual fixing * solve a merge conflict * featurize cost model * more merge fix * cost model uses compute_unit to replace microsecond as cost unit (#18934) * Reject blocks for costs above the max block cost (#18994) * Update block max cost limit to fix performance regession (#19276) * replace function with const var for better readability (#19285) * Add few more metrics data points (#19624) * periodically report sigverify_stage stats (#19674) * manual merge * cost model nits (#18528) * Accumulate consumed units (#18714) * tx wide compute budget (#18631) * more manual merge * ignore zerorize drop security * - update const cost values with data collected by #19627 - update cost calculation to closely proposed fee schedule #16984 * add transaction cost histogram metrics (#20350) * rebase to 1.7.15 * add tx count and thread id to stats (#20451) each stat reports and resets when slot changes * remove cost_model feature_set * ignore vote transactions from cost model Co-authored-by: sakridge <sakridge@gmail.com> Co-authored-by: Jeff Biseda <jbiseda@gmail.com> Co-authored-by: Jack May <jack@solana.com>
2021-10-06 15:11:41 -05:00
parent a4df784e82
commit db85d659b9
40 changed files with 3208 additions and 266 deletions
--- a/core/src/cost_update_service.rs
+++ b/core/src/cost_update_service.rs
@@ -0,0 +1,292 @@
+//! this service receives instruction ExecuteTimings from replay_stage,
+//! update cost_model which is shared with banking_stage to optimize
+//! packing transactions into block; it also triggers persisting cost
+//! table to blockstore.
+
+use crate::cost_model::CostModel;
+use solana_ledger::blockstore::Blockstore;
+use solana_measure::measure::Measure;
+use solana_runtime::bank::ExecuteTimings;
+use solana_sdk::timing::timestamp;
+use std::{
+    sync::{
+        atomic::{AtomicBool, Ordering},
+        mpsc::Receiver,
+        Arc, RwLock,
+    },
+    thread::{self, Builder, JoinHandle},
+    time::Duration,
+};
+
+#[derive(Default)]
+pub struct CostUpdateServiceTiming {
+    last_print: u64,
+    update_cost_model_count: u64,
+    update_cost_model_elapsed: u64,
+    persist_cost_table_elapsed: u64,
+}
+
+impl CostUpdateServiceTiming {
+    fn update(
+        &mut self,
+        update_cost_model_count: u64,
+        update_cost_model_elapsed: u64,
+        persist_cost_table_elapsed: u64,
+    ) {
+        self.update_cost_model_count += update_cost_model_count;
+        self.update_cost_model_elapsed += update_cost_model_elapsed;
+        self.persist_cost_table_elapsed += persist_cost_table_elapsed;
+
+        let now = timestamp();
+        let elapsed_ms = now - self.last_print;
+        if elapsed_ms > 1000 {
+            datapoint_info!(
+                "cost-update-service-stats",
+                ("total_elapsed_us", elapsed_ms * 1000, i64),
+                (
+                    "update_cost_model_count",
+                    self.update_cost_model_count as i64,
+                    i64
+                ),
+                (
+                    "update_cost_model_elapsed",
+                    self.update_cost_model_elapsed as i64,
+                    i64
+                ),
+                (
+                    "persist_cost_table_elapsed",
+                    self.persist_cost_table_elapsed as i64,
+                    i64
+                ),
+            );
+
+            *self = CostUpdateServiceTiming::default();
+            self.last_print = now;
+        }
+    }
+}
+
+pub type CostUpdateReceiver = Receiver<ExecuteTimings>;
+
+pub struct CostUpdateService {
+    thread_hdl: JoinHandle<()>,
+}
+
+impl CostUpdateService {
+    #[allow(clippy::new_ret_no_self)]
+    pub fn new(
+        exit: Arc<AtomicBool>,
+        blockstore: Arc<Blockstore>,
+        cost_model: Arc<RwLock<CostModel>>,
+        cost_update_receiver: CostUpdateReceiver,
+    ) -> Self {
+        let thread_hdl = Builder::new()
+            .name("solana-cost-update-service".to_string())
+            .spawn(move || {
+                Self::service_loop(exit, blockstore, cost_model, cost_update_receiver);
+            })
+            .unwrap();
+
+        Self { thread_hdl }
+    }
+
+    pub fn join(self) -> thread::Result<()> {
+        self.thread_hdl.join()
+    }
+
+    fn service_loop(
+        exit: Arc<AtomicBool>,
+        blockstore: Arc<Blockstore>,
+        cost_model: Arc<RwLock<CostModel>>,
+        cost_update_receiver: CostUpdateReceiver,
+    ) {
+        let mut cost_update_service_timing = CostUpdateServiceTiming::default();
+        let mut dirty: bool;
+        let mut update_count: u64;
+        let wait_timer = Duration::from_millis(100);
+
+        loop {
+            if exit.load(Ordering::Relaxed) {
+                break;
+            }
+
+            dirty = false;
+            update_count = 0_u64;
+            let mut update_cost_model_time = Measure::start("update_cost_model_time");
+            for cost_update in cost_update_receiver.try_iter() {
+                dirty |= Self::update_cost_model(&cost_model, &cost_update);
+                update_count += 1;
+            }
+            update_cost_model_time.stop();
+
+            let mut persist_cost_table_time = Measure::start("persist_cost_table_time");
+            if dirty {
+                Self::persist_cost_table(&blockstore, &cost_model);
+            }
+            persist_cost_table_time.stop();
+
+            cost_update_service_timing.update(
+                update_count,
+                update_cost_model_time.as_us(),
+                persist_cost_table_time.as_us(),
+            );
+
+            thread::sleep(wait_timer);
+        }
+    }
+
+    fn update_cost_model(cost_model: &RwLock<CostModel>, execute_timings: &ExecuteTimings) -> bool {
+        let mut dirty = false;
+        {
+            let mut cost_model_mutable = cost_model.write().unwrap();
+            for (program_id, timing) in &execute_timings.details.per_program_timings {
+                if timing.count < 1 {
+                    continue;
+                }
+                let units = timing.accumulated_units / timing.count as u64;
+                match cost_model_mutable.upsert_instruction_cost(program_id, units) {
+                    Ok(c) => {
+                        debug!(
+                            "after replayed into bank, instruction {:?} has averaged cost {}",
+                            program_id, c
+                        );
+                        dirty = true;
+                    }
+                    Err(err) => {
+                        debug!(
+                        "after replayed into bank, instruction {:?} failed to update cost, err: {}",
+                        program_id, err
+                    );
+                    }
+                }
+            }
+        }
+        debug!(
+           "after replayed into bank, updated cost model instruction cost table, current values: {:?}",
+           cost_model.read().unwrap().get_instruction_cost_table()
+        );
+        dirty
+    }
+
+    fn persist_cost_table(blockstore: &Blockstore, cost_model: &RwLock<CostModel>) {
+        let cost_model_read = cost_model.read().unwrap();
+        let cost_table = cost_model_read.get_instruction_cost_table();
+        let db_records = blockstore.read_program_costs().expect("read programs");
+
+        // delete records from blockstore if they are no longer in cost_table
+        db_records.iter().for_each(|(pubkey, _)| {
+            if cost_table.get(pubkey).is_none() {
+                blockstore
+                    .delete_program_cost(pubkey)
+                    .expect("delete old program");
+            }
+        });
+
+        for (key, cost) in cost_table.iter() {
+            blockstore
+                .write_program_cost(key, cost)
+                .expect("persist program costs to blockstore");
+        }
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use solana_runtime::message_processor::ProgramTiming;
+    use solana_sdk::pubkey::Pubkey;
+
+    #[test]
+    fn test_update_cost_model_with_empty_execute_timings() {
+        let cost_model = Arc::new(RwLock::new(CostModel::default()));
+        let empty_execute_timings = ExecuteTimings::default();
+        CostUpdateService::update_cost_model(&cost_model, &empty_execute_timings);
+
+        assert_eq!(
+            0,
+            cost_model
+                .read()
+                .unwrap()
+                .get_instruction_cost_table()
+                .len()
+        );
+    }
+
+    #[test]
+    fn test_update_cost_model_with_execute_timings() {
+        let cost_model = Arc::new(RwLock::new(CostModel::default()));
+        let mut execute_timings = ExecuteTimings::default();
+
+        let program_key_1 = Pubkey::new_unique();
+        let mut expected_cost: u64;
+
+        // add new program
+        {
+            let accumulated_us: u64 = 1000;
+            let accumulated_units: u64 = 100;
+            let count: u32 = 10;
+            expected_cost = accumulated_units / count as u64;
+
+            execute_timings.details.per_program_timings.insert(
+                program_key_1,
+                ProgramTiming {
+                    accumulated_us,
+                    accumulated_units,
+                    count,
+                },
+            );
+            CostUpdateService::update_cost_model(&cost_model, &execute_timings);
+            assert_eq!(
+                1,
+                cost_model
+                    .read()
+                    .unwrap()
+                    .get_instruction_cost_table()
+                    .len()
+            );
+            assert_eq!(
+                Some(&expected_cost),
+                cost_model
+                    .read()
+                    .unwrap()
+                    .get_instruction_cost_table()
+                    .get(&program_key_1)
+            );
+        }
+
+        // update program
+        {
+            let accumulated_us: u64 = 2000;
+            let accumulated_units: u64 = 200;
+            let count: u32 = 10;
+            // to expect new cost is Average(new_value, existing_value)
+            expected_cost = ((accumulated_units / count as u64) + expected_cost) / 2;
+
+            execute_timings.details.per_program_timings.insert(
+                program_key_1,
+                ProgramTiming {
+                    accumulated_us,
+                    accumulated_units,
+                    count,
+                },
+            );
+            CostUpdateService::update_cost_model(&cost_model, &execute_timings);
+            assert_eq!(
+                1,
+                cost_model
+                    .read()
+                    .unwrap()
+                    .get_instruction_cost_table()
+                    .len()
+            );
+            assert_eq!(
+                Some(&expected_cost),
+                cost_model
+                    .read()
+                    .unwrap()
+                    .get_instruction_cost_table()
+                    .get(&program_key_1)
+            );
+        }
+    }
+}