Rework cluster metrics dashboard to support the modern clusters

(cherry picked from commit 5f5824d78d)

# Conflicts:
#	system-test/automation_utils.sh
This commit is contained in:
Michael Vines
2020-03-11 10:21:53 -07:00
parent cb6e7426a4
commit f536d805ed
9 changed files with 242 additions and 54 deletions

View File

@ -24,8 +24,8 @@ solana transaction-count
Inspect the network explorer at Inspect the network explorer at
[https://explorer.solana.com/](https://explorer.solana.com/) for activity. [https://explorer.solana.com/](https://explorer.solana.com/) for activity.
View the [metrics dashboard](https://metrics.solana.com:3000/d/testnet-beta/testnet-monitor-beta?var-testnet=testnet) View the [metrics dashboard](https://metrics.solana.com:3000/d/monitor/cluster-telemetry) for more
for more detail on cluster activity. detail on cluster activity.
## Confirm your Installation ## Confirm your Installation

View File

@ -5,7 +5,7 @@ testnet participants, [https://discord.gg/pquxPsq](https://discord.gg/pquxPsq).
## Useful Links & Discussion ## Useful Links & Discussion
* [Network Explorer](http://explorer.solana.com/) * [Network Explorer](http://explorer.solana.com/)
* [Testnet Metrics Dashboard](https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?refresh=60s&orgId=2) * [Testnet Metrics Dashboard](https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=60s&orgId=2)
* Validator chat channels * Validator chat channels
* [\#validator-support](https://discord.gg/rZsenD) General support channel for any Validator related queries. * [\#validator-support](https://discord.gg/rZsenD) General support channel for any Validator related queries.
* [\#tourdesol](https://discord.gg/BdujK2) Discussion and support channel for Tour de SOL participants ([What is Tour de SOL?](https://solana.com/tds/)). * [\#tourdesol](https://discord.gg/BdujK2) Discussion and support channel for Tour de SOL participants ([What is Tour de SOL?](https://solana.com/tds/)).
@ -14,6 +14,6 @@ testnet participants, [https://discord.gg/pquxPsq](https://discord.gg/pquxPsq).
* [Core software repo](https://github.com/solana-labs/solana) * [Core software repo](https://github.com/solana-labs/solana)
* [Tour de SOL Docs](https://docs.solana.com/tour-de-sol) * [Tour de SOL Docs](https://docs.solana.com/tour-de-sol)
* [TdS repo](https://github.com/solana-labs/tour-de-sol) * [TdS repo](https://github.com/solana-labs/tour-de-sol)
* [TdS metrics dashboard](https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?refresh=1m&from=now-15m&to=now&var-testnet=tds&orgId=2&var-datasource=TdS%20Metrics%20%28read-only%29) * [TdS metrics dashboard](https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=1m&from=now-15m&to=now&var-testnet=tds)
Can't find what you're looking for? Send an email to ryan@solana.com or reach out to @rshea\#2622 on Discord. Can't find what you're looking for? Send an email to ryan@solana.com or reach out to @rshea\#2622 on Discord.

View File

@ -6,7 +6,7 @@ description: Where to go after you've read this guide
* [Solana Docs](https://docs.solana.com/) * [Solana Docs](https://docs.solana.com/)
* [Network Explorer](http://explorer.solana.com/) * [Network Explorer](http://explorer.solana.com/)
* [TdS metrics dashboard](https://metrics.solana.com:3000/d/testnet/testnet-monitor?refresh=1m&from=now-15m&to=now&orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=All9) * [TdS metrics dashboard](https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=1m&from=now-15m&to=now&var-testnet=tds)
* Validator chat channels * Validator chat channels
* [\#validator-support](https://discord.gg/rZsenD) General support channel for any Validator related queries that dont fall under Tour de SOL. * [\#validator-support](https://discord.gg/rZsenD) General support channel for any Validator related queries that dont fall under Tour de SOL.
* [\#tourdesol](https://discord.gg/BdujK2) Discussion and support channel for Tour de SOL participants. * [\#tourdesol](https://discord.gg/BdujK2) Discussion and support channel for Tour de SOL participants.

View File

@ -4,13 +4,14 @@
There are three versions of the testnet dashboard, corresponding to the three There are three versions of the testnet dashboard, corresponding to the three
release channels: release channels:
* https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge * https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge
* https://metrics.solana.com:3000/d/testnet-beta/testnet-monitor-beta * https://metrics.solana.com:3000/d/monitor-beta/cluster-telemetry-beta
* https://metrics.solana.com:3000/d/testnet/testnet-monitor * https://metrics.solana.com:3000/d/monitor/cluster-telemetry
The dashboard for each channel is defined from the The dashboard for each channel is defined from the
`metrics/testnet-monitor.json` source file in the git branch associated with `metrics/scripts/grafana-provisioning/dashboards/cluster-monitor.json` source
that channel, and deployed by automation running `ci/publish-metrics-dashboard.sh`. file in the git branch associated with that channel, and deployed by automation
running `ci/publish-metrics-dashboard.sh`.
A deploy can be triggered at any time via the `New Build` button of A deploy can be triggered at any time via the `New Build` button of
https://buildkite.com/solana-labs/publish-metrics-dashboard. https://buildkite.com/solana-labs/publish-metrics-dashboard.
@ -18,7 +19,7 @@ https://buildkite.com/solana-labs/publish-metrics-dashboard.
### Modifying a Dashboard ### Modifying a Dashboard
Dashboard updates are accomplished by modifying Dashboard updates are accomplished by modifying
`metrics/scripts/grafana-provisioning/dashboards/testnet-monitor.json`, `metrics/scripts/grafana-provisioning/dashboards/cluster-monitor.json`,
**manual edits made directly in Grafana will be overwritten**. **manual edits made directly in Grafana will be overwritten**.
* Check out metrics to add at https://metrics.solana.com:8888/ in the data explorer. * Check out metrics to add at https://metrics.solana.com:8888/ in the data explorer.
@ -32,13 +33,13 @@ Dashboard updates are accomplished by modifying
`Settings` menu for the dashboard `Settings` menu for the dashboard
3. Edit dashboard as desired 3. Edit dashboard as desired
4. Extract the JSON Model by selecting `JSON Model` in the `Settings` menu. Copy the JSON to the clipboard 4. Extract the JSON Model by selecting `JSON Model` in the `Settings` menu. Copy the JSON to the clipboard
and paste into `metrics/scripts/grafana-provisioning/dashboards/testnet-monitor.json`, and paste into `metrics/scripts/grafana-provisioning/dashboards/cluster-monitor.json`,
5. Delete your development dashboard: `Settings` => `Delete` 5. Delete your development dashboard: `Settings` => `Delete`
### Deploying a Dashboard Manually ### Deploying a Dashboard Manually
If you need to immediately deploy a dashboard using the contents of If you need to immediately deploy a dashboard using the contents of
`testnet-monitor.json` in your local workspace, `cluster-monitor.json` in your local workspace,
``` ```
$ export GRAFANA_API_TOKEN="an API key from https://metrics.solana.com:3000/org/apikeys" $ export GRAFANA_API_TOKEN="an API key from https://metrics.solana.com:3000/org/apikeys"
$ metrics/publish-metrics-dashboard.sh (edge|beta|stable) $ metrics/publish-metrics-dashboard.sh (edge|beta|stable)

View File

@ -11,13 +11,13 @@ fi
case $CHANNEL in case $CHANNEL in
edge) edge)
DASHBOARD=testnet-monitor-edge DASHBOARD=cluster-telemetry-edge
;; ;;
beta) beta)
DASHBOARD=testnet-monitor-beta DASHBOARD=cluster-telemetry-beta
;; ;;
stable) stable)
DASHBOARD=testnet-monitor DASHBOARD=cluster-telemetry
;; ;;
*) *)
echo "Error: Invalid CHANNEL=$CHANNEL" echo "Error: Invalid CHANNEL=$CHANNEL"
@ -31,7 +31,7 @@ if [[ -z $GRAFANA_API_TOKEN ]]; then
exit 1 exit 1
fi fi
DASHBOARD_JSON=scripts/grafana-provisioning/dashboards/testnet-monitor.json DASHBOARD_JSON=scripts/grafana-provisioning/dashboards/cluster-monitor.json
if [[ ! -r $DASHBOARD_JSON ]]; then if [[ ! -r $DASHBOARD_JSON ]]; then
echo Error: $DASHBOARD_JSON not found echo Error: $DASHBOARD_JSON not found
fi fi

View File

@ -21,7 +21,7 @@ with open(dashboard_json, 'r') as read_file:
data = json.load(read_file) data = json.load(read_file)
if channel == 'local': if channel == 'local':
data['title'] = 'Local Testnet Monitor' data['title'] = 'Local Cluster Monitor'
data['uid'] = 'local' data['uid'] = 'local'
data['links'] = [] data['links'] = []
data['templating']['list'] = [{'current': {'text': '$datasource', data['templating']['list'] = [{'current': {'text': '$datasource',
@ -66,10 +66,9 @@ if channel == 'local':
'useTags': False}] 'useTags': False}]
elif channel == 'stable': elif channel == 'stable':
# Stable dashboard only allows the user to select between the stable # Stable dashboard only allows the user to select between public clusters
# testnet databases data['title'] = 'Cluster Telemetry'
data['title'] = 'Testnet Monitor' data['uid'] = 'monitor'
data['uid'] = 'testnet'
data['templating']['list'] = [{'current': {'text': '$datasource', data['templating']['list'] = [{'current': {'text': '$datasource',
'value': '$datasource'}, 'value': '$datasource'},
'hide': 1, 'hide': 1,
@ -81,20 +80,26 @@ elif channel == 'stable':
'regex': '', 'regex': '',
'type': 'datasource'}, 'type': 'datasource'},
{'allValue': None, {'allValue': None,
'current': {'text': 'testnet', 'current': {'text': 'Developer Testnet',
'value': 'testnet'}, 'value': 'devnet'},
'hide': 1, 'hide': 1,
'includeAll': False, 'includeAll': False,
'label': 'Testnet', 'label': 'Testnet',
'multi': False, 'multi': False,
'name': 'testnet', 'name': 'testnet',
'options': [{'selected': False, 'options': [{'selected': True,
'text': 'testnet', 'text': 'Developer Testnet',
'value': 'testnet'}, 'value': 'devnet'},
{'selected': True, {'selected': False,
'text': 'testnet-perf', 'text': 'Mainnet Beta',
'value': 'testnet-perf'}], 'value': 'mainnet-beta'},
'query': 'testnet,testnet-perf', {'selected': False,
'text': 'Tour de SOL Testnet',
'value': 'tds'},
{'selected': False,
'text': 'Soft Launch Testnet',
'value': 'cluster'}],
'query': 'devnet,mainnet-beta,tds,cluster',
'type': 'custom'}, 'type': 'custom'},
{'allValue': ".*", {'allValue': ".*",
'datasource': '$datasource', 'datasource': '$datasource',
@ -114,10 +119,9 @@ elif channel == 'stable':
'type': 'query', 'type': 'query',
'useTags': False}] 'useTags': False}]
else: else:
# Non-stable dashboard only allows the user to select between all testnet # Non-stable dashboard includes all the dev clusters
# databases data['title'] = 'Cluster Telemetry ({})'.format(channel)
data['title'] = 'Testnet Monitor ({})'.format(channel) data['uid'] = 'monitor-' + channel
data['uid'] = 'testnet-' + channel
data['templating']['list'] = [{'current': {'text': '$datasource', data['templating']['list'] = [{'current': {'text': '$datasource',
'value': '$datasource'}, 'value': '$datasource'},
'hide': 1, 'hide': 1,
@ -129,8 +133,8 @@ else:
'regex': '', 'regex': '',
'type': 'datasource'}, 'type': 'datasource'},
{'allValue': ".*", {'allValue': ".*",
'current': {'text': 'testnet', 'current': {'text': 'Developer Testnet',
'value': 'testnet'}, 'value': 'devnet'},
'datasource': '$datasource', 'datasource': '$datasource',
'hide': 1, 'hide': 1,
'includeAll': False, 'includeAll': False,
@ -140,7 +144,7 @@ else:
'options': [], 'options': [],
'query': 'show databases', 'query': 'show databases',
'refresh': 1, 'refresh': 1,
'regex': 'testnet.*', 'regex': '(devnet|cluster|tds|mainnet-beta|testnet.*)',
'sort': 1, 'sort': 1,
'tagValuesQuery': '', 'tagValuesQuery': '',
'tags': [], 'tags': [],

View File

@ -27,21 +27,21 @@
"title": "Stable", "title": "Stable",
"tooltip": "", "tooltip": "",
"type": "link", "type": "link",
"url": "https://metrics.solana.com:3000/d/testnet/testnet-monitor" "url": "https://metrics.solana.com:3000/d/monitor/cluster-telemetry"
}, },
{ {
"icon": "dashboard", "icon": "dashboard",
"tags": [], "tags": [],
"title": "Beta", "title": "Beta",
"type": "link", "type": "link",
"url": "https://metrics.solana.com:3000/d/testnet-beta/testnet-monitor-beta" "url": "https://metrics.solana.com:3000/d/monitor-beta/cluster-telemetry-beta"
}, },
{ {
"icon": "dashboard", "icon": "dashboard",
"tags": [], "tags": [],
"title": "Edge", "title": "Edge",
"type": "link", "type": "link",
"url": "https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge" "url": "https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge"
} }
], ],
"panels": [ "panels": [
@ -4618,7 +4618,7 @@
}, },
"yaxes": [ "yaxes": [
{ {
"format": "µs", "format": "\u00b5s",
"label": null, "label": null,
"logBase": 1, "logBase": 1,
"max": null, "max": null,
@ -5385,7 +5385,7 @@
}, },
"yaxes": [ "yaxes": [
{ {
"format": "µs", "format": "\u00b5s",
"label": null, "label": null,
"logBase": 1, "logBase": 1,
"max": null, "max": null,
@ -5752,7 +5752,7 @@
}, },
"yaxes": [ "yaxes": [
{ {
"format": "µs", "format": "\u00b5s",
"label": null, "label": null,
"logBase": 1, "logBase": 1,
"max": null, "max": null,
@ -6727,7 +6727,7 @@
}, },
"yaxes": [ "yaxes": [
{ {
"format": "µs", "format": "\u00b5s",
"label": null, "label": null,
"logBase": 1, "logBase": 1,
"max": null, "max": null,
@ -10181,7 +10181,6 @@
"list": [ "list": [
{ {
"current": { "current": {
"selected": true,
"text": "$datasource", "text": "$datasource",
"value": "$datasource" "value": "$datasource"
}, },
@ -10197,9 +10196,8 @@
{ {
"allValue": ".*", "allValue": ".*",
"current": { "current": {
"selected": false, "text": "Developer Testnet",
"text": "testnet", "value": "devnet"
"value": "testnet"
}, },
"datasource": "$datasource", "datasource": "$datasource",
"hide": 1, "hide": 1,
@ -10210,7 +10208,7 @@
"options": [], "options": [],
"query": "show databases", "query": "show databases",
"refresh": 1, "refresh": 1,
"regex": "testnet.*", "regex": "(devnet|cluster|tds|mainnet-beta|testnet.*)",
"sort": 1, "sort": 1,
"tagValuesQuery": "", "tagValuesQuery": "",
"tags": [], "tags": [],
@ -10269,7 +10267,7 @@
] ]
}, },
"timezone": "", "timezone": "",
"title": "Testnet Monitor (edge)", "title": "Cluster Telemetry (edge)",
"uid": "testnet-edge", "uid": "monitor-edge",
"version": 2 "version": 2
} }

View File

@ -34,7 +34,7 @@ source lib/config.sh
if [[ ! -f lib/grafana-provisioning ]]; then if [[ ! -f lib/grafana-provisioning ]]; then
cp -va grafana-provisioning lib cp -va grafana-provisioning lib
./adjust-dashboard-for-channel.py \ ./adjust-dashboard-for-channel.py \
lib/grafana-provisioning/dashboards/testnet-monitor.json local lib/grafana-provisioning/dashboards/cluster-monitor.json local
mkdir -p lib/grafana-provisioning/datasources mkdir -p lib/grafana-provisioning/datasources
cat > lib/grafana-provisioning/datasources/datasource.yml <<EOF cat > lib/grafana-provisioning/datasources/datasource.yml <<EOF

185
system-test/automation_utils.sh Executable file
View File

@ -0,0 +1,185 @@
#!/usr/bin/env bash
# | source | this file
# shellcheck disable=SC1090
# shellcheck disable=SC1091
# shellcheck disable=SC2034
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
REPO_ROOT=${DIR}/..
source "${REPO_ROOT}"/ci/upload-ci-artifact.sh
function execution_step {
# shellcheck disable=SC2124
STEP="$@"
echo --- "${STEP[@]}"
}
function collect_logs {
execution_step "Collect logs from remote nodes"
rm -rf "${REPO_ROOT}"/net/log
"${REPO_ROOT}"/net/net.sh logs
for logfile in "${REPO_ROOT}"/net/log/*; do
(
upload-ci-artifact "$logfile"
)
done
}
function analyze_packet_loss {
(
set -x
# shellcheck disable=SC1091
source "${REPO_ROOT}"/net/config/config
mkdir -p iftop-logs
execution_step "Map private -> public IP addresses in iftop logs"
# shellcheck disable=SC2154
for i in "${!validatorIpList[@]}"; do
# shellcheck disable=SC2154
# shellcheck disable=SC2086
# shellcheck disable=SC2027
echo "{\"private\": \""${validatorIpListPrivate[$i]}""\", \"public\": \""${validatorIpList[$i]}""\"},"
done > ip_address_map.txt
for ip in "${validatorIpList[@]}"; do
"${REPO_ROOT}"/net/scp.sh ip_address_map.txt solana@"$ip":~/solana/
done
execution_step "Remotely post-process iftop logs"
# shellcheck disable=SC2154
for ip in "${validatorIpList[@]}"; do
iftop_log=iftop-logs/$ip-iftop.log
# shellcheck disable=SC2016
"${REPO_ROOT}"/net/ssh.sh solana@"$ip" 'PATH=$PATH:~/.cargo/bin/ ~/solana/scripts/iftop-postprocess.sh ~/solana/iftop.log temp.log ~solana/solana/ip_address_map.txt' > "$iftop_log"
upload-ci-artifact "$iftop_log"
done
execution_step "Analyzing Packet Loss"
"${REPO_ROOT}"/solana-release/bin/solana-log-analyzer analyze -f ./iftop-logs/ | sort -k 2 -g
)
}
function wait_for_bootstrap_validator_stake_drop {
max_stake="$1"
source "${REPO_ROOT}"/net/common.sh
loadConfigFile
while true; do
# shellcheck disable=SC2154
bootstrap_validator_validator_info="$(ssh "${sshOptions[@]}" "${validatorIpList[0]}" '$HOME/.cargo/bin/solana validators | grep "$($HOME/.cargo/bin/solana-keygen pubkey ~/solana/config/bootstrap-validator/identity-keypair.json)"')"
bootstrap_validator_stake_percentage="$(echo "$bootstrap_validator_validator_info" | awk '{gsub(/[\(,\),\%]/,""); print $9}')"
if [[ $(echo "$bootstrap_validator_stake_percentage < $max_stake" | bc) -ne 0 ]]; then
echo "Bootstrap validator stake has fallen below $max_stake to $bootstrap_validator_stake_percentage"
break
fi
echo "Max bootstrap validator stake: $max_stake. Current stake: $bootstrap_validator_stake_percentage. Sleeping 30s for stake to distribute."
sleep 30
done
}
function get_slot {
source "${REPO_ROOT}"/net/common.sh
loadConfigFile
ssh "${sshOptions[@]}" "${validatorIpList[0]}" '$HOME/.cargo/bin/solana slot'
}
function upload_results_to_slack() {
echo --- Uploading results to Slack Performance Results App
if [[ -z $SLACK_WEBHOOK_URL ]] ; then
echo "SLACK_WEBHOOOK_URL undefined"
exit 1
fi
[[ -n $BUILDKITE_MESSAGE ]] || BUILDKITE_MESSAGE="Message not defined"
COMMIT=$(git rev-parse HEAD)
COMMIT_BUTTON_TEXT="$(echo "$COMMIT" | head -c 8)"
COMMIT_URL="https://github.com/solana-labs/solana/commit/${COMMIT}"
if [[ -n $BUILDKITE_BUILD_URL ]] ; then
BUILD_BUTTON_TEXT="Build Kite Job"
else
BUILD_BUTTON_TEXT="Build URL not defined"
BUILDKITE_BUILD_URL="https://buildkite.com/solana-labs/"
fi
GRAFANA_URL="https://metrics.solana.com:3000/d/monitor-${CHANNEL:-edge}/cluster-telemetry-${CHANNEL:-edge}?var-testnet=${TESTNET_TAG:-testnet-automation}&from=${TESTNET_START_UNIX_MSECS:-0}&to=${TESTNET_FINISH_UNIX_MSECS:-0}"
[[ -n $RESULT_DETAILS ]] || RESULT_DETAILS="Undefined"
[[ -n $TEST_CONFIGURATION ]] || TEST_CONFIGURATION="Undefined"
payLoad="$(cat <<EOF
{
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*$BUILDKITE_MESSAGE*"
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {
"type": "plain_text",
"text": "$COMMIT_BUTTON_TEXT",
"emoji": true
},
"url": "$COMMIT_URL"
},
{
"type": "button",
"text": {
"type": "plain_text",
"text": "$BUILD_BUTTON_TEXT",
"emoji": true
},
"url": "$BUILDKITE_BUILD_URL"
},
{
"type": "button",
"text": {
"type": "plain_text",
"text": "Grafana",
"emoji": true
},
"url": "$GRAFANA_URL"
}
]
},
{
"type": "divider"
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "Test Configuration: \n\`\`\`$TEST_CONFIGURATION\`\`\`"
}
},
{
"type": "divider"
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "Result Details: \n\`\`\`$RESULT_DETAILS\`\`\`"
}
}
]
}
EOF
)"
curl -X POST \
-H 'Content-type: application/json' \
--data "$payLoad" \
"$SLACK_WEBHOOK_URL"
}