Rework cluster metrics dashboard to support the modern clusters

(cherry picked from commit 5f5824d78d)

# Conflicts:
#	system-test/automation_utils.sh
This commit is contained in:
Michael Vines
2020-03-11 10:21:53 -07:00
parent cb6e7426a4
commit f536d805ed
9 changed files with 242 additions and 54 deletions

View File

@ -24,8 +24,8 @@ solana transaction-count
Inspect the network explorer at
[https://explorer.solana.com/](https://explorer.solana.com/) for activity.
View the [metrics dashboard](https://metrics.solana.com:3000/d/testnet-beta/testnet-monitor-beta?var-testnet=testnet)
for more detail on cluster activity.
View the [metrics dashboard](https://metrics.solana.com:3000/d/monitor/cluster-telemetry) for more
detail on cluster activity.
## Confirm your Installation

View File

@ -5,7 +5,7 @@ testnet participants, [https://discord.gg/pquxPsq](https://discord.gg/pquxPsq).
## Useful Links & Discussion
* [Network Explorer](http://explorer.solana.com/)
* [Testnet Metrics Dashboard](https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?refresh=60s&orgId=2)
* [Testnet Metrics Dashboard](https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=60s&orgId=2)
* Validator chat channels
* [\#validator-support](https://discord.gg/rZsenD) General support channel for any Validator related queries.
* [\#tourdesol](https://discord.gg/BdujK2) Discussion and support channel for Tour de SOL participants ([What is Tour de SOL?](https://solana.com/tds/)).
@ -14,6 +14,6 @@ testnet participants, [https://discord.gg/pquxPsq](https://discord.gg/pquxPsq).
* [Core software repo](https://github.com/solana-labs/solana)
* [Tour de SOL Docs](https://docs.solana.com/tour-de-sol)
* [TdS repo](https://github.com/solana-labs/tour-de-sol)
* [TdS metrics dashboard](https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?refresh=1m&from=now-15m&to=now&var-testnet=tds&orgId=2&var-datasource=TdS%20Metrics%20%28read-only%29)
* [TdS metrics dashboard](https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=1m&from=now-15m&to=now&var-testnet=tds)
Can't find what you're looking for? Send an email to ryan@solana.com or reach out to @rshea\#2622 on Discord.

View File

@ -6,7 +6,7 @@ description: Where to go after you've read this guide
* [Solana Docs](https://docs.solana.com/)
* [Network Explorer](http://explorer.solana.com/)
* [TdS metrics dashboard](https://metrics.solana.com:3000/d/testnet/testnet-monitor?refresh=1m&from=now-15m&to=now&orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=All9)
* [TdS metrics dashboard](https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=1m&from=now-15m&to=now&var-testnet=tds)
* Validator chat channels
* [\#validator-support](https://discord.gg/rZsenD) General support channel for any Validator related queries that dont fall under Tour de SOL.
* [\#tourdesol](https://discord.gg/BdujK2) Discussion and support channel for Tour de SOL participants.

View File

@ -4,13 +4,14 @@
There are three versions of the testnet dashboard, corresponding to the three
release channels:
* https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge
* https://metrics.solana.com:3000/d/testnet-beta/testnet-monitor-beta
* https://metrics.solana.com:3000/d/testnet/testnet-monitor
* https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge
* https://metrics.solana.com:3000/d/monitor-beta/cluster-telemetry-beta
* https://metrics.solana.com:3000/d/monitor/cluster-telemetry
The dashboard for each channel is defined from the
`metrics/testnet-monitor.json` source file in the git branch associated with
that channel, and deployed by automation running `ci/publish-metrics-dashboard.sh`.
`metrics/scripts/grafana-provisioning/dashboards/cluster-monitor.json` source
file in the git branch associated with that channel, and deployed by automation
running `ci/publish-metrics-dashboard.sh`.
A deploy can be triggered at any time via the `New Build` button of
https://buildkite.com/solana-labs/publish-metrics-dashboard.
@ -18,7 +19,7 @@ https://buildkite.com/solana-labs/publish-metrics-dashboard.
### Modifying a Dashboard
Dashboard updates are accomplished by modifying
`metrics/scripts/grafana-provisioning/dashboards/testnet-monitor.json`,
`metrics/scripts/grafana-provisioning/dashboards/cluster-monitor.json`,
**manual edits made directly in Grafana will be overwritten**.
* Check out metrics to add at https://metrics.solana.com:8888/ in the data explorer.
@ -32,13 +33,13 @@ Dashboard updates are accomplished by modifying
`Settings` menu for the dashboard
3. Edit dashboard as desired
4. Extract the JSON Model by selecting `JSON Model` in the `Settings` menu. Copy the JSON to the clipboard
and paste into `metrics/scripts/grafana-provisioning/dashboards/testnet-monitor.json`,
and paste into `metrics/scripts/grafana-provisioning/dashboards/cluster-monitor.json`,
5. Delete your development dashboard: `Settings` => `Delete`
### Deploying a Dashboard Manually
If you need to immediately deploy a dashboard using the contents of
`testnet-monitor.json` in your local workspace,
`cluster-monitor.json` in your local workspace,
```
$ export GRAFANA_API_TOKEN="an API key from https://metrics.solana.com:3000/org/apikeys"
$ metrics/publish-metrics-dashboard.sh (edge|beta|stable)

View File

@ -11,13 +11,13 @@ fi
case $CHANNEL in
edge)
DASHBOARD=testnet-monitor-edge
DASHBOARD=cluster-telemetry-edge
;;
beta)
DASHBOARD=testnet-monitor-beta
DASHBOARD=cluster-telemetry-beta
;;
stable)
DASHBOARD=testnet-monitor
DASHBOARD=cluster-telemetry
;;
*)
echo "Error: Invalid CHANNEL=$CHANNEL"
@ -31,7 +31,7 @@ if [[ -z $GRAFANA_API_TOKEN ]]; then
exit 1
fi
DASHBOARD_JSON=scripts/grafana-provisioning/dashboards/testnet-monitor.json
DASHBOARD_JSON=scripts/grafana-provisioning/dashboards/cluster-monitor.json
if [[ ! -r $DASHBOARD_JSON ]]; then
echo Error: $DASHBOARD_JSON not found
fi

View File

@ -21,7 +21,7 @@ with open(dashboard_json, 'r') as read_file:
data = json.load(read_file)
if channel == 'local':
data['title'] = 'Local Testnet Monitor'
data['title'] = 'Local Cluster Monitor'
data['uid'] = 'local'
data['links'] = []
data['templating']['list'] = [{'current': {'text': '$datasource',
@ -66,10 +66,9 @@ if channel == 'local':
'useTags': False}]
elif channel == 'stable':
# Stable dashboard only allows the user to select between the stable
# testnet databases
data['title'] = 'Testnet Monitor'
data['uid'] = 'testnet'
# Stable dashboard only allows the user to select between public clusters
data['title'] = 'Cluster Telemetry'
data['uid'] = 'monitor'
data['templating']['list'] = [{'current': {'text': '$datasource',
'value': '$datasource'},
'hide': 1,
@ -81,20 +80,26 @@ elif channel == 'stable':
'regex': '',
'type': 'datasource'},
{'allValue': None,
'current': {'text': 'testnet',
'value': 'testnet'},
'current': {'text': 'Developer Testnet',
'value': 'devnet'},
'hide': 1,
'includeAll': False,
'label': 'Testnet',
'multi': False,
'name': 'testnet',
'options': [{'selected': False,
'text': 'testnet',
'value': 'testnet'},
{'selected': True,
'text': 'testnet-perf',
'value': 'testnet-perf'}],
'query': 'testnet,testnet-perf',
'options': [{'selected': True,
'text': 'Developer Testnet',
'value': 'devnet'},
{'selected': False,
'text': 'Mainnet Beta',
'value': 'mainnet-beta'},
{'selected': False,
'text': 'Tour de SOL Testnet',
'value': 'tds'},
{'selected': False,
'text': 'Soft Launch Testnet',
'value': 'cluster'}],
'query': 'devnet,mainnet-beta,tds,cluster',
'type': 'custom'},
{'allValue': ".*",
'datasource': '$datasource',
@ -114,10 +119,9 @@ elif channel == 'stable':
'type': 'query',
'useTags': False}]
else:
# Non-stable dashboard only allows the user to select between all testnet
# databases
data['title'] = 'Testnet Monitor ({})'.format(channel)
data['uid'] = 'testnet-' + channel
# Non-stable dashboard includes all the dev clusters
data['title'] = 'Cluster Telemetry ({})'.format(channel)
data['uid'] = 'monitor-' + channel
data['templating']['list'] = [{'current': {'text': '$datasource',
'value': '$datasource'},
'hide': 1,
@ -129,8 +133,8 @@ else:
'regex': '',
'type': 'datasource'},
{'allValue': ".*",
'current': {'text': 'testnet',
'value': 'testnet'},
'current': {'text': 'Developer Testnet',
'value': 'devnet'},
'datasource': '$datasource',
'hide': 1,
'includeAll': False,
@ -140,7 +144,7 @@ else:
'options': [],
'query': 'show databases',
'refresh': 1,
'regex': 'testnet.*',
'regex': '(devnet|cluster|tds|mainnet-beta|testnet.*)',
'sort': 1,
'tagValuesQuery': '',
'tags': [],

View File

@ -27,21 +27,21 @@
"title": "Stable",
"tooltip": "",
"type": "link",
"url": "https://metrics.solana.com:3000/d/testnet/testnet-monitor"
"url": "https://metrics.solana.com:3000/d/monitor/cluster-telemetry"
},
{
"icon": "dashboard",
"tags": [],
"title": "Beta",
"type": "link",
"url": "https://metrics.solana.com:3000/d/testnet-beta/testnet-monitor-beta"
"url": "https://metrics.solana.com:3000/d/monitor-beta/cluster-telemetry-beta"
},
{
"icon": "dashboard",
"tags": [],
"title": "Edge",
"type": "link",
"url": "https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge"
"url": "https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge"
}
],
"panels": [
@ -4618,7 +4618,7 @@
},
"yaxes": [
{
"format": "µs",
"format": "\u00b5s",
"label": null,
"logBase": 1,
"max": null,
@ -5385,7 +5385,7 @@
},
"yaxes": [
{
"format": "µs",
"format": "\u00b5s",
"label": null,
"logBase": 1,
"max": null,
@ -5752,7 +5752,7 @@
},
"yaxes": [
{
"format": "µs",
"format": "\u00b5s",
"label": null,
"logBase": 1,
"max": null,
@ -6727,7 +6727,7 @@
},
"yaxes": [
{
"format": "µs",
"format": "\u00b5s",
"label": null,
"logBase": 1,
"max": null,
@ -10181,7 +10181,6 @@
"list": [
{
"current": {
"selected": true,
"text": "$datasource",
"value": "$datasource"
},
@ -10197,9 +10196,8 @@
{
"allValue": ".*",
"current": {
"selected": false,
"text": "testnet",
"value": "testnet"
"text": "Developer Testnet",
"value": "devnet"
},
"datasource": "$datasource",
"hide": 1,
@ -10210,7 +10208,7 @@
"options": [],
"query": "show databases",
"refresh": 1,
"regex": "testnet.*",
"regex": "(devnet|cluster|tds|mainnet-beta|testnet.*)",
"sort": 1,
"tagValuesQuery": "",
"tags": [],
@ -10269,7 +10267,7 @@
]
},
"timezone": "",
"title": "Testnet Monitor (edge)",
"uid": "testnet-edge",
"title": "Cluster Telemetry (edge)",
"uid": "monitor-edge",
"version": 2
}

View File

@ -34,7 +34,7 @@ source lib/config.sh
if [[ ! -f lib/grafana-provisioning ]]; then
cp -va grafana-provisioning lib
./adjust-dashboard-for-channel.py \
lib/grafana-provisioning/dashboards/testnet-monitor.json local
lib/grafana-provisioning/dashboards/cluster-monitor.json local
mkdir -p lib/grafana-provisioning/datasources
cat > lib/grafana-provisioning/datasources/datasource.yml <<EOF

185
system-test/automation_utils.sh Executable file
View File

@ -0,0 +1,185 @@
#!/usr/bin/env bash
# | source | this file
# shellcheck disable=SC1090
# shellcheck disable=SC1091
# shellcheck disable=SC2034
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
REPO_ROOT=${DIR}/..
source "${REPO_ROOT}"/ci/upload-ci-artifact.sh
function execution_step {
# shellcheck disable=SC2124
STEP="$@"
echo --- "${STEP[@]}"
}
function collect_logs {
execution_step "Collect logs from remote nodes"
rm -rf "${REPO_ROOT}"/net/log
"${REPO_ROOT}"/net/net.sh logs
for logfile in "${REPO_ROOT}"/net/log/*; do
(
upload-ci-artifact "$logfile"
)
done
}
function analyze_packet_loss {
(
set -x
# shellcheck disable=SC1091
source "${REPO_ROOT}"/net/config/config
mkdir -p iftop-logs
execution_step "Map private -> public IP addresses in iftop logs"
# shellcheck disable=SC2154
for i in "${!validatorIpList[@]}"; do
# shellcheck disable=SC2154
# shellcheck disable=SC2086
# shellcheck disable=SC2027
echo "{\"private\": \""${validatorIpListPrivate[$i]}""\", \"public\": \""${validatorIpList[$i]}""\"},"
done > ip_address_map.txt
for ip in "${validatorIpList[@]}"; do
"${REPO_ROOT}"/net/scp.sh ip_address_map.txt solana@"$ip":~/solana/
done
execution_step "Remotely post-process iftop logs"
# shellcheck disable=SC2154
for ip in "${validatorIpList[@]}"; do
iftop_log=iftop-logs/$ip-iftop.log
# shellcheck disable=SC2016
"${REPO_ROOT}"/net/ssh.sh solana@"$ip" 'PATH=$PATH:~/.cargo/bin/ ~/solana/scripts/iftop-postprocess.sh ~/solana/iftop.log temp.log ~solana/solana/ip_address_map.txt' > "$iftop_log"
upload-ci-artifact "$iftop_log"
done
execution_step "Analyzing Packet Loss"
"${REPO_ROOT}"/solana-release/bin/solana-log-analyzer analyze -f ./iftop-logs/ | sort -k 2 -g
)
}
function wait_for_bootstrap_validator_stake_drop {
max_stake="$1"
source "${REPO_ROOT}"/net/common.sh
loadConfigFile
while true; do
# shellcheck disable=SC2154
bootstrap_validator_validator_info="$(ssh "${sshOptions[@]}" "${validatorIpList[0]}" '$HOME/.cargo/bin/solana validators | grep "$($HOME/.cargo/bin/solana-keygen pubkey ~/solana/config/bootstrap-validator/identity-keypair.json)"')"
bootstrap_validator_stake_percentage="$(echo "$bootstrap_validator_validator_info" | awk '{gsub(/[\(,\),\%]/,""); print $9}')"
if [[ $(echo "$bootstrap_validator_stake_percentage < $max_stake" | bc) -ne 0 ]]; then
echo "Bootstrap validator stake has fallen below $max_stake to $bootstrap_validator_stake_percentage"
break
fi
echo "Max bootstrap validator stake: $max_stake. Current stake: $bootstrap_validator_stake_percentage. Sleeping 30s for stake to distribute."
sleep 30
done
}
function get_slot {
source "${REPO_ROOT}"/net/common.sh
loadConfigFile
ssh "${sshOptions[@]}" "${validatorIpList[0]}" '$HOME/.cargo/bin/solana slot'
}
function upload_results_to_slack() {
echo --- Uploading results to Slack Performance Results App
if [[ -z $SLACK_WEBHOOK_URL ]] ; then
echo "SLACK_WEBHOOOK_URL undefined"
exit 1
fi
[[ -n $BUILDKITE_MESSAGE ]] || BUILDKITE_MESSAGE="Message not defined"
COMMIT=$(git rev-parse HEAD)
COMMIT_BUTTON_TEXT="$(echo "$COMMIT" | head -c 8)"
COMMIT_URL="https://github.com/solana-labs/solana/commit/${COMMIT}"
if [[ -n $BUILDKITE_BUILD_URL ]] ; then
BUILD_BUTTON_TEXT="Build Kite Job"
else
BUILD_BUTTON_TEXT="Build URL not defined"
BUILDKITE_BUILD_URL="https://buildkite.com/solana-labs/"
fi
GRAFANA_URL="https://metrics.solana.com:3000/d/monitor-${CHANNEL:-edge}/cluster-telemetry-${CHANNEL:-edge}?var-testnet=${TESTNET_TAG:-testnet-automation}&from=${TESTNET_START_UNIX_MSECS:-0}&to=${TESTNET_FINISH_UNIX_MSECS:-0}"
[[ -n $RESULT_DETAILS ]] || RESULT_DETAILS="Undefined"
[[ -n $TEST_CONFIGURATION ]] || TEST_CONFIGURATION="Undefined"
payLoad="$(cat <<EOF
{
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*$BUILDKITE_MESSAGE*"
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {
"type": "plain_text",
"text": "$COMMIT_BUTTON_TEXT",
"emoji": true
},
"url": "$COMMIT_URL"
},
{
"type": "button",
"text": {
"type": "plain_text",
"text": "$BUILD_BUTTON_TEXT",
"emoji": true
},
"url": "$BUILDKITE_BUILD_URL"
},
{
"type": "button",
"text": {
"type": "plain_text",
"text": "Grafana",
"emoji": true
},
"url": "$GRAFANA_URL"
}
]
},
{
"type": "divider"
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "Test Configuration: \n\`\`\`$TEST_CONFIGURATION\`\`\`"
}
},
{
"type": "divider"
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "Result Details: \n\`\`\`$RESULT_DETAILS\`\`\`"
}
}
]
}
EOF
)"
curl -X POST \
-H 'Content-type: application/json' \
--data "$payLoad" \
"$SLACK_WEBHOOK_URL"
}