Fixup scripts to set up a new CI node (#9348)
* Clean up node setup scripts for new CI boxes * Move files under ci directory * Set CUDA env var to setup cuda drivers * Fixup and add README * shellcheck * Apply review feedback, rename dir and setup files Co-authored-by: publish-docs.sh <maintainers@solana.com>
This commit is contained in:
64
ci/README.md
64
ci/README.md
@@ -2,7 +2,7 @@
|
||||
Our CI infrastructure is built around [BuildKite](https://buildkite.com) with some
|
||||
additional GitHub integration provided by https://github.com/mvines/ci-gate
|
||||
|
||||
## Agent Queues
|
||||
# Agent Queues
|
||||
|
||||
We define two [Agent Queues](https://buildkite.com/docs/agent/v3/queues):
|
||||
`queue=default` and `queue=cuda`. The `default` queue should be favored and
|
||||
@@ -12,9 +12,52 @@ be run on the `default` queue, and the [buildkite artifact
|
||||
system](https://buildkite.com/docs/builds/artifacts) used to transfer build
|
||||
products over to a GPU instance for testing.
|
||||
|
||||
## Buildkite Agent Management
|
||||
# Buildkite Agent Management
|
||||
|
||||
### Buildkite Azure Setup
|
||||
## Manual Node Setup for Colocated Hardware
|
||||
|
||||
This section describes how to set up a new machine that does not have a
|
||||
pre-configured image with all the requirements installed. Used for custom-built
|
||||
hardware at a colocation or office facility. Also works for vanilla Ubuntu cloud
|
||||
instances.
|
||||
|
||||
### Pre-Requisites
|
||||
|
||||
- Install Ubuntu 18.04 LTS Server
|
||||
- Log in as a local or remote user with `sudo` privileges
|
||||
|
||||
### Install Core Requirements
|
||||
|
||||
##### Non-GPU enabled machines
|
||||
```bash
|
||||
sudo ./setup-new-buildkite-agent/setup-new-machine.sh
|
||||
```
|
||||
|
||||
##### GPU-enabled machines
|
||||
- 1 or more NVIDIA GPUs should be installed in the machine (tested with 2080Ti)
|
||||
```bash
|
||||
sudo CUDA=1 ./setup-new-buildkite-agent/setup-new-machine.sh
|
||||
```
|
||||
|
||||
### Configure Node for Buildkite-agent based CI
|
||||
|
||||
- Install `buildkite-agent` and set up it user environment with:
|
||||
```bash
|
||||
sudo ./setup-new-buildkite-agent/setup-buildkite.sh
|
||||
```
|
||||
- Copy the pubkey contents from `~buildkite-agent/.ssh/id_ecdsa.pub` and
|
||||
add the pubkey as an authorized SSH key on github.
|
||||
- Edit `/etc/buildkite-agent/buildkite-agent.cfg` and/or `/etc/systemd/system/buildkite-agent@*` to the desired configuration of the agent(s)
|
||||
- Copy `ejson` keys from another CI node at `/opt/ejson/keys/`
|
||||
to the same location on the new node.
|
||||
- Start the new agent(s) with `sudo systemctl enable --now buildkite-agent`
|
||||
|
||||
# Reference
|
||||
|
||||
This section contains details regarding previous CI setups that have been used,
|
||||
and that we may return to one day.
|
||||
|
||||
## Buildkite Azure Setup
|
||||
|
||||
Create a new Azure-based "queue=default" agent by running the following command:
|
||||
```
|
||||
@@ -35,7 +78,7 @@ Creating a "queue=cuda" agent follows the same process but additionally:
|
||||
2. Edit the tags field in /etc/buildkite-agent/buildkite-agent.cfg to `tags="queue=cuda,queue=default"`
|
||||
and decrease the value of the priority field by one
|
||||
|
||||
#### Updating the CI Disk Image
|
||||
### Updating the CI Disk Image
|
||||
|
||||
1. Create a new VM Instance as described above
|
||||
1. Modify it as required
|
||||
@@ -48,12 +91,7 @@ Creating a "queue=cuda" agent follows the same process but additionally:
|
||||
1. Goto the `ci` resource group in the Azure portal and remove all resources
|
||||
with the XYZ name in them
|
||||
|
||||
## Reference
|
||||
|
||||
This section contains details regarding previous CI setups that have been used,
|
||||
and that we may return to one day.
|
||||
|
||||
### Buildkite AWS CloudFormation Setup
|
||||
## Buildkite AWS CloudFormation Setup
|
||||
|
||||
**AWS CloudFormation is currently inactive, although it may be restored in the
|
||||
future**
|
||||
@@ -62,7 +100,7 @@ AWS CloudFormation can be used to scale machines up and down based on the
|
||||
current CI load. If no machine is currently running it can take up to 60
|
||||
seconds to spin up a new instance, please remain calm during this time.
|
||||
|
||||
#### AMI
|
||||
### AMI
|
||||
We use a custom AWS AMI built via https://github.com/solana-labs/elastic-ci-stack-for-aws/tree/solana/cuda.
|
||||
|
||||
Use the following process to update this AMI as dependencies change:
|
||||
@@ -84,13 +122,13 @@ The new AMI should also now be visible in your EC2 Dashboard. Go to the desired
|
||||
AWS CloudFormation stack, update the **ImageId** field to the new AMI id, and
|
||||
*apply* the stack changes.
|
||||
|
||||
### Buildkite GCP Setup
|
||||
## Buildkite GCP Setup
|
||||
|
||||
CI runs on Google Cloud Platform via two Compute Engine Instance groups:
|
||||
`ci-default` and `ci-cuda`. Autoscaling is currently disabled and the number of
|
||||
VM Instances in each group is manually adjusted.
|
||||
|
||||
#### Updating a CI Disk Image
|
||||
### Updating a CI Disk Image
|
||||
|
||||
Each Instance group has its own disk image, `ci-default-vX` and
|
||||
`ci-cuda-vY`, where *X* and *Y* are incremented each time the image is changed.
|
||||
|
Reference in New Issue
Block a user