Fixup scripts to set up a new CI node (#9348)

* Clean up node setup scripts for new CI boxes * Move files under ci directory * Set CUDA env var to setup cuda drivers * Fixup and add README * shellcheck * Apply review feedback, rename dir and setup files Co-authored-by: publish-docs.sh <maintainers@solana.com>
2020-04-20 17:43:13 -06:00
parent 41fec5bd5b
commit 3fbe7f0bb3
19 changed files with 266 additions and 147 deletions
--- a/ci/README.md
+++ b/ci/README.md
@@ -2,7 +2,7 @@
 Our CI infrastructure is built around [BuildKite](https://buildkite.com) with some
 additional GitHub integration provided by https://github.com/mvines/ci-gate

-## Agent Queues
+# Agent Queues

 We define two [Agent Queues](https://buildkite.com/docs/agent/v3/queues):
 `queue=default` and `queue=cuda`.  The `default` queue should be favored and
@@ -12,9 +12,52 @@ be run on the `default` queue, and the [buildkite artifact
 system](https://buildkite.com/docs/builds/artifacts) used to transfer build
 products over to a GPU instance for testing.

-## Buildkite Agent Management
+# Buildkite Agent Management

-### Buildkite Azure Setup
+## Manual Node Setup for Colocated Hardware
+
+This section describes how to set up a new machine that does not have a
+pre-configured image with all the requirements installed.  Used for custom-built
+hardware at a colocation or office facility.  Also works for vanilla Ubuntu cloud
+instances.
+
+### Pre-Requisites
+
+ - Install Ubuntu 18.04 LTS Server
+ - Log in as a local or remote user with `sudo` privileges
+
+### Install Core Requirements
+
+##### Non-GPU enabled machines
+```bash
+sudo ./setup-new-buildkite-agent/setup-new-machine.sh
+```
+
+##### GPU-enabled machines
+ - 1 or more NVIDIA GPUs should be installed in the machine (tested with 2080Ti)
+```bash
+sudo CUDA=1 ./setup-new-buildkite-agent/setup-new-machine.sh
+```
+
+### Configure Node for Buildkite-agent based CI
+
+- Install `buildkite-agent` and set up it user environment with:
+```bash
+sudo ./setup-new-buildkite-agent/setup-buildkite.sh
+```
+- Copy the pubkey contents from `~buildkite-agent/.ssh/id_ecdsa.pub` and
+add the pubkey as an authorized SSH key on github.
+- Edit `/etc/buildkite-agent/buildkite-agent.cfg` and/or `/etc/systemd/system/buildkite-agent@*` to the desired configuration of the agent(s)
+- Copy `ejson` keys from another CI node at `/opt/ejson/keys/`
+to the same location on the new node.
+- Start the new agent(s) with `sudo systemctl enable --now buildkite-agent`
+
+# Reference
+
+This section contains details regarding previous CI setups that have been used,
+and that we may return to one day.
+
+## Buildkite Azure Setup

 Create a new Azure-based "queue=default" agent by running the following command:
 ```
@@ -35,7 +78,7 @@ Creating a "queue=cuda" agent follows the same process but additionally:
 2. Edit the tags field in /etc/buildkite-agent/buildkite-agent.cfg to `tags="queue=cuda,queue=default"`
   and decrease the value of the priority field by one

-#### Updating the CI Disk Image
+### Updating the CI Disk Image

 1. Create a new VM Instance as described above
 1. Modify it as required
@@ -48,12 +91,7 @@ Creating a "queue=cuda" agent follows the same process but additionally:
 1. Goto the `ci` resource group in the Azure portal and remove all resources
   with the XYZ name in them

-## Reference
-
-This section contains details regarding previous CI setups that have been used,
-and that we may return to one day.
-
-### Buildkite AWS CloudFormation Setup
+## Buildkite AWS CloudFormation Setup

 **AWS CloudFormation is currently inactive, although it may be restored in the
 future**
@@ -62,7 +100,7 @@ AWS CloudFormation can be used to scale machines up and down based on the
 current CI load.  If no machine is currently running it can take up to 60
 seconds to spin up a new instance, please remain calm during this time.

-#### AMI
+### AMI
 We use a custom AWS AMI built via https://github.com/solana-labs/elastic-ci-stack-for-aws/tree/solana/cuda.

 Use the following process to update this AMI as dependencies change:
@@ -84,13 +122,13 @@ The new AMI should also now be visible in your EC2 Dashboard.  Go to the desired
 AWS CloudFormation stack, update the **ImageId** field to the new AMI id, and
 *apply* the stack changes.

-### Buildkite GCP Setup
+## Buildkite GCP Setup

 CI runs on Google Cloud Platform via two Compute Engine Instance groups:
 `ci-default` and `ci-cuda`.  Autoscaling is currently disabled and the number of
 VM Instances in each group is manually adjusted.

-#### Updating a CI Disk Image
+### Updating a CI Disk Image

 Each Instance group has its own disk image, `ci-default-vX` and
 `ci-cuda-vY`, where *X* and *Y* are incremented each time the image is changed.