[DOCS] Re-structured documentation hierarchy

This commit is contained in:
Philippe Tillet
2021-03-06 17:26:49 -05:00
parent ca04da3575
commit 92242ace2c
8 changed files with 68 additions and 222 deletions

View File

@@ -35,11 +35,13 @@ autosectionlabel_prefix_document = True
# Sphinx gallery
extensions += ['sphinx_gallery.gen_gallery']
from sphinx_gallery.sorting import FileNameSortKey
sphinx_gallery_conf = {
'examples_dirs': '../python/tutorials/',
'gallery_dirs': 'tutorials',
'gallery_dirs': 'getting-started/tutorials',
'filename_pattern': '',
'ignore_pattern': r'__init__\.py',
'within_subsection_order': FileNameSortKey,
}
# Add any paths that contain templates here, relative to this directory.

View File

@@ -1,6 +1,21 @@
==============
Installation
==============
--------------
With Pip
--------------
Triton can be installed directly from pip with the following command
.. code-block:: python
pip install triton
--------------
From Source
==============
--------------
+++++++++++++++
Python Package

View File

@@ -1,20 +1,18 @@
.. Triton documentation master file, created by
sphinx-quickstart on Mon Feb 10 01:01:37 2020.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Triton's documentation!
==================================
.. toctree::
:maxdepth: 1
:caption: Installation Instructions
Triton is an imperative language and compiler for parallel programming. It aims to provide a programming environment for productively writing custom DNN compute kernels capable of running at maximal throughput on modern GPU hardware.
installation/packaged-binaries
installation/from-source
Getting Started
---------------
- Follow the :doc:`installation instructions <getting-started/installation>` for your platform of choice.
- Take a look at the :doc:`tutorials <getting-started/tutorials/index>` to learn how to write your first Triton program.
.. toctree::
:maxdepth: 1
:caption: Installation Instructions
:caption: Getting Started
:hidden:
tutorials/index
getting-started/installation
getting-started/tutorials/index

View File

@@ -1,8 +0,0 @@
Packaged Binaries
=================
Triton can be installed directly from pip with the following command
.. code-block:: python
pip install triton

View File

@@ -1,158 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Vector Addition\nIn this tutorial, we will see how to construct a simple, high-performance vector addition using Triton. You will learn:\n* The basic syntax of the Triton programming language\n* The best practices for creating PyTorch custom operators using the `triton.kernel` Python API\n* The best practices for validating and benchmarking custom ops against native reference implementations\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing the Compute Kernel\n\nEach compute kernel is declared using the :code:`__global__` attribute, and executed many times in parallel\non different chunks of data (See the `Single Program, Multiple Data <(https://en.wikipedia.org/wiki/SPMD>`_)\nprogramming model for more details).\n\n .. code-block:: C\n\n __global__ void add(float* z, float* x, float* y, int N){\n // The `get_program_id(i)` returns the i-th coordinate\n // of the program in the overaching SPMD context\n // (a.k.a launch grid). This is what allows us to process\n // different chunks of data in parallel.\n // For those similar with CUDA, `get_program_id({0,1,2})`\n // is similar to blockIdx.{x,y,z}\n int pid = get_program_id(0);\n // In Triton, arrays are first-class citizen. In other words,\n // they are primitives data-types and are -- contrary to C and\n // CUDA -- not implemented as pointers to contiguous chunks of\n // memory.\n // In the few lines below, we create an array of `BLOCK` pointers\n // whose memory values are, e.g.:\n // [z + pid*BLOCK + 0, z + pid*BLOCK + 1, ..., z + pid*BLOCK + BLOCK - 1]\n // Note: here BLOCK is expected to be a pre-processor macro defined at compile-time\n int offset[BLOCK] = pid * BLOCK + 0 ... BLOCK;\n float* pz [BLOCK] = z + offset;\n float* px [BLOCK] = x + offset;\n float* py [BLOCK] = y + offset;\n // Simple element-wise control-flow for load/store operations can\n // be achieved using the the ternary operator `cond ? val_true : val_false`\n // or the conditional dereferencing operator `*?(cond)ptr\n // Here, we make sure that we do not access memory out-of-bounds when we\n // write-back `z`\n bool check[BLOCK] = offset < N;\n *?(check)pz = *?(check)px + *?(check)py;\n }\n\nThe existence of arrays as a primitive data-type for Triton comes with a number of advantages that are highlighted in the `MAPL'2019 Triton paper <http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf>`_.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing the Torch bindings\nThe only thing that matters when it comes to Triton and Torch is the `triton.kernel` class. This allows you to transform the above C-like function into a callable python object that can be used to modify `torch.tensor` objects.\n\nTo create a `triton.kernel`, you only need three things:\n- `source: string`: the source-code of the kernel you want to create\n- `device: torch.device`: the device you want to compile this code for\n- `defines: dict`: the set of macros that you want the pre-processor to `#define` for you\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import torch\nimport triton"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"source-code for Triton compute kernel\nhere we just copy-paste the above code without the extensive comments.\nyou may prefer to store it in a .c file and load it from there instead.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"_src = \"\"\"\n__global__ void add(float* z, float* x, float* y, int N){\n // program id\n int pid = get_program_id(0);\n // create arrays of pointers\n int offset[BLOCK] = pid * BLOCK + 0 ... BLOCK;\n float* pz[BLOCK] = z + offset;\n float* px[BLOCK] = x + offset;\n float* py[BLOCK] = y + offset;\n // bounds checking\n bool check[BLOCK] = offset < N;\n // write-back\n *?(check)pz = *?(check)px + *?(check)py;\n}\n \"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This function returns a callable `triton.kernel` object\ncreated from the above source code.\nFor portability, we maintain a cache of kernels for different `torch.device`\nWe compile the kernel with -DBLOCK=1024\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def make_add_kernel(device):\n cache = make_add_kernel.cache\n if device not in cache:\n defines = {'BLOCK': 1024}\n cache[device] = triton.kernel(_src, device=device, defines=defines)\n return cache[device]\n\n\nmake_add_kernel.cache = dict()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a standard torch custom autograd Function\nThe only difference is that we can now use the above kernel\nin the `forward` and `backward` functions.`\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class _add(torch.autograd.Function):\n @staticmethod\n def forward(ctx, x, y):\n # constraints of the op\n assert x.dtype == torch.float32\n # *allocate output*\n z = torch.empty_like(x)\n # *create launch grid*:\n # this is a function which takes compilation parameters `opt`\n # as input and returns a tuple of int (i.e., launch grid) for the kernel.\n # triton.cdiv is a shortcut for ceil division:\n # triton.cdiv(a, b) = (a + b - 1) // b\n N = z.shape[0]\n grid = lambda opt: (triton.cdiv(N, opt.BLOCK), )\n # *launch kernel*:\n # pointer to the data of torch tensors can be retrieved with\n # the `.data_ptr()` method\n kernel = make_add_kernel(z.device)\n kernel(z.data_ptr(), x.data_ptr(), y.data_ptr(), N, grid=grid)\n return z\n\n\n# Just like we standard PyTorch ops We use the `.apply` method to create a callable object for our function\nadd = _add.apply"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing a Unit Test\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"torch.manual_seed(0)\nx = torch.rand(98432, device='cuda')\ny = torch.rand(98432, device='cuda')\nza = x + y\nzb = add(x, y)\nprint(za)\nprint(zb)\nprint(f'The maximum difference between torch and triton is ' f'{torch.max(torch.abs(za - zb))}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing a Benchmark\nWe can now benchmark our custom op for vectors of increasing sizes to get a sense of how it does\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"warmup = 10\nrep = 200\nfor N in [2**i for i in range(17, 26, 1)]:\n x = torch.rand(N, device='cuda')\n y = torch.rand(N, device='cuda')\n triton_ms = triton.testing.do_bench(lambda: add(x, y), warmup=warmup, rep=rep)\n torch_ms = triton.testing.do_bench(lambda: x + y, warmup=warmup, rep=rep)\n # print the performance of triton and torch as well as the achieved bandwidth\n print(f'{N} {triton_ms:.3f} {torch_ms:.3f}')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@@ -1,14 +1,15 @@
"""
Vector Addition
=================
In this tutorial, we will see how to construct a simple, high-performance vector addition using Triton. You will learn:
* The basic syntax of the Triton programming language
* The best practices for creating PyTorch custom operators using the `triton.kernel` Python API
* The best practices for validating and benchmarking custom ops against native reference implementations
In this tutorial, you will write a simple, high-performance vector addition using Triton and learn about:
- The basic syntax of the Triton programming language
- The best practices for creating PyTorch custom operators using the :code:`triton.kernel` Python API
- The best practices for validating and benchmarking custom ops against native reference implementations
"""
# %%
# Writing the Compute Kernel
# Compute Kernel
# --------------------------
#
# Each compute kernel is declared using the :code:`__global__` attribute, and executed many times in parallel
@@ -49,23 +50,20 @@ In this tutorial, we will see how to construct a simple, high-performance vector
# The existence of arrays as a primitive data-type for Triton comes with a number of advantages that are highlighted in the `MAPL'2019 Triton paper <http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf>`_.
# %%
# Writing the Torch bindings
# Torch bindings
# --------------------------
# The only thing that matters when it comes to Triton and Torch is the `triton.kernel` class. This allows you to transform the above C-like function into a callable python object that can be used to modify `torch.tensor` objects.
# The only thing that matters when it comes to Triton and Torch is the :code:`triton.kernel` class. This allows you to transform the above C-like function into a callable python object that can be used to modify :code:`torch.tensor` objects. To create a :code:`triton.kernel`, you only need three things:
#
# To create a `triton.kernel`, you only need three things:
# - `source: string`: the source-code of the kernel you want to create
# - `device: torch.device`: the device you want to compile this code for
# - `defines: dict`: the set of macros that you want the pre-processor to `#define` for you
# - :code:`source: string`: the source-code of the kernel you want to create
# - :code:`device: torch.device`: the device you want to compile this code for
# - :code:`defines: dict`: the set of macros that you want the pre-processor to `#define` for you
import torch
import triton
# %%
# source-code for Triton compute kernel
# here we just copy-paste the above code without the extensive comments.
# you may prefer to store it in a .c file and load it from there instead.
_src = """
__global__ void add(float* z, float* x, float* y, int N){
// program id
@@ -82,13 +80,10 @@ __global__ void add(float* z, float* x, float* y, int N){
}
"""
# %%
# This function returns a callable `triton.kernel` object
# created from the above source code.
# This function returns a callable `triton.kernel` object created from the above source code.
# For portability, we maintain a cache of kernels for different `torch.device`
# We compile the kernel with -DBLOCK=1024
def make_add_kernel(device):
cache = make_add_kernel.cache
if device not in cache:
@@ -99,12 +94,9 @@ def make_add_kernel(device):
make_add_kernel.cache = dict()
# %%
# This is a standard torch custom autograd Function
# The only difference is that we can now use the above kernel
# in the `forward` and `backward` functions.`
# This is a standard torch custom autograd Function;
# The only difference is that we can now use the above kernel in the `forward` and `backward` functions.`
class _add(torch.autograd.Function):
@staticmethod
def forward(ctx, x, y):
@@ -127,11 +119,11 @@ class _add(torch.autograd.Function):
return z
# Just like we standard PyTorch ops We use the `.apply` method to create a callable object for our function
# Just like we standard PyTorch ops We use the :code:`.apply` method to create a callable object for our function
add = _add.apply
# %%
# Writing a Unit Test
# Unit Test
# --------------------------
torch.manual_seed(0)
x = torch.rand(98432, device='cuda')
@@ -143,7 +135,7 @@ print(zb)
print(f'The maximum difference between torch and triton is ' f'{torch.max(torch.abs(za - zb))}')
# %%
# Writing a Benchmark
# Benchmarking
# --------------------------
# We can now benchmark our custom op for vectors of increasing sizes to get a sense of how it does

View File

@@ -1,16 +1,23 @@
"""
Fused Softmax
=================
In this tutorial, you will write a fused softmax layer that outperform's PyTorch implementation and learn about:
- The benefits of kernel fusion for bandwidth-bound operations.
- The syntax and usage of reduction operators in Triton.
- The automatic vectorization capabilities of the Triton compiler.
"""
# %%
# Motivations
# ------------
# Custom GPU kernels for elementwise additions are educationally valuable but won't get you very far in practice.
# Let us consider instead the case of a simple (numerically stabilized) softmax operation:
import torch
# Compute the row-wise softmax of x \in R^{M \times N}
# Compute the row-wise softmax of x
def naive_softmax(x):
# read MN elements ; write M elements
x_max = torch.max(x, axis=1)[0]
@@ -27,11 +34,13 @@ def naive_softmax(x):
# %%
# When implemented naively in pytorch, computing :math:`y` requires reading :math:`7MN` elements from DRAM and writing back :math:`3MN + 2M` elements.
# Instead, we want to write a custom "fused" pytorch operators that only reads X once and does all the necessary computations on-chip. This would require reading and writing back only :math:`MN` bytes, so we could expect a theoretical speed-up of 5x. In practice, though, we expect less because our kernel will spend some time computing exponentials and moving data around in shared memory.
# When implemented naively in pytorch, computing :code:`y = naive_softmax(x)` for :math:`x \in R^{M \times N}` requires reading :math:`7MN` elements from DRAM and writing back :math:`3MN + 2M` elements.
# Instead, we want to write a custom "fused" pytorch operators that only reads X once and does all the necessary computations on-chip.
# This would require reading and writing back only :math:`MN` bytes, so we could expect a theoretical speed-up of 5x.
# In practice, though, we expect less because our kernel will spend some time computing exponentials and moving data around in shared memory.
# %%
# Writing the Compute Kernel
# Compute Kernel
# ----------------------------
# Our softmax kernel works as follows: each program loads a row of X and writes back a normalized row of Y. Note that one important limitation of Triton is that each block must have a power-of-two number of elements, which means that we need to guard the memory operations properly if we want to handle any possible input shapes:
#
@@ -69,14 +78,16 @@ def naive_softmax(x):
# }
# %%
# Writing the Compute Kernel
# Torch Bindings
# ----------------------------
# We need to make sure that BLOCK is the smallest power of two
# greater than the number of rows N of the input matrix.
# Different values of BLOCK will result in different kernels
import torch
import triton
# %%
# source-code for Triton compute kernel
# Source code for the Triton kernel
_src = """
__global__ void softmax(float* Y, float* X, int stride_ym, int stride_xm, int M, int N){
int m = get_program_id(0);
@@ -94,12 +105,6 @@ __global__ void softmax(float* Y, float* X, int stride_ym, int stride_xm, int M,
"""
# %%
# Writing the Torch bindings
# ----------------------------
# We need to make sure that BLOCK is the smallest power of two
# greater than the number of rows N of the input matrix.
# Different values of BLOCK will result in different kernels
def next_power_of_2(n):
n -= 1
n |= n >> 1
@@ -156,7 +161,7 @@ print(torch.allclose(y_tri, y_ref))
# Seems to work!
# %%
# Benchmark
# Benchmarking
# ----------
import matplotlib.pyplot as plt

View File

@@ -1,4 +1,4 @@
Triton Tutorials
Tutorials
==================
Below is a gallery of tutorials to help you get started with Triton.
Below is a gallery of tutorials for writing various basic operations with Triton. It is recommended that you read through the tutorials in order, starting with the simplest one.