[GH-PAGES] Updated website

This commit is contained in:
Philippe Tillet
2021-03-06 22:06:32 -05:00
parent 6f789b29ab
commit 32aaf8b469
17 changed files with 400 additions and 275 deletions

View File

@@ -20,7 +20,7 @@
Vector Addition
=================
In this tutorial, you will write a simple, high-performance vector addition using Triton and learn about:
In this tutorial, you will write a simple vector addition using Triton and learn about:
- The basic syntax of the Triton programming language
- The best practices for creating PyTorch custom operators using the :code:`triton.kernel` Python API
@@ -154,15 +154,22 @@ The only thing that matters when it comes to Triton and Torch is the :code:`trit
.. GENERATED FROM PYTHON SOURCE LINES 126-128
.. GENERATED FROM PYTHON SOURCE LINES 126-127
We can now use the above function to compute the sum of two `torch.tensor` objects:
.. GENERATED FROM PYTHON SOURCE LINES 129-133
Unit Test
--------------------------
.. GENERATED FROM PYTHON SOURCE LINES 128-137
Of course, the first thing that we should check is that whether kernel is correct. This is pretty easy to test, as shown below:
.. GENERATED FROM PYTHON SOURCE LINES 133-143
.. code-block:: default
torch.manual_seed(0)
x = torch.rand(98432, device='cuda')
y = torch.rand(98432, device='cuda')
@@ -189,52 +196,67 @@ Unit Test
.. GENERATED FROM PYTHON SOURCE LINES 138-141
.. GENERATED FROM PYTHON SOURCE LINES 144-145
Seems like we're good to go!
.. GENERATED FROM PYTHON SOURCE LINES 147-150
Benchmarking
--------------------------
We can now benchmark our custom op for vectors of increasing sizes to get a sense of how it does
We can now benchmark our custom op for vectors of increasing sizes to get a sense of how it does relative to PyTorch.
.. GENERATED FROM PYTHON SOURCE LINES 141-150
.. GENERATED FROM PYTHON SOURCE LINES 150-178
.. code-block:: default
warmup = 10
rep = 200
for N in [2**i for i in range(17, 26, 1)]:
x = torch.rand(N, device='cuda')
y = torch.rand(N, device='cuda')
triton_ms = triton.testing.do_bench(lambda: add(x, y), warmup=warmup, rep=rep)
torch_ms = triton.testing.do_bench(lambda: x + y, warmup=warmup, rep=rep)
# print the performance of triton and torch as well as the achieved bandwidth
print(f'{N} {triton_ms:.3f} {torch_ms:.3f}')
import matplotlib.pyplot as plt
# There are three tensors of 4N bytes each. So the bandwidth of a given kernel
# is 12N / time_ms * 1e-6 GB/s
gbps = lambda N, ms: 12 * N / ms * 1e-6
# We want to benchmark small and large vector alike
sizes = [2**i for i in range(12, 25, 1)]
triton_bw = []
torch_bw = []
for N in sizes:
x = torch.rand(N, device='cuda', dtype=torch.float32)
y = torch.rand(N, device='cuda', dtype=torch.float32)
# Triton provide a do_bench utility function that can be used to benchmark
# arbitrary workloads. It supports a `warmup` parameter that is used to stabilize
# GPU clock speeds as well as a `rep` parameter that controls the number of times
# the benchmark is repeated. Importantly, we set `clear_l2 = True` to make sure
# that the L2 cache does not contain any element of x before each kernel call when
# N is small.
do_bench = lambda fn: gbps(N, triton.testing.do_bench(fn, warmup=10, rep=100, clear_l2=True))
triton_bw += [do_bench(lambda: add(x, y))]
torch_bw += [do_bench(lambda: x + y)]
# We plot the results as a semi-log
plt.semilogx(sizes, triton_bw, label='Triton')
plt.semilogx(sizes, torch_bw, label='Torch')
plt.legend()
plt.show()
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
131072 0.022 0.006
262144 0.021 0.005
524288 0.022 0.017
1048576 0.037 0.037
2097152 0.074 0.073
4194304 0.144 0.143
8388608 0.289 0.285
16777216 0.566 0.562
33554432 1.131 1.121
.. image:: /getting-started/tutorials/images/sphx_glr_01-vector-add_001.png
:alt: 01 vector add
:class: sphx-glr-single-img
.. GENERATED FROM PYTHON SOURCE LINES 179-179
Seems like our simple element-wise operation operates at peak bandwidth. While this is a fairly low bar for a custom GPU programming language, this is a good start before we move to more advanced operations.
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 0 minutes 3.225 seconds)
**Total running time of the script:** ( 0 minutes 4.784 seconds)
.. _sphx_glr_download_getting-started_tutorials_01-vector-add.py:

View File

@@ -20,7 +20,7 @@
Fused Softmax
=================
In this tutorial, you will write a fused softmax layer that outperform's PyTorch implementation and learn about:
In this tutorial, you will write a fused softmax operation (that outperforms PyTorch) and learn about:
- The benefits of kernel fusion for bandwidth-bound operations.
- The syntax and usage of reduction operators in Triton.
@@ -67,15 +67,17 @@ Let us consider instead the case of a simple (numerically stabilized) softmax op
.. GENERATED FROM PYTHON SOURCE LINES 37-41
When implemented naively in pytorch, computing :code:`y = naive_softmax(x)` for :math:`x \in R^{M \times N}` requires reading :math:`7MN` elements from DRAM and writing back :math:`3MN + 2M` elements.
Instead, we want to write a custom "fused" pytorch operators that only reads X once and does all the necessary computations on-chip.
This would require reading and writing back only :math:`MN` bytes, so we could expect a theoretical speed-up of 5x.
In practice, though, we expect less because our kernel will spend some time computing exponentials and moving data around in shared memory.
This is obviously wasteful; we'd prefer to have a custom "fused" kernel that only reads X once and does all the necessary computations on-chip.
In this case, we would be reading and writing back only :math:`MN` bytes, so we could expect a theoretical speed-up of ~5x (i.e., :math:`(10MN + 2M) / 2MN`).
In practice, though, we would be getting a bit less as our kernel computes exponentials and internally moves data around in shared memory.
.. GENERATED FROM PYTHON SOURCE LINES 43-79
.. GENERATED FROM PYTHON SOURCE LINES 43-82
Compute Kernel
----------------------------
Our softmax kernel works as follows: each program loads a row of X and writes back a normalized row of Y. Note that one important limitation of Triton is that each block must have a power-of-two number of elements, which means that we need to guard the memory operations properly if we want to handle any possible input shapes:
----------------
Our softmax kernel works as follows: each program loads a row of the input X, normalizes it and writes back the result to the output Y.
Note that one important limitation of Triton is that each block must have a power-of-two number of elements,
so we need to internally "pad" tiles and guard the memory operations properly if we want to handle any possible input shapes:
.. code-block:: C
@@ -94,13 +96,14 @@ Our softmax kernel works as follows: each program loads a row of X and writes ba
bool check[BLOCK] = n < N;
float x [BLOCK] = check ? *px : -F32_INFINITY;
// syntax for reduction in Triton is:
// x[..., OPERATOR, ...]
// x[:, :, OPERATOR, :, :]
// ^
// index
// The operators currently supported are {min, max, +}
// where operator is in {min, max, +}
// for 1D vectors, this is just x[OPERATOR].
float z [BLOCK] = x - x[max];
// The exponential in Triton is fast but approximate
// (i.e., like __expf in CUDA)
// Note that exponentials in Triton are fast
// but approximate (i.e., think __expf in CUDA)
float num [BLOCK] = exp(z);
float denom = num[+];
// The result of the reduction is now stored in y
@@ -110,15 +113,15 @@ Our softmax kernel works as follows: each program loads a row of X and writes ba
*?(check)py = y;
}
.. GENERATED FROM PYTHON SOURCE LINES 81-86
.. GENERATED FROM PYTHON SOURCE LINES 84-89
Torch Bindings
----------------------------
We need to make sure that BLOCK is the smallest power of two
greater than the number of rows N of the input matrix.
Different values of BLOCK will result in different kernels
---------------
Here our torch bindings is quite similar to that of the vector addition mentioned in the previous tutorial.
We just need to make sure that BLOCK is the smallest power of two greater than the number of columns N of the input matrix.
This means that different values of BLOCK will result in different kernels
.. GENERATED FROM PYTHON SOURCE LINES 86-149
.. GENERATED FROM PYTHON SOURCE LINES 89-156
.. code-block:: default
@@ -144,6 +147,7 @@ Different values of BLOCK will result in different kernels
"""
# helper function to get the smaller power-of-two larger than a given number
def next_power_of_2(n):
n -= 1
n |= n >> 1
@@ -155,16 +159,20 @@ Different values of BLOCK will result in different kernels
return n
_kernels = dict()
# kernel caching mechanism
def make_kernel(N, device):
cache = make_kernel.cache
# Now are kernels are indexed not only by the provided device but also
# by the rounded number of columns in the input matrix
BLOCK = next_power_of_2(N)
key = (BLOCK, device)
if key not in _kernels:
if key not in cache:
defines = {'BLOCK': BLOCK}
_kernels[key] = triton.kernel(_src, device=device, defines=defines)
return _kernels[key]
cache[key] = triton.kernel(_src, device=device, defines=defines)
return cache[key]
make_kernel.cache = dict()
class _softmax(torch.autograd.Function):
@@ -173,11 +181,10 @@ Different values of BLOCK will result in different kernels
# constraints of the op
assert x.dtype == torch.float32
y = torch.empty_like(x)
# *create launch grid*:
# here we just launch a grid of M programs
# The launch grid is simple: we have one kernel instance per row of the input matrix
M, N = y.shape
grid = lambda opt: (M, )
# *launch kernel*:
# Launch kernel
kernel = make_kernel(N, y.device)
kernel(y.data_ptr(), x.data_ptr(), y.stride(0), x.stride(0), M, N, grid=grid)
return y
@@ -192,21 +199,29 @@ Different values of BLOCK will result in different kernels
.. GENERATED FROM PYTHON SOURCE LINES 150-152
.. GENERATED FROM PYTHON SOURCE LINES 157-158
We can use the above softmax function to compute the row-wise softmax of a given matrix.
.. GENERATED FROM PYTHON SOURCE LINES 160-162
Unit Test
----------
.. GENERATED FROM PYTHON SOURCE LINES 152-160
.. GENERATED FROM PYTHON SOURCE LINES 164-166
We make sure that we test our kernel on a matrix with an irregular number of rows and columns.
This will allow us to verify that our padding mechanism works.
.. GENERATED FROM PYTHON SOURCE LINES 166-173
.. code-block:: default
torch.manual_seed(0)
x = torch.randn(1823, 781, device='cuda')
y_tri = softmax(x)
y_ref = torch.softmax(x, axis=1)
print(y_tri)
print(y_ref)
print(torch.allclose(y_tri, y_ref))
@@ -219,47 +234,23 @@ Unit Test
.. code-block:: none
tensor([[2.0935e-03, 6.4551e-04, 9.8605e-05, ..., 3.3981e-04, 2.7386e-03,
9.1986e-05],
[7.0923e-04, 6.7521e-04, 5.1366e-04, ..., 9.8392e-04, 2.6547e-04,
6.9062e-04],
[1.4032e-04, 5.8826e-04, 1.1694e-03, ..., 6.6423e-04, 1.8178e-04,
6.7049e-04],
...,
[1.1767e-03, 4.2703e-03, 6.0596e-04, ..., 9.5274e-04, 1.1681e-03,
6.4924e-04],
[1.0772e-04, 7.4854e-04, 3.1912e-03, ..., 2.4980e-04, 1.9012e-03,
5.2567e-04],
[2.8518e-03, 8.1899e-04, 7.7046e-04, ..., 1.3403e-03, 5.3167e-04,
4.3268e-04]], device='cuda:0')
tensor([[2.0935e-03, 6.4551e-04, 9.8605e-05, ..., 3.3981e-04, 2.7386e-03,
9.1986e-05],
[7.0923e-04, 6.7521e-04, 5.1366e-04, ..., 9.8392e-04, 2.6547e-04,
6.9062e-04],
[1.4032e-04, 5.8826e-04, 1.1694e-03, ..., 6.6423e-04, 1.8178e-04,
6.7049e-04],
...,
[1.1767e-03, 4.2703e-03, 6.0596e-04, ..., 9.5274e-04, 1.1681e-03,
6.4924e-04],
[1.0772e-04, 7.4854e-04, 3.1912e-03, ..., 2.4980e-04, 1.9012e-03,
5.2567e-04],
[2.8518e-03, 8.1899e-04, 7.7046e-04, ..., 1.3403e-03, 5.3167e-04,
4.3268e-04]], device='cuda:0')
True
.. GENERATED FROM PYTHON SOURCE LINES 161-162
.. GENERATED FROM PYTHON SOURCE LINES 174-175
Seems to work!
As expected, the results are identical.
.. GENERATED FROM PYTHON SOURCE LINES 164-166
.. GENERATED FROM PYTHON SOURCE LINES 177-181
Benchmarking
----------
-------------
Here we will benchmark our operation as a function of the number of columns in the input matrix -- assuming 4096 rows.
We will then compare its performance against (1) :code:`torch.softmax` and (2) the :code:`naive_softmax` defined above.
.. GENERATED FROM PYTHON SOURCE LINES 166-186
.. GENERATED FROM PYTHON SOURCE LINES 181-204
.. code-block:: default
@@ -267,25 +258,28 @@ Benchmarking
import matplotlib.pyplot as plt
M = 4096
Ns = [128 * i for i in range(2, 50)]
tri_ms = []
ref_ms = []
def_ms = []
Ns = [256 * i for i in range(2, 50)]
tri_bw = []
ref_bw = []
def_bw = []
for N in Ns:
x = torch.randn(M, N, device='cuda', dtype=torch.float32)
gbps = lambda ms: x.nelement() * x.element_size() * 1e-9 / (ms * 1e-3)
tri_ms += [gbps(triton.testing.do_bench(lambda: softmax(x)))]
ref_ms += [gbps(triton.testing.do_bench(lambda: torch.softmax(x, axis=1)))]
def_ms += [gbps(triton.testing.do_bench(lambda: naive_softmax(x)))]
do_bench = lambda fn: gbps(triton.testing.do_bench(fn, warmup=10, rep=100, clear_l2=True))
tri_bw += [do_bench(lambda: softmax(x))]
ref_bw += [do_bench(lambda: torch.softmax(x, axis=1))]
def_bw += [do_bench(lambda: naive_softmax(x))]
plt.xlabel('N')
plt.ylabel('Bandwidth (GB/s)')
plt.plot(Ns, tri_ms, label='Triton')
plt.plot(Ns, ref_ms, label='Torch')
plt.plot(Ns, def_ms, label='Naive')
plt.plot(Ns, tri_bw, label='Triton')
plt.plot(Ns, ref_bw, label='Torch')
plt.plot(Ns, def_bw, label='Naive')
plt.legend()
plt.show()
.. image:: /getting-started/tutorials/images/sphx_glr_02-fused-softmax_001.png
:alt: 02 fused softmax
:class: sphx-glr-single-img
@@ -294,10 +288,19 @@ Benchmarking
.. GENERATED FROM PYTHON SOURCE LINES 205-210
In the above plot, we can see that:
- Triton is 4-5x faster than the naive implementation, which is consistent with our theoretical predictions.
- Triton is significantly faster than :code:`torch.softmax` for very large input matrices. My guess from looking at the source-code of the `PyTorch kernel <https://github.com/pytorch/pytorch/blob/9409a3a39b7149bb2d833a89e0c944109bef7c27/caffe2/operators/softmax_ops.cu#L240>`_ is that PyTorch only partially fuses the computation of the softmax.
This means that -- when temporary data is too large to fit entirely in the GPU's cache -- it transfers almost twice the amount of data necessary.
Note that our Triton kernel is not only faster than PyTorch's CUDA kernel, it is also **easier to read, understand and maintain**.
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 0 minutes 5.758 seconds)
**Total running time of the script:** ( 0 minutes 33.773 seconds)
.. _sphx_glr_download_getting-started_tutorials_02-fused-softmax.py:

View File

@@ -5,10 +5,10 @@
Computation times
=================
**00:08.983** total execution time for **getting-started_tutorials** files:
**00:33.773** total execution time for **getting-started_tutorials** files:
+-----------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_getting-started_tutorials_02-fused-softmax.py` (``02-fused-softmax.py``) | 00:05.758 | 0.0 MB |
| :ref:`sphx_glr_getting-started_tutorials_02-fused-softmax.py` (``02-fused-softmax.py``) | 00:33.773 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_getting-started_tutorials_01-vector-add.py` (``01-vector-add.py``) | 00:03.225 | 0.0 MB |
| :ref:`sphx_glr_getting-started_tutorials_01-vector-add.py` (``01-vector-add.py``) | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+