[GH-PAGES] Updated website

This commit is contained in:
Philippe Tillet
2021-03-06 22:06:32 -05:00
parent 6f789b29ab
commit 32aaf8b469
17 changed files with 400 additions and 275 deletions

View File

@@ -20,7 +20,7 @@
Fused Softmax
=================
In this tutorial, you will write a fused softmax layer that outperform's PyTorch implementation and learn about:
In this tutorial, you will write a fused softmax operation (that outperforms PyTorch) and learn about:
- The benefits of kernel fusion for bandwidth-bound operations.
- The syntax and usage of reduction operators in Triton.
@@ -67,15 +67,17 @@ Let us consider instead the case of a simple (numerically stabilized) softmax op
.. GENERATED FROM PYTHON SOURCE LINES 37-41
When implemented naively in pytorch, computing :code:`y = naive_softmax(x)` for :math:`x \in R^{M \times N}` requires reading :math:`7MN` elements from DRAM and writing back :math:`3MN + 2M` elements.
Instead, we want to write a custom "fused" pytorch operators that only reads X once and does all the necessary computations on-chip.
This would require reading and writing back only :math:`MN` bytes, so we could expect a theoretical speed-up of 5x.
In practice, though, we expect less because our kernel will spend some time computing exponentials and moving data around in shared memory.
This is obviously wasteful; we'd prefer to have a custom "fused" kernel that only reads X once and does all the necessary computations on-chip.
In this case, we would be reading and writing back only :math:`MN` bytes, so we could expect a theoretical speed-up of ~5x (i.e., :math:`(10MN + 2M) / 2MN`).
In practice, though, we would be getting a bit less as our kernel computes exponentials and internally moves data around in shared memory.
.. GENERATED FROM PYTHON SOURCE LINES 43-79
.. GENERATED FROM PYTHON SOURCE LINES 43-82
Compute Kernel
----------------------------
Our softmax kernel works as follows: each program loads a row of X and writes back a normalized row of Y. Note that one important limitation of Triton is that each block must have a power-of-two number of elements, which means that we need to guard the memory operations properly if we want to handle any possible input shapes:
----------------
Our softmax kernel works as follows: each program loads a row of the input X, normalizes it and writes back the result to the output Y.
Note that one important limitation of Triton is that each block must have a power-of-two number of elements,
so we need to internally "pad" tiles and guard the memory operations properly if we want to handle any possible input shapes:
.. code-block:: C
@@ -94,13 +96,14 @@ Our softmax kernel works as follows: each program loads a row of X and writes ba
bool check[BLOCK] = n < N;
float x [BLOCK] = check ? *px : -F32_INFINITY;
// syntax for reduction in Triton is:
// x[..., OPERATOR, ...]
// x[:, :, OPERATOR, :, :]
// ^
// index
// The operators currently supported are {min, max, +}
// where operator is in {min, max, +}
// for 1D vectors, this is just x[OPERATOR].
float z [BLOCK] = x - x[max];
// The exponential in Triton is fast but approximate
// (i.e., like __expf in CUDA)
// Note that exponentials in Triton are fast
// but approximate (i.e., think __expf in CUDA)
float num [BLOCK] = exp(z);
float denom = num[+];
// The result of the reduction is now stored in y
@@ -110,15 +113,15 @@ Our softmax kernel works as follows: each program loads a row of X and writes ba
*?(check)py = y;
}
.. GENERATED FROM PYTHON SOURCE LINES 81-86
.. GENERATED FROM PYTHON SOURCE LINES 84-89
Torch Bindings
----------------------------
We need to make sure that BLOCK is the smallest power of two
greater than the number of rows N of the input matrix.
Different values of BLOCK will result in different kernels
---------------
Here our torch bindings is quite similar to that of the vector addition mentioned in the previous tutorial.
We just need to make sure that BLOCK is the smallest power of two greater than the number of columns N of the input matrix.
This means that different values of BLOCK will result in different kernels
.. GENERATED FROM PYTHON SOURCE LINES 86-149
.. GENERATED FROM PYTHON SOURCE LINES 89-156
.. code-block:: default
@@ -144,6 +147,7 @@ Different values of BLOCK will result in different kernels
"""
# helper function to get the smaller power-of-two larger than a given number
def next_power_of_2(n):
n -= 1
n |= n >> 1
@@ -155,16 +159,20 @@ Different values of BLOCK will result in different kernels
return n
_kernels = dict()
# kernel caching mechanism
def make_kernel(N, device):
cache = make_kernel.cache
# Now are kernels are indexed not only by the provided device but also
# by the rounded number of columns in the input matrix
BLOCK = next_power_of_2(N)
key = (BLOCK, device)
if key not in _kernels:
if key not in cache:
defines = {'BLOCK': BLOCK}
_kernels[key] = triton.kernel(_src, device=device, defines=defines)
return _kernels[key]
cache[key] = triton.kernel(_src, device=device, defines=defines)
return cache[key]
make_kernel.cache = dict()
class _softmax(torch.autograd.Function):
@@ -173,11 +181,10 @@ Different values of BLOCK will result in different kernels
# constraints of the op
assert x.dtype == torch.float32
y = torch.empty_like(x)
# *create launch grid*:
# here we just launch a grid of M programs
# The launch grid is simple: we have one kernel instance per row of the input matrix
M, N = y.shape
grid = lambda opt: (M, )
# *launch kernel*:
# Launch kernel
kernel = make_kernel(N, y.device)
kernel(y.data_ptr(), x.data_ptr(), y.stride(0), x.stride(0), M, N, grid=grid)
return y
@@ -192,21 +199,29 @@ Different values of BLOCK will result in different kernels
.. GENERATED FROM PYTHON SOURCE LINES 150-152
.. GENERATED FROM PYTHON SOURCE LINES 157-158
We can use the above softmax function to compute the row-wise softmax of a given matrix.
.. GENERATED FROM PYTHON SOURCE LINES 160-162
Unit Test
----------
.. GENERATED FROM PYTHON SOURCE LINES 152-160
.. GENERATED FROM PYTHON SOURCE LINES 164-166
We make sure that we test our kernel on a matrix with an irregular number of rows and columns.
This will allow us to verify that our padding mechanism works.
.. GENERATED FROM PYTHON SOURCE LINES 166-173
.. code-block:: default
torch.manual_seed(0)
x = torch.randn(1823, 781, device='cuda')
y_tri = softmax(x)
y_ref = torch.softmax(x, axis=1)
print(y_tri)
print(y_ref)
print(torch.allclose(y_tri, y_ref))
@@ -219,47 +234,23 @@ Unit Test
.. code-block:: none
tensor([[2.0935e-03, 6.4551e-04, 9.8605e-05, ..., 3.3981e-04, 2.7386e-03,
9.1986e-05],
[7.0923e-04, 6.7521e-04, 5.1366e-04, ..., 9.8392e-04, 2.6547e-04,
6.9062e-04],
[1.4032e-04, 5.8826e-04, 1.1694e-03, ..., 6.6423e-04, 1.8178e-04,
6.7049e-04],
...,
[1.1767e-03, 4.2703e-03, 6.0596e-04, ..., 9.5274e-04, 1.1681e-03,
6.4924e-04],
[1.0772e-04, 7.4854e-04, 3.1912e-03, ..., 2.4980e-04, 1.9012e-03,
5.2567e-04],
[2.8518e-03, 8.1899e-04, 7.7046e-04, ..., 1.3403e-03, 5.3167e-04,
4.3268e-04]], device='cuda:0')
tensor([[2.0935e-03, 6.4551e-04, 9.8605e-05, ..., 3.3981e-04, 2.7386e-03,
9.1986e-05],
[7.0923e-04, 6.7521e-04, 5.1366e-04, ..., 9.8392e-04, 2.6547e-04,
6.9062e-04],
[1.4032e-04, 5.8826e-04, 1.1694e-03, ..., 6.6423e-04, 1.8178e-04,
6.7049e-04],
...,
[1.1767e-03, 4.2703e-03, 6.0596e-04, ..., 9.5274e-04, 1.1681e-03,
6.4924e-04],
[1.0772e-04, 7.4854e-04, 3.1912e-03, ..., 2.4980e-04, 1.9012e-03,
5.2567e-04],
[2.8518e-03, 8.1899e-04, 7.7046e-04, ..., 1.3403e-03, 5.3167e-04,
4.3268e-04]], device='cuda:0')
True
.. GENERATED FROM PYTHON SOURCE LINES 161-162
.. GENERATED FROM PYTHON SOURCE LINES 174-175
Seems to work!
As expected, the results are identical.
.. GENERATED FROM PYTHON SOURCE LINES 164-166
.. GENERATED FROM PYTHON SOURCE LINES 177-181
Benchmarking
----------
-------------
Here we will benchmark our operation as a function of the number of columns in the input matrix -- assuming 4096 rows.
We will then compare its performance against (1) :code:`torch.softmax` and (2) the :code:`naive_softmax` defined above.
.. GENERATED FROM PYTHON SOURCE LINES 166-186
.. GENERATED FROM PYTHON SOURCE LINES 181-204
.. code-block:: default
@@ -267,25 +258,28 @@ Benchmarking
import matplotlib.pyplot as plt
M = 4096
Ns = [128 * i for i in range(2, 50)]
tri_ms = []
ref_ms = []
def_ms = []
Ns = [256 * i for i in range(2, 50)]
tri_bw = []
ref_bw = []
def_bw = []
for N in Ns:
x = torch.randn(M, N, device='cuda', dtype=torch.float32)
gbps = lambda ms: x.nelement() * x.element_size() * 1e-9 / (ms * 1e-3)
tri_ms += [gbps(triton.testing.do_bench(lambda: softmax(x)))]
ref_ms += [gbps(triton.testing.do_bench(lambda: torch.softmax(x, axis=1)))]
def_ms += [gbps(triton.testing.do_bench(lambda: naive_softmax(x)))]
do_bench = lambda fn: gbps(triton.testing.do_bench(fn, warmup=10, rep=100, clear_l2=True))
tri_bw += [do_bench(lambda: softmax(x))]
ref_bw += [do_bench(lambda: torch.softmax(x, axis=1))]
def_bw += [do_bench(lambda: naive_softmax(x))]
plt.xlabel('N')
plt.ylabel('Bandwidth (GB/s)')
plt.plot(Ns, tri_ms, label='Triton')
plt.plot(Ns, ref_ms, label='Torch')
plt.plot(Ns, def_ms, label='Naive')
plt.plot(Ns, tri_bw, label='Triton')
plt.plot(Ns, ref_bw, label='Torch')
plt.plot(Ns, def_bw, label='Naive')
plt.legend()
plt.show()
.. image:: /getting-started/tutorials/images/sphx_glr_02-fused-softmax_001.png
:alt: 02 fused softmax
:class: sphx-glr-single-img
@@ -294,10 +288,19 @@ Benchmarking
.. GENERATED FROM PYTHON SOURCE LINES 205-210
In the above plot, we can see that:
- Triton is 4-5x faster than the naive implementation, which is consistent with our theoretical predictions.
- Triton is significantly faster than :code:`torch.softmax` for very large input matrices. My guess from looking at the source-code of the `PyTorch kernel <https://github.com/pytorch/pytorch/blob/9409a3a39b7149bb2d833a89e0c944109bef7c27/caffe2/operators/softmax_ops.cu#L240>`_ is that PyTorch only partially fuses the computation of the softmax.
This means that -- when temporary data is too large to fit entirely in the GPU's cache -- it transfers almost twice the amount of data necessary.
Note that our Triton kernel is not only faster than PyTorch's CUDA kernel, it is also **easier to read, understand and maintain**.
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 0 minutes 5.758 seconds)
**Total running time of the script:** ( 0 minutes 33.773 seconds)
.. _sphx_glr_download_getting-started_tutorials_02-fused-softmax.py: