diff --git a/_downloads/034d953b6214fedce6ea03803c712b89/02-fused-softmax.ipynb b/_downloads/034d953b6214fedce6ea03803c712b89/02-fused-softmax.ipynb
index a5a16d18a..d87c0748c 100644
--- a/_downloads/034d953b6214fedce6ea03803c712b89/02-fused-softmax.ipynb
+++ b/_downloads/034d953b6214fedce6ea03803c712b89/02-fused-softmax.ipynb
@@ -15,7 +15,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "\n# Fused Softmax\nIn this tutorial, you will write a fused softmax operation that is significantly faster than PyTorch's native op for a particular class of matrices: those whose rows can fit in the GPU's SRAM.\nYou will learn about:\n\n- The benefits of kernel fusion for bandwidth-bound operations.\n- Reduction operators in Triton.\n"
+ "\n# Fused Softmax\nIn this tutorial, you will write a fused softmax operation that is significantly faster\nthan PyTorch's native op for a particular class of matrices: those whose rows can fit in\nthe GPU's SRAM.\nYou will learn about:\n\n- The benefits of kernel fusion for bandwidth-bound operations.\n- Reduction operators in Triton.\n"
]
},
{
@@ -33,21 +33,21 @@
},
"outputs": [],
"source": [
- "import torch\n\n\n# Compute the row-wise softmax of x\n@torch.jit.script\ndef naive_softmax(x):\n # read MN elements ; write M elements\n x_max = x.max(dim=1)[0]\n # read 2MN elements ; write MN elements\n z = x - x_max[:, None]\n # read MN elements ; write MN elements\n numerator = torch.exp(z)\n # read MN elements ; write M elements\n denominator = numerator.sum(dim=1)\n # read 2MN elements ; write MN elements\n ret = numerator / denominator[:, None]\n # in total: read 7MN elements ; wrote 3MN + 2M elements\n return ret"
+ "import torch\n\n\n@torch.jit.script\ndef naive_softmax(x):\n \"\"\"Compute row-wise softmax of X using native pytorch\n\n We subtract the maximum element in order to avoid overflows. Softmax is invariant to\n this shift.\n \"\"\"\n # read MN elements ; write M elements\n x_max = x.max(dim=1)[0]\n # read 2MN elements ; write MN elements\n z = x - x_max[:, None]\n # read MN elements ; write MN elements\n numerator = torch.exp(z)\n # read MN elements ; write M elements\n denominator = numerator.sum(dim=1)\n # read 2MN elements ; write MN elements\n ret = numerator / denominator[:, None]\n # in total: read 7MN elements ; wrote 3MN + 2M elements\n return ret"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "When implemented naively in pytorch, computing :code:`y = naive_softmax(x)` for $x \\in R^{M \\times N}$ requires reading $7MN$ elements from DRAM and writing back $3MN + 2M$ elements.\nThis is obviously wasteful; we'd prefer to have a custom \"fused\" kernel that only reads X once and does all the necessary computations on-chip.\nDoing so would require reading and writing back only $MN$ bytes, so we could expect a theoretical speed-up of ~5x (i.e., $(10MN + 2M) / 2MN$).\nThe `torch.jit.script` flags aims to perform this kind of \"kernel fusion\" automatically but, as we will see later, it is still far from ideal.\n\n"
+ "When implemented naively in PyTorch, computing :code:`y = naive_softmax(x)` for $x \\in R^{M \\times N}$\nrequires reading $7MN$ elements from DRAM and writing back $3MN + 2M$ elements.\nThis is obviously wasteful; we'd prefer to have a custom \"fused\" kernel that only reads\nX once and does all the necessary computations on-chip.\nDoing so would require reading and writing back only $MN$ bytes, so we could\nexpect a theoretical speed-up of ~5x (i.e., $(10MN + 2M) / 2MN$).\nThe `torch.jit.script` flags aims to perform this kind of \"kernel fusion\" automatically\nbut, as we will see later, it is still far from ideal.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Compute Kernel\nOur softmax kernel works as follows: each program loads a row of the input matrix X, normalizes it and writes back the result to the output Y.\nNote that one important limitation of Triton is that each block must have a power-of-two number of elements,\nso we need to internally \"pad\" each row and guard the memory operations properly if we want to handle any possible input shapes:\n\n"
+ "## Compute Kernel\nOur softmax kernel works as follows: each program loads a row of the input matrix X,\nnormalizes it and writes back the result to the output Y.\nNote that one important limitation of Triton is that each block must have a\npower-of-two number of elements, so we need to internally \"pad\" each row and guard the\nmemory operations properly if we want to handle any possible input shapes:\n\n"
]
},
{
@@ -58,7 +58,7 @@
},
"outputs": [],
"source": [
- "import triton\nimport triton.language as tl\n\n\n@triton.jit\ndef _softmax(Y, X, stride_xm, stride_ym, M, N, **meta):\n # row index\n m = tl.program_id(0)\n # col indices\n # here BLOCK is the smallest power of two greater than `N`\n n = tl.arange(0, meta['BLOCK'])\n # the memory address of all the elements\n # that we want to load can be computed as follows\n X = X + m * stride_xm + n\n x = tl.load(X, mask=n < N, other=-float('inf'))\n # Substract maximum for numerical stability\n z = x - tl.max(x, axis=0)\n # Note that exponentials in Triton are fast\n # but approximate (i.e., think __expf in CUDA)\n num = tl.exp(z)\n denom = tl.sum(num, axis=0)\n y = num / denom\n # Write back to Y\n Y = Y + m * stride_ym + n\n tl.store(Y, y, mask=n < N)"
+ "import triton\nimport triton.language as tl\n\n\n@triton.jit\ndef softmax_kernel(\n output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols, **meta\n):\n # The rows of the softmax are independent, so we parallelize across those\n row_idx = tl.program_id(0)\n BLOCK_SIZE = meta['BLOCK_SIZE']\n # The stride represents how much we need to increase the pointer to advance 1 row\n row_start_ptr = input_ptr + row_idx * input_row_stride\n\n # The block size is the next power of two greater than n_cols, so we can fit each\n # row in a single block\n col_offsets = tl.arange(0, BLOCK_SIZE)\n input_ptrs = row_start_ptr + col_offsets\n # Load the row into SRAM, using a mask since BLOCK_SIZE may be > than n_cols\n row = tl.load(input_ptrs, mask=col_offsets < n_cols, other=-float('inf'))\n # Substract maximum for numerical stability\n row_minus_max = row - tl.max(row, axis=0)\n # Note that exponentials in Triton are fast but approximate (i.e., think __expf in CUDA)\n numerator = tl.exp(row_minus_max)\n denominator = tl.sum(numerator, axis=0)\n softmax_output = numerator / denominator\n # Write back output to DRAM\n output_row_start_ptr = output_ptr + row_idx * output_row_stride\n output_ptrs = output_row_start_ptr + col_offsets\n tl.store(output_ptrs, softmax_output, mask=col_offsets < n_cols)"
]
},
{
@@ -76,7 +76,7 @@
},
"outputs": [],
"source": [
- "def next_power_of_2(n):\n n -= 1\n n |= n >> 1\n n |= n >> 2\n n |= n >> 4\n n |= n >> 8\n n |= n >> 16\n n += 1\n return n\n\n\ndef softmax(x):\n M, N = x.shape\n # The block size is the smallest power of two greater than the number of columns in `x`\n BLOCK = next_power_of_2(N)\n # Another trick we can use is to ask the compiler to use more threads per row by\n # increasing the number of warps (`num_warps`) over which each row is distributed.\n # You will see in the next tutorial how to auto-tune this value in a more natural\n # way so you don't have to come up with manual heuristics yourself.\n num_warps = 4\n if BLOCK >= 2048: num_warps = 8\n if BLOCK >= 4096: num_warps = 16\n # Allocate output\n y = torch.empty_like(x)\n # Enqueue kernel. The launch grid is simple: we have one kernel instance per row of the input matrix\n _softmax[(M, )](y, x, x.stride(0), y.stride(0), M, N, num_warps=num_warps, BLOCK=BLOCK)\n return y"
+ "def next_power_of_2(n):\n \"\"\"Return the smallest power of 2 greater than or equal to n\"\"\"\n n -= 1\n n |= n >> 1\n n |= n >> 2\n n |= n >> 4\n n |= n >> 8\n n |= n >> 16\n n += 1\n return n\n\n\ndef softmax(x):\n n_rows, n_cols = x.shape\n # The block size is the smallest power of two greater than the number of columns in `x`\n BLOCK_SIZE = next_power_of_2(n_cols)\n # Another trick we can use is to ask the compiler to use more threads per row by\n # increasing the number of warps (`num_warps`) over which each row is distributed.\n # You will see in the next tutorial how to auto-tune this value in a more natural\n # way so you don't have to come up with manual heuristics yourself.\n num_warps = 4\n if BLOCK_SIZE >= 2048:\n num_warps = 8\n if BLOCK_SIZE >= 4096:\n num_warps = 16\n # Allocate output\n y = torch.empty_like(x)\n # Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o\n # f the input matrix\n softmax_kernel[(n_rows,)](\n y,\n x,\n x.stride(0),\n y.stride(0),\n n_cols,\n num_warps=num_warps,\n BLOCK_SIZE=BLOCK_SIZE,\n )\n return y"
]
},
{
@@ -101,7 +101,7 @@
},
"outputs": [],
"source": [
- "torch.manual_seed(0)\nx = torch.randn(1823, 781, device='cuda')\ny_tri = softmax(x)\ny_ref = torch.softmax(x, axis=1)\nprint(torch.allclose(y_tri, y_ref))"
+ "torch.manual_seed(0)\nx = torch.randn(1823, 781, device='cuda')\ny_triton = softmax(x)\ny_torch = torch.softmax(x, axis=1)\nprint(torch.allclose(y_triton, y_torch))"
]
},
{
@@ -126,14 +126,14 @@
},
"outputs": [],
"source": [
- "@triton.testing.perf_report(\n triton.testing.Benchmark(\n x_names=['N'], # argument names to use as an x-axis for the plot\n x_vals=[128 * i for i in range(2, 100)], # different possible values for `x_name`\n line_arg='provider', # argument name whose value corresponds to a different line in the plot\n line_vals=['triton', 'torch-native', 'torch-jit'], # possible values for `line_arg``\n line_names=[\"Triton\", \"Torch (native)\", \"Torch (jit)\"], # label name for the lines\n styles=[('blue', '-'), ('green', '-'), ('green', '--')], # line styles\n ylabel=\"GB/s\", # label name for the y-axis\n plot_name=\"softmax-performance\", # name for the plot. Used also as a file name for saving the plot.\n args={'M': 4096} # values for function arguments not in `x_names` and `y_name`\n )\n)\ndef benchmark(M, N, provider):\n x = torch.randn(M, N, device='cuda', dtype=torch.float32)\n if provider == 'torch-native':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.softmax(x, axis=-1))\n if provider == 'triton':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: softmax(x))\n if provider == 'torch-jit':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: naive_softmax(x))\n gbps = lambda ms: 2 * x.nelement() * x.element_size() * 1e-9 / (ms * 1e-3)\n return gbps(ms), gbps(max_ms), gbps(min_ms)\n\n\nbenchmark.run(show_plots=True, print_data=True)"
+ "@triton.testing.perf_report(\n triton.testing.Benchmark(\n x_names=['N'], # argument names to use as an x-axis for the plot\n x_vals=[\n 128 * i for i in range(2, 100)\n ], # different possible values for `x_name`\n line_arg='provider', # argument name whose value corresponds to a different line in the plot\n line_vals=[\n 'triton',\n 'torch-native',\n 'torch-jit',\n ], # possible values for `line_arg``\n line_names=[\n \"Triton\",\n \"Torch (native)\",\n \"Torch (jit)\",\n ], # label name for the lines\n styles=[('blue', '-'), ('green', '-'), ('green', '--')], # line styles\n ylabel=\"GB/s\", # label name for the y-axis\n plot_name=\"softmax-performance\", # name for the plot. Used also as a file name for saving the plot.\n args={'M': 4096}, # values for function arguments not in `x_names` and `y_name`\n )\n)\ndef benchmark(M, N, provider):\n x = torch.randn(M, N, device='cuda', dtype=torch.float32)\n if provider == 'torch-native':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.softmax(x, axis=-1))\n if provider == 'triton':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: softmax(x))\n if provider == 'torch-jit':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: naive_softmax(x))\n gbps = lambda ms: 2 * x.nelement() * x.element_size() * 1e-9 / (ms * 1e-3)\n return gbps(ms), gbps(max_ms), gbps(min_ms)\n\n\nbenchmark.run(show_plots=True, print_data=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "In the above plot, we can see that:\n\n - Triton is 2-3x faster than the Torch JIT.\n - Triton is even faster than :code:`torch.softmax`. My guess from looking at the source-code of the `PyTorch kernel `_ is that PyTorch only partially fuses the computation of the softmax.\n This means that -- when temporary data is too large to fit entirely in the GPU's cache -- it transfers almost twice the amount of memory necessary.\n Note that our Triton kernel is not only faster than PyTorch's CUDA kernel, it is also **easier to read, understand and maintain**.\n"
+ "In the above plot, we can see that:\n\n - Triton is 2-3x faster than the Torch JIT.\n - Triton is even faster than :code:`torch.softmax`. My guess from looking at the source-code of the `PyTorch kernel `_ is that PyTorch only partially fuses the computation of the softmax.\n This means that -- when temporary data is too large to fit entirely in the GPU's cache -- it transfers almost twice the amount of memory necessary.\n Note that our Triton kernel is not only faster than PyTorch's CUDA kernel, it is also **easier to read, understand and maintain**.\n\n"
]
}
],
diff --git a/_downloads/62d97d49a32414049819dd8bb8378080/01-vector-add.py b/_downloads/62d97d49a32414049819dd8bb8378080/01-vector-add.py
index ad6303858..5ac5a1225 100644
--- a/_downloads/62d97d49a32414049819dd8bb8378080/01-vector-add.py
+++ b/_downloads/62d97d49a32414049819dd8bb8378080/01-vector-add.py
@@ -13,31 +13,37 @@ In this tutorial, you will write a simple vector addition using Triton and learn
# --------------------------
import torch
-import triton.language as tl
import triton
+import triton.language as tl
@triton.jit
-def _add(
- X, # *Pointer* to first input vector
- Y, # *Pointer* to second input vector
- Z, # *Pointer* to output vector
- N, # Size of the vector
- **meta # Optional meta-parameters for the kernel
+def add_kernel(
+ x_ptr, # *Pointer* to first input vector
+ y_ptr, # *Pointer* to second input vector
+ output_ptr, # *Pointer* to output vector
+ n_elements, # Size of the vector
+ **meta, # Optional meta-parameters for the kernel
):
- pid = tl.program_id(0)
- # Create an offset for the blocks of pointers to be
- # processed by this program instance
- offsets = pid * meta['BLOCK'] + tl.arange(0, meta['BLOCK'])
- # Create a mask to guard memory operations against
- # out-of-bounds accesses
- mask = offsets < N
- # Load x
- x = tl.load(X + offsets, mask=mask)
- y = tl.load(Y + offsets, mask=mask)
- # Write back x + y
- z = x + y
- tl.store(Z + offsets, z)
+ BLOCK_SIZE = meta['BLOCK_SIZE'] # How many inputs each program should process
+ # There are multiple 'program's processing different data. We identify which program
+ # we are here
+ pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0
+ # This program will process inputs that are offset from the initial data.
+ # for instance, if you had a vector of length 256 and block_size of 64, the programs
+ # would each access the elements [0:64, 64:128, 128:192, 192:256].
+ # Note that offsets is a list of pointers
+ block_start = pid * BLOCK_SIZE
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
+ # Create a mask to guard memory operations against out-of-bounds accesses
+ mask = offsets < n_elements
+ # Load x and y from DRAM, masking out any extar elements in case the input is not a
+ # multiple of the block size
+ x = tl.load(x_ptr + offsets, mask=mask)
+ y = tl.load(y_ptr + offsets, mask=mask)
+ output = x + y
+ # Write x + y back to DRAM
+ tl.store(output_ptr + offsets, output)
# %%
@@ -45,20 +51,23 @@ def _add(
# and (2) enqueue the above kernel with appropriate grid/block sizes.
-def add(x, y):
- z = torch.empty_like(x)
- N = z.shape[0]
+def add(x: torch.Tensor, y: torch.Tensor):
+ # We need to preallocate the output
+ output = torch.empty_like(x)
+ assert x.is_cuda and y.is_cuda and output.is_cuda
+ n_elements = output.shape[0]
# The SPMD launch grid denotes the number of kernel instances that run in parallel.
# It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) -> Tuple[int]
- grid = lambda meta: (triton.cdiv(N, meta['BLOCK']), )
+ # In this case, we use a 1D grid where the size is the number of blocks
+ grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
# NOTE:
# - each torch.tensor object is implicitly converted into a pointer to its first element.
# - `triton.jit`'ed functions can be index with a launch grid to obtain a callable GPU kernel
# - don't forget to pass meta-parameters as keywords arguments
- _add[grid](x, y, z, N, BLOCK=1024)
+ add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
# We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called, the kernel is still
# running asynchronously at this point.
- return z
+ return output
# %%
@@ -68,11 +77,14 @@ torch.manual_seed(0)
size = 98432
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
-za = x + y
-zb = add(x, y)
-print(za)
-print(zb)
-print(f'The maximum difference between torch and triton is ' f'{torch.max(torch.abs(za - zb))}')
+output_torch = x + y
+output_triton = add(x, y)
+print(output_torch)
+print(output_triton)
+print(
+ f'The maximum difference between torch and triton is '
+ f'{torch.max(torch.abs(output_torch - output_triton))}'
+)
# %%
# Seems like we're good to go!
@@ -88,15 +100,17 @@ print(f'The maximum difference between torch and triton is ' f'{torch.max(torch.
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['size'], # argument names to use as an x-axis for the plot
- x_vals=[2**i for i in range(12, 28, 1)], # different possible values for `x_name`
+ x_vals=[
+ 2 ** i for i in range(12, 28, 1)
+ ], # different possible values for `x_name`
x_log=True, # x axis is logarithmic
line_arg='provider', # argument name whose value corresponds to a different line in the plot
line_vals=['triton', 'torch'], # possible values for `line_arg`
- line_names=["Triton", "Torch"], # label name for the lines
+ line_names=['Triton', 'Torch'], # label name for the lines
styles=[('blue', '-'), ('green', '-')], # line styles
- ylabel="GB/s", # label name for the y-axis
- plot_name="vector-add-performance", # name for the plot. Used also as a file name for saving the plot.
- args={} # values for function arguments not in `x_names` and `y_name`
+ ylabel='GB/s', # label name for the y-axis
+ plot_name='vector-add-performance', # name for the plot. Used also as a file name for saving the plot.
+ args={}, # values for function arguments not in `x_names` and `y_name`
)
)
def benchmark(size, provider):
@@ -113,4 +127,4 @@ def benchmark(size, provider):
# %%
# We can now run the decorated function above. Pass `print_data=True` to see the performance number, `show_plots=True` to plot them, and/or
# `save_path='/path/to/results/' to save them to disk along with raw CSV data
-benchmark.run(print_data=True, show_plots=True)
\ No newline at end of file
+benchmark.run(print_data=True, show_plots=True)
diff --git a/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip b/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip
index 7b0eed30b..80024f8c5 100644
Binary files a/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip and b/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip differ
diff --git a/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip b/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip
index aa9a0fd84..5ef30d545 100644
Binary files a/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip and b/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip differ
diff --git a/_downloads/b51b68bc1c6b1a5e509f67800b6235af/03-matrix-multiplication.ipynb b/_downloads/b51b68bc1c6b1a5e509f67800b6235af/03-matrix-multiplication.ipynb
index 53c36efe9..9fc82332e 100644
--- a/_downloads/b51b68bc1c6b1a5e509f67800b6235af/03-matrix-multiplication.ipynb
+++ b/_downloads/b51b68bc1c6b1a5e509f67800b6235af/03-matrix-multiplication.ipynb
@@ -15,21 +15,32 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "\n# Matrix Multiplication\nIn this tutorial, you will write a 25-lines high-performance FP16 matrix multiplication kernel that achieves performance on par with cuBLAS.\nYou will specifically learn about:\n\n- Block-level matrix multiplications\n- Multi-dimensional pointer arithmetic\n- Program re-ordering for improved L2 cache hit rate \n- Automatic performance tuning\n"
+ "\n# Matrix Multiplication\nIn this tutorial, you will write a 25-lines high-performance FP16 matrix multiplication\nkernel that achieves performance on par with cuBLAS.\nYou will specifically learn about:\n\n- Block-level matrix multiplications\n- Multi-dimensional pointer arithmetic\n- Program re-ordering for improved L2 cache hit rate\n- Automatic performance tuning\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Motivations\nMatrix multiplications are a key building block of most modern high-performance computing systems.\nThey are notoriously hard to optimize, hence their implementation is generally done by hardware vendors themselves as part of so-called \"kernel libraries\" (e.g., cuBLAS).\nUnfortunately, these libraries are often proprietary and cannot be easily customized to accomodate the needs of modern deep learning workloads (e.g., fused activation functions).\nIn this tutorial, you will learn how to implement efficient matrix multiplications by yourself with Triton, in a way that is easy to customize and extend.\n\nRoughly speaking, the kernel that we will write will implement the following blocked algorithm:\n\n .. code-block:: python\n\n # do in parallel\n for m in range(0, M, BLOCK_M):\n # do in parallel\n for n in range(0, N, BLOCK_N):\n acc = zeros((BLOCK_M, BLOCK_N), dtype=float32)\n for k in range(0, K, BLOCK_K):\n a = A[m : m+BLOCK_M, k : k+BLOCK_K]\n b = B[k : k+BLOCK_K, n : n+BLOCK_N]\n acc += dot(a, b)\n C[m : m+BLOCK_M, n : n+BLOCK_N] = acc;\n\nwhere each iteration of the doubly-nested for-loop corresponds to a Triton program instance.\n\n"
+ "## Motivations\nMatrix multiplications are a key building block of most modern high-performance computing systems.\nThey are notoriously hard to optimize, hence their implementation is generally done by\nhardware vendors themselves as part of so-called \"kernel libraries\" (e.g., cuBLAS).\nUnfortunately, these libraries are often proprietary and cannot be easily customized\nto accomodate the needs of modern deep learning workloads (e.g., fused activation functions).\nIn this tutorial, you will learn how to implement efficient matrix multiplications by\nyourself with Triton, in a way that is easy to customize and extend.\n\nRoughly speaking, the kernel that we will write will implement the following blocked\nalgorithm to multiply a (MxK) by a (KxN) matrix:\n\n .. code-block:: python\n\n # do in parallel\n for m in range(0, M, BLOCK_SIZE_M):\n # do in parallel\n for n in range(0, N, BLOCK_SIZE_N):\n acc = zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=float32)\n for k in range(0, K, BLOCK_SIZE_K):\n a = A[m : m+BLOCK_SIZE_M, k : k+BLOCK_SIZE_K]\n b = B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]\n acc += dot(a, b)\n C[m : m+BLOCK_SIZE_M, n : n+BLOCK_SIZE_N] = acc;\n\nwhere each iteration of the doubly-nested for-loop corresponds to a Triton program instance.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Compute Kernel\n\nThe above algorithm is, actually, fairly straightforward to implement in Triton.\nThe main difficulty comes from the computation of the memory locations at which blocks of :code:`A` and :code:`B` must be read in the inner loop. For that, we need multi-dimensional pointer arithmetics.\n\n### Pointer Arithmetics\n\nFor a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given by :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.\nTherefore, blocks of pointers for :code:`A[m : m+BLOCK_M, k:k+BLOCK_K]` and :code:`B[k : k+BLOCK_K, n : n+BLOCK_N]` can be defined in pseudo-code as:\n\n .. code-block:: python\n\n &A[m : m+BLOCK_M, k:k+BLOCK_K] = A + (m : m+BLOCK_M)[:, None]*A.stride(0) + (k : k+BLOCK_K)[None, :]*A.stride(1);\n &B[k : k+BLOCK_K, n:n+BLOCK_N] = B + (k : k+BLOCK_K)[:, None]*B.stride(0) + (n : n+BLOCK_N)[None, :]*B.stride(1);\n\nWhich means that pointers for blocks of A and B can be initialized (i.e., :code:`k=0`) in Triton as:\n\n .. code-block:: python\n\n pid_m = triton.program_id(0)\n pid_n = triton.program_id(1)\n rm = pid_m * BLOCK_M + triton.arange(0, BLOCK_M)\n rn = pid_n * BLOCK_N + triton.arange(0, BLOCK_N)\n rk = triton.arange(0, BLOCK_K)\n // pointer for A operand\n pa = A + (rm[:, None] * stride_a_0 + rk[None, :] * stride_a_1);\n // pointer for B operand\n pb = B + (rk[:, None] * stride_b_0 + rn[None, :] * stride_b_1);\n\nAnd then updated in the inner loop as follows:\n\n .. code-block:: python\n\n pa += BLOCK_K * stride_a_1;\n pb += BLOCK_K * stride_b_0;\n\n\n### L2 Cache Optimizations\n\nAs mentioned above, each program instance computes an :code:`[BLOCK_M, BLOCK_N]` block of :code:`C`.\nIt is important to remember that the order in which these blocks are computed does matter, since it affects the L2 cache hit rate of our program.\nAnd unfortunately, a simple row-major ordering\n\n .. code-block:: Python\n\n pid = triton.program_id(0);\n grid_m = (M + BLOCK_M - 1) // BLOCK_M;\n grid_n = (N + BLOCK_N - 1) // BLOCK_N;\n pid_m = pid / grid_n;\n pid_n = pid % grid_n;\n\nis just not going to cut it.\n\nOne possible solution is to launch blocks in an order that promotes data reuse.\nThis can be done by 'super-grouping' blocks in groups of :code:`GROUP_M` rows before switching to the next column:\n\n .. code-block:: python\n\n pid = triton.program_id(0);\n width = GROUP_M * grid_n;\n group_id = pid // width;\n # we need to handle the case where M % (GROUP_M*BLOCK_M) != 0\n group_size = min(grid_m - group_id * GROUP_M, GROUP_M);\n pid_m = group_id * GROUP_M + (pid % group_size);\n pid_n = (pid % width) // (group_size);\n\nIn practice, this can improve the performance of our matrix multiplication kernel by >10\\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).\n\n\n"
+ "## Compute Kernel\n\nThe above algorithm is, actually, fairly straightforward to implement in Triton.\nThe main difficulty comes from the computation of the memory locations at which blocks\n of :code:`A` and :code:`B` must be read in the inner loop. For that, we need\nmulti-dimensional pointer arithmetics.\n\n### Pointer Arithmetics\n\nFor a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given b\ny :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.\nTherefore, blocks of pointers for :code:`A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K]` and\n:code:`B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]` can be defined in pseudo-code as:\n\n .. code-block:: python\n\n &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] = A + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);\n &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] = B + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);\n\nWhich means that pointers for blocks of A and B can be initialized (i.e., :code:`k=0`) in Triton as:\n\n .. code-block:: python\n\n pid_m = triton.program_id(0)\n pid_n = triton.program_id(1)\n rm = pid_m * BLOCK_SIZE_M + triton.arange(0, BLOCK_SIZE_M)\n rn = pid_n * BLOCK_SIZE_N + triton.arange(0, BLOCK_SIZE_N)\n rk = triton.arange(0, BLOCK_SIZE_K)\n // pointer for A operand\n pa = A + (rm[:, None] * stride_a_0 + rk[None, :] * stride_a_1);\n // pointer for B operand\n pb = B + (rk[:, None] * stride_b_0 + rn[None, :] * stride_b_1);\n\nAnd then updated in the inner loop as follows:\n\n .. code-block:: python\n\n pa += BLOCK_SIZE_K * stride_a_1;\n pb += BLOCK_SIZE_K * stride_b_0;\n\n\n### L2 Cache Optimizations\n\nAs mentioned above, each program instance computes a :code:`[BLOCK_SIZE_M, BLOCK_SIZE_N]`\n block of :code:`C`.\nIt is important to remember that the order in which these blocks are computed does\nmatter, since it affects the L2 cache hit rate of our program. and unfortunately, a\na simple row-major ordering\n\n .. code-block:: Python\n\n pid = triton.program_id(0);\n grid_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M;\n grid_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N;\n pid_m = pid / grid_n;\n pid_n = pid % grid_n;\n\nis just not going to cut it.\n\nOne possible solution is to launch blocks in an order that promotes data reuse.\nThis can be done by 'super-grouping' blocks in groups of :code:`GROUP_M` rows before\nswitching to the next column:\n\n .. code-block:: python\n\n pid = triton.program_id(0);\n width = GROUP_M * grid_n;\n group_id = pid // width;\n # we need to handle the case where M % (GROUP_M*BLOCK_SIZE_M) != 0\n group_size = min(grid_m - group_id * GROUP_M, GROUP_M);\n pid_m = group_id * GROUP_M + (pid % group_size);\n pid_n = (pid % width) // (group_size);\n\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# For example, in the following matmul where each matrix is 9 blocks by 9 blocks,\n# we can see that if we compute the output in row-major ordering, we need to load 90\n# blocks into SRAM to compute the first 9 output blocks, but if we do it in grouped\n# ordering, we only need to load 54 blocks.\n# .. image:: grouped_vs_row_major_ordering.png\n#\n# In practice, this can improve the performance of our matrix multiplication kernel by\n# more than 10\\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).\n#"
]
},
{
@@ -47,7 +58,7 @@
},
"outputs": [],
"source": [
- "import torch\nimport triton\nimport triton.language as tl\n\n# %\n# :code:`triton.jit`'ed functions can be auto-tuned by using the `triton.autotune` decorator, which consumes:\n# - A list of :code:`triton.Config` objects that define different configurations of meta-parameters (e.g., BLOCK_M) and compilation options (e.g., num_warps) to try\n# - A autotuning *key* whose change in values will trigger evaluation of all the provided configs\n\n@triton.autotune(\n configs=[\n triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=3, num_warps=8),\n triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=3, num_warps=8),\n triton.Config({'BLOCK_M': 256, 'BLOCK_N': 64, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\n triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 256, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\\\n triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\\\n triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\\\n triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\n triton.Config({'BLOCK_M': 128, 'BLOCK_N': 32 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\\\n triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 32 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=5, num_warps=2),\\\n triton.Config({'BLOCK_M': 32 , 'BLOCK_N': 64 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=5, num_warps=2),\n #triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_warps=4),\n ],\n key=['M', 'N', 'K'],\n)\n# %\n# We can now define our kernel as normal, using all the techniques presented above\n@triton.jit\ndef _matmul(A, B, C, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, **META):\n # extract meta-parameters\n BLOCK_M = META['BLOCK_M']\n BLOCK_N = META['BLOCK_N']\n BLOCK_K = META['BLOCK_K']\n GROUP_M = 8\n # matrix multiplication\n pid = tl.program_id(0)\n grid_m = (M + BLOCK_M - 1) // BLOCK_M\n grid_n = (N + BLOCK_N - 1) // BLOCK_N\n # re-order program ID for better L2 performance\n width = GROUP_M * grid_n\n group_id = pid // width\n group_size = min(grid_m - group_id * GROUP_M, GROUP_M)\n pid_m = group_id * GROUP_M + (pid % group_size)\n pid_n = (pid % width) // (group_size)\n # do matrix multiplication\n rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)\n rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)\n rk = tl.arange(0, BLOCK_K)\n A = A + (rm[:, None] * stride_am + rk[None, :] * stride_ak)\n B = B + (rk[:, None] * stride_bk + rn[None, :] * stride_bn)\n acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)\n for k in range(K, 0, -BLOCK_K):\n a = tl.load(A)\n b = tl.load(B)\n acc += tl.dot(a, b)\n A += BLOCK_K * stride_ak\n B += BLOCK_K * stride_bk\n # triton can accept arbitrary activation function\n # via metaparameters!\n if META['ACTIVATION']:\n acc = META['ACTIVATION'](acc)\n # rematerialize rm and rn to save registers\n rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)\n rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)\n C = C + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)\n mask = (rm[:, None] < M) & (rn[None, :] < N)\n tl.store(C, acc, mask=mask)\n\n\n# we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`\n@triton.jit\ndef leaky_relu(x):\n return tl.where(x >= 0, x, 0.01*x)"
+ "import torch\nimport triton\nimport triton.language as tl\n\n# %\n# :code:`triton.jit`'ed functions can be auto-tuned by using the `triton.autotune`\n# decorator, which consumes:\n# - A list of :code:`triton.Config` objects that define different configurations of\n# meta-parameters (e.g., BLOCK_SIZE_M) and compilation options (e.g., num_warps) to try\n# - An autotuning *key* whose change in values will trigger evaluation of all the\n# provided configs\n\n@triton.autotune(\n configs=[\n triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),\n triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),\n triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),\n triton.Config({'BLOCK_SIZE_M': 32 , 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),\n ],\n key=['M', 'N', 'K'],\n)\n# %\n# We can now define our kernel as normal, using all the techniques presented above\n@triton.jit\ndef matmul_kernel(\n # Pointers to matrices\n a_ptr,\n b_ptr,\n c_ptr,\n # Matrix dimensions\n M,\n N,\n K,\n # The stride variables represent how much to increase the ptr by when moving by 1\n # element in a particular dimension. E.g. stride_am is how much to increase a_ptr\n # by to get the element one row down (A has M rows)\n stride_am,\n stride_ak,\n stride_bk,\n stride_bn,\n stride_cm,\n stride_cn,\n **meta,\n):\n \"\"\"Kernel for computing the matmul AB = C\n\n A has shape (M, K), B has shape (K, N) and C has shape (M, N)\n \"\"\"\n # extract meta-parameters\n BLOCK_SIZE_M = meta['BLOCK_SIZE_M']\n BLOCK_SIZE_N = meta['BLOCK_SIZE_N']\n BLOCK_SIZE_K = meta['BLOCK_SIZE_K']\n GROUP_SIZE_M = 8\n pid = tl.program_id(axis=0)\n\n # the number of blocks is the ceil(M / BLOCK_SIZE_M) since we need an extra block\n # Note that this will lead to some quantization in performance where time-taken jumps\n # when you need to add a new block\n n_blocks_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M\n n_blocks_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N\n\n # Map PIDs to the block they should compute. This is done in a grouped ordering\n # to promote L2 cache reuse.\n n_output_blocks_in_group = GROUP_SIZE_M * n_blocks_n\n group_id = pid // n_output_blocks_in_group\n first_m_block_in_group = group_id * GROUP_SIZE_M\n\n # If the number of blocks is not divisible by the group size, the last group is smaller\n group_size_m = min(n_blocks_m - first_m_block_in_group, GROUP_SIZE_M)\n\n # Within a group, we compute in col-major ordering, block_m and block_n are the\n # output row and col that this program is computing in terms of blocks\n block_m = first_m_block_in_group + (pid % group_size_m)\n block_n = (pid % n_output_blocks_in_group) // group_size_m\n\n # Convert from block indices back to element indices\n m_start = block_m * BLOCK_SIZE_M\n n_start = block_n * BLOCK_SIZE_N\n\n # Expand out to all the offsets for each of the elements in this block.\n m_offsets_a = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]\n n_offsets_b = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]\n k_offsets = tl.arange(0, BLOCK_SIZE_K)\n\n # Get the pointers for the first block of each. We will advance this pointer\n # as we move in the K direction and accumulate.\n # a_ptrs should contain BLOCK_SIZE_M * BLOCK_SIZE_K pointers\n a_ptrs = a_ptr + (stride_am * m_offsets_a + stride_ak * k_offsets[None, :])\n # b_ptrs should contain BLOCK_SIZE_K * BLOCK_SIZE_N pointers\n b_ptrs = b_ptr + (stride_bk * k_offsets[:, None] + stride_bn * n_offsets_b)\n # We accumulate internally in fp32, but the output is written out in the dtype\n # of the tensor when it is stored\n accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)\n for k in range(0, K, BLOCK_SIZE_K):\n # Note that for simplicity, we don't apply a mask here. This means that if K is\n # not a multiple of BLOCK_SIZE_K, this will access out-of-bounds memory and\n # accumulate it incorrectly.\n a = tl.load(a_ptrs)\n b = tl.load(b_ptrs)\n # We accumulate along the K dimension\n accumulator += tl.dot(a, b)\n\n # Advance the ptrs to the next K block\n a_ptrs += BLOCK_SIZE_K * stride_ak\n b_ptrs += BLOCK_SIZE_K * stride_bk\n # triton can accept arbitrary activation function via metaparameters!\n if meta['ACTIVATION']:\n accumulator = meta['ACTIVATION'](accumulator)\n\n m_offsets_c = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]\n n_offsets_c = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]\n c_ptrs = c_ptr + stride_cm * m_offsets_c + stride_cn * n_offsets_c\n mask = (m_offsets_c < M) & (n_offsets_c < N)\n tl.store(c_ptrs, accumulator, mask=mask)\n\n\n# we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`\n@triton.jit\ndef leaky_relu(x):\n return tl.where(x >= 0, x, 0.01 * x)"
]
},
{
@@ -65,7 +76,7 @@
},
"outputs": [],
"source": [
- "def matmul(a, b, activation=None):\n # checks constraints\n assert a.shape[1] == b.shape[0], \"incompatible dimensions\"\n assert a.is_contiguous(), \"matrix A must be contiguous\"\n assert b.is_contiguous(), \"matrix B must be contiguous\"\n M, K = a.shape\n _, N = b.shape\n # allocates output\n c = torch.empty((M, N), device=a.device, dtype=a.dtype)\n # launch kernel\n grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']), )\n pgm = _matmul[grid](\n a, b, c, M, N, K, \\\n a.stride(0), a.stride(1), b.stride(0), b.stride(1), c.stride(0), c.stride(1),\\\n ACTIVATION = activation\n )\n # done; return the output tensor\n return c"
+ "def matmul(a, b, activation=None):\n # checks constraints\n assert a.shape[1] == b.shape[0], \"incompatible dimensions\"\n assert a.is_contiguous(), \"matrix A must be contiguous\"\n assert b.is_contiguous(), \"matrix B must be contiguous\"\n M, K = a.shape\n K, N = b.shape\n assert (\n K % 32 == 0\n ), \"We don't check memory-out-of-bounds with K so K must be divisible by BLOCK_SIZE_K\"\n # allocates output\n c = torch.empty((M, N), device=a.device, dtype=a.dtype)\n # 1D launch kernel where each block gets its own program.\n grid = lambda META: (\n triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']),\n )\n matmul_kernel[grid](\n a,\n b,\n c,\n M,\n N,\n K,\n a.stride(0),\n a.stride(1),\n b.stride(0),\n b.stride(1),\n c.stride(0),\n c.stride(1),\n ACTIVATION=activation,\n )\n return c"
]
},
{
@@ -83,7 +94,7 @@
},
"outputs": [],
"source": [
- "torch.manual_seed(0)\na = torch.randn((512, 512), device='cuda', dtype=torch.float16)\nb = torch.randn((512, 512), device='cuda', dtype=torch.float16)\nc_0 = matmul(a, b, activation=None)\nc_1 = torch.matmul(a, b)\nprint(c_0)\nprint(c_1)\nprint(triton.testing.allclose(c_0, c_1))"
+ "torch.manual_seed(0)\na = torch.randn((512, 512), device='cuda', dtype=torch.float16)\nb = torch.randn((512, 512), device='cuda', dtype=torch.float16)\ntriton_output = matmul(a, b, activation=None)\ntorch_output = torch.matmul(a, b)\nprint(f\"{triton_output=}\")\nprint(f\"{torch_output=}\")\nif triton.testing.allclose(triton_output, torch_output):\n print(\"\u2705 Triton and Torch match\")\nelse:\n print(\"\u274c Triton and Torch differ\")"
]
},
{
@@ -101,7 +112,7 @@
},
"outputs": [],
"source": [
- "@triton.testing.perf_report(\n triton.testing.Benchmark(\n x_names=['M', 'N', 'K'], # argument names to use as an x-axis for the plot\n x_vals=[128 * i for i in range(1, 33)], # different possible values for `x_name`\n line_arg='provider', # argument name whose value corresponds to a different line in the plot\n line_vals=['cublas', 'cublas + relu', 'triton', 'triton + relu'], # possible values for `line_arg``\n line_names=[\"cuBLAS\", \"cuBLAS (+ torch.nn.LeakyReLU)\", \"Triton\", \"Triton (+ LeakyReLU)\"], # label name for the lines\n styles=[('green', '-'), ('green', '--'), ('blue', '-'), ('blue', '--')], # line styles\n ylabel=\"TFLOPS\", # label name for the y-axis\n plot_name=\"matmul-performance\", # name for the plot. Used also as a file name for saving the plot.\n args={}\n )\n)\ndef benchmark(M, N, K, provider):\n a = torch.randn((M, K), device='cuda', dtype=torch.float16)\n b = torch.randn((K, N), device='cuda', dtype=torch.float16)\n if provider == 'cublas':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.matmul(a, b))\n if provider == 'triton':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: matmul(a, b))\n if provider == 'cublas + relu':\n torch_relu = torch.nn.ReLU(inplace=True)\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch_relu(torch.matmul(a, b)))\n if provider == 'triton + relu':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: matmul(a, b, activation=leaky_relu))\n perf = lambda ms: 2 * M * N * K * 1e-12 / (ms * 1e-3)\n return perf(ms), perf(max_ms), perf(min_ms)\n\n\nbenchmark.run(show_plots=True, print_data=True)"
+ "@triton.testing.perf_report(\n triton.testing.Benchmark(\n x_names=['M', 'N', 'K'], # argument names to use as an x-axis for the plot\n x_vals=[\n 128 * i for i in range(1, 33)\n ], # different possible values for `x_name`\n line_arg='provider', # argument name whose value corresponds to a different line in the plot\n # possible values for `line_arg``\n line_vals=['cublas', 'cublas + relu', 'triton', 'triton + relu'],\n # label name for the lines\n line_names=[\"cuBLAS\", \"cuBLAS (+ torch.nn.LeakyReLU)\", \"Triton\", \"Triton (+ LeakyReLU)\"],\n # line styles\n styles=[('green', '-'), ('green', '--'), ('blue', '-'), ('blue', '--')],\n ylabel=\"TFLOPS\", # label name for the y-axis\n plot_name=\"matmul-performance\", # name for the plot. Used also as a file name for saving the plot.\n args={},\n )\n)\ndef benchmark(M, N, K, provider):\n a = torch.randn((M, K), device='cuda', dtype=torch.float16)\n b = torch.randn((K, N), device='cuda', dtype=torch.float16)\n if provider == 'cublas':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.matmul(a, b))\n if provider == 'triton':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: matmul(a, b))\n if provider == 'cublas + relu':\n torch_relu = torch.nn.ReLU(inplace=True)\n ms, min_ms, max_ms = triton.testing.do_bench(\n lambda: torch_relu(torch.matmul(a, b))\n )\n if provider == 'triton + relu':\n ms, min_ms, max_ms = triton.testing.do_bench(\n lambda: matmul(a, b, activation=leaky_relu)\n )\n perf = lambda ms: 2 * M * N * K * 1e-12 / (ms * 1e-3)\n return perf(ms), perf(max_ms), perf(min_ms)\n\n\nbenchmark.run(show_plots=True, print_data=True)"
]
}
],
diff --git a/_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py b/_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py
index 5ca56ef91..838ddc0b0 100644
--- a/_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py
+++ b/_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py
@@ -1,12 +1,13 @@
"""
Matrix Multiplication
======================
-In this tutorial, you will write a 25-lines high-performance FP16 matrix multiplication kernel that achieves performance on par with cuBLAS.
+In this tutorial, you will write a 25-lines high-performance FP16 matrix multiplication
+kernel that achieves performance on par with cuBLAS.
You will specifically learn about:
- Block-level matrix multiplications
- Multi-dimensional pointer arithmetic
-- Program re-ordering for improved L2 cache hit rate
+- Program re-ordering for improved L2 cache hit rate
- Automatic performance tuning
"""
@@ -14,24 +15,28 @@ You will specifically learn about:
# Motivations
# -------------
# Matrix multiplications are a key building block of most modern high-performance computing systems.
-# They are notoriously hard to optimize, hence their implementation is generally done by hardware vendors themselves as part of so-called "kernel libraries" (e.g., cuBLAS).
-# Unfortunately, these libraries are often proprietary and cannot be easily customized to accomodate the needs of modern deep learning workloads (e.g., fused activation functions).
-# In this tutorial, you will learn how to implement efficient matrix multiplications by yourself with Triton, in a way that is easy to customize and extend.
+# They are notoriously hard to optimize, hence their implementation is generally done by
+# hardware vendors themselves as part of so-called "kernel libraries" (e.g., cuBLAS).
+# Unfortunately, these libraries are often proprietary and cannot be easily customized
+# to accomodate the needs of modern deep learning workloads (e.g., fused activation functions).
+# In this tutorial, you will learn how to implement efficient matrix multiplications by
+# yourself with Triton, in a way that is easy to customize and extend.
#
-# Roughly speaking, the kernel that we will write will implement the following blocked algorithm:
+# Roughly speaking, the kernel that we will write will implement the following blocked
+# algorithm to multiply a (MxK) by a (KxN) matrix:
#
# .. code-block:: python
#
# # do in parallel
-# for m in range(0, M, BLOCK_M):
+# for m in range(0, M, BLOCK_SIZE_M):
# # do in parallel
-# for n in range(0, N, BLOCK_N):
-# acc = zeros((BLOCK_M, BLOCK_N), dtype=float32)
-# for k in range(0, K, BLOCK_K):
-# a = A[m : m+BLOCK_M, k : k+BLOCK_K]
-# b = B[k : k+BLOCK_K, n : n+BLOCK_N]
+# for n in range(0, N, BLOCK_SIZE_N):
+# acc = zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=float32)
+# for k in range(0, K, BLOCK_SIZE_K):
+# a = A[m : m+BLOCK_SIZE_M, k : k+BLOCK_SIZE_K]
+# b = B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]
# acc += dot(a, b)
-# C[m : m+BLOCK_M, n : n+BLOCK_N] = acc;
+# C[m : m+BLOCK_SIZE_M, n : n+BLOCK_SIZE_N] = acc;
#
# where each iteration of the doubly-nested for-loop corresponds to a Triton program instance.
@@ -40,18 +45,22 @@ You will specifically learn about:
# ----------------
#
# The above algorithm is, actually, fairly straightforward to implement in Triton.
-# The main difficulty comes from the computation of the memory locations at which blocks of :code:`A` and :code:`B` must be read in the inner loop. For that, we need multi-dimensional pointer arithmetics.
+# The main difficulty comes from the computation of the memory locations at which blocks
+# of :code:`A` and :code:`B` must be read in the inner loop. For that, we need
+# multi-dimensional pointer arithmetics.
#
# Pointer Arithmetics
# ~~~~~~~~~~~~~~~~~~~~
#
-# For a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given by :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.
-# Therefore, blocks of pointers for :code:`A[m : m+BLOCK_M, k:k+BLOCK_K]` and :code:`B[k : k+BLOCK_K, n : n+BLOCK_N]` can be defined in pseudo-code as:
+# For a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given b
+# y :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.
+# Therefore, blocks of pointers for :code:`A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K]` and
+# :code:`B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]` can be defined in pseudo-code as:
#
# .. code-block:: python
#
-# &A[m : m+BLOCK_M, k:k+BLOCK_K] = A + (m : m+BLOCK_M)[:, None]*A.stride(0) + (k : k+BLOCK_K)[None, :]*A.stride(1);
-# &B[k : k+BLOCK_K, n:n+BLOCK_N] = B + (k : k+BLOCK_K)[:, None]*B.stride(0) + (n : n+BLOCK_N)[None, :]*B.stride(1);
+# &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] = A + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);
+# &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] = B + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);
#
# Which means that pointers for blocks of A and B can be initialized (i.e., :code:`k=0`) in Triton as:
#
@@ -59,9 +68,9 @@ You will specifically learn about:
#
# pid_m = triton.program_id(0)
# pid_n = triton.program_id(1)
-# rm = pid_m * BLOCK_M + triton.arange(0, BLOCK_M)
-# rn = pid_n * BLOCK_N + triton.arange(0, BLOCK_N)
-# rk = triton.arange(0, BLOCK_K)
+# rm = pid_m * BLOCK_SIZE_M + triton.arange(0, BLOCK_SIZE_M)
+# rn = pid_n * BLOCK_SIZE_N + triton.arange(0, BLOCK_SIZE_N)
+# rk = triton.arange(0, BLOCK_SIZE_K)
# // pointer for A operand
# pa = A + (rm[:, None] * stride_a_0 + rk[None, :] * stride_a_1);
# // pointer for B operand
@@ -71,41 +80,51 @@ You will specifically learn about:
#
# .. code-block:: python
#
-# pa += BLOCK_K * stride_a_1;
-# pb += BLOCK_K * stride_b_0;
+# pa += BLOCK_SIZE_K * stride_a_1;
+# pb += BLOCK_SIZE_K * stride_b_0;
#
#
# L2 Cache Optimizations
# ~~~~~~~~~~~~~~~~~~~~~~~~
#
-# As mentioned above, each program instance computes an :code:`[BLOCK_M, BLOCK_N]` block of :code:`C`.
-# It is important to remember that the order in which these blocks are computed does matter, since it affects the L2 cache hit rate of our program.
-# And unfortunately, a simple row-major ordering
+# As mentioned above, each program instance computes a :code:`[BLOCK_SIZE_M, BLOCK_SIZE_N]`
+# block of :code:`C`.
+# It is important to remember that the order in which these blocks are computed does
+# matter, since it affects the L2 cache hit rate of our program. and unfortunately, a
+# a simple row-major ordering
#
# .. code-block:: Python
#
# pid = triton.program_id(0);
-# grid_m = (M + BLOCK_M - 1) // BLOCK_M;
-# grid_n = (N + BLOCK_N - 1) // BLOCK_N;
+# grid_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M;
+# grid_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N;
# pid_m = pid / grid_n;
# pid_n = pid % grid_n;
#
# is just not going to cut it.
#
# One possible solution is to launch blocks in an order that promotes data reuse.
-# This can be done by 'super-grouping' blocks in groups of :code:`GROUP_M` rows before switching to the next column:
+# This can be done by 'super-grouping' blocks in groups of :code:`GROUP_M` rows before
+# switching to the next column:
#
# .. code-block:: python
#
# pid = triton.program_id(0);
# width = GROUP_M * grid_n;
# group_id = pid // width;
-# # we need to handle the case where M % (GROUP_M*BLOCK_M) != 0
+# # we need to handle the case where M % (GROUP_M*BLOCK_SIZE_M) != 0
# group_size = min(grid_m - group_id * GROUP_M, GROUP_M);
# pid_m = group_id * GROUP_M + (pid % group_size);
# pid_n = (pid % width) // (group_size);
+
+# For example, in the following matmul where each matrix is 9 blocks by 9 blocks,
+# we can see that if we compute the output in row-major ordering, we need to load 90
+# blocks into SRAM to compute the first 9 output blocks, but if we do it in grouped
+# ordering, we only need to load 54 blocks.
+# .. image:: grouped_vs_row_major_ordering.png
#
-# In practice, this can improve the performance of our matrix multiplication kernel by >10\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).
+# In practice, this can improve the performance of our matrix multiplication kernel by
+# more than 10\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).
#
# %%
@@ -118,96 +137,165 @@ import triton
import triton.language as tl
# %
-# :code:`triton.jit`'ed functions can be auto-tuned by using the `triton.autotune` decorator, which consumes:
-# - A list of :code:`triton.Config` objects that define different configurations of meta-parameters (e.g., BLOCK_M) and compilation options (e.g., num_warps) to try
-# - A autotuning *key* whose change in values will trigger evaluation of all the provided configs
+# :code:`triton.jit`'ed functions can be auto-tuned by using the `triton.autotune`
+# decorator, which consumes:
+# - A list of :code:`triton.Config` objects that define different configurations of
+# meta-parameters (e.g., BLOCK_SIZE_M) and compilation options (e.g., num_warps) to try
+# - An autotuning *key* whose change in values will trigger evaluation of all the
+# provided configs
@triton.autotune(
configs=[
- triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=3, num_warps=8),
- triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=3, num_warps=8),
- triton.Config({'BLOCK_M': 256, 'BLOCK_N': 64, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),
- triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 256, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\
- triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\
- triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\
- triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),
- triton.Config({'BLOCK_M': 128, 'BLOCK_N': 32 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\
- triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 32 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=5, num_warps=2),\
- triton.Config({'BLOCK_M': 32 , 'BLOCK_N': 64 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=5, num_warps=2),
- #triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),
+ triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),
+ triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),
+ triton.Config({'BLOCK_SIZE_M': 32 , 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),
],
key=['M', 'N', 'K'],
)
# %
# We can now define our kernel as normal, using all the techniques presented above
@triton.jit
-def _matmul(A, B, C, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, **META):
+def matmul_kernel(
+ # Pointers to matrices
+ a_ptr,
+ b_ptr,
+ c_ptr,
+ # Matrix dimensions
+ M,
+ N,
+ K,
+ # The stride variables represent how much to increase the ptr by when moving by 1
+ # element in a particular dimension. E.g. stride_am is how much to increase a_ptr
+ # by to get the element one row down (A has M rows)
+ stride_am,
+ stride_ak,
+ stride_bk,
+ stride_bn,
+ stride_cm,
+ stride_cn,
+ **meta,
+):
+ """Kernel for computing the matmul AB = C
+
+ A has shape (M, K), B has shape (K, N) and C has shape (M, N)
+ """
# extract meta-parameters
- BLOCK_M = META['BLOCK_M']
- BLOCK_N = META['BLOCK_N']
- BLOCK_K = META['BLOCK_K']
- GROUP_M = 8
- # matrix multiplication
- pid = tl.program_id(0)
- grid_m = (M + BLOCK_M - 1) // BLOCK_M
- grid_n = (N + BLOCK_N - 1) // BLOCK_N
- # re-order program ID for better L2 performance
- width = GROUP_M * grid_n
- group_id = pid // width
- group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
- pid_m = group_id * GROUP_M + (pid % group_size)
- pid_n = (pid % width) // (group_size)
- # do matrix multiplication
- rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
- rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
- rk = tl.arange(0, BLOCK_K)
- A = A + (rm[:, None] * stride_am + rk[None, :] * stride_ak)
- B = B + (rk[:, None] * stride_bk + rn[None, :] * stride_bn)
- acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
- for k in range(K, 0, -BLOCK_K):
- a = tl.load(A)
- b = tl.load(B)
- acc += tl.dot(a, b)
- A += BLOCK_K * stride_ak
- B += BLOCK_K * stride_bk
- # triton can accept arbitrary activation function
- # via metaparameters!
- if META['ACTIVATION']:
- acc = META['ACTIVATION'](acc)
- # rematerialize rm and rn to save registers
- rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
- rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
- C = C + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)
- mask = (rm[:, None] < M) & (rn[None, :] < N)
- tl.store(C, acc, mask=mask)
+ BLOCK_SIZE_M = meta['BLOCK_SIZE_M']
+ BLOCK_SIZE_N = meta['BLOCK_SIZE_N']
+ BLOCK_SIZE_K = meta['BLOCK_SIZE_K']
+ GROUP_SIZE_M = 8
+ pid = tl.program_id(axis=0)
+
+ # the number of blocks is the ceil(M / BLOCK_SIZE_M) since we need an extra block
+ # Note that this will lead to some quantization in performance where time-taken jumps
+ # when you need to add a new block
+ n_blocks_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M
+ n_blocks_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N
+
+ # Map PIDs to the block they should compute. This is done in a grouped ordering
+ # to promote L2 cache reuse.
+ n_output_blocks_in_group = GROUP_SIZE_M * n_blocks_n
+ group_id = pid // n_output_blocks_in_group
+ first_m_block_in_group = group_id * GROUP_SIZE_M
+
+ # If the number of blocks is not divisible by the group size, the last group is smaller
+ group_size_m = min(n_blocks_m - first_m_block_in_group, GROUP_SIZE_M)
+
+ # Within a group, we compute in col-major ordering, block_m and block_n are the
+ # output row and col that this program is computing in terms of blocks
+ block_m = first_m_block_in_group + (pid % group_size_m)
+ block_n = (pid % n_output_blocks_in_group) // group_size_m
+
+ # Convert from block indices back to element indices
+ m_start = block_m * BLOCK_SIZE_M
+ n_start = block_n * BLOCK_SIZE_N
+
+ # Expand out to all the offsets for each of the elements in this block.
+ m_offsets_a = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]
+ n_offsets_b = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]
+ k_offsets = tl.arange(0, BLOCK_SIZE_K)
+
+ # Get the pointers for the first block of each. We will advance this pointer
+ # as we move in the K direction and accumulate.
+ # a_ptrs should contain BLOCK_SIZE_M * BLOCK_SIZE_K pointers
+ a_ptrs = a_ptr + (stride_am * m_offsets_a + stride_ak * k_offsets[None, :])
+ # b_ptrs should contain BLOCK_SIZE_K * BLOCK_SIZE_N pointers
+ b_ptrs = b_ptr + (stride_bk * k_offsets[:, None] + stride_bn * n_offsets_b)
+ # We accumulate internally in fp32, but the output is written out in the dtype
+ # of the tensor when it is stored
+ accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
+ for k in range(0, K, BLOCK_SIZE_K):
+ # Note that for simplicity, we don't apply a mask here. This means that if K is
+ # not a multiple of BLOCK_SIZE_K, this will access out-of-bounds memory and
+ # accumulate it incorrectly.
+ a = tl.load(a_ptrs)
+ b = tl.load(b_ptrs)
+ # We accumulate along the K dimension
+ accumulator += tl.dot(a, b)
+
+ # Advance the ptrs to the next K block
+ a_ptrs += BLOCK_SIZE_K * stride_ak
+ b_ptrs += BLOCK_SIZE_K * stride_bk
+ # triton can accept arbitrary activation function via metaparameters!
+ if meta['ACTIVATION']:
+ accumulator = meta['ACTIVATION'](accumulator)
+
+ m_offsets_c = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]
+ n_offsets_c = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]
+ c_ptrs = c_ptr + stride_cm * m_offsets_c + stride_cn * n_offsets_c
+ mask = (m_offsets_c < M) & (n_offsets_c < N)
+ tl.store(c_ptrs, accumulator, mask=mask)
# we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`
@triton.jit
def leaky_relu(x):
- return tl.where(x >= 0, x, 0.01*x)
+ return tl.where(x >= 0, x, 0.01 * x)
+
# %%
# We can now create a convenience wrapper function that only takes two input tensors
# and (1) checks any shape constraint; (2) allocates the output; (3) launches the above kernel
+
def matmul(a, b, activation=None):
# checks constraints
assert a.shape[1] == b.shape[0], "incompatible dimensions"
assert a.is_contiguous(), "matrix A must be contiguous"
assert b.is_contiguous(), "matrix B must be contiguous"
M, K = a.shape
- _, N = b.shape
+ K, N = b.shape
+ assert (
+ K % 32 == 0
+ ), "We don't check memory-out-of-bounds with K so K must be divisible by BLOCK_SIZE_K"
# allocates output
c = torch.empty((M, N), device=a.device, dtype=a.dtype)
- # launch kernel
- grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']), )
- pgm = _matmul[grid](
- a, b, c, M, N, K, \
- a.stride(0), a.stride(1), b.stride(0), b.stride(1), c.stride(0), c.stride(1),\
- ACTIVATION = activation
+ # 1D launch kernel where each block gets its own program.
+ grid = lambda META: (
+ triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']),
+ )
+ matmul_kernel[grid](
+ a,
+ b,
+ c,
+ M,
+ N,
+ K,
+ a.stride(0),
+ a.stride(1),
+ b.stride(0),
+ b.stride(1),
+ c.stride(0),
+ c.stride(1),
+ ACTIVATION=activation,
)
- # done; return the output tensor
return c
@@ -220,11 +308,14 @@ def matmul(a, b, activation=None):
torch.manual_seed(0)
a = torch.randn((512, 512), device='cuda', dtype=torch.float16)
b = torch.randn((512, 512), device='cuda', dtype=torch.float16)
-c_0 = matmul(a, b, activation=None)
-c_1 = torch.matmul(a, b)
-print(c_0)
-print(c_1)
-print(triton.testing.allclose(c_0, c_1))
+triton_output = matmul(a, b, activation=None)
+torch_output = torch.matmul(a, b)
+print(f"{triton_output=}")
+print(f"{torch_output=}")
+if triton.testing.allclose(triton_output, torch_output):
+ print("β
Triton and Torch match")
+else:
+ print("β Triton and Torch differ")
# %%
# Benchmark
@@ -238,14 +329,19 @@ print(triton.testing.allclose(c_0, c_1))
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['M', 'N', 'K'], # argument names to use as an x-axis for the plot
- x_vals=[128 * i for i in range(1, 33)], # different possible values for `x_name`
+ x_vals=[
+ 128 * i for i in range(1, 33)
+ ], # different possible values for `x_name`
line_arg='provider', # argument name whose value corresponds to a different line in the plot
- line_vals=['cublas', 'cublas + relu', 'triton', 'triton + relu'], # possible values for `line_arg``
- line_names=["cuBLAS", "cuBLAS (+ torch.nn.LeakyReLU)", "Triton", "Triton (+ LeakyReLU)"], # label name for the lines
- styles=[('green', '-'), ('green', '--'), ('blue', '-'), ('blue', '--')], # line styles
+ # possible values for `line_arg``
+ line_vals=['cublas', 'cublas + relu', 'triton', 'triton + relu'],
+ # label name for the lines
+ line_names=["cuBLAS", "cuBLAS (+ torch.nn.LeakyReLU)", "Triton", "Triton (+ LeakyReLU)"],
+ # line styles
+ styles=[('green', '-'), ('green', '--'), ('blue', '-'), ('blue', '--')],
ylabel="TFLOPS", # label name for the y-axis
plot_name="matmul-performance", # name for the plot. Used also as a file name for saving the plot.
- args={}
+ args={},
)
)
def benchmark(M, N, K, provider):
@@ -257,9 +353,13 @@ def benchmark(M, N, K, provider):
ms, min_ms, max_ms = triton.testing.do_bench(lambda: matmul(a, b))
if provider == 'cublas + relu':
torch_relu = torch.nn.ReLU(inplace=True)
- ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch_relu(torch.matmul(a, b)))
+ ms, min_ms, max_ms = triton.testing.do_bench(
+ lambda: torch_relu(torch.matmul(a, b))
+ )
if provider == 'triton + relu':
- ms, min_ms, max_ms = triton.testing.do_bench(lambda: matmul(a, b, activation=leaky_relu))
+ ms, min_ms, max_ms = triton.testing.do_bench(
+ lambda: matmul(a, b, activation=leaky_relu)
+ )
perf = lambda ms: 2 * M * N * K * 1e-12 / (ms * 1e-3)
return perf(ms), perf(max_ms), perf(min_ms)
diff --git a/_downloads/d91442ac2982c4e0cc3ab0f43534afbc/02-fused-softmax.py b/_downloads/d91442ac2982c4e0cc3ab0f43534afbc/02-fused-softmax.py
index 0f96152e2..e908cca4e 100644
--- a/_downloads/d91442ac2982c4e0cc3ab0f43534afbc/02-fused-softmax.py
+++ b/_downloads/d91442ac2982c4e0cc3ab0f43534afbc/02-fused-softmax.py
@@ -1,7 +1,9 @@
"""
Fused Softmax
=================
-In this tutorial, you will write a fused softmax operation that is significantly faster than PyTorch's native op for a particular class of matrices: those whose rows can fit in the GPU's SRAM.
+In this tutorial, you will write a fused softmax operation that is significantly faster
+than PyTorch's native op for a particular class of matrices: those whose rows can fit in
+the GPU's SRAM.
You will learn about:
- The benefits of kernel fusion for bandwidth-bound operations.
@@ -17,9 +19,13 @@ You will learn about:
import torch
-# Compute the row-wise softmax of x
@torch.jit.script
def naive_softmax(x):
+ """Compute row-wise softmax of X using native pytorch
+
+ We subtract the maximum element in order to avoid overflows. Softmax is invariant to
+ this shift.
+ """
# read MN elements ; write M elements
x_max = x.max(dim=1)[0]
# read 2MN elements ; write MN elements
@@ -35,43 +41,54 @@ def naive_softmax(x):
# %%
-# When implemented naively in pytorch, computing :code:`y = naive_softmax(x)` for :math:`x \in R^{M \times N}` requires reading :math:`7MN` elements from DRAM and writing back :math:`3MN + 2M` elements.
-# This is obviously wasteful; we'd prefer to have a custom "fused" kernel that only reads X once and does all the necessary computations on-chip.
-# Doing so would require reading and writing back only :math:`MN` bytes, so we could expect a theoretical speed-up of ~5x (i.e., :math:`(10MN + 2M) / 2MN`).
-# The `torch.jit.script` flags aims to perform this kind of "kernel fusion" automatically but, as we will see later, it is still far from ideal.
+# When implemented naively in PyTorch, computing :code:`y = naive_softmax(x)` for :math:`x \in R^{M \times N}`
+# requires reading :math:`7MN` elements from DRAM and writing back :math:`3MN + 2M` elements.
+# This is obviously wasteful; we'd prefer to have a custom "fused" kernel that only reads
+# X once and does all the necessary computations on-chip.
+# Doing so would require reading and writing back only :math:`MN` bytes, so we could
+# expect a theoretical speed-up of ~5x (i.e., :math:`(10MN + 2M) / 2MN`).
+# The `torch.jit.script` flags aims to perform this kind of "kernel fusion" automatically
+# but, as we will see later, it is still far from ideal.
# %%
# Compute Kernel
# ----------------
-# Our softmax kernel works as follows: each program loads a row of the input matrix X, normalizes it and writes back the result to the output Y.
-# Note that one important limitation of Triton is that each block must have a power-of-two number of elements,
-# so we need to internally "pad" each row and guard the memory operations properly if we want to handle any possible input shapes:
+# Our softmax kernel works as follows: each program loads a row of the input matrix X,
+# normalizes it and writes back the result to the output Y.
+# Note that one important limitation of Triton is that each block must have a
+# power-of-two number of elements, so we need to internally "pad" each row and guard the
+# memory operations properly if we want to handle any possible input shapes:
import triton
import triton.language as tl
@triton.jit
-def _softmax(Y, X, stride_xm, stride_ym, M, N, **meta):
- # row index
- m = tl.program_id(0)
- # col indices
- # here BLOCK is the smallest power of two greater than `N`
- n = tl.arange(0, meta['BLOCK'])
- # the memory address of all the elements
- # that we want to load can be computed as follows
- X = X + m * stride_xm + n
- x = tl.load(X, mask=n < N, other=-float('inf'))
+def softmax_kernel(
+ output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols, **meta
+):
+ # The rows of the softmax are independent, so we parallelize across those
+ row_idx = tl.program_id(0)
+ BLOCK_SIZE = meta['BLOCK_SIZE']
+ # The stride represents how much we need to increase the pointer to advance 1 row
+ row_start_ptr = input_ptr + row_idx * input_row_stride
+
+ # The block size is the next power of two greater than n_cols, so we can fit each
+ # row in a single block
+ col_offsets = tl.arange(0, BLOCK_SIZE)
+ input_ptrs = row_start_ptr + col_offsets
+ # Load the row into SRAM, using a mask since BLOCK_SIZE may be > than n_cols
+ row = tl.load(input_ptrs, mask=col_offsets < n_cols, other=-float('inf'))
# Substract maximum for numerical stability
- z = x - tl.max(x, axis=0)
- # Note that exponentials in Triton are fast
- # but approximate (i.e., think __expf in CUDA)
- num = tl.exp(z)
- denom = tl.sum(num, axis=0)
- y = num / denom
- # Write back to Y
- Y = Y + m * stride_ym + n
- tl.store(Y, y, mask=n < N)
+ row_minus_max = row - tl.max(row, axis=0)
+ # Note that exponentials in Triton are fast but approximate (i.e., think __expf in CUDA)
+ numerator = tl.exp(row_minus_max)
+ denominator = tl.sum(numerator, axis=0)
+ softmax_output = numerator / denominator
+ # Write back output to DRAM
+ output_row_start_ptr = output_ptr + row_idx * output_row_stride
+ output_ptrs = output_row_start_ptr + col_offsets
+ tl.store(output_ptrs, softmax_output, mask=col_offsets < n_cols)
# %%
@@ -79,6 +96,7 @@ def _softmax(Y, X, stride_xm, stride_ym, M, N, **meta):
def next_power_of_2(n):
+ """Return the smallest power of 2 greater than or equal to n"""
n -= 1
n |= n >> 1
n |= n >> 2
@@ -90,20 +108,31 @@ def next_power_of_2(n):
def softmax(x):
- M, N = x.shape
+ n_rows, n_cols = x.shape
# The block size is the smallest power of two greater than the number of columns in `x`
- BLOCK = next_power_of_2(N)
+ BLOCK_SIZE = next_power_of_2(n_cols)
# Another trick we can use is to ask the compiler to use more threads per row by
# increasing the number of warps (`num_warps`) over which each row is distributed.
# You will see in the next tutorial how to auto-tune this value in a more natural
# way so you don't have to come up with manual heuristics yourself.
num_warps = 4
- if BLOCK >= 2048: num_warps = 8
- if BLOCK >= 4096: num_warps = 16
+ if BLOCK_SIZE >= 2048:
+ num_warps = 8
+ if BLOCK_SIZE >= 4096:
+ num_warps = 16
# Allocate output
y = torch.empty_like(x)
- # Enqueue kernel. The launch grid is simple: we have one kernel instance per row of the input matrix
- _softmax[(M, )](y, x, x.stride(0), y.stride(0), M, N, num_warps=num_warps, BLOCK=BLOCK)
+ # Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o
+ # f the input matrix
+ softmax_kernel[(n_rows,)](
+ y,
+ x,
+ x.stride(0),
+ y.stride(0),
+ n_cols,
+ num_warps=num_warps,
+ BLOCK_SIZE=BLOCK_SIZE,
+ )
return y
@@ -117,9 +146,9 @@ def softmax(x):
torch.manual_seed(0)
x = torch.randn(1823, 781, device='cuda')
-y_tri = softmax(x)
-y_ref = torch.softmax(x, axis=1)
-print(torch.allclose(y_tri, y_ref))
+y_triton = softmax(x)
+y_torch = torch.softmax(x, axis=1)
+print(torch.allclose(y_triton, y_torch))
#%%
# As expected, the results are identical.
@@ -134,14 +163,24 @@ print(torch.allclose(y_tri, y_ref))
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['N'], # argument names to use as an x-axis for the plot
- x_vals=[128 * i for i in range(2, 100)], # different possible values for `x_name`
+ x_vals=[
+ 128 * i for i in range(2, 100)
+ ], # different possible values for `x_name`
line_arg='provider', # argument name whose value corresponds to a different line in the plot
- line_vals=['triton', 'torch-native', 'torch-jit'], # possible values for `line_arg``
- line_names=["Triton", "Torch (native)", "Torch (jit)"], # label name for the lines
+ line_vals=[
+ 'triton',
+ 'torch-native',
+ 'torch-jit',
+ ], # possible values for `line_arg``
+ line_names=[
+ "Triton",
+ "Torch (native)",
+ "Torch (jit)",
+ ], # label name for the lines
styles=[('blue', '-'), ('green', '-'), ('green', '--')], # line styles
ylabel="GB/s", # label name for the y-axis
plot_name="softmax-performance", # name for the plot. Used also as a file name for saving the plot.
- args={'M': 4096} # values for function arguments not in `x_names` and `y_name`
+ args={'M': 4096}, # values for function arguments not in `x_names` and `y_name`
)
)
def benchmark(M, N, provider):
@@ -164,4 +203,4 @@ benchmark.run(show_plots=True, print_data=True)
# - Triton is 2-3x faster than the Torch JIT.
# - Triton is even faster than :code:`torch.softmax`. My guess from looking at the source-code of the `PyTorch kernel `_ is that PyTorch only partially fuses the computation of the softmax.
# This means that -- when temporary data is too large to fit entirely in the GPU's cache -- it transfers almost twice the amount of memory necessary.
-# Note that our Triton kernel is not only faster than PyTorch's CUDA kernel, it is also **easier to read, understand and maintain**.
\ No newline at end of file
+# Note that our Triton kernel is not only faster than PyTorch's CUDA kernel, it is also **easier to read, understand and maintain**.
diff --git a/_downloads/f191ee1e78dc52eb5f7cba88f71cef2f/01-vector-add.ipynb b/_downloads/f191ee1e78dc52eb5f7cba88f71cef2f/01-vector-add.ipynb
index 19763a6af..d8a2e0a7b 100644
--- a/_downloads/f191ee1e78dc52eb5f7cba88f71cef2f/01-vector-add.ipynb
+++ b/_downloads/f191ee1e78dc52eb5f7cba88f71cef2f/01-vector-add.ipynb
@@ -33,7 +33,7 @@
},
"outputs": [],
"source": [
- "import torch\nimport triton.language as tl\nimport triton\n\n\n@triton.jit\ndef _add(\n X, # *Pointer* to first input vector\n Y, # *Pointer* to second input vector\n Z, # *Pointer* to output vector\n N, # Size of the vector\n **meta # Optional meta-parameters for the kernel\n):\n pid = tl.program_id(0)\n # Create an offset for the blocks of pointers to be\n # processed by this program instance\n offsets = pid * meta['BLOCK'] + tl.arange(0, meta['BLOCK'])\n # Create a mask to guard memory operations against\n # out-of-bounds accesses\n mask = offsets < N\n # Load x\n x = tl.load(X + offsets, mask=mask)\n y = tl.load(Y + offsets, mask=mask)\n # Write back x + y\n z = x + y\n tl.store(Z + offsets, z)"
+ "import torch\nimport triton\nimport triton.language as tl\n\n\n@triton.jit\ndef add_kernel(\n x_ptr, # *Pointer* to first input vector\n y_ptr, # *Pointer* to second input vector\n output_ptr, # *Pointer* to output vector\n n_elements, # Size of the vector\n **meta, # Optional meta-parameters for the kernel\n):\n BLOCK_SIZE = meta['BLOCK_SIZE'] # How many inputs each program should process\n # There are multiple 'program's processing different data. We identify which program\n # we are here\n pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0\n # This program will process inputs that are offset from the initial data.\n # for instance, if you had a vector of length 256 and block_size of 64, the programs\n # would each access the elements [0:64, 64:128, 128:192, 192:256].\n # Note that offsets is a list of pointers\n block_start = pid * BLOCK_SIZE\n offsets = block_start + tl.arange(0, BLOCK_SIZE)\n # Create a mask to guard memory operations against out-of-bounds accesses\n mask = offsets < n_elements\n # Load x and y from DRAM, masking out any extar elements in case the input is not a\n # multiple of the block size\n x = tl.load(x_ptr + offsets, mask=mask)\n y = tl.load(y_ptr + offsets, mask=mask)\n output = x + y\n # Write x + y back to DRAM\n tl.store(output_ptr + offsets, output)"
]
},
{
@@ -51,7 +51,7 @@
},
"outputs": [],
"source": [
- "def add(x, y):\n z = torch.empty_like(x)\n N = z.shape[0]\n # The SPMD launch grid denotes the number of kernel instances that run in parallel.\n # It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) -> Tuple[int]\n grid = lambda meta: (triton.cdiv(N, meta['BLOCK']), )\n # NOTE:\n # - each torch.tensor object is implicitly converted into a pointer to its first element.\n # - `triton.jit`'ed functions can be index with a launch grid to obtain a callable GPU kernel\n # - don't forget to pass meta-parameters as keywords arguments\n _add[grid](x, y, z, N, BLOCK=1024)\n # We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called, the kernel is still\n # running asynchronously at this point.\n return z"
+ "def add(x: torch.Tensor, y: torch.Tensor):\n # We need to preallocate the output\n output = torch.empty_like(x)\n assert x.is_cuda and y.is_cuda and output.is_cuda\n n_elements = output.shape[0]\n # The SPMD launch grid denotes the number of kernel instances that run in parallel.\n # It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) -> Tuple[int]\n # In this case, we use a 1D grid where the size is the number of blocks\n grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)\n # NOTE:\n # - each torch.tensor object is implicitly converted into a pointer to its first element.\n # - `triton.jit`'ed functions can be index with a launch grid to obtain a callable GPU kernel\n # - don't forget to pass meta-parameters as keywords arguments\n add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)\n # We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called, the kernel is still\n # running asynchronously at this point.\n return output"
]
},
{
@@ -69,7 +69,7 @@
},
"outputs": [],
"source": [
- "torch.manual_seed(0)\nsize = 98432\nx = torch.rand(size, device='cuda')\ny = torch.rand(size, device='cuda')\nza = x + y\nzb = add(x, y)\nprint(za)\nprint(zb)\nprint(f'The maximum difference between torch and triton is ' f'{torch.max(torch.abs(za - zb))}')"
+ "torch.manual_seed(0)\nsize = 98432\nx = torch.rand(size, device='cuda')\ny = torch.rand(size, device='cuda')\noutput_torch = x + y\noutput_triton = add(x, y)\nprint(output_torch)\nprint(output_triton)\nprint(\n f'The maximum difference between torch and triton is '\n f'{torch.max(torch.abs(output_torch - output_triton))}'\n)"
]
},
{
@@ -94,7 +94,7 @@
},
"outputs": [],
"source": [
- "@triton.testing.perf_report(\n triton.testing.Benchmark(\n x_names=['size'], # argument names to use as an x-axis for the plot\n x_vals=[2**i for i in range(12, 28, 1)], # different possible values for `x_name`\n x_log=True, # x axis is logarithmic\n line_arg='provider', # argument name whose value corresponds to a different line in the plot\n line_vals=['triton', 'torch'], # possible values for `line_arg`\n line_names=[\"Triton\", \"Torch\"], # label name for the lines\n styles=[('blue', '-'), ('green', '-')], # line styles\n ylabel=\"GB/s\", # label name for the y-axis\n plot_name=\"vector-add-performance\", # name for the plot. Used also as a file name for saving the plot.\n args={} # values for function arguments not in `x_names` and `y_name`\n )\n)\ndef benchmark(size, provider):\n x = torch.rand(size, device='cuda', dtype=torch.float32)\n y = torch.rand(size, device='cuda', dtype=torch.float32)\n if provider == 'torch':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: x + y)\n if provider == 'triton':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: add(x, y))\n gbps = lambda ms: 12 * size / ms * 1e-6\n return gbps(ms), gbps(max_ms), gbps(min_ms)"
+ "@triton.testing.perf_report(\n triton.testing.Benchmark(\n x_names=['size'], # argument names to use as an x-axis for the plot\n x_vals=[\n 2 ** i for i in range(12, 28, 1)\n ], # different possible values for `x_name`\n x_log=True, # x axis is logarithmic\n line_arg='provider', # argument name whose value corresponds to a different line in the plot\n line_vals=['triton', 'torch'], # possible values for `line_arg`\n line_names=['Triton', 'Torch'], # label name for the lines\n styles=[('blue', '-'), ('green', '-')], # line styles\n ylabel='GB/s', # label name for the y-axis\n plot_name='vector-add-performance', # name for the plot. Used also as a file name for saving the plot.\n args={}, # values for function arguments not in `x_names` and `y_name`\n )\n)\ndef benchmark(size, provider):\n x = torch.rand(size, device='cuda', dtype=torch.float32)\n y = torch.rand(size, device='cuda', dtype=torch.float32)\n if provider == 'torch':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: x + y)\n if provider == 'triton':\n ms, min_ms, max_ms = triton.testing.do_bench(lambda: add(x, y))\n gbps = lambda ms: 12 * size / ms * 1e-6\n return gbps(ms), gbps(max_ms), gbps(min_ms)"
]
},
{
diff --git a/_images/sphx_glr_01-vector-add_001.png b/_images/sphx_glr_01-vector-add_001.png
index e391d1558..0fc2e7c74 100644
Binary files a/_images/sphx_glr_01-vector-add_001.png and b/_images/sphx_glr_01-vector-add_001.png differ
diff --git a/_images/sphx_glr_01-vector-add_thumb.png b/_images/sphx_glr_01-vector-add_thumb.png
index 3fcc53479..246a7e3a1 100644
Binary files a/_images/sphx_glr_01-vector-add_thumb.png and b/_images/sphx_glr_01-vector-add_thumb.png differ
diff --git a/_images/sphx_glr_02-fused-softmax_001.png b/_images/sphx_glr_02-fused-softmax_001.png
index 9d2e9ad45..05781365e 100644
Binary files a/_images/sphx_glr_02-fused-softmax_001.png and b/_images/sphx_glr_02-fused-softmax_001.png differ
diff --git a/_images/sphx_glr_02-fused-softmax_thumb.png b/_images/sphx_glr_02-fused-softmax_thumb.png
index 124590dbd..fa527d708 100644
Binary files a/_images/sphx_glr_02-fused-softmax_thumb.png and b/_images/sphx_glr_02-fused-softmax_thumb.png differ
diff --git a/_images/sphx_glr_03-matrix-multiplication_001.png b/_images/sphx_glr_03-matrix-multiplication_001.png
index d75b8896a..f40365756 100644
Binary files a/_images/sphx_glr_03-matrix-multiplication_001.png and b/_images/sphx_glr_03-matrix-multiplication_001.png differ
diff --git a/_images/sphx_glr_03-matrix-multiplication_thumb.png b/_images/sphx_glr_03-matrix-multiplication_thumb.png
index cb6771688..048f18080 100644
Binary files a/_images/sphx_glr_03-matrix-multiplication_thumb.png and b/_images/sphx_glr_03-matrix-multiplication_thumb.png differ
diff --git a/_sources/getting-started/tutorials/01-vector-add.rst.txt b/_sources/getting-started/tutorials/01-vector-add.rst.txt
index 41094e550..98044312f 100644
--- a/_sources/getting-started/tutorials/01-vector-add.rst.txt
+++ b/_sources/getting-started/tutorials/01-vector-add.rst.txt
@@ -31,37 +31,43 @@ In this tutorial, you will write a simple vector addition using Triton and learn
Compute Kernel
--------------------------
-.. GENERATED FROM PYTHON SOURCE LINES 14-43
+.. GENERATED FROM PYTHON SOURCE LINES 14-49
.. code-block:: default
import torch
- import triton.language as tl
import triton
+ import triton.language as tl
@triton.jit
- def _add(
- X, # *Pointer* to first input vector
- Y, # *Pointer* to second input vector
- Z, # *Pointer* to output vector
- N, # Size of the vector
- **meta # Optional meta-parameters for the kernel
+ def add_kernel(
+ x_ptr, # *Pointer* to first input vector
+ y_ptr, # *Pointer* to second input vector
+ output_ptr, # *Pointer* to output vector
+ n_elements, # Size of the vector
+ **meta, # Optional meta-parameters for the kernel
):
- pid = tl.program_id(0)
- # Create an offset for the blocks of pointers to be
- # processed by this program instance
- offsets = pid * meta['BLOCK'] + tl.arange(0, meta['BLOCK'])
- # Create a mask to guard memory operations against
- # out-of-bounds accesses
- mask = offsets < N
- # Load x
- x = tl.load(X + offsets, mask=mask)
- y = tl.load(Y + offsets, mask=mask)
- # Write back x + y
- z = x + y
- tl.store(Z + offsets, z)
+ BLOCK_SIZE = meta['BLOCK_SIZE'] # How many inputs each program should process
+ # There are multiple 'program's processing different data. We identify which program
+ # we are here
+ pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0
+ # This program will process inputs that are offset from the initial data.
+ # for instance, if you had a vector of length 256 and block_size of 64, the programs
+ # would each access the elements [0:64, 64:128, 128:192, 192:256].
+ # Note that offsets is a list of pointers
+ block_start = pid * BLOCK_SIZE
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
+ # Create a mask to guard memory operations against out-of-bounds accesses
+ mask = offsets < n_elements
+ # Load x and y from DRAM, masking out any extar elements in case the input is not a
+ # multiple of the block size
+ x = tl.load(x_ptr + offsets, mask=mask)
+ y = tl.load(y_ptr + offsets, mask=mask)
+ output = x + y
+ # Write x + y back to DRAM
+ tl.store(output_ptr + offsets, output)
@@ -71,31 +77,34 @@ Compute Kernel
-.. GENERATED FROM PYTHON SOURCE LINES 44-46
+.. GENERATED FROM PYTHON SOURCE LINES 50-52
Let's also declare a helper function to (1) allocate the `z` tensor
and (2) enqueue the above kernel with appropriate grid/block sizes.
-.. GENERATED FROM PYTHON SOURCE LINES 46-64
+.. GENERATED FROM PYTHON SOURCE LINES 52-73
.. code-block:: default
- def add(x, y):
- z = torch.empty_like(x)
- N = z.shape[0]
+ def add(x: torch.Tensor, y: torch.Tensor):
+ # We need to preallocate the output
+ output = torch.empty_like(x)
+ assert x.is_cuda and y.is_cuda and output.is_cuda
+ n_elements = output.shape[0]
# The SPMD launch grid denotes the number of kernel instances that run in parallel.
# It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) -> Tuple[int]
- grid = lambda meta: (triton.cdiv(N, meta['BLOCK']), )
+ # In this case, we use a 1D grid where the size is the number of blocks
+ grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
# NOTE:
# - each torch.tensor object is implicitly converted into a pointer to its first element.
# - `triton.jit`'ed functions can be index with a launch grid to obtain a callable GPU kernel
# - don't forget to pass meta-parameters as keywords arguments
- _add[grid](x, y, z, N, BLOCK=1024)
+ add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
# We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called, the kernel is still
# running asynchronously at this point.
- return z
+ return output
@@ -105,11 +114,11 @@ and (2) enqueue the above kernel with appropriate grid/block sizes.
-.. GENERATED FROM PYTHON SOURCE LINES 65-66
+.. GENERATED FROM PYTHON SOURCE LINES 74-75
We can now use the above function to compute the element-wise sum of two `torch.tensor` objects and test its correctness:
-.. GENERATED FROM PYTHON SOURCE LINES 66-77
+.. GENERATED FROM PYTHON SOURCE LINES 75-89
.. code-block:: default
@@ -118,11 +127,14 @@ We can now use the above function to compute the element-wise sum of two `torch.
size = 98432
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
- za = x + y
- zb = add(x, y)
- print(za)
- print(zb)
- print(f'The maximum difference between torch and triton is ' f'{torch.max(torch.abs(za - zb))}')
+ output_torch = x + y
+ output_triton = add(x, y)
+ print(output_torch)
+ print(output_triton)
+ print(
+ f'The maximum difference between torch and triton is '
+ f'{torch.max(torch.abs(output_torch - output_triton))}'
+ )
@@ -141,11 +153,11 @@ We can now use the above function to compute the element-wise sum of two `torch.
-.. GENERATED FROM PYTHON SOURCE LINES 78-79
+.. GENERATED FROM PYTHON SOURCE LINES 90-91
Seems like we're good to go!
-.. GENERATED FROM PYTHON SOURCE LINES 81-86
+.. GENERATED FROM PYTHON SOURCE LINES 93-98
Benchmark
-----------
@@ -153,7 +165,7 @@ We can now benchmark our custom op on vectors of increasing sizes to get a sense
To make things easier, Triton has a set of built-in utilities that allow us to concisely plot the performance of your custom ops
for different problem sizes.
-.. GENERATED FROM PYTHON SOURCE LINES 86-113
+.. GENERATED FROM PYTHON SOURCE LINES 98-127
.. code-block:: default
@@ -162,15 +174,17 @@ for different problem sizes.
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['size'], # argument names to use as an x-axis for the plot
- x_vals=[2**i for i in range(12, 28, 1)], # different possible values for `x_name`
+ x_vals=[
+ 2 ** i for i in range(12, 28, 1)
+ ], # different possible values for `x_name`
x_log=True, # x axis is logarithmic
line_arg='provider', # argument name whose value corresponds to a different line in the plot
line_vals=['triton', 'torch'], # possible values for `line_arg`
- line_names=["Triton", "Torch"], # label name for the lines
+ line_names=['Triton', 'Torch'], # label name for the lines
styles=[('blue', '-'), ('green', '-')], # line styles
- ylabel="GB/s", # label name for the y-axis
- plot_name="vector-add-performance", # name for the plot. Used also as a file name for saving the plot.
- args={} # values for function arguments not in `x_names` and `y_name`
+ ylabel='GB/s', # label name for the y-axis
+ plot_name='vector-add-performance', # name for the plot. Used also as a file name for saving the plot.
+ args={}, # values for function arguments not in `x_names` and `y_name`
)
)
def benchmark(size, provider):
@@ -191,18 +205,19 @@ for different problem sizes.
-.. GENERATED FROM PYTHON SOURCE LINES 114-116
+.. GENERATED FROM PYTHON SOURCE LINES 128-130
We can now run the decorated function above. Pass `print_data=True` to see the performance number, `show_plots=True` to plot them, and/or
`save_path='/path/to/results/' to save them to disk along with raw CSV data
-.. GENERATED FROM PYTHON SOURCE LINES 116-116
+.. GENERATED FROM PYTHON SOURCE LINES 130-131
.. code-block:: default
benchmark.run(print_data=True, show_plots=True)
+
.. image:: /getting-started/tutorials/images/sphx_glr_01-vector-add_001.png
:alt: 01 vector add
:class: sphx-glr-single-img
@@ -218,16 +233,16 @@ We can now run the decorated function above. Pass `print_data=True` to see the p
size Triton Torch
0 4096.0 9.600000 9.600000
1 8192.0 19.200000 19.200000
- 2 16384.0 38.400001 38.400001
- 3 32768.0 76.800002 76.800002
+ 2 16384.0 31.999999 31.999999
+ 3 32768.0 63.999998 76.800002
4 65536.0 127.999995 127.999995
5 131072.0 219.428568 219.428568
- 6 262144.0 384.000001 384.000001
+ 6 262144.0 341.333321 384.000001
7 524288.0 472.615390 472.615390
8 1048576.0 614.400016 614.400016
9 2097152.0 722.823517 722.823517
10 4194304.0 780.190482 780.190482
- 11 8388608.0 819.200021 812.429770
+ 11 8388608.0 812.429770 812.429770
12 16777216.0 833.084721 833.084721
13 33554432.0 843.811163 843.811163
14 67108864.0 849.278610 848.362445
@@ -239,7 +254,7 @@ We can now run the decorated function above. Pass `print_data=True` to see the p
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 0 minutes 10.996 seconds)
+ **Total running time of the script:** ( 0 minutes 11.055 seconds)
.. _sphx_glr_download_getting-started_tutorials_01-vector-add.py:
diff --git a/_sources/getting-started/tutorials/02-fused-softmax.rst.txt b/_sources/getting-started/tutorials/02-fused-softmax.rst.txt
index f6f19d709..0d5ca288a 100644
--- a/_sources/getting-started/tutorials/02-fused-softmax.rst.txt
+++ b/_sources/getting-started/tutorials/02-fused-softmax.rst.txt
@@ -20,20 +20,22 @@
Fused Softmax
=================
-In this tutorial, you will write a fused softmax operation that is significantly faster than PyTorch's native op for a particular class of matrices: those whose rows can fit in the GPU's SRAM.
+In this tutorial, you will write a fused softmax operation that is significantly faster
+than PyTorch's native op for a particular class of matrices: those whose rows can fit in
+the GPU's SRAM.
You will learn about:
- The benefits of kernel fusion for bandwidth-bound operations.
- Reduction operators in Triton.
-.. GENERATED FROM PYTHON SOURCE LINES 12-16
+.. GENERATED FROM PYTHON SOURCE LINES 14-18
Motivations
------------
Custom GPU kernels for elementwise additions are educationally valuable but won't get you very far in practice.
Let us consider instead the case of a simple (numerically stabilized) softmax operation:
-.. GENERATED FROM PYTHON SOURCE LINES 16-37
+.. GENERATED FROM PYTHON SOURCE LINES 18-43
.. code-block:: default
@@ -41,9 +43,13 @@ Let us consider instead the case of a simple (numerically stabilized) softmax op
import torch
- # Compute the row-wise softmax of x
@torch.jit.script
def naive_softmax(x):
+ """Compute row-wise softmax of X using native pytorch
+
+ We subtract the maximum element in order to avoid overflows. Softmax is invariant to
+ this shift.
+ """
# read MN elements ; write M elements
x_max = x.max(dim=1)[0]
# read 2MN elements ; write MN elements
@@ -65,22 +71,28 @@ Let us consider instead the case of a simple (numerically stabilized) softmax op
-.. GENERATED FROM PYTHON SOURCE LINES 38-42
+.. GENERATED FROM PYTHON SOURCE LINES 44-52
-When implemented naively in pytorch, computing :code:`y = naive_softmax(x)` for :math:`x \in R^{M \times N}` requires reading :math:`7MN` elements from DRAM and writing back :math:`3MN + 2M` elements.
-This is obviously wasteful; we'd prefer to have a custom "fused" kernel that only reads X once and does all the necessary computations on-chip.
-Doing so would require reading and writing back only :math:`MN` bytes, so we could expect a theoretical speed-up of ~5x (i.e., :math:`(10MN + 2M) / 2MN`).
-The `torch.jit.script` flags aims to perform this kind of "kernel fusion" automatically but, as we will see later, it is still far from ideal.
+When implemented naively in PyTorch, computing :code:`y = naive_softmax(x)` for :math:`x \in R^{M \times N}`
+requires reading :math:`7MN` elements from DRAM and writing back :math:`3MN + 2M` elements.
+This is obviously wasteful; we'd prefer to have a custom "fused" kernel that only reads
+X once and does all the necessary computations on-chip.
+Doing so would require reading and writing back only :math:`MN` bytes, so we could
+expect a theoretical speed-up of ~5x (i.e., :math:`(10MN + 2M) / 2MN`).
+The `torch.jit.script` flags aims to perform this kind of "kernel fusion" automatically
+but, as we will see later, it is still far from ideal.
-.. GENERATED FROM PYTHON SOURCE LINES 44-49
+.. GENERATED FROM PYTHON SOURCE LINES 54-61
Compute Kernel
----------------
-Our softmax kernel works as follows: each program loads a row of the input matrix X, normalizes it and writes back the result to the output Y.
-Note that one important limitation of Triton is that each block must have a power-of-two number of elements,
-so we need to internally "pad" each row and guard the memory operations properly if we want to handle any possible input shapes:
+Our softmax kernel works as follows: each program loads a row of the input matrix X,
+normalizes it and writes back the result to the output Y.
+Note that one important limitation of Triton is that each block must have a
+power-of-two number of elements, so we need to internally "pad" each row and guard the
+memory operations properly if we want to handle any possible input shapes:
-.. GENERATED FROM PYTHON SOURCE LINES 49-77
+.. GENERATED FROM PYTHON SOURCE LINES 61-94
.. code-block:: default
@@ -90,26 +102,31 @@ so we need to internally "pad" each row and guard the memory operations properly
@triton.jit
- def _softmax(Y, X, stride_xm, stride_ym, M, N, **meta):
- # row index
- m = tl.program_id(0)
- # col indices
- # here BLOCK is the smallest power of two greater than `N`
- n = tl.arange(0, meta['BLOCK'])
- # the memory address of all the elements
- # that we want to load can be computed as follows
- X = X + m * stride_xm + n
- x = tl.load(X, mask=n < N, other=-float('inf'))
+ def softmax_kernel(
+ output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols, **meta
+ ):
+ # The rows of the softmax are independent, so we parallelize across those
+ row_idx = tl.program_id(0)
+ BLOCK_SIZE = meta['BLOCK_SIZE']
+ # The stride represents how much we need to increase the pointer to advance 1 row
+ row_start_ptr = input_ptr + row_idx * input_row_stride
+
+ # The block size is the next power of two greater than n_cols, so we can fit each
+ # row in a single block
+ col_offsets = tl.arange(0, BLOCK_SIZE)
+ input_ptrs = row_start_ptr + col_offsets
+ # Load the row into SRAM, using a mask since BLOCK_SIZE may be > than n_cols
+ row = tl.load(input_ptrs, mask=col_offsets < n_cols, other=-float('inf'))
# Substract maximum for numerical stability
- z = x - tl.max(x, axis=0)
- # Note that exponentials in Triton are fast
- # but approximate (i.e., think __expf in CUDA)
- num = tl.exp(z)
- denom = tl.sum(num, axis=0)
- y = num / denom
- # Write back to Y
- Y = Y + m * stride_ym + n
- tl.store(Y, y, mask=n < N)
+ row_minus_max = row - tl.max(row, axis=0)
+ # Note that exponentials in Triton are fast but approximate (i.e., think __expf in CUDA)
+ numerator = tl.exp(row_minus_max)
+ denominator = tl.sum(numerator, axis=0)
+ softmax_output = numerator / denominator
+ # Write back output to DRAM
+ output_row_start_ptr = output_ptr + row_idx * output_row_stride
+ output_ptrs = output_row_start_ptr + col_offsets
+ tl.store(output_ptrs, softmax_output, mask=col_offsets < n_cols)
@@ -119,17 +136,18 @@ so we need to internally "pad" each row and guard the memory operations properly
-.. GENERATED FROM PYTHON SOURCE LINES 78-79
+.. GENERATED FROM PYTHON SOURCE LINES 95-96
We can create a helper function that enqueues the kernel and its (meta-)arguments for any given input tensor.
-.. GENERATED FROM PYTHON SOURCE LINES 79-110
+.. GENERATED FROM PYTHON SOURCE LINES 96-139
.. code-block:: default
def next_power_of_2(n):
+ """Return the smallest power of 2 greater than or equal to n"""
n -= 1
n |= n >> 1
n |= n >> 2
@@ -141,20 +159,31 @@ We can create a helper function that enqueues the kernel and its (meta-)argument
def softmax(x):
- M, N = x.shape
+ n_rows, n_cols = x.shape
# The block size is the smallest power of two greater than the number of columns in `x`
- BLOCK = next_power_of_2(N)
+ BLOCK_SIZE = next_power_of_2(n_cols)
# Another trick we can use is to ask the compiler to use more threads per row by
# increasing the number of warps (`num_warps`) over which each row is distributed.
# You will see in the next tutorial how to auto-tune this value in a more natural
# way so you don't have to come up with manual heuristics yourself.
num_warps = 4
- if BLOCK >= 2048: num_warps = 8
- if BLOCK >= 4096: num_warps = 16
+ if BLOCK_SIZE >= 2048:
+ num_warps = 8
+ if BLOCK_SIZE >= 4096:
+ num_warps = 16
# Allocate output
y = torch.empty_like(x)
- # Enqueue kernel. The launch grid is simple: we have one kernel instance per row of the input matrix
- _softmax[(M, )](y, x, x.stride(0), y.stride(0), M, N, num_warps=num_warps, BLOCK=BLOCK)
+ # Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o
+ # f the input matrix
+ softmax_kernel[(n_rows,)](
+ y,
+ x,
+ x.stride(0),
+ y.stride(0),
+ n_cols,
+ num_warps=num_warps,
+ BLOCK_SIZE=BLOCK_SIZE,
+ )
return y
@@ -165,26 +194,26 @@ We can create a helper function that enqueues the kernel and its (meta-)argument
-.. GENERATED FROM PYTHON SOURCE LINES 111-113
+.. GENERATED FROM PYTHON SOURCE LINES 140-142
Unit Test
----------
-.. GENERATED FROM PYTHON SOURCE LINES 115-117
+.. GENERATED FROM PYTHON SOURCE LINES 144-146
We make sure that we test our kernel on a matrix with an irregular number of rows and columns.
This will allow us to verify that our padding mechanism works.
-.. GENERATED FROM PYTHON SOURCE LINES 117-124
+.. GENERATED FROM PYTHON SOURCE LINES 146-153
.. code-block:: default
torch.manual_seed(0)
x = torch.randn(1823, 781, device='cuda')
- y_tri = softmax(x)
- y_ref = torch.softmax(x, axis=1)
- print(torch.allclose(y_tri, y_ref))
+ y_triton = softmax(x)
+ y_torch = torch.softmax(x, axis=1)
+ print(torch.allclose(y_triton, y_torch))
@@ -201,18 +230,18 @@ This will allow us to verify that our padding mechanism works.
-.. GENERATED FROM PYTHON SOURCE LINES 125-126
+.. GENERATED FROM PYTHON SOURCE LINES 154-155
As expected, the results are identical.
-.. GENERATED FROM PYTHON SOURCE LINES 128-132
+.. GENERATED FROM PYTHON SOURCE LINES 157-161
Benchmark
-------------
Here we will benchmark our operation as a function of the number of columns in the input matrix -- assuming 4096 rows.
We will then compare its performance against (1) :code:`torch.softmax` and (2) the :code:`naive_softmax` defined above.
-.. GENERATED FROM PYTHON SOURCE LINES 132-161
+.. GENERATED FROM PYTHON SOURCE LINES 161-200
.. code-block:: default
@@ -221,14 +250,24 @@ We will then compare its performance against (1) :code:`torch.softmax` and (2) t
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['N'], # argument names to use as an x-axis for the plot
- x_vals=[128 * i for i in range(2, 100)], # different possible values for `x_name`
+ x_vals=[
+ 128 * i for i in range(2, 100)
+ ], # different possible values for `x_name`
line_arg='provider', # argument name whose value corresponds to a different line in the plot
- line_vals=['triton', 'torch-native', 'torch-jit'], # possible values for `line_arg``
- line_names=["Triton", "Torch (native)", "Torch (jit)"], # label name for the lines
+ line_vals=[
+ 'triton',
+ 'torch-native',
+ 'torch-jit',
+ ], # possible values for `line_arg``
+ line_names=[
+ "Triton",
+ "Torch (native)",
+ "Torch (jit)",
+ ], # label name for the lines
styles=[('blue', '-'), ('green', '-'), ('green', '--')], # line styles
ylabel="GB/s", # label name for the y-axis
plot_name="softmax-performance", # name for the plot. Used also as a file name for saving the plot.
- args={'M': 4096} # values for function arguments not in `x_names` and `y_name`
+ args={'M': 4096}, # values for function arguments not in `x_names` and `y_name`
)
)
def benchmark(M, N, provider):
@@ -263,22 +302,22 @@ We will then compare its performance against (1) :code:`torch.softmax` and (2) t
N Triton Torch (native) Torch (jit)
0 256.0 512.000001 546.133347 186.181817
1 384.0 585.142862 585.142862 153.600004
- 2 512.0 630.153853 585.142849 154.566038
- 3 640.0 682.666684 640.000002 160.000000
+ 2 512.0 630.153853 606.814814 154.566038
+ 3 640.0 660.645170 640.000002 160.000000
4 768.0 702.171410 664.216187 163.839992
.. ... ... ... ...
- 93 12160.0 812.359066 406.179533 199.140227
- 94 12288.0 812.429770 415.661740 199.399583
- 95 12416.0 810.840807 412.149375 199.054102
- 96 12544.0 810.925276 412.971190 199.308841
- 97 12672.0 811.007961 412.097543 199.264875
+ 93 12160.0 812.359066 406.179533 199.038365
+ 94 12288.0 812.429770 415.222812 199.298541
+ 95 12416.0 810.840807 412.149375 198.854847
+ 96 12544.0 810.925276 412.971190 199.209928
+ 97 12672.0 809.389265 412.097543 199.167004
[98 rows x 4 columns]
-.. GENERATED FROM PYTHON SOURCE LINES 162-167
+.. GENERATED FROM PYTHON SOURCE LINES 201-207
In the above plot, we can see that:
@@ -290,7 +329,7 @@ In the above plot, we can see that:
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 1 minutes 12.626 seconds)
+ **Total running time of the script:** ( 1 minutes 13.186 seconds)
.. _sphx_glr_download_getting-started_tutorials_02-fused-softmax.py:
diff --git a/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt b/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt
index 7e25e3a4c..bdfd22d37 100644
--- a/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt
+++ b/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt
@@ -20,58 +20,67 @@
Matrix Multiplication
======================
-In this tutorial, you will write a 25-lines high-performance FP16 matrix multiplication kernel that achieves performance on par with cuBLAS.
+In this tutorial, you will write a 25-lines high-performance FP16 matrix multiplication
+kernel that achieves performance on par with cuBLAS.
You will specifically learn about:
- Block-level matrix multiplications
- Multi-dimensional pointer arithmetic
-- Program re-ordering for improved L2 cache hit rate
+- Program re-ordering for improved L2 cache hit rate
- Automatic performance tuning
-.. GENERATED FROM PYTHON SOURCE LINES 14-37
+.. GENERATED FROM PYTHON SOURCE LINES 15-42
Motivations
-------------
Matrix multiplications are a key building block of most modern high-performance computing systems.
-They are notoriously hard to optimize, hence their implementation is generally done by hardware vendors themselves as part of so-called "kernel libraries" (e.g., cuBLAS).
-Unfortunately, these libraries are often proprietary and cannot be easily customized to accomodate the needs of modern deep learning workloads (e.g., fused activation functions).
-In this tutorial, you will learn how to implement efficient matrix multiplications by yourself with Triton, in a way that is easy to customize and extend.
+They are notoriously hard to optimize, hence their implementation is generally done by
+hardware vendors themselves as part of so-called "kernel libraries" (e.g., cuBLAS).
+Unfortunately, these libraries are often proprietary and cannot be easily customized
+to accomodate the needs of modern deep learning workloads (e.g., fused activation functions).
+In this tutorial, you will learn how to implement efficient matrix multiplications by
+yourself with Triton, in a way that is easy to customize and extend.
-Roughly speaking, the kernel that we will write will implement the following blocked algorithm:
+Roughly speaking, the kernel that we will write will implement the following blocked
+algorithm to multiply a (MxK) by a (KxN) matrix:
.. code-block:: python
# do in parallel
- for m in range(0, M, BLOCK_M):
+ for m in range(0, M, BLOCK_SIZE_M):
# do in parallel
- for n in range(0, N, BLOCK_N):
- acc = zeros((BLOCK_M, BLOCK_N), dtype=float32)
- for k in range(0, K, BLOCK_K):
- a = A[m : m+BLOCK_M, k : k+BLOCK_K]
- b = B[k : k+BLOCK_K, n : n+BLOCK_N]
+ for n in range(0, N, BLOCK_SIZE_N):
+ acc = zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=float32)
+ for k in range(0, K, BLOCK_SIZE_K):
+ a = A[m : m+BLOCK_SIZE_M, k : k+BLOCK_SIZE_K]
+ b = B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]
acc += dot(a, b)
- C[m : m+BLOCK_M, n : n+BLOCK_N] = acc;
+ C[m : m+BLOCK_SIZE_M, n : n+BLOCK_SIZE_N] = acc;
where each iteration of the doubly-nested for-loop corresponds to a Triton program instance.
-.. GENERATED FROM PYTHON SOURCE LINES 39-110
+.. GENERATED FROM PYTHON SOURCE LINES 44-119
Compute Kernel
----------------
The above algorithm is, actually, fairly straightforward to implement in Triton.
-The main difficulty comes from the computation of the memory locations at which blocks of :code:`A` and :code:`B` must be read in the inner loop. For that, we need multi-dimensional pointer arithmetics.
+The main difficulty comes from the computation of the memory locations at which blocks
+ of :code:`A` and :code:`B` must be read in the inner loop. For that, we need
+multi-dimensional pointer arithmetics.
Pointer Arithmetics
~~~~~~~~~~~~~~~~~~~~
-For a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given by :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.
-Therefore, blocks of pointers for :code:`A[m : m+BLOCK_M, k:k+BLOCK_K]` and :code:`B[k : k+BLOCK_K, n : n+BLOCK_N]` can be defined in pseudo-code as:
+For a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given b
+y :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.
+Therefore, blocks of pointers for :code:`A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K]` and
+:code:`B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]` can be defined in pseudo-code as:
.. code-block:: python
- &A[m : m+BLOCK_M, k:k+BLOCK_K] = A + (m : m+BLOCK_M)[:, None]*A.stride(0) + (k : k+BLOCK_K)[None, :]*A.stride(1);
- &B[k : k+BLOCK_K, n:n+BLOCK_N] = B + (k : k+BLOCK_K)[:, None]*B.stride(0) + (n : n+BLOCK_N)[None, :]*B.stride(1);
+ &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] = A + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);
+ &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] = B + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);
Which means that pointers for blocks of A and B can be initialized (i.e., :code:`k=0`) in Triton as:
@@ -79,9 +88,9 @@ Which means that pointers for blocks of A and B can be initialized (i.e., :code:
pid_m = triton.program_id(0)
pid_n = triton.program_id(1)
- rm = pid_m * BLOCK_M + triton.arange(0, BLOCK_M)
- rn = pid_n * BLOCK_N + triton.arange(0, BLOCK_N)
- rk = triton.arange(0, BLOCK_K)
+ rm = pid_m * BLOCK_SIZE_M + triton.arange(0, BLOCK_SIZE_M)
+ rn = pid_n * BLOCK_SIZE_N + triton.arange(0, BLOCK_SIZE_N)
+ rk = triton.arange(0, BLOCK_SIZE_K)
// pointer for A operand
pa = A + (rm[:, None] * stride_a_0 + rk[None, :] * stride_a_1);
// pointer for B operand
@@ -91,50 +100,72 @@ And then updated in the inner loop as follows:
.. code-block:: python
- pa += BLOCK_K * stride_a_1;
- pb += BLOCK_K * stride_b_0;
+ pa += BLOCK_SIZE_K * stride_a_1;
+ pb += BLOCK_SIZE_K * stride_b_0;
L2 Cache Optimizations
~~~~~~~~~~~~~~~~~~~~~~~~
-As mentioned above, each program instance computes an :code:`[BLOCK_M, BLOCK_N]` block of :code:`C`.
-It is important to remember that the order in which these blocks are computed does matter, since it affects the L2 cache hit rate of our program.
-And unfortunately, a simple row-major ordering
+As mentioned above, each program instance computes a :code:`[BLOCK_SIZE_M, BLOCK_SIZE_N]`
+ block of :code:`C`.
+It is important to remember that the order in which these blocks are computed does
+matter, since it affects the L2 cache hit rate of our program. and unfortunately, a
+a simple row-major ordering
.. code-block:: Python
pid = triton.program_id(0);
- grid_m = (M + BLOCK_M - 1) // BLOCK_M;
- grid_n = (N + BLOCK_N - 1) // BLOCK_N;
+ grid_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M;
+ grid_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N;
pid_m = pid / grid_n;
pid_n = pid % grid_n;
is just not going to cut it.
One possible solution is to launch blocks in an order that promotes data reuse.
-This can be done by 'super-grouping' blocks in groups of :code:`GROUP_M` rows before switching to the next column:
+This can be done by 'super-grouping' blocks in groups of :code:`GROUP_M` rows before
+switching to the next column:
.. code-block:: python
pid = triton.program_id(0);
width = GROUP_M * grid_n;
group_id = pid // width;
- # we need to handle the case where M % (GROUP_M*BLOCK_M) != 0
+ # we need to handle the case where M % (GROUP_M*BLOCK_SIZE_M) != 0
group_size = min(grid_m - group_id * GROUP_M, GROUP_M);
pid_m = group_id * GROUP_M + (pid % group_size);
pid_n = (pid % width) // (group_size);
-In practice, this can improve the performance of our matrix multiplication kernel by >10\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).
+.. GENERATED FROM PYTHON SOURCE LINES 119-130
+
+.. code-block:: default
-.. GENERATED FROM PYTHON SOURCE LINES 112-115
+ # For example, in the following matmul where each matrix is 9 blocks by 9 blocks,
+ # we can see that if we compute the output in row-major ordering, we need to load 90
+ # blocks into SRAM to compute the first 9 output blocks, but if we do it in grouped
+ # ordering, we only need to load 54 blocks.
+ # .. image:: grouped_vs_row_major_ordering.png
+ #
+ # In practice, this can improve the performance of our matrix multiplication kernel by
+ # more than 10\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).
+ #
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 131-134
Final Result
-------------
-.. GENERATED FROM PYTHON SOURCE LINES 115-190
+.. GENERATED FROM PYTHON SOURCE LINES 134-263
.. code-block:: default
@@ -144,74 +175,127 @@ Final Result
import triton.language as tl
# %
- # :code:`triton.jit`'ed functions can be auto-tuned by using the `triton.autotune` decorator, which consumes:
- # - A list of :code:`triton.Config` objects that define different configurations of meta-parameters (e.g., BLOCK_M) and compilation options (e.g., num_warps) to try
- # - A autotuning *key* whose change in values will trigger evaluation of all the provided configs
+ # :code:`triton.jit`'ed functions can be auto-tuned by using the `triton.autotune`
+ # decorator, which consumes:
+ # - A list of :code:`triton.Config` objects that define different configurations of
+ # meta-parameters (e.g., BLOCK_SIZE_M) and compilation options (e.g., num_warps) to try
+ # - An autotuning *key* whose change in values will trigger evaluation of all the
+ # provided configs
@triton.autotune(
configs=[
- triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=3, num_warps=8),
- triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=3, num_warps=8),
- triton.Config({'BLOCK_M': 256, 'BLOCK_N': 64, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),
- triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 256, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\
- triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\
- triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\
- triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),
- triton.Config({'BLOCK_M': 128, 'BLOCK_N': 32 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=4, num_warps=4),\
- triton.Config({'BLOCK_M': 64 , 'BLOCK_N': 32 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=5, num_warps=2),\
- triton.Config({'BLOCK_M': 32 , 'BLOCK_N': 64 , 'BLOCK_K': 32, 'GROUP_M': 8}, num_stages=5, num_warps=2),
- #triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 32, 'GROUP_M': 8}, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),
+ triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),
+ triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
+ triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),
+ triton.Config({'BLOCK_SIZE_M': 32 , 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),
],
key=['M', 'N', 'K'],
)
# %
# We can now define our kernel as normal, using all the techniques presented above
@triton.jit
- def _matmul(A, B, C, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, **META):
+ def matmul_kernel(
+ # Pointers to matrices
+ a_ptr,
+ b_ptr,
+ c_ptr,
+ # Matrix dimensions
+ M,
+ N,
+ K,
+ # The stride variables represent how much to increase the ptr by when moving by 1
+ # element in a particular dimension. E.g. stride_am is how much to increase a_ptr
+ # by to get the element one row down (A has M rows)
+ stride_am,
+ stride_ak,
+ stride_bk,
+ stride_bn,
+ stride_cm,
+ stride_cn,
+ **meta,
+ ):
+ """Kernel for computing the matmul AB = C
+
+ A has shape (M, K), B has shape (K, N) and C has shape (M, N)
+ """
# extract meta-parameters
- BLOCK_M = META['BLOCK_M']
- BLOCK_N = META['BLOCK_N']
- BLOCK_K = META['BLOCK_K']
- GROUP_M = 8
- # matrix multiplication
- pid = tl.program_id(0)
- grid_m = (M + BLOCK_M - 1) // BLOCK_M
- grid_n = (N + BLOCK_N - 1) // BLOCK_N
- # re-order program ID for better L2 performance
- width = GROUP_M * grid_n
- group_id = pid // width
- group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
- pid_m = group_id * GROUP_M + (pid % group_size)
- pid_n = (pid % width) // (group_size)
- # do matrix multiplication
- rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
- rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
- rk = tl.arange(0, BLOCK_K)
- A = A + (rm[:, None] * stride_am + rk[None, :] * stride_ak)
- B = B + (rk[:, None] * stride_bk + rn[None, :] * stride_bn)
- acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
- for k in range(K, 0, -BLOCK_K):
- a = tl.load(A)
- b = tl.load(B)
- acc += tl.dot(a, b)
- A += BLOCK_K * stride_ak
- B += BLOCK_K * stride_bk
- # triton can accept arbitrary activation function
- # via metaparameters!
- if META['ACTIVATION']:
- acc = META['ACTIVATION'](acc)
- # rematerialize rm and rn to save registers
- rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
- rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
- C = C + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)
- mask = (rm[:, None] < M) & (rn[None, :] < N)
- tl.store(C, acc, mask=mask)
+ BLOCK_SIZE_M = meta['BLOCK_SIZE_M']
+ BLOCK_SIZE_N = meta['BLOCK_SIZE_N']
+ BLOCK_SIZE_K = meta['BLOCK_SIZE_K']
+ GROUP_SIZE_M = 8
+ pid = tl.program_id(axis=0)
+
+ # the number of blocks is the ceil(M / BLOCK_SIZE_M) since we need an extra block
+ # Note that this will lead to some quantization in performance where time-taken jumps
+ # when you need to add a new block
+ n_blocks_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M
+ n_blocks_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N
+
+ # Map PIDs to the block they should compute. This is done in a grouped ordering
+ # to promote L2 cache reuse.
+ n_output_blocks_in_group = GROUP_SIZE_M * n_blocks_n
+ group_id = pid // n_output_blocks_in_group
+ first_m_block_in_group = group_id * GROUP_SIZE_M
+
+ # If the number of blocks is not divisible by the group size, the last group is smaller
+ group_size_m = min(n_blocks_m - first_m_block_in_group, GROUP_SIZE_M)
+
+ # Within a group, we compute in col-major ordering, block_m and block_n are the
+ # output row and col that this program is computing in terms of blocks
+ block_m = first_m_block_in_group + (pid % group_size_m)
+ block_n = (pid % n_output_blocks_in_group) // group_size_m
+
+ # Convert from block indices back to element indices
+ m_start = block_m * BLOCK_SIZE_M
+ n_start = block_n * BLOCK_SIZE_N
+
+ # Expand out to all the offsets for each of the elements in this block.
+ m_offsets_a = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]
+ n_offsets_b = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]
+ k_offsets = tl.arange(0, BLOCK_SIZE_K)
+
+ # Get the pointers for the first block of each. We will advance this pointer
+ # as we move in the K direction and accumulate.
+ # a_ptrs should contain BLOCK_SIZE_M * BLOCK_SIZE_K pointers
+ a_ptrs = a_ptr + (stride_am * m_offsets_a + stride_ak * k_offsets[None, :])
+ # b_ptrs should contain BLOCK_SIZE_K * BLOCK_SIZE_N pointers
+ b_ptrs = b_ptr + (stride_bk * k_offsets[:, None] + stride_bn * n_offsets_b)
+ # We accumulate internally in fp32, but the output is written out in the dtype
+ # of the tensor when it is stored
+ accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
+ for k in range(0, K, BLOCK_SIZE_K):
+ # Note that for simplicity, we don't apply a mask here. This means that if K is
+ # not a multiple of BLOCK_SIZE_K, this will access out-of-bounds memory and
+ # accumulate it incorrectly.
+ a = tl.load(a_ptrs)
+ b = tl.load(b_ptrs)
+ # We accumulate along the K dimension
+ accumulator += tl.dot(a, b)
+
+ # Advance the ptrs to the next K block
+ a_ptrs += BLOCK_SIZE_K * stride_ak
+ b_ptrs += BLOCK_SIZE_K * stride_bk
+ # triton can accept arbitrary activation function via metaparameters!
+ if meta['ACTIVATION']:
+ accumulator = meta['ACTIVATION'](accumulator)
+
+ m_offsets_c = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]
+ n_offsets_c = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]
+ c_ptrs = c_ptr + stride_cm * m_offsets_c + stride_cn * n_offsets_c
+ mask = (m_offsets_c < M) & (n_offsets_c < N)
+ tl.store(c_ptrs, accumulator, mask=mask)
# we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`
@triton.jit
def leaky_relu(x):
- return tl.where(x >= 0, x, 0.01*x)
+ return tl.where(x >= 0, x, 0.01 * x)
@@ -220,33 +304,49 @@ Final Result
-.. GENERATED FROM PYTHON SOURCE LINES 191-193
+
+.. GENERATED FROM PYTHON SOURCE LINES 264-266
We can now create a convenience wrapper function that only takes two input tensors
and (1) checks any shape constraint; (2) allocates the output; (3) launches the above kernel
-.. GENERATED FROM PYTHON SOURCE LINES 193-214
+.. GENERATED FROM PYTHON SOURCE LINES 266-302
.. code-block:: default
+
def matmul(a, b, activation=None):
# checks constraints
assert a.shape[1] == b.shape[0], "incompatible dimensions"
assert a.is_contiguous(), "matrix A must be contiguous"
assert b.is_contiguous(), "matrix B must be contiguous"
M, K = a.shape
- _, N = b.shape
+ K, N = b.shape
+ assert (
+ K % 32 == 0
+ ), "We don't check memory-out-of-bounds with K so K must be divisible by BLOCK_SIZE_K"
# allocates output
c = torch.empty((M, N), device=a.device, dtype=a.dtype)
- # launch kernel
- grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']), )
- pgm = _matmul[grid](
- a, b, c, M, N, K, \
- a.stride(0), a.stride(1), b.stride(0), b.stride(1), c.stride(0), c.stride(1),\
- ACTIVATION = activation
+ # 1D launch kernel where each block gets its own program.
+ grid = lambda META: (
+ triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']),
+ )
+ matmul_kernel[grid](
+ a,
+ b,
+ c,
+ M,
+ N,
+ K,
+ a.stride(0),
+ a.stride(1),
+ b.stride(0),
+ b.stride(1),
+ c.stride(0),
+ c.stride(1),
+ ACTIVATION=activation,
)
- # done; return the output tensor
return c
@@ -257,14 +357,14 @@ and (1) checks any shape constraint; (2) allocates the output; (3) launches the
-.. GENERATED FROM PYTHON SOURCE LINES 215-219
+.. GENERATED FROM PYTHON SOURCE LINES 303-307
Unit Test
-----------
We can test our custom matrix multiplication operation against a native torch implementation (i.e., cuBLAS)
-.. GENERATED FROM PYTHON SOURCE LINES 219-229
+.. GENERATED FROM PYTHON SOURCE LINES 307-320
.. code-block:: default
@@ -272,11 +372,14 @@ We can test our custom matrix multiplication operation against a native torch im
torch.manual_seed(0)
a = torch.randn((512, 512), device='cuda', dtype=torch.float16)
b = torch.randn((512, 512), device='cuda', dtype=torch.float16)
- c_0 = matmul(a, b, activation=None)
- c_1 = torch.matmul(a, b)
- print(c_0)
- print(c_1)
- print(triton.testing.allclose(c_0, c_1))
+ triton_output = matmul(a, b, activation=None)
+ torch_output = torch.matmul(a, b)
+ print(f"{triton_output=}")
+ print(f"{torch_output=}")
+ if triton.testing.allclose(triton_output, torch_output):
+ print("β
Triton and Torch match")
+ else:
+ print("β Triton and Torch differ")
@@ -288,7 +391,7 @@ We can test our custom matrix multiplication operation against a native torch im
.. code-block:: none
- tensor([[ 1.1045, -36.9688, 31.4688, ..., -11.3984, 24.4531, -32.3438],
+ triton_output=tensor([[ 1.1045, -36.9688, 31.4688, ..., -11.3984, 24.4531, -32.3438],
[ 6.3555, -19.6094, 34.0938, ..., -5.8945, 5.2891, 6.8867],
[-32.0625, 5.9492, 15.3984, ..., -21.3906, -23.9844, -10.1328],
...,
@@ -296,7 +399,7 @@ We can test our custom matrix multiplication operation against a native torch im
[ 25.5000, 24.3281, -8.4688, ..., -18.9375, 32.5312, -29.9219],
[ -5.3477, 4.9844, 11.8906, ..., 5.5898, 6.4023, -17.3125]],
device='cuda:0', dtype=torch.float16)
- tensor([[ 1.1045, -36.9688, 31.4688, ..., -11.3906, 24.4531, -32.3438],
+ torch_output=tensor([[ 1.1045, -36.9688, 31.4688, ..., -11.3906, 24.4531, -32.3438],
[ 6.3516, -19.6094, 34.0938, ..., -5.8906, 5.2812, 6.8828],
[-32.0625, 5.9531, 15.3984, ..., -21.4062, -23.9844, -10.1328],
...,
@@ -304,12 +407,12 @@ We can test our custom matrix multiplication operation against a native torch im
[ 25.5000, 24.3438, -8.4609, ..., -18.9375, 32.5312, -29.9219],
[ -5.3477, 4.9805, 11.8828, ..., 5.5859, 6.4023, -17.3125]],
device='cuda:0', dtype=torch.float16)
- tensor(True, device='cuda:0')
+ β
Triton and Torch match
-.. GENERATED FROM PYTHON SOURCE LINES 230-236
+.. GENERATED FROM PYTHON SOURCE LINES 321-327
Benchmark
--------------
@@ -318,7 +421,7 @@ Square Matrix Performance
~~~~~~~~~~~~~~~~~~~~~~~~~~
We can now compare the performance of our kernel against that of cuBLAS. Here we focus on square matrices, but feel free to arrange this script as you wish to benchmark any other matrix shape.
-.. GENERATED FROM PYTHON SOURCE LINES 236-268
+.. GENERATED FROM PYTHON SOURCE LINES 327-368
.. code-block:: default
@@ -327,14 +430,19 @@ We can now compare the performance of our kernel against that of cuBLAS. Here we
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['M', 'N', 'K'], # argument names to use as an x-axis for the plot
- x_vals=[128 * i for i in range(1, 33)], # different possible values for `x_name`
+ x_vals=[
+ 128 * i for i in range(1, 33)
+ ], # different possible values for `x_name`
line_arg='provider', # argument name whose value corresponds to a different line in the plot
- line_vals=['cublas', 'cublas + relu', 'triton', 'triton + relu'], # possible values for `line_arg``
- line_names=["cuBLAS", "cuBLAS (+ torch.nn.LeakyReLU)", "Triton", "Triton (+ LeakyReLU)"], # label name for the lines
- styles=[('green', '-'), ('green', '--'), ('blue', '-'), ('blue', '--')], # line styles
+ # possible values for `line_arg``
+ line_vals=['cublas', 'cublas + relu', 'triton', 'triton + relu'],
+ # label name for the lines
+ line_names=["cuBLAS", "cuBLAS (+ torch.nn.LeakyReLU)", "Triton", "Triton (+ LeakyReLU)"],
+ # line styles
+ styles=[('green', '-'), ('green', '--'), ('blue', '-'), ('blue', '--')],
ylabel="TFLOPS", # label name for the y-axis
plot_name="matmul-performance", # name for the plot. Used also as a file name for saving the plot.
- args={}
+ args={},
)
)
def benchmark(M, N, K, provider):
@@ -346,9 +454,13 @@ We can now compare the performance of our kernel against that of cuBLAS. Here we
ms, min_ms, max_ms = triton.testing.do_bench(lambda: matmul(a, b))
if provider == 'cublas + relu':
torch_relu = torch.nn.ReLU(inplace=True)
- ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch_relu(torch.matmul(a, b)))
+ ms, min_ms, max_ms = triton.testing.do_bench(
+ lambda: torch_relu(torch.matmul(a, b))
+ )
if provider == 'triton + relu':
- ms, min_ms, max_ms = triton.testing.do_bench(lambda: matmul(a, b, activation=leaky_relu))
+ ms, min_ms, max_ms = triton.testing.do_bench(
+ lambda: matmul(a, b, activation=leaky_relu)
+ )
perf = lambda ms: 2 * M * N * K * 1e-12 / (ms * 1e-3)
return perf(ms), perf(max_ms), perf(min_ms)
@@ -371,37 +483,37 @@ We can now compare the performance of our kernel against that of cuBLAS. Here we
matmul-performance:
M cuBLAS ... Triton Triton (+ LeakyReLU)
0 128.0 0.455111 ... 0.512000 0.512000
- 1 256.0 2.730667 ... 2.978909 2.978909
- 2 384.0 7.372800 ... 7.899428 8.507077
- 3 512.0 14.563555 ... 16.384000 16.384000
+ 1 256.0 2.978909 ... 2.978909 2.978909
+ 2 384.0 7.372800 ... 8.507077 7.899428
+ 3 512.0 14.563555 ... 16.384000 15.420235
4 640.0 22.260869 ... 24.380953 24.380953
5 768.0 32.768000 ... 34.028308 34.028308
- 6 896.0 39.025776 ... 39.025776 39.025776
+ 6 896.0 39.025776 ... 39.025776 35.123201
7 1024.0 49.932191 ... 52.428801 52.428801
- 8 1152.0 45.242181 ... 46.656000 45.938215
- 9 1280.0 51.200001 ... 56.109587 56.109587
- 10 1408.0 64.138541 ... 65.684049 58.640951
- 11 1536.0 79.526831 ... 75.296679 75.296679
- 12 1664.0 63.372618 ... 61.636381 62.061463
- 13 1792.0 72.983276 ... 69.379162 68.953520
- 14 1920.0 69.467336 ... 67.434145 70.172588
- 15 2048.0 73.584279 ... 75.573044 74.898285
- 16 2176.0 83.155572 ... 80.817862 77.398646
- 17 2304.0 68.251065 ... 72.828879 73.051599
- 18 2432.0 71.305746 ... 80.963875 80.963875
- 19 2560.0 77.649287 ... 75.676673 74.983980
- 20 2688.0 83.186525 ... 84.671999 82.823267
- 21 2816.0 82.916747 ... 76.115547 79.733474
- 22 2944.0 82.237674 ... 80.771529 78.358539
- 23 3072.0 82.062468 ... 84.892208 82.782312
- 24 3200.0 84.544253 ... 88.397792 89.385477
- 25 3328.0 79.812967 ... 80.617354 81.071278
- 26 3456.0 81.518272 ... 86.970406 81.600781
- 27 3584.0 87.042978 ... 96.372338 90.640517
- 28 3712.0 84.230479 ... 82.764991 82.423549
- 29 3840.0 80.255442 ... 81.377484 80.783056
- 30 3968.0 89.329379 ... 85.932350 87.347124
- 31 4096.0 93.531519 ... 85.816960 91.056800
+ 8 1152.0 44.566925 ... 46.656000 46.656000
+ 9 1280.0 51.200001 ... 56.888887 56.109587
+ 10 1408.0 64.138541 ... 64.902096 64.902096
+ 11 1536.0 78.643199 ... 76.106321 76.106321
+ 12 1664.0 62.929456 ... 62.061463 62.061463
+ 13 1792.0 72.983276 ... 69.810085 69.379162
+ 14 1920.0 67.764707 ... 70.530615 70.530615
+ 15 2048.0 73.908442 ... 75.234154 74.898285
+ 16 2176.0 83.500614 ... 81.143743 81.143743
+ 17 2304.0 68.446623 ... 73.501144 73.501144
+ 18 2432.0 71.305746 ... 82.147552 82.147552
+ 19 2560.0 77.833728 ... 77.283019 77.101175
+ 20 2688.0 81.053536 ... 81.928846 83.922689
+ 21 2816.0 81.981598 ... 79.443003 80.320825
+ 22 2944.0 82.373605 ... 77.385141 78.112900
+ 23 3072.0 81.472093 ... 83.761985 79.638683
+ 24 3200.0 84.768213 ... 88.888888 85.561498
+ 25 3328.0 83.905938 ... 87.794262 87.156532
+ 26 3456.0 80.220468 ... 85.676480 84.068369
+ 27 3584.0 86.707226 ... 95.553020 94.847460
+ 28 3712.0 83.247783 ... 84.303780 85.309435
+ 29 3840.0 80.255442 ... 83.339866 85.005380
+ 30 3968.0 88.938731 ... 87.409694 87.159957
+ 31 4096.0 91.616198 ... 89.597949 89.538177
[32 rows x 5 columns]
@@ -411,7 +523,7 @@ We can now compare the performance of our kernel against that of cuBLAS. Here we
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 2 minutes 14.738 seconds)
+ **Total running time of the script:** ( 2 minutes 30.425 seconds)
.. _sphx_glr_download_getting-started_tutorials_03-matrix-multiplication.py:
diff --git a/_sources/getting-started/tutorials/sg_execution_times.rst.txt b/_sources/getting-started/tutorials/sg_execution_times.rst.txt
index 062e2467e..b82eb7866 100644
--- a/_sources/getting-started/tutorials/sg_execution_times.rst.txt
+++ b/_sources/getting-started/tutorials/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
Computation times
=================
-**03:38.360** total execution time for **getting-started_tutorials** files:
+**03:54.665** total execution time for **getting-started_tutorials** files:
+---------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_getting-started_tutorials_03-matrix-multiplication.py` (``03-matrix-multiplication.py``) | 02:14.738 | 0.0 MB |
+| :ref:`sphx_glr_getting-started_tutorials_03-matrix-multiplication.py` (``03-matrix-multiplication.py``) | 02:30.425 | 0.0 MB |
+---------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_getting-started_tutorials_02-fused-softmax.py` (``02-fused-softmax.py``) | 01:12.626 | 0.0 MB |
+| :ref:`sphx_glr_getting-started_tutorials_02-fused-softmax.py` (``02-fused-softmax.py``) | 01:13.186 | 0.0 MB |
+---------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_getting-started_tutorials_01-vector-add.py` (``01-vector-add.py``) | 00:10.996 | 0.0 MB |
+| :ref:`sphx_glr_getting-started_tutorials_01-vector-add.py` (``01-vector-add.py``) | 00:11.055 | 0.0 MB |
+---------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/getting-started/tutorials/01-vector-add.html b/getting-started/tutorials/01-vector-add.html
index 3b6d7bd79..43508e399 100644
--- a/getting-started/tutorials/01-vector-add.html
+++ b/getting-started/tutorials/01-vector-add.html
@@ -201,49 +201,58 @@ to download the full example code
Compute Kernel
import torch
-import triton.language as tl
import triton
+import triton.language as tl
@triton.jit
-def _add(
- X, # *Pointer* to first input vector
- Y, # *Pointer* to second input vector
- Z, # *Pointer* to output vector
- N, # Size of the vector
- **meta # Optional meta-parameters for the kernel
+def add_kernel(
+ x_ptr, # *Pointer* to first input vector
+ y_ptr, # *Pointer* to second input vector
+ output_ptr, # *Pointer* to output vector
+ n_elements, # Size of the vector
+ **meta, # Optional meta-parameters for the kernel
):
- pid = tl.program_id(0)
- # Create an offset for the blocks of pointers to be
- # processed by this program instance
- offsets = pid * meta['BLOCK'] + tl.arange(0, meta['BLOCK'])
- # Create a mask to guard memory operations against
- # out-of-bounds accesses
- mask = offsets < N
- # Load x
- x = tl.load(X + offsets, mask=mask)
- y = tl.load(Y + offsets, mask=mask)
- # Write back x + y
- z = x + y
- tl.store(Z + offsets, z)
+ BLOCK_SIZE = meta['BLOCK_SIZE'] # How many inputs each program should process
+ # There are multiple 'program's processing different data. We identify which program
+ # we are here
+ pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0
+ # This program will process inputs that are offset from the initial data.
+ # for instance, if you had a vector of length 256 and block_size of 64, the programs
+ # would each access the elements [0:64, 64:128, 128:192, 192:256].
+ # Note that offsets is a list of pointers
+ block_start = pid * BLOCK_SIZE
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
+ # Create a mask to guard memory operations against out-of-bounds accesses
+ mask = offsets < n_elements
+ # Load x and y from DRAM, masking out any extar elements in case the input is not a
+ # multiple of the block size
+ x = tl.load(x_ptr + offsets, mask=mask)
+ y = tl.load(y_ptr + offsets, mask=mask)
+ output = x + y
+ # Write x + y back to DRAM
+ tl.store(output_ptr + offsets, output)
Letβs also declare a helper function to (1) allocate the z tensor
and (2) enqueue the above kernel with appropriate grid/block sizes.
-
def add(x, y):
- z = torch.empty_like(x)
- N = z.shape[0]
+def add(x: torch.Tensor, y: torch.Tensor):
+ # We need to preallocate the output
+ output = torch.empty_like(x)
+ assert x.is_cuda and y.is_cuda and output.is_cuda
+ n_elements = output.shape[0]
# The SPMD launch grid denotes the number of kernel instances that run in parallel.
# It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) -> Tuple[int]
- grid = lambda meta: (triton.cdiv(N, meta['BLOCK']), )
+ # In this case, we use a 1D grid where the size is the number of blocks
+ grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
# NOTE:
# - each torch.tensor object is implicitly converted into a pointer to its first element.
# - `triton.jit`'ed functions can be index with a launch grid to obtain a callable GPU kernel
# - don't forget to pass meta-parameters as keywords arguments
- _add[grid](x, y, z, N, BLOCK=1024)
+ add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
# We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called, the kernel is still
# running asynchronously at this point.
- return z
+ return output
We can now use the above function to compute the element-wise sum of two torch.tensor objects and test its correctness:
@@ -251,11 +260,14 @@ and (2) enqueue the above kernel with appropriate grid/block sizes.
size = 98432
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
-za = x + y
-zb = add(x, y)
-print(za)
-print(zb)
-print(f'The maximum difference between torch and triton is ' f'{torch.max(torch.abs(za - zb))}')
+output_torch = x + y
+output_triton = add(x, y)
+print(output_torch)
+print(output_triton)
+print(
+ f'The maximum difference between torch and triton is '
+ f'{torch.max(torch.abs(output_torch - output_triton))}'
+)
Out:
@@ -274,15 +286,17 @@ for different problem sizes.
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['size'], # argument names to use as an x-axis for the plot
- x_vals=[2**i for i in range(12, 28, 1)], # different possible values for `x_name`
+ x_vals=[
+ 2 ** i for i in range(12, 28, 1)
+ ], # different possible values for `x_name`
x_log=True, # x axis is logarithmic
line_arg='provider', # argument name whose value corresponds to a different line in the plot
line_vals=['triton', 'torch'], # possible values for `line_arg`
- line_names=["Triton", "Torch"], # label name for the lines
+ line_names=['Triton', 'Torch'], # label name for the lines
styles=[('blue', '-'), ('green', '-')], # line styles
- ylabel="GB/s", # label name for the y-axis
- plot_name="vector-add-performance", # name for the plot. Used also as a file name for saving the plot.
- args={} # values for function arguments not in `x_names` and `y_name`
+ ylabel='GB/s', # label name for the y-axis
+ plot_name='vector-add-performance', # name for the plot. Used also as a file name for saving the plot.
+ args={}, # values for function arguments not in `x_names` and `y_name`
)
)
def benchmark(size, provider):
@@ -307,23 +321,23 @@ for different problem sizes.
size Triton Torch
0 4096.0 9.600000 9.600000
1 8192.0 19.200000 19.200000
-2 16384.0 38.400001 38.400001
-3 32768.0 76.800002 76.800002
+2 16384.0 31.999999 31.999999
+3 32768.0 63.999998 76.800002
4 65536.0 127.999995 127.999995
5 131072.0 219.428568 219.428568
-6 262144.0 384.000001 384.000001
+6 262144.0 341.333321 384.000001
7 524288.0 472.615390 472.615390
8 1048576.0 614.400016 614.400016
9 2097152.0 722.823517 722.823517
10 4194304.0 780.190482 780.190482
-11 8388608.0 819.200021 812.429770
+11 8388608.0 812.429770 812.429770
12 16777216.0 833.084721 833.084721
13 33554432.0 843.811163 843.811163
14 67108864.0 849.278610 848.362445
15 134217728.0 851.577704 850.656574
-
Total running time of the script: ( 0 minutes 10.996 seconds)
+
Total running time of the script: ( 0 minutes 11.055 seconds)