diff --git a/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip b/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip
index ec4554643..184dfcc42 100644
Binary files a/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip and b/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip differ
diff --git a/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip b/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip
index a5b6a97d1..eea2bf9a4 100644
Binary files a/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip and b/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip differ
diff --git a/_downloads/b51b68bc1c6b1a5e509f67800b6235af/03-matrix-multiplication.ipynb b/_downloads/b51b68bc1c6b1a5e509f67800b6235af/03-matrix-multiplication.ipynb
index a63d64e3a..0fabbe9d0 100644
--- a/_downloads/b51b68bc1c6b1a5e509f67800b6235af/03-matrix-multiplication.ipynb
+++ b/_downloads/b51b68bc1c6b1a5e509f67800b6235af/03-matrix-multiplication.ipynb
@@ -22,14 +22,14 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Motivations\nMatrix multiplications are a key building block of most modern high-performance computing systems.\nThey are notoriously hard to optimize, hence their implementation is generally done by\nhardware vendors themselves as part of so-called \"kernel libraries\" (e.g., cuBLAS).\nUnfortunately, these libraries are often proprietary and cannot be easily customized\nto accomodate the needs of modern deep learning workloads (e.g., fused activation functions).\nIn this tutorial, you will learn how to implement efficient matrix multiplications by\nyourself with Triton, in a way that is easy to customize and extend.\n\nRoughly speaking, the kernel that we will write will implement the following blocked\nalgorithm to multiply a (MxK) by a (KxN) matrix:\n\n .. code-block:: python\n\n   # do in parallel\n   for m in range(0, M, BLOCK_SIZE_M):\n     # do in parallel\n     for n in range(0, N, BLOCK_SIZE_N):\n       acc = zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=float32)\n       for k in range(0, K, BLOCK_SIZE_K):\n         a = A[m : m+BLOCK_SIZE_M, k : k+BLOCK_SIZE_K]\n         b = B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]\n         acc += dot(a, b)\n       C[m : m+BLOCK_SIZE_M, n : n+BLOCK_SIZE_N] = acc;\n\nwhere each iteration of the doubly-nested for-loop corresponds to a Triton program instance.\n\n"
+        "## Motivations\nMatrix multiplications are a key building block of most modern high-performance computing systems.\nThey are notoriously hard to optimize, hence their implementation is generally done by\nhardware vendors themselves as part of so-called \"kernel libraries\" (e.g., cuBLAS).\nUnfortunately, these libraries are often proprietary and cannot be easily customized\nto accomodate the needs of modern deep learning workloads (e.g., fused activation functions).\nIn this tutorial, you will learn how to implement efficient matrix multiplications by\nyourself with Triton, in a way that is easy to customize and extend.\n\nRoughly speaking, the kernel that we will write will implement the following blocked\nalgorithm to multiply a (M, K) by a (K, N) matrix:\n\n .. code-block:: python\n\n   # do in parallel\n   for m in range(0, M, BLOCK_SIZE_M):\n     # do in parallel\n     for n in range(0, N, BLOCK_SIZE_N):\n       acc = zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=float32)\n       for k in range(0, K, BLOCK_SIZE_K):\n         a = A[m : m+BLOCK_SIZE_M, k : k+BLOCK_SIZE_K]\n         b = B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]\n         acc += dot(a, b)\n       C[m : m+BLOCK_SIZE_M, n : n+BLOCK_SIZE_N] = acc;\n\nwhere each iteration of the doubly-nested for-loop is performed by a dedicated Triton program instance.\n\n"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Compute Kernel\n\nThe above algorithm is, actually, fairly straightforward to implement in Triton.\nThe main difficulty comes from the computation of the memory locations at which blocks\nof :code:`A` and :code:`B` must be read in the inner loop. For that, we need\nmulti-dimensional pointer arithmetics.\n\n### Pointer Arithmetics\n\nFor a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given b\ny :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.\nTherefore, blocks of pointers for :code:`A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K]` and\n:code:`B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]` can be defined in pseudo-code as:\n\n .. code-block:: python\n\n   &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] =  A + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);\n   &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] =  B + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);\n\nWhich means that pointers for blocks of A and B can be initialized (i.e., :code:`k=0`) in Triton as:\n\n .. code-block:: python\n\n   pid_m = triton.program_id(0)\n   pid_n = triton.program_id(1)\n   rm = pid_m * BLOCK_SIZE_M + triton.arange(0, BLOCK_SIZE_M)\n   rn = pid_n * BLOCK_SIZE_N + triton.arange(0, BLOCK_SIZE_N)\n   rk = triton.arange(0, BLOCK_SIZE_K)\n   // pointer for A operand\n   pa = A + (rm[:, None] * stride_a_0 + rk[None, :] * stride_a_1);\n   // pointer for B operand\n   pb = B + (rk[:, None] * stride_b_0 + rn[None, :] * stride_b_1);\n\nAnd then updated in the inner loop as follows:\n\n .. code-block:: python\n\n   pa += BLOCK_SIZE_K * stride_a_1;\n   pb += BLOCK_SIZE_K * stride_b_0;\n\n\n### L2 Cache Optimizations\n\nAs mentioned above, each program instance computes a :code:`[BLOCK_SIZE_M, BLOCK_SIZE_N]`\nblock of :code:`C`.\nIt is important to remember that the order in which these blocks are computed does\nmatter, since it affects the L2 cache hit rate of our program. and unfortunately, a\na simple row-major ordering\n\n .. code-block:: Python\n\n   pid = triton.program_id(0);\n   grid_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M;\n   grid_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N;\n   pid_m = pid / grid_n;\n   pid_n = pid % grid_n;\n\nis just not going to cut it.\n\nOne possible solution is to launch blocks in an order that promotes data reuse.\nThis can be done by 'super-grouping' blocks in groups of :code:`GROUP_M` rows before\nswitching to the next column:\n\n .. code-block:: python\n\n   pid = triton.program_id(0);\n   width = GROUP_M * grid_n;\n   group_id = pid // width;\n   # we need to handle the case where M % (GROUP_M*BLOCK_SIZE_M) != 0\n   group_size = min(grid_m - group_id * GROUP_M, GROUP_M);\n   pid_m = group_id * GROUP_M + (pid % group_size);\n   pid_n = (pid % width) // (group_size);\n\nFor example, in the following matmul where each matrix is 9 blocks by 9 blocks,\nwe can see that if we compute the output in row-major ordering, we need to load 90\nblocks into SRAM to compute the first 9 output blocks, but if we do it in grouped\nordering, we only need to load 54 blocks.\n  .. image:: grouped_vs_row_major_ordering.png\n\nIn practice, this can improve the performance of our matrix multiplication kernel by\nmore than 10\\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).\n\n\n"
+        "## Compute Kernel\n\nThe above algorithm is, actually, fairly straightforward to implement in Triton.\nThe main difficulty comes from the computation of the memory locations at which blocks\nof :code:`A` and :code:`B` must be read in the inner loop. For that, we need\nmulti-dimensional pointer arithmetics.\n\n### Pointer Arithmetics\n\nFor a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given b\ny :code:`&X[i, j] = X + i*stride_xi + j*stride_xj`.\nTherefore, blocks of pointers for :code:`A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K]` and\n:code:`B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]` can be defined in pseudo-code as:\n\n .. code-block:: python\n\n   &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] =  a_ptr + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);\n   &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] =  b_ptr + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);\n\nWhich means that pointers for blocks of A and B can be initialized (i.e., :code:`k=0`) in Triton as:\n\n .. code-block:: python\n\n   offs_am = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)\n   offs_bn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)\n   offs_k = tl.arange(0, BLOCK_SIZE_K)\n   a_ptrs = a_ptr + (offs_am[:, None]*stride_am + offs_k [None, :]*stride_ak)\n   b_ptrs = b_ptr + (offs_k [:, None]*stride_bk + offs_bn[None, :]*stride_bn)\n\nAnd then updated in the inner loop as follows:\n\n .. code-block:: python\n\n   pa += BLOCK_SIZE_K * stride_ak;\n   pb += BLOCK_SIZE_K * stride_bk;\n\n\n### L2 Cache Optimizations\n\nAs mentioned above, each program instance computes a :code:`[BLOCK_SIZE_M, BLOCK_SIZE_N]`\nblock of :code:`C`.\nIt is important to remember that the order in which these blocks are computed does\nmatter, since it affects the L2 cache hit rate of our program. and unfortunately, a\na simple row-major ordering\n\n .. code-block:: Python\n\n   pid = triton.program_id(0);\n   grid_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M;\n   grid_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N;\n   pid_m = pid / grid_n;\n   pid_n = pid % grid_n;\n\nis just not going to cut it.\n\nOne possible solution is to launch blocks in an order that promotes data reuse.\nThis can be done by 'super-grouping' blocks in groups of :code:`GROUP_M` rows before\nswitching to the next column:\n\n .. code-block:: python\n\n   # program ID\n   pid = tl.program_id(axis=0)\n   # number of program ids along the M axis\n   num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)\n   # number of programs ids along the N axis\n   num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)\n   # number of programs in group\n   num_pid_in_group = GROUP_SIZE_M * num_pid_n \n   # id of the group this program is in\n   group_id = pid // num_pid_in_group \n   # row-id of the first program in the group\n   first_pid_m = group_id * GROUP_SIZE_M \n   # if `num_pid_m` isn't divisible by `GROUP_SIZE_M`, the last group is smaller\n   group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M) \n   # *within groups*, programs are ordered in a column-major order\n   # row-id of the program in the *launch grid*\n   pid_m = first_pid_m + (pid % group_size_m)\n   # col-id of the program in the *launch grid*\n   pid_n = (pid % num_pid_in_group) // group_size_m\n\nFor example, in the following matmul where each matrix is 9 blocks by 9 blocks,\nwe can see that if we compute the output in row-major ordering, we need to load 90\nblocks into SRAM to compute the first 9 output blocks, but if we do it in grouped\nordering, we only need to load 54 blocks.\n  .. image:: grouped_vs_row_major_ordering.png\n\nIn practice, this can improve the performance of our matrix multiplication kernel by\nmore than 10\\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).\n\n\n"
       ]
     },
     {
@@ -47,7 +47,7 @@
       },
       "outputs": [],
       "source": [
-        "import torch\nimport triton\nimport triton.language as tl\n\n# %\n# :code:`triton.jit`'ed functions can be auto-tuned by using the `triton.autotune`\n# decorator, which consumes:\n#   - A list of :code:`triton.Config` objects that define different configurations of\n#       meta-parameters (e.g., BLOCK_SIZE_M) and compilation options (e.g., num_warps) to try\n#   - An autotuning *key* whose change in values will trigger evaluation of all the\n#       provided configs\n\n@triton.autotune(\n    configs=[\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),\n        triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),\n        triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 64,  'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),\n        triton.Config({'BLOCK_SIZE_M': 32 , 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),\n    ],\n    key=['M', 'N', 'K'],\n)\n# %\n# We can now define our kernel as normal, using all the techniques presented above\n@triton.jit\ndef matmul_kernel(\n    # Pointers to matrices\n    a_ptr,\n    b_ptr,\n    c_ptr,\n    # Matrix dimensions\n    M,\n    N,\n    K,\n    # The stride variables represent how much to increase the ptr by when moving by 1\n    # element in a particular dimension. E.g. stride_am is how much to increase a_ptr\n    # by to get the element one row down (A has M rows)\n    stride_am,\n    stride_ak,\n    stride_bk,\n    stride_bn,\n    stride_cm,\n    stride_cn,\n    **meta,\n):\n    \"\"\"Kernel for computing the matmul AB = C\n\n    A has shape (M, K), B has shape (K, N) and C has shape (M, N)\n    \"\"\"\n    # extract meta-parameters\n    BLOCK_SIZE_M = meta['BLOCK_SIZE_M']\n    BLOCK_SIZE_N = meta['BLOCK_SIZE_N']\n    BLOCK_SIZE_K = meta['BLOCK_SIZE_K']\n    GROUP_SIZE_M = 8\n    pid = tl.program_id(axis=0)\n\n    # the number of blocks is the ceil(M / BLOCK_SIZE_M) since we need an extra block\n    # Note that this will lead to some quantization in performance where time-taken jumps\n    # when you need to add a new block\n    n_blocks_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M\n    n_blocks_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N\n\n    # Map PIDs to the block they should compute. This is done in a grouped ordering\n    # to promote L2 cache reuse.\n    n_output_blocks_in_group = GROUP_SIZE_M * n_blocks_n\n    group_id = pid // n_output_blocks_in_group\n    first_m_block_in_group = group_id * GROUP_SIZE_M\n\n    # If the number of blocks is not divisible by the group size, the last group is smaller\n    group_size_m = min(n_blocks_m - first_m_block_in_group, GROUP_SIZE_M)\n\n    # Within a group, we compute in col-major ordering, block_m and block_n are the\n    # output row and col that this program is computing in terms of blocks\n    block_m = first_m_block_in_group + (pid % group_size_m)\n    block_n = (pid % n_output_blocks_in_group) // group_size_m\n\n    # Convert from block indices back to element indices\n    m_start = block_m * BLOCK_SIZE_M\n    n_start = block_n * BLOCK_SIZE_N\n\n    # Expand out to all the offsets for each of the elements in this block.\n    m_offsets_a = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]\n    n_offsets_b = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]\n    k_offsets = tl.arange(0, BLOCK_SIZE_K)\n\n    # Get the pointers for the first block of each. We will advance this pointer\n    # as we move in the K direction and accumulate.\n    # a_ptrs should contain BLOCK_SIZE_M * BLOCK_SIZE_K pointers\n    a_ptrs = a_ptr + (stride_am * m_offsets_a + stride_ak * k_offsets[None, :])\n    # b_ptrs should contain BLOCK_SIZE_K * BLOCK_SIZE_N pointers\n    b_ptrs = b_ptr + (stride_bk * k_offsets[:, None] + stride_bn * n_offsets_b)\n    # We accumulate internally in fp32, but the output is written out in the dtype\n    # of the tensor when it is stored\n    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)\n    for k in range(0, K, BLOCK_SIZE_K):\n        # Note that for simplicity, we don't apply a mask here. This means that if K is\n        # not a multiple of BLOCK_SIZE_K, this will access out-of-bounds memory and\n        # accumulate it incorrectly.\n        a = tl.load(a_ptrs)\n        b = tl.load(b_ptrs)\n        # We accumulate along the K dimension\n        accumulator += tl.dot(a, b)\n\n        # Advance the ptrs to the next K block\n        a_ptrs += BLOCK_SIZE_K * stride_ak\n        b_ptrs += BLOCK_SIZE_K * stride_bk\n    # triton can accept arbitrary activation function via metaparameters!\n    if meta['ACTIVATION']:\n        accumulator = meta['ACTIVATION'](accumulator)\n\n    m_offsets_c = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]\n    n_offsets_c = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]\n    c_ptrs = c_ptr + stride_cm * m_offsets_c + stride_cn * n_offsets_c\n    mask = (m_offsets_c < M) & (n_offsets_c < N)\n    tl.store(c_ptrs, accumulator, mask=mask)\n\n\n# we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`\n@triton.jit\ndef leaky_relu(x):\n    return tl.where(x >= 0, x, 0.01 * x)"
+        "import torch\nimport triton\nimport triton.language as tl\n\n# %\n# :code:`triton.jit`'ed functions can be auto-tuned by using the `triton.autotune`\n# decorator, which consumes:\n#   - A list of :code:`triton.Config` objects that define different configurations of\n#       meta-parameters (e.g., BLOCK_SIZE_M) and compilation options (e.g., num_warps) to try\n#   - An autotuning *key* whose change in values will trigger evaluation of all the\n#       provided configs\n\n@triton.autotune(\n    configs=[\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),\n        triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),\n        triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 64,  'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),\n        triton.Config({'BLOCK_SIZE_M': 64 , 'BLOCK_SIZE_N': 32 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),\n        triton.Config({'BLOCK_SIZE_M': 32 , 'BLOCK_SIZE_N': 64 , 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),\n    ],\n    key=['M', 'N', 'K'],\n)\n# %\n# We can now define our kernel as normal, using all the techniques presented above\n@triton.jit\ndef matmul_kernel(\n    # Pointers to matrices\n    a_ptr, b_ptr, c_ptr,\n    # Matrix dimensions\n    M, N, K,\n    # The stride variables represent how much to increase the ptr by when moving by 1\n    # element in a particular dimension. E.g. stride_am is how much to increase a_ptr\n    # by to get the element one row down (A has M rows)\n    stride_am, stride_ak,\n    stride_bk, stride_bn,\n    stride_cm, stride_cn,\n    # Meta-parameters\n    **meta,\n):\n    \"\"\"Kernel for computing the matmul C = A x B.\n    A has shape (M, K), B has shape (K, N) and C has shape (M, N)\n    \"\"\"\n    # extract meta-parameters\n    BLOCK_SIZE_M = meta['BLOCK_SIZE_M']\n    BLOCK_SIZE_N = meta['BLOCK_SIZE_N']\n    BLOCK_SIZE_K = meta['BLOCK_SIZE_K']\n    GROUP_SIZE_M = 8\n\n    # -----------------------------------------------------------\n    # Map program ids `pid` to the block of C it should compute.\n    # This is done in a grouped ordering to promote L2 data reuse\n    # See above `L2 Cache Optimizations` section for details\n    pid = tl.program_id(axis=0)\n    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)\n    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)\n    num_pid_in_group = GROUP_SIZE_M * num_pid_n \n    group_id = pid // num_pid_in_group \n    first_pid_m = group_id * GROUP_SIZE_M \n    group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M) \n    pid_m = first_pid_m + (pid % group_size_m)\n    pid_n = (pid % num_pid_in_group) // group_size_m\n\n    # ----------------------------------------------------------\n    # Create pointers for the first blocks of A and B.\n    # We will advance this pointer as we move in the K direction \n    # and accumulate\n    # a_ptrs is a block of [BLOCK_SIZE_M, BLOCK_SIZE_K] pointers\n    # b_ptrs is a block of [BLOCK_SIZE_K, BLOCK_SIZE_n] pointers\n    # see above `Pointer Arithmetics` section for details\n    offs_am = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)\n    offs_bn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)\n    offs_k = tl.arange(0, BLOCK_SIZE_K)\n    a_ptrs = a_ptr + (offs_am[:, None]*stride_am + offs_k [None, :]*stride_ak)\n    b_ptrs = b_ptr + (offs_k [:, None]*stride_bk + offs_bn[None, :]*stride_bn)\n\n    # -----------------------------------------------------------\n    # Iterate to compute a block of the C matrix\n    # We accumulate into a `[BLOCK_SIZE_M, BLOCK_SIZE_N]` block\n    # of fp32 values for higher accuracy.\n    # `accumulator` will be converted back to fp16 after the loop\n    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)\n    for k in range(0, K, BLOCK_SIZE_K):\n        # Note that for simplicity, we don't apply a mask here. \n        # This means that if K is not a multiple of BLOCK_SIZE_K, \n        # this will access out-of-bounds memory and produce an\n        # error or (worse!) incorrect results.\n        a = tl.load(a_ptrs)\n        b = tl.load(b_ptrs)\n        # We accumulate along the K dimension\n        accumulator += tl.dot(a, b)\n        # Advance the ptrs to the next K block\n        a_ptrs += BLOCK_SIZE_K * stride_ak\n        b_ptrs += BLOCK_SIZE_K * stride_bk\n    # you can fuse arbitrary activation functions here\n    # while the accumulator is still in FP32 !\n    if meta['ACTIVATION']: \n        accumulator = meta['ACTIVATION'](accumulator)\n    c = accumulator.to(tl.float16)\n\n    # -----------------------------------------------------------\n    # Write back the block of the output matrix C\n    offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)\n    offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)\n    c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]\n    c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)\n    tl.store(c_ptrs, c, mask=c_mask)\n\n\n# we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`\n@triton.jit\ndef leaky_relu(x):\n    return tl.where(x >= 0, x, 0.01 * x)"
       ]
     },
     {
@@ -65,7 +65,7 @@
       },
       "outputs": [],
       "source": [
-        "def matmul(a, b, activation=None):\n    # checks constraints\n    assert a.shape[1] == b.shape[0], \"incompatible dimensions\"\n    assert a.is_contiguous(), \"matrix A must be contiguous\"\n    assert b.is_contiguous(), \"matrix B must be contiguous\"\n    M, K = a.shape\n    K, N = b.shape\n    assert (\n        K % 32 == 0\n    ), \"We don't check memory-out-of-bounds with K so K must be divisible by BLOCK_SIZE_K\"\n    # allocates output\n    c = torch.empty((M, N), device=a.device, dtype=a.dtype)\n    # 1D launch kernel where each block gets its own program.\n    grid = lambda META: (\n        triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']),\n    )\n    matmul_kernel[grid](\n        a,\n        b,\n        c,\n        M,\n        N,\n        K,\n        a.stride(0),\n        a.stride(1),\n        b.stride(0),\n        b.stride(1),\n        c.stride(0),\n        c.stride(1),\n        ACTIVATION=activation,\n    )\n    return c"
+        "def matmul(a, b, activation=None):\n    # checks constraints\n    assert a.shape[1] == b.shape[0], \"incompatible dimensions\"\n    assert a.is_contiguous(), \"matrix A must be contiguous\"\n    assert b.is_contiguous(), \"matrix B must be contiguous\"\n    M, K = a.shape\n    K, N = b.shape\n    assert (\n        K % 32 == 0\n    ), \"We don't check memory-out-of-bounds with K so K must be divisible by BLOCK_SIZE_K\"\n    # allocates output\n    c = torch.empty((M, N), device=a.device, dtype=a.dtype)\n    # 1D launch kernel where each block gets its own program.\n    grid = lambda META: (\n        triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']),\n    )\n    matmul_kernel[grid](\n        a, b, c,\n        M, N, K,\n        a.stride(0), a.stride(1),\n        b.stride(0), b.stride(1),\n        c.stride(0), c.stride(1),\n        ACTIVATION=activation,\n    )\n    return c"
       ]
     },
     {
diff --git a/_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py b/_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py
index e71fae2d6..80207d8cf 100644
--- a/_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py
+++ b/_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py
@@ -23,7 +23,7 @@ You will specifically learn about:
 # yourself with Triton, in a way that is easy to customize and extend.
 #
 # Roughly speaking, the kernel that we will write will implement the following blocked
-# algorithm to multiply a (MxK) by a (KxN) matrix:
+# algorithm to multiply a (M, K) by a (K, N) matrix:
 #
 #  .. code-block:: python
 #
@@ -38,7 +38,7 @@ You will specifically learn about:
 #          acc += dot(a, b)
 #        C[m : m+BLOCK_SIZE_M, n : n+BLOCK_SIZE_N] = acc;
 #
-# where each iteration of the doubly-nested for-loop corresponds to a Triton program instance.
+# where each iteration of the doubly-nested for-loop is performed by a dedicated Triton program instance.
 
 # %%
 # Compute Kernel
@@ -53,35 +53,31 @@ You will specifically learn about:
 # ~~~~~~~~~~~~~~~~~~~~
 #
 # For a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given b
-# y :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.
+# y :code:`&X[i, j] = X + i*stride_xi + j*stride_xj`.
 # Therefore, blocks of pointers for :code:`A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K]` and
 # :code:`B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]` can be defined in pseudo-code as:
 #
 #  .. code-block:: python
 #
-#    &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] =  A + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);
-#    &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] =  B + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);
+#    &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] =  a_ptr + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);
+#    &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] =  b_ptr + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);
 #
 # Which means that pointers for blocks of A and B can be initialized (i.e., :code:`k=0`) in Triton as:
 #
 #  .. code-block:: python
 #
-#    pid_m = triton.program_id(0)
-#    pid_n = triton.program_id(1)
-#    rm = pid_m * BLOCK_SIZE_M + triton.arange(0, BLOCK_SIZE_M)
-#    rn = pid_n * BLOCK_SIZE_N + triton.arange(0, BLOCK_SIZE_N)
-#    rk = triton.arange(0, BLOCK_SIZE_K)
-#    // pointer for A operand
-#    pa = A + (rm[:, None] * stride_a_0 + rk[None, :] * stride_a_1);
-#    // pointer for B operand
-#    pb = B + (rk[:, None] * stride_b_0 + rn[None, :] * stride_b_1);
+#    offs_am = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
+#    offs_bn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
+#    offs_k = tl.arange(0, BLOCK_SIZE_K)
+#    a_ptrs = a_ptr + (offs_am[:, None]*stride_am + offs_k [None, :]*stride_ak)
+#    b_ptrs = b_ptr + (offs_k [:, None]*stride_bk + offs_bn[None, :]*stride_bn)
 #
 # And then updated in the inner loop as follows:
 #
 #  .. code-block:: python
 #
-#    pa += BLOCK_SIZE_K * stride_a_1;
-#    pb += BLOCK_SIZE_K * stride_b_0;
+#    pa += BLOCK_SIZE_K * stride_ak;
+#    pb += BLOCK_SIZE_K * stride_bk;
 #
 #
 # L2 Cache Optimizations
@@ -109,13 +105,25 @@ You will specifically learn about:
 #
 #  .. code-block:: python
 #
-#    pid = triton.program_id(0);
-#    width = GROUP_M * grid_n;
-#    group_id = pid // width;
-#    # we need to handle the case where M % (GROUP_M*BLOCK_SIZE_M) != 0
-#    group_size = min(grid_m - group_id * GROUP_M, GROUP_M);
-#    pid_m = group_id * GROUP_M + (pid % group_size);
-#    pid_n = (pid % width) // (group_size);
+#    # program ID
+#    pid = tl.program_id(axis=0)
+#    # number of program ids along the M axis
+#    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
+#    # number of programs ids along the N axis
+#    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
+#    # number of programs in group
+#    num_pid_in_group = GROUP_SIZE_M * num_pid_n 
+#    # id of the group this program is in
+#    group_id = pid // num_pid_in_group 
+#    # row-id of the first program in the group
+#    first_pid_m = group_id * GROUP_SIZE_M 
+#    # if `num_pid_m` isn't divisible by `GROUP_SIZE_M`, the last group is smaller
+#    group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M) 
+#    # *within groups*, programs are ordered in a column-major order
+#    # row-id of the program in the *launch grid*
+#    pid_m = first_pid_m + (pid % group_size_m)
+#    # col-id of the program in the *launch grid*
+#    pid_n = (pid % num_pid_in_group) // group_size_m
 #
 # For example, in the following matmul where each matrix is 9 blocks by 9 blocks,
 # we can see that if we compute the output in row-major ordering, we need to load 90
@@ -164,26 +172,19 @@ import triton.language as tl
 @triton.jit
 def matmul_kernel(
     # Pointers to matrices
-    a_ptr,
-    b_ptr,
-    c_ptr,
+    a_ptr, b_ptr, c_ptr,
     # Matrix dimensions
-    M,
-    N,
-    K,
+    M, N, K,
     # The stride variables represent how much to increase the ptr by when moving by 1
     # element in a particular dimension. E.g. stride_am is how much to increase a_ptr
     # by to get the element one row down (A has M rows)
-    stride_am,
-    stride_ak,
-    stride_bk,
-    stride_bn,
-    stride_cm,
-    stride_cn,
+    stride_am, stride_ak,
+    stride_bk, stride_bn,
+    stride_cm, stride_cn,
+    # Meta-parameters
     **meta,
 ):
-    """Kernel for computing the matmul AB = C
-
+    """Kernel for computing the matmul C = A x B.
     A has shape (M, K), B has shape (K, N) and C has shape (M, N)
     """
     # extract meta-parameters
@@ -191,67 +192,65 @@ def matmul_kernel(
     BLOCK_SIZE_N = meta['BLOCK_SIZE_N']
     BLOCK_SIZE_K = meta['BLOCK_SIZE_K']
     GROUP_SIZE_M = 8
+
+    # -----------------------------------------------------------
+    # Map program ids `pid` to the block of C it should compute.
+    # This is done in a grouped ordering to promote L2 data reuse
+    # See above `L2 Cache Optimizations` section for details
     pid = tl.program_id(axis=0)
+    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
+    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
+    num_pid_in_group = GROUP_SIZE_M * num_pid_n 
+    group_id = pid // num_pid_in_group 
+    first_pid_m = group_id * GROUP_SIZE_M 
+    group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M) 
+    pid_m = first_pid_m + (pid % group_size_m)
+    pid_n = (pid % num_pid_in_group) // group_size_m
 
-    # the number of blocks is the ceil(M / BLOCK_SIZE_M) since we need an extra block
-    # Note that this will lead to some quantization in performance where time-taken jumps
-    # when you need to add a new block
-    n_blocks_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M
-    n_blocks_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N
+    # ----------------------------------------------------------
+    # Create pointers for the first blocks of A and B.
+    # We will advance this pointer as we move in the K direction 
+    # and accumulate
+    # a_ptrs is a block of [BLOCK_SIZE_M, BLOCK_SIZE_K] pointers
+    # b_ptrs is a block of [BLOCK_SIZE_K, BLOCK_SIZE_n] pointers
+    # see above `Pointer Arithmetics` section for details
+    offs_am = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
+    offs_bn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
+    offs_k = tl.arange(0, BLOCK_SIZE_K)
+    a_ptrs = a_ptr + (offs_am[:, None]*stride_am + offs_k [None, :]*stride_ak)
+    b_ptrs = b_ptr + (offs_k [:, None]*stride_bk + offs_bn[None, :]*stride_bn)
 
-    # Map PIDs to the block they should compute. This is done in a grouped ordering
-    # to promote L2 cache reuse.
-    n_output_blocks_in_group = GROUP_SIZE_M * n_blocks_n
-    group_id = pid // n_output_blocks_in_group
-    first_m_block_in_group = group_id * GROUP_SIZE_M
-
-    # If the number of blocks is not divisible by the group size, the last group is smaller
-    group_size_m = min(n_blocks_m - first_m_block_in_group, GROUP_SIZE_M)
-
-    # Within a group, we compute in col-major ordering, block_m and block_n are the
-    # output row and col that this program is computing in terms of blocks
-    block_m = first_m_block_in_group + (pid % group_size_m)
-    block_n = (pid % n_output_blocks_in_group) // group_size_m
-
-    # Convert from block indices back to element indices
-    m_start = block_m * BLOCK_SIZE_M
-    n_start = block_n * BLOCK_SIZE_N
-
-    # Expand out to all the offsets for each of the elements in this block.
-    m_offsets_a = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]
-    n_offsets_b = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]
-    k_offsets = tl.arange(0, BLOCK_SIZE_K)
-
-    # Get the pointers for the first block of each. We will advance this pointer
-    # as we move in the K direction and accumulate.
-    # a_ptrs should contain BLOCK_SIZE_M * BLOCK_SIZE_K pointers
-    a_ptrs = a_ptr + (stride_am * m_offsets_a + stride_ak * k_offsets[None, :])
-    # b_ptrs should contain BLOCK_SIZE_K * BLOCK_SIZE_N pointers
-    b_ptrs = b_ptr + (stride_bk * k_offsets[:, None] + stride_bn * n_offsets_b)
-    # We accumulate internally in fp32, but the output is written out in the dtype
-    # of the tensor when it is stored
+    # -----------------------------------------------------------
+    # Iterate to compute a block of the C matrix
+    # We accumulate into a `[BLOCK_SIZE_M, BLOCK_SIZE_N]` block
+    # of fp32 values for higher accuracy.
+    # `accumulator` will be converted back to fp16 after the loop
     accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
     for k in range(0, K, BLOCK_SIZE_K):
-        # Note that for simplicity, we don't apply a mask here. This means that if K is
-        # not a multiple of BLOCK_SIZE_K, this will access out-of-bounds memory and
-        # accumulate it incorrectly.
+        # Note that for simplicity, we don't apply a mask here. 
+        # This means that if K is not a multiple of BLOCK_SIZE_K, 
+        # this will access out-of-bounds memory and produce an
+        # error or (worse!) incorrect results.
         a = tl.load(a_ptrs)
         b = tl.load(b_ptrs)
         # We accumulate along the K dimension
         accumulator += tl.dot(a, b)
-
         # Advance the ptrs to the next K block
         a_ptrs += BLOCK_SIZE_K * stride_ak
         b_ptrs += BLOCK_SIZE_K * stride_bk
-    # triton can accept arbitrary activation function via metaparameters!
-    if meta['ACTIVATION']:
+    # you can fuse arbitrary activation functions here
+    # while the accumulator is still in FP32 !
+    if meta['ACTIVATION']: 
         accumulator = meta['ACTIVATION'](accumulator)
+    c = accumulator.to(tl.float16)
 
-    m_offsets_c = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]
-    n_offsets_c = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]
-    c_ptrs = c_ptr + stride_cm * m_offsets_c + stride_cn * n_offsets_c
-    mask = (m_offsets_c < M) & (n_offsets_c < N)
-    tl.store(c_ptrs, accumulator, mask=mask)
+    # -----------------------------------------------------------
+    # Write back the block of the output matrix C
+    offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
+    offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
+    c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
+    c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
+    tl.store(c_ptrs, c, mask=c_mask)
 
 
 # we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`
@@ -282,18 +281,11 @@ def matmul(a, b, activation=None):
         triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']),
     )
     matmul_kernel[grid](
-        a,
-        b,
-        c,
-        M,
-        N,
-        K,
-        a.stride(0),
-        a.stride(1),
-        b.stride(0),
-        b.stride(1),
-        c.stride(0),
-        c.stride(1),
+        a, b, c,
+        M, N, K,
+        a.stride(0), a.stride(1),
+        b.stride(0), b.stride(1),
+        c.stride(0), c.stride(1),
         ACTIVATION=activation,
     )
     return c
diff --git a/_images/sphx_glr_01-vector-add_001.png b/_images/sphx_glr_01-vector-add_001.png
index 3d7049a91..f4f36e595 100644
Binary files a/_images/sphx_glr_01-vector-add_001.png and b/_images/sphx_glr_01-vector-add_001.png differ
diff --git a/_images/sphx_glr_01-vector-add_thumb.png b/_images/sphx_glr_01-vector-add_thumb.png
index acc77a248..a2efc1f1d 100644
Binary files a/_images/sphx_glr_01-vector-add_thumb.png and b/_images/sphx_glr_01-vector-add_thumb.png differ
diff --git a/_images/sphx_glr_02-fused-softmax_001.png b/_images/sphx_glr_02-fused-softmax_001.png
index 0ee5b44ef..9424aae06 100644
Binary files a/_images/sphx_glr_02-fused-softmax_001.png and b/_images/sphx_glr_02-fused-softmax_001.png differ
diff --git a/_images/sphx_glr_02-fused-softmax_thumb.png b/_images/sphx_glr_02-fused-softmax_thumb.png
index 41d99308e..de8be8737 100644
Binary files a/_images/sphx_glr_02-fused-softmax_thumb.png and b/_images/sphx_glr_02-fused-softmax_thumb.png differ
diff --git a/_images/sphx_glr_03-matrix-multiplication_001.png b/_images/sphx_glr_03-matrix-multiplication_001.png
index 2140ebaa8..3566f05bc 100644
Binary files a/_images/sphx_glr_03-matrix-multiplication_001.png and b/_images/sphx_glr_03-matrix-multiplication_001.png differ
diff --git a/_images/sphx_glr_03-matrix-multiplication_thumb.png b/_images/sphx_glr_03-matrix-multiplication_thumb.png
index 0bfd1c39a..dcfd633a2 100644
Binary files a/_images/sphx_glr_03-matrix-multiplication_thumb.png and b/_images/sphx_glr_03-matrix-multiplication_thumb.png differ
diff --git a/_sources/getting-started/tutorials/01-vector-add.rst.txt b/_sources/getting-started/tutorials/01-vector-add.rst.txt
index 369f14ead..237282f10 100644
--- a/_sources/getting-started/tutorials/01-vector-add.rst.txt
+++ b/_sources/getting-started/tutorials/01-vector-add.rst.txt
@@ -234,10 +234,10 @@ We can now run the decorated function above. Pass `print_data=True` to see the p
     0        4096.0    9.600000    9.600000
     1        8192.0   19.200000   19.200000
     2       16384.0   38.400001   38.400001
-    3       32768.0   76.800002   76.800002
+    3       32768.0   63.999998   76.800002
     4       65536.0  127.999995  127.999995
     5      131072.0  219.428568  219.428568
-    6      262144.0  341.333321  384.000001
+    6      262144.0  384.000001  384.000001
     7      524288.0  472.615390  472.615390
     8     1048576.0  614.400016  614.400016
     9     2097152.0  722.823517  722.823517
@@ -254,7 +254,7 @@ We can now run the decorated function above. Pass `print_data=True` to see the p
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 0 minutes  10.994 seconds)
+   **Total running time of the script:** ( 0 minutes  10.971 seconds)
 
 
 .. _sphx_glr_download_getting-started_tutorials_01-vector-add.py:
diff --git a/_sources/getting-started/tutorials/02-fused-softmax.rst.txt b/_sources/getting-started/tutorials/02-fused-softmax.rst.txt
index 902505c37..06b931adc 100644
--- a/_sources/getting-started/tutorials/02-fused-softmax.rst.txt
+++ b/_sources/getting-started/tutorials/02-fused-softmax.rst.txt
@@ -306,10 +306,10 @@ We will then compare its performance against (1) :code:`torch.softmax` and (2) t
     3     640.0  682.666684      640.000002   160.000000
     4     768.0  702.171410      664.216187   163.839992
     ..      ...         ...             ...          ...
-    93  12160.0  812.359066      406.179533   198.936606
-    94  12288.0  812.429770      416.101597   199.298541
-    95  12416.0  810.840807      412.149375   198.854847
-    96  12544.0  810.925276      412.971190   199.209928
+    93  12160.0  812.359066      405.755985   198.936606
+    94  12288.0  812.429770      415.222812   199.096718
+    95  12416.0  810.840807      411.296057   198.755369
+    96  12544.0  810.925276      412.971190   199.012395
     97  12672.0  811.007961      412.097543   199.167004
 
     [98 rows x 4 columns]
@@ -328,7 +328,7 @@ In the above plot, we can see that:
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  12.617 seconds)
+   **Total running time of the script:** ( 1 minutes  12.739 seconds)
 
 
 .. _sphx_glr_download_getting-started_tutorials_02-fused-softmax.py:
diff --git a/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt b/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt
index 08614bc72..d0c673c8e 100644
--- a/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt
+++ b/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt
@@ -42,7 +42,7 @@ In this tutorial, you will learn how to implement efficient matrix multiplicatio
 yourself with Triton, in a way that is easy to customize and extend.
 
 Roughly speaking, the kernel that we will write will implement the following blocked
-algorithm to multiply a (MxK) by a (KxN) matrix:
+algorithm to multiply a (M, K) by a (K, N) matrix:
 
  .. code-block:: python
 
@@ -57,9 +57,9 @@ algorithm to multiply a (MxK) by a (KxN) matrix:
          acc += dot(a, b)
        C[m : m+BLOCK_SIZE_M, n : n+BLOCK_SIZE_N] = acc;
 
-where each iteration of the doubly-nested for-loop corresponds to a Triton program instance.
+where each iteration of the doubly-nested for-loop is performed by a dedicated Triton program instance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 44-129
+.. GENERATED FROM PYTHON SOURCE LINES 44-137
 
 Compute Kernel
 ----------------
@@ -73,35 +73,31 @@ Pointer Arithmetics
 ~~~~~~~~~~~~~~~~~~~~
 
 For a row-major 2D tensor :code:`X`, the memory location of :code:`X[i, j]` is given b
-y :code:`&X[i, j] = X + i*stride_x_0 + j*stride_x_1`.
+y :code:`&X[i, j] = X + i*stride_xi + j*stride_xj`.
 Therefore, blocks of pointers for :code:`A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K]` and
 :code:`B[k : k+BLOCK_SIZE_K, n : n+BLOCK_SIZE_N]` can be defined in pseudo-code as:
 
  .. code-block:: python
 
-   &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] =  A + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);
-   &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] =  B + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);
+   &A[m : m+BLOCK_SIZE_M, k:k+BLOCK_SIZE_K] =  a_ptr + (m : m+BLOCK_SIZE_M)[:, None]*A.stride(0) + (k : k+BLOCK_SIZE_K)[None, :]*A.stride(1);
+   &B[k : k+BLOCK_SIZE_K, n:n+BLOCK_SIZE_N] =  b_ptr + (k : k+BLOCK_SIZE_K)[:, None]*B.stride(0) + (n : n+BLOCK_SIZE_N)[None, :]*B.stride(1);
 
 Which means that pointers for blocks of A and B can be initialized (i.e., :code:`k=0`) in Triton as:
 
  .. code-block:: python
 
-   pid_m = triton.program_id(0)
-   pid_n = triton.program_id(1)
-   rm = pid_m * BLOCK_SIZE_M + triton.arange(0, BLOCK_SIZE_M)
-   rn = pid_n * BLOCK_SIZE_N + triton.arange(0, BLOCK_SIZE_N)
-   rk = triton.arange(0, BLOCK_SIZE_K)
-   // pointer for A operand
-   pa = A + (rm[:, None] * stride_a_0 + rk[None, :] * stride_a_1);
-   // pointer for B operand
-   pb = B + (rk[:, None] * stride_b_0 + rn[None, :] * stride_b_1);
+   offs_am = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
+   offs_bn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
+   offs_k = tl.arange(0, BLOCK_SIZE_K)
+   a_ptrs = a_ptr + (offs_am[:, None]*stride_am + offs_k [None, :]*stride_ak)
+   b_ptrs = b_ptr + (offs_k [:, None]*stride_bk + offs_bn[None, :]*stride_bn)
 
 And then updated in the inner loop as follows:
 
  .. code-block:: python
 
-   pa += BLOCK_SIZE_K * stride_a_1;
-   pb += BLOCK_SIZE_K * stride_b_0;
+   pa += BLOCK_SIZE_K * stride_ak;
+   pb += BLOCK_SIZE_K * stride_bk;
 
 
 L2 Cache Optimizations
@@ -129,13 +125,25 @@ switching to the next column:
 
  .. code-block:: python
 
-   pid = triton.program_id(0);
-   width = GROUP_M * grid_n;
-   group_id = pid // width;
-   # we need to handle the case where M % (GROUP_M*BLOCK_SIZE_M) != 0
-   group_size = min(grid_m - group_id * GROUP_M, GROUP_M);
-   pid_m = group_id * GROUP_M + (pid % group_size);
-   pid_n = (pid % width) // (group_size);
+   # program ID
+   pid = tl.program_id(axis=0)
+   # number of program ids along the M axis
+   num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
+   # number of programs ids along the N axis
+   num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
+   # number of programs in group
+   num_pid_in_group = GROUP_SIZE_M * num_pid_n 
+   # id of the group this program is in
+   group_id = pid // num_pid_in_group 
+   # row-id of the first program in the group
+   first_pid_m = group_id * GROUP_SIZE_M 
+   # if `num_pid_m` isn't divisible by `GROUP_SIZE_M`, the last group is smaller
+   group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M) 
+   # *within groups*, programs are ordered in a column-major order
+   # row-id of the program in the *launch grid*
+   pid_m = first_pid_m + (pid % group_size_m)
+   # col-id of the program in the *launch grid*
+   pid_n = (pid % num_pid_in_group) // group_size_m
 
 For example, in the following matmul where each matrix is 9 blocks by 9 blocks,
 we can see that if we compute the output in row-major ordering, we need to load 90
@@ -147,13 +155,13 @@ In practice, this can improve the performance of our matrix multiplication kerne
 more than 10\% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 131-134
+.. GENERATED FROM PYTHON SOURCE LINES 139-142
 
 Final Result
 -------------
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 134-263
+.. GENERATED FROM PYTHON SOURCE LINES 142-262
 
 .. code-block:: default
 
@@ -190,26 +198,19 @@ Final Result
     @triton.jit
     def matmul_kernel(
         # Pointers to matrices
-        a_ptr,
-        b_ptr,
-        c_ptr,
+        a_ptr, b_ptr, c_ptr,
         # Matrix dimensions
-        M,
-        N,
-        K,
+        M, N, K,
         # The stride variables represent how much to increase the ptr by when moving by 1
         # element in a particular dimension. E.g. stride_am is how much to increase a_ptr
         # by to get the element one row down (A has M rows)
-        stride_am,
-        stride_ak,
-        stride_bk,
-        stride_bn,
-        stride_cm,
-        stride_cn,
+        stride_am, stride_ak,
+        stride_bk, stride_bn,
+        stride_cm, stride_cn,
+        # Meta-parameters
         **meta,
     ):
-        """Kernel for computing the matmul AB = C
-
+        """Kernel for computing the matmul C = A x B.
         A has shape (M, K), B has shape (K, N) and C has shape (M, N)
         """
         # extract meta-parameters
@@ -217,67 +218,65 @@ Final Result
         BLOCK_SIZE_N = meta['BLOCK_SIZE_N']
         BLOCK_SIZE_K = meta['BLOCK_SIZE_K']
         GROUP_SIZE_M = 8
+
+        # -----------------------------------------------------------
+        # Map program ids `pid` to the block of C it should compute.
+        # This is done in a grouped ordering to promote L2 data reuse
+        # See above `L2 Cache Optimizations` section for details
         pid = tl.program_id(axis=0)
+        num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
+        num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
+        num_pid_in_group = GROUP_SIZE_M * num_pid_n 
+        group_id = pid // num_pid_in_group 
+        first_pid_m = group_id * GROUP_SIZE_M 
+        group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M) 
+        pid_m = first_pid_m + (pid % group_size_m)
+        pid_n = (pid % num_pid_in_group) // group_size_m
 
-        # the number of blocks is the ceil(M / BLOCK_SIZE_M) since we need an extra block
-        # Note that this will lead to some quantization in performance where time-taken jumps
-        # when you need to add a new block
-        n_blocks_m = (M + BLOCK_SIZE_M - 1) // BLOCK_SIZE_M
-        n_blocks_n = (N + BLOCK_SIZE_N - 1) // BLOCK_SIZE_N
+        # ----------------------------------------------------------
+        # Create pointers for the first blocks of A and B.
+        # We will advance this pointer as we move in the K direction 
+        # and accumulate
+        # a_ptrs is a block of [BLOCK_SIZE_M, BLOCK_SIZE_K] pointers
+        # b_ptrs is a block of [BLOCK_SIZE_K, BLOCK_SIZE_n] pointers
+        # see above `Pointer Arithmetics` section for details
+        offs_am = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
+        offs_bn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
+        offs_k = tl.arange(0, BLOCK_SIZE_K)
+        a_ptrs = a_ptr + (offs_am[:, None]*stride_am + offs_k [None, :]*stride_ak)
+        b_ptrs = b_ptr + (offs_k [:, None]*stride_bk + offs_bn[None, :]*stride_bn)
 
-        # Map PIDs to the block they should compute. This is done in a grouped ordering
-        # to promote L2 cache reuse.
-        n_output_blocks_in_group = GROUP_SIZE_M * n_blocks_n
-        group_id = pid // n_output_blocks_in_group
-        first_m_block_in_group = group_id * GROUP_SIZE_M
-
-        # If the number of blocks is not divisible by the group size, the last group is smaller
-        group_size_m = min(n_blocks_m - first_m_block_in_group, GROUP_SIZE_M)
-
-        # Within a group, we compute in col-major ordering, block_m and block_n are the
-        # output row and col that this program is computing in terms of blocks
-        block_m = first_m_block_in_group + (pid % group_size_m)
-        block_n = (pid % n_output_blocks_in_group) // group_size_m
-
-        # Convert from block indices back to element indices
-        m_start = block_m * BLOCK_SIZE_M
-        n_start = block_n * BLOCK_SIZE_N
-
-        # Expand out to all the offsets for each of the elements in this block.
-        m_offsets_a = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]
-        n_offsets_b = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]
-        k_offsets = tl.arange(0, BLOCK_SIZE_K)
-
-        # Get the pointers for the first block of each. We will advance this pointer
-        # as we move in the K direction and accumulate.
-        # a_ptrs should contain BLOCK_SIZE_M * BLOCK_SIZE_K pointers
-        a_ptrs = a_ptr + (stride_am * m_offsets_a + stride_ak * k_offsets[None, :])
-        # b_ptrs should contain BLOCK_SIZE_K * BLOCK_SIZE_N pointers
-        b_ptrs = b_ptr + (stride_bk * k_offsets[:, None] + stride_bn * n_offsets_b)
-        # We accumulate internally in fp32, but the output is written out in the dtype
-        # of the tensor when it is stored
+        # -----------------------------------------------------------
+        # Iterate to compute a block of the C matrix
+        # We accumulate into a `[BLOCK_SIZE_M, BLOCK_SIZE_N]` block
+        # of fp32 values for higher accuracy.
+        # `accumulator` will be converted back to fp16 after the loop
         accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
         for k in range(0, K, BLOCK_SIZE_K):
-            # Note that for simplicity, we don't apply a mask here. This means that if K is
-            # not a multiple of BLOCK_SIZE_K, this will access out-of-bounds memory and
-            # accumulate it incorrectly.
+            # Note that for simplicity, we don't apply a mask here. 
+            # This means that if K is not a multiple of BLOCK_SIZE_K, 
+            # this will access out-of-bounds memory and produce an
+            # error or (worse!) incorrect results.
             a = tl.load(a_ptrs)
             b = tl.load(b_ptrs)
             # We accumulate along the K dimension
             accumulator += tl.dot(a, b)
-
             # Advance the ptrs to the next K block
             a_ptrs += BLOCK_SIZE_K * stride_ak
             b_ptrs += BLOCK_SIZE_K * stride_bk
-        # triton can accept arbitrary activation function via metaparameters!
-        if meta['ACTIVATION']:
+        # you can fuse arbitrary activation functions here
+        # while the accumulator is still in FP32 !
+        if meta['ACTIVATION']: 
             accumulator = meta['ACTIVATION'](accumulator)
+        c = accumulator.to(tl.float16)
 
-        m_offsets_c = (m_start + tl.arange(0, BLOCK_SIZE_M))[:, None]
-        n_offsets_c = (n_start + tl.arange(0, BLOCK_SIZE_N))[None, :]
-        c_ptrs = c_ptr + stride_cm * m_offsets_c + stride_cn * n_offsets_c
-        mask = (m_offsets_c < M) & (n_offsets_c < N)
-        tl.store(c_ptrs, accumulator, mask=mask)
+        # -----------------------------------------------------------
+        # Write back the block of the output matrix C
+        offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
+        offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
+        c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
+        c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
+        tl.store(c_ptrs, c, mask=c_mask)
 
 
     # we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`
@@ -293,12 +292,12 @@ Final Result
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 264-266
+.. GENERATED FROM PYTHON SOURCE LINES 263-265
 
 We can now create a convenience wrapper function that only takes two input tensors
 and (1) checks any shape constraint; (2) allocates the output; (3) launches the above kernel
 
-.. GENERATED FROM PYTHON SOURCE LINES 266-302
+.. GENERATED FROM PYTHON SOURCE LINES 265-294
 
 .. code-block:: default
 
@@ -321,18 +320,11 @@ and (1) checks any shape constraint; (2) allocates the output; (3) launches the
             triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']),
         )
         matmul_kernel[grid](
-            a,
-            b,
-            c,
-            M,
-            N,
-            K,
-            a.stride(0),
-            a.stride(1),
-            b.stride(0),
-            b.stride(1),
-            c.stride(0),
-            c.stride(1),
+            a, b, c,
+            M, N, K,
+            a.stride(0), a.stride(1),
+            b.stride(0), b.stride(1),
+            c.stride(0), c.stride(1),
             ACTIVATION=activation,
         )
         return c
@@ -345,14 +337,14 @@ and (1) checks any shape constraint; (2) allocates the output; (3) launches the
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 303-307
+.. GENERATED FROM PYTHON SOURCE LINES 295-299
 
 Unit Test
 -----------
 
 We can test our custom matrix multiplication operation against a native torch implementation (i.e., cuBLAS)
 
-.. GENERATED FROM PYTHON SOURCE LINES 307-320
+.. GENERATED FROM PYTHON SOURCE LINES 299-312
 
 .. code-block:: default
 
@@ -400,7 +392,7 @@ We can test our custom matrix multiplication operation against a native torch im
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 321-327
+.. GENERATED FROM PYTHON SOURCE LINES 313-319
 
 Benchmark
 --------------
@@ -409,7 +401,7 @@ Square Matrix Performance
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 We can now compare the performance of our kernel against that of cuBLAS. Here we focus on square matrices, but feel free to arrange this script as you wish to benchmark any other matrix shape.
 
-.. GENERATED FROM PYTHON SOURCE LINES 327-368
+.. GENERATED FROM PYTHON SOURCE LINES 319-360
 
 .. code-block:: default
 
@@ -471,37 +463,37 @@ We can now compare the performance of our kernel against that of cuBLAS. Here we
     matmul-performance:
              M     cuBLAS  ...     Triton  Triton (+ LeakyReLU)
     0    128.0   0.455111  ...   0.512000              0.512000
-    1    256.0   2.730667  ...   2.978909              2.978909
+    1    256.0   2.978909  ...   2.978909              2.978909
     2    384.0   7.372800  ...   8.507077              8.507077
-    3    512.0  14.563555  ...  15.420235             16.384000
-    4    640.0  22.260869  ...  24.380953             23.272727
+    3    512.0  14.563555  ...  16.384000             16.384000
+    4    640.0  22.260869  ...  24.380953             24.380953
     5    768.0  32.768000  ...  34.028308             34.028308
-    6    896.0  39.025776  ...  40.140799             39.025776
-    7   1024.0  49.932191  ...  53.773130             52.428801
-    8   1152.0  45.242181  ...  46.656000             46.656000
-    9   1280.0  51.200001  ...  56.888887             56.888887
+    6    896.0  39.025776  ...  40.140799             36.023796
+    7   1024.0  49.932191  ...  52.428801             52.428801
+    8   1152.0  44.566925  ...  46.656000             46.656000
+    9   1280.0  51.200001  ...  56.888887             56.109587
     10  1408.0  64.138541  ...  64.902096             64.902096
     11  1536.0  78.643199  ...  76.106321             75.296679
-    12  1664.0  62.929456  ...  62.061463             62.061463
+    12  1664.0  63.372618  ...  62.492442             61.636381
     13  1792.0  72.983276  ...  69.810085             69.379162
     14  1920.0  67.434145  ...  70.892307             70.530615
-    15  2048.0  73.908442  ...  74.898285             74.565406
-    16  2176.0  83.500614  ...  78.916269             79.855747
-    17  2304.0  68.251065  ...  73.275679             72.828879
-    18  2432.0  71.125224  ...  80.731218             80.731218
-    19  2560.0  77.649287  ...  76.560748             76.382283
-    20  2688.0  81.928846  ...  80.366642             82.823267
-    21  2816.0  77.743683  ...  78.868366             78.301990
-    22  2944.0  81.832567  ...  79.610276             78.605729
-    23  3072.0  81.005868  ...  81.005868             82.420822
-    24  3200.0  84.321474  ...  89.635851             85.106381
-    25  3328.0  83.226931  ...  87.156532             86.113988
-    26  3456.0  81.932484  ...  83.632331             85.313831
-    27  3584.0  87.211821  ...  87.211821             91.563533
-    28  3712.0  85.896254  ...  82.491612             84.874549
-    29  3840.0  85.070769  ...  87.493673             87.701820
-    30  3968.0  92.935215  ...  83.865247             83.578035
-    31  4096.0  93.662059  ...  85.926841             84.840533
+    15  2048.0  73.908442  ...  75.234154             74.898285
+    16  2176.0  81.472263  ...  80.817862             80.173899
+    17  2304.0  68.446623  ...  73.501144             73.275679
+    18  2432.0  71.305746  ...  81.197876             79.362895
+    19  2560.0  77.649287  ...  77.649287             76.560748
+    20  2688.0  82.642823  ...  80.708630             82.823267
+    21  2816.0  79.587973  ...  79.733474             77.605356
+    22  2944.0  81.967162  ...  78.112900             79.230573
+    23  3072.0  81.707223  ...  84.135370             79.863336
+    24  3200.0  84.099871  ...  87.074829             89.136491
+    25  3328.0  83.905938  ...  84.003845             86.424125
+    26  3456.0  81.518272  ...  85.494768             81.353753
+    27  3584.0  86.540320  ...  94.448944             94.847460
+    28  3712.0  83.947349  ...  88.955779             89.114488
+    29  3840.0  84.809814  ...  88.191387             87.217666
+    30  3968.0  93.148045  ...  83.179234             87.409694
+    31  4096.0  93.531519  ...  89.777746             87.552332
 
     [32 rows x 5 columns]
 
@@ -511,7 +503,7 @@ We can now compare the performance of our kernel against that of cuBLAS. Here we
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  9.226 seconds)
+   **Total running time of the script:** ( 2 minutes  30.498 seconds)
 
 
 .. _sphx_glr_download_getting-started_tutorials_03-matrix-multiplication.py:
diff --git a/_sources/getting-started/tutorials/sg_execution_times.rst.txt b/_sources/getting-started/tutorials/sg_execution_times.rst.txt
index 41acb9ef9..d16236a4b 100644
--- a/_sources/getting-started/tutorials/sg_execution_times.rst.txt
+++ b/_sources/getting-started/tutorials/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
 
 Computation times
 =================
-**03:32.837** total execution time for **getting-started_tutorials** files:
+**03:54.208** total execution time for **getting-started_tutorials** files:
 
 +---------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_getting-started_tutorials_03-matrix-multiplication.py` (``03-matrix-multiplication.py``) | 02:09.226 | 0.0 MB |
+| :ref:`sphx_glr_getting-started_tutorials_03-matrix-multiplication.py` (``03-matrix-multiplication.py``) | 02:30.498 | 0.0 MB |
 +---------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_getting-started_tutorials_02-fused-softmax.py` (``02-fused-softmax.py``)                 | 01:12.617 | 0.0 MB |
+| :ref:`sphx_glr_getting-started_tutorials_02-fused-softmax.py` (``02-fused-softmax.py``)                 | 01:12.739 | 0.0 MB |
 +---------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_getting-started_tutorials_01-vector-add.py` (``01-vector-add.py``)                       | 00:10.994 | 0.0 MB |
+| :ref:`sphx_glr_getting-started_tutorials_01-vector-add.py` (``01-vector-add.py``)                       | 00:10.971 | 0.0 MB |
 +---------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/getting-started/tutorials/01-vector-add.html b/getting-started/tutorials/01-vector-add.html
index 14cfc89a5..4a2f32168 100644
--- a/getting-started/tutorials/01-vector-add.html
+++ b/getting-started/tutorials/01-vector-add.html
@@ -322,10 +322,10 @@ for different problem sizes.</p>
 0        4096.0    9.600000    9.600000
 1        8192.0   19.200000   19.200000
 2       16384.0   38.400001   38.400001
-3       32768.0   76.800002   76.800002
+3       32768.0   63.999998   76.800002
 4       65536.0  127.999995  127.999995
 5      131072.0  219.428568  219.428568
-6      262144.0  341.333321  384.000001
+6      262144.0  384.000001  384.000001
 7      524288.0  472.615390  472.615390
 8     1048576.0  614.400016  614.400016
 9     2097152.0  722.823517  722.823517
@@ -337,7 +337,7 @@ for different problem sizes.</p>
 15  134217728.0  851.577704  850.656574
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 0 minutes  10.994 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 0 minutes  10.971 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-getting-started-tutorials-01-vector-add-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/62d97d49a32414049819dd8bb8378080/01-vector-add.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">01-vector-add.py</span></code></a></p>
diff --git a/getting-started/tutorials/02-fused-softmax.html b/getting-started/tutorials/02-fused-softmax.html
index 17e53bf59..880fb9c09 100644
--- a/getting-started/tutorials/02-fused-softmax.html
+++ b/getting-started/tutorials/02-fused-softmax.html
@@ -391,10 +391,10 @@ We will then compare its performance against (1) <code class="code docutils lite
 3     640.0  682.666684      640.000002   160.000000
 4     768.0  702.171410      664.216187   163.839992
 ..      ...         ...             ...          ...
-93  12160.0  812.359066      406.179533   198.936606
-94  12288.0  812.429770      416.101597   199.298541
-95  12416.0  810.840807      412.149375   198.854847
-96  12544.0  810.925276      412.971190   199.209928
+93  12160.0  812.359066      405.755985   198.936606
+94  12288.0  812.429770      415.222812   199.096718
+95  12416.0  810.840807      411.296057   198.755369
+96  12544.0  810.925276      412.971190   199.012395
 97  12672.0  811.007961      412.097543   199.167004
 
 [98 rows x 4 columns]
@@ -408,7 +408,7 @@ We will then compare its performance against (1) <code class="code docutils lite
 Note however that the PyTorch <cite>softmax</cite> operation is more general and will works on tensors of any shape.</p></li>
 </ul>
 </div></blockquote>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  12.617 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  12.739 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-getting-started-tutorials-02-fused-softmax-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/d91442ac2982c4e0cc3ab0f43534afbc/02-fused-softmax.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">02-fused-softmax.py</span></code></a></p>
diff --git a/getting-started/tutorials/03-matrix-multiplication.html b/getting-started/tutorials/03-matrix-multiplication.html
index 92df426d2..b21c34111 100644
--- a/getting-started/tutorials/03-matrix-multiplication.html
+++ b/getting-started/tutorials/03-matrix-multiplication.html
@@ -221,7 +221,7 @@ to accomodate the needs of modern deep learning workloads (e.g., fused activatio
 In this tutorial, you will learn how to implement efficient matrix multiplications by
 yourself with Triton, in a way that is easy to customize and extend.</p>
 <p>Roughly speaking, the kernel that we will write will implement the following blocked
-algorithm to multiply a (MxK) by a (KxN) matrix:</p>
+algorithm to multiply a (M, K) by a (K, N) matrix:</p>
 <blockquote>
 <div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># do in parallel</span>
 <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">M</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">):</span>
@@ -236,7 +236,7 @@ algorithm to multiply a (MxK) by a (KxN) matrix:</p>
 </pre></div>
 </div>
 </div></blockquote>
-<p>where each iteration of the doubly-nested for-loop corresponds to a Triton program instance.</p>
+<p>where each iteration of the doubly-nested for-loop is performed by a dedicated Triton program instance.</p>
 </div>
 <div class="section" id="compute-kernel">
 <h2>Compute Kernel<a class="headerlink" href="#compute-kernel" title="Permalink to this headline">¶</a></h2>
@@ -247,33 +247,29 @@ multi-dimensional pointer arithmetics.</p>
 <div class="section" id="pointer-arithmetics">
 <h3>Pointer Arithmetics<a class="headerlink" href="#pointer-arithmetics" title="Permalink to this headline">¶</a></h3>
 <p>For a row-major 2D tensor <code class="code docutils literal notranslate"><span class="pre">X</span></code>, the memory location of <code class="code docutils literal notranslate"><span class="pre">X[i,</span> <span class="pre">j]</span></code> is given b
-y <code class="code docutils literal notranslate"><span class="pre">&amp;X[i,</span> <span class="pre">j]</span> <span class="pre">=</span> <span class="pre">X</span> <span class="pre">+</span> <span class="pre">i*stride_x_0</span> <span class="pre">+</span> <span class="pre">j*stride_x_1</span></code>.
+y <code class="code docutils literal notranslate"><span class="pre">&amp;X[i,</span> <span class="pre">j]</span> <span class="pre">=</span> <span class="pre">X</span> <span class="pre">+</span> <span class="pre">i*stride_xi</span> <span class="pre">+</span> <span class="pre">j*stride_xj</span></code>.
 Therefore, blocks of pointers for <code class="code docutils literal notranslate"><span class="pre">A[m</span> <span class="pre">:</span> <span class="pre">m+BLOCK_SIZE_M,</span> <span class="pre">k:k+BLOCK_SIZE_K]</span></code> and
 <code class="code docutils literal notranslate"><span class="pre">B[k</span> <span class="pre">:</span> <span class="pre">k+BLOCK_SIZE_K,</span> <span class="pre">n</span> <span class="pre">:</span> <span class="pre">n+BLOCK_SIZE_N]</span></code> can be defined in pseudo-code as:</p>
 <blockquote>
-<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="o">&amp;</span><span class="n">A</span><span class="p">[</span><span class="n">m</span> <span class="p">:</span> <span class="n">m</span><span class="o">+</span><span class="n">BLOCK_SIZE_M</span><span class="p">,</span> <span class="n">k</span><span class="p">:</span><span class="n">k</span><span class="o">+</span><span class="n">BLOCK_SIZE_K</span><span class="p">]</span> <span class="o">=</span>  <span class="n">A</span> <span class="o">+</span> <span class="p">(</span><span class="n">m</span> <span class="p">:</span> <span class="n">m</span><span class="o">+</span><span class="n">BLOCK_SIZE_M</span><span class="p">)[:,</span> <span class="kc">None</span><span class="p">]</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">k</span> <span class="p">:</span> <span class="n">k</span><span class="o">+</span><span class="n">BLOCK_SIZE_K</span><span class="p">)[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
-<span class="o">&amp;</span><span class="n">B</span><span class="p">[</span><span class="n">k</span> <span class="p">:</span> <span class="n">k</span><span class="o">+</span><span class="n">BLOCK_SIZE_K</span><span class="p">,</span> <span class="n">n</span><span class="p">:</span><span class="n">n</span><span class="o">+</span><span class="n">BLOCK_SIZE_N</span><span class="p">]</span> <span class="o">=</span>  <span class="n">B</span> <span class="o">+</span> <span class="p">(</span><span class="n">k</span> <span class="p">:</span> <span class="n">k</span><span class="o">+</span><span class="n">BLOCK_SIZE_K</span><span class="p">)[:,</span> <span class="kc">None</span><span class="p">]</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">n</span> <span class="p">:</span> <span class="n">n</span><span class="o">+</span><span class="n">BLOCK_SIZE_N</span><span class="p">)[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="o">&amp;</span><span class="n">A</span><span class="p">[</span><span class="n">m</span> <span class="p">:</span> <span class="n">m</span><span class="o">+</span><span class="n">BLOCK_SIZE_M</span><span class="p">,</span> <span class="n">k</span><span class="p">:</span><span class="n">k</span><span class="o">+</span><span class="n">BLOCK_SIZE_K</span><span class="p">]</span> <span class="o">=</span>  <span class="n">a_ptr</span> <span class="o">+</span> <span class="p">(</span><span class="n">m</span> <span class="p">:</span> <span class="n">m</span><span class="o">+</span><span class="n">BLOCK_SIZE_M</span><span class="p">)[:,</span> <span class="kc">None</span><span class="p">]</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">k</span> <span class="p">:</span> <span class="n">k</span><span class="o">+</span><span class="n">BLOCK_SIZE_K</span><span class="p">)[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span><span class="o">*</span><span class="n">A</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
+<span class="o">&amp;</span><span class="n">B</span><span class="p">[</span><span class="n">k</span> <span class="p">:</span> <span class="n">k</span><span class="o">+</span><span class="n">BLOCK_SIZE_K</span><span class="p">,</span> <span class="n">n</span><span class="p">:</span><span class="n">n</span><span class="o">+</span><span class="n">BLOCK_SIZE_N</span><span class="p">]</span> <span class="o">=</span>  <span class="n">b_ptr</span> <span class="o">+</span> <span class="p">(</span><span class="n">k</span> <span class="p">:</span> <span class="n">k</span><span class="o">+</span><span class="n">BLOCK_SIZE_K</span><span class="p">)[:,</span> <span class="kc">None</span><span class="p">]</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">n</span> <span class="p">:</span> <span class="n">n</span><span class="o">+</span><span class="n">BLOCK_SIZE_N</span><span class="p">)[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span><span class="o">*</span><span class="n">B</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
 </pre></div>
 </div>
 </div></blockquote>
 <p>Which means that pointers for blocks of A and B can be initialized (i.e., <code class="code docutils literal notranslate"><span class="pre">k=0</span></code>) in Triton as:</p>
 <blockquote>
-<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pid_m</span> <span class="o">=</span> <span class="n">triton</span><span class="o">.</span><span class="n">program_id</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
-<span class="n">pid_n</span> <span class="o">=</span> <span class="n">triton</span><span class="o">.</span><span class="n">program_id</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
-<span class="n">rm</span> <span class="o">=</span> <span class="n">pid_m</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_M</span> <span class="o">+</span> <span class="n">triton</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">)</span>
-<span class="n">rn</span> <span class="o">=</span> <span class="n">pid_n</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_N</span> <span class="o">+</span> <span class="n">triton</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">)</span>
-<span class="n">rk</span> <span class="o">=</span> <span class="n">triton</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_K</span><span class="p">)</span>
-<span class="o">//</span> <span class="n">pointer</span> <span class="k">for</span> <span class="n">A</span> <span class="n">operand</span>
-<span class="n">pa</span> <span class="o">=</span> <span class="n">A</span> <span class="o">+</span> <span class="p">(</span><span class="n">rm</span><span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">stride_a_0</span> <span class="o">+</span> <span class="n">rk</span><span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">*</span> <span class="n">stride_a_1</span><span class="p">);</span>
-<span class="o">//</span> <span class="n">pointer</span> <span class="k">for</span> <span class="n">B</span> <span class="n">operand</span>
-<span class="n">pb</span> <span class="o">=</span> <span class="n">B</span> <span class="o">+</span> <span class="p">(</span><span class="n">rk</span><span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">stride_b_0</span> <span class="o">+</span> <span class="n">rn</span><span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">*</span> <span class="n">stride_b_1</span><span class="p">);</span>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">offs_am</span> <span class="o">=</span> <span class="n">pid_m</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_M</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">)</span>
+<span class="n">offs_bn</span> <span class="o">=</span> <span class="n">pid_n</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_N</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">)</span>
+<span class="n">offs_k</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_K</span><span class="p">)</span>
+<span class="n">a_ptrs</span> <span class="o">=</span> <span class="n">a_ptr</span> <span class="o">+</span> <span class="p">(</span><span class="n">offs_am</span><span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span><span class="o">*</span><span class="n">stride_am</span> <span class="o">+</span> <span class="n">offs_k</span> <span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span><span class="o">*</span><span class="n">stride_ak</span><span class="p">)</span>
+<span class="n">b_ptrs</span> <span class="o">=</span> <span class="n">b_ptr</span> <span class="o">+</span> <span class="p">(</span><span class="n">offs_k</span> <span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span><span class="o">*</span><span class="n">stride_bk</span> <span class="o">+</span> <span class="n">offs_bn</span><span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span><span class="o">*</span><span class="n">stride_bn</span><span class="p">)</span>
 </pre></div>
 </div>
 </div></blockquote>
 <p>And then updated in the inner loop as follows:</p>
 <blockquote>
-<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pa</span> <span class="o">+=</span> <span class="n">BLOCK_SIZE_K</span> <span class="o">*</span> <span class="n">stride_a_1</span><span class="p">;</span>
-<span class="n">pb</span> <span class="o">+=</span> <span class="n">BLOCK_SIZE_K</span> <span class="o">*</span> <span class="n">stride_b_0</span><span class="p">;</span>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pa</span> <span class="o">+=</span> <span class="n">BLOCK_SIZE_K</span> <span class="o">*</span> <span class="n">stride_ak</span><span class="p">;</span>
+<span class="n">pb</span> <span class="o">+=</span> <span class="n">BLOCK_SIZE_K</span> <span class="o">*</span> <span class="n">stride_bk</span><span class="p">;</span>
 </pre></div>
 </div>
 </div></blockquote>
@@ -299,13 +295,25 @@ a simple row-major ordering</p>
 This can be done by ‘super-grouping’ blocks in groups of <code class="code docutils literal notranslate"><span class="pre">GROUP_M</span></code> rows before
 switching to the next column:</p>
 <blockquote>
-<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pid</span> <span class="o">=</span> <span class="n">triton</span><span class="o">.</span><span class="n">program_id</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
-<span class="n">width</span> <span class="o">=</span> <span class="n">GROUP_M</span> <span class="o">*</span> <span class="n">grid_n</span><span class="p">;</span>
-<span class="n">group_id</span> <span class="o">=</span> <span class="n">pid</span> <span class="o">//</span> <span class="n">width</span><span class="p">;</span>
-<span class="c1"># we need to handle the case where M % (GROUP_M*BLOCK_SIZE_M) != 0</span>
-<span class="n">group_size</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">grid_m</span> <span class="o">-</span> <span class="n">group_id</span> <span class="o">*</span> <span class="n">GROUP_M</span><span class="p">,</span> <span class="n">GROUP_M</span><span class="p">);</span>
-<span class="n">pid_m</span> <span class="o">=</span> <span class="n">group_id</span> <span class="o">*</span> <span class="n">GROUP_M</span> <span class="o">+</span> <span class="p">(</span><span class="n">pid</span> <span class="o">%</span> <span class="n">group_size</span><span class="p">);</span>
-<span class="n">pid_n</span> <span class="o">=</span> <span class="p">(</span><span class="n">pid</span> <span class="o">%</span> <span class="n">width</span><span class="p">)</span> <span class="o">//</span> <span class="p">(</span><span class="n">group_size</span><span class="p">);</span>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># program ID</span>
+<span class="n">pid</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">program_id</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
+<span class="c1"># number of program ids along the M axis</span>
+<span class="n">num_pid_m</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">)</span>
+<span class="c1"># number of programs ids along the N axis</span>
+<span class="n">num_pid_n</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">)</span>
+<span class="c1"># number of programs in group</span>
+<span class="n">num_pid_in_group</span> <span class="o">=</span> <span class="n">GROUP_SIZE_M</span> <span class="o">*</span> <span class="n">num_pid_n</span>
+<span class="c1"># id of the group this program is in</span>
+<span class="n">group_id</span> <span class="o">=</span> <span class="n">pid</span> <span class="o">//</span> <span class="n">num_pid_in_group</span>
+<span class="c1"># row-id of the first program in the group</span>
+<span class="n">first_pid_m</span> <span class="o">=</span> <span class="n">group_id</span> <span class="o">*</span> <span class="n">GROUP_SIZE_M</span>
+<span class="c1"># if `num_pid_m` isn&#39;t divisible by `GROUP_SIZE_M`, the last group is smaller</span>
+<span class="n">group_size_m</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">num_pid_m</span> <span class="o">-</span> <span class="n">first_pid_m</span><span class="p">,</span> <span class="n">GROUP_SIZE_M</span><span class="p">)</span>
+<span class="c1"># *within groups*, programs are ordered in a column-major order</span>
+<span class="c1"># row-id of the program in the *launch grid*</span>
+<span class="n">pid_m</span> <span class="o">=</span> <span class="n">first_pid_m</span> <span class="o">+</span> <span class="p">(</span><span class="n">pid</span> <span class="o">%</span> <span class="n">group_size_m</span><span class="p">)</span>
+<span class="c1"># col-id of the program in the *launch grid*</span>
+<span class="n">pid_n</span> <span class="o">=</span> <span class="p">(</span><span class="n">pid</span> <span class="o">%</span> <span class="n">num_pid_in_group</span><span class="p">)</span> <span class="o">//</span> <span class="n">group_size_m</span>
 </pre></div>
 </div>
 </div></blockquote>
@@ -354,26 +362,19 @@ more than 10% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).</
 <span class="nd">@triton</span><span class="o">.</span><span class="n">jit</span>
 <span class="k">def</span> <span class="nf">matmul_kernel</span><span class="p">(</span>
     <span class="c1"># Pointers to matrices</span>
-    <span class="n">a_ptr</span><span class="p">,</span>
-    <span class="n">b_ptr</span><span class="p">,</span>
-    <span class="n">c_ptr</span><span class="p">,</span>
+    <span class="n">a_ptr</span><span class="p">,</span> <span class="n">b_ptr</span><span class="p">,</span> <span class="n">c_ptr</span><span class="p">,</span>
     <span class="c1"># Matrix dimensions</span>
-    <span class="n">M</span><span class="p">,</span>
-    <span class="n">N</span><span class="p">,</span>
-    <span class="n">K</span><span class="p">,</span>
+    <span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span>
     <span class="c1"># The stride variables represent how much to increase the ptr by when moving by 1</span>
     <span class="c1"># element in a particular dimension. E.g. stride_am is how much to increase a_ptr</span>
     <span class="c1"># by to get the element one row down (A has M rows)</span>
-    <span class="n">stride_am</span><span class="p">,</span>
-    <span class="n">stride_ak</span><span class="p">,</span>
-    <span class="n">stride_bk</span><span class="p">,</span>
-    <span class="n">stride_bn</span><span class="p">,</span>
-    <span class="n">stride_cm</span><span class="p">,</span>
-    <span class="n">stride_cn</span><span class="p">,</span>
+    <span class="n">stride_am</span><span class="p">,</span> <span class="n">stride_ak</span><span class="p">,</span>
+    <span class="n">stride_bk</span><span class="p">,</span> <span class="n">stride_bn</span><span class="p">,</span>
+    <span class="n">stride_cm</span><span class="p">,</span> <span class="n">stride_cn</span><span class="p">,</span>
+    <span class="c1"># Meta-parameters</span>
     <span class="o">**</span><span class="n">meta</span><span class="p">,</span>
 <span class="p">):</span>
-    <span class="sd">&quot;&quot;&quot;Kernel for computing the matmul AB = C</span>
-
+    <span class="sd">&quot;&quot;&quot;Kernel for computing the matmul C = A x B.</span>
 <span class="sd">    A has shape (M, K), B has shape (K, N) and C has shape (M, N)</span>
 <span class="sd">    &quot;&quot;&quot;</span>
     <span class="c1"># extract meta-parameters</span>
@@ -381,67 +382,65 @@ more than 10% on some hardware architecture (e.g., 220 to 245 TFLOPS on A100).</
     <span class="n">BLOCK_SIZE_N</span> <span class="o">=</span> <span class="n">meta</span><span class="p">[</span><span class="s1">&#39;BLOCK_SIZE_N&#39;</span><span class="p">]</span>
     <span class="n">BLOCK_SIZE_K</span> <span class="o">=</span> <span class="n">meta</span><span class="p">[</span><span class="s1">&#39;BLOCK_SIZE_K&#39;</span><span class="p">]</span>
     <span class="n">GROUP_SIZE_M</span> <span class="o">=</span> <span class="mi">8</span>
+
+    <span class="c1"># -----------------------------------------------------------</span>
+    <span class="c1"># Map program ids `pid` to the block of C it should compute.</span>
+    <span class="c1"># This is done in a grouped ordering to promote L2 data reuse</span>
+    <span class="c1"># See above `L2 Cache Optimizations` section for details</span>
     <span class="n">pid</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">program_id</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
+    <span class="n">num_pid_m</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">)</span>
+    <span class="n">num_pid_n</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">)</span>
+    <span class="n">num_pid_in_group</span> <span class="o">=</span> <span class="n">GROUP_SIZE_M</span> <span class="o">*</span> <span class="n">num_pid_n</span>
+    <span class="n">group_id</span> <span class="o">=</span> <span class="n">pid</span> <span class="o">//</span> <span class="n">num_pid_in_group</span>
+    <span class="n">first_pid_m</span> <span class="o">=</span> <span class="n">group_id</span> <span class="o">*</span> <span class="n">GROUP_SIZE_M</span>
+    <span class="n">group_size_m</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">num_pid_m</span> <span class="o">-</span> <span class="n">first_pid_m</span><span class="p">,</span> <span class="n">GROUP_SIZE_M</span><span class="p">)</span>
+    <span class="n">pid_m</span> <span class="o">=</span> <span class="n">first_pid_m</span> <span class="o">+</span> <span class="p">(</span><span class="n">pid</span> <span class="o">%</span> <span class="n">group_size_m</span><span class="p">)</span>
+    <span class="n">pid_n</span> <span class="o">=</span> <span class="p">(</span><span class="n">pid</span> <span class="o">%</span> <span class="n">num_pid_in_group</span><span class="p">)</span> <span class="o">//</span> <span class="n">group_size_m</span>
 
-    <span class="c1"># the number of blocks is the ceil(M / BLOCK_SIZE_M) since we need an extra block</span>
-    <span class="c1"># Note that this will lead to some quantization in performance where time-taken jumps</span>
-    <span class="c1"># when you need to add a new block</span>
-    <span class="n">n_blocks_m</span> <span class="o">=</span> <span class="p">(</span><span class="n">M</span> <span class="o">+</span> <span class="n">BLOCK_SIZE_M</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">//</span> <span class="n">BLOCK_SIZE_M</span>
-    <span class="n">n_blocks_n</span> <span class="o">=</span> <span class="p">(</span><span class="n">N</span> <span class="o">+</span> <span class="n">BLOCK_SIZE_N</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">//</span> <span class="n">BLOCK_SIZE_N</span>
+    <span class="c1"># ----------------------------------------------------------</span>
+    <span class="c1"># Create pointers for the first blocks of A and B.</span>
+    <span class="c1"># We will advance this pointer as we move in the K direction</span>
+    <span class="c1"># and accumulate</span>
+    <span class="c1"># a_ptrs is a block of [BLOCK_SIZE_M, BLOCK_SIZE_K] pointers</span>
+    <span class="c1"># b_ptrs is a block of [BLOCK_SIZE_K, BLOCK_SIZE_n] pointers</span>
+    <span class="c1"># see above `Pointer Arithmetics` section for details</span>
+    <span class="n">offs_am</span> <span class="o">=</span> <span class="n">pid_m</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_M</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">)</span>
+    <span class="n">offs_bn</span> <span class="o">=</span> <span class="n">pid_n</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_N</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">)</span>
+    <span class="n">offs_k</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_K</span><span class="p">)</span>
+    <span class="n">a_ptrs</span> <span class="o">=</span> <span class="n">a_ptr</span> <span class="o">+</span> <span class="p">(</span><span class="n">offs_am</span><span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span><span class="o">*</span><span class="n">stride_am</span> <span class="o">+</span> <span class="n">offs_k</span> <span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span><span class="o">*</span><span class="n">stride_ak</span><span class="p">)</span>
+    <span class="n">b_ptrs</span> <span class="o">=</span> <span class="n">b_ptr</span> <span class="o">+</span> <span class="p">(</span><span class="n">offs_k</span> <span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span><span class="o">*</span><span class="n">stride_bk</span> <span class="o">+</span> <span class="n">offs_bn</span><span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span><span class="o">*</span><span class="n">stride_bn</span><span class="p">)</span>
 
-    <span class="c1"># Map PIDs to the block they should compute. This is done in a grouped ordering</span>
-    <span class="c1"># to promote L2 cache reuse.</span>
-    <span class="n">n_output_blocks_in_group</span> <span class="o">=</span> <span class="n">GROUP_SIZE_M</span> <span class="o">*</span> <span class="n">n_blocks_n</span>
-    <span class="n">group_id</span> <span class="o">=</span> <span class="n">pid</span> <span class="o">//</span> <span class="n">n_output_blocks_in_group</span>
-    <span class="n">first_m_block_in_group</span> <span class="o">=</span> <span class="n">group_id</span> <span class="o">*</span> <span class="n">GROUP_SIZE_M</span>
-
-    <span class="c1"># If the number of blocks is not divisible by the group size, the last group is smaller</span>
-    <span class="n">group_size_m</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">n_blocks_m</span> <span class="o">-</span> <span class="n">first_m_block_in_group</span><span class="p">,</span> <span class="n">GROUP_SIZE_M</span><span class="p">)</span>
-
-    <span class="c1"># Within a group, we compute in col-major ordering, block_m and block_n are the</span>
-    <span class="c1"># output row and col that this program is computing in terms of blocks</span>
-    <span class="n">block_m</span> <span class="o">=</span> <span class="n">first_m_block_in_group</span> <span class="o">+</span> <span class="p">(</span><span class="n">pid</span> <span class="o">%</span> <span class="n">group_size_m</span><span class="p">)</span>
-    <span class="n">block_n</span> <span class="o">=</span> <span class="p">(</span><span class="n">pid</span> <span class="o">%</span> <span class="n">n_output_blocks_in_group</span><span class="p">)</span> <span class="o">//</span> <span class="n">group_size_m</span>
-
-    <span class="c1"># Convert from block indices back to element indices</span>
-    <span class="n">m_start</span> <span class="o">=</span> <span class="n">block_m</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_M</span>
-    <span class="n">n_start</span> <span class="o">=</span> <span class="n">block_n</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_N</span>
-
-    <span class="c1"># Expand out to all the offsets for each of the elements in this block.</span>
-    <span class="n">m_offsets_a</span> <span class="o">=</span> <span class="p">(</span><span class="n">m_start</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">))[:,</span> <span class="kc">None</span><span class="p">]</span>
-    <span class="n">n_offsets_b</span> <span class="o">=</span> <span class="p">(</span><span class="n">n_start</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">))[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span>
-    <span class="n">k_offsets</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_K</span><span class="p">)</span>
-
-    <span class="c1"># Get the pointers for the first block of each. We will advance this pointer</span>
-    <span class="c1"># as we move in the K direction and accumulate.</span>
-    <span class="c1"># a_ptrs should contain BLOCK_SIZE_M * BLOCK_SIZE_K pointers</span>
-    <span class="n">a_ptrs</span> <span class="o">=</span> <span class="n">a_ptr</span> <span class="o">+</span> <span class="p">(</span><span class="n">stride_am</span> <span class="o">*</span> <span class="n">m_offsets_a</span> <span class="o">+</span> <span class="n">stride_ak</span> <span class="o">*</span> <span class="n">k_offsets</span><span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:])</span>
-    <span class="c1"># b_ptrs should contain BLOCK_SIZE_K * BLOCK_SIZE_N pointers</span>
-    <span class="n">b_ptrs</span> <span class="o">=</span> <span class="n">b_ptr</span> <span class="o">+</span> <span class="p">(</span><span class="n">stride_bk</span> <span class="o">*</span> <span class="n">k_offsets</span><span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span> <span class="o">+</span> <span class="n">stride_bn</span> <span class="o">*</span> <span class="n">n_offsets_b</span><span class="p">)</span>
-    <span class="c1"># We accumulate internally in fp32, but the output is written out in the dtype</span>
-    <span class="c1"># of the tensor when it is stored</span>
+    <span class="c1"># -----------------------------------------------------------</span>
+    <span class="c1"># Iterate to compute a block of the C matrix</span>
+    <span class="c1"># We accumulate into a `[BLOCK_SIZE_M, BLOCK_SIZE_N]` block</span>
+    <span class="c1"># of fp32 values for higher accuracy.</span>
+    <span class="c1"># `accumulator` will be converted back to fp16 after the loop</span>
     <span class="n">accumulator</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">BLOCK_SIZE_M</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tl</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
     <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">BLOCK_SIZE_K</span><span class="p">):</span>
-        <span class="c1"># Note that for simplicity, we don&#39;t apply a mask here. This means that if K is</span>
-        <span class="c1"># not a multiple of BLOCK_SIZE_K, this will access out-of-bounds memory and</span>
-        <span class="c1"># accumulate it incorrectly.</span>
+        <span class="c1"># Note that for simplicity, we don&#39;t apply a mask here.</span>
+        <span class="c1"># This means that if K is not a multiple of BLOCK_SIZE_K,</span>
+        <span class="c1"># this will access out-of-bounds memory and produce an</span>
+        <span class="c1"># error or (worse!) incorrect results.</span>
         <span class="n">a</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">a_ptrs</span><span class="p">)</span>
         <span class="n">b</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">b_ptrs</span><span class="p">)</span>
         <span class="c1"># We accumulate along the K dimension</span>
         <span class="n">accumulator</span> <span class="o">+=</span> <span class="n">tl</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
-
         <span class="c1"># Advance the ptrs to the next K block</span>
         <span class="n">a_ptrs</span> <span class="o">+=</span> <span class="n">BLOCK_SIZE_K</span> <span class="o">*</span> <span class="n">stride_ak</span>
         <span class="n">b_ptrs</span> <span class="o">+=</span> <span class="n">BLOCK_SIZE_K</span> <span class="o">*</span> <span class="n">stride_bk</span>
-    <span class="c1"># triton can accept arbitrary activation function via metaparameters!</span>
+    <span class="c1"># you can fuse arbitrary activation functions here</span>
+    <span class="c1"># while the accumulator is still in FP32 !</span>
     <span class="k">if</span> <span class="n">meta</span><span class="p">[</span><span class="s1">&#39;ACTIVATION&#39;</span><span class="p">]:</span>
         <span class="n">accumulator</span> <span class="o">=</span> <span class="n">meta</span><span class="p">[</span><span class="s1">&#39;ACTIVATION&#39;</span><span class="p">](</span><span class="n">accumulator</span><span class="p">)</span>
+    <span class="n">c</span> <span class="o">=</span> <span class="n">accumulator</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">tl</span><span class="o">.</span><span class="n">float16</span><span class="p">)</span>
 
-    <span class="n">m_offsets_c</span> <span class="o">=</span> <span class="p">(</span><span class="n">m_start</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">))[:,</span> <span class="kc">None</span><span class="p">]</span>
-    <span class="n">n_offsets_c</span> <span class="o">=</span> <span class="p">(</span><span class="n">n_start</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">))[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span>
-    <span class="n">c_ptrs</span> <span class="o">=</span> <span class="n">c_ptr</span> <span class="o">+</span> <span class="n">stride_cm</span> <span class="o">*</span> <span class="n">m_offsets_c</span> <span class="o">+</span> <span class="n">stride_cn</span> <span class="o">*</span> <span class="n">n_offsets_c</span>
-    <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">m_offsets_c</span> <span class="o">&lt;</span> <span class="n">M</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">n_offsets_c</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">)</span>
-    <span class="n">tl</span><span class="o">.</span><span class="n">store</span><span class="p">(</span><span class="n">c_ptrs</span><span class="p">,</span> <span class="n">accumulator</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">)</span>
+    <span class="c1"># -----------------------------------------------------------</span>
+    <span class="c1"># Write back the block of the output matrix C</span>
+    <span class="n">offs_cm</span> <span class="o">=</span> <span class="n">pid_m</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_M</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_M</span><span class="p">)</span>
+    <span class="n">offs_cn</span> <span class="o">=</span> <span class="n">pid_n</span> <span class="o">*</span> <span class="n">BLOCK_SIZE_N</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_SIZE_N</span><span class="p">)</span>
+    <span class="n">c_ptrs</span> <span class="o">=</span> <span class="n">c_ptr</span> <span class="o">+</span> <span class="n">stride_cm</span> <span class="o">*</span> <span class="n">offs_cm</span><span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span> <span class="o">+</span> <span class="n">stride_cn</span> <span class="o">*</span> <span class="n">offs_cn</span><span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span>
+    <span class="n">c_mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">offs_cm</span><span class="p">[:,</span> <span class="kc">None</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">M</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">offs_cn</span><span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">)</span>
+    <span class="n">tl</span><span class="o">.</span><span class="n">store</span><span class="p">(</span><span class="n">c_ptrs</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">c_mask</span><span class="p">)</span>
 
 
 <span class="c1"># we can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `_matmul`</span>
@@ -469,18 +468,11 @@ and (1) checks any shape constraint; (2) allocates the output; (3) launches the
         <span class="n">triton</span><span class="o">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">META</span><span class="p">[</span><span class="s1">&#39;BLOCK_SIZE_M&#39;</span><span class="p">])</span> <span class="o">*</span> <span class="n">triton</span><span class="o">.</span><span class="n">cdiv</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">META</span><span class="p">[</span><span class="s1">&#39;BLOCK_SIZE_N&#39;</span><span class="p">]),</span>
     <span class="p">)</span>
     <span class="n">matmul_kernel</span><span class="p">[</span><span class="n">grid</span><span class="p">](</span>
-        <span class="n">a</span><span class="p">,</span>
-        <span class="n">b</span><span class="p">,</span>
-        <span class="n">c</span><span class="p">,</span>
-        <span class="n">M</span><span class="p">,</span>
-        <span class="n">N</span><span class="p">,</span>
-        <span class="n">K</span><span class="p">,</span>
-        <span class="n">a</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
-        <span class="n">a</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
-        <span class="n">b</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
-        <span class="n">b</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
-        <span class="n">c</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span>
-        <span class="n">c</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
+        <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span>
+        <span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span>
+        <span class="n">a</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">a</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
+        <span class="n">b</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">b</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
+        <span class="n">c</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">c</span><span class="o">.</span><span class="n">stride</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
         <span class="n">ACTIVATION</span><span class="o">=</span><span class="n">activation</span><span class="p">,</span>
     <span class="p">)</span>
     <span class="k">return</span> <span class="n">c</span>
@@ -575,42 +567,42 @@ torch_output=tensor([[  1.1045, -36.9688,  31.4688,  ..., -11.3906,  24.4531, -3
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>matmul-performance:
          M     cuBLAS  ...     Triton  Triton (+ LeakyReLU)
 0    128.0   0.455111  ...   0.512000              0.512000
-1    256.0   2.730667  ...   2.978909              2.978909
+1    256.0   2.978909  ...   2.978909              2.978909
 2    384.0   7.372800  ...   8.507077              8.507077
-3    512.0  14.563555  ...  15.420235             16.384000
-4    640.0  22.260869  ...  24.380953             23.272727
+3    512.0  14.563555  ...  16.384000             16.384000
+4    640.0  22.260869  ...  24.380953             24.380953
 5    768.0  32.768000  ...  34.028308             34.028308
-6    896.0  39.025776  ...  40.140799             39.025776
-7   1024.0  49.932191  ...  53.773130             52.428801
-8   1152.0  45.242181  ...  46.656000             46.656000
-9   1280.0  51.200001  ...  56.888887             56.888887
+6    896.0  39.025776  ...  40.140799             36.023796
+7   1024.0  49.932191  ...  52.428801             52.428801
+8   1152.0  44.566925  ...  46.656000             46.656000
+9   1280.0  51.200001  ...  56.888887             56.109587
 10  1408.0  64.138541  ...  64.902096             64.902096
 11  1536.0  78.643199  ...  76.106321             75.296679
-12  1664.0  62.929456  ...  62.061463             62.061463
+12  1664.0  63.372618  ...  62.492442             61.636381
 13  1792.0  72.983276  ...  69.810085             69.379162
 14  1920.0  67.434145  ...  70.892307             70.530615
-15  2048.0  73.908442  ...  74.898285             74.565406
-16  2176.0  83.500614  ...  78.916269             79.855747
-17  2304.0  68.251065  ...  73.275679             72.828879
-18  2432.0  71.125224  ...  80.731218             80.731218
-19  2560.0  77.649287  ...  76.560748             76.382283
-20  2688.0  81.928846  ...  80.366642             82.823267
-21  2816.0  77.743683  ...  78.868366             78.301990
-22  2944.0  81.832567  ...  79.610276             78.605729
-23  3072.0  81.005868  ...  81.005868             82.420822
-24  3200.0  84.321474  ...  89.635851             85.106381
-25  3328.0  83.226931  ...  87.156532             86.113988
-26  3456.0  81.932484  ...  83.632331             85.313831
-27  3584.0  87.211821  ...  87.211821             91.563533
-28  3712.0  85.896254  ...  82.491612             84.874549
-29  3840.0  85.070769  ...  87.493673             87.701820
-30  3968.0  92.935215  ...  83.865247             83.578035
-31  4096.0  93.662059  ...  85.926841             84.840533
+15  2048.0  73.908442  ...  75.234154             74.898285
+16  2176.0  81.472263  ...  80.817862             80.173899
+17  2304.0  68.446623  ...  73.501144             73.275679
+18  2432.0  71.305746  ...  81.197876             79.362895
+19  2560.0  77.649287  ...  77.649287             76.560748
+20  2688.0  82.642823  ...  80.708630             82.823267
+21  2816.0  79.587973  ...  79.733474             77.605356
+22  2944.0  81.967162  ...  78.112900             79.230573
+23  3072.0  81.707223  ...  84.135370             79.863336
+24  3200.0  84.099871  ...  87.074829             89.136491
+25  3328.0  83.905938  ...  84.003845             86.424125
+26  3456.0  81.518272  ...  85.494768             81.353753
+27  3584.0  86.540320  ...  94.448944             94.847460
+28  3712.0  83.947349  ...  88.955779             89.114488
+29  3840.0  84.809814  ...  88.191387             87.217666
+30  3968.0  93.148045  ...  83.179234             87.409694
+31  4096.0  93.531519  ...  89.777746             87.552332
 
 [32 rows x 5 columns]
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  9.226 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  30.498 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-getting-started-tutorials-03-matrix-multiplication-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/d5fee5b55a64e47f1b5724ec39adf171/03-matrix-multiplication.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">03-matrix-multiplication.py</span></code></a></p>
diff --git a/getting-started/tutorials/sg_execution_times.html b/getting-started/tutorials/sg_execution_times.html
index 2cdac5f7d..57a4d7430 100644
--- a/getting-started/tutorials/sg_execution_times.html
+++ b/getting-started/tutorials/sg_execution_times.html
@@ -174,7 +174,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-getting-started-tutorials-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>03:32.837</strong> total execution time for <strong>getting-started_tutorials</strong> files:</p>
+<p><strong>03:54.208</strong> total execution time for <strong>getting-started_tutorials</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 85%" />
@@ -183,15 +183,15 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="03-matrix-multiplication.html#sphx-glr-getting-started-tutorials-03-matrix-multiplication-py"><span class="std std-ref">Matrix Multiplication</span></a> (<code class="docutils literal notranslate"><span class="pre">03-matrix-multiplication.py</span></code>)</p></td>
-<td><p>02:09.226</p></td>
+<td><p>02:30.498</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="02-fused-softmax.html#sphx-glr-getting-started-tutorials-02-fused-softmax-py"><span class="std std-ref">Fused Softmax</span></a> (<code class="docutils literal notranslate"><span class="pre">02-fused-softmax.py</span></code>)</p></td>
-<td><p>01:12.617</p></td>
+<td><p>01:12.739</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="01-vector-add.html#sphx-glr-getting-started-tutorials-01-vector-add-py"><span class="std std-ref">Vector Addition</span></a> (<code class="docutils literal notranslate"><span class="pre">01-vector-add.py</span></code>)</p></td>
-<td><p>00:10.994</p></td>
+<td><p>00:10.971</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/searchindex.js b/searchindex.js
index 391df4c26..fd0213591 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({docnames:["getting-started/installation","getting-started/tutorials/01-vector-add","getting-started/tutorials/02-fused-softmax","getting-started/tutorials/03-matrix-multiplication","getting-started/tutorials/index","getting-started/tutorials/sg_execution_times","index","programming-guide/chapter-1/introduction","programming-guide/chapter-2/related-work","python-api/generated/triton.jit","python-api/generated/triton.language.arange","python-api/generated/triton.language.atomic_cas","python-api/generated/triton.language.atomic_xchg","python-api/generated/triton.language.broadcast_to","python-api/generated/triton.language.dot","python-api/generated/triton.language.exp","python-api/generated/triton.language.load","python-api/generated/triton.language.log","python-api/generated/triton.language.max","python-api/generated/triton.language.maximum","python-api/generated/triton.language.min","python-api/generated/triton.language.minimum","python-api/generated/triton.language.multiple_of","python-api/generated/triton.language.num_programs","python-api/generated/triton.language.program_id","python-api/generated/triton.language.ravel","python-api/generated/triton.language.reshape","python-api/generated/triton.language.sigmoid","python-api/generated/triton.language.softmax","python-api/generated/triton.language.store","python-api/generated/triton.language.sum","python-api/generated/triton.language.where","python-api/generated/triton.language.zeros","python-api/generated/triton.testing.Benchmark","python-api/generated/triton.testing.do_bench","python-api/generated/triton.testing.perf_report","python-api/triton","python-api/triton.language","python-api/triton.testing"],envversion:{"sphinx.domains.c":2,"sphinx.domains.changeset":1,"sphinx.domains.citation":1,"sphinx.domains.cpp":4,"sphinx.domains.index":1,"sphinx.domains.javascript":2,"sphinx.domains.math":2,"sphinx.domains.python":3,"sphinx.domains.rst":2,"sphinx.domains.std":2,sphinx:56},filenames:["getting-started/installation.rst","getting-started/tutorials/01-vector-add.rst","getting-started/tutorials/02-fused-softmax.rst","getting-started/tutorials/03-matrix-multiplication.rst","getting-started/tutorials/index.rst","getting-started/tutorials/sg_execution_times.rst","index.rst","programming-guide/chapter-1/introduction.rst","programming-guide/chapter-2/related-work.rst","python-api/generated/triton.jit.rst","python-api/generated/triton.language.arange.rst","python-api/generated/triton.language.atomic_cas.rst","python-api/generated/triton.language.atomic_xchg.rst","python-api/generated/triton.language.broadcast_to.rst","python-api/generated/triton.language.dot.rst","python-api/generated/triton.language.exp.rst","python-api/generated/triton.language.load.rst","python-api/generated/triton.language.log.rst","python-api/generated/triton.language.max.rst","python-api/generated/triton.language.maximum.rst","python-api/generated/triton.language.min.rst","python-api/generated/triton.language.minimum.rst","python-api/generated/triton.language.multiple_of.rst","python-api/generated/triton.language.num_programs.rst","python-api/generated/triton.language.program_id.rst","python-api/generated/triton.language.ravel.rst","python-api/generated/triton.language.reshape.rst","python-api/generated/triton.language.sigmoid.rst","python-api/generated/triton.language.softmax.rst","python-api/generated/triton.language.store.rst","python-api/generated/triton.language.sum.rst","python-api/generated/triton.language.where.rst","python-api/generated/triton.language.zeros.rst","python-api/generated/triton.testing.Benchmark.rst","python-api/generated/triton.testing.do_bench.rst","python-api/generated/triton.testing.perf_report.rst","python-api/triton.rst","python-api/triton.language.rst","python-api/triton.testing.rst"],objects:{"triton.language":{arange:[10,0,1,""],atomic_cas:[11,0,1,""],atomic_xchg:[12,0,1,""],broadcast_to:[13,0,1,""],dot:[14,0,1,""],exp:[15,0,1,""],load:[16,0,1,""],log:[17,0,1,""],max:[18,0,1,""],maximum:[19,0,1,""],min:[20,0,1,""],minimum:[21,0,1,""],multiple_of:[22,0,1,""],num_programs:[23,0,1,""],program_id:[24,0,1,""],ravel:[25,0,1,""],reshape:[26,0,1,""],sigmoid:[27,0,1,""],softmax:[28,0,1,""],store:[29,0,1,""],sum:[30,0,1,""],where:[31,0,1,""],zeros:[32,0,1,""]},"triton.testing":{Benchmark:[33,1,1,""],do_bench:[34,0,1,""],perf_report:[35,0,1,""]},"triton.testing.Benchmark":{__init__:[33,2,1,""]},triton:{jit:[9,0,1,""]}},objnames:{"0":["py","function","Python function"],"1":["py","class","Python class"],"2":["py","method","Python method"]},objtypes:{"0":"py:function","1":"py:class","2":"py:method"},terms:{"0":[1,2,3,5,7,8,23,24,32,34],"00":5,"0000":3,"000000":2,"000001":[1,2],"000002":2,"005868":3,"007961":2,"01":[1,3,5],"02":[2,5],"025776":3,"028308":3,"03":[3,5],"061463":3,"0625":3,"070769":3,"084721":1,"09":5,"0938":3,"097543":2,"0f":8,"1":[1,2,3,6,8,23,24],"10":[1,3,5],"100":[2,34],"101597":2,"1024":[1,3],"1045":3,"1048576":1,"106321":3,"106381":3,"11":[0,1,3],"113988":3,"1152":3,"12":[1,2,3,5],"12160":2,"12288":2,"12416":2,"125224":3,"12544":2,"12672":2,"127":1,"128":[1,2,3],"1280":3,"13":[1,3],"131072":1,"1328":3,"133347":2,"134217728":1,"138541":3,"14":[1,3],"140799":3,"1408":3,"142849":2,"142862":2,"149375":2,"15":[1,3],"153":2,"1536":3,"153853":2,"154":2,"156532":3,"16":[2,3,8,32],"160":2,"163":2,"16384":1,"1664":3,"167004":2,"16777216":1,"17":3,"171410":2,"1792":3,"179533":2,"18":3,"181817":2,"1823":2,"186":2,"19":[1,3],"190482":1,"192":1,"1920":3,"198":2,"1982":8,"1983":7,"1984":8,"1989":8,"199":2,"1991":[7,8],"1999":8,"1d":[1,2,3],"1e":[1,2,3],"2":[1,2,3,6,8,23,24,34],"20":[3,34],"200000":1,"200001":3,"2004":8,"2006":8,"2012":8,"2013":7,"2014":7,"2016":[7,8],"2017":7,"2018":[7,8],"2019":8,"2021":[7,8],"2048":[2,3],"2097152":1,"209928":2,"21":3,"211821":3,"2141":1,"216187":2,"2176":3,"219":1,"22":3,"220":3,"226":[3,5],"226931":3,"23":3,"2304":3,"24":3,"242181":3,"2432":3,"245":3,"25":[3,34],"251065":3,"256":[1,2,3],"2560":3,"26":3,"260869":3,"262144":1,"2656":3,"2688":3,"27":3,"272727":3,"275679":3,"278610":1,"28":[1,3],"2812":3,"2816":3,"2891":3,"29":3,"2944":3,"296679":3,"298541":2,"2d":[3,14],"2m":2,"2mn":2,"3":[0,1,2,3,8],"30":3,"301990":3,"3072":3,"3076":1,"31":3,"3125":3,"313831":3,"32":[3,5],"3200":3,"321474":3,"32768":1,"3281":3,"33":3,"3328":3,"333321":1,"33554432":1,"34":3,"341":1,"3438":3,"3456":3,"3477":3,"3516":3,"3555":3,"3584":3,"359066":2,"36":3,"362445":1,"366642":3,"3712":3,"3713":1,"372800":3,"379162":3,"38":1,"380953":3,"382283":3,"384":[1,2,3],"3840":3,"384000":3,"39":3,"3906":3,"3968":3,"3984":3,"3d":[23,24],"3mn":2,"4":[1,2,3,8],"40":3,"400001":1,"400016":1,"4023":3,"406":2,"4062":3,"4096":[1,2,3],"412":2,"416":2,"4194304":1,"420235":3,"420822":3,"428568":1,"428801":3,"429770":[1,2],"434145":3,"4492":3,"45":3,"4531":3,"455111":3,"46":3,"4609":3,"4688":3,"472":1,"49":3,"491612":3,"493673":3,"4940":1,"4m":2,"4x":2,"5":[1,3,8],"5000":3,"500614":3,"507077":3,"51":3,"512":[2,3],"512000":3,"52":3,"524288":1,"53":3,"530615":3,"5312":3,"54":3,"546":2,"56":3,"560748":3,"563533":3,"563555":3,"565406":3,"566038":2,"577704":1,"578035":3,"585":2,"5859":3,"5898":3,"5mn":2,"6":[0,1,3],"600000":1,"600004":2,"605729":3,"6094":3,"610276":3,"614":1,"615390":1,"617":[2,5],"62":3,"630":2,"632331":3,"635851":3,"64":[1,3],"640":[2,3],"643199":3,"649287":3,"65536":1,"656000":3,"656574":1,"662059":3,"664":2,"666684":2,"67":3,"67108864":1,"6724":1,"68":3,"682":2,"69":3,"6953":3,"7":[0,1,3,8],"70":3,"701820":3,"702":2,"7031":3,"7070":3,"71":3,"72":3,"722":1,"73":3,"730667":3,"731218":3,"74":3,"743683":3,"75":3,"7500":3,"76":[1,3],"768":[2,3],"768000":3,"77":3,"773130":3,"78":3,"780":1,"781":2,"79":3,"8":[1,2,3,8,32,34],"80":[3,34],"800002":1,"81":3,"810":2,"810085":3,"811":2,"811163":1,"812":[1,2],"8192":1,"82":3,"823267":3,"823517":1,"828879":3,"83":3,"832567":3,"833":1,"837":5,"8388608":1,"839992":2,"84":3,"840533":3,"840807":2,"843":1,"848":1,"849":1,"85":3,"850":1,"851":1,"854847":2,"855747":3,"86":3,"865247":3,"868366":3,"87":3,"874549":3,"8828":3,"8867":3,"888887":3,"89":3,"8906":3,"892307":3,"8945":3,"896":3,"896254":3,"898285":3,"8mn":2,"9":[0,1,2,3],"90":3,"902096":3,"908442":3,"91":3,"916269":3,"92":3,"9219":3,"925276":2,"926841":3,"928846":3,"929456":3,"93":[2,3],"932191":3,"932484":3,"935215":3,"936606":2,"9375":3,"94":2,"9492":3,"95":2,"9531":3,"96":2,"9688":3,"97":2,"971190":2,"9733":1,"978909":3,"98":2,"9805":3,"983276":3,"98432":1,"9844":3,"994":[1,5],"999995":1,"abstract":[7,8],"break":8,"byte":2,"case":[1,2,3,7,8,11],"class":[2,7,8,33],"default":34,"do":[2,3,7,8,16,29],"float":[2,7,8,34],"function":[1,2,3,8,9,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32,33,34,35],"import":[1,2,3,7,8],"int":[1,7,8,10,13,23,24,26,32,34],"new":[3,12,13,26,32],"return":[1,2,3,10,12,14,16,18,20,23,24,25,30,31,32,34,35],"static":[0,7,8],"super":3,"switch":3,"true":[1,2,3,31],"try":3,"var":8,"while":7,A:[3,7,8],And:[0,3],As:[2,3,7,8],At:8,By:34,For:[3,7,8],If:[3,8,29,31,33],In:[1,2,3,8],It:[1,3,4,6,8,9],Of:7,On:8,One:3,The:[1,2,3,7,8,11,12,13,14,23,24,26,29,31,35],There:1,These:8,To:[1,7,8],__expf:2,__init__:33,_matmul:3,a100:[3,8],a_ptr:3,ab:[1,3],abl:8,about:[1,2,3,6],abov:[1,2,3,8],academ:7,acc:[3,7,8],acceler:7,accept:3,access:[1,3,7,8,9],accomod:3,accordingli:8,account:8,accumul:[3,8],accuraci:7,achiev:[3,7,8],across:[2,7,8],activ:3,actual:[3,7,8],add:[1,3,5],add_kernel:1,addit:[2,4,5,7,34],addition:8,address:[7,16],adopt:8,advanc:[2,3,7],advoc:8,affect:3,affin:8,against:[0,1,2,3,6],aggress:[7,8],agnost:[7,8],ahead:8,aim:[2,6],al:[7,8],algebra:8,algorithm:[3,7,8],alia:8,all:[2,3,4,7,8,18,20,22,30,33],allclos:[2,3],allen1984:8,allen:8,alloc:[1,2,3,7],allow:[1,2,7,8],along:[1,3,18,20,23,24,30,34],also:[1,2,3,7,8],alwai:[8,31],amd:7,amen:8,amount:7,ampl:8,an:[1,2,3,7,8,11],analog:1,analysi:[7,8],analyz:8,ancourt1991:8,ancourt:8,ani:[1,2,3,8,33],anoth:[2,8],apart:8,api:33,appear:33,appli:[3,7,8],applic:8,approach:[7,8],appropri:1,approxim:2,ar:[0,1,2,3,7,8,9,16,22,29,31,33],arang:[1,2,3],arbitrari:3,architectur:[3,7],area:8,arg:[1,2,3,33],argument:[1,2,3,9,31,33],arrai:[8,32],arrang:3,art:[7,8],arxiv:[7,8],ask:2,aspect:8,asplo:7,assert:[1,3],assum:[2,33],asynchron:[1,7],atom:11,auguin1983:7,auguin:7,auto:[2,3,8],autom:7,automat:[2,3,7,8],autotun:[3,8],avail:[0,7,8],avoid:[2,31],awar:7,axi:[1,2,3,18,20,23,24,30,33],b:[3,7,8],b_ptr:3,back:[1,2,3],baghdadi2021:[7,8],baghdadi:[7,8],balanc:8,bandwidth:2,base:[6,7,8],basic:[1,4,8],becom:7,been:[1,7,8],befor:3,begin:8,behavior:8,being:2,believ:8,below:[4,8],bench:0,benchmark:[0,34,35],benefit:[2,7,8],best:[1,7],between:[1,7],block:[1,2,3,7,8,11,12,13,14,15,16,17,18,19,20,21,25,26,27,28,29,30,31,32],block_m:3,block_n:3,block_siz:[1,2,8],block_size_k:3,block_size_m:3,block_size_n:3,block_start:1,blue:[1,2,3],boil:8,bool:[31,33],both:[8,31],bound:[1,2,3,8],branch:8,broad:7,broadcast:[13,16,29,31],build:[0,3],builder:[10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],built:[1,8],c:[3,7,8],c_ptr:3,cach:[7,8],call:[1,3,8,9],callabl:[1,9,34],can:[0,1,2,3,7,8,35],cannot:[3,7,8],capabl:[6,7],cd:0,cdiv:[1,3],ceil:3,cgo:[7,8],chang:3,chapter:6,characterist:8,cheap:7,check:[3,6],chen2018:7,chen:7,chip:2,choic:6,click:[1,2,3],clone:0,close:8,cmake:0,cmp:11,coalesc:7,code:[1,2,3,4,7,8,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],col:[3,8],col_offset:2,color:33,column:[2,3],com:0,combin:7,come:[2,3,8],command:0,common:8,commonli:8,compar:[2,3,6,8,11],compat:14,compil:[2,3,6,7,9,22],complet:8,complex:8,compos:7,composit:8,comprehens:[7,8],comput:[6,7,8,15,17,19,21,27,28],computation:[7,8],concern:8,concis:[1,33],condit:[8,31],config:3,configur:[3,35],confirm:2,connectom:7,consecut:8,consequ:7,consid:2,constraint:[3,8],construct:7,constructor:33,consum:3,contain:[3,8,11,12,33],contextu:8,contigu:[3,10,25],control:[7,8],conveni:3,convert:[1,3,9],convolut:7,copi:[7,11],core:[7,8],correct:1,correspond:[1,2,3,33],cost:8,could:[2,8],cours:7,cpython:0,creat:[1,2,3,7],csv:1,cubla:[3,7],cuda:[1,2,3,7],cudnn:7,current:24,custom:[1,2,3,6],cut:3,cvpr:7,d:[2,9],dart:8,darte1999:8,data:[1,3,7,8,16,31,32],data_ptr:9,dataflow:8,decad:7,declar:1,decompos:8,decor:[1,3,9],deep:[3,7,8],def:[1,2,3],defin:[1,2,3,8,16],definit:8,denomin:2,denot:1,dens:8,depend:[0,8,31],deploi:7,describ:8,design:8,desir:[13,26],detail:8,detect:7,develop:[7,8],devic:[1,2,3],dialect:8,diesel:8,differ:[1,2,3,7,8,33],difficult:8,difficulti:[3,7],dijkstra82:8,dijkstra:8,dim:[2,8],dimens:[3,14,18,20,30],dimension:[3,8,14],dir:0,direct:3,disjoint:8,disk:1,dissert:8,distribut:[2,8],divis:3,dnn:[6,7,8],do_bench:[1,2,3],doe:[1,2,3,8],doesn:8,domain:[7,8],don:[1,2,3],done:[3,7,18,20,30],dot:3,doubli:3,doubt:8,down:[3,8],download:[0,1,2,3,4],dram:[1,2],dsl:[6,7,8],dtype:[1,2,3,11,12,16,29,32],e:[0,2,3,7,8,32],each:[1,2,3,7,8],eas:8,easi:3,easier:[1,2,7],easili:3,ed:[1,3,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],education:2,effect:8,effici:[3,7],effort:8,either:[1,23,24,31],elango2018:8,elango:8,element:[1,2,3,15,17,18,19,20,21,27,28,29,30,31,33],element_s:2,element_ti:[11,12,16,29],elementwis:[2,16],els:3,emerg:7,empti:3,empty_lik:[1,2],enabl:8,encod:8,end:[7,8,10],enforc:8,engin:8,enqueu:[1,2],ensur:8,entir:8,environ:6,equal:[2,8],especi:7,et:[7,8],euromicro:7,evalu:[3,31],even:8,evidenc:7,evolv:7,exampl:[1,2,3,4,7,8],execut:[5,7,8,35],exist:[7,8],exp:2,expand:3,expect:[2,11],expens:[7,8],explor:7,exponenti:[2,15],express:[7,8],extar:1,extend:3,extra:3,extract:3,extrem:8,f:[1,2,3,8],facilit:[7,8],fact:8,fairli:3,fals:[16,29,31,33],far:2,fast:[2,7,8],faster:2,fastest:8,feel:3,fetch:7,few:8,field:7,figur:8,file:[1,2,3,5],fill:32,first:[1,3,6,8,14,19,21],first_m_block_in_group:3,fit:2,fix:33,flag:2,flatten:25,flexibl:7,float16:[3,14,32],float32:[1,2,3,14],flow:[7,8],fn:[9,34],focu:[3,8],follow:[0,2,3,6,7,8],forget:1,formal:8,format:8,found:11,foundat:8,fp16:3,fp32:3,framework:[7,8],free:3,from:[1,2,3,7,8,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],full:[1,2,3],fulli:8,func:8,fundament:8,further:8,fuse:[3,4,5],fusion:[2,8],g:[3,7,8,32],galleri:[1,2,3,4],gb:[1,2],gbp:[1,2],gener:[1,2,3,4,7,8,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32,33],geq:8,get:[1,2,3,5],girbal2006:8,girbal:8,git:0,github:0,give:7,given:[2,3,12,13,23,24,26,32],global:8,go:[1,3,8],good:[1,8],gpgpu:7,gpu:[1,2,6,7,8,9],grad_to_non:34,gradient:34,grammat:8,graphic:7,greater:2,green:[1,2,3],grid:[1,2,3,23,24],grid_m:3,grid_n:3,grosser2012:8,grosser:8,group:3,group_id:3,group_m:3,group_siz:3,group_size_m:3,grow:8,guard:[1,2],guid:7,ha:[1,3,7,8,23,24],had:1,halid:[7,8],hand:8,handl:[1,2,3,8],handwritten:7,hard:3,harder:8,hardwar:[3,6,8],hasn:1,have:[2,7,8,9,14,31,33],heavi:7,helper:[1,2],henc:3,here:[1,2,3],heurist:2,hierarch:7,hierarchi:8,high:[3,7,8],highli:7,highlight:8,hint:8,hit:3,how:[1,2,3,6,7],howev:[2,8],http:0,i:[1,2,3,7,8],id:24,idea:7,ideal:2,ident:2,identifi:1,idx:[16,29],imag:[7,8],implement:[1,2,3,7,8],implicitli:[1,9,16,29],importantli:8,impos:8,improv:3,incompat:[3,8],incorrectli:3,increas:[1,2,3],incred:7,increment:8,inde:8,independ:[2,8],index:1,indic:[3,8,31],induc:8,industri:7,inequ:8,inf:2,inform:8,infrastructur:8,initi:[1,3],inner:[3,14],inplac:3,input:[1,2,3,8,13,14,15,17,18,19,20,21,22,25,26,27,28,30],input_ptr:2,input_row_strid:2,instal:6,instanc:[1,2,3,7,23,24],instead:[2,31],instruct:[6,7],int1:[16,29],integ:8,interchang:8,interest:[7,8],intermedi:8,intern:[2,3,8],interv:10,intrins:8,introduct:6,invari:[2,8],ipynb:[1,2,3],ir:[8,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],irregular:[2,8],is_contigu:3,is_cuda:1,issu:[7,8],iter:[3,7,8],its:[1,2,3,8],j:[3,7,8],jit:[1,2,3,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],journal:8,jrk2013:7,jump:3,jupyt:[1,2,3,4],just:[3,8],k:[3,7,8],k_offset:3,kb:7,kei:[3,7],kellei:7,kernel:[6,7],keyword:1,ki:8,kind:2,know:22,known:8,kxn:3,label:[1,2,3,33],lam1991:7,lam:7,lambda:[1,2,3],languag:[1,2,3,6,7,9],larg:[7,8],last:3,later:[2,8],latest:0,lattner2004:8,lattner2019:8,lattner:8,launch:[1,2,3,23,24],law:8,layer:[7,8],lead:[3,7,8],leaky_relu:3,leakyrelu:3,learn:[1,2,3,6,7,8],least:8,lee2017:7,lee:7,left:8,legal:8,length:1,less:[7,8],let:[1,2,22],letter:8,level:[3,7,8],li:7,librari:[0,3,7,8],lifelong:8,like:[1,7,8],limit:2,line:[1,2,3,8,33],line_arg:[1,2,3,33],line_nam:[1,2,3,33],line_v:[1,2,3,33],linear:[7,8],link:0,list:[1,3,33,34,35],litteratur:8,llvm11:0,llvm:[0,8],load:[1,2,3,8,31],local:[7,8],locat:[3,11,12,16,29],log:33,logarithm:[1,17],look:[6,7],loop:[3,8],low:8,m:[0,2,3,7],m_offsets_a:3,m_offsets_c:3,m_start:3,machin:[7,8],machineri:[7,8],made:7,mai:[2,8],main:[3,7,8],maintain:[2,8],major:[3,8],make:[1,2,7,8],manag:7,mani:[1,7,8],manual:[2,8],manual_se:[1,2,3],map:3,mapl:8,mark:35,markedli:7,mask:[1,2,3,16,29,31],match:[3,11],mathbb:8,mathbf:8,mathcal:8,mathemat:8,matmul:[3,8],matmul_kernel:3,matric:[2,3],matrix:[2,4,5,7,8,14],matrix_s:8,matter:[3,7,8],max:[1,2],max_m:[1,2,3],maxim:[6,8],maximum:[1,2,18],mb:[5,7],mean:[3,8],mechan:[2,8],median:34,memori:[1,2,3,7,8,11,12,16,29,31],mention:3,meta:[1,2,3],metaparamet:[1,3],method:[8,9,33,35],methodolog:8,micro:7,min:3,min_m:[1,2,3],minimum:20,minut:[1,2,3],miss:8,mitig:8,ml:7,mlir:8,mn:2,model:[1,7,8],modern:[3,6,7,8],modular:8,moor:8,more:[2,3,6,7,8,33],most:[3,8],move:3,ms:[1,2,3,34],much:[2,3],mullapudi2016:8,mullapudi:8,multi:[3,7,8],multipl:[1,4,5,7,8,22],multipli:[3,8,14],must:[2,3,10,14,31],mxk:3,n:[2,3,7],n_blocks_m:3,n_blocks_n:3,n_col:2,n_element:1,n_offsets_b:3,n_offsets_c:3,n_output_blocks_in_group:3,n_row:2,n_start:3,naiv:2,naive_softmax:2,name:[1,2,3,33],nativ:[1,2,3],natur:[2,7,17],nb:7,necessari:2,need:[1,2,3],nelement:2,nest:[3,8],net:8,network:[7,8],neural:[7,8],neurosci:7,next:[2,3],next_power_of_2:2,nightli:0,nip:7,nn:3,non:7,none:[2,3,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32,33,34],nonzero:31,normal:[2,3],note:[0,1,2,3,8,9,31],notebook:[1,2,3,4],notic:[2,8],notori:[3,7],novel:7,now:[1,3],num_stag:3,num_warp:[2,3],number:[1,2,3,8,23],numer:[2,7],nvidia:7,o:2,object:[1,3,7,9,11],obtain:1,obvious:2,occur:8,offer:7,offici:0,offset:[1,3],often:3,old:12,omega:8,onc:[2,7,8],one:[2,3,4,7,8,33],onli:[2,3,7,8,9],op:[1,2],open:10,openai:0,opencl:7,oper:[1,2,3,4,7,31],operand:3,opportun:7,opsila:7,optim:[7,8],option:[1,3,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32,33,34],order:[2,3,4,8],origin:8,osdi:7,other:[2,3,6,8,9,14,16,19,21],otherwis:31,our:[1,2,3,7],out:[1,2,3,6,8],outlin:8,output:[1,2,3],output_ptr:[1,2],output_row_start_ptr:2,output_row_strid:2,output_torch:1,output_triton:1,over:[2,7,8],overflow:2,own:3,p:8,pa:3,packag:9,pact:8,pad:2,par:3,paradigm:[7,8],parallel:[1,2,3,6,7,8],paralleliz:7,param:22,paramet:[1,3,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26,27,28,29,30,31,32,33,34,35],parametr:7,part:[3,8],particular:[2,3],particularli:[7,8],partit:7,pass:[1,8],past:[7,8],path:1,pattern:7,pb:3,peak:8,per:2,percentil:34,perf:3,perf_report:[1,2,3,33],perform:[1,2,7,8,11,34],person:8,perspect:8,phase:8,philosophi:8,pid:[1,3],pid_m:3,pid_n:3,pip:0,pipelin:[7,8],platform:[6,8],pldi:7,plot:[0,1,2,3,33],plot_nam:[1,2,3,33],pmatrix:8,point:[1,8],pointer:[1,2,9,11,12,16,29],pointerdtyp:[11,12,16,29],polli:8,polyhedr:7,polyhedra:8,popular:8,portabl:[7,8],pose:7,possibl:[1,2,3,8],power:[2,8,10],ppopp:8,practic:[1,2,3,7],pragma:7,pre:[0,7],prealloc:1,predict:8,prefer:2,premis:7,present:[0,3],preserv:8,prevent:8,primer:8,primit:[7,9],principl:8,print:[1,2,3],print_data:[1,2,3],probabl:8,problem:1,problemat:8,procedur:8,process:[1,7,8],processor:7,product:[6,8,14],program:[1,2,3,6,7,23,24],program_id:[1,2,3],programm:[7,8],project:7,promot:[3,8],properli:2,properti:8,propos:7,proprietari:3,provid:[1,2,3,6,8,18,20,30,34],pseudo:3,ptr:3,purpos:[7,8],push:8,py:[0,1,2,3,5],pypi:0,pytest:0,python:[1,2,3,4,9],pytorch:[1,2],qquad:8,quantiz:3,r:2,ragan:7,rand:1,randn:[2,3],rang:[1,2,3,7,8],rapidli:[7,8],rate:3,rather:7,raw:1,rdom:8,re:[1,3],read:[2,3,4],reader:8,real:7,reason:8,recent:7,recommend:4,recomput:7,rectifi:7,redmon2016:7,redmon:7,reduct:[2,18,20,30],refer:1,regardless:31,regrett:7,regular:8,rel:[1,8],relat:6,releas:[0,7],reli:8,relu:3,remain:[7,33],rememb:3,reorder:8,rep:34,repetit:34,repres:[2,3,8],requir:[2,8],research:[7,8],reset:34,resolut:8,resourc:7,resp:8,respect:8,restrict:8,result:[0,1,2,7,8],ret:2,retriev:8,reus:3,revisit:7,right:8,rise:8,rk:3,rm:3,rn:3,role:8,roughli:3,row:[2,3],row_idx:2,row_minus_max:2,row_start_ptr:2,run:[0,1,2,3,6,8,9,35],runtim:[8,34],rvar:8,s:[1,2,8],said:8,same:[7,33],sato2019:8,sato:8,save:[1,2,3],save_path:1,sc:8,scalabl:8,scalar:[7,14,32],scale:33,scan:8,schedul:7,scienc:8,scientif:8,scop:8,scope:8,script:[0,1,2,3],second:[1,2,3,8,14,19,21],section:8,see:[1,2,3,8],seem:[1,8],select:[7,8,31],self:33,semant:8,semi:8,sens:[1,7,8],separ:8,sequenc:7,set:[1,8],setup:0,sever:[7,8],shall:8,shape:[1,2,3,8,13,16,26,29,31,32],share:7,shift:2,should:[1,3,7,8,18,20,30,33],show_plot:[1,2,3],shown:8,side:8,sight:8,signal:7,significantli:2,sigplan:8,simd:7,simpl:[1,2,3],simplest:4,simpli:8,simplic:3,sinc:[1,2,3],singl:[2,7],size:[1,2,3,8],slower:[7,8],slowest:8,sm:8,smaller:3,smallest:2,snemi3d:7,so:[1,2,3,8],softmax:[4,5],softmax_kernel:2,softmax_output:2,solid:8,solut:3,solv:8,some:3,sometim:8,sourc:[1,2,3,4,8],space:[7,8],spars:[7,8],spatial:8,speak:3,special:7,specif:[3,7],specifi:[8,11,29],speed:2,sphinx:[1,2,3,4],split:8,spmd:[1,7,8],sram:[2,3],stabil:2,stabl:0,standard:8,start:[4,10],started_tutori:5,state:[7,8],statement:8,step:8,still:[1,2,8],stop:10,store:[1,2,3,12,31],str:33,straightforward:3,strategi:8,strength:7,stride:[2,3],stride_a_0:3,stride_a_1:3,stride_ak:3,stride_am:3,stride_b_0:3,stride_b_1:3,stride_bk:3,stride_bn:3,stride_cm:3,stride_cn:3,stride_x_0:3,stride_x_1:3,structur:[7,8],style:[1,2,3,33],subscript:8,substanti:7,substract:2,subtract:2,successfulli:8,suffer:8,suit:7,sum:[1,2],superhuman:7,support:8,sure:2,surprisingli:7,surround:8,suspicion:2,sutskev:7,sutskever2014:7,swap:[11,12],swizzl:7,synchron:[1,7],system:[0,3,7,8],t:[1,2,3,8],t_:8,taco:8,take:[3,6],taken:[3,8],target:7,techniqu:[3,7,8],tempor:8,tend:8,tension:7,tensor:[1,2,3,7,8,9,34],tensorrt:7,term:3,test:[0,1,6],text:8,tflop:3,th:34,than:[2,3,7,8,33],thei:[3,7,8],them:1,themselv:3,theoret:2,therebi:8,therefor:3,theta:8,theta_:8,thi:[1,2,3,7,8,9,33],thing:1,think:2,those:2,though:[7,8],thought:8,thread:[2,7],through:[4,8],throughout:[8,33],throughput:6,tile:8,time:[0,1,2,3,7,8,34],tiramisu:[7,8],tl:[1,2,3],tmp:0,tog:8,topic:8,torch:[1,2,3,9,34],torch_output:3,torch_relu:3,total:[1,2,3,5],tradit:[7,8],transform:8,travers:8,trend:7,tri:[13,26],trick:2,trigger:3,triton:[0,1,2,3,4,7,8],triton_output:3,trivial:7,tune:[2,3,8],tupl:[1,13,26,32],tutori:[1,2,3,6],tutorials_jupyt:4,tutorials_python:4,tvm:[7,8],two:[1,2,3,8,10,14],type:[14,22,31,32],typecast:[16,29],typic:8,u:0,un:8,uncommon:8,underneath:8,understand:2,unfortun:[3,8],unifi:7,unint:31,unit:[0,7],univers:8,unrol:8,up:2,updat:[3,8],us:[1,2,3,7,8,9,31,33,35],util:1,v100:8,val:[11,12],valid:1,valu:[1,2,3,10,11,12,15,16,17,18,20,22,29,30,31,32,33,35],valuabl:2,variabl:3,variant:7,variou:4,vasilach:[7,8],vasilache2018:[7,8],vast:8,vec:8,vector:[4,5,7,8],vendor:3,veri:[2,8],verif:8,verifi:[2,8],via:[3,8],view:25,visibl:8,vision:7,vs:0,w:8,wai:[2,3],want:[2,31],warmup:34,warp:2,wast:2,we:[1,2,3,7,8],well:[7,8],wheel:0,when:[2,3,7,8,9,31],where:[1,3,8,29],whether:[7,33],which:[1,2,3,7,8,12,18,20,30,33],whose:[1,2,3,8,16],wide:8,width:3,wise:[1,2,15,17,19,21,27,28,29],wish:[3,8],within:[3,9,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],without:8,wolf:8,wolfe1989:8,won:2,word:8,work:[2,6,7],workload:3,wors:[7,8],would:[1,2],wouldn:8,wrapper:3,write:[1,2,3,4,6,8],written:3,wrote:2,x:[1,2,3,8,15,17,19,21,25,27,28,31,33],x_log:[1,33],x_max:2,x_name:[1,2,3,33],x_ptr:1,x_val:[1,2,3,33],xi:8,xii:8,xlabel:33,xo:8,y:[1,2,3,8,19,21,31,33],y_log:33,y_name:[1,2],y_ptr:1,y_torch:2,y_triton:2,year:8,yet:[7,8],yi:8,yield:31,yii:8,ylabel:[1,2,3,33],yo:8,you:[0,1,2,3,4,7,31],your:[0,1,6],yourself:[2,3],z:[1,2,8],zero:3,zip:4},titles:["Installation","Vector Addition","Fused Softmax","Matrix Multiplication","Tutorials","Computation times","Welcome to Triton\u2019s documentation!","Introduction","Related Work","triton.jit","triton.language.arange","triton.language.atomic_cas","triton.language.atomic_xchg","triton.language.broadcast_to","triton.language.dot","triton.language.exp","triton.language.load","triton.language.log","triton.language.max","triton.language.maximum","triton.language.min","triton.language.minimum","triton.language.multiple_of","triton.language.num_programs","triton.language.program_id","triton.language.ravel","triton.language.reshape","triton.language.sigmoid","triton.language.softmax","triton.language.store","triton.language.sum","triton.language.where","triton.language.zeros","triton.testing.Benchmark","triton.testing.do_bench","triton.testing.perf_report","triton","triton.language","triton.testing"],titleterms:{"final":3,addit:1,advantag:8,algebra:37,api:6,arang:10,arithmet:3,atomic_ca:11,atomic_xchg:12,benchmark:[1,2,3,33],binari:0,broadcast_to:13,cach:3,challeng:7,comparison:37,compil:[8,37],comput:[1,2,3,5],creation:37,distribut:0,do_bench:34,document:6,dot:14,exp:15,from:0,further:6,fuse:2,get:6,go:6,hint:37,index:37,instal:0,introduct:7,jit:9,kernel:[1,2,3],l2:3,languag:[8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,37],limit:8,linear:37,load:16,log:17,manipul:37,math:37,matrix:3,max:18,maximum:19,memori:37,min:20,minimum:21,model:37,motiv:[2,3,7],multipl:3,multiple_of:22,num_program:23,op:37,optim:3,packag:0,perf_report:35,perform:3,pointer:3,polyhedr:8,program:[8,37],program_id:24,python:[0,6],ravel:25,reduct:37,refer:[7,8],relat:8,represent:8,reshap:26,result:3,s:6,schedul:8,shape:37,sigmoid:27,softmax:[2,28],sourc:0,squar:3,start:6,store:29,sum:30,test:[2,3,33,34,35,38],time:5,triton:[6,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38],tutori:4,unit:[2,3],vector:1,welcom:6,where:31,work:8,zero:32}})
\ No newline at end of file
+Search.setIndex({docnames:["getting-started/installation","getting-started/tutorials/01-vector-add","getting-started/tutorials/02-fused-softmax","getting-started/tutorials/03-matrix-multiplication","getting-started/tutorials/index","getting-started/tutorials/sg_execution_times","index","programming-guide/chapter-1/introduction","programming-guide/chapter-2/related-work","python-api/generated/triton.jit","python-api/generated/triton.language.arange","python-api/generated/triton.language.atomic_cas","python-api/generated/triton.language.atomic_xchg","python-api/generated/triton.language.broadcast_to","python-api/generated/triton.language.dot","python-api/generated/triton.language.exp","python-api/generated/triton.language.load","python-api/generated/triton.language.log","python-api/generated/triton.language.max","python-api/generated/triton.language.maximum","python-api/generated/triton.language.min","python-api/generated/triton.language.minimum","python-api/generated/triton.language.multiple_of","python-api/generated/triton.language.num_programs","python-api/generated/triton.language.program_id","python-api/generated/triton.language.ravel","python-api/generated/triton.language.reshape","python-api/generated/triton.language.sigmoid","python-api/generated/triton.language.softmax","python-api/generated/triton.language.store","python-api/generated/triton.language.sum","python-api/generated/triton.language.where","python-api/generated/triton.language.zeros","python-api/generated/triton.testing.Benchmark","python-api/generated/triton.testing.do_bench","python-api/generated/triton.testing.perf_report","python-api/triton","python-api/triton.language","python-api/triton.testing"],envversion:{"sphinx.domains.c":2,"sphinx.domains.changeset":1,"sphinx.domains.citation":1,"sphinx.domains.cpp":4,"sphinx.domains.index":1,"sphinx.domains.javascript":2,"sphinx.domains.math":2,"sphinx.domains.python":3,"sphinx.domains.rst":2,"sphinx.domains.std":2,sphinx:56},filenames:["getting-started/installation.rst","getting-started/tutorials/01-vector-add.rst","getting-started/tutorials/02-fused-softmax.rst","getting-started/tutorials/03-matrix-multiplication.rst","getting-started/tutorials/index.rst","getting-started/tutorials/sg_execution_times.rst","index.rst","programming-guide/chapter-1/introduction.rst","programming-guide/chapter-2/related-work.rst","python-api/generated/triton.jit.rst","python-api/generated/triton.language.arange.rst","python-api/generated/triton.language.atomic_cas.rst","python-api/generated/triton.language.atomic_xchg.rst","python-api/generated/triton.language.broadcast_to.rst","python-api/generated/triton.language.dot.rst","python-api/generated/triton.language.exp.rst","python-api/generated/triton.language.load.rst","python-api/generated/triton.language.log.rst","python-api/generated/triton.language.max.rst","python-api/generated/triton.language.maximum.rst","python-api/generated/triton.language.min.rst","python-api/generated/triton.language.minimum.rst","python-api/generated/triton.language.multiple_of.rst","python-api/generated/triton.language.num_programs.rst","python-api/generated/triton.language.program_id.rst","python-api/generated/triton.language.ravel.rst","python-api/generated/triton.language.reshape.rst","python-api/generated/triton.language.sigmoid.rst","python-api/generated/triton.language.softmax.rst","python-api/generated/triton.language.store.rst","python-api/generated/triton.language.sum.rst","python-api/generated/triton.language.where.rst","python-api/generated/triton.language.zeros.rst","python-api/generated/triton.testing.Benchmark.rst","python-api/generated/triton.testing.do_bench.rst","python-api/generated/triton.testing.perf_report.rst","python-api/triton.rst","python-api/triton.language.rst","python-api/triton.testing.rst"],objects:{"triton.language":{arange:[10,0,1,""],atomic_cas:[11,0,1,""],atomic_xchg:[12,0,1,""],broadcast_to:[13,0,1,""],dot:[14,0,1,""],exp:[15,0,1,""],load:[16,0,1,""],log:[17,0,1,""],max:[18,0,1,""],maximum:[19,0,1,""],min:[20,0,1,""],minimum:[21,0,1,""],multiple_of:[22,0,1,""],num_programs:[23,0,1,""],program_id:[24,0,1,""],ravel:[25,0,1,""],reshape:[26,0,1,""],sigmoid:[27,0,1,""],softmax:[28,0,1,""],store:[29,0,1,""],sum:[30,0,1,""],where:[31,0,1,""],zeros:[32,0,1,""]},"triton.testing":{Benchmark:[33,1,1,""],do_bench:[34,0,1,""],perf_report:[35,0,1,""]},"triton.testing.Benchmark":{__init__:[33,2,1,""]},triton:{jit:[9,0,1,""]}},objnames:{"0":["py","function","Python function"],"1":["py","class","Python class"],"2":["py","method","Python method"]},objtypes:{"0":"py:function","1":"py:class","2":"py:method"},terms:{"0":[1,2,3,5,7,8,23,24,32,34],"00":5,"0000":3,"000000":2,"000001":[1,2],"000002":2,"003845":3,"007961":2,"01":[1,3,5],"012395":2,"02":[2,5],"023796":3,"025776":3,"028308":3,"03":[3,5],"0625":3,"074829":3,"084721":1,"0938":3,"096718":2,"097543":2,"099871":3,"0f":8,"1":[1,2,3,6,8,23,24],"10":[1,3,5],"100":[2,34],"1024":[1,3],"1045":3,"1048576":1,"106321":3,"109587":3,"11":[0,1,3],"112900":3,"114488":3,"1152":3,"12":[1,2,3,5],"12160":2,"12288":2,"12416":2,"12544":2,"12672":2,"127":1,"128":[1,2,3],"1280":3,"13":[1,3],"131072":1,"1328":3,"133347":2,"134217728":1,"135370":3,"136491":3,"138541":3,"14":[1,3],"140799":3,"1408":3,"142849":2,"142862":2,"148045":3,"15":[1,3],"153":2,"1536":3,"153853":2,"154":2,"16":[2,3,8,32],"160":2,"163":2,"16384":1,"1664":3,"167004":2,"16777216":1,"17":3,"171410":2,"173899":3,"1792":3,"179234":3,"18":3,"181817":2,"1823":2,"186":2,"19":[1,3],"190482":1,"191387":3,"192":1,"1920":3,"197876":3,"198":2,"1982":8,"1983":7,"1984":8,"1989":8,"199":2,"1991":[7,8],"1999":8,"1d":[1,2,3],"1e":[1,2,3],"2":[1,2,3,6,8,23,24,34],"20":[3,34],"200000":1,"200001":3,"2004":8,"2006":8,"2012":8,"2013":7,"2014":7,"2016":[7,8],"2017":7,"2018":[7,8],"2019":8,"2021":[7,8],"2048":[2,3],"208":5,"2097152":1,"21":3,"2141":1,"216187":2,"2176":3,"217666":3,"219":1,"22":3,"220":3,"222812":2,"23":3,"2304":3,"230573":3,"234154":3,"24":3,"2432":3,"245":3,"25":[3,34],"256":[1,2,3],"2560":3,"26":3,"260869":3,"262144":1,"2656":3,"2688":3,"27":3,"275679":3,"278610":1,"28":[1,3],"2812":3,"2816":3,"2891":3,"29":3,"2944":3,"296057":2,"296679":3,"2d":[3,14],"2m":2,"2mn":2,"3":[0,1,2,3,8],"30":[3,5],"305746":3,"3072":3,"3076":1,"31":3,"3125":3,"32":3,"3200":3,"32768":1,"3281":3,"33":3,"3328":3,"33554432":1,"34":3,"3438":3,"3456":3,"3477":3,"3516":3,"353753":3,"3555":3,"3584":3,"359066":2,"36":3,"362445":1,"362895":3,"3712":3,"3713":1,"372618":3,"372800":3,"379162":3,"38":1,"380953":3,"384":[1,2,3],"3840":3,"384000":3,"39":3,"3906":3,"3968":3,"3984":3,"3d":[23,24],"3mn":2,"4":[1,2,3,8],"40":3,"400001":1,"400016":1,"4023":3,"405":2,"4062":3,"4096":[1,2,3],"409694":3,"411":2,"412":2,"415":2,"4194304":1,"424125":3,"428568":1,"428801":3,"429770":[1,2],"434145":3,"44":3,"446623":3,"448944":3,"4492":3,"4531":3,"455111":3,"46":3,"4609":3,"4688":3,"472":1,"472263":3,"49":3,"492442":3,"4940":1,"494768":3,"498":[3,5],"4m":2,"4x":2,"5":[1,3,8],"5000":3,"501144":3,"507077":3,"51":3,"512":[2,3],"512000":3,"518272":3,"52":3,"524288":1,"530615":3,"5312":3,"531519":3,"54":[3,5],"540320":3,"546":2,"552332":3,"56":3,"560748":3,"563555":3,"566038":2,"566925":3,"577704":1,"585":2,"5859":3,"587973":3,"5898":3,"5mn":2,"6":[0,1,3],"600000":1,"600004":2,"605356":3,"6094":3,"61":3,"614":1,"615390":1,"62":3,"63":[1,3],"630":2,"636381":3,"64":[1,3],"640":[2,3],"642823":3,"643199":3,"649287":3,"65536":1,"656000":3,"656574":1,"664":2,"666684":2,"67":3,"67108864":1,"6724":1,"68":3,"682":2,"69":3,"6953":3,"7":[0,1,3,8],"70":3,"702":2,"7031":3,"7070":3,"707223":3,"708630":3,"71":3,"72":3,"722":1,"73":3,"733474":3,"739":[2,5],"74":3,"75":3,"7500":3,"755369":2,"755985":2,"76":[1,3],"768":[2,3],"768000":3,"77":3,"777746":3,"78":3,"780":1,"781":2,"79":3,"8":[1,2,3,8,32,34],"80":[3,34],"800002":1,"809814":3,"81":3,"810":2,"810085":3,"811":2,"811163":1,"812":[1,2],"817862":3,"8192":1,"82":3,"823267":3,"823517":1,"83":3,"833":1,"8388608":1,"839992":2,"84":3,"840807":2,"843":1,"847460":3,"848":1,"849":1,"85":3,"850":1,"851":1,"86":3,"863336":3,"87":3,"88":3,"8828":3,"8867":3,"888887":3,"89":3,"8906":3,"892307":3,"8945":3,"896":3,"898285":3,"8mn":2,"9":[0,1,2,3],"90":3,"902096":3,"905938":3,"908442":3,"9219":3,"925276":2,"93":[2,3],"932191":3,"936606":2,"9375":3,"94":[2,3],"947349":3,"9492":3,"95":2,"9531":3,"955779":3,"96":2,"967162":3,"9688":3,"97":2,"971":[1,5],"971190":2,"9733":1,"978909":3,"98":2,"9805":3,"983276":3,"98432":1,"9844":3,"999995":1,"999998":1,"abstract":[7,8],"break":8,"byte":2,"case":[1,2,7,8,11],"class":[2,7,8,33],"default":34,"do":[2,3,7,8,16,29],"float":[2,7,8,34],"function":[1,2,3,8,9,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32,33,34,35],"import":[1,2,3,7,8],"int":[1,7,8,10,13,23,24,26,32,34],"new":[12,13,26,32],"return":[1,2,3,10,12,14,16,18,20,23,24,25,30,31,32,34,35],"static":[0,7,8],"super":3,"switch":3,"true":[1,2,3,31],"try":3,"var":8,"while":[3,7],A:[3,7,8],And:[0,3],As:[2,3,7,8],At:8,By:34,For:[3,7,8],If:[8,29,31,33],In:[1,2,3,8],It:[1,3,4,6,8,9],Of:7,On:8,One:3,The:[1,2,3,7,8,11,12,13,14,23,24,26,29,31,35],There:1,These:8,To:[1,7,8],__expf:2,__init__:33,_matmul:3,a100:[3,8],a_ptr:3,ab:1,abl:8,about:[1,2,3,6],abov:[1,2,3,8],academ:7,acc:[3,7,8],acceler:7,access:[1,3,7,8,9],accomod:3,accordingli:8,account:8,accumul:[3,8],accuraci:[3,7],achiev:[3,7,8],across:[2,7,8],activ:3,actual:[3,7,8],add:[1,5],add_kernel:1,addit:[2,4,5,7,34],addition:8,address:[7,16],adopt:8,advanc:[2,3,7],advoc:8,affect:3,affin:8,after:3,against:[0,1,2,3,6],aggress:[7,8],agnost:[7,8],ahead:8,aim:[2,6],al:[7,8],algebra:8,algorithm:[3,7,8],alia:8,all:[2,3,4,7,8,18,20,22,30,33],allclos:[2,3],allen1984:8,allen:8,alloc:[1,2,3,7],allow:[1,2,7,8],along:[1,3,18,20,23,24,30,34],also:[1,2,3,7,8],alwai:[8,31],amd:7,amen:8,amount:7,ampl:8,an:[1,2,3,7,8,11],analog:1,analysi:[7,8],analyz:8,ancourt1991:8,ancourt:8,ani:[1,2,3,8,33],anoth:[2,8],apart:8,api:33,appear:33,appli:[3,7,8],applic:8,approach:[7,8],appropri:1,approxim:2,ar:[0,1,2,3,7,8,9,16,22,29,31,33],arang:[1,2,3],arbitrari:3,architectur:[3,7],area:8,arg:[1,2,3,33],argument:[1,2,3,9,31,33],arrai:[8,32],arrang:3,art:[7,8],arxiv:[7,8],ask:2,aspect:8,asplo:7,assert:[1,3],assum:[2,33],asynchron:[1,7],atom:11,auguin1983:7,auguin:7,auto:[2,3,8],autom:7,automat:[2,3,7,8],autotun:[3,8],avail:[0,7,8],avoid:[2,31],awar:7,axi:[1,2,3,18,20,23,24,30,33],b:[3,7,8],b_ptr:3,back:[1,2,3],baghdadi2021:[7,8],baghdadi:[7,8],balanc:8,bandwidth:2,base:[6,7,8],basic:[1,4,8],becom:7,been:[1,7,8],befor:3,begin:8,behavior:8,being:2,believ:8,below:[4,8],bench:0,benchmark:[0,34,35],benefit:[2,7,8],best:[1,7],between:[1,7],block:[1,2,3,7,8,11,12,13,14,15,16,17,18,19,20,21,25,26,27,28,29,30,31,32],block_siz:[1,2,8],block_size_k:3,block_size_m:3,block_size_n:3,block_start:1,blue:[1,2,3],boil:8,bool:[31,33],both:[8,31],bound:[1,2,3,8],branch:8,broad:7,broadcast:[13,16,29,31],build:[0,3],builder:[10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],built:[1,8],c:[3,7,8],c_mask:3,c_ptr:3,cach:[7,8],call:[1,3,8,9],callabl:[1,9,34],can:[0,1,2,3,7,8,35],cannot:[3,7,8],capabl:[6,7],cd:0,cdiv:[1,3],cgo:[7,8],chang:3,chapter:6,characterist:8,cheap:7,check:[3,6],chen2018:7,chen:7,chip:2,choic:6,click:[1,2,3],clone:0,close:8,cmake:0,cmp:11,coalesc:7,code:[1,2,3,4,7,8,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],col:[3,8],col_offset:2,color:33,column:[2,3],com:0,combin:7,come:[2,3,8],command:0,common:8,commonli:8,compar:[2,3,6,8,11],compat:14,compil:[2,3,6,7,9,22],complet:8,complex:8,compos:7,composit:8,comprehens:[7,8],comput:[6,7,8,15,17,19,21,27,28],computation:[7,8],concern:8,concis:[1,33],condit:[8,31],config:3,configur:[3,35],confirm:2,connectom:7,consecut:8,consequ:7,consid:2,constraint:[3,8],construct:7,constructor:33,consum:3,contain:[8,11,12,33],contextu:8,contigu:[3,10,25],control:[7,8],conveni:3,convert:[1,3,9],convolut:7,copi:[7,11],core:[7,8],correct:1,correspond:[1,2,3,33],cost:8,could:[2,8],cours:7,cpython:0,creat:[1,2,3,7],csv:1,cubla:[3,7],cuda:[1,2,3,7],cudnn:7,current:24,custom:[1,2,3,6],cut:3,cvpr:7,d:[2,9],dart:8,darte1999:8,data:[1,3,7,8,16,31,32],data_ptr:9,dataflow:8,decad:7,declar:1,decompos:8,decor:[1,3,9],dedic:3,deep:[3,7,8],def:[1,2,3],defin:[1,2,3,8,16],definit:8,denomin:2,denot:1,dens:8,depend:[0,8,31],deploi:7,describ:8,design:8,desir:[13,26],detail:[3,8],detect:7,develop:[7,8],devic:[1,2,3],dialect:8,diesel:8,differ:[1,2,3,7,8,33],difficult:8,difficulti:[3,7],dijkstra82:8,dijkstra:8,dim:[2,8],dimens:[3,14,18,20,30],dimension:[3,8,14],dir:0,direct:3,disjoint:8,disk:1,dissert:8,distribut:[2,8],divis:3,dnn:[6,7,8],do_bench:[1,2,3],doe:[1,2,3,8],doesn:8,domain:[7,8],don:[1,2,3],done:[3,7,18,20,30],dot:3,doubli:3,doubt:8,down:[3,8],download:[0,1,2,3,4],dram:[1,2],dsl:[6,7,8],dtype:[1,2,3,11,12,16,29,32],e:[0,2,3,7,8,32],each:[1,2,3,7,8],eas:8,easi:3,easier:[1,2,7],easili:3,ed:[1,3,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],education:2,effect:8,effici:[3,7],effort:8,either:[1,23,24,31],elango2018:8,elango:8,element:[1,2,3,15,17,18,19,20,21,27,28,29,30,31,33],element_s:2,element_ti:[11,12,16,29],elementwis:[2,16],els:3,emerg:7,empti:3,empty_lik:[1,2],enabl:8,encod:8,end:[7,8,10],enforc:8,engin:8,enqueu:[1,2],ensur:8,entir:8,environ:6,equal:[2,8],error:3,especi:7,et:[7,8],euromicro:7,evalu:[3,31],even:8,evidenc:7,evolv:7,exampl:[1,2,3,4,7,8],execut:[5,7,8,35],exist:[7,8],exp:2,expect:[2,11],expens:[7,8],explor:7,exponenti:[2,15],express:[7,8],extar:1,extend:3,extract:3,extrem:8,f:[1,2,3,8],facilit:[7,8],fact:8,fairli:3,fals:[16,29,31,33],far:2,fast:[2,7,8],faster:2,fastest:8,feel:3,fetch:7,few:8,field:7,figur:8,file:[1,2,3,5],fill:32,first:[1,3,6,8,14,19,21],first_pid_m:3,fit:2,fix:33,flag:2,flatten:25,flexibl:7,float16:[3,14,32],float32:[1,2,3,14],flow:[7,8],fn:[9,34],focu:[3,8],follow:[0,2,3,6,7,8],forget:1,formal:8,format:8,found:11,foundat:8,fp16:3,fp32:3,framework:[7,8],free:3,from:[1,2,3,7,8,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],full:[1,2,3],fulli:8,func:8,fundament:8,further:8,fuse:[3,4,5],fusion:[2,8],g:[3,7,8,32],galleri:[1,2,3,4],gb:[1,2],gbp:[1,2],gener:[1,2,3,4,7,8,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32,33],geq:8,get:[1,2,3,5],girbal2006:8,girbal:8,git:0,github:0,give:7,given:[2,3,12,13,23,24,26,32],global:8,go:[1,3,8],good:[1,8],gpgpu:7,gpu:[1,2,6,7,8,9],grad_to_non:34,gradient:34,grammat:8,graphic:7,greater:2,green:[1,2,3],grid:[1,2,3,23,24],grid_m:3,grid_n:3,grosser2012:8,grosser:8,group:3,group_id:3,group_m:3,group_size_m:3,grow:8,guard:[1,2],guid:7,ha:[1,3,7,8,23,24],had:1,halid:[7,8],hand:8,handl:[1,2,8],handwritten:7,hard:3,harder:8,hardwar:[3,6,8],hasn:1,have:[2,7,8,9,14,31,33],heavi:7,helper:[1,2],henc:3,here:[1,2,3],heurist:2,hierarch:7,hierarchi:8,high:[3,7,8],higher:3,highli:7,highlight:8,hint:8,hit:3,how:[1,2,3,6,7],howev:[2,8],http:0,i:[1,2,3,7,8],id:[3,24],idea:7,ideal:2,ident:2,identifi:1,idx:[16,29],imag:[7,8],implement:[1,2,3,7,8],implicitli:[1,9,16,29],importantli:8,impos:8,improv:3,incompat:[3,8],incorrect:3,increas:[1,2,3],incred:7,increment:8,inde:8,independ:[2,8],index:1,indic:[8,31],induc:8,industri:7,inequ:8,inf:2,inform:8,infrastructur:8,initi:[1,3],inner:[3,14],inplac:3,input:[1,2,3,8,13,14,15,17,18,19,20,21,22,25,26,27,28,30],input_ptr:2,input_row_strid:2,instal:6,instanc:[1,2,3,7,23,24],instead:[2,31],instruct:[6,7],int1:[16,29],integ:8,interchang:8,interest:[7,8],intermedi:8,intern:[2,8],interv:10,intrins:8,introduct:6,invari:[2,8],ipynb:[1,2,3],ir:[8,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],irregular:[2,8],is_contigu:3,is_cuda:1,isn:3,issu:[7,8],iter:[3,7,8],its:[1,2,3,8],j:[3,7,8],jit:[1,2,3,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],journal:8,jrk2013:7,jupyt:[1,2,3,4],just:[3,8],k:[3,7,8],kb:7,kei:[3,7],kellei:7,kernel:[6,7],keyword:1,ki:8,kind:2,know:22,known:8,label:[1,2,3,33],lam1991:7,lam:7,lambda:[1,2,3],languag:[1,2,3,6,7,9],larg:[7,8],last:3,later:[2,8],latest:0,lattner2004:8,lattner2019:8,lattner:8,launch:[1,2,3,23,24],law:8,layer:[7,8],lead:[7,8],leaky_relu:3,leakyrelu:3,learn:[1,2,3,6,7,8],least:8,lee2017:7,lee:7,left:8,legal:8,length:1,less:[7,8],let:[1,2,22],letter:8,level:[3,7,8],li:7,librari:[0,3,7,8],lifelong:8,like:[1,7,8],limit:2,line:[1,2,3,8,33],line_arg:[1,2,3,33],line_nam:[1,2,3,33],line_v:[1,2,3,33],linear:[7,8],link:0,list:[1,3,33,34,35],litteratur:8,llvm11:0,llvm:[0,8],load:[1,2,3,8,31],local:[7,8],locat:[3,11,12,16,29],log:33,logarithm:[1,17],look:[6,7],loop:[3,8],low:8,m:[0,2,3,7],machin:[7,8],machineri:[7,8],made:7,mai:[2,8],main:[3,7,8],maintain:[2,8],major:[3,8],make:[1,2,7,8],manag:7,mani:[1,7,8],manual:[2,8],manual_se:[1,2,3],map:3,mapl:8,mark:35,markedli:7,mask:[1,2,3,16,29,31],match:[3,11],mathbb:8,mathbf:8,mathcal:8,mathemat:8,matmul:[3,8],matmul_kernel:3,matric:[2,3],matrix:[2,4,5,7,8,14],matrix_s:8,matter:[3,7,8],max:[1,2],max_m:[1,2,3],maxim:[6,8],maximum:[1,2,18],mb:[5,7],mean:[3,8],mechan:[2,8],median:34,memori:[1,2,3,7,8,11,12,16,29,31],mention:3,meta:[1,2,3],metaparamet:1,method:[8,9,33,35],methodolog:8,micro:7,min:3,min_m:[1,2,3],minimum:20,minut:[1,2,3],miss:8,mitig:8,ml:7,mlir:8,mn:2,model:[1,7,8],modern:[3,6,7,8],modular:8,moor:8,more:[2,3,6,7,8,33],most:[3,8],move:3,ms:[1,2,3,34],much:[2,3],mullapudi2016:8,mullapudi:8,multi:[3,7,8],multipl:[1,4,5,7,8,22],multipli:[3,8,14],must:[2,3,10,14,31],n:[2,3,7],n_col:2,n_element:1,n_row:2,naiv:2,naive_softmax:2,name:[1,2,3,33],nativ:[1,2,3],natur:[2,7,17],nb:7,necessari:2,need:[1,2,3],nelement:2,nest:[3,8],net:8,network:[7,8],neural:[7,8],neurosci:7,next:[2,3],next_power_of_2:2,nightli:0,nip:7,nn:3,non:7,none:[2,3,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32,33,34],nonzero:31,normal:[2,3],note:[0,1,2,3,8,9,31],notebook:[1,2,3,4],notic:[2,8],notori:[3,7],novel:7,now:[1,3],num_pid_in_group:3,num_pid_m:3,num_pid_n:3,num_stag:3,num_warp:[2,3],number:[1,2,3,8,23],numer:[2,7],nvidia:7,o:2,object:[1,3,7,9,11],obtain:1,obvious:2,occur:8,offer:7,offici:0,offs_am:3,offs_bn:3,offs_cm:3,offs_cn:3,offs_k:3,offset:1,often:3,old:12,omega:8,onc:[2,7,8],one:[2,3,4,7,8,33],onli:[2,3,7,8,9],op:[1,2],open:10,openai:0,opencl:7,oper:[1,2,3,4,7,31],opportun:7,opsila:7,optim:[7,8],option:[1,3,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32,33,34],order:[2,3,4,8],origin:8,osdi:7,other:[2,3,6,8,9,14,16,19,21],otherwis:31,our:[1,2,3,7],out:[1,2,3,6,8],outlin:8,output:[1,2,3],output_ptr:[1,2],output_row_start_ptr:2,output_row_strid:2,output_torch:1,output_triton:1,over:[2,7,8],overflow:2,own:3,p:8,pa:3,packag:9,pact:8,pad:2,par:3,paradigm:[7,8],parallel:[1,2,3,6,7,8],paralleliz:7,param:22,paramet:[1,3,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26,27,28,29,30,31,32,33,34,35],parametr:7,part:[3,8],particular:[2,3],particularli:[7,8],partit:7,pass:[1,8],past:[7,8],path:1,pattern:7,pb:3,peak:8,per:2,percentil:34,perf:3,perf_report:[1,2,3,33],perform:[1,2,7,8,11,34],person:8,perspect:8,phase:8,philosophi:8,pid:[1,3],pid_m:3,pid_n:3,pip:0,pipelin:[7,8],platform:[6,8],pldi:7,plot:[0,1,2,3,33],plot_nam:[1,2,3,33],pmatrix:8,point:[1,8],pointer:[1,2,9,11,12,16,29],pointerdtyp:[11,12,16,29],polli:8,polyhedr:7,polyhedra:8,popular:8,portabl:[7,8],pose:7,possibl:[1,2,3,8],power:[2,8,10],ppopp:8,practic:[1,2,3,7],pragma:7,pre:[0,7],prealloc:1,predict:8,prefer:2,premis:7,present:[0,3],preserv:8,prevent:8,primer:8,primit:[7,9],principl:8,print:[1,2,3],print_data:[1,2,3],probabl:8,problem:1,problemat:8,procedur:8,process:[1,7,8],processor:7,produc:3,product:[6,8,14],program:[1,2,3,6,7,23,24],program_id:[1,2,3],programm:[7,8],project:7,promot:[3,8],properli:2,properti:8,propos:7,proprietari:3,provid:[1,2,3,6,8,18,20,30,34],pseudo:3,ptr:3,purpos:[7,8],push:8,py:[0,1,2,3,5],pypi:0,pytest:0,python:[1,2,3,4,9],pytorch:[1,2],qquad:8,r:2,ragan:7,rand:1,randn:[2,3],rang:[1,2,3,7,8],rapidli:[7,8],rate:3,rather:7,raw:1,rdom:8,re:[1,3],read:[2,3,4],reader:8,real:7,reason:8,recent:7,recommend:4,recomput:7,rectifi:7,redmon2016:7,redmon:7,reduct:[2,18,20,30],refer:1,regardless:31,regrett:7,regular:8,rel:[1,8],relat:6,releas:[0,7],reli:8,relu:3,remain:[7,33],rememb:3,reorder:8,rep:34,repetit:34,repres:[2,3,8],requir:[2,8],research:[7,8],reset:34,resolut:8,resourc:7,resp:8,respect:8,restrict:8,result:[0,1,2,7,8],ret:2,retriev:8,reus:3,revisit:7,right:8,rise:8,role:8,roughli:3,row:[2,3],row_idx:2,row_minus_max:2,row_start_ptr:2,run:[0,1,2,3,6,8,9,35],runtim:[8,34],rvar:8,s:[1,2,8],said:8,same:[7,33],sato2019:8,sato:8,save:[1,2,3],save_path:1,sc:8,scalabl:8,scalar:[7,14,32],scale:33,scan:8,schedul:7,scienc:8,scientif:8,scop:8,scope:8,script:[0,1,2,3],second:[1,2,3,8,14,19,21],section:[3,8],see:[1,2,3,8],seem:[1,8],select:[7,8,31],self:33,semant:8,semi:8,sens:[1,7,8],separ:8,sequenc:7,set:[1,8],setup:0,sever:[7,8],shall:8,shape:[1,2,3,8,13,16,26,29,31,32],share:7,shift:2,should:[1,3,7,8,18,20,30,33],show_plot:[1,2,3],shown:8,side:8,sight:8,signal:7,significantli:2,sigplan:8,simd:7,simpl:[1,2,3],simplest:4,simpli:8,simplic:3,sinc:[1,2,3],singl:[2,7],size:[1,2,8],slower:[7,8],slowest:8,sm:8,smaller:3,smallest:2,snemi3d:7,so:[1,2,3,8],softmax:[4,5],softmax_kernel:2,softmax_output:2,solid:8,solut:3,solv:8,some:3,sometim:8,sourc:[1,2,3,4,8],space:[7,8],spars:[7,8],spatial:8,speak:3,special:7,specif:[3,7],specifi:[8,11,29],speed:2,sphinx:[1,2,3,4],split:8,spmd:[1,7,8],sram:[2,3],stabil:2,stabl:0,standard:8,start:[4,10],started_tutori:5,state:[7,8],statement:8,step:8,still:[1,2,3,8],stop:10,store:[1,2,3,12,31],str:33,straightforward:3,strategi:8,strength:7,stride:[2,3],stride_ak:3,stride_am:3,stride_bk:3,stride_bn:3,stride_cm:3,stride_cn:3,stride_xi:3,stride_xj:3,structur:[7,8],style:[1,2,3,33],subscript:8,substanti:7,substract:2,subtract:2,successfulli:8,suffer:8,suit:7,sum:[1,2],superhuman:7,support:8,sure:2,surprisingli:7,surround:8,suspicion:2,sutskev:7,sutskever2014:7,swap:[11,12],swizzl:7,synchron:[1,7],system:[0,3,7,8],t:[1,2,3,8],t_:8,taco:8,take:[3,6],taken:8,target:7,techniqu:[3,7,8],tempor:8,tend:8,tension:7,tensor:[1,2,3,7,8,9,34],tensorrt:7,test:[0,1,6],text:8,tflop:3,th:34,than:[2,3,7,8,33],thei:[3,7,8],them:1,themselv:3,theoret:2,therebi:8,therefor:3,theta:8,theta_:8,thi:[1,2,3,7,8,9,33],thing:1,think:2,those:2,though:[7,8],thought:8,thread:[2,7],through:[4,8],throughout:[8,33],throughput:6,tile:8,time:[0,1,2,3,7,8,34],tiramisu:[7,8],tl:[1,2,3],tmp:0,tog:8,topic:8,torch:[1,2,3,9,34],torch_output:3,torch_relu:3,total:[1,2,3,5],tradit:[7,8],transform:8,travers:8,trend:7,tri:[13,26],trick:2,trigger:3,triton:[0,1,2,3,4,7,8],triton_output:3,trivial:7,tune:[2,3,8],tupl:[1,13,26,32],tutori:[1,2,3,6],tutorials_jupyt:4,tutorials_python:4,tvm:[7,8],two:[1,2,3,8,10,14],type:[14,22,31,32],typecast:[16,29],typic:8,u:0,un:8,uncommon:8,underneath:8,understand:2,unfortun:[3,8],unifi:7,unint:31,unit:[0,7],univers:8,unrol:8,up:2,updat:[3,8],us:[1,2,3,7,8,9,31,33,35],util:1,v100:8,val:[11,12],valid:1,valu:[1,2,3,10,11,12,15,16,17,18,20,22,29,30,31,32,33,35],valuabl:2,variabl:3,variant:7,variou:4,vasilach:[7,8],vasilache2018:[7,8],vast:8,vec:8,vector:[4,5,7,8],vendor:3,veri:[2,8],verif:8,verifi:[2,8],via:8,view:25,visibl:8,vision:7,vs:0,w:8,wai:[2,3],want:[2,31],warmup:34,warp:2,wast:2,we:[1,2,3,7,8],well:[7,8],wheel:0,when:[2,3,7,8,9,31],where:[1,3,8,29],whether:[7,33],which:[1,2,3,7,8,12,18,20,30,33],whose:[1,2,3,8,16],wide:8,wise:[1,2,15,17,19,21,27,28,29],wish:[3,8],within:[3,9,10,11,12,13,14,15,16,17,18,20,22,23,24,26,29,30,31,32],without:8,wolf:8,wolfe1989:8,won:2,word:8,work:[2,6,7],workload:3,wors:[3,7,8],would:[1,2],wouldn:8,wrapper:3,write:[1,2,3,4,6,8],wrote:2,x:[1,2,3,8,15,17,19,21,25,27,28,31,33],x_log:[1,33],x_max:2,x_name:[1,2,3,33],x_ptr:1,x_val:[1,2,3,33],xi:8,xii:8,xlabel:33,xo:8,y:[1,2,3,8,19,21,31,33],y_log:33,y_name:[1,2],y_ptr:1,y_torch:2,y_triton:2,year:8,yet:[7,8],yi:8,yield:31,yii:8,ylabel:[1,2,3,33],yo:8,you:[0,1,2,3,4,7,31],your:[0,1,6],yourself:[2,3],z:[1,2,8],zero:3,zip:4},titles:["Installation","Vector Addition","Fused Softmax","Matrix Multiplication","Tutorials","Computation times","Welcome to Triton\u2019s documentation!","Introduction","Related Work","triton.jit","triton.language.arange","triton.language.atomic_cas","triton.language.atomic_xchg","triton.language.broadcast_to","triton.language.dot","triton.language.exp","triton.language.load","triton.language.log","triton.language.max","triton.language.maximum","triton.language.min","triton.language.minimum","triton.language.multiple_of","triton.language.num_programs","triton.language.program_id","triton.language.ravel","triton.language.reshape","triton.language.sigmoid","triton.language.softmax","triton.language.store","triton.language.sum","triton.language.where","triton.language.zeros","triton.testing.Benchmark","triton.testing.do_bench","triton.testing.perf_report","triton","triton.language","triton.testing"],titleterms:{"final":3,addit:1,advantag:8,algebra:37,api:6,arang:10,arithmet:3,atomic_ca:11,atomic_xchg:12,benchmark:[1,2,3,33],binari:0,broadcast_to:13,cach:3,challeng:7,comparison:37,compil:[8,37],comput:[1,2,3,5],creation:37,distribut:0,do_bench:34,document:6,dot:14,exp:15,from:0,further:6,fuse:2,get:6,go:6,hint:37,index:37,instal:0,introduct:7,jit:9,kernel:[1,2,3],l2:3,languag:[8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,37],limit:8,linear:37,load:16,log:17,manipul:37,math:37,matrix:3,max:18,maximum:19,memori:37,min:20,minimum:21,model:37,motiv:[2,3,7],multipl:3,multiple_of:22,num_program:23,op:37,optim:3,packag:0,perf_report:35,perform:3,pointer:3,polyhedr:8,program:[8,37],program_id:24,python:[0,6],ravel:25,reduct:37,refer:[7,8],relat:8,represent:8,reshap:26,result:3,s:6,schedul:8,shape:37,sigmoid:27,softmax:[2,28],sourc:0,squar:3,start:6,store:29,sum:30,test:[2,3,33,34,35,38],time:5,triton:[6,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38],tutori:4,unit:[2,3],vector:1,welcom:6,where:31,work:8,zero:32}})
\ No newline at end of file