diff --git a/docs/tutorials/custom-operation.rst b/docs/tutorials/custom-operation.rst index 0412a99f4..30619bef9 100644 --- a/docs/tutorials/custom-operation.rst +++ b/docs/tutorials/custom-operation.rst @@ -57,7 +57,8 @@ As you will see, a wrapper for the above Triton function can be created in just } """ # create callable kernel for the source-code - kernel = triton.kernel(src) + # options: 4 warps and a -DTILE=1024 + kernel = triton.kernel(src, defines = {'TILE': 1024}; num_warps = [4]) # Forward pass @staticmethod @@ -72,11 +73,7 @@ As you will see, a wrapper for the above Triton function can be created in just N = x.numel() grid = lambda opt: (triton.cdiv(N, opt.d('TILE')), ) # launch kernel - # options: 4 warps and a -DTILE=1024 - _add.kernel(z, x, y, N, - grid = grid, - num_warps = 4, - defines = {'TILE': 1024}) + _add.kernel(z, x, y, N, grid = grid) # return output return z diff --git a/docs/tutorials/index.rst b/docs/tutorials/index.rst index b49e262c1..1cd7548ce 100644 --- a/docs/tutorials/index.rst +++ b/docs/tutorials/index.rst @@ -8,4 +8,3 @@ Tutorials triton-vs-cuda matrix-transposition matrix-multiplication - putting-it-all-together diff --git a/docs/tutorials/triton-vs-cuda.rst b/docs/tutorials/triton-vs-cuda.rst index 4d563a583..c90190313 100644 --- a/docs/tutorials/triton-vs-cuda.rst +++ b/docs/tutorials/triton-vs-cuda.rst @@ -97,12 +97,10 @@ Auto-Tuning Now assume that you want to tune the above code for different data types, tile sizes and thread block sizes. This is doable in CUDA but would require you to write cumbersome machinery to handle different vector sizes and loop unrolling factors. In Triton, this can be trivially done by adjusting some compilation parameters. For example: .. code-block:: python + + kernel = triton.kernel(src, defines = {'TILE': [256, 512, 1024]}, num_warps = [2, 4, 8]) - _vector_add.kernel(y, x, N, grid=grid, - defines={'TILE': [256, 512, 1024]}, - num_warps = [2, 4, 8]) - -would benchmark our above triton-code for tile sizes of 256, 512 and 1024 executed with 2, 4 or 8 warps -- and cache the fastest kernel. +would benchmark our above triton source-code for tile sizes of 256, 512 and 1024 executed with 2, 4 or 8 warps -- and cache the fastest kernel. ============================= Going Further