diff --git a/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip b/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip index ae1f46ffc..6979c5e54 100644 Binary files a/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip and b/_downloads/662999063954282841dc90b8945f85ce/tutorials_jupyter.zip differ diff --git a/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip b/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip index 44e1c13b5..f0e2a2cdc 100644 Binary files a/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip and b/_downloads/763344228ae6bc253ed1a6cf586aa30d/tutorials_python.zip differ diff --git a/_images/sphx_glr_01-vector-add_001.png b/_images/sphx_glr_01-vector-add_001.png index 24ff5da90..00fd8add6 100644 Binary files a/_images/sphx_glr_01-vector-add_001.png and b/_images/sphx_glr_01-vector-add_001.png differ diff --git a/_images/sphx_glr_01-vector-add_thumb.png b/_images/sphx_glr_01-vector-add_thumb.png index 6bb9c1750..7a7b34226 100644 Binary files a/_images/sphx_glr_01-vector-add_thumb.png and b/_images/sphx_glr_01-vector-add_thumb.png differ diff --git a/_images/sphx_glr_02-fused-softmax_001.png b/_images/sphx_glr_02-fused-softmax_001.png index f1fed55d9..29eef2838 100644 Binary files a/_images/sphx_glr_02-fused-softmax_001.png and b/_images/sphx_glr_02-fused-softmax_001.png differ diff --git a/_images/sphx_glr_02-fused-softmax_thumb.png b/_images/sphx_glr_02-fused-softmax_thumb.png index 6a0270a24..29f186443 100644 Binary files a/_images/sphx_glr_02-fused-softmax_thumb.png and b/_images/sphx_glr_02-fused-softmax_thumb.png differ diff --git a/_images/sphx_glr_03-matrix-multiplication_001.png b/_images/sphx_glr_03-matrix-multiplication_001.png index 9eec2b7bc..2e37edb43 100644 Binary files a/_images/sphx_glr_03-matrix-multiplication_001.png and b/_images/sphx_glr_03-matrix-multiplication_001.png differ diff --git a/_images/sphx_glr_03-matrix-multiplication_thumb.png b/_images/sphx_glr_03-matrix-multiplication_thumb.png index eb7217875..787302154 100644 Binary files a/_images/sphx_glr_03-matrix-multiplication_thumb.png and b/_images/sphx_glr_03-matrix-multiplication_thumb.png differ diff --git a/_sources/getting-started/tutorials/01-vector-add.rst.txt b/_sources/getting-started/tutorials/01-vector-add.rst.txt index 251ea2b47..a4774c0c2 100644 --- a/_sources/getting-started/tutorials/01-vector-add.rst.txt +++ b/_sources/getting-started/tutorials/01-vector-add.rst.txt @@ -216,13 +216,13 @@ We can now run the decorated function above. Pass `print_data=True` to see the p vector-add-performance: size Triton Torch - 0 4096.0 9.540372 9.600000 + 0 4096.0 9.600000 9.600000 1 8192.0 19.200000 19.200000 2 16384.0 38.400001 38.400001 3 32768.0 76.800002 76.800002 4 65536.0 127.999995 127.999995 5 131072.0 219.428568 219.428568 - 6 262144.0 341.333321 341.333321 + 6 262144.0 341.333321 384.000001 7 524288.0 472.615390 472.615390 8 1048576.0 614.400016 614.400016 9 2097152.0 722.823517 722.823517 @@ -239,7 +239,7 @@ We can now run the decorated function above. Pass `print_data=True` to see the p .. rst-class:: sphx-glr-timing - **Total running time of the script:** ( 0 minutes 11.067 seconds) + **Total running time of the script:** ( 0 minutes 11.032 seconds) .. _sphx_glr_download_getting-started_tutorials_01-vector-add.py: diff --git a/_sources/getting-started/tutorials/02-fused-softmax.rst.txt b/_sources/getting-started/tutorials/02-fused-softmax.rst.txt index 5690c5b5a..f43cc9d0f 100644 --- a/_sources/getting-started/tutorials/02-fused-softmax.rst.txt +++ b/_sources/getting-started/tutorials/02-fused-softmax.rst.txt @@ -262,15 +262,15 @@ We will then compare its performance against (1) :code:`torch.softmax` and (2) t softmax-performance: N Triton Torch (native) Torch (jit) 0 256.0 512.000001 546.133347 273.066674 - 1 384.0 585.142862 585.142862 261.446801 + 1 384.0 585.142862 585.142862 267.130429 2 512.0 630.153853 606.814814 264.258068 - 3 640.0 682.666684 640.000002 265.974036 + 3 640.0 682.666684 640.000002 269.473696 4 768.0 702.171410 664.216187 273.066663 .. ... ... ... ... - 93 12160.0 812.359066 405.755985 329.483481 - 94 12288.0 812.429770 415.222812 329.602681 + 93 12160.0 812.359066 406.179533 329.483481 + 94 12288.0 812.429770 415.661740 329.602681 95 12416.0 810.840807 412.149375 329.173158 - 96 12544.0 810.925276 412.971190 329.022957 + 96 12544.0 810.925276 412.546756 329.292871 97 12672.0 811.007961 412.097543 329.410251 [98 rows x 4 columns] @@ -290,7 +290,7 @@ In the above plot, we can see that: .. rst-class:: sphx-glr-timing - **Total running time of the script:** ( 1 minutes 8.169 seconds) + **Total running time of the script:** ( 1 minutes 8.174 seconds) .. _sphx_glr_download_getting-started_tutorials_02-fused-softmax.py: diff --git a/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt b/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt index 0dc62d2ec..9ebb1774a 100644 --- a/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt +++ b/_sources/getting-started/tutorials/03-matrix-multiplication.rst.txt @@ -371,37 +371,37 @@ We can now compare the performance of our kernel against that of cuBLAS. Here we matmul-performance: M cuBLAS ... Triton Triton (+ LeakyReLU) 0 128.0 0.455111 ... 0.512000 0.512000 - 1 256.0 2.730667 ... 2.978909 2.978909 - 2 384.0 7.372800 ... 8.507077 8.507077 - 3 512.0 14.563555 ... 15.420235 15.420235 - 4 640.0 22.260869 ... 23.272727 23.272727 + 1 256.0 2.978909 ... 2.978909 2.978909 + 2 384.0 7.372800 ... 7.899428 7.899428 + 3 512.0 14.563555 ... 16.384000 15.420235 + 4 640.0 22.260869 ... 24.380953 24.380953 5 768.0 32.768000 ... 34.028308 34.028308 6 896.0 39.025776 ... 39.025776 39.025776 - 7 1024.0 49.932191 ... 52.428801 52.428801 + 7 1024.0 51.150050 ... 52.428801 52.428801 8 1152.0 44.566925 ... 46.656000 46.656000 9 1280.0 51.200001 ... 56.109587 56.109587 10 1408.0 64.138541 ... 65.684049 65.684049 - 11 1536.0 80.430545 ... 76.106321 76.106321 - 12 1664.0 63.372618 ... 61.636381 61.636381 - 13 1792.0 72.983276 ... 69.379162 68.953520 - 14 1920.0 68.098521 ... 69.818184 69.818184 - 15 2048.0 73.584279 ... 75.234154 75.573044 - 16 2176.0 83.500614 ... 80.817862 80.494588 - 17 2304.0 68.251065 ... 73.051599 73.051599 - 18 2432.0 71.305746 ... 81.197876 81.197876 - 19 2560.0 77.833728 ... 76.740048 76.382283 - 20 2688.0 83.004501 ... 83.369354 85.051697 - 21 2816.0 81.067298 ... 75.982940 78.868366 - 22 2944.0 79.483304 ... 79.865439 79.230573 - 23 3072.0 81.589488 ... 83.146995 83.025078 - 24 3200.0 84.432717 ... 86.956520 89.385477 - 25 3328.0 84.003845 ... 82.843841 85.806075 - 26 3456.0 82.350937 ... 84.686523 84.508982 - 27 3584.0 84.111686 ... 92.032132 96.475743 - 28 3712.0 86.416391 ... 84.874549 88.170647 - 29 3840.0 85.005380 ... 87.771425 87.493673 - 30 3968.0 92.512459 ... 80.703662 84.797731 - 31 4096.0 93.727466 ... 90.995066 91.741443 + 11 1536.0 80.430545 ... 76.106321 75.296679 + 12 1664.0 63.372618 ... 62.061463 61.636381 + 13 1792.0 72.983276 ... 68.953520 68.533074 + 14 1920.0 69.120002 ... 68.435645 68.435645 + 15 2048.0 73.908442 ... 75.573044 75.234154 + 16 2176.0 83.500614 ... 80.173899 79.855747 + 17 2304.0 68.446623 ... 73.051599 72.607513 + 18 2432.0 71.125224 ... 81.197876 80.963875 + 19 2560.0 77.649287 ... 76.027843 76.740048 + 20 2688.0 83.552988 ... 83.186525 82.823267 + 21 2816.0 84.035084 ... 76.921000 79.733474 + 22 2944.0 82.102191 ... 80.122235 78.729910 + 23 3072.0 82.540970 ... 82.661468 82.661468 + 24 3200.0 84.432717 ... 89.385477 84.432717 + 25 3328.0 83.905938 ... 86.113988 86.528001 + 26 3456.0 82.015834 ... 83.545665 84.156124 + 27 3584.0 87.466332 ... 92.600816 84.988707 + 28 3712.0 85.163978 ... 82.902362 83.666116 + 29 3840.0 84.292684 ... 84.550462 85.070769 + 30 3968.0 89.921841 ... 87.472354 87.409694 + 31 4096.0 93.792965 ... 89.478485 90.260743 [32 rows x 5 columns] @@ -411,7 +411,7 @@ We can now compare the performance of our kernel against that of cuBLAS. Here we .. rst-class:: sphx-glr-timing - **Total running time of the script:** ( 1 minutes 58.805 seconds) + **Total running time of the script:** ( 2 minutes 2.376 seconds) .. _sphx_glr_download_getting-started_tutorials_03-matrix-multiplication.py: diff --git a/_sources/getting-started/tutorials/sg_execution_times.rst.txt b/_sources/getting-started/tutorials/sg_execution_times.rst.txt index b36af7bad..8329990b7 100644 --- a/_sources/getting-started/tutorials/sg_execution_times.rst.txt +++ b/_sources/getting-started/tutorials/sg_execution_times.rst.txt @@ -5,12 +5,12 @@ Computation times ================= -**03:18.042** total execution time for **getting-started_tutorials** files: +**03:21.583** total execution time for **getting-started_tutorials** files: +---------------------------------------------------------------------------------------------------------+-----------+--------+ -| :ref:`sphx_glr_getting-started_tutorials_03-matrix-multiplication.py` (``03-matrix-multiplication.py``) | 01:58.805 | 0.0 MB | +| :ref:`sphx_glr_getting-started_tutorials_03-matrix-multiplication.py` (``03-matrix-multiplication.py``) | 02:02.376 | 0.0 MB | +---------------------------------------------------------------------------------------------------------+-----------+--------+ -| :ref:`sphx_glr_getting-started_tutorials_02-fused-softmax.py` (``02-fused-softmax.py``) | 01:08.169 | 0.0 MB | +| :ref:`sphx_glr_getting-started_tutorials_02-fused-softmax.py` (``02-fused-softmax.py``) | 01:08.174 | 0.0 MB | +---------------------------------------------------------------------------------------------------------+-----------+--------+ -| :ref:`sphx_glr_getting-started_tutorials_01-vector-add.py` (``01-vector-add.py``) | 00:11.067 | 0.0 MB | +| :ref:`sphx_glr_getting-started_tutorials_01-vector-add.py` (``01-vector-add.py``) | 00:11.032 | 0.0 MB | +---------------------------------------------------------------------------------------------------------+-----------+--------+ diff --git a/_sources/programming-guide/chapter-2/related-work.rst.txt b/_sources/programming-guide/chapter-2/related-work.rst.txt index 08486f8ae..bb83d4851 100644 --- a/_sources/programming-guide/chapter-2/related-work.rst.txt +++ b/_sources/programming-guide/chapter-2/related-work.rst.txt @@ -2,7 +2,7 @@ Related Work ============== -At first sight, Triton may seem like just yet another DSL for DNNs. The purpose of this section is to contextualize Triton and highlights its differences with the two leading approaches in this domain: polyhedral compilation and scheduling languages. +At first sight, Triton may seem like just yet another DSL for DNNs. The purpose of this section is to contextualize Triton and highlight its differences with the two leading approaches in this domain: polyhedral compilation and scheduling languages. ----------------------- Polyhedral Compilation @@ -121,7 +121,7 @@ Limitations Unfortunately, polyhedral compilers suffer from two major limitations that have prevented its adoption as a universal method for code generation in neural networks. -First, the set of possible program transformations $\Omega = \{ \Theta_S ~|~ S \in \text{program} \}$ is large, and grows with the number of statements in the program as well as with the size of their iteration domain. Verifying the legality of each transformation can also require the resolution of complex integer linear programs, making polyhedral compilation very computationally expensive. To make matters worse, hardware properties (e.g., cache size, number of SMs) and contextual characteristics (e.g., input tensor shapes) also have to be taken into account by this framework, leading to expensive auto-tuning procedures [SATO2019]_. +First, the set of possible program transformations :math:`\Omega = \{ \Theta_S ~|~ S \in \text{program} \}` is large, and grows with the number of statements in the program as well as with the size of their iteration domain. Verifying the legality of each transformation can also require the resolution of complex integer linear programs, making polyhedral compilation very computationally expensive. To make matters worse, hardware properties (e.g., cache size, number of SMs) and contextual characteristics (e.g., input tensor shapes) also have to be taken into account by this framework, leading to expensive auto-tuning procedures [SATO2019]_. Second, the polyhedral framework is not very generally applicable; SCoPs are relatively common [GIRBAL2006]_ but require loop bounds and array subscripts to be affine functions of loop indices, which typically only occurs in regular, dense computations. For this reason, this framework still has to be successfully applied to sparse -- or even structured-sparse -- neural networks, whose importance has been rapidly rising over the past few years. @@ -131,7 +131,7 @@ On the other hand, blocked program representations advocated by this dissertatio Scheduling Languages ----------------------- -Separation of concerns \cite{dijkstra82} is a well-known design principle in computer science: programs should be decomposed into modular layers of abstraction that separate the semantics of their algorithms from the details of their implementation. Systems like Halide and TVM push this philosophy one step further, and enforce this separation at the grammatical level through the use of a **scheduling language**. The benefits of this methodology are particularly visible in the case of matrix multiplication, where, as one can see below, the definition of the algorithm (Line 1-7) is completely disjoint from its implementation (Line 8-16), meaning that both can be maintained, optimized and distributed independently. +Separation of concerns [DIJKSTRA82]_ is a well-known design principle in computer science: programs should be decomposed into modular layers of abstraction that separate the semantics of their algorithms from the details of their implementation. Systems like Halide and TVM push this philosophy one step further, and enforce this separation at the grammatical level through the use of a **scheduling language**. The benefits of this methodology are particularly visible in the case of matrix multiplication, where, as one can see below, the definition of the algorithm (Line 1-7) is completely disjoint from its implementation (Line 8-16), meaning that both can be maintained, optimized and distributed independently. .. code-block:: python :linenos: @@ -168,7 +168,7 @@ Scheduling languages are, without a doubt, one of the most popular approaches fo Limitations ++++++++++++ -This ease-of-development comes at a cost. First of all, existing systems that follow this paradigm tend to be noticeably slower than Triton on modern hardware when applicable (e.g., V100/A100 tensor cores w/ equal tile sizes). I do believe that this is not a fundamental issue of scheduling languages -- in the sense that it could probably be solved with more efforts -- but it could mean that these systems are harder to engineer. More importantly, existing scheduling languages generate loops whose bounds and increments cannot depend on surrounding loop indice without at least imposing severe constraints on possible schedules -- if not breaking the system entirely. This is problematic for sparse com-putations, whose iteration spaces may be irregular. +This ease-of-development comes at a cost. First of all, existing systems that follow this paradigm tend to be noticeably slower than Triton on modern hardware when applicable (e.g., V100/A100 tensor cores w/ equal tile sizes). I do believe that this is not a fundamental issue of scheduling languages -- in the sense that it could probably be solved with more efforts -- but it could mean that these systems are harder to engineer. More importantly, existing scheduling languages generate loops whose bounds and increments cannot depend on surrounding loop indice without at least imposing severe constraints on possible schedules -- if not breaking the system entirely. This is problematic for sparse computations, whose iteration spaces may be irregular. .. table:: :widths: 50 50 @@ -206,4 +206,5 @@ References .. [GROSSER2012] T. Grosser et al., "Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation", Parallel Processing Letters 2012 .. [SATO2019] Y. Sato et al., "An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation", TACO 2019 .. [GIRBAL2006] S. Girbal et al., "Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies", International Journal of Parallel Programming 2006 -.. [MULLAPUDI2016] R. Mullapudi et al., "Automatically scheduling halide image processing pipelines", TOG 2016 \ No newline at end of file +.. [DIJKSTRA82] E. W. Dijkstra et al., "On the role of scientific thought", Selected writings on computing: a personal perspective 1982 +.. [MULLAPUDI2016] R. Mullapudi et al., "Automatically scheduling halide image processing pipelines", TOG 2016 diff --git a/getting-started/tutorials/01-vector-add.html b/getting-started/tutorials/01-vector-add.html index d670a665e..9c00ed806 100644 --- a/getting-started/tutorials/01-vector-add.html +++ b/getting-started/tutorials/01-vector-add.html @@ -305,13 +305,13 @@ for different problem sizes.

Out:

vector-add-performance:
            size      Triton       Torch
-0        4096.0    9.540372    9.600000
+0        4096.0    9.600000    9.600000
 1        8192.0   19.200000   19.200000
 2       16384.0   38.400001   38.400001
 3       32768.0   76.800002   76.800002
 4       65536.0  127.999995  127.999995
 5      131072.0  219.428568  219.428568
-6      262144.0  341.333321  341.333321
+6      262144.0  341.333321  384.000001
 7      524288.0  472.615390  472.615390
 8     1048576.0  614.400016  614.400016
 9     2097152.0  722.823517  722.823517
@@ -323,7 +323,7 @@ for different problem sizes.

15 134217728.0 851.577704 850.656574
-

Total running time of the script: ( 0 minutes 11.067 seconds)

+

Total running time of the script: ( 0 minutes 11.032 seconds)