[GH-PAGES] Updated website

2022-04-08 00:44:05 +00:00
parent 80b92a0d2d
commit 0c570c178d
173 changed files with 401 additions and 386 deletions
--- a/master/_sources/getting-started/tutorials/01-vector-add.rst.txt
+++ b/master/_sources/getting-started/tutorials/01-vector-add.rst.txt
@@ -31,7 +31,7 @@ In this tutorial, you will write a simple vector addition using Triton and learn
 Compute Kernel
 --------------------------

-.. GENERATED FROM PYTHON SOURCE LINES 14-50
+.. GENERATED FROM PYTHON SOURCE LINES 14-53

 .. code-block:: default

@@ -48,9 +48,11 @@ Compute Kernel
        y_ptr,  # *Pointer* to second input vector
        output_ptr,  # *Pointer* to output vector
        n_elements,  # Size of the vector
+        time_start_ptr, time_end_ptr,
        BLOCK_SIZE: tl.constexpr,  # Number of elements each program should process
                     # NOTE: `constexpr` so it can be used as a shape value
    ):
+        tl.atomic_min(time_start_ptr, tl.clock())
        # There are multiple 'program's processing different data. We identify which program
        # we are here
        pid = tl.program_id(axis=0)  # We use a 1D launch grid so axis is 0
@@ -69,6 +71,7 @@ Compute Kernel
        output = x + y
        # Write x + y back to DRAM
        tl.store(output_ptr + offsets, output, mask=mask)
+        tl.atomic_max(time_end_ptr, tl.clock())



@@ -78,18 +81,20 @@ Compute Kernel



-.. GENERATED FROM PYTHON SOURCE LINES 51-53
+.. GENERATED FROM PYTHON SOURCE LINES 54-56

 Let's also declare a helper function to (1) allocate the `z` tensor
 and (2) enqueue the above kernel with appropriate grid/block sizes.

-.. GENERATED FROM PYTHON SOURCE LINES 53-74
+.. GENERATED FROM PYTHON SOURCE LINES 56-79

 .. code-block:: default



    def add(x: torch.Tensor, y: torch.Tensor):
+        time_start = torch.zeros(1, dtype=torch.int64, device='cuda')
+        time_end = torch.zeros(1, dtype=torch.int64, device='cuda')
        # We need to preallocate the output
        output = torch.empty_like(x)
        assert x.is_cuda and y.is_cuda and output.is_cuda
@@ -102,7 +107,7 @@ and (2) enqueue the above kernel with appropriate grid/block sizes.
        #  - each torch.tensor object is implicitly converted into a pointer to its first element.
        #  - `triton.jit`'ed functions can be index with a launch grid to obtain a callable GPU kernel
        #  - don't forget to pass meta-parameters as keywords arguments
-        add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
+        add_kernel[grid](x, y, output, n_elements, time_start, time_end, BLOCK_SIZE=1024)
        # We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called, the kernel is still
        # running asynchronously at this point.
        return output
@@ -115,11 +120,11 @@ and (2) enqueue the above kernel with appropriate grid/block sizes.



-.. GENERATED FROM PYTHON SOURCE LINES 75-76
+.. GENERATED FROM PYTHON SOURCE LINES 80-81

 We can now use the above function to compute the element-wise sum of two `torch.tensor` objects and test its correctness:

-.. GENERATED FROM PYTHON SOURCE LINES 76-90
+.. GENERATED FROM PYTHON SOURCE LINES 81-95

 .. code-block:: default

@@ -154,11 +159,11 @@ We can now use the above function to compute the element-wise sum of two `torch.



-.. GENERATED FROM PYTHON SOURCE LINES 91-92
+.. GENERATED FROM PYTHON SOURCE LINES 96-97

 Seems like we're good to go!

-.. GENERATED FROM PYTHON SOURCE LINES 94-99
+.. GENERATED FROM PYTHON SOURCE LINES 99-104

 Benchmark
 -----------
@@ -166,7 +171,7 @@ We can now benchmark our custom op on vectors of increasing sizes to get a sense
 To make things easier, Triton has a set of built-in utilities that allow us to concisely plot the performance of your custom ops
 for different problem sizes.

-.. GENERATED FROM PYTHON SOURCE LINES 99-128
+.. GENERATED FROM PYTHON SOURCE LINES 104-133

 .. code-block:: default

@@ -206,12 +211,12 @@ for different problem sizes.



-.. GENERATED FROM PYTHON SOURCE LINES 129-131
+.. GENERATED FROM PYTHON SOURCE LINES 134-136

 We can now run the decorated function above. Pass `print_data=True` to see the performance number, `show_plots=True` to plot them, and/or
 `save_path='/path/to/results/' to save them to disk along with raw CSV data

-.. GENERATED FROM PYTHON SOURCE LINES 131-132
+.. GENERATED FROM PYTHON SOURCE LINES 136-137

 .. code-block:: default

@@ -232,22 +237,22 @@ We can now run the decorated function above. Pass `print_data=True` to see the p

    vector-add-performance:
               size      Triton       Torch
-    0        4096.0    9.600000    9.600000
-    1        8192.0   19.200000   19.200000
-    2       16384.0   38.400001   38.400001
-    3       32768.0   76.800002   76.800002
-    4       65536.0  127.999995  127.999995
-    5      131072.0  219.428568  219.428568
-    6      262144.0  341.333321  341.333321
-    7      524288.0  472.615390  472.615390
-    8     1048576.0  614.400016  614.400016
-    9     2097152.0  722.823517  722.823517
-    10    4194304.0  780.190482  780.190482
-    11    8388608.0  812.429770  812.429770
-    12   16777216.0  833.084721  833.084721
-    13   33554432.0  842.004273  843.811163
-    14   67108864.0  847.448255  848.362445
-    15  134217728.0  849.737435  850.656574
+    0        4096.0    4.800000    9.600000
+    1        8192.0    8.727273   19.200000
+    2       16384.0   17.454545   38.400001
+    3       32768.0   38.400001   76.800002
+    4       65536.0   69.818181  127.999995
+    5      131072.0  139.636363  219.428568
+    6      262144.0  219.428568  341.333321
+    7      524288.0  341.333321  472.615390
+    8     1048576.0  472.615390  614.400016
+    9     2097152.0  614.400016  702.171410
+    10    4194304.0  712.347810  780.190482
+    11    8388608.0  774.047204  812.429770
+    12   16777216.0  809.086412  833.084721
+    13   33554432.0  829.569620  842.004273
+    14   67108864.0  840.205105  848.362445
+    15  134217728.0  845.625825  850.656574



@@ -255,7 +260,7 @@ We can now run the decorated function above. Pass `print_data=True` to see the p

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 1 minutes  42.600 seconds)
+   **Total running time of the script:** ( 1 minutes  42.917 seconds)


 .. _sphx_glr_download_getting-started_tutorials_01-vector-add.py: