[GH-PAGES] Updated website

2022-04-16 00:44:44 +00:00
parent 824d060dfb
commit 9b0ee317d9
160 changed files with 365 additions and 380 deletions
--- a/master/_sources/getting-started/tutorials/01-vector-add.rst.txt
+++ b/master/_sources/getting-started/tutorials/01-vector-add.rst.txt
@@ -31,7 +31,7 @@ In this tutorial, you will write a simple vector addition using Triton and learn
 Compute Kernel
 --------------------------

-.. GENERATED FROM PYTHON SOURCE LINES 14-53
+.. GENERATED FROM PYTHON SOURCE LINES 14-50

 .. code-block:: default

@@ -48,11 +48,9 @@ Compute Kernel
        y_ptr,  # *Pointer* to second input vector
        output_ptr,  # *Pointer* to output vector
        n_elements,  # Size of the vector
-        time_start_ptr, time_end_ptr,
        BLOCK_SIZE: tl.constexpr,  # Number of elements each program should process
                     # NOTE: `constexpr` so it can be used as a shape value
    ):
-        tl.atomic_min(time_start_ptr, tl.clock())
        # There are multiple 'program's processing different data. We identify which program
        # we are here
        pid = tl.program_id(axis=0)  # We use a 1D launch grid so axis is 0
@@ -71,7 +69,6 @@ Compute Kernel
        output = x + y
        # Write x + y back to DRAM
        tl.store(output_ptr + offsets, output, mask=mask)
-        tl.atomic_max(time_end_ptr, tl.clock())



@@ -81,20 +78,18 @@ Compute Kernel



-.. GENERATED FROM PYTHON SOURCE LINES 54-56
+.. GENERATED FROM PYTHON SOURCE LINES 51-53

 Let's also declare a helper function to (1) allocate the `z` tensor
 and (2) enqueue the above kernel with appropriate grid/block sizes.

-.. GENERATED FROM PYTHON SOURCE LINES 56-79
+.. GENERATED FROM PYTHON SOURCE LINES 53-74

 .. code-block:: default



    def add(x: torch.Tensor, y: torch.Tensor):
-        time_start = torch.zeros(1, dtype=torch.int64, device='cuda')
-        time_end = torch.zeros(1, dtype=torch.int64, device='cuda')
        # We need to preallocate the output
        output = torch.empty_like(x)
        assert x.is_cuda and y.is_cuda and output.is_cuda
@@ -107,7 +102,7 @@ and (2) enqueue the above kernel with appropriate grid/block sizes.
        #  - each torch.tensor object is implicitly converted into a pointer to its first element.
        #  - `triton.jit`'ed functions can be index with a launch grid to obtain a callable GPU kernel
        #  - don't forget to pass meta-parameters as keywords arguments
-        add_kernel[grid](x, y, output, n_elements, time_start, time_end, BLOCK_SIZE=1024)
+        add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
        # We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called, the kernel is still
        # running asynchronously at this point.
        return output
@@ -120,11 +115,11 @@ and (2) enqueue the above kernel with appropriate grid/block sizes.



-.. GENERATED FROM PYTHON SOURCE LINES 80-81
+.. GENERATED FROM PYTHON SOURCE LINES 75-76

 We can now use the above function to compute the element-wise sum of two `torch.tensor` objects and test its correctness:

-.. GENERATED FROM PYTHON SOURCE LINES 81-95
+.. GENERATED FROM PYTHON SOURCE LINES 76-90

 .. code-block:: default

@@ -159,11 +154,11 @@ We can now use the above function to compute the element-wise sum of two `torch.



-.. GENERATED FROM PYTHON SOURCE LINES 96-97
+.. GENERATED FROM PYTHON SOURCE LINES 91-92

 Seems like we're good to go!

-.. GENERATED FROM PYTHON SOURCE LINES 99-104
+.. GENERATED FROM PYTHON SOURCE LINES 94-99

 Benchmark
 -----------
@@ -171,7 +166,7 @@ We can now benchmark our custom op on vectors of increasing sizes to get a sense
 To make things easier, Triton has a set of built-in utilities that allow us to concisely plot the performance of your custom ops
 for different problem sizes.

-.. GENERATED FROM PYTHON SOURCE LINES 104-133
+.. GENERATED FROM PYTHON SOURCE LINES 99-128

 .. code-block:: default

@@ -211,12 +206,12 @@ for different problem sizes.



-.. GENERATED FROM PYTHON SOURCE LINES 134-136
+.. GENERATED FROM PYTHON SOURCE LINES 129-131

 We can now run the decorated function above. Pass `print_data=True` to see the performance number, `show_plots=True` to plot them, and/or
 `save_path='/path/to/results/' to save them to disk along with raw CSV data

-.. GENERATED FROM PYTHON SOURCE LINES 136-137
+.. GENERATED FROM PYTHON SOURCE LINES 131-132

 .. code-block:: default

@@ -237,22 +232,22 @@ We can now run the decorated function above. Pass `print_data=True` to see the p

    vector-add-performance:
               size      Triton       Torch
-    0        4096.0    4.800000    9.600000
-    1        8192.0    9.600000   19.200000
-    2       16384.0   19.200000   38.400001
-    3       32768.0   34.909091   63.999998
-    4       65536.0   69.818181  127.999995
-    5      131072.0  139.636363  219.428568
-    6      262144.0  219.428568  384.000001
-    7      524288.0  361.411758  472.615390
-    8     1048576.0  491.520012  614.400016
-    9     2097152.0  599.414644  702.171410
-    10    4194304.0  702.171410  780.190482
-    11    8388608.0  774.047204  812.429770
-    12   16777216.0  809.086412  833.084721
-    13   33554432.0  829.569620  842.004273
-    14   67108864.0  840.205105  848.362445
-    15  134217728.0  846.080710  850.656574
+    0        4096.0    9.600000    9.600000
+    1        8192.0   19.200000   19.200000
+    2       16384.0   38.400001   38.400001
+    3       32768.0   63.999998   63.999998
+    4       65536.0  127.999995  127.999995
+    5      131072.0  219.428568  219.428568
+    6      262144.0  341.333321  384.000001
+    7      524288.0  472.615390  472.615390
+    8     1048576.0  614.400016  614.400016
+    9     2097152.0  722.823517  702.171410
+    10    4194304.0  780.190482  780.190482
+    11    8388608.0  812.429770  812.429770
+    12   16777216.0  833.084721  833.084721
+    13   33554432.0  842.004273  842.004273
+    14   67108864.0  847.448255  848.362445
+    15  134217728.0  849.737435  850.656574



@@ -260,7 +255,7 @@ We can now run the decorated function above. Pass `print_data=True` to see the p

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 1 minutes  42.289 seconds)
+   **Total running time of the script:** ( 1 minutes  49.775 seconds)


 .. _sphx_glr_download_getting-started_tutorials_01-vector-add.py: