Deprecation of Triton-C and Replacement by decorated Python functions (#86)

This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes. See documentations for more information on the new API
2021-04-20 22:29:40 -04:00
parent 1fdb465b71
commit 39f4730305
91 changed files with 4500 additions and 13008 deletions
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -35,6 +35,13 @@ extensions = []
 # Math Jax
 extensions += ['sphinx.ext.mathjax']

+# Auto Doc
+import sys
+import os
+sys.path.insert(0, os.path.abspath('../python/'))
+extensions = ['sphinx.ext.autodoc', 'sphinx.ext.autosummary', 'sphinx.ext.coverage', 'sphinx.ext.napoleon']
+autosummary_generate = True
+
 # Sphinx gallery
 extensions += ['sphinx_gallery.gen_gallery']
 from sphinx_gallery.sorting import FileNameSortKey
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -17,15 +17,27 @@ Getting Started
   getting-started/installation
   getting-started/tutorials/index

-Programming Guide
+Language Reference
+-------------------
+
+- Checkout the :doc:`Python API Documentation <language-reference/python-api/index>`
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Language Reference
+   :hidden:
+
+   language-reference/python-api/index
+
+   
+Going Further
 ------------------

 Check out the following documents to learn more about Triton and how it compares against other DSLs for DNNs:

 - Chapter 1: :doc:`Introduction <programming-guide/chapter-1/introduction>`
 - Chapter 2: :doc:`Related Work <programming-guide/chapter-2/related-work>`
- Chapter 3: :doc:`The Triton-C Language <programming-guide/chapter-3/triton-c>`
- Chapter 4: :doc:`The Triton-IR Intermediate Representation <programming-guide/chapter-4/triton-ir>`

 .. toctree::
   :maxdepth: 1
@@ -33,6 +45,4 @@ Check out the following documents to learn more about Triton and how it compares
   :hidden:

   programming-guide/chapter-1/introduction
-   programming-guide/chapter-2/related-work
-   programming-guide/chapter-3/triton-c
-   programming-guide/chapter-4/triton-ir
+   programming-guide/chapter-2/related-work
--- a/docs/language-reference/python-api/index.rst
+++ b/docs/language-reference/python-api/index.rst
@@ -0,0 +1,117 @@
+Python API
+===========
+
+.. currentmodule:: triton
+
+
+Programming Model
+-------------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    program_id
+    num_programs
+
+
+Creation Ops
+-------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    arange
+    zeros
+
+
+Shape Manipulation Ops
+-----------------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    broadcast_to
+    reshape
+    ravel
+
+
+
+Linear Algebra Ops
+-------------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    dot
+
+Memory Ops
+--------------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    load
+    store
+    atomic_cas
+    atomic_xchg
+
+
+Indexing Ops
+--------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    where
+
+
+Math Ops
+----------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    exp
+    log
+    sigmoid
+    softmax
+
+
+Reduction Ops
+---------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    max
+    min
+    sum
+
+
+Comparison ops
+---------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    minimum
+    maximum
+
+
+Compiler Hint Ops
+-------------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    multiple_of
--- a/docs/programming-guide/chapter-3/triton-c.rst
+++ b/docs/programming-guide/chapter-3/triton-c.rst
@@ -1,84 +0,0 @@
-=======================
-The Triton-C Language
-=======================
-
-In the introduction, we stressed the importance of blocked algorithms and described their core principles in pseudo-code. To facilitate their implementation on modern GPU hardware, we present Triton-C, a single-threaded imperative kernel language in which block variables are first-class citizen.  This language may be used either directly by developers familiar with C, or as an intermediate language for existing (and future) transcompilers. In this chapter, we describe its differences with C, its Numpy-like semantics and its "Single-Program, Multiple-Data" (SPMD) programming model.
-
-------------------
-Differences with C
-------------------
-
-The syntax of Triton-C is based on that of ANSI C, but was modified and extended to accomodate the semantics and programming model described in the next two  subsections. These changes fall into the following categories:
-
-+++++++++++
-Extensions
-+++++++++++
-
-**Variable declarations**: Triton adds special-purpose syntax for multi-dimensional array declarations (e.g., :code:`int block[16, 16]`), which purposely differs from that of nested arrays (i.e., arrays of pointers) found in ANSI C (e.g., :code:`int block[16][16]`). Block dimensions must be constant but can also be made parametric with the use of pre-processor macros. One-dimensional blocks of integers may be initialized using ellipses (e.g., :code:`int range[16] = 0 ... 16`).
-
-**Primitive types**: Triton-C supports the following primitive data-types: :code:`bool`, :code:`uint8`, :code:`uint16`, :code:`uint32`, :code:`uint64`, :code:`int8`, :code:`int16`, :code:`int32`, :code:`int64`, :code:`half`, :code:`float`, :code:`double`.
-
-**Operators and built-in function**: The usual C operators were extended to support element-wise array operations (:code:`+`, :code:`-`, :code:`&&`, :code:`*`, etc.) and complex array operations(:code:`@` for matrix multiplication). Additionally, some built-in functions were added for concurrency (:code:`get_program_id`, :code:`atomic_add`).
-
-**Slicing and broadcasting**: Multi-dimensional blocks can be broadcast along any particular dimension using numpy-like slicing syntax (e.g., :code:`int array[8, 8] = range[:, newaxis]` for stacking columns). Note that, as of now, slicing blocks to retrieve sub-blocks (or scalars) is forbidden as it is incompatible with the automatic parallelization methods used by our JIT. Reductions can be achieved using a syntax similar to slicing (e.g., :code:`array[+]` for summing an array, or :code:`array[:, max]` for row-wise maximum). Currently supported reduction operators are :code:`+`, :code:`min`, :code:`max`.
-
-**Masked pointer dereferencement**: Block-level operations in Triton-C are "atomic", in the sense that they execute either completely or not at all. Basic element-wise control-flow for block-level operations can nonetheless be achieved using ternary operators and the *masked pointer dereferencement* operator exemplified below:
-
-.. code-block:: C
-  :force:
-
-  // create mask
-  bool mask[16, 16] = ...;
-  // conditional addition
-  float x[16, 16] = mask ? a + b : 0;
-  // conditional load
-  float y[16] 16] = mask ? *ptr : 0;
-  // conditional store
-  *?(mask)ptr = y;
-  \end{lstlisting}
-
-
-+++++++++++++
-Restrictions
-+++++++++++++
-
-The Triton project is still in its infancy. As such, there are quite a few features of ANSI C that are not supported:
-
-**Non-kernel functions**: Right now, all function definitions must be kernels, i.e. be preceded with the :code:`__global__` attribute. We are aware that this is a severe limitations, and the reason why it exists is because our automatic parallelization engine would not be capable of handling array parameter arguments.
-
-**Non-primitive types**: Non-primitive types defined with :code:`struct` and :code:`union` are currently not supported, again because it is unclear at this point how these constructs would hook into our block-level data-flow analysis passes.
-
-**While loops**: We just haven't had time to implement those yet.
-
----------------
-Semantics
----------------
-
-The existence of built-in **blocked** types, variable and operations in Triton-C offers two main benefits. First, it simplifies the structure of blocked programs by hiding important details pertaining to concurrent programming such as memory coalescing, cache management and specialized tensor instrinsics. Second, it opens the door for compilers to perform these optimizations automatically. However, it also means that programs have some kind of *block-level semantics* that does not exist in C. Though some aspects of it (e.g., the :code:`@` operator) are pretty intuitive, one in particular might be puzzling to some GPU programmers: broadcasting semantics.
-
-+++++++++++++++++++++++
-Broadcasting Semantics
-+++++++++++++++++++++++
-
-
-Block variables in Triton are strongly typed, meaning that certain instructions statically require their operands to satisfy strict shape constraints. For example, a scalar may not be added to an array unless it is first appropriately broadcast. *Broadcasting semantics* (first introduced in `Numpy <https://numpy.org/doc/stable/user/basics.broadcasting.html>`_) provides two formal rules for performing these conversions automatically in the case of binary operators: (1) the shape of the lowest-dimension operand is left-padded with ones until both operands have the same dimensionality; and (2) the content of both operands is replicated as many times as needed until their shape is identical. An error is emitted if this cannot be done.
-
-.. code-block:: C
-
-  int a[16], b[32, 16], c[16, 1];
-  // a is first reshaped to [1, 16]
-  // and then broadcast to [32, 16]
-  int x_1[32, 16] = a[newaxis, :] + b;
-  // Same as above but implicitly
-  int x_2[32, 16] = a + b;
-  // a is first reshaped to [1, 16]
-  // a is broadcast to [16, 16]
-  // c is broadcast to [16, 16]
-  int y[16, 16] = a + c;
-
------------------
-Programming Model
------------------
-
-As discussed in the `CUDA documentation <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_, The execution of CUDA  code on GPUs is supported by an `SPMD <https://en.wikipedia.org/wiki/SPMD>`_ programming model in which each kernel instance is associated with an identifiable *thread-block*, itself decomposed into *warps* of 32 *threads*. The Triton programming model is similar, but each kernel is *single-threaded* -- though automatically parallelized -- and associated with a global :code:`program id` which varies from instance to instance. This approach leads to simpler kernels in which CUDA-like concurrency primitives (shared memory synchronization, inter-thread communication, etc.) do not exist. The global program ids associated with each  kernel instance can be queried using the :code:`get_program_id(axis)` built-in function where :code:`0 <= axis <= 2`. This is, for example, useful to create e.g., blocks of pointers as shown in the tutorials.
-
--- a/docs/programming-guide/chapter-4/broadcast-1.png
+++ b/docs/programming-guide/chapter-4/broadcast-1.png
--- a/docs/programming-guide/chapter-4/broadcast-2.png
+++ b/docs/programming-guide/chapter-4/broadcast-2.png
--- a/docs/programming-guide/chapter-4/triton-ir.rst
+++ b/docs/programming-guide/chapter-4/triton-ir.rst
@@ -1,82 +0,0 @@
-==========================================
-The Triton-IR Intermediate Representation
-==========================================
-
-Triton-IR is an LLVM-based Intermediate Representation (IR) whose purpose is to provide an environment suitable for block-level program analysis, transformation and optimization.
-In our implementation, Triton-IR programs are constructed directly from Triton-C after parsing, but they could also be formed directly by higher-level DSLs in the future.
-Triton-IR and LLVM-IR programs share the same high-level structure, but the former also includes a number of extensions necessary for block-level data-flow analysis.
-These extensions are crucial for carrying out the optimizations outlined in the next chapter of this document.
-
---------------------------------
-Structure of a Triton-IR Program
---------------------------------
-
-++++++++
-Modules
-++++++++
-
-At the highest level, Triton-IR programs consist of one or multiple basic units of compilation known as *modules*. These modules are compiled independently from one another, and eventually aggregated by a linker whose role is to resolve forward declarations and adequately merge global definitions. Each module itself is composed of functions, global variables, constants and other miscellaneous symbols such as metadata and attributes.
-
-++++++++++
-Functions
-++++++++++
-
-Triton-IR function definitions consist of a return type, a name and a potentially empty arguments list. Additional visibility, alignment and linkage specifiers can be added if desired. Function attributes (such as inlining hints) and parameter attributes (such as "readonly", aliasing hints) can also be specified, allowing compiler backends to perform more aggressive optimizations by, for instance, making better use of non-coherent caches found on NVIDIA GPUs. This header is followed by a body composed of a list of basic blocks whose interdependencies form the Control Flow Graph (CFG) of the function.
-
-+++++++++++++
-Basic Blocks
-+++++++++++++
-
-Basic blocks are straight-line code sequences that may only contain so-called *terminator* instructions (i.e., branching, return) at their end. To simplify program analysis, Triton-IR uses the Static Single Assignment (SSA) form, meaning that each variable in each basic block must be (1) assigned to only once and (2) defined before being used. In so doing, each basic block implicitly defines a Data-Flow Graph (DFG). In our case, the SSA form is created directly from Triton-C's Abstract Syntax Trees (ASTs) using an algorithm from the literature [BRAUN13]_.
-
---------------------------------
-Block-Level Dataflow Analysis
---------------------------------
-
-+++++++
-Types
-+++++++
-
-Multi-dimensional blocks are at the center of data-flow analysis in Triton-JIT. They can be declared using syntax similar to vector declarations in LLVM-IR. For example, :code:`i32<8, 8>` is the type corresponding to :math:`8 \times 8` blocks of 32-bit integers. Note that there is no preprocessor in Triton-IR, hence parametric shape  values must be resolved before programs are generated. In our case, this is done by Triton-JIT's auto-tuner.
-
-+++++++++++++
-Instructions
-+++++++++++++
-
-Triton-IR introduces a set of *reblocking* instructions whose purpose is to support broadcasting semantics as described in the previous chapter.  The :code:`reshape` instruction creates a block of the specified shape using the raw data from its input argument. This is particularly useful to re-interpret variables as higher-dimensional arrays by padding their input shapes with ones in preparation for broadcasting. The :code:`broadcast` instruction creates a block of the specified shapes by replicating its input argument as many times as necessary along dimensions of size 1 -- as shown below for the :code:`broadcast<3,3>` instruction.
-
-|pic1| and |pic2|
-
-.. |pic1| image:: broadcast-1.png
-   :width: 40%
-
-.. |pic2| image:: broadcast-2.png
-   :width: 40%
-
-Usual scalar instructions (:code:`cmp`, :code:`getelementptr`, :code:`add`, :code:`load`...) were preserved and extended to signify element-wise operations when applicable. Finally, Triton-IR also exposes specialized arithmetic instructions for reductions (:code:`reduce`) and matrix multiplications (:code:`dot`).
-
----------------------------------
-Block-Level Control Flow Analysis
----------------------------------
-
-In Triton-IR, operations on block variables are atomic: they execute either in full or not at all. As a result, traditional control flow structures (e.g., conditional, loops) are not applicable to individual block elements. This is problematic, since a program may need to e.g., partially guard blocked loads against memory access violations.
-
-This could be potentially solved through the use of the Predicated SSA (PSSA) [CARTER99]_ [STOUTCHININ01]_ form for Triton-IR. However, this would create a lot of unnecessary complexity for GPUs, where the benefits of PSSA are close to none as divergent program paths  within warps are  serialized anyway. Therefore, recent versions of Triton handle intra-block control flow in a much simpler way, using conditional instructions such as  :code:`select`, :code:`masked_load` and :code:`masked_store`:
-
-.. code-block:: C
-
-  // For all indices [idx], return cond[idx] ? true_value[idx] : false_value[idx];
-  select       TYPE<TS1, ..., TSN> cond, true_value, false_value;
-  // For all indices [idx], return cond[idx] ? *true_addr[idx] : false_value[idx];
-  masked_load  TYPE<TS1, ..., TSN> cond, true_addr, false_value;
-  // For all indices [idx], execute *true_addr[idx] = true_value[idx] if cond[idx]
-  masked_store TYPE<TS1, ..., TSN> cond, true_addr, true_value;
-
-
------------
-References
------------
-
-.. [BRAUN13] M. Braun et al., "Simple and Efficient Construction of Static Single Assignment Form", CC 2013
-.. [CARTER99] L. Carter et al., "Predicated Static Single Assignment", PACT 1999
-.. [STOUTCHININ01] A. Stoutchinin et al., "Efficient Static Single Assignment Form for Predication", MICRO 2001