<h1>The Triton-C Language<aclass="headerlink"href="#the-triton-c-language"title="Permalink to this headline">¶</a></h1>
<p>In the introduction, we stressed the importance of blocked algorithms and described their core principles in pseudo-code. To facilitate their implementation on modern GPU hardware, we present Triton-C, a single-threaded imperative kernel language in which block variables are first-class citizen. This language may be used either directly by developers familiar with C, or as an intermediate language for existing (and future) transcompilers. In this chapter, we describe its differences with C, its Numpy-like semantics and its “Single-Program, Multiple-Data” (SPMD) programming model.</p>
<divclass="section"id="differences-with-c">
<h2>Differences with C<aclass="headerlink"href="#differences-with-c"title="Permalink to this headline">¶</a></h2>
<p>The syntax of Triton-C is based on that of ANSI C, but was modified and extended to accomodate the semantics and programming model described in the next two subsections. These changes fall into the following categories:</p>
<divclass="section"id="extensions">
<h3>Extensions<aclass="headerlink"href="#extensions"title="Permalink to this headline">¶</a></h3>
<p><strong>Variable declarations</strong>: Triton adds special-purpose syntax for multi-dimensional array declarations (e.g., <codeclass="code docutils literal notranslate"><spanclass="pre">int</span><spanclass="pre">block[16,</span><spanclass="pre">16]</span></code>), which purposely differs from that of nested arrays (i.e., arrays of pointers) found in ANSI C (e.g., <codeclass="code docutils literal notranslate"><spanclass="pre">int</span><spanclass="pre">block[16][16]</span></code>). Block dimensions must be constant but can also be made parametric with the use of pre-processor macros. One-dimensional blocks of integers may be initialized using ellipses (e.g., <codeclass="code docutils literal notranslate"><spanclass="pre">int</span><spanclass="pre">range[16]</span><spanclass="pre">=</span><spanclass="pre">0</span><spanclass="pre">...</span><spanclass="pre">16</span></code>).</p>
<p><strong>Operators and built-in function</strong>: The usual C operators were extended to support element-wise array operations (<codeclass="code docutils literal notranslate"><spanclass="pre">+</span></code>, <codeclass="code docutils literal notranslate"><spanclass="pre">-</span></code>, <codeclass="code docutils literal notranslate"><spanclass="pre">&&</span></code>, <codeclass="code docutils literal notranslate"><spanclass="pre">*</span></code>, etc.) and complex array operations(<codeclass="code docutils literal notranslate"><spanclass="pre">@</span></code> for matrix multiplication). Additionally, some built-in functions were added for concurrency (<codeclass="code docutils literal notranslate"><spanclass="pre">get_program_id</span></code>, <codeclass="code docutils literal notranslate"><spanclass="pre">atomic_add</span></code>).</p>
<p><strong>Slicing and broadcasting</strong>: Multi-dimensional blocks can be broadcast along any particular dimension using numpy-like slicing syntax (e.g., <codeclass="code docutils literal notranslate"><spanclass="pre">int</span><spanclass="pre">array[8,</span><spanclass="pre">8]</span><spanclass="pre">=</span><spanclass="pre">range[:,</span><spanclass="pre">newaxis]</span></code> for stacking columns). Note that, as of now, slicing blocks to retrieve sub-blocks (or scalars) is forbidden as it is incompatible with the automatic parallelization methods used by our JIT. Reductions can be achieved using a syntax similar to slicing (e.g., <codeclass="code docutils literal notranslate"><spanclass="pre">array[+]</span></code> for summing an array, or <codeclass="code docutils literal notranslate"><spanclass="pre">array[:,</span><spanclass="pre">max]</span></code> for row-wise maximum). Currently supported reduction operators are <codeclass="code docutils literal notranslate"><spanclass="pre">+</span></code>, <codeclass="code docutils literal notranslate"><spanclass="pre">min</span></code>, <codeclass="code docutils literal notranslate"><spanclass="pre">max</span></code>.</p>
<p><strong>Masked pointer dereferencement</strong>: Block-level operations in Triton-C are “atomic”, in the sense that they execute either completely or not at all. Basic element-wise control-flow for block-level operations can nonetheless be achieved using ternary operators and the <em>masked pointer dereferencement</em> operator exemplified below:</p>
<h3>Restrictions<aclass="headerlink"href="#restrictions"title="Permalink to this headline">¶</a></h3>
<p>The Triton project is still in its infancy. As such, there are quite a few features of ANSI C that are not supported:</p>
<p><strong>Non-kernel functions</strong>: Right now, all function definitions must be kernels, i.e. be preceded with the <codeclass="code docutils literal notranslate"><spanclass="pre">__global__</span></code> attribute. We are aware that this is a severe limitations, and the reason why it exists is because our automatic parallelization engine would not be capable of handling array parameter arguments.</p>
<p><strong>Non-primitive types</strong>: Non-primitive types defined with <codeclass="code docutils literal notranslate"><spanclass="pre">struct</span></code> and <codeclass="code docutils literal notranslate"><spanclass="pre">union</span></code> are currently not supported, again because it is unclear at this point how these constructs would hook into our block-level data-flow analysis passes.</p>
<p><strong>While loops</strong>: We just haven’t had time to implement those yet.</p>
</div>
</div>
<divclass="section"id="semantics">
<h2>Semantics<aclass="headerlink"href="#semantics"title="Permalink to this headline">¶</a></h2>
<p>The existence of built-in <strong>blocked</strong> types, variable and operations in Triton-C offers two main benefits. First, it simplifies the structure of blocked programs by hiding important details pertaining to concurrent programming such as memory coalescing, cache management and specialized tensor instrinsics. Second, it opens the door for compilers to perform these optimizations automatically. However, it also means that programs have some kind of <em>block-level semantics</em> that does not exist in C. Though some aspects of it (e.g., the <codeclass="code docutils literal notranslate"><spanclass="pre">@</span></code> operator) are pretty intuitive, one in particular might be puzzling to some GPU programmers: broadcasting semantics.</p>
<divclass="section"id="broadcasting-semantics">
<h3>Broadcasting Semantics<aclass="headerlink"href="#broadcasting-semantics"title="Permalink to this headline">¶</a></h3>
<p>Block variables in Triton are strongly typed, meaning that certain instructions statically require their operands to satisfy strict shape constraints. For example, a scalar may not be added to an array unless it is first appropriately broadcast. <em>Broadcasting semantics</em> (first introduced in <aclass="reference external"href="https://numpy.org/doc/stable/user/basics.broadcasting.html">Numpy</a>) provides two formal rules for performing these conversions automatically in the case of binary operators: (1) the shape of the lowest-dimension operand is left-padded with ones until both operands have the same dimensionality; and (2) the content of both operands is replicated as many times as needed until their shape is identical. An error is emitted if this cannot be done.</p>
<h2>Programming Model<aclass="headerlink"href="#programming-model"title="Permalink to this headline">¶</a></h2>
<p>As discussed in the <aclass="reference external"href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html">CUDA documentation</a>, The execution of CUDA code on GPUs is supported by an <aclass="reference external"href="https://en.wikipedia.org/wiki/SPMD">SPMD</a> programming model in which each kernel instance is associated with an identifiable <em>thread-block</em>, itself decomposed into <em>warps</em> of 32 <em>threads</em>. The Triton programming model is similar, but each kernel is <em>single-threaded</em>– though automatically parallelized – and associated with a global <codeclass="code docutils literal notranslate"><spanclass="pre">program</span><spanclass="pre">id</span></code> which varies from instance to instance. This approach leads to simpler kernels in which CUDA-like concurrency primitives (shared memory synchronization, inter-thread communication, etc.) do not exist. The global program ids associated with each kernel instance can be queried using the <codeclass="code docutils literal notranslate"><spanclass="pre">get_program_id(axis)</span></code> built-in function where <codeclass="code docutils literal notranslate"><spanclass="pre">0</span><spanclass="pre"><=</span><spanclass="pre">axis</span><spanclass="pre"><=</span><spanclass="pre">2</span></code>. This is, for example, useful to create e.g., blocks of pointers as shown in the tutorials.</p>