triton/programming-guide/chapter-3/triton-c.html



<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
  <meta charset="utf-8" />
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  
  <title>The Triton-C Language &mdash; Triton  documentation</title>
  

  <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/gallery.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/gallery-binder.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/gallery-dataframe.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/gallery-rendered-html.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/css/custom.css" type="text/css" />

  
  <!--[if lt IE 9]>
    <script src="../../_static/js/html5shiv.min.js"></script>
  <![endif]-->
  
    
      <script type="text/javascript" id="documentation_options" data-url_root="../../" src="../../_static/documentation_options.js"></script>
        <script src="../../_static/jquery.js"></script>
        <script src="../../_static/underscore.js"></script>
        <script src="../../_static/doctools.js"></script>
    
    <script type="text/javascript" src="../../_static/js/theme.js"></script>

    
    <link rel="index" title="Index" href="../../genindex.html" />
    <link rel="search" title="Search" href="../../search.html" />
    <link rel="next" title="The Triton-IR Intermediate Representation" href="../chapter-4/triton-ir.html" />
    <link rel="prev" title="Related Work" href="../chapter-2/related-work.html" /> 
</head>

<body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search" >
          

            <a href="../../index.html" class="icon icon-home"> Triton
          

          </a>

          
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
    <input type="text" name="q" placeholder="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>

          
        </div>

        
        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
          
            
              <p class="caption"><span class="caption-text">Getting Started</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../getting-started/installation.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../getting-started/tutorials/index.html">Tutorials</a></li>
</ul>
<p class="caption"><span class="caption-text">Programming Guide</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../chapter-1/introduction.html">Introduction</a></li>
<li class="toctree-l1"><a class="reference internal" href="../chapter-2/related-work.html">Related Work</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">The Triton-C Language</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#differences-with-c">Differences with C</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#extensions">Extensions</a></li>
<li class="toctree-l3"><a class="reference internal" href="#restrictions">Restrictions</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#semantics">Semantics</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#broadcasting-semantics">Broadcasting Semantics</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#programming-model">Programming Model</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter-4/triton-ir.html">The Triton-IR Intermediate Representation</a></li>
</ul>

            
        </div>
        
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">

      
      <nav class="wy-nav-top" aria-label="top navigation">
        
          <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
          <a href="../../index.html">Triton</a>
        
      </nav>


      <div class="wy-nav-content">
        
        <div class="rst-content">
        
          
<div role="navigation" aria-label="breadcrumbs navigation">

  <ul class="wy-breadcrumbs">
    
      <li><a href="../../index.html" class="icon icon-home"></a> &raquo;</li>
        
      <li>The Triton-C Language</li>
    
    
      <li class="wy-breadcrumbs-aside">
        
          
            <a href="../../_sources/programming-guide/chapter-3/triton-c.rst.txt" rel="nofollow"> View page source</a>
          
        
      </li>
    
  </ul>

  
  <hr/>
</div>
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
            
  <div class="section" id="the-triton-c-language">
<h1>The Triton-C Language<a class="headerlink" href="#the-triton-c-language" title="Permalink to this headline">¶</a></h1>
<p>In the introduction, we stressed the importance of blocked algorithms and described their core principles in pseudo-code. To facilitate their implementation on modern GPU hardware, we present Triton-C, a single-threaded imperative kernel language in which block variables are first-class citizen.  This language may be used either directly by developers familiar with C, or as an intermediate language for existing (and future) transcompilers. In this chapter, we describe its differences with C, its Numpy-like semantics and its “Single-Program, Multiple-Data” (SPMD) programming model.</p>
<div class="section" id="differences-with-c">
<h2>Differences with C<a class="headerlink" href="#differences-with-c" title="Permalink to this headline">¶</a></h2>
<p>The syntax of Triton-C is based on that of ANSI C, but was modified and extended to accomodate the semantics and programming model described in the next two  subsections. These changes fall into the following categories:</p>
<div class="section" id="extensions">
<h3>Extensions<a class="headerlink" href="#extensions" title="Permalink to this headline">¶</a></h3>
<p><strong>Variable declarations</strong>: Triton adds special-purpose syntax for multi-dimensional array declarations (e.g., <code class="code docutils literal notranslate"><span class="pre">int</span> <span class="pre">block[16,</span> <span class="pre">16]</span></code>), which purposely differs from that of nested arrays (i.e., arrays of pointers) found in ANSI C (e.g., <code class="code docutils literal notranslate"><span class="pre">int</span> <span class="pre">block[16][16]</span></code>). Block dimensions must be constant but can also be made parametric with the use of pre-processor macros. One-dimensional blocks of integers may be initialized using ellipses (e.g., <code class="code docutils literal notranslate"><span class="pre">int</span> <span class="pre">range[16]</span> <span class="pre">=</span> <span class="pre">0</span> <span class="pre">...</span> <span class="pre">16</span></code>).</p>
<p><strong>Primitive types</strong>: Triton-C supports the following primitive data-types: <code class="code docutils literal notranslate"><span class="pre">bool</span></code>, <code class="code docutils literal notranslate"><span class="pre">uint8</span></code>, <code class="code docutils literal notranslate"><span class="pre">uint16</span></code>, <code class="code docutils literal notranslate"><span class="pre">uint32</span></code>, <code class="code docutils literal notranslate"><span class="pre">uint64</span></code>, <code class="code docutils literal notranslate"><span class="pre">int8</span></code>, <code class="code docutils literal notranslate"><span class="pre">int16</span></code>, <code class="code docutils literal notranslate"><span class="pre">int32</span></code>, <code class="code docutils literal notranslate"><span class="pre">int64</span></code>, <code class="code docutils literal notranslate"><span class="pre">half</span></code>, <code class="code docutils literal notranslate"><span class="pre">float</span></code>, <code class="code docutils literal notranslate"><span class="pre">double</span></code>.</p>
<p><strong>Operators and built-in function</strong>: The usual C operators were extended to support element-wise array operations (<code class="code docutils literal notranslate"><span class="pre">+</span></code>, <code class="code docutils literal notranslate"><span class="pre">-</span></code>, <code class="code docutils literal notranslate"><span class="pre">&amp;&amp;</span></code>, <code class="code docutils literal notranslate"><span class="pre">*</span></code>, etc.) and complex array operations(<code class="code docutils literal notranslate"><span class="pre">&#64;</span></code> for matrix multiplication). Additionally, some built-in functions were added for concurrency (<code class="code docutils literal notranslate"><span class="pre">get_program_id</span></code>, <code class="code docutils literal notranslate"><span class="pre">atomic_add</span></code>).</p>
<p><strong>Slicing and broadcasting</strong>: Multi-dimensional blocks can be broadcast along any particular dimension using numpy-like slicing syntax (e.g., <code class="code docutils literal notranslate"><span class="pre">int</span> <span class="pre">array[8,</span> <span class="pre">8]</span> <span class="pre">=</span> <span class="pre">range[:,</span> <span class="pre">newaxis]</span></code> for stacking columns). Note that, as of now, slicing blocks to retrieve sub-blocks (or scalars) is forbidden as it is incompatible with the automatic parallelization methods used by our JIT. Reductions can be achieved using a syntax similar to slicing (e.g., <code class="code docutils literal notranslate"><span class="pre">array[+]</span></code> for summing an array, or <code class="code docutils literal notranslate"><span class="pre">array[:,</span> <span class="pre">max]</span></code> for row-wise maximum). Currently supported reduction operators are <code class="code docutils literal notranslate"><span class="pre">+</span></code>, <code class="code docutils literal notranslate"><span class="pre">min</span></code>, <code class="code docutils literal notranslate"><span class="pre">max</span></code>.</p>
<p><strong>Masked pointer dereferencement</strong>: Block-level operations in Triton-C are “atomic”, in the sense that they execute either completely or not at all. Basic element-wise control-flow for block-level operations can nonetheless be achieved using ternary operators and the <em>masked pointer dereferencement</em> operator exemplified below:</p>
<div class="highlight-C notranslate"><div class="highlight"><pre><span></span><span class="c1">// create mask</span>
<span class="kt">bool</span> <span class="n">mask</span><span class="p">[</span><span class="mi">16</span><span class="p">,</span> <span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">...;</span>
<span class="c1">// conditional addition</span>
<span class="kt">float</span> <span class="n">x</span><span class="p">[</span><span class="mi">16</span><span class="p">,</span> <span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="n">mask</span> <span class="o">?</span> <span class="n">a</span> <span class="o">+</span> <span class="nl">b</span> <span class="p">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="c1">// conditional load</span>
<span class="kt">float</span> <span class="n">y</span><span class="p">[</span><span class="mi">16</span><span class="p">]</span> <span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="n">mask</span> <span class="o">?</span> <span class="o">*</span><span class="nl">ptr</span> <span class="p">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="c1">// conditional store</span>
<span class="o">*?</span><span class="p">(</span><span class="n">mask</span><span class="p">)</span><span class="n">ptr</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span>
<span class="err">\</span><span class="n">end</span><span class="p">{</span><span class="n">lstlisting</span><span class="p">}</span>
</pre></div>
</div>
</div>
<div class="section" id="restrictions">
<h3>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h3>
<p>The Triton project is still in its infancy. As such, there are quite a few features of ANSI C that are not supported:</p>
<p><strong>Non-kernel functions</strong>: Right now, all function definitions must be kernels, i.e. be preceded with the <code class="code docutils literal notranslate"><span class="pre">__global__</span></code> attribute. We are aware that this is a severe limitations, and the reason why it exists is because our automatic parallelization engine would not be capable of handling array parameter arguments.</p>
<p><strong>Non-primitive types</strong>: Non-primitive types defined with <code class="code docutils literal notranslate"><span class="pre">struct</span></code> and <code class="code docutils literal notranslate"><span class="pre">union</span></code> are currently not supported, again because it is unclear at this point how these constructs would hook into our block-level data-flow analysis passes.</p>
<p><strong>While loops</strong>: We just haven’t had time to implement those yet.</p>
</div>
</div>
<div class="section" id="semantics">
<h2>Semantics<a class="headerlink" href="#semantics" title="Permalink to this headline">¶</a></h2>
<p>The existence of built-in <strong>blocked</strong> types, variable and operations in Triton-C offers two main benefits. First, it simplifies the structure of blocked programs by hiding important details pertaining to concurrent programming such as memory coalescing, cache management and specialized tensor instrinsics. Second, it opens the door for compilers to perform these optimizations automatically. However, it also means that programs have some kind of <em>block-level semantics</em> that does not exist in C. Though some aspects of it (e.g., the <code class="code docutils literal notranslate"><span class="pre">&#64;</span></code> operator) are pretty intuitive, one in particular might be puzzling to some GPU programmers: broadcasting semantics.</p>
<div class="section" id="broadcasting-semantics">
<h3>Broadcasting Semantics<a class="headerlink" href="#broadcasting-semantics" title="Permalink to this headline">¶</a></h3>
<p>Block variables in Triton are strongly typed, meaning that certain instructions statically require their operands to satisfy strict shape constraints. For example, a scalar may not be added to an array unless it is first appropriately broadcast. <em>Broadcasting semantics</em> (first introduced in <a class="reference external" href="https://numpy.org/doc/stable/user/basics.broadcasting.html">Numpy</a>) provides two formal rules for performing these conversions automatically in the case of binary operators: (1) the shape of the lowest-dimension operand is left-padded with ones until both operands have the same dimensionality; and (2) the content of both operands is replicated as many times as needed until their shape is identical. An error is emitted if this cannot be done.</p>
<div class="highlight-C notranslate"><div class="highlight"><pre><span></span><span class="kt">int</span> <span class="n">a</span><span class="p">[</span><span class="mi">16</span><span class="p">],</span> <span class="n">b</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">16</span><span class="p">],</span> <span class="n">c</span><span class="p">[</span><span class="mi">16</span><span class="p">,</span> <span class="mi">1</span><span class="p">];</span>
<span class="c1">// a is first reshaped to [1, 16]</span>
<span class="c1">// and then broadcast to [32, 16]</span>
<span class="kt">int</span> <span class="n">x_1</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">newaxis</span><span class="p">,</span> <span class="o">:</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span>
<span class="c1">// Same as above but implicitly</span>
<span class="kt">int</span> <span class="n">x_2</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span>
<span class="c1">// a is first reshaped to [1, 16]</span>
<span class="c1">// a is broadcast to [16, 16]</span>
<span class="c1">// c is broadcast to [16, 16]</span>
<span class="kt">int</span> <span class="n">y</span><span class="p">[</span><span class="mi">16</span><span class="p">,</span> <span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">c</span><span class="p">;</span>
</pre></div>
</div>
</div>
</div>
<div class="section" id="programming-model">
<h2>Programming Model<a class="headerlink" href="#programming-model" title="Permalink to this headline">¶</a></h2>
<p>As discussed in the <a class="reference external" href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html">CUDA documentation</a>, The execution of CUDA  code on GPUs is supported by an <a class="reference external" href="https://en.wikipedia.org/wiki/SPMD">SPMD</a> programming model in which each kernel instance is associated with an identifiable <em>thread-block</em>, itself decomposed into <em>warps</em> of 32 <em>threads</em>. The Triton programming model is similar, but each kernel is <em>single-threaded</em> – though automatically parallelized – and associated with a global <code class="code docutils literal notranslate"><span class="pre">program</span> <span class="pre">id</span></code> which varies from instance to instance. This approach leads to simpler kernels in which CUDA-like concurrency primitives (shared memory synchronization, inter-thread communication, etc.) do not exist. The global program ids associated with each  kernel instance can be queried using the <code class="code docutils literal notranslate"><span class="pre">get_program_id(axis)</span></code> built-in function where <code class="code docutils literal notranslate"><span class="pre">0</span> <span class="pre">&lt;=</span> <span class="pre">axis</span> <span class="pre">&lt;=</span> <span class="pre">2</span></code>. This is, for example, useful to create e.g., blocks of pointers as shown in the tutorials.</p>
</div>
</div>


           </div>
           
          </div>
          <footer>
    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
        <a href="../chapter-4/triton-ir.html" class="btn btn-neutral float-right" title="The Triton-IR Intermediate Representation" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
        <a href="../chapter-2/related-work.html" class="btn btn-neutral float-left" title="Related Work" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
    </div>

  <hr/>

  <div role="contentinfo">
    <p>
        &#169; Copyright 2020, Philippe Tillet.

    </p>
  </div>
    
    
    Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
    
    <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
    
    provided by <a href="https://readthedocs.org">Read the Docs</a>. 

</footer>
        </div>
      </div>

    </section>

  </div>
  

  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
  </script>

  
</body>
</html>