[documentation] added description of the __multipleof attribute

This commit is contained in:
Philippe Tillet
2019-09-10 14:16:52 -04:00
parent df2455f4b8
commit 41acac6ba1

View File

@@ -10,7 +10,8 @@
3. [Auto-Tuning](#auto-tuning) 3. [Auto-Tuning](#auto-tuning)
3. [Matrix Transposition](#matrix-transposition) 3. [Matrix Transposition](#matrix-transposition)
1. [Compute Kernel](#trans-compute-kernel) 1. [Compute Kernel](#trans-compute-kernel)
2. [Conditional Dereferencing](#conditional-dereferencing) 2. [The __multipleof Attribute](#trans-multipleof)
3. [Conditional Dereferencing](#conditional-dereferencing)
4. [Matrix Multiplication](#matrix-multiplication) 4. [Matrix Multiplication](#matrix-multiplication)
1. [Compute Kernel](#matmul-compute-kernel) 1. [Compute Kernel](#matmul-compute-kernel)
2. [Optimizations](#optimizations) 2. [Optimizations](#optimizations)
@@ -243,6 +244,18 @@ which will be used in statements (5) and (6) to construct tiles of pointers
``` ```
- Statement (7) element-wise dereferences the above array of pointers `*px`, transposes it using the unary transposition operator `^`, and writes it back at the location specified by `py`. - Statement (7) element-wise dereferences the above array of pointers `*px`, transposes it using the unary transposition operator `^`, and writes it back at the location specified by `py`.
### <span style="color:darkblue"> The __multipleof Attribute </span> <a name="trans-multipleof"></a>
The memory loads and store in our transposition kernel are not vectorizable by default, since `X + ldx` (and `Y + ldy`) may be misaligned when `ldx` (and `ldy`) are not multiples of e.g., 4. This is unfortunate because tensor dimensions can be easily made into nice powers of two in Deep Learning, due to batch-sizes and layer width being flexible.
For this reason, Triton provides a __multipleof(N) attributes for variables that are guaranteed to always be multiple of N. In the case of Matrix Transpositions, vector loads can be enabled by modifying the function's signature as follows:
```c
__global__ void transpose(TYPE * X, TYPE * Y, int M, int N, int ldx __multipleof(8), int ldy __multipleof(8)) {
// ...
}
```
### <span style="color:darkblue"> Conditional Dereferencing </span> <a name="conditional-dereferencing"></a> ### <span style="color:darkblue"> Conditional Dereferencing </span> <a name="conditional-dereferencing"></a>
You might have noticed that the above code will fail when `M` and `N` are not multiples of `TM` and `TN` respectively. Fortunately, the above kernel can be slightly modified to handle thie situation, as shown below: You might have noticed that the above code will fail when `M` and `N` are not multiples of `TM` and `TN` respectively. Fortunately, the above kernel can be slightly modified to handle thie situation, as shown below: