Commit Graph

467 Commits

Author SHA1 Message Date
Phil Tillet
58d2867fe6 testing things... 2022-12-09 19:31:34 -08:00
Yan Chunwei
13644e7ac4 adapt isMMAv1Row in backend (#969) 2022-12-08 20:52:43 -08:00
Keren Zhou
3ed36dcb4d [BACKEND] MMA->DotOperand conversion for chain dot of float32 tensors (#962)
Co-authored-by: Philippe Tillet <phil@openai.com>
2022-12-08 20:11:51 +00:00
Keren Zhou
83f3b9165b [FRONTEND][BACKEND] Fix bool and int8 load when the other operand is given (#968) 2022-12-08 11:52:18 -08:00
Keren Zhou
71c35bcf9c [Triton-MLIR][BACKEND] Mark float to integer in Arithmetic Dialect as legal (#963) 2022-12-08 09:07:01 -08:00
Yan Chunwei
4eab9dcedf [Triton-MLIR][BACKEND] make MMAv1 splitk works (#960) 2022-12-07 08:58:38 +00:00
Philippe Tillet
b2b793dfb5 [FRONTEND][BACKEND] Fixes for cat / reshape / addptr (#959)
Most notably, this PR:
- changes the traits (and assembly format) of addptr so it can handle offsets that have arbitrary integer width.
- adds support for `cat`
2022-12-06 23:29:50 -08:00
Philippe Tillet
981aee7f1e [FRONTEND] Frontend fixes for uint / for loops / random (#958) 2022-12-06 20:25:47 -08:00
Philippe Tillet
115cd3ac47 [FRONTEND] Added reshape as an alias for view (for now) (#956) 2022-12-06 09:57:05 -08:00
Philippe Tillet
532e10cf87 [FRONTEND][BACKEND] Clean-up transpositions (#953) 2022-12-06 09:32:13 -08:00
Keren Zhou
16e973edf2 [BACKEND] Fix dependency analysis in pipeline (#946) 2022-12-06 09:08:55 -08:00
Crutcher Dunnavant
9490252261 [FRONTEND] Support alternative install locations of system libdevice.10.bc (#951) 2022-12-06 03:41:44 +00:00
Yan Chunwei
e419781978 [Triton-MLIR][BACKEND] Make mmav1 works on basic cases (#944)
TODO:

- Add more cases
- Currently, we just set vec to 4 to make the basic cases pass

Issue:

- the vec in shared layout is different compared to master branch
- when vec=1, it encounters CUDA misalignment error, it doesn't work in
master branch as well
- when setting vec to the value identical to master branch, the MMA
works
2022-12-06 10:57:08 +08:00
Crutcher Dunnavant
189491727a [FRONTEND] Extract and unify @builtin/@extern (#913)
This change attaches builtin-ness as an explicit attribute, rather than
a module prefix expectation. This permits us to source those builtins
from multiple sub-modules (useful when some builtins are part of the
true cyclic implementation core, and some are just useful library
additions); but also prevents accidental inclusion of non-builtins that
happen to be in the right library.

Once the flag exists, and the compiler is using `is_builtin()` for
decision making; the existence of the current `@extern` interface
becomes isomorphic to `@builtin`; and the interface can be unified.

Leaving `@extern` a thin-wrapper, and encouraging continued use of it,
establishes future-proofing towards adding additional extern tracing,
metric hooks, or scanning in the future.

* Add `triton.impl` package to hold the core, order dependent impl
details.
 * Extract `@builtin` and unify `@extern`; add `is_builtin()`
   * Add sense bit so that `@builtin` detection is less fragile.
 * Modify the compiler to use `is_builtin()`
2022-12-05 22:59:41 +00:00
Crutcher Dunnavant
e0072d210a [FRONTEND] Propagate mypy types through @jit, @builtin, etc (#915)
Changes to make decorated API methods no longer type-opaque.

```
$ echo 'import triton; reveal_type(triton.language.max)' | mypy /dev/stdin
/dev/stdin:1: note: Revealed type is "def (input: Any, axis: Any, _builder: Any =) -> Any"
Success: no issues found in 1 source file
```
2022-12-05 22:41:02 +00:00
Crutcher Dunnavant
2fa17588f7 [FRONTEND] Expand __init__ * imports, add __all__ (#912)
Expand `from .foo import *` to full listings, and `__all__` sections.

This reifies the module export listings, which is useful for code
importing this module; without this, clients will need special `mypy`
control pragmas for this library.

This removes a number of `# flake8` control pragmas.

Verified with `flake8`
2022-12-05 14:22:55 -08:00
Philippe Tillet
99c7e0e008 [BUILD] Change default build type (#945) 2022-12-03 17:47:33 -08:00
Keren Zhou
f2fcaeabf3 [BACKEND] Support dot op when the output is mma encoding and allowtf32 is true (#937) 2022-12-03 19:14:12 +00:00
Philippe Tillet
8edfe813a5 [FRONTEND][BACKEND] Added trans instruction; made flash attention bwd pass work (#943) 2022-12-03 09:58:24 -08:00
donproc
521ff9ad74 [TRITON-MLIR][FRONTEND]fix scf.if to run through layernorm tutorial (#938)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-12-02 17:45:29 +08:00
Keren Zhou
c280ebda1b [Triton-MLIR][BACKEND] Fix the membar pass to add missing barriers caused by scf.for (#933)
1. Add missing barriers and revert the previous temporary solution
2. Extract the `run` method from membar analysis because the membar
analysis should have two phases, including construction, which doesn't
modify any IR, and modification, which adds barrier IRs. Hope this could
make the use of membar clear.
2022-12-01 11:54:18 -08:00
donproc
9def1bcebf [TRITON-MLIR][FRONTEND]minor fix to run through atomic_cas test (#925)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-12-01 13:43:26 +00:00
Keren Zhou
7d90a07d0b [Triton-MLIR][BACKEND] Refactor decompose insert_slice_async (#929)
1. Improve pipline's comment
2. Decompose insert_slice_async when load vector size is not supported
3. Add a test that could fail our gemm code

Copy my comments here:

There's a knob that may cause performance regression when decomposition
has been performed. We should remove this knob once we have thorough
analysis on async wait. Currently, we decompose `insert_slice_async`
into `load` and `insert_slice` without knowing which `async_wait` is
responsible for the `insert_slice_async`. To guarantee correctness, we
blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed.

There are two options to improve this:
1. We can perform a dataflow analysis to find the `async_wait` that is
responsible for the `insert_slice_async` in the backend.
4. We can modify the pipeline to perform the decomposition before the
`async_wait` is inserted. However, it is also risky because we don't
know the correct vectorized shape yet in the pipeline pass. Making the
pipeline pass aware of the vectorization could introduce additional
dependencies on the AxisInfoAnalysis and the Coalesce analysis.
2022-11-30 10:07:34 -08:00
Philippe Tillet
6461254fb5 [BACKEND] Make flash attention forward pass work (#928)
This also simplifies BroadcastOp codegen
2022-11-30 10:13:24 +00:00
Philippe Tillet
9bb54402b3 [FRONTEND][BACKEND] Small fixes to multiple_of, num_programs, axisinfo; enable block-sparse tests (#927) 2022-11-29 20:00:34 +01:00
Qingyi Liu
9d31998a9d [Triton-MLIR][BACKEND] Add argmin / argmax implementation for ReduceOp (#918) 2022-11-27 22:59:27 -08:00
goostavz
630dc315ee [Triton-MLIR] uncomment the UT in test_gemm that has already been fixed (#920) 2022-11-28 11:23:20 +08:00
Keren Zhou
35c9ec1103 [Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917) 2022-11-26 12:30:38 -08:00
donproc
f63be0e9b5 [TRITON-MLIR][BACKEND]support atomic_cas (#914)
1. support atomics-cas
2. add xchg support in atomic_rmw

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-25 12:02:08 +08:00
Keren Zhou
153aecb339 [Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908)
`insert_slice_async` is decomposed into `load + insert_slice` in the
backend.

Not sure if V100 perf can match the master branch though in this way.
Maybe the performance can be improved if instructions are arranged in
the following form:

```
%0 = load
%1 = load 
%2 = load 
...
insert_slice %0
insert_slice %1
insert_slice %2
```

Tested on A100 when manually enabling this decomposition.
Tests on V100 haven't been integrated yet, we can divide the tests into
two phases:
1. Test only load, insert_slice, and insert_slice_async, given TritonGPU
IRs in `test_backend.py`.
2. End to end gemm tests on V100.
2022-11-24 14:05:54 -08:00
Crutcher Dunnavant
f98aed1258 [Triton-MLIR][RUNTIME] Add /usr/bin/ptxas as a search path (#909)
Make `ptxas` search a bit broader to include `/usr/bin/ptxas`, installed
by the lambda stack repo versions:
https://lambdalabs.com/lambda-stack-deep-learning-software
2022-11-24 18:49:16 +00:00
Crutcher Dunnavant
ace7d28736 [Triton-MLIR][RUNTIME] Fix ir metadata lookup bug (#910) 2022-11-24 09:27:23 +01:00
donproc
8925c2cd11 [TRITON-MLIR][BACKEND]AtomicRMWOp supports scalar (#903)
AtomicRMWOp supports scalar

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-23 07:59:09 +00:00
Keren Zhou
2e33352419 [Triton-MLIR] Fix side effects (#906)
Try to add proper side effects for triton operations. 

The CSE pass could fail, hang, or output incorrect IRs for unknown
reasons, if side effects are not defined properly.

For instance, suppose we have two shared memory tensors:

```
%a = triton_gpu.alloc_tensor shape0, share_encoding0
%b = triton_gpu.alloc_tensor shape0, share_encoding0
```

The CSE pass will consider `%a` and `%b` are the same thing and
eliminate one of them, resulting in mysterious outcomes.
2022-11-22 23:29:18 -08:00
Yan Chunwei
037f9efa95 [Triton-MLIR][BACKEND] Fix wpt overflow issue in mma v2 (#904)
This PR

1. Fix wpt overflow issue in mma v2
2. Refine transpose logic
2022-11-23 11:27:15 +08:00
ben-zhang-609
07786dc932 [Triton-MLIR] Add compute capability (#902)
add compute capability from python frontend to backend.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2022-11-22 11:08:23 -08:00
Keren Zhou
85cccfb81f [BUILD] Fix compilation problems in the release build (#897) 2022-11-21 05:40:36 +00:00
Philippe Tillet
23f71daa27 [OPTIMIZER] Fixed up order of shared layouts (#881) 2022-11-21 06:25:02 +01:00
Philippe Tillet
4d64ffb5fe [FRONTEND] Handle for loops with negative constant steps (#896) 2022-11-20 11:37:38 +01:00
Keren Zhou
6c5f646f4e [WIP][Triton-MLIR] Prefetch pass fixup (#873)
A (potential) problem by directly adopting `tensor.extract_slice`.

Long story short, `tensor.extract_slice` is not aware of swizzling.
Consider the following shared memory tensor and its first three slices,
where each slice includes two tile (the loading unit of LDGSTS) of
elements. Currently, the tiles haven't been swizzled yet, so slicing
seems to work.

<img width="1219" alt="image"
src="https://user-images.githubusercontent.com/2306281/201833023-a7950705-2d50-4c0a-8527-7505261c3a3c.png">

However, now consider the following figure, which is the layout after
applying swizzling on the first figure.

<img width="1244" alt="image"
src="https://user-images.githubusercontent.com/2306281/201834824-7daae360-f5bc-4e6b-a921-20be3f294b78.png">

Note that on phase 2, all tiles have been swizzled out of their
originally slices. This implies that if we use the tile index after
slicing, we can no longer locate the correct tiles. For example, T3 was
in slice 1 but got swapped to slice 0 after swizzling.

Here's a more detailed explanation. In the current `triton-mlir` branch,
we only compute the relative offset of each tile. So T3's index in Slice
1 is *1*, and it will be swizzled using *1* and *phase id*. Whereas the
correct index of T3 should be *3*, which is the relative offset to the
beginning of the shared memory tensor being swizzled, and T3 should be
swizzled using *3* and *phase id*.

This PR proposes a hacky solution for this problem. We restore the
"correct" offset of each tile by **assuming that slicing on a specific
dim only happens at most once on the output of insert_slice_async**. I
admit it's risky and fragile.

The other possible solution is adopting cutlass' swizzling logic that
limits the indices being swizzled in a "bounding box" that matches the
mma instruction executes. For example, in the following tensor layout,
each 4x4 submatrix is a minimum swizzling unit, and the entire tensor
represents the tensor layout of operand A in `mma.16816`.

<img width="565" alt="image"
src="https://user-images.githubusercontent.com/2306281/201836879-4ca7824b-530c-4a06-a3d5-1e74a2de1b42.png">

Co-authored-by: Phil Tillet <phil@openai.com>
2022-11-19 19:57:16 -08:00
Jun Yang
8a5647782d [Triton-MLIR][Testing]Fix tests warning, with small code clean-up (#894)
1.Code clean-up to remove superfluous #includes.
2.Fix two python test warnings, in which one relates to ["#"
formats](https://jira.mongodb.org/browse/PYTHON-2343), the other relates
to regular expression string usage.
2022-11-19 14:33:59 +00:00
donproc
afaf59b0c9 [TRITON-MLIR][BACKEND] Atomic support mask (#889)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-19 19:57:19 +08:00
Philippe Tillet
dab4855bdf [TESTING] Added infrastructure for executing TTGIR program and test for layout conversions (#885) 2022-11-18 07:46:45 +01:00
goostavz
9ea6135eb5 [Triton-MLIR][Backend] Some cleanup in getMultiDimIndex/getLinearIndex (#880) 2022-11-18 01:19:21 +00:00
donproc
5eee738df7 [Triton-MLIR][FRONTEND] [BACKEND] fix atomics (#879)
minor fix to backend and frontend of atomics, we can pass 1 test without
mask and the shape aligned with CTA size now

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-16 12:25:15 +08:00
Qingyi Liu
4c4159c6fa [Triton-MLIR] Add ex2.approx implementation for ExpOp and fix smem allocation for ReduceOpConversion (#875) 2022-11-15 01:27:32 +00:00
goostavz
c28cfd821b [Triton-MLIR][Backend] Fix convert_layout blocked->shared in non-default order (#876)
This PR fix the problem of TN/NT GEMM correctness when no SCF involved.
I'll continue to clean up getLinearIndex/getMultiDimIndex in a uniformed
way which should be benifical to avoid different kinds of order issues.
This is not fully done yet, just merge to sync the code.
2022-11-15 09:02:46 +08:00
Yan Chunwei
1eedaf7bec [Triton-MLIR][BACKEND] adapt DotOp layout for FMADot (#872) 2022-11-14 16:56:30 +08:00
Chenggang Zhao
516a241234 [Triton-MLIR] Fix some typos (#874)
Fix some typos
2022-11-13 18:15:53 -08:00
Philippe Tillet
f40c63fb03 [Triton-MLIR][OPTIMIZER] Cleaned up swizzling (#869)
Swizzling is no longer implemented as a separate pass. It is instead
done in a specialized constructor of SharedEncodingAttr, and tested via
google tests instead of triton-opt + filecheck.

In the future we may want to implement it as a pass again once we have
an additional dialect between TritonGPU and LLVM.
2022-11-10 12:05:46 -08:00