Commit Graph

722 Commits

Author SHA1 Message Date
goostavz
630dc315ee [Triton-MLIR] uncomment the UT in test_gemm that has already been fixed (#920) 2022-11-28 11:23:20 +08:00
Keren Zhou
35c9ec1103 [Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917) 2022-11-26 12:30:38 -08:00
donproc
f63be0e9b5 [TRITON-MLIR][BACKEND]support atomic_cas (#914)
1. support atomics-cas
2. add xchg support in atomic_rmw

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-25 12:02:08 +08:00
Keren Zhou
153aecb339 [Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908)
`insert_slice_async` is decomposed into `load + insert_slice` in the
backend.

Not sure if V100 perf can match the master branch though in this way.
Maybe the performance can be improved if instructions are arranged in
the following form:

```
%0 = load
%1 = load 
%2 = load 
...
insert_slice %0
insert_slice %1
insert_slice %2
```

Tested on A100 when manually enabling this decomposition.
Tests on V100 haven't been integrated yet, we can divide the tests into
two phases:
1. Test only load, insert_slice, and insert_slice_async, given TritonGPU
IRs in `test_backend.py`.
2. End to end gemm tests on V100.
2022-11-24 14:05:54 -08:00
Crutcher Dunnavant
f98aed1258 [Triton-MLIR][RUNTIME] Add /usr/bin/ptxas as a search path (#909)
Make `ptxas` search a bit broader to include `/usr/bin/ptxas`, installed
by the lambda stack repo versions:
https://lambdalabs.com/lambda-stack-deep-learning-software
2022-11-24 18:49:16 +00:00
Crutcher Dunnavant
ace7d28736 [Triton-MLIR][RUNTIME] Fix ir metadata lookup bug (#910) 2022-11-24 09:27:23 +01:00
ben-zhang-609
b688f7b7b8 [Triton-MLIR] add_volta_warpsPerTile (#907) 2022-11-24 01:44:29 +00:00
donproc
8925c2cd11 [TRITON-MLIR][BACKEND]AtomicRMWOp supports scalar (#903)
AtomicRMWOp supports scalar

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-23 07:59:09 +00:00
Keren Zhou
2e33352419 [Triton-MLIR] Fix side effects (#906)
Try to add proper side effects for triton operations. 

The CSE pass could fail, hang, or output incorrect IRs for unknown
reasons, if side effects are not defined properly.

For instance, suppose we have two shared memory tensors:

```
%a = triton_gpu.alloc_tensor shape0, share_encoding0
%b = triton_gpu.alloc_tensor shape0, share_encoding0
```

The CSE pass will consider `%a` and `%b` are the same thing and
eliminate one of them, resulting in mysterious outcomes.
2022-11-22 23:29:18 -08:00
Yan Chunwei
037f9efa95 [Triton-MLIR][BACKEND] Fix wpt overflow issue in mma v2 (#904)
This PR

1. Fix wpt overflow issue in mma v2
2. Refine transpose logic
2022-11-23 11:27:15 +08:00
ben-zhang-609
07786dc932 [Triton-MLIR] Add compute capability (#902)
add compute capability from python frontend to backend.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2022-11-22 11:08:23 -08:00
Keren Zhou
2afebcd79b [Triton-MLIR][Backend] Remove unnecessary barriers (#901)
Cross operation barriers are taken care of by the Membar pass. 

Explicit barriers are only required if there's any synchronization
necessary within each operation.
2022-11-22 10:03:29 -08:00
Yan Chunwei
136668bac3 [Triton-MLIR][BACKEND] tiny code cleanup (#899)
- Remove the unnecessary `static` in the anonymous namespace
- Remove several unnecessary functions
- Several simple rewrites to make code more clear
2022-11-21 16:00:46 +08:00
Keren Zhou
04b852e031 [Triton-MLIR] Fix warnings and variable names (#898)
We have been seeing the following error message for a while:

> NO target: Unable to find target for this triple (no targets are
registered)

Seems that it's not necessary to setup the target triple at that point,
so we can just take it out to get rid of the error message.

Variable names have been changed to the camel style.
2022-11-21 06:25:27 +00:00
Keren Zhou
85cccfb81f [BUILD] Fix compilation problems in the release build (#897) 2022-11-21 05:40:36 +00:00
Philippe Tillet
23f71daa27 [OPTIMIZER] Fixed up order of shared layouts (#881) 2022-11-21 06:25:02 +01:00
Philippe Tillet
4d64ffb5fe [FRONTEND] Handle for loops with negative constant steps (#896) 2022-11-20 11:37:38 +01:00
Keren Zhou
6c5f646f4e [WIP][Triton-MLIR] Prefetch pass fixup (#873)
A (potential) problem by directly adopting `tensor.extract_slice`.

Long story short, `tensor.extract_slice` is not aware of swizzling.
Consider the following shared memory tensor and its first three slices,
where each slice includes two tile (the loading unit of LDGSTS) of
elements. Currently, the tiles haven't been swizzled yet, so slicing
seems to work.

<img width="1219" alt="image"
src="https://user-images.githubusercontent.com/2306281/201833023-a7950705-2d50-4c0a-8527-7505261c3a3c.png">

However, now consider the following figure, which is the layout after
applying swizzling on the first figure.

<img width="1244" alt="image"
src="https://user-images.githubusercontent.com/2306281/201834824-7daae360-f5bc-4e6b-a921-20be3f294b78.png">

Note that on phase 2, all tiles have been swizzled out of their
originally slices. This implies that if we use the tile index after
slicing, we can no longer locate the correct tiles. For example, T3 was
in slice 1 but got swapped to slice 0 after swizzling.

Here's a more detailed explanation. In the current `triton-mlir` branch,
we only compute the relative offset of each tile. So T3's index in Slice
1 is *1*, and it will be swizzled using *1* and *phase id*. Whereas the
correct index of T3 should be *3*, which is the relative offset to the
beginning of the shared memory tensor being swizzled, and T3 should be
swizzled using *3* and *phase id*.

This PR proposes a hacky solution for this problem. We restore the
"correct" offset of each tile by **assuming that slicing on a specific
dim only happens at most once on the output of insert_slice_async**. I
admit it's risky and fragile.

The other possible solution is adopting cutlass' swizzling logic that
limits the indices being swizzled in a "bounding box" that matches the
mma instruction executes. For example, in the following tensor layout,
each 4x4 submatrix is a minimum swizzling unit, and the entire tensor
represents the tensor layout of operand A in `mma.16816`.

<img width="565" alt="image"
src="https://user-images.githubusercontent.com/2306281/201836879-4ca7824b-530c-4a06-a3d5-1e74a2de1b42.png">

Co-authored-by: Phil Tillet <phil@openai.com>
2022-11-19 19:57:16 -08:00
Yan Chunwei
e8994209f4 [Triton-MLIR][Backend]fix mma-v2 transpose error (#888) 2022-11-20 11:29:09 +08:00
Jun Yang
8a5647782d [Triton-MLIR][Testing]Fix tests warning, with small code clean-up (#894)
1.Code clean-up to remove superfluous #includes.
2.Fix two python test warnings, in which one relates to ["#"
formats](https://jira.mongodb.org/browse/PYTHON-2343), the other relates
to regular expression string usage.
2022-11-19 14:33:59 +00:00
donproc
afaf59b0c9 [TRITON-MLIR][BACKEND] Atomic support mask (#889)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-19 19:57:19 +08:00
Philippe Tillet
dab4855bdf [TESTING] Added infrastructure for executing TTGIR program and test for layout conversions (#885) 2022-11-18 07:46:45 +01:00
goostavz
9ea6135eb5 [Triton-MLIR][Backend] Some cleanup in getMultiDimIndex/getLinearIndex (#880) 2022-11-18 01:19:21 +00:00
donproc
5eee738df7 [Triton-MLIR][FRONTEND] [BACKEND] fix atomics (#879)
minor fix to backend and frontend of atomics, we can pass 1 test without
mask and the shape aligned with CTA size now

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-16 12:25:15 +08:00
goostavz
37f5846280 [Triton-MLIR][Backend] Minor fix for allocation and backend in handling tt.ptr tensors (#878) 2022-11-15 10:08:07 +00:00
Yan Chunwei
a22ff39017 [Triton-MLIR][BACKEND] Refine/add codegen for get_promgram_id and get_num_programs Op (#877) 2022-11-15 15:45:24 +08:00
Qingyi Liu
4c4159c6fa [Triton-MLIR] Add ex2.approx implementation for ExpOp and fix smem allocation for ReduceOpConversion (#875) 2022-11-15 01:27:32 +00:00
goostavz
c28cfd821b [Triton-MLIR][Backend] Fix convert_layout blocked->shared in non-default order (#876)
This PR fix the problem of TN/NT GEMM correctness when no SCF involved.
I'll continue to clean up getLinearIndex/getMultiDimIndex in a uniformed
way which should be benifical to avoid different kinds of order issues.
This is not fully done yet, just merge to sync the code.
2022-11-15 09:02:46 +08:00
Yan Chunwei
1eedaf7bec [Triton-MLIR][BACKEND] adapt DotOp layout for FMADot (#872) 2022-11-14 16:56:30 +08:00
Chenggang Zhao
516a241234 [Triton-MLIR] Fix some typos (#874)
Fix some typos
2022-11-13 18:15:53 -08:00
Philippe Tillet
f40c63fb03 [Triton-MLIR][OPTIMIZER] Cleaned up swizzling (#869)
Swizzling is no longer implemented as a separate pass. It is instead
done in a specialized constructor of SharedEncodingAttr, and tested via
google tests instead of triton-opt + filecheck.

In the future we may want to implement it as a pass again once we have
an additional dialect between TritonGPU and LLVM.
2022-11-10 12:05:46 -08:00
Philippe Tillet
2aa538ec2e [BACKEND] Added support for mma layouts in reductions (#863)
Validated hackily by manually modifying the reduction .ttgir in my local
cache. There will be a follow-up PR adding some better testing
infrastructure to test out conversions and reductions on arbitrary
layouts.
2022-11-10 09:58:07 -08:00
Chenggang Zhao
57fd1864a7 [Triton-MLIR] Support FP8 (#864)
Co-authored-by: Superjomn <yanchunwei@outlook.com>
2022-11-10 15:53:06 +08:00
Da Yan
4946167241 [Triton-MLIR] tt.dot operands now must have DotOperand layout; also added prefetch pass prototype (#712)
Co-authored-by: Jokeren <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
Co-authored-by: Superjomn <yanchunwei@outlook.com>
2022-11-10 05:57:27 +00:00
Yan Chunwei
8832e32683 [Triton-MLIR][BACKEND] Refine ptxbuilder (#867)
This PR does

1. Add `onlyBindMLIRArgs` argument to `PTXInstrCommon::call` method to
support passing in a whole PTX code snippet
2. Refine the APIs and simplify the code usage.
2022-11-10 13:41:52 +08:00
donproc
4640023d9b [Triton-MLIR][Backend]add atomic rmw without mask (#842)
add atomic without mask

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-10 08:15:58 +08:00
Yan Chunwei
0c87360657 [Triton-MLIR][Backend] Port FMADot conversion for DotOp (#844)
Co-authored-by: ben-zhang-609 <benzh609@gmail.com>
2022-11-09 12:57:50 +08:00
Yan Chunwei
de5b84c476 [Triton-MLIR][Backend] Fix mma<v2> int8 precision error (#850)
Fix mma.16816 s8 precision error

Co-authored-by: ben-zhang-609 <benzh609@gmail.com>
2022-11-09 12:23:43 +08:00
Qingyi Liu
e517b58d59 [Triton-MLIR] Minor fixes to enable fused-softmax and layer-norm tutorials (#835) 2022-11-09 02:18:56 +00:00
Keren Zhou
2da71b2aaa [Triton-MLIR] Increase block size K to completely eliminate shared memory bank conflicts (#862) 2022-11-08 17:39:23 -08:00
goostavz
080b4addf8 [Triton-MLIR][Backend] Fix the order in linear/delinear and a few bugs in reduce conversion (#851)
1, fix the order in linearize/delinearize, which fix the error of order
in emitIndices;
2, fix the selecting of fast implementation in reduce codegen;
3, fix the redundant barrier in reduce codegen;
4, fix the index mapping of the second round of warp_shuffle in shuffle
version of reduce codegen.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2022-11-08 10:10:09 -08:00
Ian Bearman
303790da88 [BUILD] use Python Var In Tests (#859) 2022-11-08 17:44:19 +00:00
Da Yan
137344946f [OPTIMIZER] Fix the load-mask issue with the pipeline pass (#857) 2022-11-08 09:29:53 -08:00
Philippe Tillet
976cf12af1 [OPTIMIZER] Fixed memory coalescing (#847) 2022-11-07 06:22:18 -08:00
Philippe Tillet
b6f15e214b [FRONTEND] Fixed up type cast in atomics codegen (#853) 2022-11-07 05:46:24 -08:00
ben-zhang-609
84ad215268 [Triton-MLIR] Enable libdevice for ptx backend when has external functions. (#848)
At the phase from ptx to cubin, check whether llvm::Module has external
functions. if has, link with libdevice at:
https://github.com/openai/triton/blob/triton-mlir/python/triton/language/libdevice.10.bc
2022-11-07 08:01:50 +00:00
Keren Zhou
fdd59900f7 [Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837)
## Features

- Allow taking a block of tensor slice, as long as each dimension is
contiguous (unit stride).
- Fix some problems in `insert_slice_async`'s semantic.
- More general verification for ops that return shared layout encoding.

## Known Limitations

- `insert_slice_async` still uses the old semantic. May submit another
PR later to support similar semantic like `tensor.extract_slice`.
- No encoding verification for `tensor.extract_slice`.
- 3d tensor ops are broken.
- Strided accesses are not allowed.
- May cause a little performance slowdown since we are passing strides
as values but not constants (e.g., int).
It would be difficult to pass strides as attributes when we have control
flows. A block argument is possible to accept tensors with different
strides.
2022-11-06 22:59:03 -08:00
Philippe Tillet
a4ff0c362c [FRONTEND] Fix issues with atomics (#849) 2022-11-06 20:52:11 -08:00
Philippe Tillet
b6dbe959f0 [RUNTIME] Re-vamped cache so users can manually patch IR / ptx / cubin files (#845)
Also deprecates a couple of tests
2022-11-04 10:57:29 -07:00
Keren Zhou
4218e68d74 [Triton-MLIR] [Frontend] Return a scalar if all input args are scalar (#839) 2022-11-03 20:27:47 -07:00