triton

Author	SHA1	Message	Date
Keren Zhou	f2fcaeabf3	[BACKEND] Support dot op when the output is mma encoding and allowtf32 is true (#937 )	2022-12-03 19:14:12 +00:00
Philippe Tillet	8edfe813a5	[FRONTEND][BACKEND] Added `trans` instruction; made flash attention bwd pass work (#943 )	2022-12-03 09:58:24 -08:00
goostavz	4d64589b22	[Triton-MLIR][Backend] Fix the definition of MmaEncodingAttr v1, and the output sequence of DotConversion in MMAv1 (#941 )	2022-12-03 21:12:48 +08:00
donproc	521ff9ad74	[TRITON-MLIR][FRONTEND]fix scf.if to run through layernorm tutorial (#938 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-02 17:45:29 +08:00
Keren Zhou	c280ebda1b	[Triton-MLIR][BACKEND] Fix the membar pass to add missing barriers caused by scf.for (#933 ) 1. Add missing barriers and revert the previous temporary solution 2. Extract the `run` method from membar analysis because the membar analysis should have two phases, including construction, which doesn't modify any IR, and modification, which adds barrier IRs. Hope this could make the use of membar clear.	2022-12-01 11:54:18 -08:00
donproc	9def1bcebf	[TRITON-MLIR][FRONTEND]minor fix to run through atomic_cas test (#925 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-01 13:43:26 +00:00
Keren Zhou	7d90a07d0b	[Triton-MLIR][BACKEND] Refactor decompose insert_slice_async (#929 ) 1. Improve pipline's comment 2. Decompose insert_slice_async when load vector size is not supported 3. Add a test that could fail our gemm code Copy my comments here: There's a knob that may cause performance regression when decomposition has been performed. We should remove this knob once we have thorough analysis on async wait. Currently, we decompose `insert_slice_async` into `load` and `insert_slice` without knowing which `async_wait` is responsible for the `insert_slice_async`. To guarantee correctness, we blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed. There are two options to improve this: 1. We can perform a dataflow analysis to find the `async_wait` that is responsible for the `insert_slice_async` in the backend. 4. We can modify the pipeline to perform the decomposition before the `async_wait` is inserted. However, it is also risky because we don't know the correct vectorized shape yet in the pipeline pass. Making the pipeline pass aware of the vectorization could introduce additional dependencies on the AxisInfoAnalysis and the Coalesce analysis.	2022-11-30 10:07:34 -08:00
Philippe Tillet	6461254fb5	[BACKEND] Make flash attention forward pass work (#928 ) This also simplifies BroadcastOp codegen	2022-11-30 10:13:24 +00:00
goostavz	4e6a8209ed	[Triton-MLIR] Two fixes on allocation and backend related with MMA v1 (#930 )	2022-11-30 09:27:26 +00:00
Philippe Tillet	9bb54402b3	[FRONTEND][BACKEND] Small fixes to multiple_of, num_programs, axisinfo; enable block-sparse tests (#927 )	2022-11-29 20:00:34 +01:00
Philippe Tillet	66c36c4378	[BACKEND] Fixed bounds-wrapping issues (#926 ) This fixes an issue that led to out-of-bounds shared memory accesses on small matrices	2022-11-29 17:56:45 +01:00
Qingyi Liu	661be523c0	[Triton-MLIR][BACKEND] Minor fixes of shared memory in ReduceOpConversion (#924 )	2022-11-29 11:50:31 +08:00
Yan Chunwei	c87fbf886e	[Triton-MLIR][BACKEND] Remove static and unnamed namespace in Utility.h (#923 ) Reference https://wiki.sei.cmu.edu/confluence/display/cplusplus/DCL59-CPP.+Do+not+define+an+unnamed+namespace+in+a+header+file	2022-11-29 01:06:06 +00:00
goostavz	0c1d4d764e	[Triton-MLIR][Backend] support MMA v1 in ConvertLayout (#922 ) The e2e verification of mma v1 is not done yet. Get this merged in advance just to prevent more conflicts.	2022-11-28 08:10:30 +00:00
Qingyi Liu	9d31998a9d	[Triton-MLIR][BACKEND] Add argmin / argmax implementation for ReduceOp (#918 )	2022-11-27 22:59:27 -08:00
Yan Chunwei	04ec5deb41	[Triton-MLIR][BACKEND] decouple the dot code (#921 ) This PR - apply minimal modification to decouple the Dot helper related code from TritonGPUToLLVM.cpp to a separate local header file to make it easier to share some data structure for Dot - add some patch necessary for transA and transB - add some patch necessary for MMA v1 execution in backend	2022-11-28 13:30:27 +08:00
goostavz	630dc315ee	[Triton-MLIR] uncomment the UT in test_gemm that has already been fixed (#920 )	2022-11-28 11:23:20 +08:00
Keren Zhou	35c9ec1103	[Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917 )	2022-11-26 12:30:38 -08:00
donproc	f63be0e9b5	[TRITON-MLIR][BACKEND]support atomic_cas (#914 ) 1. support atomics-cas 2. add xchg support in atomic_rmw Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-25 12:02:08 +08:00
Keren Zhou	153aecb339	[Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908 ) `insert_slice_async` is decomposed into `load + insert_slice` in the backend. Not sure if V100 perf can match the master branch though in this way. Maybe the performance can be improved if instructions are arranged in the following form: ``` %0 = load %1 = load %2 = load ... insert_slice %0 insert_slice %1 insert_slice %2 ``` Tested on A100 when manually enabling this decomposition. Tests on V100 haven't been integrated yet, we can divide the tests into two phases: 1. Test only load, insert_slice, and insert_slice_async, given TritonGPU IRs in `test_backend.py`. 2. End to end gemm tests on V100.	2022-11-24 14:05:54 -08:00
Crutcher Dunnavant	f98aed1258	[Triton-MLIR][RUNTIME] Add /usr/bin/ptxas as a search path (#909 ) Make `ptxas` search a bit broader to include `/usr/bin/ptxas`, installed by the lambda stack repo versions: https://lambdalabs.com/lambda-stack-deep-learning-software	2022-11-24 18:49:16 +00:00
Crutcher Dunnavant	ace7d28736	[Triton-MLIR][RUNTIME] Fix ir metadata lookup bug (#910 )	2022-11-24 09:27:23 +01:00
ben-zhang-609	b688f7b7b8	[Triton-MLIR] add_volta_warpsPerTile (#907 )	2022-11-24 01:44:29 +00:00
donproc	8925c2cd11	[TRITON-MLIR][BACKEND]AtomicRMWOp supports scalar (#903 ) AtomicRMWOp supports scalar Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-23 07:59:09 +00:00
Keren Zhou	2e33352419	[Triton-MLIR] Fix side effects (#906 ) Try to add proper side effects for triton operations. The CSE pass could fail, hang, or output incorrect IRs for unknown reasons, if side effects are not defined properly. For instance, suppose we have two shared memory tensors: ``` %a = triton_gpu.alloc_tensor shape0, share_encoding0 %b = triton_gpu.alloc_tensor shape0, share_encoding0 ``` The CSE pass will consider `%a` and `%b` are the same thing and eliminate one of them, resulting in mysterious outcomes.	2022-11-22 23:29:18 -08:00
Yan Chunwei	037f9efa95	[Triton-MLIR][BACKEND] Fix wpt overflow issue in mma v2 (#904 ) This PR 1. Fix wpt overflow issue in mma v2 2. Refine transpose logic	2022-11-23 11:27:15 +08:00
ben-zhang-609	07786dc932	[Triton-MLIR] Add compute capability (#902 ) add compute capability from python frontend to backend. Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2022-11-22 11:08:23 -08:00
Keren Zhou	2afebcd79b	[Triton-MLIR][Backend] Remove unnecessary barriers (#901 ) Cross operation barriers are taken care of by the Membar pass. Explicit barriers are only required if there's any synchronization necessary within each operation.	2022-11-22 10:03:29 -08:00
Yan Chunwei	136668bac3	[Triton-MLIR][BACKEND] tiny code cleanup (#899 ) - Remove the unnecessary `static` in the anonymous namespace - Remove several unnecessary functions - Several simple rewrites to make code more clear	2022-11-21 16:00:46 +08:00
Keren Zhou	04b852e031	[Triton-MLIR] Fix warnings and variable names (#898 ) We have been seeing the following error message for a while: > NO target: Unable to find target for this triple (no targets are registered) Seems that it's not necessary to setup the target triple at that point, so we can just take it out to get rid of the error message. Variable names have been changed to the camel style.	2022-11-21 06:25:27 +00:00
Keren Zhou	85cccfb81f	[BUILD] Fix compilation problems in the release build (#897 )	2022-11-21 05:40:36 +00:00
Philippe Tillet	23f71daa27	[OPTIMIZER] Fixed up order of shared layouts (#881 )	2022-11-21 06:25:02 +01:00
Philippe Tillet	4d64ffb5fe	[FRONTEND] Handle for loops with negative constant steps (#896 )	2022-11-20 11:37:38 +01:00
Keren Zhou	6c5f646f4e	[WIP][Triton-MLIR] Prefetch pass fixup (#873 ) A (potential) problem by directly adopting `tensor.extract_slice`. Long story short, `tensor.extract_slice` is not aware of swizzling. Consider the following shared memory tensor and its first three slices, where each slice includes two tile (the loading unit of LDGSTS) of elements. Currently, the tiles haven't been swizzled yet, so slicing seems to work. <img width="1219" alt="image" src="https://user-images.githubusercontent.com/2306281/201833023-a7950705-2d50-4c0a-8527-7505261c3a3c.png"> However, now consider the following figure, which is the layout after applying swizzling on the first figure. <img width="1244" alt="image" src="https://user-images.githubusercontent.com/2306281/201834824-7daae360-f5bc-4e6b-a921-20be3f294b78.png"> Note that on phase 2, all tiles have been swizzled out of their originally slices. This implies that if we use the tile index after slicing, we can no longer locate the correct tiles. For example, T3 was in slice 1 but got swapped to slice 0 after swizzling. Here's a more detailed explanation. In the current `triton-mlir` branch, we only compute the relative offset of each tile. So T3's index in Slice 1 is 1, and it will be swizzled using 1 and phase id. Whereas the correct index of T3 should be 3, which is the relative offset to the beginning of the shared memory tensor being swizzled, and T3 should be swizzled using 3 and phase id. This PR proposes a hacky solution for this problem. We restore the "correct" offset of each tile by assuming that slicing on a specific dim only happens at most once on the output of insert_slice_async. I admit it's risky and fragile. The other possible solution is adopting cutlass' swizzling logic that limits the indices being swizzled in a "bounding box" that matches the mma instruction executes. For example, in the following tensor layout, each 4x4 submatrix is a minimum swizzling unit, and the entire tensor represents the tensor layout of operand A in `mma.16816`. <img width="565" alt="image" src="https://user-images.githubusercontent.com/2306281/201836879-4ca7824b-530c-4a06-a3d5-1e74a2de1b42.png"> Co-authored-by: Phil Tillet <phil@openai.com>	2022-11-19 19:57:16 -08:00
Yan Chunwei	e8994209f4	[Triton-MLIR][Backend]fix mma-v2 transpose error (#888 )	2022-11-20 11:29:09 +08:00
Jun Yang	8a5647782d	[Triton-MLIR][Testing]Fix tests warning, with small code clean-up (#894 ) 1.Code clean-up to remove superfluous #includes. 2.Fix two python test warnings, in which one relates to ["#" formats](https://jira.mongodb.org/browse/PYTHON-2343), the other relates to regular expression string usage.	2022-11-19 14:33:59 +00:00
donproc	afaf59b0c9	[TRITON-MLIR][BACKEND] Atomic support mask (#889 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-19 19:57:19 +08:00
Philippe Tillet	dab4855bdf	[TESTING] Added infrastructure for executing TTGIR program and test for layout conversions (#885 )	2022-11-18 07:46:45 +01:00
goostavz	9ea6135eb5	[Triton-MLIR][Backend] Some cleanup in getMultiDimIndex/getLinearIndex (#880 )	2022-11-18 01:19:21 +00:00
donproc	5eee738df7	[Triton-MLIR][FRONTEND] [BACKEND] fix atomics (#879 ) minor fix to backend and frontend of atomics, we can pass 1 test without mask and the shape aligned with CTA size now Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-16 12:25:15 +08:00
goostavz	37f5846280	[Triton-MLIR][Backend] Minor fix for allocation and backend in handling tt.ptr tensors (#878 )	2022-11-15 10:08:07 +00:00
Yan Chunwei	a22ff39017	[Triton-MLIR][BACKEND] Refine/add codegen for get_promgram_id and get_num_programs Op (#877 )	2022-11-15 15:45:24 +08:00
Qingyi Liu	4c4159c6fa	[Triton-MLIR] Add ex2.approx implementation for ExpOp and fix smem allocation for ReduceOpConversion (#875 )	2022-11-15 01:27:32 +00:00
goostavz	c28cfd821b	[Triton-MLIR][Backend] Fix convert_layout blocked->shared in non-default order (#876 ) This PR fix the problem of TN/NT GEMM correctness when no SCF involved. I'll continue to clean up getLinearIndex/getMultiDimIndex in a uniformed way which should be benifical to avoid different kinds of order issues. This is not fully done yet, just merge to sync the code.	2022-11-15 09:02:46 +08:00
Yan Chunwei	1eedaf7bec	[Triton-MLIR][BACKEND] adapt DotOp layout for FMADot (#872 )	2022-11-14 16:56:30 +08:00
Chenggang Zhao	516a241234	[Triton-MLIR] Fix some typos (#874 ) Fix some typos	2022-11-13 18:15:53 -08:00
Philippe Tillet	f40c63fb03	[Triton-MLIR][OPTIMIZER] Cleaned up swizzling (#869 ) Swizzling is no longer implemented as a separate pass. It is instead done in a specialized constructor of SharedEncodingAttr, and tested via google tests instead of triton-opt + filecheck. In the future we may want to implement it as a pass again once we have an additional dialect between TritonGPU and LLVM.	2022-11-10 12:05:46 -08:00
Philippe Tillet	2aa538ec2e	[BACKEND] Added support for mma layouts in reductions (#863 ) Validated hackily by manually modifying the reduction .ttgir in my local cache. There will be a follow-up PR adding some better testing infrastructure to test out conversions and reductions on arbitrary layouts.	2022-11-10 09:58:07 -08:00
Chenggang Zhao	57fd1864a7	[Triton-MLIR] Support FP8 (#864 ) Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 15:53:06 +08:00
Da Yan	4946167241	[Triton-MLIR] `tt.dot` operands now must have DotOperand layout; also added prefetch pass prototype (#712 ) Co-authored-by: Jokeren <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com> Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 05:57:27 +00:00

1 2 3 4 5 ...

738 Commits