triton

Author	SHA1	Message	Date
Jokeren	65896aef9d	Debugging	2022-12-13 11:17:40 -08:00
Jokeren	3a1c140385	Add script	2022-12-12 12:10:40 -08:00
Philippe Tillet	e5cfa0f633	[FRONTEND] Added a few assertions in `semantic.dot` (#977 )	2022-12-12 00:07:14 -08:00
Philippe Tillet	52accd4c2b	[BACKEND] Add isRow attribute for DotOp tensors whose parent is mmav1 (#970 ) Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>	2022-12-11 19:01:57 -08:00
Yan Chunwei	4fb048873a	[Triton-MLIR][CI] Fix v100 tests to avoid skiping tests mistakely (#975 )	2022-12-11 04:57:51 +00:00
Keren Zhou	be2f70699c	[BACKEND][FRONTEND] Fix problems with test_matmul (#973 ) 1. Handle induction variable when step is negative 2. Restore async_wait that accidentally deleted 3. Add missing induction variable in prefetch 4. Add device property functions Co-authored-by: Philippe Tillet <Phil.Tillet@gmail.com>	2022-12-10 20:34:58 -08:00
Yan Chunwei	24fd953f9a	[BACKEND] Refine v100 tests and fix mmav1 numwarps>1 hang issue (#971 ) This PR - Fix numWarps>1 hang issue - add existing test cases in test_gemm.py to CI, and add a common flag `valid_on_Volta` to determine whether the test case should be activated on Volta or just skip. - Currently, the column-major cases are disabled. - Add test_core.py and other tests to Volta CI - the `test_printf.py` failed.	2022-12-09 07:41:22 -08:00
goostavz	793012b4c4	[Triton-MLIR][Backend] Fix mmav1 in case of numWarps > 1 (#972 )	2022-12-09 18:36:05 +08:00
Keren Zhou	3ed36dcb4d	[BACKEND] MMA->DotOperand conversion for chain dot of float32 tensors (#962 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2022-12-08 20:11:51 +00:00
Keren Zhou	83f3b9165b	[FRONTEND][BACKEND] Fix bool and int8 load when the other operand is given (#968 )	2022-12-08 11:52:18 -08:00
Keren Zhou	71c35bcf9c	[Triton-MLIR][BACKEND] Mark float to integer in Arithmetic Dialect as legal (#963 )	2022-12-08 09:07:01 -08:00
Yan Chunwei	4eab9dcedf	[Triton-MLIR][BACKEND] make MMAv1 splitk works (#960 )	2022-12-07 08:58:38 +00:00
Philippe Tillet	b2b793dfb5	[FRONTEND][BACKEND] Fixes for cat / reshape / addptr (#959 ) Most notably, this PR: - changes the traits (and assembly format) of addptr so it can handle offsets that have arbitrary integer width. - adds support for `cat`	2022-12-06 23:29:50 -08:00
Philippe Tillet	981aee7f1e	[FRONTEND] Frontend fixes for uint / for loops / random (#958 )	2022-12-06 20:25:47 -08:00
Philippe Tillet	532e10cf87	[FRONTEND][BACKEND] Clean-up transpositions (#953 )	2022-12-06 09:32:13 -08:00
Crutcher Dunnavant	9490252261	[FRONTEND] Support alternative install locations of system libdevice.10.bc (#951 )	2022-12-06 03:41:44 +00:00
Yan Chunwei	e419781978	[Triton-MLIR][BACKEND] Make mmav1 works on basic cases (#944 ) TODO: - Add more cases - Currently, we just set vec to 4 to make the basic cases pass Issue: - the vec in shared layout is different compared to master branch - when vec=1, it encounters CUDA misalignment error, it doesn't work in master branch as well - when setting vec to the value identical to master branch, the MMA works	2022-12-06 10:57:08 +08:00
Keren Zhou	f2fcaeabf3	[BACKEND] Support dot op when the output is mma encoding and allowtf32 is true (#937 )	2022-12-03 19:14:12 +00:00
Philippe Tillet	8edfe813a5	[FRONTEND][BACKEND] Added `trans` instruction; made flash attention bwd pass work (#943 )	2022-12-03 09:58:24 -08:00
donproc	9def1bcebf	[TRITON-MLIR][FRONTEND]minor fix to run through atomic_cas test (#925 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-01 13:43:26 +00:00
Keren Zhou	7d90a07d0b	[Triton-MLIR][BACKEND] Refactor decompose insert_slice_async (#929 ) 1. Improve pipline's comment 2. Decompose insert_slice_async when load vector size is not supported 3. Add a test that could fail our gemm code Copy my comments here: There's a knob that may cause performance regression when decomposition has been performed. We should remove this knob once we have thorough analysis on async wait. Currently, we decompose `insert_slice_async` into `load` and `insert_slice` without knowing which `async_wait` is responsible for the `insert_slice_async`. To guarantee correctness, we blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed. There are two options to improve this: 1. We can perform a dataflow analysis to find the `async_wait` that is responsible for the `insert_slice_async` in the backend. 4. We can modify the pipeline to perform the decomposition before the `async_wait` is inserted. However, it is also risky because we don't know the correct vectorized shape yet in the pipeline pass. Making the pipeline pass aware of the vectorization could introduce additional dependencies on the AxisInfoAnalysis and the Coalesce analysis.	2022-11-30 10:07:34 -08:00
Philippe Tillet	9bb54402b3	[FRONTEND][BACKEND] Small fixes to multiple_of, num_programs, axisinfo; enable block-sparse tests (#927 )	2022-11-29 20:00:34 +01:00
Qingyi Liu	9d31998a9d	[Triton-MLIR][BACKEND] Add argmin / argmax implementation for ReduceOp (#918 )	2022-11-27 22:59:27 -08:00
goostavz	630dc315ee	[Triton-MLIR] uncomment the UT in test_gemm that has already been fixed (#920 )	2022-11-28 11:23:20 +08:00
Keren Zhou	35c9ec1103	[Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917 )	2022-11-26 12:30:38 -08:00
donproc	f63be0e9b5	[TRITON-MLIR][BACKEND]support atomic_cas (#914 ) 1. support atomics-cas 2. add xchg support in atomic_rmw Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-25 12:02:08 +08:00
donproc	8925c2cd11	[TRITON-MLIR][BACKEND]AtomicRMWOp supports scalar (#903 ) AtomicRMWOp supports scalar Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-23 07:59:09 +00:00
Keren Zhou	2e33352419	[Triton-MLIR] Fix side effects (#906 ) Try to add proper side effects for triton operations. The CSE pass could fail, hang, or output incorrect IRs for unknown reasons, if side effects are not defined properly. For instance, suppose we have two shared memory tensors: ``` %a = triton_gpu.alloc_tensor shape0, share_encoding0 %b = triton_gpu.alloc_tensor shape0, share_encoding0 ``` The CSE pass will consider `%a` and `%b` are the same thing and eliminate one of them, resulting in mysterious outcomes.	2022-11-22 23:29:18 -08:00
Yan Chunwei	037f9efa95	[Triton-MLIR][BACKEND] Fix wpt overflow issue in mma v2 (#904 ) This PR 1. Fix wpt overflow issue in mma v2 2. Refine transpose logic	2022-11-23 11:27:15 +08:00
Philippe Tillet	23f71daa27	[OPTIMIZER] Fixed up order of shared layouts (#881 )	2022-11-21 06:25:02 +01:00
donproc	afaf59b0c9	[TRITON-MLIR][BACKEND] Atomic support mask (#889 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-19 19:57:19 +08:00
Philippe Tillet	dab4855bdf	[TESTING] Added infrastructure for executing TTGIR program and test for layout conversions (#885 )	2022-11-18 07:46:45 +01:00
goostavz	9ea6135eb5	[Triton-MLIR][Backend] Some cleanup in getMultiDimIndex/getLinearIndex (#880 )	2022-11-18 01:19:21 +00:00
donproc	5eee738df7	[Triton-MLIR][FRONTEND] [BACKEND] fix atomics (#879 ) minor fix to backend and frontend of atomics, we can pass 1 test without mask and the shape aligned with CTA size now Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-16 12:25:15 +08:00
Qingyi Liu	4c4159c6fa	[Triton-MLIR] Add ex2.approx implementation for ExpOp and fix smem allocation for ReduceOpConversion (#875 )	2022-11-15 01:27:32 +00:00
goostavz	c28cfd821b	[Triton-MLIR][Backend] Fix convert_layout blocked->shared in non-default order (#876 ) This PR fix the problem of TN/NT GEMM correctness when no SCF involved. I'll continue to clean up getLinearIndex/getMultiDimIndex in a uniformed way which should be benifical to avoid different kinds of order issues. This is not fully done yet, just merge to sync the code.	2022-11-15 09:02:46 +08:00
Yan Chunwei	1eedaf7bec	[Triton-MLIR][BACKEND] adapt DotOp layout for FMADot (#872 )	2022-11-14 16:56:30 +08:00
Chenggang Zhao	516a241234	[Triton-MLIR] Fix some typos (#874 ) Fix some typos	2022-11-13 18:15:53 -08:00
Philippe Tillet	2aa538ec2e	[BACKEND] Added support for mma layouts in reductions (#863 ) Validated hackily by manually modifying the reduction .ttgir in my local cache. There will be a follow-up PR adding some better testing infrastructure to test out conversions and reductions on arbitrary layouts.	2022-11-10 09:58:07 -08:00
Chenggang Zhao	57fd1864a7	[Triton-MLIR] Support FP8 (#864 ) Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 15:53:06 +08:00
Da Yan	4946167241	[Triton-MLIR] `tt.dot` operands now must have DotOperand layout; also added prefetch pass prototype (#712 ) Co-authored-by: Jokeren <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com> Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 05:57:27 +00:00
Yan Chunwei	0c87360657	[Triton-MLIR][Backend] Port FMADot conversion for DotOp (#844 ) Co-authored-by: ben-zhang-609 <benzh609@gmail.com>	2022-11-09 12:57:50 +08:00
Yan Chunwei	de5b84c476	[Triton-MLIR][Backend] Fix mma<v2> int8 precision error (#850 ) Fix mma.16816 s8 precision error Co-authored-by: ben-zhang-609 <benzh609@gmail.com>	2022-11-09 12:23:43 +08:00
goostavz	080b4addf8	[Triton-MLIR][Backend] Fix the order in linear/delinear and a few bugs in reduce conversion (#851 ) 1, fix the order in linearize/delinearize, which fix the error of order in emitIndices; 2, fix the selecting of fast implementation in reduce codegen; 3, fix the redundant barrier in reduce codegen; 4, fix the index mapping of the second round of warp_shuffle in shuffle version of reduce codegen. Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2022-11-08 10:10:09 -08:00
Philippe Tillet	976cf12af1	[OPTIMIZER] Fixed memory coalescing (#847 )	2022-11-07 06:22:18 -08:00
ben-zhang-609	84ad215268	[Triton-MLIR] Enable libdevice for ptx backend when has external functions. (#848 ) At the phase from ptx to cubin, check whether llvm::Module has external functions. if has, link with libdevice at: https://github.com/openai/triton/blob/triton-mlir/python/triton/language/libdevice.10.bc	2022-11-07 08:01:50 +00:00
Keren Zhou	fdd59900f7	[Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837 ) ## Features - Allow taking a block of tensor slice, as long as each dimension is contiguous (unit stride). - Fix some problems in `insert_slice_async`'s semantic. - More general verification for ops that return shared layout encoding. ## Known Limitations - `insert_slice_async` still uses the old semantic. May submit another PR later to support similar semantic like `tensor.extract_slice`. - No encoding verification for `tensor.extract_slice`. - 3d tensor ops are broken. - Strided accesses are not allowed. - May cause a little performance slowdown since we are passing strides as values but not constants (e.g., int). It would be difficult to pass strides as attributes when we have control flows. A block argument is possible to accept tensors with different strides.	2022-11-06 22:59:03 -08:00
Philippe Tillet	b6dbe959f0	[RUNTIME] Re-vamped cache so users can manually patch IR / ptx / cubin files (#845 ) Also deprecates a couple of tests	2022-11-04 10:57:29 -07:00
Keren Zhou	4218e68d74	[Triton-MLIR] [Frontend] Return a scalar if all input args are scalar (#839 )	2022-11-03 20:27:47 -07:00
ben-zhang-609	5feb6e24f9	[Triton-MLIR]Add ptx vprintf support (#825 ) Not know how to write unit test for this feature. Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>	2022-11-02 16:39:09 +08:00

1 2

91 Commits