Commit Graph

88 Commits

Author SHA1 Message Date
Da Yan
210a296699 [BACKEND] bf16 flash-attention (#636) 2022-08-26 20:40:55 -07:00
Daniil Fukalov
cc79376222 Fix deprectaion warning on CreateGEP(Value *, ArrayRef<Value *>, const Twine &) (#608)
This variant of CreateGEP() is already removed in LLVM 14.
2022-08-07 17:10:18 -07:00
daadaada
9b2bc88d11 [BACKEND] Better bf16 support (#588) 2022-07-19 21:22:37 -07:00
Keren Zhou
4912916c11 [FRONTEND] Added support for element-wise function defined in external LLVM bitcode (e.g., libdevice) (#562) 2022-07-13 15:52:21 -07:00
Philippe Tillet
4a399a7e40 [BACKEND] Fix some bugs (atomics, a segfault...) (#577)
This should fix #558 , #573 and #574
2022-07-06 20:03:04 -07:00
Philippe Tillet
f733327ba4 [BACKEND][CODEGEN] Disabling L2 residency control by default (#570) 2022-06-29 17:05:13 -07:00
Philippe Tillet
5b4c8f221e [BACKEND] Compiler improvements (#557)
This PR adds several optimization capabilities in the compiler backend:
- Now using inline PTX for `tl.store`, making it possible to use things like evict_last
- For A100, mma layout can be directly converted to shared memory
- For A100, an additional "transpose" argument in `dot` allows tensors to be loaded once and used both row- and col- major.
- Fixed liveness analysis; this was broken.
- Now can load/store directly mma layout without converting. Useful for when tl.dot accumulator is initialized with DRAM data inside of an inner loop.
- `tl.dot` can now take LHS inputs in registers when it comes from a previous `tl.dot` instruction. Useful for e.g. fused attention.
2022-06-27 11:49:19 -07:00
Keren Zhou
87413bc925 [BACKEND] Fix layout convert for non-contiguous input (#564) 2022-06-25 23:12:03 -07:00
Keren Zhou
b5e728cb14 Add argmin argmax (#552) 2022-06-15 13:55:20 -07:00
Keren Zhou
93209c07e0 [BACKEND][CODEGEN] Fix reduce uint (#547) 2022-06-13 16:43:57 -07:00
Philippe Tillet
58c8889235 [FRONTEND] Fix scanline layout (#548) 2022-06-13 16:21:10 -07:00
Mengchi Zhang
2cdc6d35c4 [FRONTEND] Give col_per_thread an initial value to make the compiler happy (#535)
Signed-off-by: Mengchi Zhang <mengchi@fb.com>
2022-06-06 12:48:23 -07:00
Philippe Tillet
8876e53206 [BACKEND] Restored reduction bugfixes 2022-06-03 11:38:52 -07:00
Philippe Tillet
a60374a597 Revert "[BACKEND] Various bug fixes; making reductions faster (#533)".
This is a more stable commit that produce bitwise identical code to earlier
versions. Using commits after this one may lead to slightly different numerics
2022-06-03 11:36:06 -07:00
Philippe Tillet
3e7500dfe6 [BACKEND] Various bug fixes; making reductions faster (#533) 2022-05-31 17:14:44 -07:00
Philippe Tillet
d35617bea1 [BACKEND][CODEGEN] Faster reduction for scanline layout (#516) 2022-05-14 15:26:13 -07:00
Sriram Murali
7c9bc5a47b [CODEGEN] Change return type of generator::packed_type to appease build warnings (#507) 2022-05-04 20:03:37 -07:00
Philippe Tillet
ae2a1ab225 [BACKEND] Alignment pass improvements (#503) 2022-04-25 21:16:00 -07:00
Philippe Tillet
7d544799a0 [BACKEND] Now disabling L2 eviction policy for sm < 80 2022-04-25 09:35:36 -07:00
Philippe Tillet
bda209002e [BACKEND][CODEGEN] vectorization bugfix (#502) 2022-04-23 13:18:33 -07:00
Philippe Tillet
0cc3b1129b [BACKEND][CODE_GEN] eviction policies now also apply to L2 (#501) 2022-04-21 23:56:01 -07:00
Philippe Tillet
2bed6fc850 [LANG] Added support for device functions (#484) 2022-04-03 20:58:16 -07:00
Philippe Tillet
e0cc488055 [FRONTEND] Added tl.clock and tl.globaltimer (#485) 2022-03-28 16:15:43 -07:00
Philippe Tillet
bb5765df5c [CODEGEN] Now padding shared memory for layout conversion (#468) 2022-03-03 22:19:05 -08:00
Philippe Tillet
98ed7db8c1 [CODEGEN] Improvements and bugfixes (#463) 2022-02-24 14:56:24 -08:00
TC
137bb67fad [LANG] Add fp16 to fp8 conversion (#444) 2022-02-02 20:42:09 -08:00
Philippe Tillet
807d8a1945 [ALL] Merge master (#447) 2022-01-30 20:21:20 -08:00
Philippe Tillet
bef76b142a [BACKEND] float division is now approximate by default (#446) 2022-01-29 18:29:29 -08:00
daadaada
59d371c6eb [BACKEND] Added Int8 mma (#440) 2022-01-27 09:12:44 -08:00
daadaada
94a2e10fe5 [BACKEND] Add bf16 & tf32 mma supports (on A100) (#426) 2022-01-11 10:20:31 -08:00
Philippe Tillet
03f1256f60 [FRONTEND] Added volatile flag for load (#407) 2021-12-30 22:33:24 -08:00
Philippe Tillet
558555630f [FRONTEND] Added xor_sum 2021-12-16 17:55:35 -08:00
Madeleine Thompson
e575ae3443 [FRONTEND] Minor accumulated style and warning fixes (#388)
- Fix some whitespace.
- Make an undeclared dependency on `pytest` explicit.
- Fix deprecated `description-file` use.
- `#ifdef` out a deprecated `PyEval_InitThreads` call.
- Use a slightly different numpy invocation in `test_random.py` to quiet down overflow warnings in tests.
- Fix a deprecated cast in `test_core.py`.
- Suppress a warning about `visit_Constant` in Python 3.9+; we can't migrate yet because it'd break Python 3.6 and 3.7.
- Use chained exceptions for `CompilationError` rather than rolling our own; it makes the error messages nicer.
- Add a `__str__` for `tl.dtype` to make debugging kernels easier; it lets you `print` a dtype to see what type was inferred.
- Fix a few bad escapes.
2021-12-10 15:19:20 -08:00
Philippe Tillet
8ec9f037bb [BACKEND/CODE_GEN] Fixed float32 matmul problem (#380) 2021-11-30 22:00:56 -08:00
Philippe Tillet
e66bf76354 [RUNTIME] Bunch of bugfixes (#372) 2021-11-12 00:55:00 -08:00
Philippe Tillet
5ce1b726dc [CODEGEN] Various bugfixes that make it possible to fuse RNG in a matmul epilogue (#356) 2021-10-24 02:30:46 -07:00
daadaada
858dec8372 [CODEGEN] Add cache modifier to tl.load (#351)
* Add cache modifier to tl.load
* Add comment to cache_modifier
* Remove force_nc_cache
* Update test
2021-10-17 22:14:04 -07:00
Philippe Tillet
5123db0b7d [LANG] Various (relatively minor) improvements (#320) 2021-10-04 18:39:40 -07:00
Philippe Tillet
2c287544cb [OPS] Faster and cleaner block-sparse implementation (#311) 2021-09-27 18:25:16 -07:00
Philippe Tillet
8a882b215f [CODEGEN] Fixed performance regression on vectorized loads (#259) 2021-09-02 01:07:31 -07:00
Philippe Tillet
4ff3714d61 [CODEGEN] Various bugfixes and stability improvements in compiler backend (#240) 2021-08-30 11:50:35 -07:00
daadaada
85426dbaf7 [DOCS] Add comments in layout.h (#249) 2021-08-28 18:07:32 -07:00
Philippe Tillet
226fde6ea1 [CODEGEN] Now using atomic_rmw code path for atomic_xchg (#222) 2021-08-17 16:33:23 -07:00
Philippe Tillet
bb1eebb4b4 [CODEGEN] Fixed bug for visit_reduce1d with 64-bit data-types (#207) 2021-08-14 21:07:01 -07:00
Philippe Tillet
01276b5153 [FRONTEND] Added compilation flag to force use of .nc cache modifier (#134)
in DRAM loads. /!\ USE CAREFULLY - THIS CAN BREAK CORRECTNESS IF MISUSED
/!\
2021-07-27 12:38:49 -07:00
Philippe Tillet
2824345065 [LANGUAGE] Added cos/sin (#132) 2021-07-27 12:38:49 -07:00
Philippe Tillet
8cea583109 [IR] Preliminary support for BF16 (#129)
This PR adds a BF16 data-type, along with FP32 <-> BF16 conversion instructions in the LLVM codegen. Other kinds of ops on bfloat16 are not yet supported.
2021-07-27 12:38:49 -07:00
daadaada
d8d6b715c8 [CODEGEN] Performance improvement on A100 (#125)
Improved codegen for the Ampere GPUs.

    * Make the layout pass recognize the multistage pipelined pattern.
    * Now the pipeline pass can automate the multistage pipelining transformation.
    * Remove extra barriers (from the prefetch pass & WAR) on Ampere.
    * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.
2021-07-27 12:38:49 -07:00
Philippe Tillet
80c86ecf4a [LANG] Minor semantic changes (#121)
* Now using unordered instead of ordered float (fixes NaN issues)
* Bool -> int32 now converts to 1 rather than -1
* Reduce extend arguments to 32-bits if possible
2021-07-27 12:38:49 -07:00
Philippe Tillet
0274429429 [IR] Added IR and Codegen support for atomic_rmw (#120) 2021-07-27 12:38:49 -07:00