Commit Graph

386 Commits

Author SHA1 Message Date
Philippe Tillet
e66bf76354 [RUNTIME] Bunch of bugfixes (#372) 2021-11-12 00:55:00 -08:00
Philippe Tillet
2acaa4d0dd [LANG] Added support for constexpr (#361) 2021-10-30 00:32:58 -07:00
Philippe Tillet
b7f0e87dc2 [DRIVER] Removed std::cout log message 2021-10-29 10:42:10 -07:00
Philippe Tillet
d3e584d4ba Revert "[DRIVER] Fixed CUDA 10.1 bug (#357)" (#358)
This reverts commit d35014ba47.
2021-10-26 15:04:49 -07:00
Philippe Tillet
d35014ba47 [DRIVER] Fixed CUDA 10.1 bug (#357) 2021-10-26 11:17:06 -07:00
Philippe Tillet
5ce1b726dc [CODEGEN] Various bugfixes that make it possible to fuse RNG in a matmul epilogue (#356) 2021-10-24 02:30:46 -07:00
daadaada
858dec8372 [CODEGEN] Add cache modifier to tl.load (#351)
* Add cache modifier to tl.load
* Add comment to cache_modifier
* Remove force_nc_cache
* Update test
2021-10-17 22:14:04 -07:00
Philippe Tillet
9b32075062 [CODEGEN] Some compiler improvements (#349) 2021-10-13 17:49:39 -07:00
Stephen McGroarty
c2e6b90ff1 [CODEGEN] Fixes masked load exception (#342) 2021-10-13 13:31:52 -07:00
daadaada
9e9d781912 [CODEGEN] Pipeline fixup (#336) 2021-10-10 01:47:11 -07:00
daadaada
d5f20dbce0 [IR] Fix error when building in debug mode (#331) 2021-10-08 21:40:20 -07:00
Philippe Tillet
5123db0b7d [LANG] Various (relatively minor) improvements (#320) 2021-10-04 18:39:40 -07:00
Philippe Tillet
2c287544cb [OPS] Faster and cleaner block-sparse implementation (#311) 2021-09-27 18:25:16 -07:00
Philippe Tillet
e22d92c63c [RUNTIME] removed obsolete putenv call (#305) 2021-09-23 17:51:58 -07:00
Philippe Tillet
ec2e7b8f48 [CODEGEN] Fixed nasty bug in coalesce pass (#303) 2021-09-23 17:05:11 -07:00
Philippe Tillet
2849e7a773 [CODEGEN] now re-coalescing before atomics (#298) 2021-09-22 13:35:53 -07:00
Philippe Tillet
6e5b0b4301 [FRONTEND] Added on-disk cache for compiled kernels (#287) 2021-09-18 22:48:26 -07:00
Philippe Tillet
313d6488f6 [CODEGEN] Fixed over-aggressive division handling in alignment pass (#280) 2021-09-15 00:40:17 -07:00
Philippe Tillet
8fdd7e7ed6 [LANG] Fixed semantics of boolean load/store (#282) 2021-09-13 17:39:06 -07:00
Philippe Tillet
3e395bc84e [LANG] Fixed semantics of NaN in float comparisons (#281) 2021-09-13 15:06:29 -07:00
Philippe Tillet
43723ccb95 [FRONTEND] Removed circular import that broke Python 3.6 support (#272) 2021-09-09 13:46:55 -07:00
Philippe Tillet
94c83d30ce [GENERAL] Removed deprecated driver files and added basic compatibility with rocm (#268)
- Removed driver module -- accelerator runtime is handled by pytorch
- Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes
- Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors
2021-09-09 00:04:28 -07:00
Philippe Tillet
8a882b215f [CODEGEN] Fixed performance regression on vectorized loads (#259) 2021-09-02 01:07:31 -07:00
Philippe Tillet
768e0ded28 [CODEGEN] Fixed bug in pipelining pass and casting semantics analysis (#257) 2021-09-01 20:58:47 -07:00
daadaada
274d613488 [IR] Better printer (#256) 2021-09-01 09:55:12 -07:00
Philippe Tillet
4ff3714d61 [CODEGEN] Various bugfixes and stability improvements in compiler backend (#240) 2021-08-30 11:50:35 -07:00
daadaada
85426dbaf7 [DOCS] Add comments in layout.h (#249) 2021-08-28 18:07:32 -07:00
milesial
5b29da719d [DRIVER] Add CUDA P2P support (#209) 2021-08-20 21:00:54 -07:00
Philippe Tillet
226fde6ea1 [CODEGEN] Now using atomic_rmw code path for atomic_xchg (#222) 2021-08-17 16:33:23 -07:00
Philippe Tillet
bb1eebb4b4 [CODEGEN] Fixed bug for visit_reduce1d with 64-bit data-types (#207) 2021-08-14 21:07:01 -07:00
Philippe Tillet
298da78058 [CODEGEN/DRIVER] Tweaks for performance optimization (#193) 2021-08-07 16:41:44 -07:00
Philippe Tillet
e8031fe61f [DRIVER] More robust support of unsupported CUDA version (#179) 2021-08-02 09:06:55 -07:00
daadaada
c7060eadb2 [CODEGEN] Fix bug in auto-pipeline pass when a value depends on multiple phis (#164) 2021-07-31 23:40:36 -07:00
Philippe Tillet
2f0f51be50 [DRIVER] No longer crashing when encountering CUDA version >11.4 2021-07-29 11:27:55 -07:00
Philippe Tillet
76c6f24fb6 [CI] Made build-wheels compatible with system LLVM setup (#138)
This speeds up wheelhouse build time by ~10x
2021-07-27 12:38:49 -07:00
Philippe Tillet
8eb63bcb01 [CI] Various improvements to CI (#137)
Add clean-up before CI runs. Now using static LLVM-11 libraries from system rather than recompilation. Still no run-time LLVM dependencies
2021-07-27 12:38:49 -07:00
Philippe Tillet
94ce6aa80f [DRIVER] Added support for CUDA 11.4 (#135) 2021-07-27 12:38:49 -07:00
Philippe Tillet
01276b5153 [FRONTEND] Added compilation flag to force use of .nc cache modifier (#134)
in DRAM loads. /!\ USE CAREFULLY - THIS CAN BREAK CORRECTNESS IF MISUSED
/!\
2021-07-27 12:38:49 -07:00
Philippe Tillet
2824345065 [LANGUAGE] Added cos/sin (#132) 2021-07-27 12:38:49 -07:00
Philippe Tillet
8cea583109 [IR] Preliminary support for BF16 (#129)
This PR adds a BF16 data-type, along with FP32 <-> BF16 conversion instructions in the LLVM codegen. Other kinds of ops on bfloat16 are not yet supported.
2021-07-27 12:38:49 -07:00
daadaada
0b05e06c0d cu_device::max_shared_memory() now returns max dynamic shared memory size (#127) 2021-07-27 12:38:49 -07:00
daadaada
d8d6b715c8 [CODEGEN] Performance improvement on A100 (#125)
Improved codegen for the Ampere GPUs.

    * Make the layout pass recognize the multistage pipelined pattern.
    * Now the pipeline pass can automate the multistage pipelining transformation.
    * Remove extra barriers (from the prefetch pass & WAR) on Ampere.
    * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.
2021-07-27 12:38:49 -07:00
Philippe Tillet
5a51f3e529 [CODEGEN] Bugfix in membar pass (#124)
Membar pass on top of master is buggy with asynchronous copy. For example, it doesn't wait for asynchronous copies to complete before recoalescing accumulator in GEMM, which leads to undefined behavior when the program doesn't enter the loop. This PR proposes
2021-07-27 12:38:49 -07:00
Philippe Tillet
b7b05a560e [DRIVER] Now giving the option to use system ptxas through environment variable (#123) 2021-07-27 12:38:49 -07:00
Philippe Tillet
80c86ecf4a [LANG] Minor semantic changes (#121)
* Now using unordered instead of ordered float (fixes NaN issues)
* Bool -> int32 now converts to 1 rather than -1
* Reduce extend arguments to 32-bits if possible
2021-07-27 12:38:49 -07:00
Philippe Tillet
0274429429 [IR] Added IR and Codegen support for atomic_rmw (#120) 2021-07-27 12:38:49 -07:00
Philippe Tillet
59b0ac672a [LANGUAGE] Added support for bitcast (#119) 2021-07-27 12:38:49 -07:00
Philippe Tillet
38ab4e955a [CODEGEN] Bugfix in prefetch pass (#118) 2021-07-27 12:38:49 -07:00
Philippe Tillet
b5dcac484d [CODEGEN] Small bugfix in atomic-add (#114) 2021-07-27 12:38:49 -07:00
Philippe Tillet
f81012a8cf [CODEGEN] Fixed atomic_add issue (#112)
* [CODEGEN] Fixed atomic_add issue

* [CODEGEN] Fixed liveness analysis bug for instructions that are not
DCE'd but have no users (e.g., atomic_cas)
2021-07-27 12:38:49 -07:00