Commit Graph

14 Commits

Author SHA1 Message Date
Philippe Tillet
061ef3920e [CODEGEN] Fixed bug that caused conditional operator to not always
properly mask load operations

Also includes minor improvement to benchmarking infrastructure
2021-03-08 20:04:26 -05:00
Philippe Tillet
15f8e8c3b7 [CODEGEN] Major performance improvements on A100 (#70)
Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks
2021-02-21 18:19:39 -05:00
Philippe Tillet
af080740f2 [GENERAL] Merged v1.0alpha into master. Added features are:
- A100 support via mma.16816
- Thread swizzling for conflict-free shared memory accesses without
padding
- Complete overhaul of the LLVM code generation in
codegen/selection/generator.cc to remove overengineering
- Added debugging capabilities in the Python binding
- Compilation error for kernels that spill
2021-01-11 19:23:24 -05:00
Philippe Tillet
cd21151b98 [GENERAL] Fixed some undefined behavior with GCC-9 2020-05-11 11:07:21 -04:00
Philippe Tillet
9da8fe11ed [CODEGEN] Fixed bug that caused missing recoalescing for some transpose
operations
2020-05-11 00:26:03 -04:00
Philippe Tillet
7c09ff80eb [CORE] Fixed several issues that arose in the development of the
torch-blocksparse package:

* Now using warp shuffle in reductions when possible
* Various bugfixes in layout inference
* Added INFINITY, exponential and select
* Better error messages for unimplemented constructs
2020-03-31 18:57:28 -04:00
Philippe Tillet
7621aeda3f [CODEGEN][TRANSFORM][PEEPHOLE] Fixed bug in *1 multiplication 2020-02-19 00:18:55 -05:00
Philippe Tillet
de6fdd5625 [general] removed useless files and includes 2019-10-20 19:29:48 -04:00
Philippe Tillet
ed1b2bc563 more work on padding 2019-09-27 22:15:30 -04:00
Philippe Tillet
575dd06be3 [codegen] more progress towards unified dot implementation 2019-09-26 14:01:28 -04:00
Philippe Tillet
f0013f8bf1 [codegen] [allocation] fixed issues in HMMA 2019-09-23 17:54:42 -04:00
Philippe Tillet
7e0af2118c [codegen] worked around bug seemingly from nvptx/ptxas by simplifying multiplications by 1:
- Generated LLVM-IR looked correct
- Illegal addressing disappeared when running cuda-memcheck
- Illegal addressing disappeared when using nvptx-short-pointer
2019-08-30 16:45:14 -07:00
Philippe Tillet
96b4d5e411 [examples] multiple transposition schemes now supported 2019-08-24 13:08:38 -07:00
Philippe Tillet
732156b942 [general] rename *.cpp -> *.cc 2019-08-23 19:06:39 -07:00