Commit Graph

110 Commits

Author SHA1 Message Date
Philippe Tillet
2f8f0042a9 [DOCS] Added matrix multiplication tutorial 2021-03-15 13:57:41 -04:00
Philippe Tillet
061ef3920e [CODEGEN] Fixed bug that caused conditional operator to not always
properly mask load operations

Also includes minor improvement to benchmarking infrastructure
2021-03-08 20:04:26 -05:00
Philippe Tillet
a7437e14c5 [RUNTIME] Added auto-alignment mechanism (#71)
This PR adds an automatic memory alignment mechanism in the Triton runtime. Specifically, the JIT compiler detects the alignment (in bytes) of each pointer argument as well as the largest power of two divisor (between 1 and 16) of each integer argument. Proper .aligned and .multipleof attributes are then added to the Triton-IR on-the-fly for all auto-tunable kernels. There is a cache that remembers all the kernels compiled for each possible configuration.

This PR also includes substantial cleaning of the Python API. This adds 2-3us overhead, mostly due to accessing integer #defines from the auto-tuned compilation options. The previous solution was slightly faster but hacky and potentially unsafe, so this is preferred for now.
2021-03-04 01:51:11 -05:00
Philippe Tillet
15f8e8c3b7 [CODEGEN] Major performance improvements on A100 (#70)
Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks
2021-02-21 18:19:39 -05:00
Philippe Tillet
1726197bb4 Improvements w/ Auto-Tuning and standard benchmarks (#57)
[PYTHON] Bug-fixes in the auto-tuning module and improvement of the existing API for it
2021-02-03 16:37:21 -05:00
Philippe Tillet
4a61e65fc9 [LANG] Added __debug_barrier() call to force insertion of a CUDA
__syncthreads
2021-01-31 20:09:36 -05:00
Philippe Tillet
6e77538087 [RUNTIME] Auto-tuning now works as expected when the values of
autotune_key change
2021-01-31 19:23:51 -05:00
Philippe Tillet
0b23f95b20 [RUNTIME] Added option to print LLVM-IR
Also includes appropriate driver code change for that
2021-01-31 01:01:32 -05:00
Philippe Tillet
79d098450f [PYTHON][TESTS][DOC] Various improvement of the API and code quality:
* Simplified `triton.kernel` API to achieve lower latency:
  > .data_ptr() must now be passed as kernel argument. No more implicit
conversion from torch.tensor
  > compilation options are now constant attributes, i.e., opt.d('VAR')
becomes opt.VAR
  > torch.device must now be passed explicitly to triton.kernel (no
longer inferred from torch.tensor arguments)
* C++ tests moved to `python/tests/`
* C++ tutorial created in `tutorials/`
* Python tutorial created in python/tutorials/
* Version changed to 1.0alpha
* No longer copying C++ headers into the Python package
* added python/triton/ops/ package for pre-written Triton ops
2021-01-29 17:27:16 -05:00
Philippe Tillet
aef1b2b3c9 [CODEGEN] Fixed bug in recoalesce_inst LLVM codegen 2021-01-19 19:19:51 -05:00
Philippe Tillet
e11077eab9 [RUNTIME] Disable error on spills 2021-01-14 20:33:34 -05:00
Philippe Tillet
af080740f2 [GENERAL] Merged v1.0alpha into master. Added features are:
- A100 support via mma.16816
- Thread swizzling for conflict-free shared memory accesses without
padding
- Complete overhaul of the LLVM code generation in
codegen/selection/generator.cc to remove overengineering
- Added debugging capabilities in the Python binding
- Compilation error for kernels that spill
2021-01-11 19:23:24 -05:00
Philippe Tillet
836173434e [LANG] Added hacky min/max 2020-12-23 18:17:52 -05:00
Philippe Tillet
75131b4622 [LANG] Added some more atomic_add support 2020-12-01 22:31:32 -05:00
Philippe Tillet
1d2b1b72fc [DRIVER] Removed deprecated files and functions 2020-11-26 23:21:28 -05:00
Philippe Tillet
7710e048f4 [DRIVER] Simplified Driver API by substantially removing reliance on driver::context 2020-11-26 00:38:25 -05:00
Philippe Tillet
bcc5745ea0 [CODEGEN] Fixed bug in atomic_add 2020-11-19 18:19:55 -05:00
Philippe Tillet
e69ed1bdb2 [LANG] Added sqrtf support 2020-11-19 15:41:47 -05:00
Philippe Tillet
51025ca2ad [DRIVER] Improved performance of Host driver code 2020-11-12 02:11:45 -05:00
Philippe Tillet
6c5284ed3b [GENERAL] Various bugfixes 2020-11-11 14:44:56 -05:00
Philippe Tillet
a2d54b5ad3 [General] LLVM-9 -> LLVM-10 2020-11-07 22:46:18 -05:00
Philippe Tillet
81000db9e9 [PYTHON] Added option to show PTX source code in Python 2020-11-07 02:55:48 -05:00
Philippe Tillet
e2c1ac8d24 [LANG] Added log intrinsic 2020-11-03 15:50:11 -05:00
Philippe Tillet
37ee888d88 [PYTHON] Cleaning C++ bindings 2020-11-02 15:06:08 -05:00
Philippe Tillet
9be1d5afc2 [GENERAL] Various improvements:
* Sparse einsum in triton.ops.einsum
* Hacky support for fixed-tile-size atomic-add
* Various bugfixes in parser
2020-10-25 12:16:40 -07:00
Philippe Tillet
db7a72bfe3 [DRIVER] Removed OpenCL support
There is no plan to support OpenCL anytime soon (Vulkan would be preferred). Removing the adequate portion of the driver code
2020-10-13 20:57:32 -07:00
Philippe Tillet
0cbee3ec56 [CODEGEN] More work on the CPU backend 2020-09-14 10:59:40 -04:00
Philippe Tillet
30ac1359b9 [RUNTIME] Lower-level interface for executing functions 2020-08-12 18:33:35 -04:00
Philippe Tillet
18a4cefec7 [CORE] Auto-tuning now copies scalar buffers. Still needs to copy all buffers that are both read from and written to. 2020-05-15 23:21:42 -04:00
Philippe Tillet
05214d22e3 [CODEGEN] Bugfix in Disassociate pass; Added fp32 atomic_add support 2020-05-13 23:21:21 -04:00
Philippe Tillet
9da8fe11ed [CODEGEN] Fixed bug that caused missing recoalescing for some transpose
operations
2020-05-11 00:26:03 -04:00
Philippe Tillet
fa5e4af93e [CORE][RUNTIME] Better error message on internal compilation error 2020-04-07 14:01:21 -04:00
Philippe Tillet
e04efc1c85 [GENERAL] Error messages now no longer make terminal color green 2020-04-03 23:25:29 -04:00
Philippe Tillet
7c09ff80eb [CORE] Fixed several issues that arose in the development of the
torch-blocksparse package:

* Now using warp shuffle in reductions when possible
* Various bugfixes in layout inference
* Added INFINITY, exponential and select
* Better error messages for unimplemented constructs
2020-03-31 18:57:28 -04:00
Philippe Tillet
1f1e4ee9ec [PYTHON] Merged blocksparse branch:
* Example for blocksparse matrix multiplication
* Simplified Triton kernel API
* Revived auto-tuning in einsum
2020-03-05 13:08:07 -05:00
Philippe Tillet
f2daff85d2 [GENERAL] Improved caching mechanism:
* Now computing hash in libtriton
* Now only compiling a single pytorch hook per function signature
2020-02-24 16:36:50 -05:00
Philippe Tillet
7621aeda3f [CODEGEN][TRANSFORM][PEEPHOLE] Fixed bug in *1 multiplication 2020-02-19 00:18:55 -05:00
Philippe Tillet
d11d2db6ee [PYTHON][EINSUM] Now handling reduction sizes that are not a multiple of
TK
2020-02-17 13:52:58 -05:00
Philippe Tillet
a099c6f7f3 [TRITON][LANG] Added support for bitcast 2020-02-09 20:11:13 -05:00
Philippe Tillet
ce7a00674a [PYTHON][EXAMPLES] Added self-attention example using triton.ops.einsum 2020-01-21 16:45:04 -05:00
Philippe Tillet
78b98fb7cf [GENERAL] Cleaned polymorphic structure of layouts analysis pass 2020-01-21 11:38:39 -05:00
Philippe Tillet
f278d9741a [GENERAL] Merged einsum feature branch. Various feature, performance
improvements and bugfixes:

* Added preliminary support for extended Einstein summation in PyTriton
* Significant performance improvement on FP32 kernels containing matrix
multiplication
* Added re-coalescing pass for FP16 kernels containing matrix
multiplication
* Various bugfixes
2020-01-20 12:42:48 -05:00
Philippe Tillet
f4bbbbe5e4 [PYTHON][OPS] Bugfix in conv fprop 2019-11-01 00:43:02 -04:00
Philippe Tillet
739a8d9061 some work on conv 2019-10-31 18:08:27 -04:00
Philippe Tillet
d65a94c768 [PYTHON][OPS] Added batch normalization op 2019-10-29 17:29:11 -04:00
Philippe Tillet
76651a065f [PYTHON][EXAMPLES] Better einsum example 2019-10-29 12:56:58 -04:00
Philippe Tillet
e11557855f [PYTHON] [OPS] Added einsum implementation 2019-10-26 22:14:50 -04:00
Philippe Tillet
655f43fb5b more work 2019-10-26 15:10:19 -04:00
Philippe Tillet
0770ccf537 [codegen] [selection] disassociation prototype 2019-10-25 09:39:46 -04:00
Philippe Tillet
943bf41b5c [python] [op] added Triton NT einsum 2019-10-21 23:37:39 -04:00