Philippe Tillet
5ba5a77561
[BUILD] Remove compilation warnings
2021-07-27 12:38:49 -07:00
Philippe Tillet
183878dce5
[DOCS] Added matrix multiplication tutorial
2021-07-27 12:38:49 -07:00
Philippe Tillet
5b9afaa688
[CODEGEN] Fixed bug that caused conditional operator to not always
...
properly mask load operations
Also includes minor improvement to benchmarking infrastructure
2021-07-27 12:38:49 -07:00
Philippe Tillet
62835a0979
[RUNTIME] Added auto-alignment mechanism ( #71 )
...
This PR adds an automatic memory alignment mechanism in the Triton runtime. Specifically, the JIT compiler detects the alignment (in bytes) of each pointer argument as well as the largest power of two divisor (between 1 and 16) of each integer argument. Proper .aligned and .multipleof attributes are then added to the Triton-IR on-the-fly for all auto-tunable kernels. There is a cache that remembers all the kernels compiled for each possible configuration.
This PR also includes substantial cleaning of the Python API. This adds 2-3us overhead, mostly due to accessing integer #defines from the auto-tuned compilation options. The previous solution was slightly faster but hacky and potentially unsafe, so this is preferred for now.
2021-07-27 12:38:49 -07:00
Philippe Tillet
567a1a3d17
[CODEGEN] Bugfixes with FP32 async copy
2021-07-27 12:38:49 -07:00
Philippe Tillet
11215f0f03
[CODEGEN] Now initializing cp.async to zero when predicate is false
...
WARNING: case for non-zero initialization is still not handled. Will
require manual copy to shared
2021-07-27 12:38:49 -07:00
Philippe Tillet
5b83259592
[CODEGEN] Major performance improvements on A100 ( #70 )
...
Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks
2021-07-27 12:38:49 -07:00
Philippe Tillet
3ca40b05cf
[DRIVER] Added options for developers to cache PTX file so that ti can
...
be manually modified
2021-07-27 12:38:49 -07:00
Philippe Tillet
b8a52c70c9
[LANG] Now requiring tiles have power of 2 number of elements
2021-07-27 12:38:48 -07:00
Philippe Tillet
6fb4800f57
Improvements w/ Auto-Tuning and standard benchmarks ( #57 )
...
[PYTHON] Bug-fixes in the auto-tuning module and improvement of the existing API for it
2021-07-27 12:38:48 -07:00
Philippe Tillet
ad5a30bae1
[LANG] Added __debug_barrier() call to force insertion of a CUDA
...
__syncthreads
2021-07-27 12:38:48 -07:00
Philippe Tillet
3fde4b8f5b
[RUNTIME] Auto-tuning now works as expected when the values of
...
autotune_key change
2021-07-27 12:38:48 -07:00
Philippe Tillet
0b025db2ee
[RUNTIME] Added option to print LLVM-IR
...
Also includes appropriate driver code change for that
2021-07-27 12:38:48 -07:00
Philippe Tillet
9f9d7b8840
[LANG] Fixed parsing error for built-in functions exp/log/sqrtf
2021-07-27 12:38:48 -07:00
Philippe Tillet
269ebc12e5
[PYTHON][TESTS][DOC] Various improvement of the API and code quality:
...
* Simplified `triton.kernel` API to achieve lower latency:
> .data_ptr() must now be passed as kernel argument. No more implicit
conversion from torch.tensor
> compilation options are now constant attributes, i.e., opt.d('VAR')
becomes opt.VAR
> torch.device must now be passed explicitly to triton.kernel (no
longer inferred from torch.tensor arguments)
* C++ tests moved to `python/tests/`
* C++ tutorial created in `tutorials/`
* Python tutorial created in python/tutorials/
* Version changed to 1.0alpha
* No longer copying C++ headers into the Python package
* added python/triton/ops/ package for pre-written Triton ops
2021-07-27 12:38:48 -07:00
Philippe Tillet
a5a477c36b
[CODEGEN] Fixed bug in recoalesce_inst LLVM codegen
2021-07-27 12:38:48 -07:00
Philippe Tillet
376c876eb8
[RUNTIME] Disable error on spills
2021-07-27 12:38:48 -07:00
Philippe Tillet
3b36a1e60c
[CODEGEN] Fixed issue in traversal order for atomic_add and store_inst
2021-07-27 12:38:48 -07:00
Philippe Tillet
083bbd1e8d
[GENERAL] Merged v1.0alpha into master. Added features are:
...
- A100 support via mma.16816
- Thread swizzling for conflict-free shared memory accesses without
padding
- Complete overhaul of the LLVM code generation in
codegen/selection/generator.cc to remove overengineering
- Added debugging capabilities in the Python binding
- Compilation error for kernels that spill
2021-07-27 12:38:48 -07:00
Philippe Tillet
c4fceeea49
[LANG] Added hacky min/max
2021-07-27 12:38:48 -07:00
Yan Da
27dc780871
[IR] Check constant_int type
2021-07-27 12:38:48 -07:00
Yan Da
01ef691b84
[LANG] Fix gep bug in INC
2021-07-27 12:38:48 -07:00
Yan Da
e9b2335224
[LANG] Add support for POSTFIX_INC and POSTFIX_DEC, and pointer type
2021-07-27 12:38:48 -07:00
Yan Da
05b95b7fa6
[LANG] Add support for PREFIX_INC and PREFIX_DEC.
2021-07-27 12:38:48 -07:00
Philippe Tillet
fd5c72d6a0
[LANG] Added some more atomic_add support
2021-07-27 12:38:48 -07:00
Philippe Tillet
5e8f4c934c
[DRIVER] Better exception handling of invalid ptx
2021-07-27 12:38:48 -07:00
Philippe Tillet
44ca2c0cb8
[DRIVER] Removed deprecated files and functions
2021-07-27 12:38:48 -07:00
Philippe Tillet
7ab2c2a356
[DRIVER] Removed obsolete SetArg
2021-07-27 12:38:48 -07:00
Philippe Tillet
4f08d87fed
[DRIVER] Simplified Driver API by substantially removing reliance on driver::context
2021-07-27 12:38:48 -07:00
Philippe Tillet
f42b04d925
[DRIVER] Added (slow) support for CUDA11 and Ampere
2021-07-27 12:38:48 -07:00
Philippe Tillet
baa858aa74
[CODEGEN] Fixed bug in atomic_add
2021-07-27 12:38:48 -07:00
Philippe Tillet
7d095ec686
[LANG] Added sqrtf support
2021-07-27 12:38:48 -07:00
Philippe Tillet
073fddffc1
[PYTHON] Compiling Triton in Release mode now...
2021-07-27 12:38:48 -07:00
Philippe Tillet
5d84fde733
tmp
2021-07-27 12:38:48 -07:00
Philippe Tillet
da287bb710
[CODEGEN] Progress on atom.add.f16x2
2021-07-27 12:38:48 -07:00
Philippe Tillet
a77c925dfd
[DRIVER] Improved performance of Host driver code
2021-07-27 12:38:48 -07:00
Philippe Tillet
8f8d36c7a4
[GENERAL] Various bugfixes
2021-07-27 12:38:48 -07:00
Philippe Tillet
50587bbf4b
[General] LLVM-9 -> LLVM-10
2021-07-27 12:38:48 -07:00
Philippe Tillet
8f3ee53f24
[PYTHON] Added option to show PTX source code in Python
2021-07-27 12:38:48 -07:00
Philippe Tillet
f152150e7d
[LANG] Added log intrinsic
2021-07-27 12:38:48 -07:00
Philippe Tillet
02a6e81b88
[PYTHON] Cleaning C++ bindings
2021-07-27 12:38:48 -07:00
Philippe Tillet
34f1d5e565
[CODEGEN] Fixed bug in 2D reductions
2021-07-27 12:38:48 -07:00
Philippe Tillet
049ab989b5
[GENERAL] Various improvements:
...
* Sparse einsum in triton.ops.einsum
* Hacky support for fixed-tile-size atomic-add
* Various bugfixes in parser
2021-07-27 12:38:48 -07:00
Philippe Tillet
664d3cae89
[DRIVER] Removed OpenCL support
...
There is no plan to support OpenCL anytime soon (Vulkan would be preferred). Removing the adequate portion of the driver code
2021-07-27 12:38:48 -07:00
Philippe Tillet
840308ab5d
[CODEGEN] More work on the CPU backend
2021-07-27 12:38:48 -07:00
Philippe Tillet
acff1b5e05
[RUNTIME] Lower-level interface for executing functions
2021-07-27 12:38:48 -07:00
Philippe Tillet
ba9955ae39
[CODEGEN][ANALYSIS] Fixed issue in layout inference
2021-07-27 12:38:48 -07:00
Philippe Tillet
29a0ad6c4d
[DRIVER] Now always using PTXv6.4
2021-07-27 12:38:48 -07:00
Philippe Tillet
8bdfbe2514
[ANALYSIS] Replaced min by gcd in layout inference
2021-07-27 12:38:48 -07:00
Philippe Tillet
e18f169a39
[CODEGEN] Fixed various issues in alignment inference pass
2021-07-27 12:38:48 -07:00