Commit Graph

36 Commits

Author SHA1 Message Date
Victor
0ff1a26b70 fixed p2p tests failing when there are no supported p2p devices (#386) 2021-12-06 18:14:03 -08:00
daadaada
858dec8372 [CODEGEN] Add cache modifier to tl.load (#351)
* Add cache modifier to tl.load
* Add comment to cache_modifier
* Remove force_nc_cache
* Update test
2021-10-17 22:14:04 -07:00
Stephen McGroarty
c2e6b90ff1 [CODEGEN] Fixes masked load exception (#342) 2021-10-13 13:31:52 -07:00
Philippe Tillet
c3c0ff0552 [LANGUAGE] Fixed issue with duplicates in large arrays of random uniform numbers (#338) 2021-10-10 15:22:34 -07:00
daadaada
9e9d781912 [CODEGEN] Pipeline fixup (#336) 2021-10-10 01:47:11 -07:00
Philippe Tillet
5123db0b7d [LANG] Various (relatively minor) improvements (#320) 2021-10-04 18:39:40 -07:00
Philippe Tillet
2c287544cb [OPS] Faster and cleaner block-sparse implementation (#311) 2021-09-27 18:25:16 -07:00
Benjamin Lefaudeux
b53f5f3803 [OPS][BLOCKSPARSE] safeguarding a couple more configurations (#292) 2021-09-20 17:15:31 -07:00
Philippe Tillet
a12827848d [FRONTEND] Now using exist_ok=True when creating cache directories (#288) 2021-09-18 23:44:21 -07:00
Philippe Tillet
313d6488f6 [CODEGEN] Fixed over-aggressive division handling in alignment pass (#280) 2021-09-15 00:40:17 -07:00
Philippe Tillet
da5063d898 [TEST] Added performance regression tests (#283) 2021-09-14 01:46:32 -07:00
Philippe Tillet
3e395bc84e [LANG] Fixed semantics of NaN in float comparisons (#281) 2021-09-13 15:06:29 -07:00
Philippe Tillet
585e5cd0ec [TEST] Added test for empty kernel (#271) 2021-09-09 10:20:37 -07:00
Philippe Tillet
94c83d30ce [GENERAL] Removed deprecated driver files and added basic compatibility with rocm (#268)
- Removed driver module -- accelerator runtime is handled by pytorch
- Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes
- Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors
2021-09-09 00:04:28 -07:00
Szymon Sidor
8bedcce9be [LANG] Added seeded random number generation - philox (#261) 2021-09-02 22:02:40 -07:00
Philippe Tillet
4ff3714d61 [CODEGEN] Various bugfixes and stability improvements in compiler backend (#240) 2021-08-30 11:50:35 -07:00
milesial
5b29da719d [DRIVER] Add CUDA P2P support (#209) 2021-08-20 21:00:54 -07:00
Philippe Tillet
a714b6b856 [PYTHON] re-activated auto-tuner configurations for triton.ops.matmul (#212) 2021-08-16 22:56:21 -07:00
Philippe Tillet
bb1eebb4b4 [CODEGEN] Fixed bug for visit_reduce1d with 64-bit data-types (#207) 2021-08-14 21:07:01 -07:00
Philippe Tillet
b120d70a0a [CI] Moved from assert_allclose to assert_almost_equal (#200) 2021-08-12 12:00:30 -07:00
Philippe Tillet
2824345065 [LANGUAGE] Added cos/sin (#132) 2021-07-27 12:38:49 -07:00
Philippe Tillet
8cea583109 [IR] Preliminary support for BF16 (#129)
This PR adds a BF16 data-type, along with FP32 <-> BF16 conversion instructions in the LLVM codegen. Other kinds of ops on bfloat16 are not yet supported.
2021-07-27 12:38:49 -07:00
daadaada
d8d6b715c8 [CODEGEN] Performance improvement on A100 (#125)
Improved codegen for the Ampere GPUs.

    * Make the layout pass recognize the multistage pipelined pattern.
    * Now the pipeline pass can automate the multistage pipelining transformation.
    * Remove extra barriers (from the prefetch pass & WAR) on Ampere.
    * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.
2021-07-27 12:38:49 -07:00
Philippe Tillet
0274429429 [IR] Added IR and Codegen support for atomic_rmw (#120) 2021-07-27 12:38:49 -07:00
Philippe Tillet
59b0ac672a [LANGUAGE] Added support for bitcast (#119) 2021-07-27 12:38:49 -07:00
Philippe Tillet
3ab121dbdb [PYTHON] Added support for tuples (#116) 2021-07-27 12:38:49 -07:00
Philippe Tillet
f81012a8cf [CODEGEN] Fixed atomic_add issue (#112)
* [CODEGEN] Fixed atomic_add issue

* [CODEGEN] Fixed liveness analysis bug for instructions that are not
DCE'd but have no users (e.g., atomic_cas)
2021-07-27 12:38:49 -07:00
Philippe Tillet
bfc0a7587d [PYTHON] Renamed triton.core -> triton.language (#92) 2021-07-27 12:38:49 -07:00
Philippe Tillet
39f4730305 Deprecation of Triton-C and Replacement by decorated Python functions (#86)
This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes.

See documentations for more information on the new API
2021-07-27 12:38:49 -07:00
Philippe Tillet
2f80a98776 [BUILD] Added automatic nightly build releases to pip in CI; removed build-time dependence on LLVM and PyTorch (#77)
Recently there has been more and more report about installation issues:

    - Installing Triton before upgrading pytorch can create some issues because Triton uses some torch headers

    - llvm-10-dev not available on some platform; llvm-11-dev not available on e.g. Ubuntu.
    absence of nightly builds

This PR should fix all these issues. Some CMake tricks are used to download and install llvm at build time. Triton Python bindings were modified to remove dependence on pytorch ops. Midnight CI job added to generate binary wheels for all Triton version and update them on pypi's new triton-nightly project.

This PR will also make it very easy to use LLVM forks in the future for whatever needs we have.
2021-07-27 12:38:49 -07:00
Philippe Tillet
62835a0979 [RUNTIME] Added auto-alignment mechanism (#71)
This PR adds an automatic memory alignment mechanism in the Triton runtime. Specifically, the JIT compiler detects the alignment (in bytes) of each pointer argument as well as the largest power of two divisor (between 1 and 16) of each integer argument. Proper .aligned and .multipleof attributes are then added to the Triton-IR on-the-fly for all auto-tunable kernels. There is a cache that remembers all the kernels compiled for each possible configuration.

This PR also includes substantial cleaning of the Python API. This adds 2-3us overhead, mostly due to accessing integer #defines from the auto-tuned compilation options. The previous solution was slightly faster but hacky and potentially unsafe, so this is preferred for now.
2021-07-27 12:38:49 -07:00
Philippe Tillet
567a1a3d17 [CODEGEN] Bugfixes with FP32 async copy 2021-07-27 12:38:49 -07:00
Philippe Tillet
5b83259592 [CODEGEN] Major performance improvements on A100 (#70)
Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks
2021-07-27 12:38:49 -07:00
Jared Kaplan
045ab5d62a [PYTHON] Add Blocksparse Attention Fwd/Bwd Test (#69)
Also includes small bugfix for block-sparse softmax
2021-07-27 12:38:49 -07:00
Philippe Tillet
ce8aa2a41a [CI] Added benchmarking to CI script (#65) 2021-07-27 12:38:49 -07:00
Philippe Tillet
5e3c7f5a60 [PYTHON] Added automated benchmark script (#63)
This adds a bench functionality to the setup.py that can be used to run the benchmark suite and generates a bunch of csv files (and optionally plots)

python setup.py bench
python setup.py bench --with-plots
python setup.py bench --filter=cross_entropy
2021-07-27 12:38:48 -07:00