Da Yan
3e2953f357
Allow multiple_of and max_contiguous to accept n-d values ( #617 )
2022-08-10 09:59:32 -07:00
Daniil Fukalov
7b91c7befd
Fix "warning: control reaches end of non-void function". ( #607 )
2022-08-02 16:12:48 -07:00
Sharad Vikram
968f59027e
Expose module.print
in pybind ( #604 )
2022-07-29 21:36:08 -07:00
Da Yan
f28caddbf8
[FRONTEND] Allow tl.where to select pointers ( #595 )
2022-07-21 09:54:27 -07:00
Keren Zhou
af85f5fa46
[FRONTEND] Refresh cache when the source code of outlined functions are changed ( #590 )
2022-07-20 17:34:07 -07:00
Keren Zhou
4912916c11
[FRONTEND] Added support for element-wise function defined in external LLVM bitcode (e.g., libdevice) ( #562 )
2022-07-13 15:52:21 -07:00
Keren Zhou
b5e728cb14
Add argmin argmax ( #552 )
2022-06-15 13:55:20 -07:00
Keren Zhou
93209c07e0
[BACKEND][CODEGEN] Fix reduce uint ( #547 )
2022-06-13 16:43:57 -07:00
Keren Zhou
38573d1261
[FRONTEND] Return allocated registers and spilled registers for users ( #541 )
2022-06-07 18:37:12 -07:00
Bert Maher
43fec2adca
[FRONTEND] Add binding for create_int_to_ptr ( #526 )
2022-05-25 15:26:18 -07:00
Philippe Tillet
9f08ecd684
[FRONTEND] Semantic analysis refactor ( #491 )
...
Moved dispatch.cc to semantic.py (@ptillet)
Integer signedness analysis was moved from C++ to python (@daadaada)
Cleaner frontend types (@daadaada)
Moved SSA construction to a separate object (@ptillet)
Co-authored-by: Yan Da <dyanab@connect.ust.hk >
2022-04-06 16:13:53 -07:00
Philippe Tillet
2bed6fc850
[LANG] Added support for device functions ( #484 )
2022-04-03 20:58:16 -07:00
Philippe Tillet
e0cc488055
[FRONTEND] Added tl.clock
and tl.globaltimer
( #485 )
2022-03-28 16:15:43 -07:00
Philippe Tillet
76a9ee50a8
Revert "[FRONTEND] Semantic analysis refactor ( #473 )" ( #483 )
...
This reverts commit 539961072c
.
2022-03-24 17:16:50 -07:00
daadaada
539961072c
[FRONTEND] Semantic analysis refactor ( #473 )
...
Moved dispatch.cc to semantic.py
Integer signedness now moved from C++ to python
Cleaner frontend type
Co-authored-by: Phil Tillet <phil@openai.com >
2022-03-16 21:25:30 -07:00
Philippe Tillet
d4d8eaf6c0
[FRONTEND] improved caching mechanism ( #474 )
...
Co-authored-by: Greg Brockman <gdb@gregbrockman.com >
Co-authored-by: Christopher Hesse <christopherhesse@users.noreply.github.com >
2022-03-15 12:20:51 -07:00
Philippe Tillet
98ed7db8c1
[CODEGEN] Improvements and bugfixes ( #463 )
2022-02-24 14:56:24 -08:00
Philippe Tillet
9b100302d3
[FRONTEND] Now using pybind11 to release GIL ( #458 )
2022-02-10 01:57:39 -08:00
Philippe Tillet
7b48340ffd
[CI] Some fixes for the build ( #451 )
2022-02-06 19:11:33 -08:00
Philippe Tillet
807d8a1945
[ALL] Merge master ( #447 )
2022-01-30 20:21:20 -08:00
Philippe Tillet
bef76b142a
[BACKEND] float division is now approximate by default ( #446 )
2022-01-29 18:29:29 -08:00
Philippe Tillet
4c97d1ecd7
[FRONTEND] Bunch of fixes here and there ( #436 )
2022-01-20 10:55:59 -08:00
Philippe Tillet
4c94359199
[FRONTEND] Alignment fix-up ( #428 )
2022-01-11 23:11:58 -08:00
Madeleine Thompson
0ab9d67bad
uint8, uint16, uint32, and uint64 in kernels ( #413 )
...
A forthcoming PR will update the RNG to use these types.
Also:
- Add tests for the `//`, `<<`, and `>>` operators.
- Change `TensorWrapper` to unwrap objects when the resulting object would be simpler.
- Clean up `throw_unreachable`, since it was triggering compiler warnings.
2022-01-05 15:27:17 -08:00
Philippe Tillet
03f1256f60
[FRONTEND] Added volatile
flag for load ( #407 )
2021-12-30 22:33:24 -08:00
Madeleine Thompson
985798f101
add missing bfloat16 repr and improve assertions ( #403 )
...
- `BF16TyID` was missing a repr implementation.
- Throw a better exception on impossible casts.
- Add a few assertions. Tested with a debug build.
- Add `pointer_dtype.__str__` to aid kernel debugging.
2021-12-23 17:01:17 -08:00
Philippe Tillet
a425f24d54
[FRONTEND] Better cache hook ( #400 )
...
Added an additional `repr` argument to the cache hook, which represents a human-readable string representation of the signature and argument attributes associated with the compiled binary.
2021-12-21 21:29:47 -08:00
daadaada
39d4bfed83
[OPS] Add performance model for gemm/gemv ( #397 )
...
Significantly improves the performance of `triton.ops.matmul` in memory-bound settings via the use of many more block configs coupled with a performance model to drive the auto-tuning process.
2021-12-21 09:56:10 -08:00
daadaada
4a8953efa3
[FRONTEND] Replace the legacy print call in triton.cc with the SlotTracker-based one. ( #396 )
...
The legacy print call will assign names (e.g., %10) to values, which can be undesirable in some cases.
2021-12-18 18:03:22 -08:00
Philippe Tillet
558555630f
[FRONTEND] Added xor_sum
2021-12-16 17:55:35 -08:00
Philippe Tillet
e31b9b4e66
[RUNTIME] Better support for None
( #387 )
...
* regression test fails but it doesn't make sense to me.
2021-12-09 13:21:22 -08:00
Philippe Tillet
f23bf55f15
[RUNTIME] release the gil on launch ( #383 )
2021-12-03 13:01:01 -08:00
Philippe Tillet
c86ad9c9ab
[FRONTEND] Added default arguments to non-kernel @triton.jit'd function ( #379 )
2021-11-29 19:11:26 -08:00
Philippe Tillet
5693b582ea
[RUNTIME] Now using pybind11 to avoid memory leaks ( #377 )
2021-11-21 02:30:22 -08:00
Philippe Tillet
01cc3d4503
[RUNTIME] Restored do_not_specialize
( #374 )
2021-11-12 15:06:55 -08:00
Philippe Tillet
5d54352164
[FRONTEND] Significantly reduce kernel launch time ( #367 )
2021-11-04 13:25:24 -07:00
Philippe Tillet
5ce1b726dc
[CODEGEN] Various bugfixes that make it possible to fuse RNG in a matmul epilogue ( #356 )
2021-10-24 02:30:46 -07:00
daadaada
858dec8372
[CODEGEN] Add cache modifier to tl.load ( #351 )
...
* Add cache modifier to tl.load
* Add comment to cache_modifier
* Remove force_nc_cache
* Update test
2021-10-17 22:14:04 -07:00
Philippe Tillet
5123db0b7d
[LANG] Various (relatively minor) improvements ( #320 )
2021-10-04 18:39:40 -07:00
Philippe Tillet
6e5b0b4301
[FRONTEND] Added on-disk cache for compiled kernels ( #287 )
2021-09-18 22:48:26 -07:00
Philippe Tillet
94c83d30ce
[GENERAL] Removed deprecated driver files and added basic compatibility with rocm ( #268 )
...
- Removed driver module -- accelerator runtime is handled by pytorch
- Added basic support for ROCM based on @micmelesse 's PR -- now can execute empty kernel on AMD devices without any compile-time changes
- Now only using PREFER_SHARED for kernels when the size of shared memory is greater than 49k. Otherwise there can be poor L1 performance for broadcast tensors
2021-09-09 00:04:28 -07:00
daadaada
85426dbaf7
[DOCS] Add comments in layout.h ( #249 )
2021-08-28 18:07:32 -07:00
milesial
5b29da719d
[DRIVER] Add CUDA P2P support ( #209 )
2021-08-20 21:00:54 -07:00
Philippe Tillet
298da78058
[CODEGEN/DRIVER] Tweaks for performance optimization ( #193 )
2021-08-07 16:41:44 -07:00
Philippe Tillet
01276b5153
[FRONTEND] Added compilation flag to force use of .nc
cache modifier ( #134 )
...
in DRAM loads. /!\ USE CAREFULLY - THIS CAN BREAK CORRECTNESS IF MISUSED
/!\
2021-07-27 12:38:49 -07:00
Philippe Tillet
2824345065
[LANGUAGE] Added cos/sin ( #132 )
2021-07-27 12:38:49 -07:00
Philippe Tillet
8cea583109
[IR] Preliminary support for BF16 ( #129 )
...
This PR adds a BF16 data-type, along with FP32 <-> BF16 conversion instructions in the LLVM codegen. Other kinds of ops on bfloat16 are not yet supported.
2021-07-27 12:38:49 -07:00
daadaada
d8d6b715c8
[CODEGEN] Performance improvement on A100 ( #125 )
...
Improved codegen for the Ampere GPUs.
* Make the layout pass recognize the multistage pipelined pattern.
* Now the pipeline pass can automate the multistage pipelining transformation.
* Remove extra barriers (from the prefetch pass & WAR) on Ampere.
* Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.
2021-07-27 12:38:49 -07:00
Philippe Tillet
0274429429
[IR] Added IR and Codegen support for atomic_rmw ( #120 )
2021-07-27 12:38:49 -07:00
Philippe Tillet
59b0ac672a
[LANGUAGE] Added support for bitcast ( #119 )
2021-07-27 12:38:49 -07:00