291 Commits
v1.1.1 ... rocm

Author SHA1 Message Date
rsanthanam-amd
46fd581b0a Merge pull request #29 from ROCmSoftwarePlatform/parse_amdgcn_from_rocminfo
Changes to eliminate the need for the MI_GPU_ARCH environment variable.
2022-11-18 12:53:25 -06:00
Rohit Santhanam
8cc448d92e Changes to eliminate the need for the MI_GPU_ARCH environment variable.
The AMDGPU arch is now parsed out of the rocminfo dump.
2022-11-18 18:51:57 +00:00
Michael Melesse
9a9fabbba9 Merge pull request #22 from ROCmSoftwarePlatform/IFU_11_1_2022
IFU 11/1/2022
2022-11-01 14:27:33 -04:00
Michael Melesse
15886b5ffc skip segfault 2022-11-01 17:52:18 +00:00
Michael Melesse
d5830b4b6a Merge branch 'master' into IFU_11_1_2022 2022-11-01 17:29:10 +00:00
Michael Melesse
bba1579485 remove scripts 2022-11-01 17:24:35 +00:00
rsanthanam-amd
cc6b5180c7 Merge pull request #19 from ROCmSoftwarePlatform/unskip_test_reduce
reduce the skips for test_reduce functions
2022-11-01 11:05:18 -05:00
Michael Melesse
dfad6bdf36 reduce the skips for test_reduce functions 2022-11-01 15:00:12 +00:00
rsanthanam-amd
f3bcbcfde6 Merge pull request #18 from ROCmSoftwarePlatform/fix_test_dot
Fix 6/7 test dot
2022-11-01 09:34:37 -05:00
Michael Melesse
7ec29a7453 revert scripts 2022-11-01 14:22:33 +00:00
Michael Melesse
4fb9d4904e fix 6/7 dot tests 2022-11-01 14:18:06 +00:00
Michael Melesse
4f3e2d6ed7 Merge branch 'rocm52_fixes_IFU' into fix_test_dot 2022-10-31 19:24:45 +00:00
rsanthanam-amd
fecc7ce248 Fix for test_bitwise subtests for ROCm. (#16)
The issue was that the kernel names were colliding with each other in
the cache.  Since the kernel names were based on the date and time, the
kernels were getting compiled so fast that a subsequent kernel would end
up with the same name as the previous one and would therefore overwrite
it in the cache.

It seems to run the same test multiple times but the subsequent runs
would end up using the wrong kernel because of the collisions.

It is fixed by appending a randomly generated alphanumeric string to
keep the kernel names unique.
2022-10-31 15:24:08 -04:00
Michael Melesse
277b712284 save changes 2022-10-31 19:11:58 +00:00
Michael Melesse
d024f0cfb8 update test_dot to use float 32 2022-10-31 18:58:10 +00:00
Michael Melesse
1811791665 add failures in report 2022-10-31 18:39:58 +00:00
Michael Melesse
9b3f2487b5 fix minor bug 2022-10-31 18:33:47 +00:00
rsanthanam-amd
14730a2352 Merge pull request #15 from ROCmSoftwarePlatform/bfloat_enable
unskip most bfloat tests
2022-10-31 13:10:30 -05:00
Mark Saroufim
578ada7740 [DOCS] Add install from source instructions to README (#821) 2022-10-31 11:08:18 -07:00
Michael Melesse
15683986cd unskip most bfloat tests 2022-10-31 18:04:54 +00:00
Phil Tillet
6311d70406 Revert "[BUILD] Now using cibuildwheel default"
This reverts commit 584086f08c.
2022-10-29 17:15:47 -07:00
Phil Tillet
584086f08c [BUILD] Now using cibuildwheel default 2022-10-29 16:59:06 -07:00
Keren Zhou
3ca667dfa8 [Frontend] Return a scalar if all input args are scalar (#816) 2022-10-28 23:27:06 -07:00
rsanthanam-amd
48fcd8c987 Merge pull request #14 from ROCmSoftwarePlatform/fix_vectorization
fix test_vectorization and test_load_cache_modifier
2022-10-28 16:12:57 -05:00
Michael Melesse
8d9572bc63 add similar fixes two addition tests 2022-10-28 20:34:58 +00:00
Michael Melesse
ffb30cdc52 skip ptx assert 2022-10-28 20:23:11 +00:00
Michael Melesse
7fce2bc5f1 add print_llvm_module 2022-10-28 20:07:35 +00:00
rsanthanam-amd
531ef18cb6 Fix for binop % (mod) unit test failures. (#13)
If the either data type if fp, then fmod should be used for the
reference computation.
2022-10-28 15:06:17 -04:00
Michael Melesse
5f0d90db7e tab prints 2022-10-28 19:05:42 +00:00
Michael Melesse
03ae41b310 add print helper 2022-10-28 17:55:28 +00:00
Michael Melesse
bd61338b31 update scripts 2022-10-28 17:48:26 +00:00
Michael Melesse
6e50f8b2c0 print irs 2022-10-28 17:46:52 +00:00
Michael Melesse
aa556d4f1b update script 2022-10-26 21:51:15 +00:00
Michael Melesse
61e88efb23 ignore logs 2022-10-26 21:42:41 +00:00
Michael Melesse
ed9638801a fix for test_cast 2022-10-26 21:34:58 +00:00
Michael Melesse
8ecab462f6 skip segfaults on ROCM 2022-10-26 20:46:47 +00:00
Michael Melesse
648e4cfe89 skip test_atomic_rmw on rocm 2022-10-26 18:22:23 +00:00
Michael Melesse
abe0d3e1b1 cast to amd device when as_nvidia shows up 2022-10-26 18:12:18 +00:00
Michael Melesse
4464dfcc18 save scripts 2022-10-26 17:42:58 +00:00
Michael Melesse
0cae0168ec fix bfloat failure 2022-10-26 17:40:28 +00:00
Michael Melesse
88d57ef9c9 add cache print 2022-10-26 17:19:30 +00:00
Michael Melesse
39381d99f8 send amdgcn to cache 2022-10-26 17:18:33 +00:00
Michael Melesse
df925f7187 add cache print script 2022-10-25 20:48:36 +00:00
Michael Melesse
e84297ca79 print cache 2022-10-25 20:44:42 +00:00
Michael Melesse
61c85c18b2 try to load binary 2022-10-25 20:29:43 +00:00
Michael Melesse
da5c24ffcb just clean cache 2022-10-25 20:27:13 +00:00
Michael Melesse
09302f0106 fix linking bug 2022-10-25 18:31:10 +00:00
Yanbo Liang
5ca1ed0101 Add bf16/fp16/fp64 support for ty_to_cpp (#800)
In ```torch._inductor```, we [convert 0d CPU tensor to scalar during
triton codegen](https://github.com/pytorch/pytorch/pull/87329), so need
add missing triton support for bf16/fp16/fp64.
2022-10-24 19:41:25 -07:00
Michael Melesse
9184b5cf65 add prints 2022-10-24 18:28:28 +00:00
Michael Melesse
8da4323514 write hipmodule bytes 2022-10-24 17:58:25 +00:00
Michael Melesse
eb89e9bdd9 fix generator.cc: generator::visit_function: segfault 2022-10-24 17:41:20 +00:00
Michael Melesse
56a06f7a06 add debug steps 2022-10-21 20:17:30 +00:00
Michael Melesse
6a31c43774 update batcktrace 2022-10-21 19:56:19 +00:00
Michael Melesse
8785793445 fix typo 2022-10-21 17:58:38 +00:00
Michael Melesse
d022f5cf2c add compiling back to gcn 2022-10-21 17:54:31 +00:00
Michael Melesse
4624fd4e1d save compiler 2022-10-19 20:39:32 +00:00
Michael Melesse
41144f927f fix hip launch 2022-10-17 20:41:28 +00:00
Michael Melesse
4d6d4c9431 hip src 2022-10-17 20:18:44 +00:00
Michael Melesse
32dbc08c05 fix llvm build errors 2022-10-17 18:29:15 +00:00
Michael Melesse
4f21501def add fixes 2022-10-17 18:21:14 +00:00
Michael Melesse
5c548fb57e Merge branch 'master' into rcom52_fixes 2022-10-17 17:53:48 +00:00
Michael Melesse
fa4d0fd1ef add scripts 2022-10-17 17:28:48 +00:00
Daniil Fukalov
406d03bfaf Improve ROCm support. (#780)
- updates to support ROCm 5.2
- workarounds in tests where NV tools were used unconditionally
- implemented `get_num_blocks()` and `add_memfence()` for AMD GPU
- backported from history some atomics
- added bf16 support
- minor warnings cleanup
- added dockerfile to run on a ROCm enabled machine

Co-authored-by: B1tway <andrew.shukshov@gmail.com>
Co-authored-by: Andrey Shukshov <36711069+B1tway@users.noreply.github.com>
2022-10-14 11:33:42 -07:00
Keren Zhou
db3aa1d1fb [FRONTEND] Fix libdevice (#776)
Fix two problems in libdevice and external dispatch:

1. Use static triton types (e.g., tl.int32) instead of creating new
types. Otherwise, `tl.int32` and `tl.dtype('int32')` are not the same
thing.

2. The name of an extern inst should be empty but not the symbol name of
the inst. TTIR generator will assign names automatically. Otherwise, we
have the same variable name when there are multiple same extern insts.

Before the PR:

```bash
  __nv_exp = extern_elementwise f64<1024> %11;
  __nv_exp = extern_elementwise f64<1024> %11;
```

After the PR:

```bash
  %12 = extern_elementwise f64<1024> %11;
  %13 = extern_elementwise f64<1024> %11;
```
2022-10-13 17:18:16 -07:00
Twizzes
ddae106c0e [DOCS] Update installation.rst to fix windows build error (#747) 2022-10-13 13:27:15 -07:00
Keren Zhou
bc98aead33 [Backend] Fix for mov.u8 (#766)
Init a potential fix for mov.u8 which is not supported by ptx for now.
Use mov.u16 instead and cast it to u8.
2022-10-12 14:32:27 -07:00
Yu Guo
71b46acc42 [IR] Added special-purpose dequantize instruction (#759)
It is currently necessary for optimal performance in quantized workloads to add a special-purpose instruction in the IR. Backward compatibility with this instruction is *NOT* guaranteed.
2022-10-12 14:14:45 -07:00
Philippe Tillet
33e6f0df7f [DRIVER] Bumped CUDA requirement to 11.4+. This is to avoid bad performance surprises as older ptxas are much slower. (#769)
This also makes codegen simpler by avoiding special handling of eviction policies
2022-10-12 12:02:30 -07:00
Philippe Tillet
af76c989eb [RUNTIME] Make entry point cache key depend on triton version hash (#765) 2022-10-11 13:24:30 -07:00
Bin Bao
09cc2d454b [FRONTEND] Fix a bool tensor storing problem (#746) 2022-10-10 12:11:50 -07:00
Felipe Petroski Such
5d4b26d380 [RUNTIME] support multiple devices in the same process (#757) 2022-10-09 20:30:04 -07:00
Chris
9a11a567ce [DOCS] Fixed typos in 01-vector-add.py (#751) 2022-10-09 18:12:46 -07:00
Keren Zhou
11345e9b74 [RUNTIME] Add callback functions for external tools (#738) 2022-10-05 14:46:55 -07:00
Philippe Tillet
bdfdb9a1d2 [RUNTIME] Fixed JIT bug that leg some constexpr values to be overriden by specialization parameters (#742) 2022-10-05 11:00:32 -07:00
shenggan
77c752dc78 [RUNTIME] remove fixed cu_include_dir (#739)
Use environment variable `CUDA_HOME` with default value`/usr/local/cuda` for `cu_include_dir` #731
2022-10-04 19:49:57 -07:00
Natalia Gimelshein
d3c925db8a [FRONTEND] properly broadcast scalar where condition (#736) 2022-10-04 12:44:03 -07:00
fdrocha
2b0f877fad [RUNTIME] Support environments with multiple cudalibs (#733) 2022-10-03 18:36:24 +00:00
Keren Zhou
4a2d3b7d79 [RUNTIME] Dump llvm, ttir, and sass to help debugging (#732) 2022-10-03 00:39:52 +00:00
Natalia Gimelshein
f55960e773 [FRONTEND] fix broadcasting for where (#729)
Fixes #532, all 3 inputs to where have to be broadcast together.
2022-10-01 13:18:47 -07:00
Phil Tillet
b244db06da [TUTORIALS] Attention tutorial fixup 2022-09-30 19:31:43 -07:00
Shintaro Iwasaki
7b61303ea1 [CODEGEN] Fix extract_N_bufferable in layout analysis (#728) 2022-09-30 12:21:22 -07:00
Shintaro Iwasaki
ae59f51c2d [CODEGEN] Fix an inliner to call a function with a phi-node (#727) 2022-09-29 21:36:40 -07:00
albanD
f45e31ba7c [FRONTEND] Make sure to hold the gil when creating python objects (#726)
Without this patch, a debug version of python complains that:
```
Fatal Python error: Python memory allocator called without holding the GIL
Python runtime state: initialized
```
2022-09-29 18:06:22 -07:00
Philippe Tillet
dad97528b2 [TESTING] allclose fixup (#724) 2022-09-28 22:49:05 +00:00
Jason Ansel
998fd5f9af [FRONTEND] Make triton.compile work without a cuda context (#708)
This allows compiling in a subprocess. I'm not seeing a ton of speedup from this, but figure it is a good change anyway.
2022-09-24 13:41:47 -07:00
Shintaro Iwasaki
3ac929b48b [BUILD] Download pybind11 in setup.py (#703)
Based on the discussion in #700, this PR enables downloading pybind11 in
`setup.py` without `git submodule` instead of copy-pasting pybind11
code. The downloaded pybind11 will be in `~/.triton/pybind` (like
`llvm`).
2022-09-23 15:54:07 -07:00
Jason Ansel
579c03615d [FRONTEND] Reduce number of compiles in JITFunction (#704)
I suspect this was the cause of the "new compiles even on a warm cache"
behavior I was seeing, though haven't 100% confirmed it.

Python `set()` iteration order is nondeterministic when you create a new
process. So the same args could produce different `instance_descriptor`s
and have false cache misses.
2022-09-23 21:44:52 +00:00
Philippe Tillet
25e1b36785 Revert "[pybind11] Use git-submodule for pybind11" (#701)
Reverts openai/triton#699
2022-09-23 12:25:38 -07:00
Shintaro Iwasaki
61d104ab3a [FRONTEND] Use git-submodule for pybind11 (#699)
This PR changes the `pybind11` source code management from copy-paste to
a package controlled by git-submodule.

See the discussion in #694 for details.
2022-09-23 09:55:03 -07:00
Philippe Tillet
8c3d4d5749 [RUNTIME] now decoupling entry point from cubin (#696) 2022-09-22 16:44:22 -07:00
Shintaro Iwasaki
df67068bb0 [pybind11] Update pybind11 to 2.10.0 (#691)
This PR updates the version of pybind11 to 2.10.0 (the latest stable).
2022-09-21 20:18:02 -07:00
Philippe Tillet
677ddae618 [FRONTEND] Add warmup for triton.jit() (#684)
This revives #671 , removing the static functions that may unnecessarily hold a reference to the grid and the JITFunction object

Co-authored-by: Jason Ansel <jansel@jansel.net>
2022-09-21 19:13:20 +00:00
Jason Ansel
6abe813d1c Fix issue breaking cudagraphs (#685)
@ngimel figured this one out. 

The errors we were seeing from cudagraphs capture were coming from
`cuStreamGetCtx` which is not allowed while a stream is capturing.

It appears the result of `cuStreamGetCtx()` isn't even used, so I
believe it can just be removed.
2022-09-21 10:20:48 -07:00
Philippe Tillet
e318185eb4 [DOCS] Improved README.md wording (#683)
Initial wording dates from a time where nobody knew Triton, and
comparing it to CUDA helped differentiate it from other existing DSLs.
But nowadays this comparison doesn't make much sense; Triton is its own
thing, and some people may even still be more productive in CUDA than
Triton -- language preferences are subjective after all.
2022-09-20 18:09:43 -07:00
Philippe Tillet
7dc2a70edb Revert "Add .warmup() for triton.jit()" (#682)
Reverts openai/triton#671

It seems like for some reason this caused out-of-memory errors on some
of our internal workloads. I'm reverting this so that HEAD can be used
in production at OpenAI, and I will work on digging into this issue
asynchronously.
2022-09-20 16:05:14 -07:00
Philippe Tillet
48f30550f1 [FRONTEND] Now using raw compiler syscalls when possible (#678) 2022-09-19 21:01:36 -07:00
Jason Ansel
93b1adc53b [FRONTEND] Add .warmup() for triton.jit() (#671) 2022-09-18 23:09:34 -07:00
Phil Tillet
82956e5d6b [PACKAGING] Added missing package 2022-09-18 17:34:05 -07:00
Philippe Tillet
2baf333d44 [DOCS] Fixed typos (#670) 2022-09-18 17:13:12 -07:00
Jason Ansel
49f6bc3f2b [FRONTEND] Fix filename too long error in new runtime (#669) 2022-09-18 21:26:29 +00:00
Phil Tillet
00f4ef6958 [CI] wheel/docs workflows now only run on V100 machine 2022-09-18 13:28:35 -07:00
Jason Ansel
e647402fd3 Fix warning in generated C code (#667) 2022-09-18 12:57:32 -07:00
Philippe Tillet
4a77dfb042 [FRONTEND] Complete rewrite of the runtime (#644)
This PR completely rewrites the runtime of Triton to be more lean and
clearly separate the compilation step from the just-in-time caching logic.
This should substantially reduce launch overhead.
2022-09-18 08:51:48 -07:00
Ian Bearman
889d9e34a1 [REPO] update gitignore (#666)
Update `.gitignore` to include `.vs` and `.vscode`
2022-09-17 14:25:28 -07:00
Shintaro Iwasaki
c668d6596e [DOCS] Fix spelling (#664)
This PR applies minor spelling fix in comments and string literals to
`master`. It shouldn't hurt anything.
2022-09-16 12:26:40 -07:00
Sophia Wisdom
4580a04710 [FRONTEND] Improve error message for CPU tensors (#654)
Redo of #651 against master. Fixes #525 by catching CUDA error when we
check pytorch tensor size and rethrowing a more informative error that
says why we failed.
2022-09-14 14:26:42 -07:00
Philippe Tillet
cfbbc7b43a [CI] Added V100 tag to disambiguate self-hosted runners (#653) 2022-09-14 13:47:50 -07:00
Yunxing Dai
59a8e25f43 [DOCS] Fix typo (#650) 2022-09-14 12:17:05 -07:00
Da Yan
437ced38c2 fp8 <> bf16 conversion (#637)
Co-authored-by: Philippe Tillet <phil@openai.com>
2022-08-30 14:20:12 -07:00
Da Yan
210a296699 [BACKEND] bf16 flash-attention (#636) 2022-08-26 20:40:55 -07:00
Daniil Fukalov
fe0c29b9ec Fix inconsistent struct declaration instead of class. (#632)
Looks like typo.
2022-08-26 16:20:21 -07:00
Phil Wang
7394d732ad [DOCS] support for variable head dimensions in flash attention triton tutorial (#623) 2022-08-15 19:16:49 -07:00
Da Yan
3e2953f357 Allow multiple_of and max_contiguous to accept n-d values (#617) 2022-08-10 09:59:32 -07:00
Daniil Fukalov
cc79376222 Fix deprectaion warning on CreateGEP(Value *, ArrayRef<Value *>, const Twine &) (#608)
This variant of CreateGEP() is already removed in LLVM 14.
2022-08-07 17:10:18 -07:00
Daniil Fukalov
7b91c7befd Fix "warning: control reaches end of non-void function". (#607) 2022-08-02 16:12:48 -07:00
Sharad Vikram
968f59027e Expose module.print in pybind (#604) 2022-07-29 21:36:08 -07:00
Anton Kostin
923d468187 Update LICENSE (#602) 2022-07-25 09:30:03 -07:00
Jason Ansel
027321cdcf [FRONTEND] Make tl.rand() 1-exclusive (#601) 2022-07-24 17:47:23 -07:00
Jason Ansel
e02e56dc63 [FRONTEND] Add missing rfloordiv (#598)
* [FRONTEND] Add missing rfloordiv

* fix tests
2022-07-23 21:54:12 -07:00
Philippe Tillet
ab56d310dd [BACKEND][IR] Fixed up internal dtype size for booleans (1bit -> 8bit) (#600) 2022-07-23 20:08:03 -07:00
Da Yan
f28caddbf8 [FRONTEND] Allow tl.where to select pointers (#595) 2022-07-21 09:54:27 -07:00
Keren Zhou
af85f5fa46 [FRONTEND] Refresh cache when the source code of outlined functions are changed (#590) 2022-07-20 17:34:07 -07:00
daadaada
9b2bc88d11 [BACKEND] Better bf16 support (#588) 2022-07-19 21:22:37 -07:00
Philippe Tillet
86cab58d89 [CI] Changed dev wheel date to UTC time to match CRON schedule (#587) 2022-07-18 14:54:13 -07:00
Phil Tillet
5b04331dd2 [TUTORIALS] Added more credits in fused attention tutorial 2022-07-13 23:48:58 -07:00
Jason Ansel
0a3f3d5f25 [PACKAGING] Include triton/language/libdevice.10.bc in package data (#582) 2022-07-13 23:45:27 -07:00
Keren Zhou
4912916c11 [FRONTEND] Added support for element-wise function defined in external LLVM bitcode (e.g., libdevice) (#562) 2022-07-13 15:52:21 -07:00
Phil Tillet
971f5782b4 [tutorials] Added flash attention credits in tutorial 2022-07-11 18:56:48 -07:00
Philippe Tillet
d5eb9bc230 [tutorial] Added bwd in fused attention example (#579)
Doesn't work on V100
2022-07-11 15:43:46 -07:00
Jason Ansel
c9a2b9c7d4 [FRONTEND] Add missing args to get_simd_tflops() (#578) 2022-07-11 14:37:59 -07:00
Philippe Tillet
4a399a7e40 [BACKEND] Fix some bugs (atomics, a segfault...) (#577)
This should fix #558 , #573 and #574
2022-07-06 20:03:04 -07:00
vesuppi
22105bc33b [FRONTEND] Added type check in semantic arange (#572) 2022-07-03 15:25:37 -07:00
Keren Zhou
4bf509889b [BUILD] Change the default build type to Release (#571) 2022-07-01 12:17:22 -07:00
Keren Zhou
a74cce375f [FRONTEND] Raise broadcast error (#555) 2022-06-30 17:32:07 -07:00
Philippe Tillet
f733327ba4 [BACKEND][CODEGEN] Disabling L2 residency control by default (#570) 2022-06-29 17:05:13 -07:00
Natalia Gimelshein
1bbb2430d9 [TUTORIALS] adjust heuristics for dwdb kernel (#565) 2022-06-29 17:00:22 -07:00
Kashif Rasul
1895ceaa2d [TUTORIAL] Fix f-string for older python (#569)
fixes issue #568
2022-06-29 09:39:10 -07:00
Philippe Tillet
feb7a2a0dc [FRONTEND] Hotfix for store argument order (#567) 2022-06-28 00:24:02 -07:00
Philippe Tillet
5b4c8f221e [BACKEND] Compiler improvements (#557)
This PR adds several optimization capabilities in the compiler backend:
- Now using inline PTX for `tl.store`, making it possible to use things like evict_last
- For A100, mma layout can be directly converted to shared memory
- For A100, an additional "transpose" argument in `dot` allows tensors to be loaded once and used both row- and col- major.
- Fixed liveness analysis; this was broken.
- Now can load/store directly mma layout without converting. Useful for when tl.dot accumulator is initialized with DRAM data inside of an inner loop.
- `tl.dot` can now take LHS inputs in registers when it comes from a previous `tl.dot` instruction. Useful for e.g. fused attention.
2022-06-27 11:49:19 -07:00
Keren Zhou
87413bc925 [BACKEND] Fix layout convert for non-contiguous input (#564) 2022-06-25 23:12:03 -07:00
Keren Zhou
d345ddf837 [DOCS] Separate atomic cas from other atomic operations since operands are very different (#559) 2022-06-22 17:51:17 -07:00
Keren Zhou
b02bac41ba [CI] Change cache dir (#561) 2022-06-22 11:44:35 -07:00
Keren Zhou
a428cf0bb2 [FRONTEND] Fix pytorch warning. (#560)
UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc').
2022-06-20 20:12:09 -07:00
Keren Zhou
b5e728cb14 Add argmin argmax (#552) 2022-06-15 13:55:20 -07:00
Jason Ansel
6b9756532f [BACKEND] Remove print in coalesce.cc (#551) 2022-06-15 13:13:20 -07:00
Madeleine Thompson
8ce2c12e33 [PYTHON] move ephemeral files to homedir (#549)
This prevents potential conflicts with other users on shared machines.
2022-06-13 19:37:52 -07:00
Keren Zhou
93209c07e0 [BACKEND][CODEGEN] Fix reduce uint (#547) 2022-06-13 16:43:57 -07:00
Philippe Tillet
58c8889235 [FRONTEND] Fix scanline layout (#548) 2022-06-13 16:21:10 -07:00
Natalia Gimelshein
7094657aa9 [FRONTEND] fix bool conversion of floating types (#545) 2022-06-13 15:52:37 -07:00
Keren Zhou
38573d1261 [FRONTEND] Return allocated registers and spilled registers for users (#541) 2022-06-07 18:37:12 -07:00
Mengchi Zhang
2cdc6d35c4 [FRONTEND] Give col_per_thread an initial value to make the compiler happy (#535)
Signed-off-by: Mengchi Zhang <mengchi@fb.com>
2022-06-06 12:48:23 -07:00
TC
f13cbaab9f [FRONTEND] assert that num_warps is a power of 2 (#539) 2022-06-06 11:37:08 -07:00
Philippe Tillet
751e325d2e [TUTORIALS] Fixed typo 2022-06-05 13:33:21 -07:00
Philippe Tillet
801c8a4c92 [TUTORIALS] Fixed typo 2022-06-05 12:32:07 -07:00
Philippe Tillet
8876e53206 [BACKEND] Restored reduction bugfixes 2022-06-03 11:38:52 -07:00
Philippe Tillet
a60374a597 Revert "[BACKEND] Various bug fixes; making reductions faster (#533)".
This is a more stable commit that produce bitwise identical code to earlier
versions. Using commits after this one may lead to slightly different numerics
2022-06-03 11:36:06 -07:00
Philippe Tillet
efa04cac1f [FRONTEND] A couple of bugfixes (#534) 2022-06-02 16:57:37 -07:00
Philippe Tillet
3e7500dfe6 [BACKEND] Various bug fixes; making reductions faster (#533) 2022-05-31 17:14:44 -07:00
Bert Maher
37037bb3be [FRONTEND] Default cache dir to /tmp/triton_$USER (#527) 2022-05-27 13:51:05 -07:00
Philippe Tillet
c82a206684 [FRONTEND] Better dot error message (#531) 2022-05-26 17:41:09 -07:00
Philippe Tillet
0e2883020a [BACKEND] Fixed typo in alignment analysis (#528) 2022-05-25 20:01:19 -07:00
Bert Maher
43fec2adca [FRONTEND] Add binding for create_int_to_ptr (#526) 2022-05-25 15:26:18 -07:00
Philippe Tillet
011bc83c1b [FRONTEND] For loops now promote initial value (#524) 2022-05-24 13:20:10 -07:00
Natalia Gimelshein
96bff90471 [FRONTEND] faster jit function launch (#523)
With fast (200 ns) get_stream function soon to be available from pytorch this shaves off approx 25-30 us from function launch, but even without that function due to caching device properties we are saving ~15-20us.
2022-05-24 12:08:49 -07:00
daadaada
d5eaa8dfa0 Making the generated Triton IR deterministic & a script to compare cached assembly (#522) 2022-05-24 08:56:36 -07:00
Shantanu
80f6a2698b [FRONTEND] Ensure version_key is called at most once (#519)
Co-authored-by: hauntsaninja <>
2022-05-23 13:40:08 -07:00
daadaada
205a493b10 [FRONTEND] Fix a bug in atomic_cas (correct cmp to val) & more tests on atomic_cas (#520)
Fix a bug in atomic_cas (correct cmp to val) & more tests on atomic_cas
2022-05-21 09:45:54 -07:00
Jiabao Lei
abea3dc2c6 [FRONTEND] provide device kwargs && fix fstring error for py<3.8 (#515)
Co-authored-by: Philippe Tillet <phil@openai.com>
2022-05-14 16:21:46 -07:00
Philippe Tillet
d35617bea1 [BACKEND][CODEGEN] Faster reduction for scanline layout (#516) 2022-05-14 15:26:13 -07:00
Mengchi Zhang
d1a22a94e6 [FRONTEND] Add empty return value and remove protect to open the access to contained_tys_vec_t (#514)
Signed-off-by: Mengchi Zhang <mengchi@fb.com>
2022-05-13 11:46:12 -07:00
Jason Ansel
d954a05989 [FRONTEND] Handle torch.uint8 args (#513)
Co-authored-by: Philippe Tillet <Phil.Tillet@gmail.com>
2022-05-12 13:07:39 -07:00
Philippe Tillet
0835a4fb05 [TUTORIALS] Removed #noformat in layer norm tutorial 2022-05-12 12:41:25 -07:00
Philippe Tillet
c736ba7c3e [TUTORIALS] Fixed formatting 2022-05-12 12:31:23 -07:00
Philippe Tillet
cd30a99aa2 [TUTORIALS] fixed formatting 2022-05-12 12:28:22 -07:00
Philippe Tillet
d87435e536 [TUTORIALS] Layer norm tutorial now uses residency control (#510) 2022-05-05 19:53:54 -07:00
Sriram Murali
7c9bc5a47b [CODEGEN] Change return type of generator::packed_type to appease build warnings (#507) 2022-05-04 20:03:37 -07:00
Philippe Tillet
95feb10ec9 [FRONTEND] fixup (#505) 2022-04-30 14:25:06 -07:00
Philippe Tillet
11a908655d [FRONTEND] Fixup 2022-04-29 14:35:09 -07:00
Phil Tillet
cd78ce4888 [FRONTEND] Improved error message when assigning None to non-constexpr 2022-04-29 09:17:54 -07:00
Philippe Tillet
ae2a1ab225 [BACKEND] Alignment pass improvements (#503) 2022-04-25 21:16:00 -07:00
Philippe Tillet
7d544799a0 [BACKEND] Now disabling L2 eviction policy for sm < 80 2022-04-25 09:35:36 -07:00
Philippe Tillet
3ca792043f [TEST] Added test for vectorization 2022-04-24 13:50:48 -07:00
Philippe Tillet
bda209002e [BACKEND][CODEGEN] vectorization bugfix (#502) 2022-04-23 13:18:33 -07:00
Philippe Tillet
0cc3b1129b [BACKEND][CODE_GEN] eviction policies now also apply to L2 (#501) 2022-04-21 23:56:01 -07:00
Philippe Tillet
7d6c504e8d [TESTING] Added testing utilities for fixing clock and using cuda-memcheck (#500) 2022-04-21 22:40:10 -07:00
Philippe Tillet
073be1d2ee [FRONTEND] check that tensors have power-of-two number of elements (#499) 2022-04-14 19:30:02 -07:00
Philippe Tillet
5c7122004c [TUTORIALS] Tutorial shouldn't expose clock. Just removed it. 2022-04-14 17:33:44 -07:00
Philippe Tillet
dc4d40faec [FRONTEND] now mangle constexpr float containing "e-" 2022-04-14 10:26:48 -07:00
Philippe Tillet
25f6689508 [FRONTEND] rename current stream monkey patch (#495) 2022-04-13 11:45:55 -07:00
Philippe Tillet
76bfac9f15 [FRONTEND] Improved constexpr handling (#493) 2022-04-12 00:02:54 -07:00
Philippe Tillet
14b0fd4cfb [FRONTEND] Added possibility for users to customize current stream query (#492) 2022-04-07 12:11:32 -07:00
Philippe Tillet
6424771f55 [CI] Documentation fixup 2022-04-07 09:42:35 -07:00
Philippe Tillet
9f08ecd684 [FRONTEND] Semantic analysis refactor (#491)
Moved dispatch.cc to semantic.py (@ptillet)
Integer signedness analysis was moved from C++ to python (@daadaada)
Cleaner frontend types (@daadaada)
Moved SSA construction to a separate object (@ptillet)


Co-authored-by: Yan Da <dyanab@connect.ust.hk>
2022-04-06 16:13:53 -07:00
Philippe Tillet
2bed6fc850 [LANG] Added support for device functions (#484) 2022-04-03 20:58:16 -07:00
apd10
e85c7a7fc7 Bugfix in ptxas path. (#487)
Bug: "ret" value is destroyed when a failing "ptxas --version" is run
overwriting the previous valid "ret" value.

Fix: keep rets only for those runs which are successful. Pick the first
one
2022-03-30 20:45:41 -07:00
Philippe Tillet
bace26143d [TUTORIALS] Removed leftover print 2022-03-28 16:53:23 -07:00
Philippe Tillet
e0cc488055 [FRONTEND] Added tl.clock and tl.globaltimer (#485) 2022-03-28 16:15:43 -07:00
Philippe Tillet
76a9ee50a8 Revert "[FRONTEND] Semantic analysis refactor (#473)" (#483)
This reverts commit 539961072c.
2022-03-24 17:16:50 -07:00
Philippe Tillet
ea6d1f1b85 [DRIVER] LLVM driver fixup (#482)
Current way of doing things is probably not super thread safe. init is shared between threads and some threads my not call the LLVMInitialize* function.
2022-03-23 00:24:45 -07:00
Keren Zhou
a4f68165cd [FRONTEND] Hot fix for lineno (#481)
Override __reduce__ to make CompilationError pickable and print out error messages
2022-03-22 22:09:49 -07:00
daadaada
539961072c [FRONTEND] Semantic analysis refactor (#473)
Moved dispatch.cc to semantic.py
Integer signedness now moved from C++ to python
Cleaner frontend type

Co-authored-by: Phil Tillet <phil@openai.com>
2022-03-16 21:25:30 -07:00
Yongjik Kim
0dd2ec2e3a [FRONTEND] Add an assert in case we get a CPU tensor. (#478) 2022-03-16 14:38:56 -07:00
Philippe Tillet
d4d8eaf6c0 [FRONTEND] improved caching mechanism (#474)
Co-authored-by: Greg Brockman <gdb@gregbrockman.com>
Co-authored-by: Christopher Hesse <christopherhesse@users.noreply.github.com>
2022-03-15 12:20:51 -07:00
Doğukan Tuna
21f8a0646d [DOCS] Minor README.md (#470)
Added binary distribution for quick installation
2022-03-05 00:50:37 -08:00
Philippe Tillet
a50a47a85b [CODEGEN] Reverted some changes from previous PR; fixed vectorization characteristics of mma layout (#469) 2022-03-04 01:53:31 -08:00
Philippe Tillet
bb5765df5c [CODEGEN] Now padding shared memory for layout conversion (#468) 2022-03-03 22:19:05 -08:00
daadaada
d9dd97492f Use unique_ptr in ir::context_impl (#462)
Co-authored-by: Philippe Tillet <Phil.Tillet@gmail.com>
2022-02-24 16:07:10 -08:00
Philippe Tillet
98ed7db8c1 [CODEGEN] Improvements and bugfixes (#463) 2022-02-24 14:56:24 -08:00
daadaada
a9dfdcaaa9 [FRONTEND] Make the performance model work for int8, tf32, and fp32 (#456) 2022-02-11 22:34:42 -08:00
Philippe Tillet
9b100302d3 [FRONTEND] Now using pybind11 to release GIL (#458) 2022-02-10 01:57:39 -08:00
Philippe Tillet
40093a9878 [DOCS] Multiple versions are now supported (#457) 2022-02-09 01:32:41 -08:00
Philippe Tillet
4941bc7001 [DOCS] Some more fixes (#455) 2022-02-08 16:53:56 -08:00
Philippe Tillet
2fdf0a4fe8 [DOCS] changed build command 2022-02-08 11:45:21 -08:00
Philippe Tillet
077d6c8ff0 [DOCS] re-activated tutorials 2022-02-08 11:42:39 -08:00
Philippe Tillet
822ddcd14b [DOCS] Added versioning (#453) 2022-02-08 11:28:18 -08:00
Philippe Tillet
7b48340ffd [CI] Some fixes for the build (#451) 2022-02-06 19:11:33 -08:00
Philippe Tillet
5a8a544d10 [OPS][BLOCKSPARSE] Improved robustness, clarity and performance (#450)
* dds layout now internally re-uses dsd code path for increased code 
* at_mask and kp_mask related things are now dropped from the softmax API. I couldn't think of any case where it was needed beyond is_causal. And if there is any, we should probably find a way to get it implemented statically so that users don't have to materialize masks.
 * fixed bug in blocksparse matmul that caused troubles when layout had a full row/col of zeros
 * blocksparse softmax now no longer modifies any data in-place
 * blocksparse softmax now takes an is_dense arguments that provides better performance. Passing is_dense=True, is_causal=True is the best way to achieve triangular attention.
  * unit tests now test backward pass
2022-02-06 18:00:45 -08:00
Philippe Tillet
69ff52ea1f [CODEGEN] removed buggy (and mostly useless) optimization in peephole pass (#449) 2022-02-05 21:37:23 -08:00
TC
137bb67fad [LANG] Add fp16 to fp8 conversion (#444) 2022-02-02 20:42:09 -08:00
Philippe Tillet
3b20170fa3 Merge pull request #448 from openai/v2.0
`v2.0` is now merged into `master`
2022-01-30 20:49:08 -08:00
Philippe Tillet
b0d6e2f322 [STYLE] run autopep 2022-01-30 20:27:44 -08:00
Philippe Tillet
2922dc141c Merge branch 'master' into v2.0 2022-01-30 20:25:01 -08:00
Philippe Tillet
807d8a1945 [ALL] Merge master (#447) 2022-01-30 20:21:20 -08:00
Philippe Tillet
bef76b142a [BACKEND] float division is now approximate by default (#446) 2022-01-29 18:29:29 -08:00
Philippe Tillet
bd52e530a0 [OPS][BLOCKSPARSE] Fix padding issue in DSD LUT (#445) 2022-01-28 21:40:30 -08:00
daadaada
e68d6a7776 [BACKEND] Making the warp-level tile "more square" to increase data-reuse for tl.dot. (#442)
* Increase smem data-reuse for some layouts

* tweak

* Keep the original tiling logic for sm < 80

Co-authored-by: Philippe Tillet <phil@openai.com>
2022-01-27 09:59:54 -08:00
daadaada
59d371c6eb [BACKEND] Added Int8 mma (#440) 2022-01-27 09:12:44 -08:00
Benjamin Lefaudeux
3a23c1dd33 [BACKEND] minor, hotfix for gcc compilation (#439) 2022-01-23 14:24:02 -08:00
Philippe Tillet
ccf9abe0ba [FRONTEND][RANDOM] Improved backward compatibility of RNG (#438)
The unsigned int PR definitely improved our RNG. However, it requires
different floating point arithmetics which, means the results are not
bit-wise identical to how they were before. This commit revives backward
compatibility, but we should change it back to the "right" way later.
2022-01-21 18:05:55 -08:00
Philippe Tillet
4c97d1ecd7 [FRONTEND] Bunch of fixes here and there (#436) 2022-01-20 10:55:59 -08:00
Philippe Tillet
e0c5709cc8 [FRONTEND] Fixed semantics bug on ptr to bool conversions (#432) 2022-01-17 18:00:03 -08:00
daadaada
2a944ded53 [TESTS] Added bfloat16 tests (#430) 2022-01-13 23:38:32 -08:00
Philippe Tillet
4c94359199 [FRONTEND] Alignment fix-up (#428) 2022-01-11 23:11:58 -08:00
Philippe Tillet
bbc78f6516 [FRONTEND][RANDOM] Make sure offset dtype is always uint32 before calling uint32_to_uniform_float (#427) 2022-01-11 11:08:49 -08:00
Botao Yu
bf32205edc [OPS][BLOCKSPARSE] Remove unnecessary loop and add cuda bool layout support (#425) 2022-01-11 11:07:16 -08:00
daadaada
94a2e10fe5 [BACKEND] Add bf16 & tf32 mma supports (on A100) (#426) 2022-01-11 10:20:31 -08:00
Madeleine Thompson
efdabe6073 [STYLE] check python with flake8 (#424)
I've been using this locally to find errors without running tests, and now that we're using autopep8, it passes with minimal suppressions. This is also what turned up the issues with the tutorials, which were fixed in #422.
2022-01-07 15:28:36 -08:00
Madeleine Thompson
a70acfec77 [STYLE] add isort and autopep8 config files and check on CI (#423)
Also a fix a few more style issues from the "aggressive" mode of autopep8.
2022-01-07 13:11:34 -08:00
Madeleine Thompson
9801aa7b56 [DOCS] fix tutorials for v2.0 (#422)
- Fix meta-parameter usage on tutorials.
- Install tutorial dependencies on CI.
- Switch from `requirements-test.txt` to `extras_require` for test dependencies, and also use it for tutorial dependencies.
- Make some performance tests deterministic.
2022-01-07 12:34:38 -08:00
Madeleine Thompson
8bf551ae7a [STYLE] run autopep8 and isort (#421)
Run:
```
isort ./python
autopep8 -i --ignore E501,E701,E731 $(find ./python/ -name '*.py')
```
with an `.isort.cfg` and then clean up a few warts. This PR should be a no-op; the idea is that this is all boring whitespace changes, and any config file changes will be in a different change to make it easier to review.
2022-01-06 14:34:17 -08:00
Shantanu
6f7acad48f [CODEGEN] Avoid use of deprecated AST nodes (#418)
Co-authored-by: hauntsaninja <>
2022-01-06 12:04:33 -08:00
Madeleine Thompson
120cda015e [FRONTEND] use unsigned integers to simplify RNG (#417) 2022-01-06 10:49:09 -08:00
Philippe Tillet
001fb757fe [OPS][BLOCKSPARSE] Added .contiguous() in blocksparse inputs when necessary (#420) 2022-01-06 09:56:22 -08:00
Madeleine Thompson
0ab9d67bad uint8, uint16, uint32, and uint64 in kernels (#413)
A forthcoming PR will update the RNG to use these types.

Also:
- Add tests for the `//`, `<<`, and `>>` operators.
- Change `TensorWrapper` to unwrap objects when the resulting object would be simpler.
- Clean up `throw_unreachable`, since it was triggering compiler warnings.
2022-01-05 15:27:17 -08:00
Madeleine Thompson
d8db0308cb [TEST] use numpy for reference results in test_core.py (#409)
Since numpy supports unsigned integers, and pytorch doesn't, this will make it easier to test unsigned integer support.

This adds an explicit requirement for numpy in tests, but we already required scipy, so it was already an implicit dependency.
2022-01-04 13:07:29 -08:00
Philippe Tillet
03f1256f60 [FRONTEND] Added volatile flag for load (#407) 2021-12-30 22:33:24 -08:00
Noah Ziems
3edc2633e9 [TUTORIALS] Fix 01-vector-add.py typo (#406) 2021-12-29 15:09:34 -08:00
Madeleine Thompson
985798f101 add missing bfloat16 repr and improve assertions (#403)
- `BF16TyID` was missing a repr implementation.
- Throw a better exception on impossible casts.
- Add a few assertions. Tested with a debug build.
- Add `pointer_dtype.__str__` to aid kernel debugging.
2021-12-23 17:01:17 -08:00
Philippe Tillet
d8fce83e7a [FRONTEND] Remade exception picklable 2021-12-21 22:14:06 -08:00
Philippe Tillet
a425f24d54 [FRONTEND] Better cache hook (#400)
Added an additional `repr` argument to the cache hook, which represents a human-readable string representation of the signature and argument attributes associated with the compiled binary.
2021-12-21 21:29:47 -08:00
Philippe Tillet
2509124dd0 [DRIVER] Fixed some issue with how ptxas is used (#399)
Now using tmpnam and properly deleting temporaries when an exception is raised
2021-12-21 14:31:51 -08:00
daadaada
39d4bfed83 [OPS] Add performance model for gemm/gemv (#397)
Significantly improves the performance of `triton.ops.matmul` in memory-bound settings via the use of many more block configs coupled with a performance model to drive the auto-tuning process.
2021-12-21 09:56:10 -08:00
Madeleine Thompson
5cdb948c05 [FRONTEND] signed-integer math fixes and testing (#395)
- Promote 16-bit floating-point `/` and `%` to 32-bit; we have to anyway.
- Do not force result of integer binary operations to be the LHS type. There used to be a bug in pytorch that did this, which Triton matched, but that bug is fixed now.
- When testing signed integer operations, use random numbers from the full range of the type.
- Add an optional `seed` argument to `triton.testing.random` so binary operations are not tested with both sides equal when the LHS and RHS have the same type.
- Fix a bad `CompilationError` invocation.
- Fix a warning suppression that causes tests to fail if you run them with `-W error` on python 3.8.
2021-12-21 09:46:05 -08:00
daadaada
4a8953efa3 [FRONTEND] Replace the legacy print call in triton.cc with the SlotTracker-based one. (#396)
The legacy print call will assign names (e.g., %10) to values, which can be undesirable in some cases.
2021-12-18 18:03:22 -08:00
Madeleine Thompson
fa62b4a8f6 [FRONTEND] better stringification (#394)
- Don't override `self.args` in `CompilationError`, and show the line number and column in error messages. This causes it to generate an easier-to-read backtrace.
- Better `__str__` on `TensorWrapper`, `dtype`, and `block`.
2021-12-17 20:11:45 -08:00
Philippe Tillet
4e93b41c52 [GENERAL] Some minor fixups (#393)
* [RUNTIME] Now displaying error message when generated PTX is invalid

* [CODEGEN] Now converting `if` condition to bool implicitly
2021-12-17 18:06:21 -08:00
Philippe Tillet
e062812969 [CODEGEN] Disabled peephole for masked load + select -- masked_load
doesn't work as expected when vectorized
2021-12-17 12:44:47 -08:00
Victor
eb077fc993 [RUNTIME] fixed NVidia DLL names on Windows (#392) 2021-12-16 22:09:52 -08:00
Philippe Tillet
e0b92c1380 [FRONTEND] Reverted from .random import *. There are still some
namespace errors in the Triton frontend apparently
2021-12-16 18:37:51 -08:00
Philippe Tillet
558555630f [FRONTEND] Added xor_sum 2021-12-16 17:55:35 -08:00
Michael Melesse
94d5c2e8b5 [ROCM] enable matmul(dot) and others (#391) 2021-12-13 12:28:15 -08:00
Madeleine Thompson
e575ae3443 [FRONTEND] Minor accumulated style and warning fixes (#388)
- Fix some whitespace.
- Make an undeclared dependency on `pytest` explicit.
- Fix deprecated `description-file` use.
- `#ifdef` out a deprecated `PyEval_InitThreads` call.
- Use a slightly different numpy invocation in `test_random.py` to quiet down overflow warnings in tests.
- Fix a deprecated cast in `test_core.py`.
- Suppress a warning about `visit_Constant` in Python 3.9+; we can't migrate yet because it'd break Python 3.6 and 3.7.
- Use chained exceptions for `CompilationError` rather than rolling our own; it makes the error messages nicer.
- Add a `__str__` for `tl.dtype` to make debugging kernels easier; it lets you `print` a dtype to see what type was inferred.
- Fix a few bad escapes.
2021-12-10 15:19:20 -08:00
Philippe Tillet
9def2424ab [RUNTIME] Fix typo in IfExp 2021-12-09 15:14:41 -08:00
Philippe Tillet
e31b9b4e66 [RUNTIME] Better support for None (#387)
* regression test fails but it doesn't make sense to me.
2021-12-09 13:21:22 -08:00
Victor
73b04d71b2 Fixes for building on Windows (#382)
* make C++ code compatible with Windows + MSVC

* added dlfcn-win32 for cross-platform dlopen

* fixed building and pip install on Windows

* fixed shared library file name under Windows
2021-12-07 14:10:58 -08:00
Victor
0ff1a26b70 fixed p2p tests failing when there are no supported p2p devices (#386) 2021-12-06 18:14:03 -08:00
Philippe Tillet
f23bf55f15 [RUNTIME] release the gil on launch (#383) 2021-12-03 13:01:01 -08:00
Philippe Tillet
8ec9f037bb [BACKEND/CODE_GEN] Fixed float32 matmul problem (#380) 2021-11-30 22:00:56 -08:00
Philippe Tillet
c86ad9c9ab [FRONTEND] Added default arguments to non-kernel @triton.jit'd function (#379) 2021-11-29 19:11:26 -08:00
daadaada
1296eb877b [RUNTIME] Config hook v2.0 (#373)
* Add pre_hook to triton.Config
* Use argument names in triton.heuristics
* Update base perf
* Remove meta from heuristics
2021-11-21 11:20:59 -08:00
Philippe Tillet
5693b582ea [RUNTIME] Now using pybind11 to avoid memory leaks (#377) 2021-11-21 02:30:22 -08:00
Philippe Tillet
edd4b0c8b7 [CODEGEN] Fixed issue with jit function passed as constexpr 2021-11-16 09:53:34 -08:00
Philippe Tillet
5b7ba3eb96 [CODEGEN] Reverted to old launch method (memory leak?) 2021-11-16 01:21:03 -08:00
Philippe Tillet
791b953b21 [CODEGEN] Reverted to old way to query current stream 2021-11-16 00:17:27 -08:00
Philippe Tillet
b908095872 [VERSION] Bumped triton.__version__ to 2.0.0 2021-11-12 15:10:36 -08:00
Philippe Tillet
01cc3d4503 [RUNTIME] Restored do_not_specialize (#374) 2021-11-12 15:06:55 -08:00
Philippe Tillet
e66bf76354 [RUNTIME] Bunch of bugfixes (#372) 2021-11-12 00:55:00 -08:00
Philippe Tillet
f7ab96cfd7 [FRONTEND] Fixed some issues with constexpr 2021-11-09 13:03:09 -08:00
daadaada
9a02dddf29 Fix sdd_lut (#368) 2021-11-08 08:25:05 -08:00
Philippe Tillet
5d54352164 [FRONTEND] Significantly reduce kernel launch time (#367) 2021-11-04 13:25:24 -07:00
Philippe Tillet
2acaa4d0dd [LANG] Added support for constexpr (#361) 2021-10-30 00:32:58 -07:00
Philippe Tillet
b7f0e87dc2 [DRIVER] Removed std::cout log message 2021-10-29 10:42:10 -07:00
Philippe Tillet
770ea96cca [PACKAGING] Bumped dev version to 2.0.0 2021-10-29 01:28:17 -07:00
Philippe Tillet
969d6de8a2 [PACKAGING] Bumped dev version to 1.1.2 2021-10-29 01:26:21 -07:00
Philippe Tillet
2d6df9b518 [PACKAGING] Bumped dev version to 1.1.2 2021-10-29 01:24:19 -07:00
Philippe Tillet
1b842f8e5e [CI] Now running integration tests on pull requests on branch v2.0 2021-10-29 01:11:12 -07:00
Philippe Tillet
d3e584d4ba Revert "[DRIVER] Fixed CUDA 10.1 bug (#357)" (#358)
This reverts commit d35014ba47.
2021-10-26 15:04:49 -07:00
Philippe Tillet
d35014ba47 [DRIVER] Fixed CUDA 10.1 bug (#357) 2021-10-26 11:17:06 -07:00
Philippe Tillet
5ce1b726dc [CODEGEN] Various bugfixes that make it possible to fuse RNG in a matmul epilogue (#356) 2021-10-24 02:30:46 -07:00
daadaada
858dec8372 [CODEGEN] Add cache modifier to tl.load (#351)
* Add cache modifier to tl.load
* Add comment to cache_modifier
* Remove force_nc_cache
* Update test
2021-10-17 22:14:04 -07:00
Philippe Tillet
90ded16c32 [DOCS] Added placeholder docstring for layernorm tutorial 2021-10-15 19:04:01 -07:00
161 changed files with 21532 additions and 19293 deletions

View File

@@ -8,35 +8,48 @@ jobs:
Build-Documentation:
runs-on: self-hosted
runs-on: [self-hosted, V100]
steps:
- name: Checkout gh-pages
uses: actions/checkout@v1
with:
ref: 'gh-pages'
- name: Clear docs
run: |
rm -r /tmp/triton-docs
continue-on-error: true
- name: Checkout branch
uses: actions/checkout@v1
- name: Install Triton
run: |
alias python='python3'
cd python
pip3 install -e .
- name: Build docs
run: |
git fetch origin master:master
cd docs
make html
sphinx-multiversion . _build/html/
- name: Publish docs
run: |
git branch
# update docs
mkdir /tmp/triton-docs;
mv docs/_build/html/* /tmp/triton-docs/
git checkout gh-pages
sh ./update-website.sh
cp -r CNAME /tmp/triton-docs/
cp -r index.html /tmp/triton-docs/
cp -r .nojekyll /tmp/triton-docs/
rm -r *
cp -r /tmp/triton-docs/* .
# ln -s master/index.html .
# mv master docs
git add .
git commit -am "[GH-PAGES] Updated website"
# publish docs
eval `ssh-agent -s`
DISPLAY=:0 SSH_ASKPASS=~/.ssh/give_pass.sh ssh-add ${{ secrets.SSH_KEY }} <<< ${{ secrets.SSH_PASS }}
git remote set-url origin git@github.com:openai/triton.git
git push
git push

View File

@@ -11,7 +11,7 @@ jobs:
Integration-Tests:
runs-on: self-hosted
runs-on: [self-hosted, V100]
steps:
@@ -20,14 +20,23 @@ jobs:
- name: Clear cache
run: |
rm -r /tmp/triton/
rm -r ~/.triton/
continue-on-error: true
- name: Install Triton
run: |
alias python='python3'
cd python
pip3 install -e .
pip3 install -e '.[tests]'
- name: Check imports
run: "isort -c ./python || ( echo '::error title=Imports not sorted::Please run \"isort ./python\"' ; exit 1 )"
- name: Check style
run: "autopep8 -a -r -d --exit-code ./python || ( echo '::error title=Style issues::Please run \"autopep8 -a -r -i ./python\"' ; exit 1 )"
- name: Flake8
run: "flake8 --config ./python/setup.cfg ./python || ( echo '::error::Flake8 failed; see logs for errors.' ; exit 1 )"
- name: Unit tests
run: |
@@ -43,4 +52,3 @@ jobs:
pytest -vs .
sudo nvidia-smi -i 0 -rgc
sudo nvidia-smi -i 0 -rmc

View File

@@ -8,7 +8,7 @@ jobs:
Build-Wheels:
runs-on: self-hosted
runs-on: [self-hosted, V100]
steps:
@@ -18,7 +18,7 @@ jobs:
- name: Patch setup.py
run: |
#sed -i 's/name\=\"triton\"/name="triton-nightly"/g' python/setup.py
export LATEST_DATE=$(git show -s --format=%ci `git rev-parse HEAD` | cut -d ' ' -f 1 | sed 's/-//g')
export LATEST_DATE=$(TZ=UTC0 git show --quiet --date='format-local:%Y%m%d' --format="%cd")
sed -i -r "s/version\=\"(.*)\"/version=\"\1-dev"$LATEST_DATE"\"/g" python/setup.py
echo "" >> python/setup.cfg
echo "[build_ext]" >> python/setup.cfg

7
.gitignore vendored
View File

@@ -1,6 +1,13 @@
build/
__pycache__
.pytest_cache
python/build/
python/triton.egg-info/
python/triton/_C/libtriton.pyd
python/triton/_C/libtriton.so
.vscode
.vs
log_*

3
.gitmodules vendored Normal file
View File

@@ -0,0 +1,3 @@
[submodule "deps/dlfcn-win32"]
path = deps/dlfcn-win32
url = https://github.com/dlfcn-win32/dlfcn-win32.git

4
.isort.cfg Normal file
View File

@@ -0,0 +1,4 @@
[settings]
known_local_folder=triton
line_length=88
py_version=36

View File

@@ -1,14 +1,20 @@
cmake_minimum_required(VERSION 3.6)
include(ExternalProject)
set(CMAKE_CXX_STANDARD 17)
if(NOT TRITON_LLVM_BUILD_DIR)
set(TRITON_LLVM_BUILD_DIR ${CMAKE_BINARY_DIR})
endif()
set(TRITON_USE_ROCM "$ENV{TRITON_USE_ROCM}")
set(TRITON_ROCM_DEBUG "$ENV{TRITON_ROCM_DEBUG}")
project(triton)
include(CTest)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
if(NOT WIN32)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
endif()
# Options
option(BUILD_TUTORIALS "Build C++ Triton tutorials" ON)
@@ -20,18 +26,47 @@ if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE "Release")
endif()
find_library(TERMINFO_LIBRARY tinfo)
if(NOT WIN32)
find_library(TERMINFO_LIBRARY tinfo)
endif()
# Compiler flags
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/include)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D__STDC_FORMAT_MACROS -std=gnu++17")
# Third-party
include_directories(${PYBIND11_INCLUDE_DIR})
if(WIN32)
SET(BUILD_SHARED_LIBS OFF)
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/deps/dlfcn-win32/src)
add_subdirectory(deps/dlfcn-win32/src ${CMAKE_BINARY_DIR}/dlfcn-win32)
endif()
if (TRITON_USE_ROCM)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D__STDC_FORMAT_MACROS -std=gnu++17 -Wno-unused-result -Wno-attributes")
else()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D__STDC_FORMAT_MACROS -std=gnu++17")
endif()
##########
# LLVM
##########
if("${LLVM_LIBRARY_DIR}" STREQUAL "")
find_package(LLVM 11 REQUIRED COMPONENTS "nvptx;amdgpu")
if(WIN32)
find_package(LLVM 13 REQUIRED COMPONENTS nvptx amdgpu)
include_directories(${LLVM_INCLUDE_DIRS})
separate_arguments(LLVM_DEFINITIONS_LIST NATIVE_COMMAND ${LLVM_DEFINITIONS})
add_definitions(${LLVM_DEFINITIONS_LIST})
llvm_map_components_to_libnames(LLVM_LIBRARIES support core
NVPTXInfo nvptxcodegen
AMDGPUInfo AMDGPUcodegen
)
else()
find_package(LLVM 11 REQUIRED COMPONENTS "nvptx;amdgpu")
endif()
message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")
if(APPLE)
set(CMAKE_OSX_DEPLOYMENT_TARGET "10.14")
@@ -39,7 +74,7 @@ if("${LLVM_LIBRARY_DIR}" STREQUAL "")
# sometimes we don't want to use llvm-config, since it may have been downloaded for some specific linux distros
else()
set(LLVM_LDFLAGS "-L${LLVM_LIBRARY_DIR}")
set(LLVM_LIBRARIES
set(LLVM_LIBRARIES
libLLVMNVPTXCodeGen.a
libLLVMNVPTXDesc.a
libLLVMNVPTXInfo.a
@@ -84,6 +119,15 @@ libLLVMBinaryFormat.a
libLLVMAMDGPUInfo.a
libLLVMSupport.a
libLLVMDemangle.a
libLLVMPasses.a
libLLVMAnalysis.a
libLLVMTransformUtils.a
libLLVMScalarOpts.a
libLLVMTransformUtils.a
libLLVMipo.a
libLLVMObjCARCOpts.a
libLLVMCoroutines.a
libLLVMAnalysis.a
)
endif()
include_directories("${LLVM_INCLUDE_DIRS}")
@@ -102,18 +146,38 @@ if(BUILD_PYTHON_MODULE)
endif()
include_directories("." ${PYTHON_SRC_PATH} ${PYTHON_INCLUDE_DIRS} ${CUTLASS_INCLUDE_DIR})
link_directories(${PYTHON_LINK_DIRS} ${CUTLASS_LIBRARY_DIR})
if (TRITON_USE_ROCM)
add_definitions(-DUSE_ROCM)
endif()
if (TRITON_ROCM_DEBUG)
add_definitions(-DDEBUG_ROCM)
endif()
set(PYTHON_SRC ${PYTHON_SRC_PATH}/main.cc ${PYTHON_SRC_PATH}/triton.cc ${PYTHON_SRC_PATH}/superblock.cc ${CUTLASS_SRC})
endif()
# Triton
file(GLOB_RECURSE LIBTRITON_SRC lib/*.cc)
add_library(triton SHARED ${LIBTRITON_SRC} ${PYTHON_SRC})
if (WIN32 AND BUILD_PYTHON_MODULE)
find_package(Python3 REQUIRED COMPONENTS Development)
Python3_add_library(triton SHARED ${LIBTRITON_SRC} ${PYTHON_SRC})
set_target_properties(triton PROPERTIES SUFFIX ".pyd")
set_target_properties(triton PROPERTIES PREFIX "lib")
else()
add_library(triton SHARED ${LIBTRITON_SRC} ${PYTHON_SRC})
endif()
target_link_options(triton PRIVATE ${LLVM_LDFLAGS})
target_link_libraries(triton ${LLVM_LIBRARIES} z ${TERMINFO_LIBRARY})
if(WIN32)
target_link_libraries(triton PRIVATE ${LLVM_LIBRARIES} dl) # dl is from dlfcn-win32
else()
target_link_libraries(triton ${LLVM_LIBRARIES} z)
endif()
if(BUILD_PYTHON_MODULE)
if(BUILD_PYTHON_MODULE AND NOT WIN32)
set(CMAKE_SHARED_LIBRARY_SUFFIX ".so")
# Check if the platform is MacOS
if(APPLE)

View File

@@ -1,6 +1,6 @@
/*
* Copyright 2018-2020 Philippe Tillet
* Copyright 2020-2021 OpenAI
* Copyright 2020-2022 OpenAI
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files
@@ -20,4 +20,4 @@
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
*/

View File

@@ -12,12 +12,36 @@
# Triton
This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.
This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment for expressing tensor math workloads that offers high flexibility, developer productivity and end to end performance.
The foundations of this project are described in the following MAPL2019 publication: [Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations](http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf). Please consider citing this work if you use Triton!
The [official documentation](https://triton-lang.org) contains installation instructions and tutorials.
# Quick Installation
You can install the latest stable release of Triton from pip:
```bash
pip install triton
```
Binary wheels are available for CPython 3.6-3.9 and PyPy 3.6-3.7.
And the latest nightly release:
```bash
pip install -U --pre triton
```
# Install from source
```
git clone https://github.com/openai/triton.git;
cd triton/python;
pip install cmake; # build time dependency
pip install -e .
```
# Changelog
Version 1.1 is out! New features include:

View File

@@ -25,7 +25,7 @@
# LLVM_VERSION_STRING - Full LLVM version string (e.g. 6.0.0svn).
# LLVM_VERSION_BASE_STRING - Base LLVM version string without git/svn suffix (e.g. 6.0.0).
#
# Note: The variable names were chosen in conformance with the offical CMake
# Note: The variable names were chosen in conformance with the official CMake
# guidelines, see ${CMAKE_ROOT}/Modules/readme.txt.
# Try suffixed versions to pick up the newest LLVM install available on Debian
@@ -196,4 +196,4 @@ include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(LLVM
REQUIRED_VARS LLVM_ROOT_DIR
VERSION_VAR LLVM_VERSION_STRING)
VERSION_VAR LLVM_VERSION_STRING)

1
deps/dlfcn-win32 vendored Submodule

Submodule deps/dlfcn-win32 added at 522c301ec3

27
docs/_templates/versions.html vendored Normal file
View File

@@ -0,0 +1,27 @@
{%- if current_version %}
<div class="rst-versions" data-toggle="rst-versions" role="note" aria-label="versions">
<span class="rst-current-version" data-toggle="rst-current-version">
<span class="fa fa-book"> Other Versions</span>
v: {{ current_version.name }}
<span class="fa fa-caret-down"></span>
</span>
<div class="rst-other-versions">
{%- if versions.tags %}
<dl>
<dt>Tags</dt>
{%- for item in versions.tags %}
<dd><a href="{{ item.url }}">{{ item.name }}</a></dd>
{%- endfor %}
</dl>
{%- endif %}
{%- if versions.branches %}
<dl>
<dt>Branches</dt>
{%- for item in versions.branches %}
<dd><a href="{{ item.url }}">{{ item.name }}</a></dd>
{%- endfor %}
</dl>
{%- endif %}
</div>
</div>
{%- endif %}

View File

@@ -34,15 +34,18 @@ def process_sig(app, what, name, obj, options, signature, return_annotation):
def setup(app):
"""Customize function args retrieving to get args under decorator."""
import sphinx
import triton
import os
app.connect("autodoc-process-signature", process_sig)
os.system("pip install -e ../python")
def forward_jit_fn(func):
old = func
def wrapped(obj, **kwargs):
if isinstance(obj, triton.code_gen.JITFunction):
import triton
if isinstance(obj, triton.runtime.JITFunction):
obj = obj.fn
return old(obj)
@@ -52,7 +55,8 @@ def setup(app):
old_documenter = sphinx.ext.autosummary.get_documenter
def documenter(app, obj, parent):
if isinstance(obj, triton.code_gen.JITFunction):
import triton
if isinstance(obj, triton.runtime.JITFunction):
obj = obj.fn
return old_documenter(app, obj, parent)
@@ -66,9 +70,17 @@ def setup(app):
import sys
import os
sys.path.insert(0, os.path.abspath('../python/'))
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.intersphinx', 'sphinx.ext.autosummary', 'sphinx.ext.coverage', 'sphinx.ext.napoleon']
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.intersphinx', 'sphinx.ext.autosummary', 'sphinx.ext.coverage', 'sphinx.ext.napoleon', 'sphinx_multiversion']
autosummary_generate = True
# versioning config
smv_tag_whitelist = r'^(v1.1.2)$'
smv_branch_whitelist = r'^master$'
smv_remote_whitelist = None
smv_released_pattern = r'^tags/.*$'
smv_outputdir_format = '{ref.name}'
smv_prefer_remote_refs = False
# Sphinx gallery
extensions += ['sphinx_gallery.gen_gallery']
from sphinx_gallery.sorting import FileNameSortKey
@@ -85,6 +97,11 @@ sphinx_gallery_conf = {
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
html_sidebars = {
'**': [
'_templates/versions.html',
],
}
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:

View File

@@ -34,17 +34,19 @@ You can install the Python package from source by running the following commands
.. code-block:: bash
git clone https://github.com/openai/triton.git;
cd triton/python;
cd triton;
git submodule update --init --recursive;
cd python;
pip install cmake; # build time dependency
pip install -e .
Note that, if llvm-11 is not present on your system, the setup.py script will download the official LLVM11 static libraries link against that.
Note that, if llvm-11 is not present on your system and you are on linux, the setup.py script will download the official LLVM11 static libraries link against that. For windows users, LLVM must be installed and configured in PATH.
You can then test your installation by running the unit tests:
.. code-block:: bash
pip install -r requirements-test.txt
pip install -e '.[tests]'
pytest -vs test/unit/
and the benchmarks

View File

@@ -14,7 +14,7 @@ Traditional compilers typically rely on intermediate representations, such as LL
Program Representation
+++++++++++++++++++++++
Polyhedral compilation is a vast area of research. In this section we only outline the most basic aspects of this topic, but readers interested in the solid mathematical foundations underneath may refer to the ample litterature on linear and integer programming.
Polyhedral compilation is a vast area of research. In this section we only outline the most basic aspects of this topic, but readers interested in the solid mathematical foundations underneath may refer to the ample literature on linear and integer programming.
.. table::
:widths: 50 50

View File

@@ -106,9 +106,13 @@ Atomic Ops
:nosignatures:
atomic_cas
atomic_xchg
atomic_add
atomic_max
atomic_min
atomic_and
atomic_or
atomic_xor
Comparison ops

163
include/print_helper.h Executable file
View File

@@ -0,0 +1,163 @@
#pragma once
#ifndef _PRINT_IR_H_
#define _PRINT_IR_H_
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Verifier.h"
#include "llvm/IR/IRPrintingPasses.h"
#include "llvm/IR/Module.h"
#include "llvm/Support/CodeGen.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/SourceMgr.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/Support/TargetRegistry.h"
#include "llvm/Support/TargetSelect.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/Target/TargetOptions.h"
#include "llvm/IR/LegacyPassManager.h"
#include "llvm/ExecutionEngine/ExecutionEngine.h"
#include "llvm/ExecutionEngine/SectionMemoryManager.h"
#include "llvm/Transforms/Utils/Cloning.h"
#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/IR/Verifier.h"
#include "llvm/IR/Module.h"
#include <fstream>
#include <iostream>
#include <sstream>
#include "triton/ir/module.h"
#include "triton/ir/print.h"
#include <iomanip>
#define PRINT_CURRENT_FUNCTION() std::cout << __FILE__ << ":" << __LINE__ << ":" << __FUNCTION__ << std::endl;
static int print_count = 0;
inline std::string return_current_time_and_date()
{
auto now = std::chrono::system_clock::now();
auto in_time_t = std::chrono::system_clock::to_time_t(now);
std::stringstream ss;
ss << std::put_time(std::localtime(&in_time_t), "%Y-%m-%d--%I-%M-%S");
return ss.str();
}
template <typename T>
inline void print_vector(std::vector<T> &vec, std::string name = "")
{
std::cout << name << ": ";
for (auto v : vec)
{
std::cout << v << ", ";
}
std::cout << '\b';
std::cout << std::endl;
}
// dump llvm ir to tmp file
inline std::string print_llvm_module(llvm::Module *llvm_module, bool print_to_cout = true)
{
std::cout << "\t" << "print_llvm_module" << std::endl;
// get module as a string
std::error_code ec;
std::string mod_string;
std::unique_ptr<llvm::raw_string_ostream> ir_ss(
new llvm::raw_string_ostream(mod_string));
llvm_module->print(*ir_ss, nullptr);
// print module
if (print_to_cout)
{
if (!mod_string.empty())
std::cout << "\t" << mod_string << std::endl;
else
std::cout << "\t" << llvm_module->getModuleIdentifier() << ": "
<< "is empty" << std::endl;
}
return mod_string;
}
// dump llvm ir to tmp file
inline void write_llvm_ir(llvm::Module *llvm_module, std::string filename = "", bool tracked = false)
{
// get module string
std::string module_string = print_llvm_module(llvm_module, false);
// get file name and path
if (filename.empty())
filename = llvm_module->getModuleIdentifier();
std::string count_str = "";
if (tracked)
{
count_str = "_" + std::to_string(print_count);
}
std::string ir_path = std::string("/tmp/") + filename + count_str + std::string(".ll");
// write file
std::ofstream output_file(ir_path);
output_file << module_string;
output_file.close();
// increament counter
if (tracked)
{
print_count += 1;
}
}
inline void print_triton_ir(triton::ir::module ir_ref, std::string name)
{
std::ofstream ir_out(std::string("/tmp/") + name + std::string("_") + return_current_time_and_date() + std::string(".ttir"));
ir_out.flush();
triton::ir::print(ir_ref, ir_out);
ir_out.close();
}
inline void print_triton_ir(std::string ir_ref, std::string name)
{
std::ofstream ir_out(std::string("/tmp/") + name + std::string("_") + return_current_time_and_date() + std::string(".ttir"));
ir_out.flush();
ir_out << ir_ref << std::endl;
ir_out.close();
}
inline std::string get_llvm_value_as_str(llvm::Value *llvm_value)
{
std::string value_str;
llvm::raw_string_ostream rso(value_str);
llvm_value->print(rso);
return rso.str();
}
inline void print_llvm_value(llvm::Value *llvm_value, std::string name = "")
{
if (llvm_value)
std::cout << "\t" << name << ": " << get_llvm_value_as_str(llvm_value) << std::endl;
else
std::cout << "\t" << name << ": "
<< "is nullptr" << std::endl;
}
inline void print_llvm_type(llvm::Type *llvm_type, std::string name = "")
{
std::string type_str;
llvm::raw_string_ostream rso(type_str);
llvm_type->print(rso);
std::cout << name << " type: " << rso.str() << std::endl;
}
inline void print_llvm_value_type(llvm::Value *llvm_value, std::string name = "")
{
print_llvm_type(llvm_value->getType(), name);
}
inline void write_ptx(std::string ptx_str)
{
std::ofstream file("/tmp/kernel.ptx");
file << ptx_str;
}
#endif

View File

@@ -12,7 +12,9 @@ namespace ir {
class phi_node;
class splat_inst;
class cast_inst;
class cmp_inst;
class reshape_inst;
class dequantize_inst;
class broadcast_inst;
class binary_operator;
class getelementptr_inst;
@@ -33,8 +35,10 @@ private:
std::vector<cst_info> populate_is_constant_phi(ir::phi_node* x);
std::vector<cst_info> populate_is_constant_splat(ir::splat_inst* x);
std::vector<cst_info> populate_is_constant_reshape(ir::reshape_inst* x);
std::vector<cst_info> populate_is_constant_dequantize(ir::dequantize_inst* x);
std::vector<cst_info> populate_is_constant_broadcast(ir::broadcast_inst* x);
std::vector<cst_info> populate_is_constant_binop(ir::binary_operator* x);
std::vector<cst_info> populate_is_constant_cmp(ir::cmp_inst* x);
std::vector<cst_info> populate_is_constant_gep(ir::getelementptr_inst* x);
std::vector<cst_info> populate_is_constant_default(ir::value* v);
std::vector<cst_info> populate_is_constant(ir::value *v);
@@ -42,6 +46,7 @@ private:
std::vector<unsigned> populate_max_contiguous_phi(ir::phi_node* x);
std::vector<unsigned> populate_max_contiguous_splat(ir::splat_inst* x);
std::vector<unsigned> populate_max_contiguous_reshape(ir::reshape_inst* x);
std::vector<unsigned> populate_max_contiguous_dequantize(ir::dequantize_inst* x);
std::vector<unsigned> populate_max_contiguous_broadcast(ir::broadcast_inst* x);
std::vector<unsigned> populate_max_contiguous_binop(ir::binary_operator* x);
std::vector<unsigned> populate_max_contiguous_gep(ir::getelementptr_inst* x);
@@ -52,6 +57,7 @@ private:
std::vector<unsigned> populate_starting_multiple_phi(ir::phi_node* x);
std::vector<unsigned> populate_starting_multiple_splat(ir::splat_inst* x);
std::vector<unsigned> populate_starting_multiple_reshape(ir::reshape_inst* x);
std::vector<unsigned> populate_starting_multiple_dequantize(ir::dequantize_inst* x);
std::vector<unsigned> populate_starting_multiple_broadcast(ir::broadcast_inst* x);
std::vector<unsigned> populate_starting_multiple_binop(ir::binary_operator* x);
std::vector<unsigned> populate_starting_multiple_gep(ir::getelementptr_inst* x);
@@ -65,6 +71,7 @@ public:
void run(ir::module &mod);
unsigned get(ir::value* v, unsigned ax) const;
std::vector<unsigned> contiguous(ir::value* v) const;
std::vector<cst_info> get_cst_info(ir::value* v) const;
private:
std::map<ir::value*, std::vector<cst_info>> is_constant_;

View File

@@ -25,6 +25,7 @@ private:
void update_graph_reduce(ir::instruction *i);
void update_graph_reshape(ir::instruction *i);
void update_graph_trans(ir::instruction *i);
void update_graph_dequantize(ir::instruction *i);
void update_graph_broadcast(ir::instruction *i);
void update_graph_dot(ir::instruction *i);
void update_graph_elementwise(ir::instruction *i,

View File

@@ -103,12 +103,70 @@ public:
int shape_per_cta(size_t k) { return shape_per_cta_.at(k); }
int rep_per_cta(size_t k) { return shape_[k] / shape_per_cta_[k]; }
virtual int contig_per_thread(size_t k) = 0;
protected:
std::vector<int> shape_per_cta_;
};
class mma_layout: public distributed_layout {
public:
enum TensorCoreType : uint8_t {
// floating-point tensor core instr
FP32_FP16_FP16_FP32 = 0, // default
FP32_BF16_BF16_FP32,
FP32_TF32_TF32_FP32,
// integer tensor core instr
INT32_INT1_INT1_INT32, // Not implemented
INT32_INT4_INT4_INT32, // Not implemented
INT32_INT8_INT8_INT32, // Not implemented
//
NOT_APPLICABLE,
};
// Used on nvidia GPUs with sm >= 80
inline static const std::map<TensorCoreType, std::vector<int>> mma_instr_shape_ = {
{FP32_FP16_FP16_FP32, {16, 8, 16}},
{FP32_BF16_BF16_FP32, {16, 8, 16}},
{FP32_TF32_TF32_FP32, {16, 8, 8}},
{INT32_INT1_INT1_INT32, {16, 8, 256}},
{INT32_INT4_INT4_INT32, {16, 8, 64}},
{INT32_INT8_INT8_INT32, {16, 8, 32}},
};
// shape of matrices loaded by ldmatrix (m-n-k, for mxk & kxn matrices)
inline static const std::map<TensorCoreType, std::vector<int>> mma_mat_shape_ = {
{FP32_FP16_FP16_FP32, {8, 8, 8}},
{FP32_BF16_BF16_FP32, {8, 8, 8}},
{FP32_TF32_TF32_FP32, {8, 8, 4}},
{INT32_INT1_INT1_INT32, {8, 8, 64}},
{INT32_INT4_INT4_INT32, {8, 8, 32}},
{INT32_INT8_INT8_INT32, {8, 8, 16}},
};
inline static const std::map<TensorCoreType, std::string> mma_instr_ptx_ = {
{FP32_FP16_FP16_FP32, "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32"},
{FP32_BF16_BF16_FP32, "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32"},
{FP32_TF32_TF32_FP32, "mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32"},
{INT32_INT1_INT1_INT32, "mma.sync.aligned.m16n8k256.row.col.s32.b1.b1.s32.xor.popc"},
{INT32_INT4_INT4_INT32, "mma.sync.aligned.m16n8k64.row.col.satfinite.s32.s4.s4.s32"},
{INT32_INT8_INT8_INT32, "mma.sync.aligned.m16n8k32.row.col.satfinite.s32.s8.s8.s32"},
};
// vector length per ldmatrix (16*8/elelment_size_in_bits)
inline static const std::map<TensorCoreType, int> mma_instr_vec_ = {
{FP32_FP16_FP16_FP32, 8},
{FP32_BF16_BF16_FP32, 8},
{FP32_TF32_TF32_FP32, 4},
{INT32_INT1_INT1_INT32, 128},
{INT32_INT4_INT4_INT32, 32},
{INT32_INT8_INT8_INT32, 16},
};
public:
mma_layout(size_t num_warps,
const std::vector<int>& axes,
@@ -116,13 +174,25 @@ public:
const std::vector<ir::value *> &values,
analysis::align* align, target *tgt,
shared_layout* layout_a,
shared_layout* layout_b);
shared_layout* layout_b,
ir::value *dot);
void accept(layout_visitor* vst) { vst->visit_layout_mma(this); }
// accessor
int fpw(size_t k) { return fpw_.at(k); }
int wpt(size_t k) { return wpt_.at(k); }
int spw(size_t k) { return spw_.at(k); }
int rep(size_t k) { return rep_.at(k); }
int contig_per_thread(size_t k) { return contig_per_thread_.at(k); }
// helpers for generator.cc
std::string get_ptx_instr() const { return mma_instr_ptx_.at(tensor_core_type_); }
std::vector<int> get_mma_instr_shape() const { return mma_instr_shape_.at(tensor_core_type_); }
std::vector<int> get_mma_mat_shape() const { return mma_mat_shape_.at(tensor_core_type_); }
int get_vec_a() const { return mma_instr_vec_.at(tensor_core_type_); }
int get_vec_b() const { return mma_instr_vec_.at(tensor_core_type_); }
// setter
void set_tensor_core_type(TensorCoreType type) { tensor_core_type_ = type; }
private:
// fragment per warp
@@ -135,9 +205,14 @@ private:
std::vector<int> spt_;
// repetitions
std::vector<int> rep_;
// contiguous per thread
std::vector<int> contig_per_thread_;
TensorCoreType tensor_core_type_ = FP32_FP16_FP16_FP32;
};
struct scanline_layout: public distributed_layout {
class scanline_layout: public distributed_layout {
public:
scanline_layout(size_t num_warps,
const std::vector<int>& axes,
const std::vector<unsigned>& shape,
@@ -148,8 +223,10 @@ struct scanline_layout: public distributed_layout {
// accessor
int mts(size_t k) { return mts_.at(k); }
int nts(size_t k) { return nts_.at(k); }
int contig_per_thread(size_t k) { return nts_.at(k); }
public:
int per_thread(size_t k) { return contig_per_thread(k) * shape_[k] / shape_per_cta(k);}
private:
// micro tile size. The size of a tile held by a thread block.
std::vector<int> mts_;
// nano tile size. The size of a tile held by a thread.
@@ -169,7 +246,7 @@ struct N_buffer_info_t {
std::map<ir::value*, int> firsts_idx;
};
// abstract for dot and coresponding smem values
// abstract for dot and corresponding smem values
class shared_layout: public data_layout {
private:
static bool is_loop_latch(ir::phi_node *phi, ir::instruction *terminator);
@@ -182,7 +259,8 @@ public:
const std::vector<unsigned>& shapes,
const std::vector<ir::value *> &values_,
ir::type *ty,
analysis::align* align);
analysis::align* align, target *tgt,
bool is_tmp = false);
void accept(layout_visitor* vst) { vst->visit_layout_shared(this); }
// accessors
size_t get_size() { return size_; }
@@ -197,7 +275,10 @@ public:
ir::value* hmma_dot_b() { return hmma_dot_b_; }
void set_mma_vec(int mma_vec) { mma_vec_ = mma_vec; }
int get_mma_vec() { return mma_vec_;}
int get_mma_strided() { return mma_strided_; }
bool allow_swizzle() const { return allow_swizzle_; }
data_layout* get_arg_layout() { return arg_layout_; }
bool is_tmp() const { return is_tmp_; }
private:
size_t size_;
@@ -209,6 +290,10 @@ private:
ir::value* hmma_dot_b_;
data_layout* arg_layout_;
int mma_vec_;
int mma_strided_;
bool allow_swizzle_ = true;
target *tgt_;
bool is_tmp_;
};
@@ -227,13 +312,20 @@ private:
void create(size_t id, const std::vector<ir::value*>& values);
public:
void create_tmp_layout(size_t id, data_layout* arg,
const std::vector<int>& axes,
const std::vector<unsigned>& shape,
ir::instruction* i,
bool is_index = false);
public:
// constructor
layouts(analysis::axes *axes, analysis::align *align, size_t num_warps, target* tgt);
// accessors
unsigned layout_of(ir::value *value) const { return groups_.at(value); }
bool has(ir::value* value) const { return groups_.find(value) != groups_.end(); }
bool has(size_t id) { return layouts_.find(id) != layouts_.end(); }
const std::vector<ir::value*>& values_of(unsigned id) const { return values_.at(id); }
size_t num_layouts() const { return values_.size();}
data_layout* get(size_t id) { return layouts_.at(id); }
@@ -241,7 +333,19 @@ public:
std::map<size_t, data_layout*> &get_all() { return layouts_; }
bool has_tmp(ir::value* i) { return tmp_.find(i) != tmp_.end(); }
int tmp(ir::value* i) { return tmp_.at(i);}
int has_tmp_index(ir::value* i) { return tmp_index_.find(i) != tmp_index_.end(); }
int tmp_index(ir::value* i) { return tmp_index_.at(i);}
void copy(ir::value* dst, ir::value* src) { groups_[dst] = groups_[src]; }
// layout checkers
bool is_scanline(ir::instruction* i);
bool is_coalesced_scanline(ir::instruction* i);
bool is_mma(ir::instruction* i);
bool is_a100_mma(ir::instruction* i);
// execution
void run(ir::module &mod);
@@ -255,6 +359,7 @@ private:
std::map<size_t, std::vector<ir::value*>> values_;
std::map<size_t, data_layout*> layouts_;
std::map<ir::value*, size_t> tmp_;
std::map<ir::value*, size_t> tmp_index_;
};
}

View File

@@ -1,12 +1,14 @@
#ifndef TDL_INCLUDE_IR_CODEGEN_LIVENESS_H
#define TDL_INCLUDE_IR_CODEGEN_LIVENESS_H
#include <map>
#include <set>
#include <vector>
#include "triton/codegen/analysis/layout.h"
#include "triton/tools/graph.h"
#include "llvm/ADT/MapVector.h"
#include <set>
#include <vector>
namespace triton{
namespace ir{
@@ -42,14 +44,14 @@ struct segment {
class liveness {
private:
typedef std::map<shared_layout*, segment> intervals_map_t;
typedef llvm::MapVector<shared_layout*, segment> intervals_map_t;
public:
// constructor
liveness(layouts *l): layouts_(l){ }
// accessors
const intervals_map_t& get() const { return intervals_; }
segment get(shared_layout* v) const { return intervals_.at(v); }
segment get(shared_layout* v) const { return intervals_.lookup(v); }
// run
void run(ir::module &mod);

View File

@@ -0,0 +1,90 @@
#ifndef _TRITON_CODE_GEN_EXTERN_LIB_H_
#define _TRITON_CODE_GEN_EXTERN_LIB_H_
#include <memory>
#include <string>
#include <map>
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/Module.h"
#include "llvm/IRReader/IRReader.h"
#include "llvm/Support/SourceMgr.h"
namespace triton {
namespace codegen {
///
/// \brief ExternLib is a class that represents a library of external functions.
///
class ExternLib {
public:
ExternLib(const std::string &name, const std::string &path)
: name_(name), path_(path) {}
virtual ~ExternLib() = default;
virtual const std::string &name() const { return name_; }
virtual const std::string &path() const { return path_; }
///
/// \brief Load the library and return the module.
///
std::unique_ptr<llvm::Module> load(llvm::LLVMContext &ctx);
///
/// \brief Link the module into the given module.
///
void link(std::unique_ptr<llvm::Module> &llvm,
std::unique_ptr<llvm::Module> &mod);
///
/// \brief Run load, link, and opt on the module.
///
virtual void install(llvm::LLVMContext &ctx,
std::unique_ptr<llvm::Module> &llvm) {
auto mod = load(ctx);
link(llvm, mod);
opt(ctx, llvm);
}
///
/// \brief Run opt on the module.
///
virtual void opt(llvm::LLVMContext &ctx,
std::unique_ptr<llvm::Module> &llvm) = 0;
private:
std::string name_;
std::string path_;
};
///
/// \brief ExternLibMap is a map of ExternLibs from their names to their paths.
///
typedef std::map<std::string, std::unique_ptr<ExternLib>> ExternLibMap;
///
/// \brief Concrete class for NVIDIA's libdevice library.
///
class LibDevice final : public ExternLib {
public:
LibDevice(const std::string &name, const std::string &path)
: ExternLib(name, path) {}
virtual ~LibDevice() = default;
virtual void opt(llvm::LLVMContext &ctx,
std::unique_ptr<llvm::Module> &llvm) override;
};
///
/// \brief Create an ExternLib instance based on the name and path.
///
std::unique_ptr<ExternLib> create_extern_lib(const std::string &lib_name,
const std::string &lib_path);
} // namespace codegen
} // namespace triton
#endif

View File

@@ -3,6 +3,7 @@
#include <memory>
#include "extern_lib.h"
namespace llvm{
class Module;
@@ -30,12 +31,10 @@ namespace codegen{
// TODO:
// There should be a proper pass manager there!
std::unique_ptr<llvm::Module> add_passes_to_emit_bin(ir::module &ir, llvm::LLVMContext& ctx,
codegen::target* target,
int sm, int num_warps,
int num_stages, bool force_nc_cache, int &shared_static);
std::unique_ptr<llvm::Module> add_passes_to_emit_bin(
ir::module &ir, llvm::LLVMContext &ctx, codegen::target *target,
int num_warps, int num_stages, int &shared_static,
const ExternLibMap &extern_libs);
}
}

View File

@@ -4,7 +4,9 @@
#define _TRITON_SELECTION_GENERATOR_H_
#include "triton/ir/visitor.h"
#include "triton/ir/instructions.h"
#include "triton/codegen/analysis/layout.h"
#include "triton/codegen/extern_lib.h"
#include <functional>
// forward
@@ -24,6 +26,7 @@ namespace llvm{
class IRBuilder;
class ArrayType;
class Function;
class StructType;
}
namespace triton{
@@ -114,18 +117,28 @@ private:
private:
Type *cvt(ir::type *ty);
llvm::Attribute cvt(ir::attribute attr);
void packed_type(ir::value* i);
void forward_declare(ir::function* fn);
Value *cast_shared_layout_ptr(analysis::data_layout *layout, Type *ty);
public:
private:
typedef std::function<void(
std::pair<Value *, Value *> &acc, std::function<Value *()> load_value_fn,
std::function<Value *()> load_index_fn, bool is_first)>
acc_fn_t;
public:
generator(analysis::axes *a_axes,
analysis::layouts *layouts,
analysis::align *alignment,
analysis::allocation *alloc,
analysis::swizzle *swizzle,
target *tgt,
unsigned num_warps,
bool force_nc_cache = false);
unsigned num_warps);
void visit_value(ir::value* v);
void visit_call_inst(ir::call_inst*);
void visit_launch_inst(ir::launch_inst *);
void visit_phi_node(ir::phi_node*);
void visit_binary_operator(ir::binary_operator*);
void visit_getelementptr_inst(ir::getelementptr_inst*);
@@ -135,9 +148,19 @@ public:
std::tuple<Value*, Value*, Value*, Value*> fp32x4_to_fp8x4(Value *in0, Value *in1, Value *in2, Value *in3);
std::tuple<Value*, Value*, Value*, Value*> fp8x4_to_fp16x4(Value *in0, Value *in1, Value *in2, Value *in3);
std::tuple<Value*, Value*, Value*, Value*> fp16x4_to_fp8x4(Value *in0, Value *in1, Value *in2, Value *in3);
std::tuple<Value*, Value*, Value*, Value*> fp8x4_to_bf16x4(Value *in0, Value *in1, Value *in2, Value *in3);
std::tuple<Value*, Value*, Value*, Value*> bf16x4_to_fp8x4(Value *in0, Value *in1, Value *in2, Value *in3);
Value* bf16_to_fp32(Value *in0);
Value* fp32_to_bf16(Value *in0);
std::tuple<Value*, Value*, Value*, Value*, Value*, Value*, Value*, Value*> int16_to_float16x8(
Value *in0, Value *scale_x512, Value *shift
);
std::tuple<Value*, Value*, Value*, Value*, Value*, Value*, Value*, Value*> int32_to_float16x8(
Value *in0, Value *scale_x512, Value *shift
);
std::tuple<Value*, Value*, Value*, Value*> int32_to_float16x4(Value *in0, Value *scale_x512, Value *shift);
std::tuple<Value*, Value*> prepare_scale_shift(Value *scale, Value *shift);
void visit_dequantize_inst(ir::dequantize_inst*);
void visit_cast_inst(ir::cast_inst*);
void visit_return_inst(ir::return_inst*);
void visit_cond_branch_inst(ir::cond_branch_inst*);
@@ -148,12 +171,16 @@ public:
void visit_store_inst(ir::store_inst*);
void visit_unmasked_store_inst(ir::unmasked_store_inst*);
void visit_masked_store_inst(ir::masked_store_inst*);
void visit_cat_inst(ir::cat_inst*);
void visit_extract_value_inst(ir::extract_value_inst *);
void visit_insert_value_inst(ir::insert_value_inst *);
void visit_reshape_inst(ir::reshape_inst*);
void visit_splat_inst(ir::splat_inst*);
void visit_broadcast_inst(ir::broadcast_inst*);
void visit_downcast_inst(ir::downcast_inst*);
void visit_exp_inst(ir::exp_inst*);
void visit_cos_inst(ir::cos_inst*);
void visit_umulhi_inst(ir::umulhi_inst* x);
void visit_sin_inst(ir::sin_inst*);
void visit_log_inst(ir::log_inst*);
void visit_get_program_id_inst(ir::get_program_id_inst*);
@@ -167,8 +194,8 @@ public:
void visit_trans_inst(ir::trans_inst*);
void visit_sqrt_inst(ir::sqrt_inst*);
Value* shfl_sync(Value* acc, int32_t i);
void visit_reduce1d_inst(ir::reduce_inst*, std::function<Value*(Value*,Value*)>, Value*);
void visit_reducend_inst(ir::reduce_inst*, std::function<Value*(Value*,Value*)>, Value*);
void visit_reducend_inst_fast(ir::reduce_inst* x, acc_fn_t do_acc, Value *neutral);
void visit_reducend_inst(ir::reduce_inst* x, acc_fn_t do_acc, Value *neutral);
void visit_reduce_inst(ir::reduce_inst*);
void visit_select_inst(ir::select_inst*);
void visit_layout_convert(ir::value *out, ir::value *in);
@@ -181,6 +208,9 @@ public:
void visit_async_wait_inst(ir::async_wait_inst*);
// void visit_make_range_dyn(ir::make_range_dyn*);
void visit_make_range(ir::make_range*);
void visit_clock_inst(ir::clock_inst*);
void visit_globaltimer_inst(ir::globaltimer_inst*);
void visit_extern_elementwise_inst(ir::extern_elementwise_inst*);
// void visit_make_range_sta(ir::make_range_sta*);
void visit_undef_value(ir::undef_value*);
void visit_constant_int(ir::constant_int*);
@@ -196,12 +226,21 @@ public:
void visit_layout_scanline(analysis::scanline_layout*);
void visit_layout_shared(analysis::shared_layout*);
// Add a new external library based on given name and path if it doesn't exist
void add_extern_lib(const std::string &lib_name, const std::string &lib_path);
private:
// Get all external libraries
const ExternLibMap &get_extern_lib_map() {
return extern_lib_map_;
}
private:
LLVMContext *ctx_;
Builder* builder_;
Module *mod_;
std::map<std::string, std::unique_ptr<ExternLib>> extern_lib_map_;
analysis::axes *a_axes_;
analysis::swizzle *swizzle_;
std::map<unsigned, distributed_axis> axes_;
@@ -213,7 +252,6 @@ private:
std::set<ir::value*> seen_;
unsigned num_warps_;
bool force_nc_cache_;
std::map<analysis::data_layout*, Value*> offset_a_m_;
std::map<analysis::data_layout*, Value*> offset_a_k_;
@@ -235,10 +273,11 @@ private:
/// idx for multi-stage pipeline
std::map<analysis::data_layout*, Value*> read_smem_idx_;
std::map<analysis::data_layout*, Value*> write_smem_idx_;
/// triton bb -> llvm bb
std::map<ir::value*, BasicBlock *> bbs_;
std::map<ir::value*, std::vector<int>> ords_;
std::map<ir::value*, Function*> fns_;
// helper for creating llvm values
adder add;
@@ -250,6 +289,9 @@ private:
/// Record prefetch instrs that needs to be moved
std::map<ir::value*, std::vector<Value*>> prefetch_latch_to_bb_;
// Eviction policies
std::map<ir::load_inst::EVICTION_POLICY, Value*> policies_;
};
}

View File

@@ -36,6 +36,7 @@ namespace triton{
namespace codegen{
class nvidia_cu_target;
class amd_cl_target;
class target {
public:
@@ -49,7 +50,12 @@ public:
virtual Value* get_block_id(Module *module, Builder& builder, unsigned ax) = 0;
virtual Value* get_num_blocks(Module *module, Builder& builder, unsigned ax) = 0;
virtual unsigned guaranteed_alignment() = 0;
#ifdef USE_ROCM
amd_cl_target* as_nvidia();
amd_cl_target* as_amd();
#else
nvidia_cu_target* as_nvidia();
#endif
bool is_gpu() const;
private:
@@ -67,6 +73,7 @@ public:
Value* get_block_id(Module *module, Builder& builder, unsigned ax);
Value* get_num_blocks(Module *module, Builder& builder, unsigned ax);
unsigned guaranteed_alignment() { return 16; }
int sm() { return 0; } // treat as if old CUDA device
};
class nvidia_cu_target: public target {

View File

@@ -32,11 +32,12 @@ private:
ir::value* rematerialize(ir::value *v, ir::builder& builder, std::map<ir::value*, ir::value*>& seen);
public:
coalesce(analysis::align* align, triton::codegen::analysis::layouts *layouts);
coalesce(analysis::align* align, triton::codegen::analysis::layouts *layouts, bool has_sm80);
triton::ir::value *simplify(ir::instruction* i, triton::ir::builder &builder);
void run(ir::module &mod);
private:
bool has_sm80_;
analysis::align* align_;
analysis::layouts* layout_;
};

View File

@@ -15,22 +15,30 @@ namespace ir {
}
namespace codegen{
namespace analysis{
class layouts;
}
namespace transform{
class cts {
private:
void add_copy(ir::instruction *parent, ir::value *x, ir::builder &builder, bool to_shared);
bool is_shmem_op(ir::instruction* i, int op);
bool is_shmem_res(ir::value* i);
void add_copy(ir::instruction *parent, ir::value *x, ir::builder &builder, bool to_shared, std::map<ir::value*,ir::value*>& copies);
public:
cts(bool use_async = false): use_async_(use_async) {}
cts(analysis::layouts* layouts, bool has_sm80 = false): layouts_(layouts), has_sm80_(has_sm80) {}
void run(ir::module &mod);
private:
bool use_async_;
bool has_sm80_;
analysis::layouts* layouts_;
};
}
}
}
#endif
#endif

View File

@@ -0,0 +1,31 @@
#pragma once
#include <list>
namespace triton {
namespace ir {
class module;
class function;
class call_inst;
class builder;
}
namespace codegen{
namespace transform{
struct fncmp {
bool operator()(ir::function* x, ir::function* y) const;
};
class inliner {
public:
inliner() {}
void do_inline(ir::function* fn, ir::call_inst* callsite, ir::builder& builder, std::list<ir::call_inst*>& callsites);
void run(ir::module &mod);
};
}
}
}

View File

@@ -30,12 +30,15 @@ private:
bool rewrite_dot_hmma(ir::dot_inst *dot, ir::builder& builder, bool trans_a, bool trans_b, ir::value *A, ir::value *B, ir::value *D);
bool rewrite_dot(ir::instruction *value, ir::builder& builder);
bool rewrite_mult(ir::instruction *value, ir::builder& builder);
bool rewrite_insert_extract(ir::instruction *value, ir::builder& builder);
bool rewrite_unit_red(ir::instruction *value, ir::builder& builder);
bool rewrite_gep_ptr_min_off_plus_off(ir::instruction *value, ir::builder& builder);
bool rewrite_select_masked_load(ir::instruction *value, ir::builder& builder);
bool rewrite_load_to_shared(ir::instruction *value, ir::builder& builder);
bool rewrite_cvt_layout(ir::instruction *value, ir::builder& builder);
public:
peephole(target* tgt, analysis::layouts* layouts): tgt_(tgt), layouts_(layouts) {}
void run(ir::module &mod);

View File

@@ -11,7 +11,6 @@
#include "triton/external/CUDA/nvml.h"
//// HIP backend
//#define __HIP_PLATFORM_AMD__
#include "triton/external/hip.h"
//Exceptions
@@ -89,6 +88,7 @@ public:
static CUresult cuDeviceGetAttribute(int *pi, CUdevice_attribute attrib, CUdevice dev);
static CUresult cuDeviceGetCount(int *count);
// link management
static CUresult cuLinkAddFile_v2(CUlinkState state, CUjitInputType type, const char *path, unsigned int numOptions, CUjit_option *options, void **optionValues);
static CUresult cuLinkAddData_v2(CUlinkState state, CUjitInputType type, void* data, size_t size, const char* name, unsigned int numOptions, CUjit_option* options, void** optionValues);
static CUresult cuLinkCreate_v2(unsigned int numOptions, CUjit_option* options, void** optionValues, CUlinkState* stateOut);
static CUresult cuLinkComplete(CUlinkState state, void** cubinOut, size_t* sizeOut);
@@ -182,7 +182,8 @@ public:
static hipError_t hipEventElapsedTime(float *pMilliseconds, hipEvent_t hStart, hipEvent_t hEnd);
static hipError_t hipEventRecord(hipEvent_t hEvent, hipStream_t hStream);
static hipError_t hipEventDestroy(hipEvent_t hEvent);
// error handling
static hipError_t hipGetLastError(void);
private:
@@ -214,6 +215,7 @@ private:
static void* cuDeviceGetAttribute_;
static void* cuDeviceGetCount_;
// link management
static void* cuLinkAddFile_v2_;
static void* cuLinkAddData_v2_;
static void* cuLinkCreate_v2_;
static void* cuLinkDestroy_;
@@ -307,6 +309,8 @@ private:
static void* hipEventElapsedTime_;
static void* hipEventRecord_;
static void* hipEventDestroy_;
// error handling
static void* hipGetLastError_;
};
}

View File

@@ -9,11 +9,16 @@ namespace triton{
namespace driver{
void init_llvm();
std::string path_to_ptxas(int& version);
std::string llir_to_ptx(llvm::Module* module, int cc, int version);
std::string ptx_to_cubin(const std::string& ptx, int cc);
std::string ptx_to_cubin(const std::string& ptx, const std::string& ptxas_path, int cc);
CUmodule ptx_to_cumodule(const std::string& ptx, int cc);
std::string llir_to_amdgpu(llvm::Module* module, const std::string& proc);
std::tuple<std::string, std::string> llir_to_amdgcn(llvm::Module* module, const std::string& proc);
hipModule_t amdgpu_to_hipmodule(const std::string& path);
}
}
#define STRINGIFY_HELPER(X) #X
#define STRINGIFY(X) STRINGIFY_HELPER(X)

5999
include/triton/external/CUDA/cuda.h vendored Executable file → Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -1,13 +1,35 @@
/*
* @brief hipError_t
* @enum
* @ingroup Enumerations
*/
// Developer note - when updating these, update the hipErrorName and hipErrorString functions in
// NVCC and HCC paths Also update the hipCUDAErrorTohipError function in NVCC path.
Copyright (c) 2015 - 2022 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
#ifndef HIP_H
#define HIP_H
// Ignoring error-code return values from hip APIs is discouraged. On C++17,
// we can make that yield a warning
#if __cplusplus >= 201703L
#define __HIP_NODISCARD [[nodiscard]]
#else
#define __HIP_NODISCARD
#endif
/*
* @brief hipError_t
@@ -17,9 +39,7 @@
// Developer note - when updating these, update the hipErrorName and hipErrorString functions in
// NVCC and HCC paths Also update the hipCUDAErrorTohipError function in NVCC path.
#include <cstddef>
typedef enum hipError_t {
typedef enum __HIP_NODISCARD hipError_t {
hipSuccess = 0, ///< Successful completion.
hipErrorInvalidValue = 1, ///< One or more of the parameters passed to the API call is NULL
///< or not in an acceptable range.
@@ -73,6 +93,7 @@ typedef enum hipError_t {
hipErrorInvalidHandle = 400,
// Deprecated
hipErrorInvalidResourceHandle = 400, ///< Resource handle (hipEvent_t or hipStream_t) invalid.
hipErrorIllegalState = 401, ///< Resource required is not in a valid state to perform operation.
hipErrorNotFound = 500,
hipErrorNotReady = 600, ///< Indicates that asynchronous operations enqueued earlier are not
///< ready. This is not actually an error, but is used to distinguish
@@ -86,6 +107,7 @@ typedef enum hipError_t {
hipErrorPeerAccessNotEnabled =
705, ///< Peer access was never enabled from the current device.
hipErrorSetOnActiveProcess = 708,
hipErrorContextIsDestroyed = 709,
hipErrorAssert = 710, ///< Produced when the kernel calls assert.
hipErrorHostMemoryAlreadyRegistered =
712, ///< Produced when trying to lock a page-locked memory.
@@ -98,6 +120,32 @@ typedef enum hipError_t {
///< that was launched via cooperative launch APIs exceeds the maximum number of
///< allowed blocks for the current device
hipErrorNotSupported = 801, ///< Produced when the hip API is not supported/implemented
hipErrorStreamCaptureUnsupported = 900, ///< The operation is not permitted when the stream
///< is capturing.
hipErrorStreamCaptureInvalidated = 901, ///< The current capture sequence on the stream
///< has been invalidated due to a previous error.
hipErrorStreamCaptureMerge = 902, ///< The operation would have resulted in a merge of
///< two independent capture sequences.
hipErrorStreamCaptureUnmatched = 903, ///< The capture was not initiated in this stream.
hipErrorStreamCaptureUnjoined = 904, ///< The capture sequence contains a fork that was not
///< joined to the primary stream.
hipErrorStreamCaptureIsolation = 905, ///< A dependency would have been created which crosses
///< the capture sequence boundary. Only implicit
///< in-stream ordering dependencies are allowed
///< to cross the boundary
hipErrorStreamCaptureImplicit = 906, ///< The operation would have resulted in a disallowed
///< implicit dependency on a current capture sequence
///< from hipStreamLegacy.
hipErrorCapturedEvent = 907, ///< The operation is not permitted on an event which was last
///< recorded in a capturing stream.
hipErrorStreamCaptureWrongThread = 908, ///< A stream capture sequence not initiated with
///< the hipStreamCaptureModeRelaxed argument to
///< hipStreamBeginCapture was passed to
///< hipStreamEndCapture in a different thread.
hipErrorGraphExecUpdateFailure = 910, ///< This error indicates that the graph update
///< not performed because it included changes which
///< violated constraintsspecific to instantiated graph
///< update.
hipErrorUnknown = 999, //< Unknown error.
// HSA Runtime Error Codes start here.
hipErrorRuntimeMemory = 1052, ///< HSA runtime memory call returned error. Typically not seen
@@ -107,35 +155,154 @@ typedef enum hipError_t {
hipErrorTbd ///< Marker that more error codes are needed.
} hipError_t;
#undef __HIP_NODISCARD
/*
* @brief hipDeviceAttribute_t
* @enum
* @ingroup Enumerations
*/
typedef enum hipDeviceAttribute_t {
hipDeviceAttributeCudaCompatibleBegin = 0,
hipDeviceAttributeEccEnabled = hipDeviceAttributeCudaCompatibleBegin, ///< Whether ECC support is enabled.
hipDeviceAttributeAccessPolicyMaxWindowSize, ///< Cuda only. The maximum size of the window policy in bytes.
hipDeviceAttributeAsyncEngineCount, ///< Cuda only. Asynchronous engines number.
hipDeviceAttributeCanMapHostMemory, ///< Whether host memory can be mapped into device address space
hipDeviceAttributeCanUseHostPointerForRegisteredMem,///< Cuda only. Device can access host registered memory
///< at the same virtual address as the CPU
hipDeviceAttributeClockRate, ///< Peak clock frequency in kilohertz.
hipDeviceAttributeComputeMode, ///< Compute mode that device is currently in.
hipDeviceAttributeComputePreemptionSupported, ///< Cuda only. Device supports Compute Preemption.
hipDeviceAttributeConcurrentKernels, ///< Device can possibly execute multiple kernels concurrently.
hipDeviceAttributeConcurrentManagedAccess, ///< Device can coherently access managed memory concurrently with the CPU
hipDeviceAttributeCooperativeLaunch, ///< Support cooperative launch
hipDeviceAttributeCooperativeMultiDeviceLaunch, ///< Support cooperative launch on multiple devices
hipDeviceAttributeDeviceOverlap, ///< Cuda only. Device can concurrently copy memory and execute a kernel.
///< Deprecated. Use instead asyncEngineCount.
hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on
///< the device without migration
hipDeviceAttributeGlobalL1CacheSupported, ///< Cuda only. Device supports caching globals in L1
hipDeviceAttributeHostNativeAtomicSupported, ///< Cuda only. Link between the device and the host supports native atomic operations
hipDeviceAttributeIntegrated, ///< Device is integrated GPU
hipDeviceAttributeIsMultiGpuBoard, ///< Multiple GPU devices.
hipDeviceAttributeKernelExecTimeout, ///< Run time limit for kernels executed on the device
hipDeviceAttributeL2CacheSize, ///< Size of L2 cache in bytes. 0 if the device doesn't have L2 cache.
hipDeviceAttributeLocalL1CacheSupported, ///< caching locals in L1 is supported
hipDeviceAttributeLuid, ///< Cuda only. 8-byte locally unique identifier in 8 bytes. Undefined on TCC and non-Windows platforms
hipDeviceAttributeLuidDeviceNodeMask, ///< Cuda only. Luid device node mask. Undefined on TCC and non-Windows platforms
hipDeviceAttributeComputeCapabilityMajor, ///< Major compute capability version number.
hipDeviceAttributeManagedMemory, ///< Device supports allocating managed memory on this system
hipDeviceAttributeMaxBlocksPerMultiProcessor, ///< Cuda only. Max block size per multiprocessor
hipDeviceAttributeMaxBlockDimX, ///< Max block size in width.
hipDeviceAttributeMaxBlockDimY, ///< Max block size in height.
hipDeviceAttributeMaxBlockDimZ, ///< Max block size in depth.
hipDeviceAttributeMaxGridDimX, ///< Max grid size in width.
hipDeviceAttributeMaxGridDimY, ///< Max grid size in height.
hipDeviceAttributeMaxGridDimZ, ///< Max grid size in depth.
hipDeviceAttributeMaxSurface1D, ///< Maximum size of 1D surface.
hipDeviceAttributeMaxSurface1DLayered, ///< Cuda only. Maximum dimensions of 1D layered surface.
hipDeviceAttributeMaxSurface2D, ///< Maximum dimension (width, height) of 2D surface.
hipDeviceAttributeMaxSurface2DLayered, ///< Cuda only. Maximum dimensions of 2D layered surface.
hipDeviceAttributeMaxSurface3D, ///< Maximum dimension (width, height, depth) of 3D surface.
hipDeviceAttributeMaxSurfaceCubemap, ///< Cuda only. Maximum dimensions of Cubemap surface.
hipDeviceAttributeMaxSurfaceCubemapLayered, ///< Cuda only. Maximum dimension of Cubemap layered surface.
hipDeviceAttributeMaxTexture1DWidth, ///< Maximum size of 1D texture.
hipDeviceAttributeMaxTexture1DLayered, ///< Cuda only. Maximum dimensions of 1D layered texture.
hipDeviceAttributeMaxTexture1DLinear, ///< Maximum number of elements allocatable in a 1D linear texture.
///< Use cudaDeviceGetTexture1DLinearMaxWidth() instead on Cuda.
hipDeviceAttributeMaxTexture1DMipmap, ///< Cuda only. Maximum size of 1D mipmapped texture.
hipDeviceAttributeMaxTexture2DWidth, ///< Maximum dimension width of 2D texture.
hipDeviceAttributeMaxTexture2DHeight, ///< Maximum dimension hight of 2D texture.
hipDeviceAttributeMaxTexture2DGather, ///< Cuda only. Maximum dimensions of 2D texture if gather operations performed.
hipDeviceAttributeMaxTexture2DLayered, ///< Cuda only. Maximum dimensions of 2D layered texture.
hipDeviceAttributeMaxTexture2DLinear, ///< Cuda only. Maximum dimensions (width, height, pitch) of 2D textures bound to pitched memory.
hipDeviceAttributeMaxTexture2DMipmap, ///< Cuda only. Maximum dimensions of 2D mipmapped texture.
hipDeviceAttributeMaxTexture3DWidth, ///< Maximum dimension width of 3D texture.
hipDeviceAttributeMaxTexture3DHeight, ///< Maximum dimension height of 3D texture.
hipDeviceAttributeMaxTexture3DDepth, ///< Maximum dimension depth of 3D texture.
hipDeviceAttributeMaxTexture3DAlt, ///< Cuda only. Maximum dimensions of alternate 3D texture.
hipDeviceAttributeMaxTextureCubemap, ///< Cuda only. Maximum dimensions of Cubemap texture
hipDeviceAttributeMaxTextureCubemapLayered, ///< Cuda only. Maximum dimensions of Cubemap layered texture.
hipDeviceAttributeMaxThreadsDim, ///< Maximum dimension of a block
hipDeviceAttributeMaxThreadsPerBlock, ///< Maximum number of threads per block.
hipDeviceAttributeMaxThreadsPerMultiProcessor, ///< Maximum resident threads per multiprocessor.
hipDeviceAttributeMaxPitch, ///< Maximum pitch in bytes allowed by memory copies
hipDeviceAttributeMemoryBusWidth, ///< Global memory bus width in bits.
hipDeviceAttributeMemoryClockRate, ///< Peak memory clock frequency in kilohertz.
hipDeviceAttributeComputeCapabilityMinor, ///< Minor compute capability version number.
hipDeviceAttributeMultiGpuBoardGroupID, ///< Cuda only. Unique ID of device group on the same multi-GPU board
hipDeviceAttributeMultiprocessorCount, ///< Number of multiprocessors on the device.
hipDeviceAttributeName, ///< Device name.
hipDeviceAttributePageableMemoryAccess, ///< Device supports coherently accessing pageable memory
///< without calling hipHostRegister on it
hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via the host's page tables
hipDeviceAttributePciBusId, ///< PCI Bus ID.
hipDeviceAttributePciDeviceId, ///< PCI Device ID.
hipDeviceAttributePciDomainID, ///< PCI Domain ID.
hipDeviceAttributePersistingL2CacheMaxSize, ///< Cuda11 only. Maximum l2 persisting lines capacity in bytes
hipDeviceAttributeMaxRegistersPerBlock, ///< 32-bit registers available to a thread block. This number is shared
///< by all thread blocks simultaneously resident on a multiprocessor.
hipDeviceAttributeMaxRegistersPerMultiprocessor, ///< 32-bit registers available per block.
hipDeviceAttributeReservedSharedMemPerBlock, ///< Cuda11 only. Shared memory reserved by CUDA driver per block.
hipDeviceAttributeMaxSharedMemoryPerBlock, ///< Maximum shared memory available per block in bytes.
hipDeviceAttributeSharedMemPerBlockOptin, ///< Cuda only. Maximum shared memory per block usable by special opt in.
hipDeviceAttributeSharedMemPerMultiprocessor, ///< Cuda only. Shared memory available per multiprocessor.
hipDeviceAttributeSingleToDoublePrecisionPerfRatio, ///< Cuda only. Performance ratio of single precision to double precision.
hipDeviceAttributeStreamPrioritiesSupported, ///< Cuda only. Whether to support stream priorities.
hipDeviceAttributeSurfaceAlignment, ///< Cuda only. Alignment requirement for surfaces
hipDeviceAttributeTccDriver, ///< Cuda only. Whether device is a Tesla device using TCC driver
hipDeviceAttributeTextureAlignment, ///< Alignment requirement for textures
hipDeviceAttributeTexturePitchAlignment, ///< Pitch alignment requirement for 2D texture references bound to pitched memory;
hipDeviceAttributeTotalConstantMemory, ///< Constant memory size in bytes.
hipDeviceAttributeTotalGlobalMem, ///< Global memory available on devicice.
hipDeviceAttributeUnifiedAddressing, ///< Cuda only. An unified address space shared with the host.
hipDeviceAttributeUuid, ///< Cuda only. Unique ID in 16 byte.
hipDeviceAttributeWarpSize, ///< Warp size in threads.
hipDeviceAttributeMemoryPoolsSupported, ///< Device supports HIP Stream Ordered Memory Allocator
hipDeviceAttributeCudaCompatibleEnd = 9999,
hipDeviceAttributeAmdSpecificBegin = 10000,
hipDeviceAttributeClockInstructionRate = hipDeviceAttributeAmdSpecificBegin, ///< Frequency in khz of the timer used by the device-side "clock*"
hipDeviceAttributeArch, ///< Device architecture
hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, ///< Maximum Shared Memory PerMultiprocessor.
hipDeviceAttributeGcnArch, ///< Device gcn architecture
hipDeviceAttributeGcnArchName, ///< Device gcnArch name in 256 bytes
hipDeviceAttributeHdpMemFlushCntl, ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
hipDeviceAttributeHdpRegFlushCntl, ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc, ///< Supports cooperative launch on multiple
///< devices with unmatched functions
hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim, ///< Supports cooperative launch on multiple
///< devices with unmatched grid dimensions
hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim, ///< Supports cooperative launch on multiple
///< devices with unmatched block dimensions
hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem, ///< Supports cooperative launch on multiple
///< devices with unmatched shared memories
hipDeviceAttributeIsLargeBar, ///< Whether it is LargeBar
hipDeviceAttributeAsicRevision, ///< Revision of the GPU in this device
hipDeviceAttributeCanUseStreamWaitValue, ///< '1' if Device supports hipStreamWaitValue32() and
///< hipStreamWaitValue64(), '0' otherwise.
hipDeviceAttributeImageSupport, ///< '1' if Device supports image, '0' otherwise.
hipDeviceAttributePhysicalMultiProcessorCount, ///< All available physical compute
///< units for the device
hipDeviceAttributeFineGrainSupport, ///< '1' if Device supports fine grain, '0' otherwise
hipDeviceAttributeAmdSpecificEnd = 19999,
hipDeviceAttributeVendorSpecificBegin = 20000,
// Extended attributes for vendors
} hipDeviceAttribute_t;
#define HIP_LAUNCH_PARAM_BUFFER_POINTER ((void*)0x01)
#define HIP_LAUNCH_PARAM_BUFFER_SIZE ((void*)0x02)
#define HIP_LAUNCH_PARAM_END ((void*)0x03)
// API-visible structures
typedef struct ihipCtx_t* hipCtx_t;
// Note many APIs also use integer deviceIds as an alternative to the device pointer:
typedef int hipDevice_t;
typedef enum hipDeviceP2PAttr {
hipDevP2PAttrPerformanceRank = 0,
hipDevP2PAttrAccessSupported,
hipDevP2PAttrNativeAtomicSupported,
hipDevP2PAttrHipArrayAccessSupported
} hipDeviceP2PAttr;
typedef struct ihipStream_t* hipStream_t;
#define hipIpcMemLazyEnablePeerAccess 0
#define HIP_IPC_HANDLE_SIZE 64
typedef struct hipIpcMemHandle_st {
char reserved[HIP_IPC_HANDLE_SIZE];
} hipIpcMemHandle_t;
typedef struct hipIpcEventHandle_st {
char reserved[HIP_IPC_HANDLE_SIZE];
} hipIpcEventHandle_t;
typedef struct ihipModule_t* hipModule_t;
typedef struct ihipModuleSymbol_t* hipFunction_t;
typedef struct hipFuncAttributes {
@@ -150,91 +317,8 @@ typedef struct hipFuncAttributes {
int ptxVersion;
size_t sharedSizeBytes;
} hipFuncAttributes;
typedef struct ihipEvent_t* hipEvent_t;
/*
* @brief hipDeviceAttribute_t
* @enum
* @ingroup Enumerations
*/
typedef enum hipDeviceAttribute_t {
hipDeviceAttributeMaxThreadsPerBlock, ///< Maximum number of threads per block.
hipDeviceAttributeMaxBlockDimX, ///< Maximum x-dimension of a block.
hipDeviceAttributeMaxBlockDimY, ///< Maximum y-dimension of a block.
hipDeviceAttributeMaxBlockDimZ, ///< Maximum z-dimension of a block.
hipDeviceAttributeMaxGridDimX, ///< Maximum x-dimension of a grid.
hipDeviceAttributeMaxGridDimY, ///< Maximum y-dimension of a grid.
hipDeviceAttributeMaxGridDimZ, ///< Maximum z-dimension of a grid.
hipDeviceAttributeMaxSharedMemoryPerBlock, ///< Maximum shared memory available per block in
///< bytes.
hipDeviceAttributeTotalConstantMemory, ///< Constant memory size in bytes.
hipDeviceAttributeWarpSize, ///< Warp size in threads.
hipDeviceAttributeMaxRegistersPerBlock, ///< Maximum number of 32-bit registers available to a
///< thread block. This number is shared by all thread
///< blocks simultaneously resident on a
///< multiprocessor.
hipDeviceAttributeClockRate, ///< Peak clock frequency in kilohertz.
hipDeviceAttributeMemoryClockRate, ///< Peak memory clock frequency in kilohertz.
hipDeviceAttributeMemoryBusWidth, ///< Global memory bus width in bits.
hipDeviceAttributeMultiprocessorCount, ///< Number of multiprocessors on the device.
hipDeviceAttributeComputeMode, ///< Compute mode that device is currently in.
hipDeviceAttributeL2CacheSize, ///< Size of L2 cache in bytes. 0 if the device doesn't have L2
///< cache.
hipDeviceAttributeMaxThreadsPerMultiProcessor, ///< Maximum resident threads per
///< multiprocessor.
hipDeviceAttributeComputeCapabilityMajor, ///< Major compute capability version number.
hipDeviceAttributeComputeCapabilityMinor, ///< Minor compute capability version number.
hipDeviceAttributeConcurrentKernels, ///< Device can possibly execute multiple kernels
///< concurrently.
hipDeviceAttributePciBusId, ///< PCI Bus ID.
hipDeviceAttributePciDeviceId, ///< PCI Device ID.
hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, ///< Maximum Shared Memory Per
///< Multiprocessor.
hipDeviceAttributeIsMultiGpuBoard, ///< Multiple GPU devices.
hipDeviceAttributeIntegrated, ///< iGPU
hipDeviceAttributeCooperativeLaunch, ///< Support cooperative launch
hipDeviceAttributeCooperativeMultiDeviceLaunch, ///< Support cooperative launch on multiple devices
hipDeviceAttributeMaxTexture1DWidth, ///< Maximum number of elements in 1D images
hipDeviceAttributeMaxTexture2DWidth, ///< Maximum dimension width of 2D images in image elements
hipDeviceAttributeMaxTexture2DHeight, ///< Maximum dimension height of 2D images in image elements
hipDeviceAttributeMaxTexture3DWidth, ///< Maximum dimension width of 3D images in image elements
hipDeviceAttributeMaxTexture3DHeight, ///< Maximum dimensions height of 3D images in image elements
hipDeviceAttributeMaxTexture3DDepth, ///< Maximum dimensions depth of 3D images in image elements
hipDeviceAttributeHdpMemFlushCntl, ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
hipDeviceAttributeHdpRegFlushCntl, ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
hipDeviceAttributeMaxPitch, ///< Maximum pitch in bytes allowed by memory copies
hipDeviceAttributeTextureAlignment, ///<Alignment requirement for textures
hipDeviceAttributeTexturePitchAlignment, ///<Pitch alignment requirement for 2D texture references bound to pitched memory;
hipDeviceAttributeKernelExecTimeout, ///<Run time limit for kernels executed on the device
hipDeviceAttributeCanMapHostMemory, ///<Device can map host memory into device address space
hipDeviceAttributeEccEnabled, ///<Device has ECC support enabled
hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc, ///< Supports cooperative launch on multiple
///devices with unmatched functions
hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim, ///< Supports cooperative launch on multiple
///devices with unmatched grid dimensions
hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim, ///< Supports cooperative launch on multiple
///devices with unmatched block dimensions
hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem, ///< Supports cooperative launch on multiple
///devices with unmatched shared memories
hipDeviceAttributeAsicRevision, ///< Revision of the GPU in this device
hipDeviceAttributeManagedMemory, ///< Device supports allocating managed memory on this system
hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on
/// the device without migration
hipDeviceAttributeConcurrentManagedAccess, ///< Device can coherently access managed memory
/// concurrently with the CPU
hipDeviceAttributePageableMemoryAccess, ///< Device supports coherently accessing pageable memory
/// without calling hipHostRegister on it
hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via
/// the host's page tables
hipDeviceAttributeCanUseStreamWaitValue ///< '1' if Device supports hipStreamWaitValue32() and
///< hipStreamWaitValue64() , '0' otherwise.
} hipDeviceAttribute_t;
typedef void* hipDeviceptr_t;
/*
@@ -262,7 +346,6 @@ typedef enum hipJitOption {
hipJitOptionFastCompile,
hipJitOptionNumOptions
} hipJitOption;
/**
* @warning On AMD devices and some Nvidia devices, these hints and controls are ignored.
*/
@@ -271,7 +354,6 @@ typedef enum hipFuncAttribute {
hipFuncAttributePreferredSharedMemoryCarveout = 9,
hipFuncAttributeMax
} hipFuncAttribute;
/**
* @warning On AMD devices and some Nvidia devices, these hints and controls are ignored.
*/
@@ -282,7 +364,4 @@ typedef enum hipFuncCache_t {
hipFuncCachePreferEqual, ///< prefer equal size L1 cache and shared memory
} hipFuncCache_t;
#define HIP_LAUNCH_PARAM_BUFFER_POINTER ((void*)0x01)
#define HIP_LAUNCH_PARAM_BUFFER_SIZE ((void*)0x02)
#define HIP_LAUNCH_PARAM_END ((void*)0x03)
#endif

View File

@@ -1,4 +1,4 @@
#pragma once
#pragma once
#ifndef _TRITON_IR_BASIC_BLOCK_H_
#define _TRITON_IR_BASIC_BLOCK_H_
@@ -27,7 +27,7 @@ public:
private:
// constructors
basic_block(context &ctx, const std::string &name, function *parent);
basic_block(context &ctx, const std::string &name, function *parent, basic_block *next);
public:
// accessors
@@ -35,6 +35,7 @@ public:
context& get_context() { return ctx_; }
// get iterator to first instruction that is not a phi
void replace_phi_uses_with(basic_block* before, basic_block* after);
iterator get_first_non_phi();
// get instruction list
@@ -60,13 +61,16 @@ public:
inline const instruction &back() const { return *inst_list_.back(); }
inline instruction &back() { return *inst_list_.back(); }
void append_instruction(ir::instruction* i);
// split
basic_block* split_before(ir::instruction* loc, const std::string& name);
// predecessors
const std::vector<basic_block*>& get_predecessors() const { return preds_; }
const std::vector<basic_block*>& get_successors() const { return succs_; }
void add_predecessor(basic_block* pred);
std::vector<basic_block*> get_predecessors() const;
std::vector<basic_block*> get_successors() const;
// factory functions
static basic_block* create(context &ctx, const std::string &name, function *parent);
static basic_block* create(context &ctx, const std::string &name, function *parent, basic_block *next = nullptr);
void print(std::ostream &os);

View File

@@ -22,13 +22,16 @@ class phi_node;
/* Builder */
class builder{
public:
typedef basic_block::iterator iterator;
public:
// Constructor
builder(context &ctx);
// Getters
const context& get_context() { return ctx_; }
// const context& get_context() const { return ctx_; }
context& get_context() { return ctx_; }
// Setters
void set_insert_point(iterator instr);
void set_insert_point(instruction* i);
@@ -38,10 +41,11 @@ public:
iterator get_insert_point() { return insert_point_;}
// Constants
value *get_int1(bool val);
value *get_int32(int32_t val);
value *get_int64(int64_t val);
value *get_int32(uint32_t val);
value *get_int64(uint64_t val);
value *get_float16(float val);
value *get_float32(float val);
value *get_float64(float val);
value *get_range(int32_t lo, int32_t hi);
// Types
type *get_void_ty();
@@ -50,7 +54,9 @@ public:
type *get_int16_ty();
type *get_int32_ty();
type *get_int64_ty();
type *get_fp8_ty();
type *get_half_ty();
type *get_bf16_ty();
type *get_float_ty();
type *get_double_ty();
// Insert
@@ -67,8 +73,13 @@ public:
value* create_br(basic_block *dest);
value* create_cond_br(value *cond, basic_block* if_dest, basic_block* else_dest);
value* create_ret_void();
value* create_ret(value *ret);
// Dequantize instructions
value* create_dequantize(value *src, value *scale, value *shift, type *dest_ty);
// Cast instructions
value* create_bitcast(value *src, type *dest_ty);
value *create_cast(cast_op_t op, value *v, type *dst_ty);
value* create_int_to_ptr(value *src, type *dst_ty);
value* create_ptr_to_int(value *src, type *dst_ty);
value* create_si_to_fp(value *src, type *dst_ty);
value* create_ui_to_fp(value *src, type *dst_ty);
@@ -78,6 +89,9 @@ public:
value* create_fp_trunc(value *src, type *dst_ty);
value* create_int_cast(value *src, type *dst_ty, bool is_signed);
value *create_downcast(value *arg);
// Call instruction
value* create_call(function* fn, const std::vector<value*>& args);
value* create_launch(function* fn, const std::vector<value*>& args, const std::vector<value*>& grid, value* num_warps);
// Phi instruction
phi_node* create_phi(type *ty, unsigned num_reserved);
// Binary instructions
@@ -87,11 +101,11 @@ public:
value *create_frem(value *lhs, value *rhs);
value *create_fadd(value *lhs, value *rhs);
value *create_fsub(value *lhs, value *rhs);
value *create_mul(value *lhs, value *rhs, bool has_nuw = false, bool has_nsw = false);
value *create_sdiv(value *lhs, value *rhs);
value *create_udiv(value *lhs, value *rhs);
value *create_srem(value *lhs, value *rhs);
value *create_urem(value *lhs, value *rhs);
value *create_mul(value *lhs, value *rhs, bool has_nuw = false, bool has_nsw = false);
value *create_add(value *lhs, value *rhs, bool has_nuw = false, bool has_nsw = false);
value *create_sub(value *lhs, value *rhs, bool has_nuw = false, bool has_nsw = false);
value *create_shl(value *lhs, value *rhs, bool has_nuw = false, bool has_nsw = false);
@@ -130,31 +144,57 @@ public:
value *create_xor(value *lhs, value *rhs);
value *create_or(value *lhs, value *rhs);
// Input/Output
value *create_load(value *arg);
value *create_store(value *ptr, value *val);
value *create_masked_load(value *arg, value *mask, value *false_value);
value *create_masked_store(value *ptr, value *val, value *mask);
value *create_load(value *arg, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction, bool is_volatile);
value *create_store(value *ptr, value *val, store_inst::EVICTION_POLICY eviction);
value *create_masked_load(value *arg, value *mask, value *false_value, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction, bool is_volatile);
value *create_masked_store(value *ptr, value *val, value *mask, store_inst::EVICTION_POLICY eviction);
// Struct instructions
value *create_insert_value(value* val, value *elt, size_t idx);
value *create_extract_value(value* val, size_t idx);
// Block instruction
value *create_splat(value *arg, const type::block_shapes_t &shapes);
value *create_reshape(value *arg, const type::block_shapes_t &shapes);
value *create_cat(value *lhs, value *rhs);
value *create_broadcast(value *arg, const type::block_shapes_t &shapes);
// Atomic instruction
value *create_atomic_cas(value *ptr, value *cmp, value *val);
value *create_atomic_rmw(atomic_rmw_op_t op, value *ptr, value *val, value *msk);
value *create_atomic_max(value *ptr, value *val, value *msk);
value *create_atomic_umax(value *ptr, value *val, value *msk);
value *create_atomic_min(value *ptr, value *val, value *msk);
value *create_atomic_umin(value *ptr, value *val, value *msk);
value *create_atomic_fadd(value *ptr, value *val, value *msk);
value *create_atomic_add(value *ptr, value *val, value *msk);
value *create_atomic_and(value *ptr, value *val, value *msk);
value *create_atomic_or(value *ptr, value *val, value *msk);
value *create_atomic_xor(value *ptr, value *val, value *msk);
value *create_atomic_xchg(value *ptr, value *val, value *msk);
// Utilities
value *create_clock();
value *create_globaltimer();
// Extern instruction
value *create_extern_elementwise(const std::string &lib_name,
const std::string &lib_path,
const std::string &symbol_name,
const std::vector<value *> &args,
type *ret_ty);
// Built-in instruction
value *create_get_program_id(unsigned axis);
value *create_get_num_programs(unsigned axis);
value *create_atomic_cas(value *ptr, value *cmp, value *val);
value *create_atomic_rmw(ir::atomic_rmw_op_t op, value *ptr, value *val, value *msk);
value *create_exp(value* arg);
value *create_cos(value* arg);
value *create_sin(value* arg);
value *create_log(value* arg);
value *create_dot(value *A, value *B, value *C);
value *create_dot(value *A, value *B, value *C, bool trans_a, bool trans_b, bool allow_tf32);
value *create_trans(value *A, const std::vector<int> &perm = {});
value *create_sqrt(value *A);
value *create_reduce(value *A, reduce_inst::op_t op, unsigned axis);
value *create_select(value *pred, value *if_value, value *else_value);
// Intrinsics
// These have no place in the IR, and hopefully they can be removed at some point
value *create_umulhi(value* lhs, value* rhs);
value *create_copy_to_shared(value *arg);
value *create_masked_load_async(value *arg, value *mask, value *false_value);
value *create_masked_load_async(value *arg, value *mask, value *false_value, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY);
value *create_copy_from_shared(value *arg);
value *create_barrier(const std::string &name = "");
value *create_async_wait(int N);

View File

@@ -9,7 +9,6 @@
namespace triton{
namespace ir{
class builder;
class type;
class context_impl;
@@ -21,7 +20,6 @@ public:
context& operator=(const context&) = delete;
public:
ir::builder* builder = nullptr;
std::shared_ptr<context_impl> p_impl;
};

View File

@@ -3,17 +3,15 @@
#ifndef _TRITON_IR_CONTEXT_IMPL_H_
#define _TRITON_IR_CONTEXT_IMPL_H_
#include <map>
#include "triton/ir/type.h"
#include "triton/ir/constant.h"
#include <map>
#include <memory>
namespace triton{
namespace ir{
class context;
class constant;
class constant_int;
class constant_fp;
class undef_value;
/* Context impl */
class context_impl {
@@ -29,16 +27,17 @@ public:
// integer types
integer_type int1_ty, int8_ty, int16_ty, int32_ty, int64_ty, int128_ty;
// Pointer types
std::map<std::pair<type*, unsigned>, pointer_type*> ptr_tys;
std::map<std::pair<type*, unsigned>, std::unique_ptr<pointer_type>> ptr_tys;
// Block types
std::map<std::pair<type*, type::block_shapes_t>, block_type*> block_tys;
std::map<std::pair<type*, type::block_shapes_t>, std::unique_ptr<block_type>> block_tys;
// Struct types
std::map<type::contained_tys_vec_t, struct_type*> struct_tys;
// Int constants
std::map<std::pair<type*, uint64_t>, constant_int*> int_constants_;
std::map<std::pair<type*, uint64_t>, std::unique_ptr<constant_int>> int_constants_;
// Float constants
std::map<std::pair<type*, double>, constant_fp*> fp_constants_;
std::map<std::pair<type*, double>, std::unique_ptr<constant_fp>> fp_constants_;
// undef values
std::map<type*, undef_value*> uv_constants_;
std::map<type*, std::unique_ptr<undef_value>> uv_constants_;
};

View File

@@ -1,108 +0,0 @@
#pragma once
#ifndef _TRITON_IR_DISPATCH_H_
#define _TRITON_IR_DISPATCH_H_
#include "triton/ir/builder.h"
#include <stdexcept>
namespace triton{
namespace ir{
/*----------------------------------------------
higher level functions that follow the likely
semantics of most expected frontends
----------------------------------------------*/
struct semantic_error: public std::runtime_error {
semantic_error(const std::string& msg):
std::runtime_error(msg) { }
};
struct dispatch{
typedef ir::type::block_shapes_t shape_t;
// programming model
static ir::value *program_id(int axis, ir::builder *builder);
static ir::value *num_programs(int axis, ir::builder *builder);
// binary operators
static ir::value *add(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *sub(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *mul(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *truediv(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *floordiv(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *mod(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *and_(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *or_(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *xor_(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *lshr(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *shl(ir::value *input, ir::value *other, ir::builder *builder);
// unary operators
static ir::value *plus(ir::value *input, ir::builder *builder);
static ir::value *minus(ir::value *input, ir::builder *builder);
static ir::value *invert(ir::value *input, ir::builder *builder);
// comparison operators
static ir::value *greater_than(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *greater_equal(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *less_than(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *less_equal(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *equal(ir::value *input, ir::value *other, ir::builder *builder);
static ir::value *not_equal(ir::value *input, ir::value *other, ir::builder *builder);
// block creation
static ir::value* arange(int start, int end, ir::builder *builder);
static ir::value* zeros(shape_t shape, ir::type *dtype, ir::builder *builder);
// casting ops
static ir::value *reshape(ir::value *input, shape_t shape, ir::builder *builder);
static ir::value *broadcast(ir::value *input, shape_t shape, ir::builder *builder);
static std::tuple<ir::value*, ir::value*> broadcast(ir::value *lhs, ir::value* rhs, ir::builder *builder);
static ir::value *bitcast(ir::value *input, ir::type *type, ir::builder *builder);
static ir::value *cast(ir::value *input, ir::type *type, ir::builder *builder);
// memory operators
static ir::value *load(ir::value* ptr, ir::value* mask, ir::value* other, ir::builder *builder);
static ir::value *store(ir::value* ptr, ir::value *value, ir::value *mask, ir::builder *builder);
static ir::value *atomic_cas(ir::value* ptr, ir::value *cmp, ir::value *val, ir::builder *builder);
static ir::value *atomic_add(ir::value* ptr, ir::value *val, ir::value *msk, ir::builder *builder);
static ir::value *atomic_max(ir::value* ptr, ir::value *val, ir::value *msk, ir::builder *builder);
static ir::value *atomic_min(ir::value* ptr, ir::value *val, ir::value *msk, ir::builder *builder);
static ir::value *atomic_and(ir::value* ptr, ir::value *val, ir::value *msk, ir::builder *builder);
static ir::value *atomic_or(ir::value* ptr, ir::value *val, ir::value *msk, ir::builder *builder);
static ir::value *atomic_xor(ir::value* ptr, ir::value *val, ir::value *msk, ir::builder *builder);
static ir::value *atomic_xchg(ir::value* ptr, ir::value *val, ir::value *msk, ir::builder *builder);
// linear algebra
static ir::value *dot(ir::value *lhs, ir::value *rhs, ir::builder *builder);
// indexing
static ir::value *where(ir::value* condition, ir::value *x, ir::value *y, ir::builder *builder);
// reduction
static ir::value *min(ir::value *input, unsigned int axis, ir::builder *builder);
static ir::value *max(ir::value *input, unsigned int axis, ir::builder *builder);
static ir::value *sum(ir::value *input, unsigned int axis, ir::builder *builder);
// math
static ir::value *exp(ir::value *x, ir::builder *builder);
static ir::value *log(ir::value *x, ir::builder *builder);
static ir::value *cos(ir::value *x, ir::builder *builder);
static ir::value *sin(ir::value *x, ir::builder *builder);
static ir::value *sqrt(ir::value *x, ir::builder *builder);
// internal (debug/optimization)
static ir::value *multiple_of(ir::value *x, int value, ir::builder *builder);
static ir::value *max_contiguous(ir::value *x, int value, ir::builder *builder);
static ir::value *debug_barrier(ir::builder *builder);
};
}
}
#endif

View File

@@ -95,6 +95,9 @@ enum value_id_t: unsigned {
INSTRUCTIONS
* ------------ */
INST_BEGIN,
// call
INST_CALL,
INST_LAUNCH,
// phi
INST_PHI,
// arithmetic
@@ -105,6 +108,8 @@ enum value_id_t: unsigned {
// cmp
INST_ICMP,
INST_FCMP,
// dequantize
INST_DEQUANTIZE,
// cast
INST_CAST_TRUNC,
INST_CAST_ZEXT,
@@ -129,9 +134,13 @@ enum value_id_t: unsigned {
INST_MASKED_LOAD_ASYNC,
INST_UNMASKED_STORE,
INST_MASKED_STORE,
// struct
INST_EXTRACT_VALUE,
INST_INSERT_VALUE,
// retile
INST_RESHAPE,
INST_SPLAT,
INST_CAT,
INST_BROADCAST,
INST_DOWNCAST,
// builtin
@@ -142,10 +151,13 @@ enum value_id_t: unsigned {
INST_ATOMIC_EXCH,
INST_ATOMIC_RMW,
// math
INST_UMULHI,
INST_EXP,
INST_COS,
INST_SIN,
INST_LOG,
// extern
INST_EXTERN_ELEMENTWISE,
// array arithmetic
INST_TRANS,
INST_REDUCE,
@@ -163,6 +175,8 @@ enum value_id_t: unsigned {
INST_MAKE_RANGE_STA,
INST_MAKE_RANGE,
INST_PREFETCH_S,
INST_GLOBALTIMER,
INST_CLOCK,
};

View File

@@ -24,7 +24,7 @@ public:
static argument* create(type *ty, const std::string &name,
function *parent = nullptr, unsigned arg_no = 0);
function* get_parent() const;
unsigned get_arg_no() const;
unsigned get_arg_no() const;
void accept(visitor *v);
@@ -112,7 +112,7 @@ public:
static function *create(function_type *ty, linkage_types_t linkage,
const std::string &name, module *mod);
// blocks
const blocks_t &blocks() { return blocks_; }
blocks_t &blocks() { return blocks_; }
const blocks_t &blocks() const { return blocks_; }
void insert_block(basic_block* block, basic_block *next = nullptr);
@@ -121,6 +121,8 @@ public:
const attr_map_t &attrs() { return attrs_; }
bool has_attr(unsigned arg_id) const { return attrs_.find(arg_id) != attrs_.end(); }
std::set<attribute> get_attributes(const argument* arg) { return attrs_[arg->get_arg_no() + 1]; }
void set_is_kernel(bool new_val) { is_kernel_ = new_val; }
bool get_is_kernel() { return is_kernel_; }
void print(std::ostream &os);
@@ -134,6 +136,7 @@ private:
args_t args_;
blocks_t blocks_;
attr_map_t attrs_;
bool is_kernel_;
};
}

View File

@@ -59,8 +59,8 @@ public:
std::string repr() const { return repr_impl(); }
// metadata
void set_metadata(ir::metadata::kind_t kind,
unsigned value) { metadatas_[kind] = value;}
unsigned get_metadata(ir::metadata::kind_t kind) { return metadatas_[kind];}
std::vector<unsigned> value) { metadatas_[kind] = value;}
std::vector<unsigned> get_metadata(ir::metadata::kind_t kind) { return metadatas_[kind];}
// cloning
ir::instruction* clone() {
ir::instruction* res = clone_impl();
@@ -77,10 +77,55 @@ public:
private:
basic_block *parent_;
std::map<ir::metadata::kind_t, unsigned> metadatas_;
std::map<ir::metadata::kind_t, std::vector<unsigned>> metadatas_;
value_id_t id_;
};
//===----------------------------------------------------------------------===//
// call_inst classes
//===----------------------------------------------------------------------===//
class call_inst: public instruction {
private:
std::string repr_impl() const;
call_inst(ir::function* fn, const std::vector<ir::value*>& values, const std::string& name, instruction* next);
public:
static call_inst* create(ir::function* fn, const std::vector<ir::value*>& values, const std::string &name = "", instruction *next = nullptr);
ir::function* get_fn() { return fn_; }
_TRITON_DEFINE_CLONE(call_inst)
_TRITON_DEFINE_ACCEPT(call_inst)
private:
ir::function* fn_;
};
class launch_inst: public instruction {
private:
std::string repr_impl() const { return "launch"; }
launch_inst(ir::function* fn, const std::vector<ir::value*>& values, const std::vector<ir::value*>& grid, ir::value* num_warps,
const std::string &name = "", instruction *next = nullptr);
public:
static launch_inst* create(ir::function* fn, const std::vector<ir::value*>& values, const std::vector<ir::value*>& grid, ir::value* num_warps,
const std::string& name = "", instruction* next = nullptr);
ir::function* get_fn();
std::vector<ir::value*> get_values();
std::vector<ir::value*> get_grid();
ir::value* get_num_warps();
_TRITON_DEFINE_CLONE(launch_inst)
_TRITON_DEFINE_ACCEPT(launch_inst)
private:
unsigned val_begin;
unsigned val_end;
unsigned grid_begin;
unsigned grid_end;
};
//===----------------------------------------------------------------------===//
// phi_node classes
@@ -117,6 +162,7 @@ private:
//===----------------------------------------------------------------------===//
// binary_operator classes
//===----------------------------------------------------------------------===//
class binary_operator: public instruction {
public:
typedef binary_op_t op_t;
@@ -145,6 +191,10 @@ public:
bool is_shl() const;
bool is_shr() const;
// Approx
void set_fdiv_ieee_rounding(bool rnd) { fdiv_ieee_rnd_ = rnd; }
bool get_fdiv_ieee_rounding() { return fdiv_ieee_rnd_; }
// Wraps
void set_has_no_unsigned_wrap(bool b = true) { has_no_unsigned_wrap_ = b; }
void set_has_no_signed_wrap(bool b = true) { has_no_signed_wrap_ = b; }
@@ -163,6 +213,8 @@ public:
binary_op_t op_;
bool has_no_unsigned_wrap_;
bool has_no_signed_wrap_;
bool fdiv_ieee_rnd_;
};
@@ -222,6 +274,24 @@ protected:
unary_inst(type *ty, value_id_t id, value *v, const std::string &name, instruction *next);
};
//===----------------------------------------------------------------------===//
// dequantize_inst classes
//===----------------------------------------------------------------------===//
class dequantize_inst: public instruction{
private:
std::string repr_impl() const override { return "dequantize"; }
protected:
dequantize_inst(type *ty, value *v, value *scale, value *shift, const std::string &name, instruction *next);
public:
static dequantize_inst *create(value *arg, value *scale, value *shift, type *ty,
const std::string &name = "", instruction *next = nullptr);
_TRITON_DEFINE_CLONE(dequantize_inst)
_TRITON_DEFINE_ACCEPT(dequantize_inst)
};
//===----------------------------------------------------------------------===//
// cast_inst classes
@@ -383,20 +453,61 @@ private:
//===----------------------------------------------------------------------===//
class io_inst: public instruction {
public:
enum EVICTION_POLICY : uint32_t {
NORMAL=0,
EVICT_FIRST,
EVICT_LAST,
};
protected:
io_inst(type *ty, value_id_t id, unsigned num_ops,
io_inst(type *ty, value_id_t id, unsigned num_ops, EVICTION_POLICY eviction,
const std::string &name = "", instruction *next = nullptr);
std::string get_eviction_policy_repr() const {
if (eviction_ == EVICT_FIRST) return ".L1::evict_first";
if (eviction_ == EVICT_LAST) return ".L2::evict_last";
return "";
}
public:
// accessors
value *get_pointer_operand() { return get_operand(0); }
EVICTION_POLICY get_eviction_policy() const { return eviction_; }
protected:
EVICTION_POLICY eviction_;
};
// load
class load_inst: public io_inst {
public:
enum CACHE_MODIFIER : uint32_t {
NONE=0,
CA,
CG,
};
CACHE_MODIFIER get_cache_modifier() const { return cache_; }
bool get_is_volatile() const { return is_volatile_; }
protected:
load_inst(value *ptr, value_id_t id, unsigned num_ops,
load_inst(value *ptr, value_id_t id, unsigned num_ops, CACHE_MODIFIER cache, EVICTION_POLICY eviction,
bool is_volatile,
const std::string &name = "", instruction *next = nullptr);
std::string get_cache_modifier_repr() const {
if (cache_ == CA) return ".ca";
if (cache_ == CG) return ".cg";
return "";
}
CACHE_MODIFIER cache_;
std::string get_volatile_repr() {
return is_volatile_ ? ".volatile" : "";
}
bool is_volatile_;
private:
static type *get_pointee_type(type *ty);
@@ -405,11 +516,13 @@ private:
// unmasked load
class unmasked_load_inst: public load_inst {
private:
std::string repr_impl() const { return "unmasked_load"; }
unmasked_load_inst(value *ptr, const std::string &name, instruction *next);
std::string repr_impl() const { return "unmasked_load" + get_cache_modifier_repr(); }
unmasked_load_inst(value *ptr, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction, bool is_volatile, const std::string &name, instruction *next);
public:
static unmasked_load_inst* create(value *ptr,
CACHE_MODIFIER cache, EVICTION_POLICY eviction,
bool is_volatile,
const std::string &name = "",
instruction *next = nullptr);
_TRITON_DEFINE_CLONE(unmasked_load_inst)
@@ -419,8 +532,8 @@ public:
// masked load
class masked_load_inst: public load_inst {
private:
std::string repr_impl() const { return "masked_load"; }
masked_load_inst(value *ptr, value *mask, value *false_value,
std::string repr_impl() const { return "masked_load" + get_cache_modifier_repr(); }
masked_load_inst(value *ptr, value *mask, value *false_value, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction, bool is_volatile,
const std::string &name, instruction *next);
public:
@@ -429,6 +542,8 @@ public:
value *get_false_value_operand() { return get_operand(2); }
// factory method
static masked_load_inst* create(value *ptr, value *mask, value *false_value,
CACHE_MODIFIER cache, EVICTION_POLICY eviction,
bool is_volatile,
const std::string &name = "",
instruction *next = nullptr);
_TRITON_DEFINE_CLONE(masked_load_inst)
@@ -438,9 +553,10 @@ public:
// masked load async
class masked_load_async_inst: public load_inst {
private:
std::string repr_impl() const { return "masked_load_async_async"; }
std::string repr_impl() const { return "masked_load_async" + get_cache_modifier_repr(); }
masked_load_async_inst(value *ptr, value *mask, value *false_value,
const std::string &name, instruction *next);
CACHE_MODIFIER cache, EVICTION_POLICY eviction,
const std::string &name, instruction *next);
public:
// accessors
@@ -448,6 +564,8 @@ public:
value *get_false_value_operand() { return get_operand(2); }
// factory method
static masked_load_async_inst* create(value *ptr, value *mask, value *false_value,
load_inst::CACHE_MODIFIER cache,
EVICTION_POLICY eviction,
const std::string &name = "",
instruction *next = nullptr);
_TRITON_DEFINE_CLONE(masked_load_async_inst)
@@ -459,7 +577,7 @@ public:
// store
class store_inst: public io_inst {
protected:
store_inst(value *ptr, value_id_t id, unsigned num_ops,
store_inst(value *ptr, value_id_t id, unsigned num_ops, EVICTION_POLICY eviction,
const std::string &name = "", instruction *next = nullptr);
public:
@@ -470,11 +588,11 @@ public:
class unmasked_store_inst: public store_inst{
private:
std::string repr_impl() const { return "unmasked_store"; }
unmasked_store_inst(value *ptr, value *v, const std::string &name, instruction *next);
unmasked_store_inst(value *ptr, value *v, EVICTION_POLICY eviction, const std::string &name, instruction *next);
public:
// factory method
static unmasked_store_inst* create(value* ptr, value *v,
static unmasked_store_inst* create(value* ptr, value *v, EVICTION_POLICY eviction,
const std::string &name = "",
instruction *next = nullptr);
_TRITON_DEFINE_CLONE(unmasked_store_inst)
@@ -484,24 +602,77 @@ public:
class masked_store_inst: public store_inst{
private:
std::string repr_impl() const { return "masked_store"; }
masked_store_inst(value *ptr, value *v, value *mask,
masked_store_inst(value *ptr, value *v, value *mask, EVICTION_POLICY eviction,
const std::string &name, instruction *next);
public:
// accessors
value *get_mask_operand() { return get_operand(2); }
// factory method
static masked_store_inst* create(value *ptr, value *v, value *mask,
static masked_store_inst* create(value *ptr, value *v, value *mask, EVICTION_POLICY eviction,
const std::string &name = "",
instruction *next = nullptr);
_TRITON_DEFINE_CLONE(masked_store_inst)
_TRITON_DEFINE_ACCEPT(masked_store_inst)
};
//===----------------------------------------------------------------------===//
// struct classes
//===----------------------------------------------------------------------===//
// insert_value
class insert_value_inst: public instruction {
private:
std::string repr_impl() const { return "insertvalue"; }
insert_value_inst(value *val, value *elt, size_t idx, const std::string &name, instruction *next);
public:
static insert_value_inst* create(value *val, value* elt, size_t idx, const std::string &name = "", instruction *next = nullptr);
size_t get_idx() { return idx_; }
_TRITON_DEFINE_CLONE(insert_value_inst)
_TRITON_DEFINE_ACCEPT(insert_value_inst)
private:
size_t idx_;
};
// extract_value
class extract_value_inst: public instruction {
private:
std::string repr_impl() const { return "extractvalue"; }
extract_value_inst(value *val, size_t idx, const std::string &name, instruction *next);
public:
static extract_value_inst* create(value *val, size_t idx, const std::string &name = "", instruction *next = nullptr);
size_t get_idx() { return idx_; }
_TRITON_DEFINE_CLONE(extract_value_inst)
_TRITON_DEFINE_ACCEPT(extract_value_inst)
private:
size_t idx_;
};
//===----------------------------------------------------------------------===//
// retile_inst classes
//===----------------------------------------------------------------------===//
// cat
class cat_inst: public instruction {
private:
std::string repr_impl() const { return "cat"; }
cat_inst(value *x, value *y, const std::string &name, instruction *next);
public:
static instruction* create(value *lhs, value *rhs,
const std::string &name = "",
instruction *next = nullptr);
_TRITON_DEFINE_CLONE(cat_inst)
_TRITON_DEFINE_ACCEPT(cat_inst)
};
// retile
class retile_inst: public unary_inst {
@@ -608,6 +779,8 @@ private:
class atomic_inst: public io_inst {
public:
using io_inst::io_inst;
atomic_inst(type *ty, value_id_t id, unsigned num_ops, const std::string &name, instruction *next):
io_inst(ty, id, num_ops, NORMAL, name, next) {}
};
class atomic_rmw_inst: public atomic_inst {
@@ -636,6 +809,17 @@ public:
static instruction* create(value *ptr, value *cmp, value *val, const std::string &name = "", instruction *next = nullptr);
};
class umulhi_inst: public builtin_inst {
private:
umulhi_inst(value *lhs, value *rhs, const std::string &name = "", instruction *next = nullptr);
std::string repr_impl() const { return "umulhi"; }
_TRITON_DEFINE_CLONE(umulhi_inst)
_TRITON_DEFINE_ACCEPT(umulhi_inst)
public:
static instruction* create(value *lhs, value *rhs, const std::string &name = "", instruction *next = nullptr);
};
class exp_inst: public builtin_inst {
private:
exp_inst(value *val, const std::string &name = "", instruction *next = nullptr);
@@ -684,24 +868,40 @@ public:
class dot_inst: public builtin_inst {
public:
enum TransT { NoTrans, Trans };
enum DataType {
FP8, FP16, BF16, TF32, FP32,
INT1, INT4, INT8, INT32,
UNKNOWN,
};
private:
dot_inst(value *A, value *B, value *C, TransT AT, TransT BT, const std::string &name, instruction *next);
dot_inst(value *A, value *B, value *C, TransT AT, TransT BT, bool allow_tf32, const std::string &name, instruction *next);
std::string repr_impl() const { return "dot"; }
bool is_prefetched_ = false;
public:
bool is_prefetched() const { return is_prefetched_; }
void set_prefetched(bool is_prefetched) { is_prefetched_ = is_prefetched; }
bool allow_tf32() const { return allow_tf32_; }
bool is_trans_a() const { return AT_ == Trans; }
bool is_trans_b() const { return BT_ == Trans; }
public:
static instruction *create(value *A, value *B, value *C, bool AT, bool BT, const std::string &name = "", instruction *next = nullptr);
static instruction* create_nn(value *A, value *B, value *C, const std::string &name = "", instruction *next = nullptr);
static instruction* create_nt(value *A, value *B, value *C, const std::string &name = "", instruction *next = nullptr);
static instruction* create_tn(value *A, value *B, value *C, const std::string &name = "", instruction *next = nullptr);
static instruction* create_tt(value *A, value *B, value *C, const std::string &name = "", instruction *next = nullptr);
static instruction *create(value *A, value *B, value *C, bool AT, bool BT, bool allow_tf32, const std::string &name = "", instruction *next = nullptr);
static instruction* create_nn(value *A, value *B, value *C, bool allow_tf32, const std::string &name = "", instruction *next = nullptr);
static instruction* create_nt(value *A, value *B, value *C, bool allow_tf32, const std::string &name = "", instruction *next = nullptr);
static instruction* create_tn(value *A, value *B, value *C, bool allow_tf32, const std::string &name = "", instruction *next = nullptr);
static instruction* create_tt(value *A, value *B, value *C, bool allow_tf32, const std::string &name = "", instruction *next = nullptr);
_TRITON_DEFINE_CLONE(dot_inst)
_TRITON_DEFINE_ACCEPT(dot_inst)
private:
bool is_prefetched_ = false;
bool allow_tf32_ = false;
DataType C_type_ = DataType::FP32;
DataType A_type_ = DataType::FP16;
DataType B_type_ = DataType::FP16;
TransT AT_;
TransT BT_;
};
//class outer_inst: public builtin_inst {
@@ -743,8 +943,11 @@ public:
class reduce_inst: public builtin_inst {
public:
enum op_t{
ADD, SUB, MAX, MIN,
FADD, FSUB, FMAX, FMIN
ADD, SUB, MAX, MIN, UMAX, UMIN,
ARGMAX, ARGMIN, ARGUMAX, ARGUMIN,
FADD, FSUB, FMAX, FMIN,
ARGFMAX, ARGFMIN,
XOR
};
private:
@@ -761,12 +964,19 @@ public:
static instruction* create(value *arg, op_t op, unsigned axis, const std::string &name = "", instruction *next = nullptr);
unsigned get_axis() const { return axis_; }
op_t get_op() const { return op_; }
bool with_index() const {
return with_index_ops_.find(op_) != with_index_ops_.end();
}
private:
unsigned axis_;
op_t op_;
const static inline std::set<op_t> with_index_ops_ = {
op_t::ARGMAX, op_t::ARGMIN, op_t::ARGUMAX,
op_t::ARGUMIN, op_t::ARGFMAX, op_t::ARGFMIN};
unsigned axis_;
op_t op_;
};
class select_inst: public builtin_inst {
private:
select_inst(value *pred, value *if_value, value *else_value, const std::string& name, instruction* next);
@@ -785,6 +995,7 @@ public:
// intrinsics classes
//===----------------------------------------------------------------------===//
class copy_to_shared_inst: public unary_inst{
private:
using unary_inst::unary_inst;
@@ -853,11 +1064,11 @@ class prefetch_s_inst : public instruction {
std::string repr_impl() const { return "prefetch_s"; }
_TRITON_DEFINE_CLONE(prefetch_s_inst)
_TRITON_DEFINE_ACCEPT(prefetch_s_inst)
/// inc_: 0->first, 1->latch
int inc_ = 0;
public:
prefetch_s_inst(context &ctx, value *arg, int inc, const std::string &name, instruction *next)
prefetch_s_inst(context &ctx, value *arg, int inc, const std::string &name, instruction *next)
: instruction(type::get_void_ty(ctx), INST_PREFETCH_S, 1, name, next), inc_(inc) {
set_operand(0, arg);
}
@@ -866,35 +1077,6 @@ public:
instruction *next=nullptr);
};
//// On NVIDIA, implementation is such that
//// constant_range = nv_dynamic_program_idx + nv_static_program_idx
//// so as to enable re-association on nv_static_program_idx which is constant
//class make_range_dyn: public instruction {
//private:
// make_range_dyn(type *ty, const std::string &name, instruction *next);
// std::string repr_impl() const { return "nv_dynamic_program_idx"; }
// _TRITON_DEFINE_CLONE(make_range_dyn)
// _TRITON_DEFINE_ACCEPT(make_range_dyn)
//public:
// static make_range_dyn* create(type *ty, const std::string &name = "", instruction *next = nullptr);
//};
//class make_range_sta: public constant {
//private:
// make_range_sta(make_range *range);
//public:
// static make_range_sta *get(make_range* range);
// make_range* get_range() const;
// std::string repr() const { return "nv_static_program_idx"; }
// _TRITON_DEFINE_ACCEPT(make_range_sta)
//private:
// make_range *range_;
//};
/* constant range */
class make_range: public instruction{
make_range(type *ty, constant_int* first, constant_int* last);
@@ -912,7 +1094,53 @@ private:
constant_int* last_;
};
/* timing utilities */
class clock_inst: public instruction{
clock_inst(context &ctx, const std::string &name, instruction *next);
std::string repr_impl() const { return "clock"; }
_TRITON_DEFINE_CLONE(clock_inst)
_TRITON_DEFINE_ACCEPT(clock_inst)
public:
static clock_inst* create(context &ctx, const std::string &name = "", instruction *next = nullptr);
};
class globaltimer_inst: public instruction{
globaltimer_inst(context &ctx, const std::string &name, instruction *next);
std::string repr_impl() const { return "globaltimer"; }
_TRITON_DEFINE_CLONE(globaltimer_inst)
_TRITON_DEFINE_ACCEPT(globaltimer_inst)
public:
static globaltimer_inst* create(context &ctx, const std::string &name = "", instruction *next = nullptr);
};
class extern_elementwise_inst : public instruction {
extern_elementwise_inst(context &ctx, const std::vector<value *> &args,
type *dst_ty, const std::string &lib_name,
const std::string &extern_lib_path,
const std::string &symbol_name,
const std::string &name, instruction *next);
std::string repr_impl() const { return "extern_elementwise"; }
_TRITON_DEFINE_CLONE(extern_elementwise_inst)
_TRITON_DEFINE_ACCEPT(extern_elementwise_inst)
public:
static extern_elementwise_inst *create(
context &ctx, const std::vector<value *> &args, type *dst_ty,
const std::string &lib_name = "", const std::string &lib_path = "",
const std::string &symbol_name = "", const std::string &name = "",
instruction *next = nullptr);
const std::string &get_lib_name() const { return lib_name_; }
const std::string &get_lib_path() const { return lib_path_; }
const std::string &get_symbol_name() const { return symbol_name_; }
private:
std::string lib_name_;
std::string lib_path_;
std::string symbol_name_;
};
}
}

View File

@@ -3,6 +3,8 @@
#ifndef _TRITON_IR_METADATA_H_
#define _TRITON_IR_METADATA_H_
#include <vector>
namespace triton{
namespace ir{
@@ -16,14 +18,14 @@ public:
};
private:
metadata(kind_t kind, unsigned value);
metadata(kind_t kind, std::vector<unsigned> value);
public:
static metadata* get(kind_t kind, unsigned value);
static metadata* get(kind_t kind, std::vector<unsigned> value);
private:
kind_t kind_;
unsigned value_;
std::vector<unsigned> value_;
};
}

View File

@@ -34,50 +34,74 @@ class constant;
class global_value;
class alloc_const;
/* Module */
class module {
class value_constructor {
typedef std::pair<std::string, basic_block*> val_key_t;
friend class function;
typedef std::pair<ir::metadata::kind_t, unsigned> md_pair_t;
public:
typedef std::map<std::string, global_value*> symbols_map_t;
typedef std::vector<function*> functions_list_t;
struct current_iteration_info_t{
lang::iteration_statement *statement;
basic_block *block;
};
private:
phi_node *make_phi(type *ty, unsigned num_values, basic_block *block);
value *try_remove_trivial_phis(ir::phi_node *&phi);
value *add_phi_operands(const std::string& name, phi_node *&phi);
value *get_value_recursive(const std::string& name, basic_block *block);
void push_function(function *fn) { functions_.push_back(fn); }
public:
module(const std::string &name, builder& builder);
builder& get_builder();
// Setters
value_constructor(builder &builder);
void set_value(const std::string& name, basic_block* block, value *x);
void set_value(const std::string& name, value* x);
void set_const(const std::string& name);
void set_continue_fn(std::function<ir::value*()> fn);
// Getters
const std::map<val_key_t, value*>& get_values() { return values_; }
void set_values(const std::map<val_key_t, value*>& values) { values_ = values; }
value *get_value(const std::string& name, basic_block* block);
value *get_value(const std::string& name);
void set_type(const std::string& name, ir::type* ty) { types_[name] = ty; }
const std::string& get_name();
std::function<ir::value*()> get_continue_fn();
// Seal block -- no more predecessors will be added
void seal_block(basic_block *block);
// Metadata
private:
ir::builder& builder_;
std::map<val_key_t, value*> values_;
std::map<std::string, type*> types_;
std::set<basic_block*> sealed_blocks_;
std::map<basic_block*, std::map<std::string, phi_node*>> incomplete_phis_;
std::map<value*, value**> current_phi_;
};
/* Module */
class module {
typedef std::pair<std::string, basic_block*> val_key_t;
typedef std::pair<ir::metadata::kind_t, std::vector<unsigned>> md_pair_t;
friend class function;
public:
typedef std::map<std::string, global_value*> symbols_map_t;
typedef std::vector<function*> functions_list_t;
private:
void push_function(function *fn) { functions_.push_back(fn); }
public:
module(const std::string &name, builder &builder): name_(name), builder_(builder) {}
builder &get_builder() { return builder_; };
const std::string& get_name() { return name_; };
// Functions
const functions_list_t &get_function_list() const { return functions_; }
functions_list_t &get_function_list() { return functions_; }
function *get_function(const std::string& name) {
if(symbols_.find(name) == symbols_.end())
throw std::runtime_error("function " + name + " is not declared");
return (function*)symbols_.at(name);
}
function *get_or_insert_function(const std::string &name, function_type *ty);
bool has_function(const std::string& name){
return symbols_.find(name) != symbols_.end();
}
void remove_function(ir::function* fn){
functions_.erase(std::remove(functions_.begin(), functions_.end(), fn), functions_.end());
}
void reset_ret_ty(const std::string& name, type* ty);
// Const allocation
void add_alloc(ir::alloc_const* x) { allocs_.push_back(x); }
const std::vector<ir::alloc_const*>& allocs() { return allocs_; }
@@ -85,22 +109,15 @@ public:
void register_global(const std::string& name, ir::value *x) { globals_[name] = x; }
const std::map<std::string, ir::value*>& globals() const { return globals_; }
// Metadata
void add_metadata(const std::string &name, md_pair_t x) { metadatas_[name] = x; }
void print(std::ostream &os);
void add_metadata(const std::string &name, md_pair_t x) { metadatas_[name] = x; }
const std::map<std::string, md_pair_t> &get_metadatas() const { return metadatas_; }
private:
std::string name_;
builder& builder_;
std::map<val_key_t, value*> values_;
std::map<std::string, type*> types_;
std::set<std::string> const_;
std::set<basic_block*> sealed_blocks_;
std::map<basic_block*, std::map<std::string, phi_node*>> incomplete_phis_;
builder &builder_;
functions_list_t functions_;
symbols_map_t symbols_;
std::function<ir::value*()> continue_fn_;
std::map<value*, value**> current_phi_;
std::vector<ir::alloc_const*> allocs_;
std::map<std::string, ir::value*> globals_;
std::map<std::string, md_pair_t> metadatas_;

View File

@@ -1,4 +1,4 @@
#pragma once
#pragma once
#ifndef _TRITON_IR_TYPE_H_
#define _TRITON_IR_TYPE_H_
@@ -6,6 +6,7 @@
#include <cassert>
#include <vector>
#include <string>
#include <stdexcept>
namespace triton{
namespace ir{
@@ -20,7 +21,6 @@ class type {
public:
typedef std::vector<unsigned> block_shapes_t;
protected:
typedef std::vector<type*> contained_tys_vec_t;
typedef contained_tys_vec_t::iterator ty_iterator;
typedef contained_tys_vec_t::const_iterator const_ty_iterator;
@@ -68,23 +68,24 @@ public:
type *get_tile_element_ty() const;
unsigned get_pointer_address_space() const;
type *get_pointer_element_ty() const;
unsigned get_struct_numel() const { return contained_tys_.size(); }
type *get_struct_type(unsigned int i) const { return contained_tys_[i]; }
// primitive predicates
bool is_void_ty() const { return id_ == VoidTyID; }
bool is_fp8_ty() const { return id_ == FP8TyID; }
bool is_fp16_ty() const { return id_ == FP16TyID; }
bool is_bf16_ty() const { return id_ == BF16TyID; }
bool is_fp32_ty() const { return id_ == FP32TyID; }
bool is_fp64_ty() const { return id_ == FP64TyID; }
bool is_fp32_ty() const { return id_ == FP32TyID; }
bool is_fp64_ty() const { return id_ == FP64TyID; }
bool is_label_ty() const { return id_ == LabelTyID;}
bool is_metadata_ty() const { return id_ == MetadataTyID; }
bool is_token_ty() const { return id_ == TokenTyID; }
bool is_integer_ty() const { return id_ == IntegerTyID; }
bool is_integer_ty(unsigned bitwidth) { return is_integer_ty() &&
get_integer_bitwidth() == bitwidth;}
bool is_bool_ty() const { return is_integer_ty(1); }
bool is_pointer_ty() const { return id_ == PointerTyID; }
bool is_block_ty() const { return id_ == BlockTyID; }
bool is_struct_ty() const { return id_ == StructTyID; }
// Composite predicates
bool is_int_or_tileint_ty();
@@ -128,21 +129,21 @@ public:
switch(id_) {
case VoidTyID: return "void";
case FP8TyID: return "fp8";
case BF16TyID: return "bf16";
case FP16TyID: return "f16";
case FP32TyID: return "f32";
case FP64TyID: return "f64";
case LabelTyID: return "label";
case MetadataTyID: return "md";
case TokenTyID: return "tok";
case IntegerTyID: return "i" + std::to_string(get_integer_bitwidth());
case IntegerTyID: return ("i") + std::to_string(get_integer_bitwidth());
case FunctionTyID: return "fn";
case PointerTyID: return get_pointer_element_ty()->repr() + "*";
case StructTyID: return "struct";
case BlockTyID: return tile_repr();
default: break;
}
assert(false);
return "";
throw std::logic_error("unknown type id '" + std::to_string(id_) + "'");
};
private:
@@ -159,7 +160,7 @@ class integer_type: public type {
private:
// constructors
integer_type(context &ctx, unsigned bitwidth)
: type(ctx, IntegerTyID), bitwidth_(bitwidth){ }
: type(ctx, IntegerTyID), bitwidth_(bitwidth) {}
public:
// accessors
@@ -181,6 +182,16 @@ public:
type* get_type_at_index(value *idx) const;
};
class struct_type: public composite_type {
public:
struct_type(const contained_tys_vec_t& tys, bool is_packed);
unsigned get_num_types() const { return contained_tys_.size(); }
static struct_type* get(const contained_tys_vec_t& tys, bool is_packed);
private:
bool is_packed_;
};
class block_type: public composite_type {
private:
block_type(type *ty, const block_shapes_t &shapes);
@@ -229,6 +240,7 @@ public:
ty_iterator params_end() { return contained_tys_.end(); }
type* get_param_ty(unsigned i) const { return contained_tys_.at(1 + i); }
type* get_return_ty() const { return contained_tys_.at(0); }
void reset_ret_ty(type* ty) { contained_tys_[0] = ty;}
// factory methods
static function_type* get(type *ret_ty, const std::vector<type*>& param_tys);
};

View File

@@ -22,6 +22,7 @@ public:
};
void for_each_instruction(ir::module& mod, const std::function<void(triton::ir::instruction*)> &fn);
void for_each_instruction_backward(module &mod, const std::function<void (instruction *)> &do_work);
void for_each_value(ir::module& mod, const std::function<void(triton::ir::value *)> &fn);
}

View File

@@ -21,7 +21,7 @@ class visitor;
class value {
public:
typedef std::set<user*> users_t;
typedef std::vector<user*> users_t;
public:
// constructor
@@ -30,7 +30,7 @@ public:
// uses
void add_use(user* arg);
users_t::iterator erase_use(user* arg);
const std::set<user*> &get_users() { return users_; }
const std::vector<user*> &get_users() { return users_; }
void replace_all_uses_with(value *target);
// name
void set_name(const std::string &name);

View File

@@ -11,12 +11,16 @@ class value;
class instruction;
class call_inst;
class launch_inst;
class phi_node;
class binary_operator;
class getelementptr_inst;
class icmp_inst;
class fcmp_inst;
class dequantize_inst;
class cast_inst;
class trunc_inst;
class z_ext_inst;
@@ -42,12 +46,17 @@ class masked_load_inst;
class unmasked_store_inst;
class masked_store_inst;
class extract_value_inst;
class insert_value_inst;
class retile_inst;
class reshape_inst;
class splat_inst;
class cat_inst;
class broadcast_inst;
class downcast_inst;
class umulhi_inst;
class exp_inst;
class cos_inst;
class sin_inst;
@@ -73,6 +82,10 @@ class async_wait_inst;
class make_range_dyn;
class make_range;
class prefetch_s_inst;
class clock_inst;
class globaltimer_inst;
class extern_elementwise_inst;
class make_range_sta;
class undef_value;
@@ -101,6 +114,8 @@ public:
virtual ~visitor() {}
virtual void visit_value(ir::value*);
virtual void visit_call_inst(ir::call_inst*) = 0;
virtual void visit_launch_inst(ir::launch_inst*) = 0;
virtual void visit_basic_block(basic_block*) = 0;
virtual void visit_argument(argument*) = 0;
@@ -110,6 +125,7 @@ public:
virtual void visit_icmp_inst(icmp_inst*) = 0;
virtual void visit_fcmp_inst(fcmp_inst*) = 0;
virtual void visit_dequantize_inst(dequantize_inst*) = 0;
virtual void visit_cast_inst(cast_inst*) = 0;
virtual void visit_return_inst(return_inst*) = 0;
@@ -122,13 +138,18 @@ public:
virtual void visit_unmasked_store_inst(unmasked_store_inst*) = 0;
virtual void visit_masked_store_inst(masked_store_inst*) = 0;
virtual void visit_umulhi_inst(umulhi_inst*) = 0;
virtual void visit_exp_inst(exp_inst*) = 0;
virtual void visit_cos_inst(cos_inst*) = 0;
virtual void visit_sin_inst(sin_inst*) = 0;
virtual void visit_log_inst(log_inst*) = 0;
virtual void visit_extract_value_inst(extract_value_inst*) = 0;
virtual void visit_insert_value_inst(insert_value_inst*) = 0;
virtual void visit_reshape_inst(reshape_inst*) = 0;
virtual void visit_splat_inst(splat_inst*) = 0;
virtual void visit_cat_inst(cat_inst*) = 0;
virtual void visit_broadcast_inst(broadcast_inst*) = 0;
virtual void visit_downcast_inst(downcast_inst*) = 0;
@@ -150,17 +171,18 @@ public:
virtual void visit_masked_load_async_inst(masked_load_async_inst*)= 0;
virtual void visit_barrier_inst(barrier_inst*) = 0;
virtual void visit_async_wait_inst(async_wait_inst*) = 0;
// virtual void visit_make_range_dyn(make_range_dyn*) = 0;
virtual void visit_make_range(make_range*) = 0;
virtual void visit_prefetch_s_inst(prefetch_s_inst*) = 0;
virtual void visit_function(function*) = 0;
virtual void visit_clock_inst(clock_inst*) = 0;
virtual void visit_globaltimer_inst(globaltimer_inst*) = 0;
// virtual void visit_make_range_sta(make_range_sta*) = 0;
virtual void visit_undef_value(undef_value*) = 0;
virtual void visit_constant_int(constant_int*) = 0;
virtual void visit_constant_fp(constant_fp*) = 0;
virtual void visit_alloc_const(alloc_const*) = 0;
virtual void visit_extern_elementwise_inst(extern_elementwise_inst*) = 0;
};
}

View File

@@ -3,8 +3,9 @@
#ifndef _TRITON_TOOLS_THREAD_GRAPH_H_
#define _TRITON_TOOLS_THREAD_GRAPH_H_
#include "llvm/ADT/SetVector.h"
#include <map>
#include <set>
#include <vector>
#include <iostream>
@@ -13,21 +14,21 @@ namespace tools{
template<class node_t>
class graph {
typedef std::map<node_t, std::set<node_t>> edges_t;
typedef std::map<node_t, llvm::SetVector<node_t>> edges_t;
public:
typedef std::map<size_t, std::vector<node_t>> cmap_t;
typedef std::map<node_t, size_t> nmap_t;
private:
void connected_components_impl(node_t x, std::set<node_t> &nodes,
void connected_components_impl(node_t x, llvm::SetVector<node_t> &nodes,
nmap_t* nmap, cmap_t* cmap, int id) const {
if(nmap)
(*nmap)[x] = id;
if(cmap)
(*cmap)[id].push_back(x);
if(nodes.find(x) != nodes.end()) {
nodes.erase(x);
if (nodes.count(x)) {
nodes.remove(x);
for(const node_t &y: edges_.at(x))
connected_components_impl(y, nodes, nmap, cmap, id);
}
@@ -39,7 +40,7 @@ public:
cmap->clear();
if(nmap)
nmap->clear();
std::set<node_t> nodes = nodes_;
llvm::SetVector<node_t> nodes = nodes_;
unsigned id = 0;
while(!nodes.empty()){
connected_components_impl(*nodes.begin(), nodes, nmap, cmap, id++);
@@ -59,7 +60,7 @@ public:
}
private:
std::set<node_t> nodes_;
llvm::SetVector<node_t> nodes_;
edges_t edges_;
};

View File

@@ -13,6 +13,14 @@ namespace tools
{
#ifdef _WIN32
#define popen _popen
#define pclose _pclose
#endif
#ifndef WEXITSTATUS
#define WEXITSTATUS(stat_val) ((unsigned)(stat_val) & 255)
#endif
int exec(const std::string& cmd, std::string& result) {
char buffer[128];

View File

@@ -33,19 +33,10 @@ namespace tools
inline std::string getenv(const char * name)
{
#ifdef _MSC_VER
char* cache_path = 0;
std::size_t sz = 0;
_dupenv_s(&cache_path, &sz, name);
#else
const char * cstr = std::getenv(name);
#endif
const char * cstr = std::getenv(name);
if(!cstr)
return "";
std::string result(cstr);
#ifdef _MSC_VER
free(cache_path);
#endif
return result;
}

View File

@@ -115,6 +115,18 @@ std::vector<align::cst_info> align::populate_is_constant_reshape(ir::reshape_ins
return add_to_cache(x, result, is_constant_);
}
std::vector<align::cst_info> align::populate_is_constant_dequantize(ir::dequantize_inst* x) {
auto x_shapes = get_shapes(x);
std::vector<cst_info> result;
ir::value *op = x->get_operand(0);
auto op_shapes = op->get_type()->get_block_shapes();
auto op_cst = populate_is_constant(op);
for(size_t d = 0; d < x_shapes.size(); d++) {
result.push_back(op_cst[d]);
}
return add_to_cache(x, result, is_constant_);
}
std::vector<align::cst_info> align::populate_is_constant_broadcast(ir::broadcast_inst* x) {
auto x_shapes = get_shapes(x);
std::vector<cst_info> result;
@@ -129,6 +141,36 @@ std::vector<align::cst_info> align::populate_is_constant_broadcast(ir::broadcast
return add_to_cache(x, result, is_constant_);
}
std::vector<align::cst_info> align::populate_is_constant_cmp(ir::cmp_inst* x) {
auto x_shapes = get_shapes(x);
std::vector<cst_info> result;
ir::value* lhs_op = x->get_operand(0);
ir::value* rhs_op = x->get_operand(1);
auto lhs = populate_is_constant(lhs_op);
auto rhs = populate_is_constant(rhs_op);
auto lhs_max_contiguous = populate_max_contiguous(lhs_op);
auto rhs_max_contiguous = populate_max_contiguous(rhs_op);
auto lhs_multiple_of = populate_starting_multiple(lhs_op);
auto rhs_multiple_of = populate_starting_multiple(rhs_op);
for(size_t d = 0; d < x_shapes.size(); d++) {
cst_info ax = {1, 0};
// Examples:
// 16 17 18 ... 32 < 24 24 24 ... 24 => equal in groups of 8
// 16 17 18 ... 32 < 20 20 20 ... 20 => equal in groups of 4
// 16 17 18 ... 32 < 16 16 16 ... 16 => equal in groups of 16
//
// if LHS is a range of N continuous (or equal) elements that starts at M,
// and RHS is a set of N constants that start at K
// then the result in constant in groups of gcd(M, K)
if(rhs[d].num_cst % lhs_max_contiguous[d] == 0 ||
rhs[d].num_cst % lhs[d].num_cst == 0)
ax.num_cst = gcd(lhs_multiple_of[d], rhs_multiple_of[d]);
result.push_back(ax);
}
return add_to_cache(x, result, is_constant_);
}
std::vector<align::cst_info> align::populate_is_constant_binop(ir::binary_operator* x) {
auto x_shapes = get_shapes(x);
std::vector<cst_info> result;
@@ -136,12 +178,14 @@ std::vector<align::cst_info> align::populate_is_constant_binop(ir::binary_operat
ir::value* rhs_op = x->get_operand(1);
auto lhs = populate_is_constant(lhs_op);
auto rhs = populate_is_constant(rhs_op);
auto max_contiguous = populate_max_contiguous(lhs_op);
auto lhs_max_contiguous = populate_max_contiguous(lhs_op);
auto rhs_max_contiguous = populate_max_contiguous(rhs_op);
auto lhs_multiple_of = populate_starting_multiple(lhs_op);
auto rhs_multiple_of = populate_starting_multiple(rhs_op);
for(size_t d = 0; d < x_shapes.size(); d++) {
cst_info ax;
if(lhs[d].num_cst==0 && rhs[d].value && x->is_int_div()){
// todo might not be entirely true
unsigned num_constants = gcd(max_contiguous[d], rhs[d].value);
unsigned num_constants = gcd(lhs_max_contiguous[d], rhs[d].value);
ax = {num_constants, 0};
}
else
@@ -180,10 +224,14 @@ std::vector<align::cst_info> align::populate_is_constant(ir::value *v) {
return populate_is_constant_splat(x);
if(auto *x = dynamic_cast<ir::reshape_inst*>(v))
return populate_is_constant_reshape(x);
if(auto *x = dynamic_cast<ir::dequantize_inst*>(v))
return populate_is_constant_dequantize(x);
if(auto *x = dynamic_cast<ir::broadcast_inst*>(v))
return populate_is_constant_broadcast(x);
if(auto *x = dynamic_cast<ir::binary_operator*>(v))
return populate_is_constant_binop(x);
if(auto *x = dynamic_cast<ir::cmp_inst*>(v))
return populate_is_constant_cmp(x);
if(auto *x = dynamic_cast<ir::getelementptr_inst*>(v))
return populate_is_constant_gep(x);
return populate_is_constant_default(v);
@@ -245,6 +293,23 @@ std::vector<unsigned> align::populate_max_contiguous_reshape(ir::reshape_inst* x
return add_to_cache(x, result, max_contiguous_);
}
std::vector<unsigned> align::populate_max_contiguous_dequantize(ir::dequantize_inst* x) {
auto shapes = get_shapes(x);
std::vector<unsigned> result;
ir::value *op = x->get_operand(0);
auto ret_last_dim = (x->get_type()->get_block_shapes()).back();
auto op_last_dim = (op->get_type()->get_block_shapes()).back();
auto op_mc = populate_max_contiguous(op);
for(size_t d = 0; d < shapes.size(); d++) {
unsigned factor = 1;
if (d == shapes.size() - 1) {
factor = ret_last_dim / op_last_dim;
}
result.push_back(factor * op_mc[d]);
}
return add_to_cache(x, result, max_contiguous_);
}
std::vector<unsigned> align::populate_max_contiguous_broadcast(ir::broadcast_inst* x) {
auto shapes = get_shapes(x);
std::vector<unsigned> result;
@@ -285,8 +350,8 @@ std::vector<unsigned> align::populate_max_contiguous_binop(ir::binary_operator*
}
if(x->is_int_add_sub()){
unsigned lvalue = 1, rvalue = 1;
lvalue = gcd(rhs_max_contiguous[d], lhs_starting_multiple[d]);
rvalue = gcd(lhs_max_contiguous[d], rhs_starting_multiple[d]);
lvalue = gcd(rhs_max_contiguous[d], lhs_cst_info[d].num_cst);
rvalue = gcd(lhs_max_contiguous[d], rhs_cst_info[d].num_cst);
value = std::max(lvalue, rvalue);
}
result.push_back(value);
@@ -332,9 +397,9 @@ std::vector<unsigned> align::populate_max_contiguous(ir::value *v){
if(max_contiguous_.find(v) != max_contiguous_.end())
return max_contiguous_.at(v);
if(auto *x = dynamic_cast<ir::instruction*>(v)){
unsigned max_contiguous = x->get_metadata(ir::metadata::max_contiguous);
if(max_contiguous > 0)
return add_to_cache(x, {max_contiguous}, max_contiguous_);
std::vector<unsigned> max_contiguous = x->get_metadata(ir::metadata::max_contiguous);
if(!max_contiguous.empty())
return add_to_cache(x, max_contiguous, max_contiguous_);
}
if(auto *x = dynamic_cast<ir::cast_inst*>(v))
return populate_max_contiguous_cast(x);
@@ -342,6 +407,8 @@ std::vector<unsigned> align::populate_max_contiguous(ir::value *v){
return populate_max_contiguous_splat(x);
if(auto *x = dynamic_cast<ir::reshape_inst*>(v))
return populate_max_contiguous_reshape(x);
if(auto *x = dynamic_cast<ir::dequantize_inst*>(v))
return populate_max_contiguous_dequantize(x);
if(auto *x = dynamic_cast<ir::broadcast_inst*>(v))
return populate_max_contiguous_broadcast(x);
if(auto *x = dynamic_cast<ir::binary_operator*>(v))
@@ -386,6 +453,23 @@ std::vector<unsigned> align::populate_starting_multiple_reshape(ir::reshape_inst
return add_to_cache(x, result, starting_multiple_);
}
std::vector<unsigned> align::populate_starting_multiple_dequantize(ir::dequantize_inst* x){
auto shapes = get_shapes(x);
std::vector<unsigned> result;
ir::value *op = x->get_operand(0);
auto ret_last_dim = (x->get_type()->get_block_shapes()).back();
auto op_last_dim = (op->get_type()->get_block_shapes()).back();
auto op_multiple = populate_starting_multiple(op);
for(size_t d = 0; d < shapes.size(); d++) {
unsigned factor = 1;
if (d == shapes.size() - 1) {
factor = ret_last_dim / op_last_dim;
}
result.push_back(factor * op_multiple[d]);
}
return add_to_cache(x, result, starting_multiple_);
}
std::vector<unsigned> align::populate_starting_multiple_broadcast(ir::broadcast_inst* x){
auto result = populate_starting_multiple(x->get_operand(0));
return add_to_cache(x, result, starting_multiple_);
@@ -401,7 +485,7 @@ std::vector<unsigned> align::populate_starting_multiple_binop(ir::binary_operato
if(x->is_int_add_sub())
result[d] = gcd(lhs[d], rhs[d]);
if(x->is_int_div())
result[d] = 1;
result[d] = (lhs[d] == (1 << 31)) ? 1 << 31 : 1;
if(x->is_int_rem() && rhs[d] > 1){
result[d] = gcd(lhs[d], rhs[d]);
}
@@ -471,28 +555,42 @@ std::vector<unsigned> align::populate_starting_multiple_default(ir::value* v) {
return add_to_cache(v, {1}, starting_multiple_);
}
unsigned get_max_multiple(int val){
if(val == 0) return 1 << 31;
if(val % 128 == 0) return 128;
if(val % 64 == 0) return 64;
if(val % 32 == 0) return 32;
if(val % 16 == 0) return 16;
if(val % 8 == 0) return 8;
if(val % 4 == 0) return 4;
if(val % 2 == 0) return 2;
return 1;
}
std::vector<unsigned> align::populate_starting_multiple(ir::value *v){
if(starting_multiple_.find(v) != starting_multiple_.end())
return starting_multiple_.at(v);
if(auto *x = dynamic_cast<ir::instruction*>(v)){
unsigned multiple_of = x->get_metadata(ir::metadata::multiple_of);
if(multiple_of > 0)
return add_to_cache(x, {multiple_of}, starting_multiple_);
std::vector<unsigned> multiple_of = x->get_metadata(ir::metadata::multiple_of);
if(!multiple_of.empty())
return add_to_cache(x, multiple_of, starting_multiple_);
}
if(auto *x = dynamic_cast<ir::cast_inst*>(v))
return populate_starting_multiple_cast(x);
if(auto *x = dynamic_cast<ir::binary_operator*>(v))
return populate_starting_multiple_binop(x);
if(auto *x = dynamic_cast<ir::constant_int*>(v))
return add_to_cache(x, {std::min<unsigned>(x->get_value(), 128)}, starting_multiple_);
return add_to_cache(x, {get_max_multiple(x->get_value())}, starting_multiple_);
if(auto *x = dynamic_cast<ir::make_range*>(v))
return add_to_cache(x, {(unsigned)x->get_first()->get_value()}, starting_multiple_);
return add_to_cache(x, {get_max_multiple(x->get_first()->get_value())}, starting_multiple_);
if(auto *x = dynamic_cast<ir::getelementptr_inst*>(v))
return populate_starting_multiple_gep(x);
if(auto *x = dynamic_cast<ir::splat_inst*>(v))
return populate_starting_multiple_splat(x);
if(auto *x = dynamic_cast<ir::reshape_inst*>(v))
return populate_starting_multiple_reshape(x);
if(auto *x = dynamic_cast<ir::dequantize_inst*>(v))
return populate_starting_multiple_dequantize(x);
if(auto *x = dynamic_cast<ir::broadcast_inst*>(v))
return populate_starting_multiple_broadcast(x);
if(auto *x = dynamic_cast<ir::phi_node*>(v))
@@ -511,12 +609,15 @@ std::vector<unsigned> align::contiguous(ir::value* v) const {
return max_contiguous_.at(v);
}
std::vector<align::cst_info> align::get_cst_info(ir::value* v) const {
return is_constant_.at(v);
}
void align::populate(ir::value *v) {
populate_is_constant(v);
populate_starting_multiple(v);
populate_max_contiguous(v);
}
void align::run(ir::module &mod) {

View File

@@ -92,8 +92,10 @@ void allocation::run(ir::module &mod) {
}
// Save maximum size of induced memory space
allocated_size_ = 0;
for(shared_layout* x: V)
for(shared_layout* x: V){
allocated_size_ = std::max<size_t>(allocated_size_, starts[x] + x->get_size());
// std::cout << "start: " << starts[x] << " | end: " << starts[x] + x->get_size() << std::endl;
}
}
}

View File

@@ -56,6 +56,17 @@ void axes::update_graph_trans(ir::instruction *i) {
graph_.add_edge({i, perm[d]}, {op, d});
}
void axes::update_graph_dequantize(ir::instruction *i) {
auto *dequantize = static_cast<ir::dequantize_inst*>(i);
auto shapes = dequantize->get_type()->get_block_shapes();
ir::value *op = dequantize->get_operand(0);
// add edge except the last axis
for(unsigned d = 0; d < shapes.size() - 1; d ++){
graph_.add_edge({i, d}, {op, d});
}
}
void axes::update_graph_broadcast(ir::instruction *i) {
auto *broadcast = static_cast<ir::broadcast_inst*>(i);
auto shapes = broadcast->get_type()->get_block_shapes();
@@ -79,7 +90,7 @@ void axes::update_graph_dot(ir::instruction *i) {
graph_.add_edge({dot, d}, {D, d});
}
void axes::update_graph_elementwise(ir::instruction *i,
void axes::update_graph_elementwise(ir::instruction *i,
bool is_masked_load_async) {
if(i->get_num_operands() == 0)
return;
@@ -116,8 +127,10 @@ void axes::update_graph(ir::instruction *i) {
switch (i->get_id()) {
case ir::INST_REDUCE: return update_graph_reduce(i);
case ir::INST_RESHAPE: return update_graph_reshape(i);
case ir::INST_SPLAT: return update_graph_no_edge(i);;
case ir::INST_SPLAT: return update_graph_no_edge(i);
case ir::INST_CAT: return update_graph_elementwise(i, true);
case ir::INST_TRANS: return update_graph_trans(i);
case ir::INST_DEQUANTIZE: return update_graph_dequantize(i);
case ir::INST_BROADCAST: return update_graph_broadcast(i);
case ir::INST_DOT: return update_graph_dot(i);
case ir::INST_COPY_TO_SHARED: return update_graph_no_edge(i);

View File

@@ -23,19 +23,67 @@ inline unsigned clamp(unsigned x, unsigned a, unsigned b) {
return std::min(std::max(x, lo), hi);
}
inline bool is_hmma_c(ir::value *v){
inline bool is_hmma_c(ir::value *v, int sm){
bool result = false;
if(auto *x = dynamic_cast<ir::dot_inst*>(v)){
ir::value *a = x->get_operand(0);
ir::type *a_ty = a->get_type();
ir::value *b = x->get_operand(1);
ir::type *b_ty = b->get_type();
result = a_ty->get_scalar_ty()->is_fp16_ty() &&
b_ty->get_scalar_ty()->is_fp16_ty();
result = (a_ty->get_scalar_ty()->is_fp16_ty() && b_ty->get_scalar_ty()->is_fp16_ty()) ||
(a_ty->get_scalar_ty()->is_bf16_ty() && b_ty->get_scalar_ty()->is_bf16_ty()) ||
(a_ty->get_scalar_ty()->is_fp32_ty() && b_ty->get_scalar_ty()->is_fp32_ty() &&
x->allow_tf32() && sm >= 80) ||
(a_ty->get_scalar_ty()->is_integer_ty(8) && b_ty->get_scalar_ty()->is_integer_ty(8) &&
sm >= 80);
}
return result;
}
static mma_layout::TensorCoreType get_mma_type(ir::value *v) {
mma_layout::TensorCoreType mma_type;
if (auto* dot = dynamic_cast<ir::dot_inst*>(v)) {
ir::value* a = dot->get_operand(0);
ir::value* b = dot->get_operand(1);
ir::type* a_ty = a->get_type();
ir::type* b_ty = b->get_type();
ir::type* c_ty = v->get_type();
if (c_ty->get_scalar_ty()->is_fp32_ty()) {
// floating point tensor cores
if (a_ty->get_scalar_ty()->is_fp16_ty() && b_ty->get_scalar_ty()->is_fp16_ty()) {
mma_type = mma_layout::FP32_FP16_FP16_FP32;
return mma_type;
}
if (a_ty->get_scalar_ty()->is_bf16_ty() && b_ty->get_scalar_ty()->is_bf16_ty()) {
mma_type = mma_layout::FP32_BF16_BF16_FP32;
return mma_type;
}
if (a_ty->get_scalar_ty()->is_fp32_ty() && b_ty->get_scalar_ty()->is_fp32_ty()
&& dot->allow_tf32()) {
mma_type = mma_layout::FP32_TF32_TF32_FP32;
return mma_type;
}
} else if (c_ty->get_scalar_ty()->is_integer_ty(32)) {
// throw std::runtime_error("integer tensor cores are not yet supported");
// // integer tensor cores
// if (a_ty->get_scalar_ty()->is_integer_ty(1) && b_ty->get_scalar_ty()->is_integer_ty(1)) {
// mma_type = mma_layout::INT32_INT1_INT1_INT32;
// return mma_type;
// }
// if (a_ty->get_scalar_ty()->is_integer_ty(4) && b_ty->get_scalar_ty()->is_integer_ty(4)) {
// mma_type = mma_layout::INT32_INT4_INT4_INT32;
// return mma_type;
// }
if (a_ty->get_scalar_ty()->is_integer_ty(8) && b_ty->get_scalar_ty()->is_integer_ty(8)) {
mma_type = mma_layout::INT32_INT8_INT8_INT32;
return mma_type;
}
}
}
return mma_layout::NOT_APPLICABLE;
}
inline void extract_io_use(ir::value *v, std::set<ir::value*>& result) {
for(ir::user* u: v->get_users()){
auto i = dynamic_cast<ir::io_inst*>(u);
@@ -52,11 +100,12 @@ inline void extract_dot_use(ir::value *v, ir::value*& result, size_t n) {
}
}
inline void extract_hmma_dot_use(ir::value *v, ir::value*& result, size_t n) {
inline void extract_hmma_dot_use(ir::value *v, ir::value*& result, size_t n, int sm) {
for(ir::user* u: v->get_users()){
auto i = dynamic_cast<ir::dot_inst*>(u);
if(i && is_hmma_c(i) && i->get_operand(n) == v)
if(i && is_hmma_c(i, sm) && i->get_operand(n) == v) {
result = i;
}
}
}
@@ -142,10 +191,12 @@ mma_layout::mma_layout(size_t num_warps,
const std::vector<unsigned>& shape,
const std::vector<ir::value *> &values,
analysis::align* align, target* tgt,
shared_layout *layout_a, shared_layout *layout_b): distributed_layout(MMA, axes, shape, values, align) {
shared_layout *layout_a, shared_layout *layout_b,
ir::value *dot): distributed_layout(MMA, axes, shape, values, align) {
tensor_core_type_ = get_mma_type(dot);
/* fragments per warp */
// try to make things as square as possible to maximize data re-use
if(tgt->as_nvidia()->sm() < 80){
if(tgt->as_nvidia() && tgt->as_nvidia()->sm() < 80){
fpw_ = {2, 2, 1};
auto ord_a = layout_a->get_order();
auto ord_b = layout_b->get_order();
@@ -157,25 +208,67 @@ mma_layout::mma_layout(size_t num_warps,
int pack_size_1 = (is_b_row && !is_b_vec4) ? 2 : 1;
rep_ = {2*pack_size_0, 2*pack_size_1, 1};
spw_ = {fpw_[0]*4*rep_[0], fpw_[1]*4*rep_[1], 1};
contig_per_thread_ = {1, 1};
order_ = {0, 1};
}
else{
fpw_ = {1, 1, 1};
spw_ = {16, 8, 1};
rep_ = {2, 2, 1};
spw_ = mma_instr_shape_.at(tensor_core_type_); // e.g., {16, 8, 16} for f32.f16.f16.f32
contig_per_thread_ = {1, 2};
order_ = {1, 0};
}
order_ = {0, 1};
/* warps per tile */
// try to make things as square as possible to maximize data re-use
wpt_ = {1, 1, 1};
std::vector<int> wpt_nm1;
do{
wpt_nm1 = wpt_;
if(wpt_[0] * wpt_[1] * wpt_[2] < num_warps)
wpt_[0] = clamp(wpt_[0]*2, 1, shape_[0] / spw_[0]);
if(wpt_[0] * wpt_[1] * wpt_[2] < num_warps)
wpt_[1] = clamp(wpt_[1]*2, 1, shape_[1] / spw_[1]);
}while(wpt_nm1 != wpt_);
// try to make warp-level tiles as square as possible to maximize data re-use
if (tgt->as_nvidia()->sm() < 80) {
std::vector<int> wpt_nm1;
do{
wpt_nm1 = wpt_;
if(wpt_[0] * wpt_[1] * wpt_[2] < num_warps)
wpt_[0] = clamp(wpt_[0]*2, 1, shape_[0] / spw_[0]);
if(wpt_[0] * wpt_[1] * wpt_[2] < num_warps)
wpt_[1] = clamp(wpt_[1]*2, 1, shape_[1] / spw_[1]);
}while(wpt_nm1 != wpt_);
} else {
bool changed = false;
// try to have a warp own entire rows of the output
// this makes it easier to fuse multiple mmas by fusing
// registers
bool one_warp_per_row = false;
for(ir::value* v: values)
for(ir::user* u: v->get_users()){
auto* dot = dynamic_cast<ir::dot_inst*>(u);
auto* cts = dynamic_cast<ir::copy_to_shared_inst*>(u);
if((dot && dot->get_operand(2)!=v) || !layout_a->to_shared() || cts)
one_warp_per_row = shape[0] / spw_[0] >= num_warps;
}
// std::cout << one_warp_per_row << std::endl;
if(one_warp_per_row){
wpt_[1] = 1;
wpt_[0] = num_warps;
}
else{
do {
changed = false;
if (wpt_[0] * wpt_[1] * wpt_[2] >= num_warps)
break;
if (shape_[0] / spw_[0] / wpt_[0] >= shape_[1] / (spw_[1]*2) / wpt_[1]) {
if (wpt_[0] < shape_[0] / spw_[0]) {
wpt_[0] *= 2;
changed = true;
}
} else {
if (wpt_[1] < shape_[1] / (spw_[1]*2)) {
wpt_[1] *= 2;
changed = true;
}
}
} while(changed);
}
}
// std::cout << wpt_[0] << " " << wpt_[1] << std::endl;
/* shape per block */
shape_per_cta_ = {spw_[0]*wpt_[0], spw_[1]*wpt_[1], 1};
@@ -198,8 +291,6 @@ scanline_layout::scanline_layout(size_t num_warps,
bool is_dot = std::any_of(values.begin(), values.end(),
[&](ir::value* v) { return dynamic_cast<ir::dot_inst*>(v); });
std::vector<ir::value*> ptrs;
for(ir::value *v: values)
for(ir::user *usr: v->get_users())
@@ -215,7 +306,6 @@ scanline_layout::scanline_layout(size_t num_warps,
contiguous = std::max<int>(contiguous, std::min<int>(align->get(ptr, i), 128 / nbits));
}
nts_[i] = clamp(size / num_threads, 1, std::min<int>(contiguous, shape_[i]));
mts_[i] = clamp(num_threads, 1, shape_[i] / nts_[i]);
size /= shape_[i];
@@ -277,12 +367,16 @@ void shared_layout::extract_double_bufferable(ir::value *v, std::shared_ptr<doub
res.reset(new double_buffer_info_t{value_1, value_0, phi});
}
static bool is_smem(ir::value* v) {
if (dynamic_cast<ir::copy_to_shared_inst*>(v) ||
dynamic_cast<ir::masked_load_async_inst*>(v))
return true;
else
return false;
static bool is_smem_in(ir::value* v, const ir::basic_block* bb) {
if (ir::instruction *instr = dynamic_cast<ir::instruction*>(v)) {
if (instr->get_parent() != bb)
return false;
if (dynamic_cast<ir::copy_to_shared_inst*>(v) ||
dynamic_cast<ir::masked_load_async_inst*>(v)) {
return true;
}
}
return false;
}
/// param:
@@ -297,14 +391,14 @@ static bool is_multistage_pipe_phi(ir::phi_node* phi, ir::basic_block* bb0, ir::
ir::basic_block *cbb0 = cphi->get_incoming_block(0);
ir::basic_block *cbb1 = cphi->get_incoming_block(1);
if (is_smem(c0)) {
if (is_smem_in(c0, cbb0)) {
assert(cbb0 == bb0);
values_0.push_back(c0);
if (auto phi1 = dynamic_cast<ir::phi_node*>(c1)) {
next = phi1;
continue;
} else {
if (is_smem(c1)) {
if (is_smem_in(c1, cbb1)) {
value_1 = c1;
assert(cbb1 == bb1);
return true;
@@ -359,7 +453,8 @@ shared_layout::shared_layout(data_layout *arg,
const std::vector<unsigned>& shape,
const std::vector<ir::value *> &values,
ir::type *ty,
analysis::align* align): data_layout(SHARED, axes, shape, values, align), ty_(ty) {
analysis::align* align, target *tgt, bool is_tmp)
: data_layout(SHARED, axes, shape, values, align), ty_(ty), tgt_(tgt), is_tmp_(is_tmp){
size_ = 0;
arg_layout_ = arg;
@@ -385,12 +480,35 @@ shared_layout::shared_layout(data_layout *arg,
for(ir::value* v: values){
extract_dot_use(v, dot_a, 0);
extract_dot_use(v, dot_b, 1);
extract_hmma_dot_use(v, hmma_dot_a, 0);
extract_hmma_dot_use(v, hmma_dot_b, 1);
extract_hmma_dot_use(v, hmma_dot_a, /*op*/0, tgt_->as_nvidia()->sm());
extract_hmma_dot_use(v, hmma_dot_b, /*op*/1, tgt_->as_nvidia()->sm());
}
hmma_dot_a_ = hmma_dot_a;
hmma_dot_b_ = hmma_dot_b;
// Update mma_vec
if (hmma_dot_a_) {
assert(order_.size() == 2);
std::vector<int> mat_shape = mma_layout::mma_mat_shape_.at(get_mma_type(hmma_dot_a_));
mma_vec_ = order_[0] == 1 ? mat_shape[2] : mat_shape[0]; // k : m
mma_strided_ = order_[0] == 1 ? mat_shape[0] : mat_shape[2];
// for now, disable swizzle when using lds.8
if (get_mma_type(hmma_dot_a_) == mma_layout::INT32_INT8_INT8_INT32)
if (order_[0] == 0) // need transpose
allow_swizzle_ = false;
} else if (hmma_dot_b_) {
assert(order_.size() == 2);
std::vector<int> mat_shape = mma_layout::mma_mat_shape_.at(get_mma_type(hmma_dot_b_));
mma_vec_ = order_[0] == 1 ? mat_shape[1] : mat_shape[2]; // n : k
mma_strided_ = order_[0] == 1 ? mat_shape[2] : mat_shape[1];
// for now, disable swizzle when using lds.8
if (get_mma_type(hmma_dot_b_) == mma_layout::INT32_INT8_INT8_INT32)
if (order_[0] == 1) // need transpose
allow_swizzle_ = false;
}
// size
size_ = ty_->get_primitive_size_in_bits() / 8;
for(auto s: shape_)
@@ -454,7 +572,8 @@ void layouts::make_graph(ir::instruction *i) {
void layouts::create(size_t id, const std::vector<ir::value*>& values) {
// if(layouts_.find(id) != layouts_.end())
// return;
auto it_hmma_c = std::find_if(values.begin(), values.end(), &is_hmma_c);
auto it_hmma_c = std::find_if(values.begin(), values.end(),
[&](ir::value* v){ return is_hmma_c(v, tgt_->as_nvidia()->sm()); });
auto cmp = [](ir::value* x, ir::value *y) {
std::pair<int, int> xx = {x->get_type()->get_tile_rank(), x->get_type()->get_tile_num_elements()};
std::pair<int, int> yy = {y->get_type()->get_tile_rank(), y->get_type()->get_tile_num_elements()};
@@ -476,19 +595,61 @@ void layouts::create(size_t id, const std::vector<ir::value*>& values) {
ir::value *b = dot->get_operand(1);
create(groups_.at(a), values_.at(groups_.at(a)));
create(groups_.at(b), values_.at(groups_.at(b)));
layouts_[id] = new mma_layout(num_warps_, axes, shapes, values, align_, tgt_, (shared_layout*)layouts_.at(groups_.at(a)), (shared_layout*)layouts_.at(groups_.at(b)));
layouts_[id] = new mma_layout(num_warps_, axes, shapes, values, align_, tgt_,
(shared_layout*)layouts_.at(groups_.at(a)),
(shared_layout*)layouts_.at(groups_.at(b)),
dot);
}
else if(it_cts != values.end()){
ir::instruction *cts = (ir::instruction*)*it_cts;
ir::value *arg = cts->get_operand(0);
create(groups_.at(arg), values_.at(groups_.at(arg)));
layouts_[id] = new shared_layout(get(arg), axes, shapes, values, largest->get_type()->get_scalar_ty(), align_);
layouts_[id] = new shared_layout(get(arg), axes, shapes, values, largest->get_type()->get_scalar_ty(), align_, tgt_);
}
else{
layouts_[id] = new scanline_layout(num_warps_, axes, shapes, values, align_, tgt_);
}
}
// layout checkers
bool layouts::is_scanline(ir::instruction *i) {
return this->get(i->get_operand(0))->to_scanline() != nullptr;
}
bool layouts::is_coalesced_scanline(ir::instruction *i) {
if (auto *red = dynamic_cast<ir::reduce_inst *>(i)) {
auto *scanline = this->get(i->get_operand(0))->to_scanline();
return scanline && scanline->get_order()[0] == red->get_axis();
}
return false;
}
bool layouts::is_mma(ir::instruction *i) {
return this->get(i->get_operand(0))->to_mma() != nullptr;
}
bool layouts::is_a100_mma(ir::instruction *i) {
if (auto *red = dynamic_cast<ir::reduce_inst *>(i)) {
return is_mma(red) && (tgt_->as_nvidia()->sm() >= 80) &&
(red->get_axis() == 1);
}
return false;
}
void layouts::create_tmp_layout(size_t id, data_layout *arg,
const std::vector<int> &axes,
const std::vector<unsigned> &shape,
ir::instruction *i, bool is_index) {
ir::type *ty = is_index ? ir::type::get_int32_ty(i->get_type()->get_context())
: i->get_type()->get_scalar_ty();
layouts_[id] = new shared_layout(arg, axes, shape, {i}, ty, align_, tgt_, true);
if (is_index) {
tmp_index_[i] = id;
} else {
tmp_[i] = id;
}
}
void layouts::run(ir::module &mod) {
// make graph
graph_.clear();
@@ -499,6 +660,7 @@ void layouts::run(ir::module &mod) {
make_graph(i);
});
// connected components
graph_.connected_components(&values_, &groups_);
@@ -509,35 +671,47 @@ void layouts::run(ir::module &mod) {
// create temporaries
size_t id = values_.size();
ir::for_each_instruction(mod, [this, &id](ir::instruction* i) {
// std::cout << "layout: " << std::endl;
// i->print(std::cout);
if(auto *red = dynamic_cast<ir::reduce_inst*>(i)) {
id++;
ir::value *arg = red->get_operand(0);
unsigned axis = red->get_axis();
distributed_layout *layout =
dynamic_cast<analysis::distributed_layout *>(get(arg));
// shape
auto shapes = arg->get_type()->get_block_shapes();
scanline_layout *layout = get(arg)->to_scanline();
shapes[axis] = layout->mts(axis);
unsigned axis = red->get_axis();
shapes[axis] =
layout->shape_per_cta(axis) / layout->contig_per_thread(axis);
// create layout
layouts_[id] = new shared_layout(layout, axes_->get(arg), shapes, {red}, red->get_type()->get_scalar_ty(), align_);
tmp_[red] = id;
id++;
create_tmp_layout(id, layout, axes_->get(arg), shapes, red);
if (red->with_index()) {
id++;
create_tmp_layout(id, layout, axes_->get(arg), shapes, red, true);
}
}
if(auto *val = dynamic_cast<ir::cvt_layout_inst*>(i)){
distributed_layout* out_layout = dynamic_cast<distributed_layout*>(get(val));
distributed_layout* in_layout = dynamic_cast<distributed_layout*>(get(i->get_operand(0)));
id++;
size_t dim = val->get_type()->get_tile_rank();
ir::type::block_shapes_t shape(dim);
for(size_t k = 0; k < dim; k++){
shape[k] = std::max(in_layout->shape_per_cta(k),
out_layout->shape_per_cta(k));
}
layouts_[id] = new shared_layout(out_layout, axes_->get(val), shape, {val}, val->get_type()->get_scalar_ty(), align_);
tmp_[val] = id;
auto in_ord = in_layout->get_order();
auto out_ord = out_layout->get_order();
int in_vec = in_layout->contig_per_thread(in_ord[0]);
int out_vec = out_layout->contig_per_thread(out_ord[0]);
int pad = std::max(in_vec, out_vec);
shape[out_ord[0]] += pad;
id++;
create_tmp_layout(id, out_layout, axes_->get(val), shape, val);
}
if(auto *atom = dynamic_cast<ir::atomic_inst*>(i)){
id++;
layouts_[id] = new shared_layout(nullptr, {}, {1}, {atom}, atom->get_type()->get_scalar_ty(), align_);
tmp_[atom] = id;
create_tmp_layout(id, nullptr, {}, {1}, atom);
}
});

View File

@@ -14,43 +14,108 @@ namespace analysis{
void liveness::run(ir::module &mod) {
intervals_.clear();
// Assigns index to each instruction
std::map<ir::value*, slot_index> indices;
for(ir::function *fn: mod.get_function_list()){
slot_index index = 0;
for(ir::basic_block *block: fn->blocks())
for(ir::instruction *instr: block->get_inst_list()){
index += 1;
indices.insert({instr, index});
std::map<ir::value*, std::set<shared_layout*>> layouts_map;
for(auto &x: layouts_->get_all()){
shared_layout* layout = x.second->to_shared();
if(!layout || layout->is_tmp())
continue;
for(ir::value* v:layout->get_values()){
layouts_map[v].insert(layout);
}
}
// create live intervals
std::map<ir::user*, std::set<shared_layout*>> live_in;
while(true){
bool changed = false;
ir::instruction* last_inst = nullptr;
ir::for_each_instruction_backward(mod, [&](ir::instruction* i){
// gen
std::set<shared_layout*> gen;
for(ir::value* v: i->ops())
for(shared_layout* layout: layouts_map[v])
gen.insert(layout);
// kill
std::set<shared_layout*> kill;
for(shared_layout* layout: layouts_map[i])
kill.insert(layout);
// temporaries are handled separately
if(layouts_->has_tmp(i)){
gen.insert(layouts_->get(layouts_->tmp(i))->to_shared());
kill.insert(layouts_->get(layouts_->tmp(i))->to_shared());
}
if(layouts_->has_tmp_index(i)){
gen.insert(layouts_->get(layouts_->tmp_index(i))->to_shared());
kill.insert(layouts_->get(layouts_->tmp_index(i))->to_shared());
}
// live-out
std::set<shared_layout*> live_out;
std::vector<ir::instruction*> succs = {last_inst};
if(i == i->get_parent()->get_inst_list().back())
for(ir::basic_block* succ: i->get_parent()->get_successors())
succs.push_back(succ->get_inst_list().front());
for(ir::instruction* succ: succs)
for(shared_layout* layout: live_in[succ])
if(!layout->is_tmp())
live_out.insert(layout);
// new sets
std::set<shared_layout*> live_out_minus_kill;
std::set_difference(live_out.begin(), live_out.end(), kill.begin(), kill.end(),
std::inserter(live_out_minus_kill, live_out_minus_kill.end()));
std::set<shared_layout*> new_live_in;
std::set_union(gen.begin(), gen.end(), live_out_minus_kill.begin(), live_out_minus_kill.end(),
std::inserter(new_live_in, new_live_in.end()));
changed = changed || (new_live_in != live_in[i]);
live_in[i] = new_live_in;
last_inst = i;
});
if(!changed)
break;
}
// ir::for_each_instruction(mod, [&](ir::instruction* i){
// i->print(std::cout);
// std::cout << " live_in: " << live_in[i].size() << std::endl;
// });
// Assigns index to each instruction
std::map<ir::value*, slot_index> indices;
slot_index index = 0;
ir::for_each_instruction(mod, [&](ir::instruction* instr){
index += 1;
indices.insert({instr, index});
});
for(auto &x: layouts_->get_all()){
shared_layout* layout = x.second->to_shared();
if(layout)
intervals_[layout] = segment{INT32_MAX, 0};
}
for(auto& x: live_in)
for(shared_layout* layout: x.second)
intervals_[layout].start = std::min<int>(intervals_[layout].start, indices[x.first]);
for(auto& x: live_in)
for(shared_layout* layout: x.second){
intervals_[layout].end = std::max<int>(intervals_[layout].end, indices[x.first] + 1);
}
for(auto &x: layouts_->get_all()) {
shared_layout* layout = x.second->to_shared();
if(!layout)
continue;
// users
std::set<ir::user*> users;
for(ir::value *v: layout->get_values()){
for(ir::user *u: v->get_users())
users.insert(u);
}
// compute intervals
unsigned start = INT32_MAX;
for(ir::value *v: layout->get_values())
if(indices.find(v) != indices.end())
start = std::min(start, indices.at(v));
unsigned end = 0;
for(ir::user *u: users)
if(indices.find(u) != indices.end())
end = std::max(end, indices.at(u));
if(end == 0)
end = start + 1;
intervals_[layout] = segment{start, end};
// std::cout << intervals_[layout].start << " " << intervals_[layout].end << std::endl;
}
}

View File

@@ -19,6 +19,7 @@ void swizzle::run(ir::module &) {
continue;
ir::value* mma_dot_a = layout->hmma_dot_a();
ir::value* mma_dot_b = layout->hmma_dot_b();
if(!mma_dot_a && !mma_dot_b){
per_phase_[layout] = 1;
max_phase_[layout] = 1;
@@ -27,22 +28,31 @@ void swizzle::run(ir::module &) {
}
auto ord = layout->get_order();
scanline_layout* in_layout = dynamic_cast<scanline_layout*>(layout->get_arg_layout());
if(!in_layout)
continue;
int per_phase = 1;
int dtsize = layout->get_type()->get_scalar_ty()->get_primitive_size_in_bits() / 8;
if(in_layout)
per_phase = std::max<int>(128 / (in_layout->mts(ord[0])*in_layout->nts(ord[0])*dtsize), 1);
else
per_phase = 1;
if(tgt_->as_nvidia() && tgt_->as_nvidia()->sm() < 80){
int inner = mma_dot_a ? 0 : 1;
per_phase_[layout] = std::max<int>(128 / (in_layout->mts(ord[0])*in_layout->nts(ord[0])*dtsize), 1);
per_phase_[layout] = per_phase;
max_phase_[layout] = (ord[inner] == 1 ? 8 : 4) / per_phase_[layout];
if(mma_dot_a)
vec_[layout] = 2*layouts_->get(mma_dot_a)->to_mma()->rep(0);
else
vec_[layout] = 2*layouts_->get(mma_dot_b)->to_mma()->rep(1);
}
else{
per_phase_[layout] = std::max<int>(128 / (in_layout->mts(ord[0])*in_layout->nts(ord[0])*dtsize), 1);
max_phase_[layout] = 8 / per_phase_[layout];
vec_[layout] = 8;
else {
if (!layout->allow_swizzle()) {
per_phase_[layout] = 1;
max_phase_[layout] = 1;
vec_[layout] = 1;
} else {
per_phase_[layout] = per_phase;
max_phase_[layout] = layout->get_mma_strided() / per_phase_[layout];
vec_[layout] = layout->get_mma_vec();
}
}
}
}

63
lib/codegen/extern_lib.cc Normal file
View File

@@ -0,0 +1,63 @@
#include "triton/codegen/extern_lib.h"
#include "llvm/IR/Constants.h"
#include "llvm/IR/LegacyPassManager.h"
#include "llvm/IR/Metadata.h"
#include "llvm/IR/Type.h"
#include "llvm/Linker/Linker.h"
#include "llvm/Transforms/IPO/PassManagerBuilder.h"
#include "triton/codegen/pass.h"
namespace triton {
namespace codegen {
std::unique_ptr<llvm::Module> ExternLib::load(llvm::LLVMContext& ctx) {
llvm::SMDiagnostic err;
auto mod = llvm::parseIRFile(this->path_, err, ctx);
if (!mod) {
throw std::runtime_error("Failed to load extern lib " + this->name_ +
" at " + this->path_);
}
return mod;
}
void ExternLib::link(std::unique_ptr<llvm::Module>& llvm,
std::unique_ptr<llvm::Module>& mod) {
// Set triple and data layout to match the target module
mod->setTargetTriple(llvm->getTargetTriple());
mod->setDataLayout(llvm->getDataLayout());
if (llvm::Linker::linkModules(*llvm, std::move(mod))) {
throw std::runtime_error("Failed to link extern lib " + this->name_ +
" at " + this->path_);
}
}
void LibDevice::opt(llvm::LLVMContext& ctx, std::unique_ptr<llvm::Module>& llvm) {
// Add nvvm reflect flags to llvm module
// https://llvm.org/docs/LangRef.html#module-flags-metadata
// i32 4: Override the other module.
// i32 1: Emit an error
// If both modules specify Override, but the values differ, an error
// will be emitted.
llvm::Type* I32 = llvm::Type::getInt32Ty(ctx);
llvm::Metadata* md_four =
llvm::ConstantAsMetadata::get(llvm::ConstantInt::getSigned(I32, 4));
llvm::Metadata* md_name = llvm::MDString::get(ctx, "nvvm-reflect-ftz");
llvm::Metadata* md_one =
llvm::ConstantAsMetadata::get(llvm::ConstantInt::getSigned(I32, 1));
llvm::MDNode* reflect = llvm::MDNode::get(ctx, {md_four, md_name, md_one});
llvm->addModuleFlag(reflect);
}
std::unique_ptr<ExternLib> create_extern_lib(const std::string& lib_name,
const std::string& lib_path) {
if (lib_name == "libdevice") {
return std::make_unique<LibDevice>(lib_name, lib_path);
} else {
throw std::runtime_error("Unknown external library: " + lib_name);
}
}
} // namespace codegen
} // namespace triton

View File

@@ -1,4 +1,14 @@
#include "triton/codegen/pass.h"
#include "llvm/IR/Constants.h"
#include "llvm/IR/LegacyPassManager.h"
#include "llvm/IR/Module.h"
#include "llvm/IR/Verifier.h"
#include "llvm/IRReader/IRReader.h"
#include "llvm/Linker/Linker.h"
#include "llvm/Support/SourceMgr.h"
#include "llvm/Transforms/IPO.h"
#include "llvm/Transforms/IPO/PassManagerBuilder.h"
#include "triton/codegen/analysis/align.h"
#include "triton/codegen/analysis/allocation.h"
#include "triton/codegen/analysis/axes.h"
@@ -9,6 +19,7 @@
#include "triton/codegen/transform/cts.h"
#include "triton/codegen/transform/dce.h"
#include "triton/codegen/transform/disassociate.h"
#include "triton/codegen/transform/inline.h"
#include "triton/codegen/transform/membar.h"
#include "triton/codegen/transform/peephole.h"
#include "triton/codegen/transform/pipeline.h"
@@ -16,43 +27,92 @@
#include "triton/ir/function.h"
#include "triton/ir/module.h"
#include "triton/ir/print.h"
#include "llvm/IR/Module.h"
#include "llvm/IR/LegacyPassManager.h"
#include "llvm/IR/Verifier.h"
namespace triton {
namespace codegen {
static void link_extern_libs(const ExternLibMap& user_extern_lib_map,
const ExternLibMap& target_extern_lib_map,
ir::module& ir, llvm::LLVMContext& ctx,
std::unique_ptr<llvm::Module>& llvm) {
for (const auto& iter : target_extern_lib_map) {
auto &lib_name = iter.first;
if (user_extern_lib_map.count(lib_name) != 0 &&
user_extern_lib_map.at(lib_name)->path() != "") {
// If the user specified a path for this library, use it.
user_extern_lib_map.at(lib_name)->install(ctx, llvm);
} else {
// Otherwise, use the default path.
iter.second->install(ctx, llvm);
}
}
std::set<llvm::StringRef> function_names;
for (auto& func : ir.get_function_list()) {
function_names.insert(func->get_name());
}
llvm::legacy::PassManager pass;
pass.add(llvm::createInternalizePass([&](const llvm::GlobalValue& v) -> bool {
if (function_names.count(v.getName()) != 0) {
// Preserve global functions
return true;
}
// Internalize all device functions
return false;
}));
llvm::legacy::PassManager pm;
pm.add(llvm::createVerifierPass());
pm.run(*llvm);
llvm::PassManagerBuilder builder;
builder.OptLevel = 3;
builder.SizeLevel = 0;
builder.populateModulePassManager(pass);
pass.run(*llvm);
}
// TODO:
// There should be a proper pass manager there!
std::unique_ptr<llvm::Module> add_passes_to_emit_bin(ir::module &ir, llvm::LLVMContext& ctx, codegen::target* target,
int cc, int num_warps, int num_stages, bool force_nc_cache, int& shared_static) {
std::unique_ptr<llvm::Module> add_passes_to_emit_bin(
ir::module& ir, llvm::LLVMContext& ctx, codegen::target* target,
int num_warps, int num_stages, int& shared_static,
const ExternLibMap& extern_lib_map) {
std::cout << "pass.cc: add_passes_to_emit_bin" << std::endl;
// generate llvm code
std::string name = ir.get_function_list()[0]->get_name();
std::unique_ptr<llvm::Module> llvm(new llvm::Module(name, ctx));
// optimizations
bool cts_use_async = target->as_nvidia() && target->as_nvidia()->sm() >= 80;
// bool has_sm80 = target->as_nvidia() && target->as_nvidia()->sm() >= 80;
bool has_sm80 = false;
// create passes
codegen::analysis::align align;
codegen::transform::inliner inliner;
codegen::analysis::axes axes;
codegen::transform::cts cts(cts_use_async);
codegen::transform::pipeline pipeline(cts_use_async, num_stages);
codegen::transform::pipeline pipeline(has_sm80, num_stages);
codegen::transform::disassociate disassociate;
codegen::analysis::layouts layouts(&axes, &align, num_warps, target);
codegen::transform::cts cts(&layouts, has_sm80);
codegen::analysis::liveness liveness(&layouts);
codegen::analysis::swizzle swizzle(&layouts, target);
codegen::analysis::allocation allocation(&liveness);
codegen::transform::dce dce;
codegen::transform::peephole peephole(target, &layouts);
codegen::transform::coalesce coalesce(&align, &layouts);
codegen::transform::coalesce coalesce(&align, &layouts, has_sm80);
codegen::transform::prefetch prefetch_s(target);
codegen::transform::membar barriers(&liveness, &layouts, &allocation, &prefetch_s, target);
codegen::generator isel(&axes, &layouts, &align, &allocation, &swizzle, target, num_warps, force_nc_cache);
codegen::transform::membar barriers(&liveness, &layouts, &allocation,
&prefetch_s, target);
codegen::generator isel(&axes, &layouts, &align, &allocation, &swizzle,
target, num_warps);
// run passes
inliner.run(ir);
dce.run(ir);
peephole.run(ir);
dce.run(ir);
pipeline.run(ir);
dce.run(ir);
// ir.print(std::cout);
disassociate.run(ir);
dce.run(ir);
align.run(ir);
@@ -60,8 +120,7 @@ std::unique_ptr<llvm::Module> add_passes_to_emit_bin(ir::module &ir, llvm::LLVMC
layouts.run(ir);
peephole.run(ir);
dce.run(ir);
if (target->is_gpu())
cts.run(ir);
if (target->is_gpu()) cts.run(ir);
align.run(ir);
axes.run(ir);
layouts.run(ir);
@@ -69,8 +128,7 @@ std::unique_ptr<llvm::Module> add_passes_to_emit_bin(ir::module &ir, llvm::LLVMC
dce.run(ir);
align.run(ir);
dce.run(ir);
if (target->is_gpu())
cts.run(ir);
if (target->is_gpu()) cts.run(ir);
dce.run(ir);
align.run(ir);
axes.run(ir);
@@ -81,14 +139,27 @@ std::unique_ptr<llvm::Module> add_passes_to_emit_bin(ir::module &ir, llvm::LLVMC
axes.run(ir);
layouts.run(ir);
swizzle.run(ir);
// std::cout << "---" << std::endl;
// ir.print(std::cout);
// std::cout << "---" << std::endl;
// ir.print(std::cout);
liveness.run(ir);
allocation.run(ir);
prefetch_s.run(ir);
barriers.run(ir);
// exit(1);
// ir.print(std::cout);
isel.visit(ir, *llvm);
shared_static = allocation.allocated_size();
if (isel.get_extern_lib_map().size() > 0) {
// If there's any extern lib calls,
// we need to link them in.
link_extern_libs(extern_lib_map, isel.get_extern_lib_map(), ir, ctx, llvm);
}
return llvm;
}
} // namespace codegen
} // namespace triton
} // namespace codegen
} // namespace triton

File diff suppressed because it is too large Load Diff

View File

@@ -15,10 +15,22 @@ namespace codegen{
// base
nvidia_cu_target* target::as_nvidia() {
return dynamic_cast<nvidia_cu_target*>(this);
#ifdef USE_ROCM
amd_cl_target *target::as_amd()
{
return dynamic_cast<amd_cl_target *>(this);
}
amd_cl_target *target::as_nvidia()
{
return this->as_amd();
}
#else
// causes segfault on ROCM
nvidia_cu_target *target::as_nvidia()
{
return dynamic_cast<nvidia_cu_target *>(this);
}
#endif
bool target::is_gpu() const {
return is_gpu_;
@@ -41,7 +53,8 @@ Value* amd_cl_target::get_global_offset(Module *module, IRBuilder<>& builder, un
}
Instruction* amd_cl_target::add_memfence(Module *module, IRBuilder<>& builder) {
throw std::runtime_error("not implemented");
Function *barrier = Intrinsic::getDeclaration(module, Intrinsic::amdgcn_s_waitcnt);
return builder.CreateIntrinsic(Intrinsic::amdgcn_s_waitcnt, {}, {builder.getInt32(0)});
}
@@ -56,7 +69,50 @@ Value* amd_cl_target::get_block_id(Module *module, IRBuilder<>& builder, unsigne
}
Value* amd_cl_target::get_num_blocks(Module *module, IRBuilder<>& builder, unsigned ax) {
throw std::runtime_error("not implemented on AMD");
Function &F = *builder.GetInsertBlock()->getParent();
Module *Mod = F.getParent();
// We are indexing into this struct, and want to extract the grid_size_*
// fields.
//
// typedef struct hsa_kernel_dispatch_packet_s {
// uint16_t header;
// uint16_t setup;
// uint16_t workgroup_size_x ;
// uint16_t workgroup_size_y;
// uint16_t workgroup_size_z;
// uint16_t reserved0;
// uint32_t grid_size_x ;
// uint32_t grid_size_y ;
// uint32_t grid_size_z;
// .....
// } hsa_kernel_dispatch_packet_t
//
Function *DispatchPtrFn =
Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_dispatch_ptr);
CallInst *DispatchPtr = builder.CreateCall(DispatchPtrFn, {});
DispatchPtr->addAttribute(AttributeList::ReturnIndex, Attribute::NoAlias);
DispatchPtr->addAttribute(AttributeList::ReturnIndex, Attribute::NonNull);
F.removeFnAttr("amdgpu-no-dispatch-ptr");
// Size of the dispatch packet struct.
DispatchPtr->addDereferenceableAttr(AttributeList::ReturnIndex, 64);
Type *I32Ty = Type::getInt32Ty(Mod->getContext());
// TODO: include AMDGPUAS:: declarations.
Value *CastDispatchPtr = builder.CreateBitCast(
DispatchPtr, PointerType::get(I32Ty, 4 /*AMDGPUAS::CONSTANT_ADDRESS*/));
// grid_size_x offset is 3*32bit
assert(ax < 3);
Value *GEP =
builder.CreateConstInBoundsGEP1_64(I32Ty, CastDispatchPtr, ax + 3);
LoadInst *Load = builder.CreateAlignedLoad(I32Ty, GEP, Align(4));
MDNode *MD = MDNode::get(Mod->getContext(), None);
Load->setMetadata(LLVMContext::MD_invariant_load, MD);
return Load; // throw std::runtime_error("not implemented on AMD");
}
Value* amd_cl_target::get_local_id(Module *module, IRBuilder<>& builder, unsigned ax) {
@@ -156,7 +212,7 @@ Value* cpu_target::get_block_id(Module *module, llvm::IRBuilder<> &builder, unsi
}
Value* cpu_target::get_num_blocks(Module *module, IRBuilder<>& builder, unsigned ax) {
throw std::runtime_error("not implemented");
throw std::runtime_error("not implemented on CPU");
}

View File

@@ -12,46 +12,11 @@ namespace triton {
namespace codegen{
namespace transform{
coalesce::coalesce(analysis::align* align, analysis::layouts *layouts)
: align_(align), layout_(layouts) { }
// simplify layout conversions using the following simple rules:
// - cvt_1(cvt_2(x)) if convert1 is the inverse of convert2
// - cvt_1(elementwise(x, y)) = elementwise(convert(x), convert(y))
//ir::value* coalesce::simplify(ir::instruction *inst, ir::builder& builder){
// ir::value* _op = inst->get_operand(0);
// ir::instruction* op = dynamic_cast<ir::instruction*>(_op);
// analysis::mma_layout* mma_in = layout_->get(op) ->to_mma();
// analysis::mma_layout* mma_out = layout_->get(inst)->to_mma();
// std::cout << 1 << std::endl;
// // i must be layout conversion instruction
// if(!mma_in && !mma_out)
// return inst;
// // - cvt_1(cvt_2(x)) if convert1 is the inverse of convert2
// bool is_op_cvt = op->get_id() == ir::INST_CVT_LAYOUT;
// if((mma_in || mma_out) && is_op_cvt &&
// (layout_->get(inst) == layout_->get(op->get_operand(0))))
// return op->get_operand(0);
// // - cvt_1(elementwise(x, y)) = elementwise(cvt_1(x), cvt_2(y))
// if(op->get_id() != ir::INST_BINOP && op->get_id() != ir::INST_GETELEMENTPTR)
// return inst;
// std::cout << 1 << std::endl;
// for(size_t i = 0; i < op->get_num_operands(); i++){
// ir::value* arg_i = op->get_operand(i);
// builder.set_insert_point(op);
// // create new layout transform
// ir::instruction* new_arg_i = inst->clone();
// builder.insert(new_arg_i);
// // set the right args
// new_arg_i->replace_uses_of_with(new_arg_i->get_operand(0), arg_i);
// op->replace_uses_of_with(arg_i, simplify(new_arg_i, builder));
// }
// std::cout << 2 << std::endl;
// return op;
//}
coalesce::coalesce(analysis::align* align, analysis::layouts *layouts, bool has_sm80)
: align_(align), layout_(layouts), has_sm80_(has_sm80) { }
void coalesce::run(ir::module &mod) {
std::set<analysis::data_layout*> invalidated;
ir::builder& builder = mod.get_builder();
// add layout conversion instructions
for(ir::function *fn: mod.get_function_list())
@@ -61,23 +26,43 @@ void coalesce::run(ir::module &mod) {
if(dynamic_cast<ir::store_inst*>(i) || dynamic_cast<ir::atomic_rmw_inst*>(i))
if(ir::value* op = i->get_operand(1))
if(op->get_type()->is_block_ty())
if(layout_->get(op)->to_mma()){
if(op->get_type()->get_tile_ranks1() == 2)
if(invalidated.find(layout_->get(op)) == invalidated.end())
if(layout_->get(op)->to_mma())
if(dynamic_cast<ir::io_inst*>(i)->get_eviction_policy()==ir::io_inst::NORMAL){
ir::instruction* new_op = ir::cvt_layout_inst::create(op);
builder.set_insert_point(i);
builder.insert(new_op);
i->replace_uses_of_with(op, new_op);
}
// coalesce before copy_to_shared
// only necessary for sm < 80 as Ampere+ can handle reduction
// on MMA layout
if(!has_sm80_)
if(dynamic_cast<ir::copy_to_shared_inst*>(i) || dynamic_cast<ir::reduce_inst*>(i))
if(ir::value* op = i->get_operand(0))
if(op->get_type()->is_block_ty())
if(op->get_type()->get_tile_ranks1() == 2)
if(invalidated.find(layout_->get(op)) == invalidated.end())
if(layout_->get(op)->to_mma()){
ir::instruction* new_op = ir::cvt_layout_inst::create(op);
builder.set_insert_point(i);
builder.insert(new_op);
op->replace_all_uses_with(new_op);
new_op->replace_uses_of_with(new_op, op);
invalidated.insert(layout_->get(op));
}
// uncoalesce after load
if(auto x = dynamic_cast<ir::load_inst*>(i))
if(x->get_type()->is_block_ty())
if(x->get_type()->get_tile_rank()==2)
if(layout_->get(x)->to_mma()){
if(x->get_type()->get_tile_ranks1()==2)
if(layout_->get(x)->to_mma())
if(!has_sm80_ || dynamic_cast<ir::io_inst*>(i)->get_eviction_policy()==ir::io_inst::NORMAL){
builder.set_insert_point_after(x);
ir::instruction* new_x = ir::cvt_layout_inst::create(x);
builder.insert(new_x);
x->replace_all_uses_with(new_x);
new_x->replace_uses_of_with(new_x, x);
// new_x->replace_uses_of_with(new_x, new_x);
}
}
for(ir::function *fn: mod.get_function_list())
@@ -90,9 +75,11 @@ void coalesce::run(ir::module &mod) {
auto out_contig = align_->contiguous(ptr);
auto val_inst = dynamic_cast<ir::instruction*>(val);
if(!val_inst)
break;
continue;
if(dynamic_cast<ir::cvt_layout_inst*>(val))
break;
continue;
if(!val->get_type()->is_block_ty() || val->get_type()->get_tile_ranks1()==1)
continue;
std::vector<unsigned> in_contig;
std::vector<ir::instruction*> queue = {val_inst};
std::set<ir::instruction*> seen;
@@ -101,6 +88,8 @@ void coalesce::run(ir::module &mod) {
ir::instruction* curr = queue.back();
seen.insert(curr);
queue.pop_back();
if(auto dot_inst = dynamic_cast<ir::dot_inst*>(curr))
break;
if(auto io_inst = dynamic_cast<ir::io_inst*>(curr)){
in_contig = align_->contiguous(io_inst->get_pointer_operand());
break;

View File

@@ -1,8 +1,10 @@
#include "triton/codegen/analysis/layout.h"
#include "triton/codegen/transform/cts.h"
#include "triton/ir/module.h"
#include "triton/ir/function.h"
#include "triton/ir/basic_block.h"
#include "triton/ir/instructions.h"
#include "triton/ir/utils.h"
#include <iostream>
namespace triton {
@@ -10,9 +12,9 @@ namespace codegen{
namespace transform{
inline bool is_shmem_op(ir::instruction* i, int op) {
bool cts::is_shmem_op(ir::instruction* i, int op) {
if(i->get_id() == ir::INST_DOT)
return op==0 || op==1;
return op == 0 || op == 1;
if(i->get_id() == ir::INST_COPY_FROM_SHARED)
return op==0;
if(i->get_id() == ir::INST_TRANS)
@@ -20,7 +22,7 @@ inline bool is_shmem_op(ir::instruction* i, int op) {
return false;
}
inline bool is_shmem_res(ir::value* v){
bool cts::is_shmem_res(ir::value* v){
ir::instruction* i = dynamic_cast<ir::instruction*>(v);
if(!i)
return false;
@@ -35,7 +37,7 @@ inline bool is_shmem_res(ir::value* v){
// run pass on module
void cts::add_copy(ir::instruction *parent, ir::value *x, ir::builder &builder, bool to_shared) {
void cts::add_copy(ir::instruction *parent, ir::value *x, ir::builder &builder, bool to_shared, std::map<ir::value*, ir::value*>& copies) {
auto *i = dynamic_cast<ir::instruction*>(x);
// not an instruction
if(!i) {
@@ -51,7 +53,7 @@ void cts::add_copy(ir::instruction *parent, ir::value *x, ir::builder &builder,
// phi node
if(auto* phi = dynamic_cast<ir::phi_node*>(x)) {
for(unsigned i = 0; i < phi->get_num_incoming(); ++i)
add_copy(phi, phi->get_incoming_value(i), builder, to_shared);
add_copy(phi, phi->get_incoming_value(i), builder, to_shared, copies);
return;
}
// already in shared memory
@@ -65,33 +67,52 @@ void cts::add_copy(ir::instruction *parent, ir::value *x, ir::builder &builder,
}
else
copy = builder.create_copy_from_shared(x);
parent->replace_uses_of_with(x, copy);
copies.insert({x, copy});
parent->replace_uses_of_with(x, copies.at(x));
}
void cts::run(ir::module &mod) {
// Add shared copies
ir::builder &builder = mod.get_builder();
for(ir::function* fn: mod.get_function_list()){
for(ir::basic_block* block: fn->blocks())
for(ir::instruction* i: block->get_inst_list()){
size_t num_op = i->get_num_operands();
// copy to shared operands
for(size_t k = 0; k < num_op; k++)
if(is_shmem_op(i, k)){
add_copy(i, i->get_operand(k), builder, true);
}
// copy from shared operands
for(size_t k = 0; k < num_op; k++)
if(!dynamic_cast<ir::phi_node*>(i) &&
!is_shmem_op(i,k) &&
is_shmem_res(i->get_operand(k))){
add_copy(i, i->get_operand(k), builder, false);
}
// Precompute where copies should be added
std::set<ir::value*> shmem_ops;
std::set<ir::value*> shmem_res;
ir::for_each_instruction(mod, [&](ir::instruction* i) {
if(i->get_id() == ir::INST_DOT){
ir::dot_inst* dot = dynamic_cast<ir::dot_inst*>(i);
ir::value* lhs = i->get_operand(0);
ir::type* ty = lhs->get_type()->get_scalar_ty();
analysis::mma_layout* mma_lhs = layouts_->get(lhs)->to_mma();
// TODO: V100
bool is_lhs_shmem = !(mma_lhs && has_sm80_ && ty->get_primitive_size_in_bits() == 16 && !dot->is_trans_a());
if(is_lhs_shmem)
shmem_ops.insert(lhs);
shmem_ops.insert(i->get_operand(1));
}
}
if(i->get_id() == ir::INST_COPY_FROM_SHARED)
shmem_ops.insert(i->get_operand(0));
if(i->get_id() == ir::INST_TRANS)
shmem_ops.insert(i->get_operand(0));
if(i->get_id() == ir::INST_TRANS ||
i->get_id() == ir::INST_COPY_TO_SHARED ||
i->get_id() == ir::INST_MASKED_LOAD_ASYNC)
shmem_res.insert(i);
});
// Add shared copies
std::map<ir::value*, ir::value*> copies;
ir::builder &builder = mod.get_builder();
ir::for_each_instruction(mod, [&](ir::instruction* i) {
size_t num_op = i->get_num_operands();
for(size_t k = 0; k < num_op; k++){
ir::value* op = i->get_operand(k);
// copy to shared operands
bool is_shmem_op = shmem_ops.find(op) != shmem_ops.end();
if(is_shmem_op)
add_copy(i, op, builder, true, copies);
}
});
}
}
}
}
}

View File

@@ -3,6 +3,7 @@
#include "triton/ir/basic_block.h"
#include "triton/ir/module.h"
#include "triton/ir/utils.h"
#include <iostream>
namespace triton {
namespace codegen{
@@ -28,6 +29,8 @@ void dce::run(ir::module &mod) {
case ir::INST_ATOMIC_CAS:
case ir::INST_ATOMIC_RMW:
case ir::INST_ATOMIC_EXCH:
case ir::INST_CALL:
case ir::INST_LAUNCH:
case ir::INST_BARRIER: {
work_list.push_back(i);
marked.insert(i);
@@ -65,6 +68,7 @@ void dce::run(ir::module &mod) {
}
}
// delete
for(ir::instruction* i: to_delete)
i->erase_from_parent();

View File

@@ -11,6 +11,8 @@ namespace transform{
ir::instruction* rematerialize(ir::builder& bld, ir::instruction *root,
std::set<ir::value*>& seen) {
if (dynamic_cast<ir::phi_node*>(root))
return root;
if(!seen.insert(root).second)
return root;
if(!root->get_type()->is_block_ty())

View File

@@ -0,0 +1,147 @@
#include <iostream>
#include "triton/codegen/transform/inline.h"
#include "triton/ir/module.h"
#include "triton/ir/function.h"
#include "triton/ir/utils.h"
namespace triton{
namespace codegen{
namespace transform{
bool fncmp::operator()(ir::function* x, ir::function* y) const {
auto fn_list = x->get_parent()->get_function_list();
return std::find(fn_list.begin(), fn_list.end(), x) < std::find(fn_list.begin(), fn_list.end(), y);
};
void inliner::do_inline(ir::function* fn, ir::call_inst* callsite, ir::builder& builder,
std::list<ir::call_inst*>& callsites){
ir::basic_block* parent_block = callsite->get_parent();
ir::function* parent_fn = parent_block->get_parent();
// the parent block is split into block A and block B:
// - block A (`new_blocks[0]`) is the entry block of the inlined function
// - block B (`exit`) resumes execution of the parent function
ir::basic_block* entry = parent_block->split_before(callsite, fn->get_name());
ir::basic_block* exit = entry->get_successors()[0];
std::vector<ir::basic_block*> new_blocks = {entry};
for(size_t i = 1; i < fn->blocks().size(); i++){
ir::basic_block* block = fn->blocks()[i];
ir::context& ctx = block->get_context();
const std::string& name = block->get_parent()->get_name() + "_" + block->get_name();
new_blocks.push_back(ir::basic_block::create(ctx, name, parent_fn));
}
// a phi node holds the return values of the inlined function
if(exit->get_inst_list().empty())
builder.set_insert_point(exit);
else
builder.set_insert_point(exit->get_first_non_phi());
ir::phi_node* exit_val = builder.create_phi(fn->get_fn_type()->get_return_ty(), 0);
callsite->replace_all_uses_with(exit_val);
callsite->erase_from_parent();
// get arguments `fn` is called with
std::vector<ir::value*> tgt_args(callsite->op_begin(), callsite->op_end());
std::vector<ir::argument*> src_args(fn->args().begin(), fn->args().end());
// Actually generate the instructions:
// - Remove the branch created by basic_block::split_before
// - Clone all instructions
// - Replace `ret` with incoming nodes to `exit_val` and branches to `exit`
ir::instruction* terminator = new_blocks[0]->get_inst_list().back();
// new_blocks[0]->get_inst_list().back()->erase_from_parent();
terminator->erase_from_parent();
std::map<ir::instruction*, ir::instruction*> inst_map;
std::map<ir::argument*, ir::value*> arg_map;
for(size_t k = 0; k < fn->args().size(); k++)
arg_map[fn->args()[k]] = callsite->ops()[k];
std::vector<ir::basic_block*> rpo = ir::cfg::reverse_post_order(fn);
// clone instructions
for(size_t i = 0; i < new_blocks.size(); i++){
ir::basic_block* old_block = fn->blocks()[i];
ir::basic_block* new_block = new_blocks[i];
builder.set_insert_point(new_block);
for(ir::instruction* old_inst: old_block->get_inst_list()){
ir::instruction* new_inst = old_inst->clone();
inst_map[old_inst] = new_inst;
builder.insert(new_inst);
}
}
// update basic blocks
for(size_t i = 0; i < new_blocks.size(); i++) {
for (ir::instruction* new_inst: new_blocks[i]->get_inst_list()) {
// replace basic use cases
for(size_t k = 0; k < new_blocks.size(); k++)
new_inst->replace_uses_of_with(fn->blocks()[k], new_blocks[k]);
if(ir::phi_node* phi = dynamic_cast<ir::phi_node*>(new_inst)) {
// additionally replace basic blocks of phi-nodes since
// replace_uses_of_with() does not replace them.
for(unsigned in = 0; in < phi->get_num_incoming(); in++)
for(size_t k = 0; k < new_blocks.size(); k++)
if (phi->get_incoming_block(in) == fn->blocks()[k])
phi->set_incoming_block(in, new_blocks[k]);
}
}
}
// replace operands of instructions after constructing inst_map
for (auto& it: inst_map) {
ir::instruction* new_inst = it.second;
for(size_t k = 0; k < new_inst->get_num_operands(); k++) {
ir::value* op = new_inst->get_operand(k);
if(auto arg_op = dynamic_cast<ir::argument*>(op))
new_inst->set_operand(k, arg_map.at(arg_op));
if(auto inst_op = dynamic_cast<ir::instruction*>(op))
if(inst_map.find(inst_op) != inst_map.end())
new_inst->set_operand(k, inst_map.at(inst_op));
}
// handles a ret instruciton.
// instead of returning we need to branch to after the function call
if(ir::return_inst* ret = dynamic_cast<ir::return_inst*>(new_inst)) {
if(ir::value* ret_val = ret->get_return_value())
exit_val->add_incoming(ret_val, new_inst->get_parent());
// replace ret with branch
ir::instruction* new_br_inst = ir::branch_inst::create(exit);
builder.set_insert_point(new_inst->get_parent());
builder.insert(new_br_inst);
new_inst->erase_from_parent();
}
}
if(exit_val->get_num_incoming() == 1)
exit_val->replace_all_uses_with(exit_val->get_incoming_value(0));
// done -- make sure insert point is properly set to exit block
builder.set_insert_point(exit);
}
void inliner::run(ir::module &mod) {
// gather all call sites
while(true){
std::map<ir::function*, size_t> counts;
for(ir::function* fn: mod.get_function_list())
counts[fn] = 0;
std::list<ir::call_inst*> callsites;
for(ir::function* fn: mod.get_function_list()){
for(ir::basic_block* block: fn->blocks())
for(ir::instruction* instr: block->get_inst_list())
if(ir::call_inst* call = dynamic_cast<ir::call_inst*>(instr)){
callsites.push_back(call);
counts[call->get_fn()] += 1;
}
}
for(auto& count: counts){
if(!count.first->get_is_kernel() && count.second == 0)
count.first->get_parent()->remove_function(count.first);
}
if(callsites.empty())
break;
for(ir::call_inst* call: callsites)
do_inline(call->get_fn(), call, mod.get_builder(), callsites);
}
}
}
}
}

View File

@@ -36,6 +36,9 @@ int membar::group_of(ir::value* v, std::vector<ir::value*> &async_write) {
else{
if(layouts_->has_tmp(v))
return async_write.size() - 1;
// // Ignore copy_to_shared. It won't modify async behavior.
// if(dynamic_cast<ir::copy_to_shared_inst*>(v))
// return 0;
auto it = std::find(async_write.begin(), async_write.end(), v);
return std::distance(async_write.begin(), it);
}
@@ -60,15 +63,22 @@ membar::val_set_t membar::intersect_with(const val_set_t& as, const val_set_t& b
continue;
analysis::shared_layout* a_layout = layouts_->get(a)->to_shared();
analysis::shared_layout* a_tmp = layouts_->has_tmp(a) ? layouts_->get(layouts_->tmp(a))->to_shared() : nullptr;
analysis::shared_layout* a_tmp_index = layouts_->has_tmp_index(a) ? layouts_->get(layouts_->tmp_index(a))->to_shared() : nullptr;
for(ir::value* b: bs){
if(!b->get_type()->is_block_ty())
continue;
analysis::shared_layout* b_layout = layouts_->get(b)->to_shared();
analysis::shared_layout* b_tmp = layouts_->has_tmp(b) ? layouts_->get(layouts_->tmp(b))->to_shared() : nullptr;
analysis::shared_layout* b_tmp_index = layouts_->has_tmp_index(b) ? layouts_->get(layouts_->tmp_index(b))->to_shared() : nullptr;
if(intersect_with(a_layout, b_layout) ||
intersect_with(a_layout, b_tmp) ||
intersect_with(a_layout, b_tmp_index) ||
intersect_with(a_tmp, b_layout) ||
intersect_with(a_tmp, b_tmp))
intersect_with(a_tmp, b_tmp) ||
intersect_with(a_tmp, b_tmp_index) ||
intersect_with(a_tmp_index, b_layout) ||
intersect_with(a_tmp_index, b_tmp) ||
intersect_with(a_tmp_index, b_tmp_index))
ret.insert(b);
}
}

View File

@@ -61,7 +61,8 @@ bool peephole::rewrite_dot(ir::instruction *value, ir::builder& builder){
// dot(a, b, c) + d -> dot(a, b, c + d)
// d + dot(a, b, c) -> dot(a, b, c + d)
auto add = dynamic_cast<ir::binary_operator*>(value);
if(add && add->get_op() == ir::binary_op_t::FAdd) {
if(add && (add->get_op() == ir::binary_op_t::FAdd || add->get_op() == ir::binary_op_t::Add)) {
bool is_int_dot = add->get_op() == ir::binary_op_t::Add;
ir::value *lhs = add->get_operand(0);
ir::value *rhs = add->get_operand(1);
ir::dot_inst *lhs_dot = dynamic_cast<ir::dot_inst*>(lhs);
@@ -72,15 +73,21 @@ bool peephole::rewrite_dot(ir::instruction *value, ir::builder& builder){
ir::value *other = (dot == lhs) ? rhs : lhs;
ir::value *acc = dot->get_operand(2);
ir::splat_inst *splat = dynamic_cast<ir::splat_inst*>(acc);
ir::constant_fp *_0 = nullptr;
ir::constant *_0 = nullptr;
if(splat)
_0 = dynamic_cast<ir::constant_fp*>(splat->get_operand(0));
if(!(_0 && _0->get_value() == 0.0))
_0 = dynamic_cast<ir::constant*>(splat->get_operand(0));
if(!_0)
return false;
if (auto *fp_0 = dynamic_cast<ir::constant_fp*>(_0))
if (fp_0->get_value() != 0.0)
return false;
if (auto *int_0 = dynamic_cast<ir::constant_int*>(_0))
if (int_0->get_value() != 0)
return false;
ir::value *a = dot->get_operand(0);
ir::value *b = dot->get_operand(1);
builder.set_insert_point(add);
ir::value * new_dot = builder.insert(ir::dot_inst::create_nn(a, b, other, dot->get_name()));
ir::value * new_dot = builder.insert(ir::dot_inst::create(a, b, other, dot->is_trans_a(), dot->is_trans_b(), dot->allow_tf32(), dot->get_name()));
add->replace_all_uses_with(new_dot);
return true;
}
@@ -116,7 +123,7 @@ bool peephole::rewrite_load_to_shared(ir::instruction *value, ir::builder& build
int nts = layout->nts(layout->get_order()[0]);
int dtsize = value->get_type()->get_scalar_ty()->get_primitive_size_in_bits() / 8;
if(nts*dtsize >= 4){
ir::value* new_load = builder.create_masked_load_async(ptr, msk, val);
ir::value* new_load = builder.create_masked_load_async(ptr, msk, val, ld->get_cache_modifier(), ld->get_eviction_policy());
copy_to_shared->replace_all_uses_with(new_load);
return true;
}
@@ -143,32 +150,53 @@ bool peephole::rewrite_unit_red(ir::instruction *value, ir::builder& builder){
}
bool peephole::rewrite_mult(ir::instruction *value, ir::builder& builder) {
auto binop = dynamic_cast<ir::binary_operator*>(value);
if(binop && binop->get_op() == ir::binary_op_t::Mul) {
ir::value *lhs = binop->get_operand(0);
ir::value *rhs = binop->get_operand(1);
ir::constant_int *_1_lhs = nullptr;
if(ir::splat_inst *splat = dynamic_cast<ir::splat_inst*>(lhs)){
auto *cst = dynamic_cast<ir::constant_int*>(splat->get_operand(0));
if(cst && cst->get_value() == 1)
_1_lhs = cst;
}
ir::constant_int *_1_rhs = nullptr;
if(ir::splat_inst *splat = dynamic_cast<ir::splat_inst*>(rhs)){
auto *cst = dynamic_cast<ir::constant_int*>(splat->get_operand(0));
if(cst && cst->get_value() == 1)
_1_rhs = cst;
}
if(_1_lhs){
binop->replace_all_uses_with(rhs);
return true;
}
else if(_1_rhs){
binop->replace_all_uses_with(lhs);
return true;
}
auto binop = dynamic_cast<ir::binary_operator*>(value);
if(binop && binop->get_op() == ir::binary_op_t::Mul) {
ir::value *lhs = binop->get_operand(0);
ir::value *rhs = binop->get_operand(1);
ir::constant_int *_1_lhs = nullptr;
if(ir::splat_inst *splat = dynamic_cast<ir::splat_inst*>(lhs)){
auto *cst = dynamic_cast<ir::constant_int*>(splat->get_operand(0));
if(cst && cst->get_value() == 1)
_1_lhs = cst;
}
ir::constant_int *_1_rhs = nullptr;
if(ir::splat_inst *splat = dynamic_cast<ir::splat_inst*>(rhs)){
auto *cst = dynamic_cast<ir::constant_int*>(splat->get_operand(0));
if(cst && cst->get_value() == 1)
_1_rhs = cst;
}
if(_1_lhs){
binop->replace_all_uses_with(rhs);
return true;
}
else if(_1_rhs){
binop->replace_all_uses_with(lhs);
return true;
}
}
return false;
}
bool peephole::rewrite_insert_extract(ir::instruction *value, ir::builder& builder){
auto extracted = dynamic_cast<ir::extract_value_inst*>(value);
if(!extracted)
return false;
size_t extract_idx = extracted->get_idx();
ir::value* agg = extracted->get_operand(0);
auto insert = dynamic_cast<ir::insert_value_inst*>(agg);
while(insert){
agg = insert->get_operand(0);
ir::value* inserted = insert->get_operand(1);
size_t insert_idx = insert->get_idx();
insert = dynamic_cast<ir::insert_value_inst*>(agg);
if(extract_idx == insert_idx){
extracted->replace_all_uses_with(inserted);
return true;
}
insert = dynamic_cast<ir::insert_value_inst*>(agg);
}
return false;
}
@@ -206,7 +234,10 @@ bool peephole::rewrite_select_masked_load(ir::instruction *value, ir::builder& b
builder.set_insert_point(select);
ir::value* new_load = builder.create_masked_load(if_value->get_pointer_operand(),
if_value->get_mask_operand(),
select->get_else_value_op());
select->get_else_value_op(),
if_value->get_cache_modifier(),
if_value->get_eviction_policy(),
if_value->get_is_volatile());
select->replace_all_uses_with(new_load);
return true;
}
@@ -218,22 +249,22 @@ bool peephole::rewrite_cvt_layout(ir::instruction *value, ir::builder& builder){
ir::instruction* op = dynamic_cast<ir::instruction*>(cvt->get_operand(0));
if(!op)
return false;
// convert(elementwise(x, y)) = elementwise(convert(x), convert(y))
if(op->get_id() == ir::INST_BINOP){
for(size_t i = 0; i < op->get_num_operands(); i++){
ir::value* arg_i = op->get_operand(i);
builder.set_insert_point(op);
// create new layout transform
ir::instruction* new_arg_i = cvt->clone();
layouts_->copy(new_arg_i, op);
builder.insert(new_arg_i);
// set the right args
new_arg_i->replace_uses_of_with(new_arg_i->get_operand(0), arg_i);
op->replace_uses_of_with(arg_i, new_arg_i);
}
cvt->replace_all_uses_with(op);
return true;
}
// // convert(elementwise(x, y)) = elementwise(convert(x), convert(y))
// if(op->get_id() == ir::INST_BINOP){
// for(size_t i = 0; i < op->get_num_operands(); i++){
// ir::value* arg_i = op->get_operand(i);
// builder.set_insert_point(op);
// // create new layout transform
// ir::instruction* new_arg_i = cvt->clone();
// layouts_->copy(new_arg_i, op);
// builder.insert(new_arg_i);
// // set the right args
// new_arg_i->replace_uses_of_with(new_arg_i->get_operand(0), arg_i);
// op->replace_uses_of_with(arg_i, new_arg_i);
// }
// cvt->replace_all_uses_with(op);
// return true;
// }
auto cvt_op = dynamic_cast<ir::cvt_layout_inst*>(op);
if(!cvt_op)
return false;
@@ -281,9 +312,11 @@ void peephole::run(ir::module &mod) {
was_modified = was_modified || rewrite_mult(i, builder);
// was_modified = was_modified || rewrite_cts_cfs(i, builder);
// was_modified = was_modified || rewrite_trans_phi(i, builder);
was_modified = was_modified || rewrite_insert_extract(i, builder);
was_modified = was_modified || rewrite_unit_red(i, builder);
was_modified = was_modified || rewrite_gep_ptr_min_off_plus_off(i, builder);
was_modified = was_modified || rewrite_select_masked_load(i, builder);
// TODO: DOESN'T WORK FOR VECTORIZED MASKED LOAD
// was_modified = was_modified || rewrite_select_masked_load(i, builder);
was_modified = was_modified || rewrite_cvt_layout(i, builder);
if(tgt_->as_nvidia() && tgt_->as_nvidia()->sm() >= 80)
was_modified = was_modified || rewrite_load_to_shared(i, builder);

View File

@@ -111,6 +111,8 @@ struct pipeline_info_t {
};
void pipeline::run(ir::module &mod) {
if (num_stages_ <= 1)
return;
// *Very* conservative heuristics for pre-fetching.
// A load instruction can be pipelined if:
// - the pointer is a phi node that references a value
@@ -132,6 +134,7 @@ void pipeline::run(ir::module &mod) {
ir::builder &builder = mod.get_builder();
const int num_stages = num_stages_;
std::vector<std::pair<ir::phi_node*, std::vector<ir::value*>>> preheader_loads; // Used to reorder loads
for(auto info: to_pipeline){
ir::load_inst* load = info.load;
ir::phi_node* ptr = info.ptr;
@@ -176,7 +179,7 @@ void pipeline::run(ir::module &mod) {
false_value = remat_false_value;
} else
false_value = builder.create_splat(ir::undef_value::get(ty->get_scalar_ty()), ty->get_block_shapes());
first_loads[0] = builder.create_masked_load(first_ptrs[0], first_masks[0], false_value);
first_loads[0] = builder.create_masked_load(first_ptrs[0], first_masks[0], false_value, load->get_cache_modifier(), load->get_eviction_policy(), load->get_is_volatile());
for (int stage = 1; stage < num_stages-1; ++stage) {
// mask is the loop condition of the previous iteration
@@ -191,7 +194,7 @@ void pipeline::run(ir::module &mod) {
first_masks[stage] = builder.create_and(first_masks[stage], remat_mask);
false_value = remat_false_value;
}
first_loads[stage] = builder.create_masked_load(first_ptrs[stage], first_masks[stage], false_value);
first_loads[stage] = builder.create_masked_load(first_ptrs[stage], first_masks[stage], false_value, load->get_cache_modifier(), load->get_eviction_policy(), load->get_is_volatile());
}
// create new phis for induction variables
@@ -220,7 +223,7 @@ void pipeline::run(ir::module &mod) {
next_mask = builder.create_and(next_mask, remat_mask);
false_value = remat_false_value;
}
ir::value* next_load = builder.create_masked_load(next_ptr, next_mask, false_value);
ir::value* next_load = builder.create_masked_load(next_ptr, next_mask, false_value, load->get_cache_modifier(), load->get_eviction_policy(), load->get_is_volatile());
// phi node
@@ -255,7 +258,7 @@ void pipeline::run(ir::module &mod) {
}
else
false_value = builder.create_splat(ir::undef_value::get(ty->get_scalar_ty()), ty->get_block_shapes());
ir::value* first_load = builder.create_masked_load(first_ptr, first_mask, false_value);
ir::value* first_load = builder.create_masked_load(first_ptr, first_mask, false_value, load->get_cache_modifier(), load->get_eviction_policy(), load->get_is_volatile());
// pre-fetch next iteration
builder.set_insert_point(block->get_inst_list().back());
ir::value* next_ptr = ptr->get_value_for_block(block);
@@ -266,7 +269,7 @@ void pipeline::run(ir::module &mod) {
next_mask = builder.create_and(next_mask, remat_mask);
false_value = remat_false_value;
}
ir::value* next_load = builder.create_masked_load(next_ptr, next_mask, false_value);
ir::value* next_load = builder.create_masked_load(next_ptr, next_mask, false_value, load->get_cache_modifier(), load->get_eviction_policy(), load->get_is_volatile());
// phi node
builder.set_insert_point(block->get_first_non_phi());
ir::phi_node* new_load = builder.create_phi(ty, 2);

View File

@@ -29,8 +29,16 @@ void prefetch::run(ir::module &mod) {
std::vector<ir::dot_inst*> to_prefetch;
ir::for_each_instruction(mod, [&](ir::instruction *i) {
if (auto *dot = dynamic_cast<ir::dot_inst*>(i)) {
// Now only do prefetching when dot is fp16
if (dot->get_operand(0)->get_type()->get_scalar_ty()->get_type_id() != ir::type::FP16TyID)
// Now only do prefetching when dot is using tensor cores
if (!(dot->get_operand(0)->get_type()->get_scalar_ty()->is_fp16_ty() ||
dot->get_operand(0)->get_type()->get_scalar_ty()->is_bf16_ty() ||
(dot->get_operand(0)->get_type()->get_scalar_ty()->is_fp32_ty() && dot->allow_tf32()
&& tgt_->as_nvidia() && tgt_->as_nvidia()->sm() >= 80) ||
(dot->get_operand(0)->get_type()->get_scalar_ty()->is_integer_ty(8)
&& dot->get_operand(1)->get_type()->get_scalar_ty()->is_integer_ty(8)
&& tgt_->as_nvidia() && tgt_->as_nvidia()->sm() >= 80)
)
)
return;
auto *a = dynamic_cast<ir::phi_node*>(dot->get_operand(0));
auto *b = dynamic_cast<ir::phi_node*>(dot->get_operand(1));

View File

@@ -91,9 +91,13 @@ void* dispatch::fname ## _;
bool dispatch::cuinit(){
if(cuda_==nullptr){
#ifdef _WIN32
cuda_ = dlopen("cudart64_110.dll", RTLD_LAZY);
#else
cuda_ = dlopen("libcuda.so", RTLD_LAZY);
if(!cuda_)
cuda_ = dlopen("libcuda.so.1", RTLD_LAZY);
#endif
if(!cuda_)
throw std::runtime_error("Could not find `libcuda.so`. Make sure it is in your LD_LIBRARY_PATH.");
}
@@ -134,6 +138,7 @@ CUDA_DEFINE3(CUresult, cuDeviceGetAttribute, int *, CUdevice_attribute, CUdevice
CUDA_DEFINE1(CUresult, cuDeviceGetCount, int*)
// link management
CUDA_DEFINE6(CUresult, cuLinkAddFile_v2, CUlinkState, CUjitInputType, const char *, unsigned int , CUjit_option *, void **);
CUDA_DEFINE8(CUresult, cuLinkAddData_v2, CUlinkState, CUjitInputType, void*, size_t, const char*, unsigned int, CUjit_option*, void**);
CUDA_DEFINE4(CUresult, cuLinkCreate_v2, unsigned int, CUjit_option*, void**, CUlinkState*);
CUDA_DEFINE1(CUresult, cuLinkDestroy, CUlinkState);
@@ -176,8 +181,13 @@ CUDA_DEFINE1(CUresult, cuEventDestroy_v2, CUevent)
* NVML
* ------------------- */
bool dispatch::nvmlinit(){
#ifdef _WIN32
if(nvml_==nullptr)
nvml_ = dlopen("nvml.dll", RTLD_LAZY);
#else
if(nvml_==nullptr)
nvml_ = dlopen("libnvidia-ml.so", RTLD_LAZY);
#endif
nvmlReturn_t (*fptr)();
nvmlInit_v2_ = dlsym(nvml_, "nvmlInit_v2");
*reinterpret_cast<void **>(&fptr) = nvmlInit_v2_;
@@ -212,6 +222,7 @@ bool dispatch::hipinit(){
return res;
}
#define HIP_DEFINE0(ret, fname) DEFINE0(hipinit, hip_, ret, fname)
#define HIP_DEFINE1(ret, fname, t1) DEFINE1(hipinit, hip_, ret, fname, t1)
#define HIP_DEFINE2(ret, fname, t1, t2) DEFINE2(hipinit, hip_, ret, fname, t1, t2)
#define HIP_DEFINE3(ret, fname, t1, t2, t3) DEFINE3(hipinit, hip_, ret, fname, t1, t2, t3)
@@ -268,7 +279,8 @@ HIP_DEFINE2(hipError_t, hipEventCreate, hipEvent_t *, unsigned int)
HIP_DEFINE3(hipError_t, hipEventElapsedTime, float *, hipEvent_t, hipEvent_t)
HIP_DEFINE2(hipError_t, hipEventRecord, hipEvent_t, hipStream_t)
HIP_DEFINE1(hipError_t, hipEventDestroy, hipEvent_t)
// error handling
HIP_DEFINE0(hipError_t, hipGetLastError)
/* ------------------- *
* COMMON

View File

@@ -90,7 +90,7 @@ void check(CUresult err)
case CUDA_ERROR_NOT_PERMITTED : throw not_permitted();
case CUDA_ERROR_NOT_SUPPORTED : throw not_supported();
case CUDA_ERROR_UNKNOWN : throw unknown();
default : throw unknown();
default : throw std::runtime_error("unimplemented code: " + std::to_string(err));
}
}

View File

@@ -1,28 +1,32 @@
/* Copyright 2015-2017 Philippe Tillet
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files
* (the "Software"), to deal in the Software without restriction,
* including without limitation the rights to use, copy, modify, merge,
* publish, distribute, sublicense, and/or sell copies of the Software,
* and to permit persons to whom the Software is furnished to do so,
* subject to the following conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files
* (the "Software"), to deal in the Software without restriction,
* including without limitation the rights to use, copy, modify, merge,
* publish, distribute, sublicense, and/or sell copies of the Software,
* and to permit persons to whom the Software is furnished to do so,
* subject to the following conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
#include <fstream>
#if __has_include(<unistd.h>)
#include <unistd.h>
#endif
#include <memory>
#include <regex>
#include <iomanip>
#include <string>
#include "triton/driver/llvm.h"
#include "triton/driver/dispatch.h"
#include "triton/driver/error.h"
@@ -46,6 +50,7 @@
#include "llvm/ExecutionEngine/ExecutionEngine.h"
#include "llvm/ExecutionEngine/SectionMemoryManager.h"
#include "llvm/Transforms/Utils/Cloning.h"
#include "llvm/Transforms/Scalar.h"
// begin AMD stuff
#include "llvm/Support/FileSystem.h"
@@ -54,310 +59,346 @@
#include "llvm/Support/ToolOutputFile.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/IR/IntrinsicsAMDGPU.h"
#include "llvm/IR/Intrinsics.h"
// end AMD stuff
namespace triton{
namespace driver{
void init_llvm() {
static bool init = false;
if(!init){
LLVMInitializeNVPTXTargetInfo();
LLVMInitializeNVPTXTarget();
LLVMInitializeNVPTXTargetMC();
LLVMInitializeNVPTXAsmPrinter();
LLVMInitializeAMDGPUTargetInfo();
LLVMInitializeAMDGPUTarget();
LLVMInitializeAMDGPUTargetMC();
LLVMInitializeAMDGPUAsmPrinter();
init = true;
}
extern "C"
{
int set_curterm(char *nterm) { return 0; }
int del_curterm(char *nterm) { return 0; }
int tigetnum(char *capname) { return 0; }
int setupterm(char *term, int fildes, int *errret) { return 0; }
}
/* ------------------------ */
// CUDA //
/* ------------------------ */
static bool find_and_replace(std::string& str, const std::string& begin, const std::string& end, const std::string& target){
size_t start_replace = str.find(begin);
size_t end_replace = str.find(end, start_replace);
if(start_replace == std::string::npos)
return false;
str.replace(start_replace, end_replace + 1 - start_replace, target);
return true;
namespace {
std::string gen_random(const int len) {
static const char alphanum[] =
"0123456789"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz";
std::string tmp_s;
tmp_s.reserve(len);
for (int i = 0; i < len; ++i) {
tmp_s += alphanum[rand() % (sizeof(alphanum) - 1)];
}
return tmp_s;
}
int vptx(int version){
if(version >= 11030) return 73;
if(version >= 11020) return 72;
if(version >= 11010) return 71;
if(version >= 11000) return 70;
if(version >= 10020) return 65;
if(version >= 10010) return 64;
if(version >= 10000) return 63;
throw std::runtime_error("Triton requires CUDA 10+");
}
std::string llir_to_ptx(llvm::Module* module, int cc, int version){
// LLVM version in use may not officially support target hardware
int max_nvvm_cc = 75;
int max_nvvm_ptx = 64;
// options
auto options = llvm::cl::getRegisteredOptions();
auto* short_ptr = static_cast<llvm::cl::opt<bool>*>(options["nvptx-short-ptr"]);
assert(short_ptr);
short_ptr->setValue(true);
// compute capability
std::string sm = "sm_" + std::to_string(cc);
// max PTX version
int ptx = vptx(version);
int ptx_major = ptx / 10;
int ptx_minor = ptx % 10;
// create
llvm::SmallVector<char, 0> buffer;
std::string triple = "nvptx64-nvidia-cuda";
std::string proc = "sm_" + std::to_string(std::min(cc, max_nvvm_cc));
std::string layout = "";
std::string features = "+ptx" + std::to_string(std::min(ptx, max_nvvm_ptx));
init_llvm();
// verify and store llvm
llvm::legacy::PassManager pm;
pm.add(llvm::createVerifierPass());
pm.run(*module);
// create machine
module->setTargetTriple(triple);
std::string error;
auto target = llvm::TargetRegistry::lookupTarget(module->getTargetTriple(), error);
llvm::TargetOptions opt;
opt.AllowFPOpFusion = llvm::FPOpFusion::Fast;
opt.UnsafeFPMath = false;
opt.NoInfsFPMath = false;
opt.NoNaNsFPMath = true;
llvm::TargetMachine *machine = target->createTargetMachine(module->getTargetTriple(), proc, features, opt,
llvm::Reloc::PIC_, llvm::None, llvm::CodeGenOpt::Aggressive);
// set data layout
if(layout.empty())
module->setDataLayout(machine->createDataLayout());
else
module->setDataLayout(layout);
// emit machine code
for (llvm::Function &f : module->functions())
f.addFnAttr(llvm::Attribute::AlwaysInline);
llvm::legacy::PassManager pass;
llvm::raw_svector_ostream stream(buffer);
// emit
machine->addPassesToEmitFile(pass, stream, nullptr, llvm::CodeGenFileType::CGFT_AssemblyFile);
pass.run(*module);
namespace triton
{
namespace driver
{
// post-process
std::string result(buffer.begin(), buffer.end());
find_and_replace(result, ".version", "\n", ".version " + std::to_string(ptx_major) + "." + std::to_string(ptx_minor) + "\n");
find_and_replace(result, ".target", "\n", ".target " + sm + "\n");
while(find_and_replace(result, "\t// begin inline asm", "\n", ""));
while(find_and_replace(result, "\t// end inline asm", "\n", ""));
return result;
}
void init_llvm()
{
LLVMInitializeNVPTXTargetInfo();
LLVMInitializeNVPTXTarget();
LLVMInitializeNVPTXTargetMC();
LLVMInitializeNVPTXAsmPrinter();
LLVMInitializeAMDGPUTargetInfo();
LLVMInitializeAMDGPUTarget();
LLVMInitializeAMDGPUTargetMC();
LLVMInitializeAMDGPUAsmPrinter();
}
std::string ptx_to_cubin(const std::string& ptx, int cc) {
std::string ptxas = "ptxas";
std::string version;
int use_system_ptxas = tools::exec(ptxas + " --version 2>&1", version) == 0;
if(!use_system_ptxas)
return "";
/* ------------------------ */
// CUDA //
/* ------------------------ */
static bool find_and_replace(std::string &str, const std::string &begin, const std::string &end, const std::string &target)
{
size_t start_replace = str.find(begin);
size_t end_replace = str.find(end, start_replace);
if (start_replace == std::string::npos)
return false;
str.replace(start_replace, end_replace + 1 - start_replace, target);
return true;
}
// compile ptx with ptxas
char _fsrc[] = "/tmp/triton_k_XXXXXX";
char _flog[] = "/tmp/triton_l_XXXXXX";
mkstemp(_fsrc);
mkstemp(_flog);
std::string fsrc = _fsrc;
std::string flog = _flog;
std::string fbin = fsrc + ".o";
const char* _fbin = fbin.c_str();
std::ofstream ofs(fsrc);
ofs << ptx;
ofs.close();
std::string cmd;
int err;
cmd = ptxas + " -v --gpu-name=sm_" + std::to_string(cc) + " " + fsrc + " -o " + fsrc + ".o 2> " + flog;
err = system(cmd.c_str());
CUmodule ret;
std::ifstream _cubin(_fbin, std::ios::binary );
std::string cubin(std::istreambuf_iterator<char>(_cubin), {});
_cubin.close();
dispatch::cuModuleLoadData(&ret, cubin.c_str());
unlink(_fsrc);
unlink(_flog);
unlink(_fbin);
return cubin;
}
std::string path_to_ptxas(int &version)
{
std::vector<std::string> rets;
std::string ret;
// search paths for ptxas
std::vector<std::string> ptxas_prefixes = {"", "/usr/local/cuda/bin/"};
std::string triton_ptxas = tools::getenv("TRITON_PTXAS_PATH");
if (!triton_ptxas.empty())
ptxas_prefixes.insert(ptxas_prefixes.begin(), triton_ptxas);
// see what path for ptxas are valid
std::vector<std::string> working_ptxas;
for (std::string prefix : ptxas_prefixes)
{
std::string ptxas = prefix + "ptxas";
bool works = tools::exec(ptxas + " --version 2>&1", ret) == 0;
if (works)
{
working_ptxas.push_back(ptxas);
rets.push_back(ret);
}
}
// error if no working ptxas was found
if (working_ptxas.empty())
throw std::runtime_error("`ptxas` was searched in TRITON_PTXAS_PATH, /usr/local/cuda/bin/ or PATH"
" but a working version could not be found.");
std::string ptxas = working_ptxas.front();
// parse version
std::regex version_regex("release (\\d+)\\.(\\d+)");
std::smatch match;
bool found = false;
// currently choosing the first ptxas. Other logics can be implemented in future
for (std::string ret : rets)
{
if (std::regex_search(ret, match, version_regex))
{
int major = std::stoi(match[1]);
int minor = std::stoi(match[2]);
version = major * 1000 + minor * 10;
found = true;
break;
}
}
if (not found)
{
throw std::runtime_error("Error in parsing version");
}
return ptxas;
}
CUmodule ptx_to_cumodule(const std::string& ptx, int cc) {
// JIT compile source-code
try{
// use ptxas if present in PATH. Otherwise, use JIT from the driver
std::string ptxas = "ptxas";
std::string version;
int use_system_ptxas = tools::exec(ptxas + " --version 2>&1", version) == 0;
int vptx(int version)
{
if (version >= 11040)
return 74;
// if(version >= 11030) return 73;
// if(version >= 11020) return 72;
// if(version >= 11010) return 71;
// if(version >= 11000) return 70;
// if(version >= 10020) return 65;
// if(version >= 10010) return 64;
// if(version >= 10000) return 63;
throw std::runtime_error("Triton requires CUDA 11.4+");
}
// Use PTXAS via system call
if(use_system_ptxas){
std::string llir_to_ptx(llvm::Module *module, int cc, int version)
{
// LLVM version in use may not officially support target hardware
int max_nvvm_cc = 75;
int max_nvvm_ptx = 74;
// options
auto options = llvm::cl::getRegisteredOptions();
auto *short_ptr = static_cast<llvm::cl::opt<bool> *>(options["nvptx-short-ptr"]);
assert(short_ptr);
short_ptr->setValue(true);
// compute capability
std::string sm = "sm_" + std::to_string(cc);
// max PTX version
int ptx = vptx(version);
int ptx_major = ptx / 10;
int ptx_minor = ptx % 10;
// create
llvm::SmallVector<char, 0> buffer;
std::string triple = "nvptx64-nvidia-cuda";
std::string proc = "sm_" + std::to_string(std::min(cc, max_nvvm_cc));
std::string layout = "";
std::string features = "";
// std::string features = "+ptx" + std::to_string(std::min(ptx, max_nvvm_ptx));
init_llvm();
// verify and store llvm
llvm::legacy::PassManager pm;
// pm.add(llvm::createPrintModulePass(llvm::outs()));
pm.add(llvm::createVerifierPass());
pm.run(*module);
// module->print(llvm::outs(), nullptr);
// create machine
module->setTargetTriple(triple);
std::string error;
llvm::TargetMachine *machine;
auto target = llvm::TargetRegistry::lookupTarget(module->getTargetTriple(), error);
llvm::TargetOptions opt;
opt.AllowFPOpFusion = llvm::FPOpFusion::Fast;
opt.UnsafeFPMath = false;
opt.NoInfsFPMath = false;
opt.NoNaNsFPMath = true;
machine = target->createTargetMachine(module->getTargetTriple(), proc, features, opt,
llvm::Reloc::PIC_, llvm::None, llvm::CodeGenOpt::Aggressive);
// set data layout
if (layout.empty())
module->setDataLayout(machine->createDataLayout());
else
module->setDataLayout(layout);
// emit machine code
for (llvm::Function &f : module->functions())
f.addFnAttr(llvm::Attribute::AlwaysInline);
llvm::legacy::PassManager pass;
llvm::raw_svector_ostream stream(buffer);
// emit
machine->addPassesToEmitFile(pass, stream, nullptr, llvm::CodeGenFileType::CGFT_AssemblyFile);
pass.run(*module);
// post-process
std::string result(buffer.begin(), buffer.end());
find_and_replace(result, ".version", "\n", ".version " + std::to_string(ptx_major) + "." + std::to_string(ptx_minor) + "\n");
find_and_replace(result, ".target", "\n", ".target " + sm + "\n");
while (find_and_replace(result, "\t// begin inline asm", "\n", ""))
;
while (find_and_replace(result, "\t// end inline asm", "\n", ""))
;
return result;
}
std::string ptx_to_cubin(const std::string &ptx, const std::string &ptxas, int cc)
{
// compile ptx with ptxas
char _fsrc[] = "/tmp/triton_k_XXXXXX";
char _flog[] = "/tmp/triton_l_XXXXXX";
mkstemp(_fsrc);
mkstemp(_flog);
char _fsrc[L_tmpnam];
char _flog[L_tmpnam];
std::tmpnam(_fsrc);
std::tmpnam(_flog);
std::string fsrc = _fsrc;
std::string flog = _flog;
std::string fbin = fsrc + ".o";
const char* _fbin = fbin.c_str();
const char *_fbin = fbin.c_str();
std::ofstream ofs(fsrc);
ofs << ptx;
ofs << ptx << std::endl;
ofs.close();
std::string cmd;
int err;
cmd = ptxas + " -v --gpu-name=sm_" + std::to_string(cc) + " " + fsrc + " -o " + fsrc + ".o 2> " + flog;
err = system(cmd.c_str());
CUmodule ret;
std::ifstream _cubin(_fbin, std::ios::binary );
if (err != 0)
{
std::ifstream _log(_flog);
std::string log(std::istreambuf_iterator<char>(_log), {});
unlink(_fsrc);
unlink(_flog);
throw std::runtime_error("Internal Triton PTX codegen error: \n" + log);
}
std::ifstream _cubin(_fbin, std::ios::binary);
std::string cubin(std::istreambuf_iterator<char>(_cubin), {});
_cubin.close();
dispatch::cuModuleLoadData(&ret, cubin.c_str());
unlink(_fsrc);
unlink(_flog);
unlink(_fbin);
return cubin;
}
/* ------------------------ */
// HIP //
/* ------------------------ */
std::tuple<std::string, std::string> llir_to_amdgcn(llvm::Module *module, const std::string &_proc)
{
std::cout << "llvm.cc: llir_to_amdgcn:" << std::endl;
init_llvm();
// proc = std::get<0>(GetFeatureStrFromGCNArchName(rocminfo));
// features = std::get<1>(GetFeatureStrFromGCNArchName(rocminfo));
// create
llvm::SmallVector<char, 0> buffer;
std::string triple = "amdgcn-amd-amdhsa";
std::string layout = "";
std::string features = "+sramecc,-xnack";
std::string proc = _proc;
// name kernel
auto in_time_t = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
std::stringstream cur_time;
cur_time << std::put_time(std::localtime(&in_time_t), "%Y-%m-%d--%I-%M-%S");
std::string kernel_name = module->getModuleIdentifier() + "_" + cur_time.str() + "_" + gen_random(12);
// verify and store llvm
llvm::legacy::PassManager pm;
pm.add(llvm::createVerifierPass());
pm.run(*module);
// create machine
module->setTargetTriple(triple);
std::string error;
auto target = llvm::TargetRegistry::lookupTarget(module->getTargetTriple(), error);
llvm::TargetOptions opt;
opt.AllowFPOpFusion = llvm::FPOpFusion::Fast;
opt.UnsafeFPMath = false;
opt.NoInfsFPMath = false;
opt.NoNaNsFPMath = true;
llvm::TargetMachine *machine = target->createTargetMachine(module->getTargetTriple(), proc, features, opt,
llvm::Reloc::PIC_, llvm::None,
llvm::CodeGenOpt::None);
// set data layout
if (layout.empty())
module->setDataLayout(machine->createDataLayout());
else
module->setDataLayout(layout);
// emit machine code
for (llvm::Function &f : module->functions())
f.addFnAttr(llvm::Attribute::AlwaysInline);
llvm::legacy::PassManager pass;
llvm::raw_svector_ostream stream(buffer);
// create dump files
std::error_code ec;
// Save GCN ISA binary.
std::string isabin_path = std::string("/tmp/") + kernel_name + std::string(".o");
std::unique_ptr<llvm::raw_fd_ostream> isabin_fs(
new llvm::raw_fd_ostream(isabin_path, ec, llvm::sys::fs::OF_Text));
if (ec)
{
std::cout << isabin_path << " was not created. error code: " << ec << std::endl;
}
// emit
machine->addPassesToEmitFile(pass, *isabin_fs, nullptr, llvm::CGFT_ObjectFile);
pass.run(*module);
// Save GCN ISA.
llvm::SmallVector<char, 0> debugBuffer;
llvm::legacy::PassManager debugPass;
llvm::raw_svector_ostream debugStream(debugBuffer);
machine->addPassesToEmitFile(debugPass, debugStream, nullptr, llvm::CodeGenFileType::CGFT_AssemblyFile); // TODO:cause segfault on REM ops also cause @llvm.amdgcn.if bug
debugPass.run(*module);
std::string amdgcn(debugBuffer.begin(), debugBuffer.end());
// generate HASCO file
std::string hsaco_path = std::string("/tmp/") + kernel_name + std::string(".hsaco");
std::string error_message;
int lld_result =
llvm::sys::ExecuteAndWait("/opt/rocm/llvm/bin/ld.lld",
{"/opt/rocm/llvm/bin/ld.lld", "-flavor", "gnu", "-shared", "-o", hsaco_path, isabin_path},
llvm::None, {}, 0, 0, &error_message);
if (lld_result)
{
std::cout << "ld.lld execute fail: " << std::endl;
std::cout << error_message << std::endl;
std::cout << lld_result << std::endl;
}
return std::make_tuple(amdgcn, hsaco_path);
}
hipModule_t amdgpu_to_hipmodule(const std::string &hsaco_path)
{
std::cout << "llvm.cc: amdgpu_to_hipmodule:" << std::endl;
// Read HSACO.
std::ifstream hsaco_file(hsaco_path, std::ios::binary | std::ios::ate);
std::ifstream::pos_type hsaco_file_size = hsaco_file.tellg();
std::vector<unsigned char> hsaco(hsaco_file_size);
hsaco_file.seekg(0, std::ios::beg);
hsaco_file.read(reinterpret_cast<char *>(&hsaco[0]), hsaco_file_size);
hsaco_file.close();
hipJitOption opt[] = {hipJitOptionErrorLogBufferSizeBytes, hipJitOptionErrorLogBuffer,
hipJitOptionInfoLogBufferSizeBytes, hipJitOptionInfoLogBuffer,
hipJitOptionLogVerbose};
const unsigned int errbufsize = 8192;
const unsigned int logbufsize = 8192;
char _err[errbufsize];
char _log[logbufsize];
void *optval[] = {(void *)(uintptr_t)errbufsize,
(void *)_err, (void *)(uintptr_t)logbufsize,
(void *)_log, (void *)1};
hipModule_t ret;
dispatch::hipModuleLoadDataEx(&ret, hsaco.data(), 5, opt, optval);
return ret;
}
// Use PTXAS included in driver
CUjit_option opt[] = {CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES, CU_JIT_ERROR_LOG_BUFFER,
CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES, CU_JIT_INFO_LOG_BUFFER,
CU_JIT_LOG_VERBOSE};
unsigned int errbufsize = 8192;
unsigned int logbufsize = 8192;
char _err[errbufsize];
char _log[logbufsize];
void* optval[] = {(void*)(uintptr_t)errbufsize, (void*)_err, (void*)(uintptr_t)logbufsize, (void*)_log, (void*)1};
CUmodule ret;
dispatch::cuModuleLoadDataEx(&ret, ptx.data(), 5, opt, optval);
return ret;
}
catch(exception::cuda::invalid_ptx const &){
std::cout << ptx << std::endl;
std::cerr << "It appears that Triton produced invalid PTX code:" << std::endl;
throw;
}
}
/* ------------------------ */
// HIP //
/* ------------------------ */
std::string llir_to_amdgpu(llvm::Module* module, const std::string& _proc) {
init_llvm();
// proc = std::get<0>(GetFeatureStrFromGCNArchName(rocminfo));
// features = std::get<1>(GetFeatureStrFromGCNArchName(rocminfo));
// create
llvm::SmallVector<char, 0> buffer;
std::string triple = "amdgcn-amd-amdhsa";
std::string layout = "";
std::string features;
std::string proc = "gfx908";
// verify and store llvm
llvm::legacy::PassManager pm;
pm.add(llvm::createVerifierPass());
pm.run(*module);
// create machine
module->setTargetTriple(triple);
std::string error;
auto target = llvm::TargetRegistry::lookupTarget(module->getTargetTriple(), error);
llvm::TargetOptions opt;
opt.AllowFPOpFusion = llvm::FPOpFusion::Fast;
opt.UnsafeFPMath = false;
opt.NoInfsFPMath = false;
opt.NoNaNsFPMath = true;
llvm::TargetMachine *machine = target->createTargetMachine(module->getTargetTriple(), proc, features, opt,
llvm::Reloc::PIC_, llvm::None,
llvm::CodeGenOpt::Aggressive);
// set data layout
if(layout.empty())
module->setDataLayout(machine->createDataLayout());
else
module->setDataLayout(layout);
// emit machine code
for (llvm::Function &f : module->functions())
f.addFnAttr(llvm::Attribute::AlwaysInline);
llvm::legacy::PassManager pass;
llvm::raw_svector_ostream stream(buffer);
// create dump files
std::string module_name = module->getModuleIdentifier();
std::error_code ec;
// Save GCN ISA binary.
std::string isabin_path = std::string("/tmp/") + module_name + std::string(".o");
std::unique_ptr<llvm::raw_fd_ostream> isabin_fs(
new llvm::raw_fd_ostream(isabin_path, ec, llvm::sys::fs::OF_Text));
if (ec)
{
std::cout << isabin_path << " was not created. error code: " << ec << std::endl;
}
// emit
machine->addPassesToEmitFile(pass, *isabin_fs, nullptr, llvm::CGFT_ObjectFile);
pass.run(*module);
// Save GCN ISA.
std::string amdgcn_path = std::string("/tmp/") + module_name + std::string(".gcn");
std::string result(buffer.begin(), buffer.end());
std::ofstream amdgcn(amdgcn_path);
amdgcn << result;
amdgcn.close();
// generate HASCO file
std::string hsaco_path = std::string("/tmp/") + module_name + std::string(".hsaco");
std::string error_message;
int lld_result =
llvm::sys::ExecuteAndWait("/opt/rocm/llvm/bin/ld.lld",
{"/opt/rocm/llvm/bin/ld.lld", "-flavor", "gnu", "-shared", "-o", hsaco_path, isabin_path},
llvm::None, {}, 0, 0, &error_message);
if (lld_result)
{
std::cout << "ld.lld execute fail: " << std::endl;
std::cout << error_message << std::endl;
std::cout << lld_result << std::endl;
}
return hsaco_path;
}
hipModule_t amdgpu_to_hipmodule(const std::string& path) {
// Read HSACO.
std::ifstream hsaco_file(path, std::ios::binary | std::ios::ate);
std::ifstream::pos_type hsaco_file_size = hsaco_file.tellg();
std::vector<unsigned char> hsaco(hsaco_file_size);
hsaco_file.seekg(0, std::ios::beg);
hsaco_file.read(reinterpret_cast<char*>(&hsaco[0]), hsaco_file_size);
hsaco_file.close();
hipJitOption opt[] = {hipJitOptionErrorLogBufferSizeBytes, hipJitOptionErrorLogBuffer,
hipJitOptionInfoLogBufferSizeBytes, hipJitOptionInfoLogBuffer,
hipJitOptionLogVerbose};
unsigned int errbufsize = 8192;
unsigned int logbufsize = 8192;
char _err[errbufsize];
char _log[logbufsize];
void* optval[] = {(void*)(uintptr_t)errbufsize,
(void*)_err, (void*)(uintptr_t)logbufsize,
(void*)_log, (void*)1};
hipModule_t ret;
dispatch::hipModuleLoadDataEx(&ret, hsaco.data(), 5, opt, optval);
return ret;
}
}
}
} // namespace driver
} // namespace triton

View File

@@ -1,3 +1,5 @@
#include <iostream>
#include <algorithm>
#include "triton/ir/basic_block.h"
#include "triton/ir/instructions.h"
#include "triton/ir/type.h"
@@ -9,23 +11,71 @@ namespace ir {
class phi_node;
basic_block::basic_block(context &ctx, const std::string &name, function *parent):
basic_block::basic_block(context &ctx, const std::string &name, function *parent, basic_block* next):
value(type::get_label_ty(ctx), name), ctx_(ctx), parent_(parent) {
if(parent_)
parent_->insert_block(this);
parent_->insert_block(this, next);
}
basic_block* basic_block::create(context &ctx, const std::string &name, function *parent){
return new basic_block(ctx, name, parent);
basic_block* basic_block::create(context &ctx, const std::string &name, function *parent, basic_block* next){
return new basic_block(ctx, name, parent, next);
}
void basic_block::add_predecessor(basic_block *pred) {
preds_.push_back(pred);
if(pred)
pred->succs_.push_back(this);
void basic_block::replace_phi_uses_with(basic_block* before, basic_block* after) {
for(ir::instruction* i: inst_list_){
auto* curr_phi = dynamic_cast<ir::phi_node*>(i);
if(!curr_phi)
break;
// curr_phi->replace_uses_of_with(before, after);
for (size_t idx = 0; idx < curr_phi->get_num_incoming(); ++idx)
if (curr_phi->get_incoming_block(idx) == before)
curr_phi->set_incoming_block(idx, after);
}
}
void basic_block::append_instruction(ir::instruction* i){
i->set_parent(this);
inst_list_.push_back(i);
}
basic_block* basic_block::split_before(ir::instruction* loc, const std::string& name) {
basic_block* ret = basic_block::create(ctx_, name, parent_, this);
ret->set_name(get_name());
set_name("after_" + name);
// splice instruction list
auto loc_it = std::find(inst_list_.begin(), inst_list_.end(), loc);
ret->get_inst_list().splice(ret->get_inst_list().begin(), inst_list_, inst_list_.begin(), loc_it);
for(ir::instruction* i: ret->get_inst_list())
i->set_parent(ret);
// the predecessors of `this` becomes the predecessors of `ret`
for(ir::basic_block* pred: get_predecessors()){
auto* term = dynamic_cast<ir::terminator_inst*>(pred->get_inst_list().back());
assert(term);
term->replace_uses_of_with(this, ret);
replace_phi_uses_with(pred, ret);
}
ir::branch_inst* br = branch_inst::create(this);
ret->append_instruction(br);
return ret;
}
std::vector<basic_block*> basic_block::get_predecessors() const {
std::vector<basic_block*> ret;
for(ir::user* u: users_)
if(auto term = dynamic_cast<ir::terminator_inst*>(u))
ret.push_back(term->get_parent());
return ret;
}
std::vector<basic_block*> basic_block::get_successors() const {
std::vector<basic_block*> ret;
for(ir::instruction* i: inst_list_)
for(ir::value* v: i->ops())
if(auto block = dynamic_cast<ir::basic_block*>(v))
ret.push_back(block);
return ret;
}
basic_block::iterator basic_block::get_first_non_phi(){
auto it = begin();

View File

@@ -48,10 +48,10 @@ void builder::set_insert_point(basic_block *block){
value *builder::get_int1(bool val)
{ return constant_int::get(type::get_int1_ty(ctx_), val); }
value *builder::get_int32(int32_t val)
value *builder::get_int32(uint32_t val)
{ return constant_int::get(type::get_int32_ty(ctx_), val);}
value *builder::get_int64(int64_t val)
value *builder::get_int64(uint64_t val)
{ return constant_int::get(type::get_int64_ty(ctx_), val);}
value *builder::get_float16(float val)
@@ -60,6 +60,9 @@ value *builder::get_float16(float val)
value *builder::get_float32(float val)
{ return constant_fp::get(type::get_fp32_ty(ctx_), val); }
value *builder::get_float64(float val)
{ return constant_fp::get(type::get_fp64_ty(ctx_), val); }
value *builder::get_range(int32_t _lo, int32_t _hi) {
constant_int* lo = static_cast<constant_int*>(get_int32(_lo));
constant_int* hi = static_cast<constant_int*>(get_int32(_hi));
@@ -84,9 +87,15 @@ type *builder::get_int32_ty()
type *builder::get_int64_ty()
{ return type::get_int64_ty(ctx_); }
type *builder::get_fp8_ty()
{ return type::get_fp8_ty(ctx_); }
type *builder::get_half_ty()
{ return type::get_fp16_ty(ctx_); }
type *builder::get_bf16_ty()
{ return type::get_bf16_ty(ctx_); }
type *builder::get_float_ty()
{ return type::get_fp32_ty(ctx_); }
@@ -99,13 +108,10 @@ type *builder::get_double_ty()
//===----------------------------------------------------------------------===//
value* builder::create_br(basic_block *dest){
dest->add_predecessor(block_);
return insert(branch_inst::create(dest));
}
value* builder::create_cond_br(value *cond, basic_block *if_dest, basic_block *else_dest){
if_dest->add_predecessor(block_);
else_dest->add_predecessor(block_);
return insert(branch_inst::create(cond, if_dest, else_dest));
}
@@ -113,6 +119,18 @@ value *builder::create_ret_void() {
return insert(return_inst::create(ctx_));
}
value *builder::create_ret(value* val) {
return insert(return_inst::create(ctx_, val));
}
//===----------------------------------------------------------------------===//
// dequantize instructions
//===----------------------------------------------------------------------===//
value* builder::create_dequantize(value *src, value *scale, value *shift, type *dst_ty){
return insert(dequantize_inst::create(src, scale, shift, dst_ty));
}
//===----------------------------------------------------------------------===//
// cast instructions
//===----------------------------------------------------------------------===//
@@ -121,6 +139,8 @@ value *builder::create_ret_void() {
return create_cast(OPCODE, src, dst_ty);\
}
DEFINE_CAST_INSTR(bitcast, cast_op_t::BitCast)
DEFINE_CAST_INSTR(int_to_ptr, cast_op_t::IntToPtr)
DEFINE_CAST_INSTR(ptr_to_int, cast_op_t::PtrToInt)
DEFINE_CAST_INSTR(si_to_fp, cast_op_t::SIToFP)
DEFINE_CAST_INSTR(ui_to_fp, cast_op_t::UIToFP)
@@ -145,6 +165,19 @@ phi_node* builder::create_phi(type *ty, unsigned num_reserved){
return insert(phi_node::create(ty, num_reserved));
}
//===----------------------------------------------------------------------===//
// call instructions
//===----------------------------------------------------------------------===//
value *builder::create_call(function* fn, const std::vector<value*>& args){
return insert(call_inst::create(fn, args));
}
value* builder::create_launch(function* fn, const std::vector<value*>& args, const std::vector<value*>& grid, value* num_warps){
return insert(launch_inst::create(fn, args, grid, num_warps));
}
//===----------------------------------------------------------------------===//
// binary float instructions
//===----------------------------------------------------------------------===//
@@ -273,22 +306,35 @@ DEFINE_FCMP_INSTR(UNE, cmp_pred_t::FCMP_UNE)
// load/store instructions
//===----------------------------------------------------------------------===//
value *builder::create_load(value *ptr){
return insert(unmasked_load_inst::create(ptr));
value *builder::create_load(value *ptr, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction, bool is_volatile){
return insert(unmasked_load_inst::create(ptr, cache, eviction, is_volatile));
}
value *builder::create_store(value *ptr, value *val){
return insert(unmasked_store_inst::create(ptr, val));
value *builder::create_store(value *ptr, value *val, store_inst::EVICTION_POLICY eviction){
return insert(unmasked_store_inst::create(ptr, val, eviction));
}
value *builder::create_masked_load(value *ptr, value *mask, value *false_value){
return insert(masked_load_inst::create(ptr, mask, false_value));
value *builder::create_masked_load(value *ptr, value *mask, value *false_value, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction, bool is_volatile){
return insert(masked_load_inst::create(ptr, mask, false_value, cache, eviction, is_volatile));
}
value *builder::create_masked_store(value *ptr, value *val, value *mask){
return insert(masked_store_inst::create(ptr, val, mask));
value *builder::create_masked_store(value *ptr, value *val, value *mask, store_inst::EVICTION_POLICY eviction){
return insert(masked_store_inst::create(ptr, val, mask, eviction));
}
//===----------------------------------------------------------------------===//
// struct instructions
//===----------------------------------------------------------------------===//
// Struct instructions
value *builder::create_insert_value(value* val, value *elt, size_t idx){
return insert(insert_value_inst::create(val, elt, idx));
}
value *builder::create_extract_value(value* val, size_t idx) {
return insert(extract_value_inst::create(val, idx));
}
//===----------------------------------------------------------------------===//
// block instructions
//===----------------------------------------------------------------------===//
@@ -297,6 +343,10 @@ value *builder::create_reshape(value *arg, const type::block_shapes_t &shapes) {
return insert(reshape_inst::create(arg, shapes));
}
value *builder::create_cat(value *lhs, value *rhs) {
return insert(cat_inst::create(lhs, rhs));
}
value *builder::create_splat(value *arg, const type::block_shapes_t &shapes) {
return insert(splat_inst::create(arg, shapes));
}
@@ -309,6 +359,50 @@ value *builder::create_downcast(value *arg) {
return insert(downcast_inst::create(arg));
}
//
value *builder::create_atomic_rmw(ir::atomic_rmw_op_t op, value *ptr, value *val, value *msk){
return insert(atomic_rmw_inst::create(op, ptr, val, msk));
}
#define DEFINE_ATOMIC_RMW_INSTR(SUFFIX, OPCODE)\
value *builder::create_ ## SUFFIX(value *ptr, value *val, value *mask){\
return create_atomic_rmw(OPCODE, ptr, val, mask);\
}
DEFINE_ATOMIC_RMW_INSTR(atomic_max, ir::atomic_rmw_op_t::Max)
DEFINE_ATOMIC_RMW_INSTR(atomic_umax, ir::atomic_rmw_op_t::UMax)
DEFINE_ATOMIC_RMW_INSTR(atomic_min, ir::atomic_rmw_op_t::Min)
DEFINE_ATOMIC_RMW_INSTR(atomic_umin, ir::atomic_rmw_op_t::UMin)
DEFINE_ATOMIC_RMW_INSTR(atomic_fadd, ir::atomic_rmw_op_t::FAdd)
DEFINE_ATOMIC_RMW_INSTR(atomic_add, ir::atomic_rmw_op_t::Add)
DEFINE_ATOMIC_RMW_INSTR(atomic_and, ir::atomic_rmw_op_t::And)
DEFINE_ATOMIC_RMW_INSTR(atomic_or, ir::atomic_rmw_op_t::Or)
DEFINE_ATOMIC_RMW_INSTR(atomic_xor, ir::atomic_rmw_op_t::Xor)
DEFINE_ATOMIC_RMW_INSTR(atomic_xchg, ir::atomic_rmw_op_t::Xchg)
// Utilities
value *builder::create_clock() {
return insert(clock_inst::create(ctx_));
}
value *builder::create_globaltimer() {
return insert(globaltimer_inst::create(ctx_));
}
//===----------------------------------------------------------------------===//
// externs
//===----------------------------------------------------------------------===//
value *builder::create_extern_elementwise(const std::string &lib_name,
const std::string &lib_path,
const std::string &symbol_name,
const std::vector<value *> &args,
type *ret_ty) {
return insert(extern_elementwise_inst::create(ctx_, args, ret_ty, lib_name,
lib_path, symbol_name));
}
//===----------------------------------------------------------------------===//
// built-in instructions
//===----------------------------------------------------------------------===//
@@ -325,9 +419,6 @@ value *builder::create_atomic_cas(value *ptr, value *cmp, value *val){
return insert(atomic_cas_inst::create(ptr, cmp, val));
}
value *builder::create_atomic_rmw(ir::atomic_rmw_op_t op, value *ptr, value *val, value *msk){
return insert(atomic_rmw_inst::create(op, ptr, val, msk));
}
value *builder::create_exp(value *arg){
return insert(exp_inst::create(arg));
@@ -345,8 +436,8 @@ value *builder::create_log(value *arg){
return insert(log_inst::create(arg));
}
value *builder::create_dot(value *A, value *B, value *C) {
return insert(dot_inst::create_nn(A, B, C));
value *builder::create_dot(value *A, value *B, value *C, bool trans_a, bool trans_b, bool allow_tf32) {
return insert(dot_inst::create(A, B, C, trans_a, trans_b, allow_tf32));
}
value *builder::create_trans(value *A, const std::vector<int>& perm) {
@@ -369,6 +460,9 @@ value *builder::create_select(value *pred, value *if_value, value *else_value){
// intrinsic instructions
//===----------------------------------------------------------------------===//
value *builder::create_umulhi(value *lhs, value *rhs) {
return insert(umulhi_inst::create(lhs, rhs));
}
value *builder::create_copy_to_shared(value *arg) {
return insert(copy_to_shared_inst::create(arg));
@@ -379,8 +473,8 @@ value *builder::create_copy_from_shared(value *arg) {
return insert(copy_from_shared_inst::create(arg));
}
value *builder::create_masked_load_async(value *ptr, value *mask, value *false_value) {
return insert(masked_load_async_inst::create(ptr, mask, false_value));
value *builder::create_masked_load_async(value *ptr, value *mask, value *false_value, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction) {
return insert(masked_load_async_inst::create(ptr, mask, false_value, cache, eviction));
}
value *builder::create_barrier(const std::string &name) {

View File

@@ -18,6 +18,8 @@ constant *constant::get_null_value(type *ty) {
return constant_int::get(ty, 0);
case type::FP16TyID:
return constant_fp::get(type::get_fp16_ty(ctx), 0);
case type::BF16TyID:
return constant_fp::get(type::get_bf16_ty(ctx), 0);
case type::FP32TyID:
return constant_fp::get(type::get_fp32_ty(ctx), 0);
case type::FP64TyID:
@@ -47,10 +49,10 @@ constant_int *constant_int::get(type *ty, uint64_t value) {
if (!ty->is_integer_ty())
throw std::runtime_error("Cannot create constant_int with non integer ty");
context_impl *impl = ty->get_context().p_impl.get();
constant_int *& cst = impl->int_constants_[std::make_pair(ty, value)];
if(cst == nullptr)
cst = new constant_int(ty, value);
return cst;
std::unique_ptr<constant_int> &cst = impl->int_constants_[std::make_pair(ty, value)];
if(!cst)
cst.reset(new constant_int(ty, value));
return cst.get();
}
@@ -73,10 +75,10 @@ constant *constant_fp::get_zero_value_for_negation(type *ty) {
constant *constant_fp::get(type *ty, double v){
context_impl *impl = ty->get_context().p_impl.get();
constant_fp *&result = impl->fp_constants_[std::make_pair(ty, v)];
std::unique_ptr<constant_fp> &result = impl->fp_constants_[std::make_pair(ty, v)];
if(!result)
result = new constant_fp(ty, v);
return result;
result.reset(new constant_fp(ty, v));
return result.get();
}
@@ -86,10 +88,10 @@ undef_value::undef_value(type *ty)
undef_value *undef_value::get(type *ty) {
context_impl *impl = ty->get_context().p_impl.get();
undef_value *&result = impl->uv_constants_[ty];
std::unique_ptr<undef_value> &result = impl->uv_constants_[ty];
if(!result)
result = new undef_value(ty);
return result;
result.reset(new undef_value(ty));
return result.get();
}
/* global value */

View File

@@ -24,9 +24,7 @@ context_impl::context_impl(context &ctx)
int16_ty(ctx, 16),
int32_ty(ctx, 32),
int64_ty(ctx, 64),
int128_ty(ctx, 128){
}
int128_ty(ctx, 128) {}
//===----------------------------------------------------------------------===//
// context

View File

@@ -1,754 +0,0 @@
#include "triton/ir/dispatch.h"
#include <iostream>
namespace triton{
namespace ir{
ir::value* throw_unreachable(std::string key) {
throw std::runtime_error("Encountered unimplemented code path in `" + key + "`. "
"This is likely a bug on our side.");
return 0;
}
//===----------------------------------------------------------------------===//
// Programming Model
//===----------------------------------------------------------------------===//
ir::value *dispatch::program_id(int axis, ir::builder *builder) {
return builder->create_get_program_id(axis);
}
ir::value *dispatch::num_programs(int axis, ir::builder *builder) {
return builder->create_get_num_programs(axis);
}
//===----------------------------------------------------------------------===//
// Implicit Casting Utilities
//===----------------------------------------------------------------------===//
ir::type *integer_promote(ir::type* a_ty, ir::type* b_ty){
int a_rank = a_ty->get_integer_bitwidth();
int b_rank = b_ty->get_integer_bitwidth();
return a_rank > b_rank ? a_ty : b_ty;
}
ir::type *computation_type(ir::type* a_ty, ir::type* b_ty){
context &ctx = a_ty->get_context();
// 1) if one operand is double, the other is implicitly
// converted to double
if(a_ty->is_fp64_ty() || b_ty->is_fp64_ty())
return type::get_fp64_ty(ctx);
// 2) if one operand is float, the other is implicitly
// converted to float
if(a_ty->is_fp32_ty() || b_ty->is_fp32_ty())
return type::get_fp32_ty(ctx);
// 3 ) if one operand is half, the other is implicitly
// converted to half
if(a_ty->is_fp16_ty() || b_ty->is_fp16_ty())
return type::get_fp16_ty(ctx);
if(!a_ty->is_integer_ty() || !b_ty->is_integer_ty())
throw_unreachable("augment_types");
// 4 ) both operands are integer and undergo
// integer promotion
return integer_promote(a_ty, b_ty);
}
//===----------------------------------------------------------------------===//
// Binary Operators
//===----------------------------------------------------------------------===//
void throw_incompatible_types(ir::type* type_a, ir::type* type_b) {
throw semantic_error("invalid operands of type " + type_a->repr() + " and " + type_b->repr());
}
void check_ptr_type(ir::type* type_a, ir::type* type_b, bool allow_ptr_a){
if(type_a->is_pointer_ty()){
if(!allow_ptr_a)
throw_incompatible_types(type_a, type_b);
// T* + U* with T != U
if(type_b->is_pointer_ty() && (type_a != type_b))
throw_incompatible_types(type_a, type_b);
// T* + float
if(type_b->is_floating_point_ty())
throw_incompatible_types(type_a, type_b);
}
}
void binary_op_type_checking(ir::value*& lhs, ir::value*& rhs, ir::builder* builder,
bool allow_lhs_ptr = false, bool allow_rhs_ptr = false,
bool arithmetic_check = true){
// implicit broadcasting
std::tie(lhs, rhs) = dispatch::broadcast(lhs, rhs, builder);
// implicit typecasting
ir::type *lhs_sca_ty = lhs->get_type()->get_scalar_ty();
ir::type *rhs_sca_ty = rhs->get_type()->get_scalar_ty();
check_ptr_type(lhs_sca_ty, rhs_sca_ty, allow_lhs_ptr);
check_ptr_type(rhs_sca_ty, lhs_sca_ty, allow_rhs_ptr);
if(arithmetic_check && !lhs_sca_ty->is_pointer_ty() && !rhs_sca_ty->is_pointer_ty()){
ir::type *ret_sca_ty = computation_type(lhs_sca_ty, rhs_sca_ty);
lhs = dispatch::cast(lhs, ret_sca_ty, builder);
rhs = dispatch::cast(rhs, ret_sca_ty, builder);
}
}
ir::value *dispatch::add(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder, true, true);
ir::type *input_scalar_ty = input->get_type()->get_scalar_ty();
ir::type *other_scalar_ty = other->get_type()->get_scalar_ty();
// offset + ptr
// ptr + offset
if(other_scalar_ty->is_pointer_ty() && !input_scalar_ty->is_pointer_ty())
std::swap(input, other);
if (input_scalar_ty->is_pointer_ty())
return builder->create_gep(input, {other});
// float + float
else if (input_scalar_ty->is_floating_point_ty())
return builder->create_fadd(input, other);
// int + int
else if (input_scalar_ty->is_integer_ty())
return builder->create_add(input, other);
return throw_unreachable("add");
}
ir::value *dispatch::sub(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder, true, false);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// ptr - offset
if (scalar_ty->is_pointer_ty())
return builder->create_gep(input, {dispatch::minus(other, builder)});
// float + float
if (scalar_ty->is_floating_point_ty())
return builder->create_fsub(input, other);
// int + int
else if (scalar_ty->is_integer_ty())
return builder->create_sub(input, other);
return throw_unreachable("sub");
}
ir::value *dispatch::mul(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// float * float
if (scalar_ty->is_floating_point_ty())
return builder->create_fmul(input, other);
// int * int
else if (scalar_ty->is_integer_ty())
return builder->create_mul(input, other);
return throw_unreachable("mul");
}
ir::value *dispatch::truediv(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder, false, false, false);
ir::type *input_scalar_ty = input->get_type()->get_scalar_ty();
ir::type *other_scalar_ty = other->get_type()->get_scalar_ty();
// float / int
if(input_scalar_ty->is_floating_point_ty() && other_scalar_ty->is_integer_ty())
other = cast(other, input_scalar_ty, builder);
// int / float
else if(input_scalar_ty->is_integer_ty() && other_scalar_ty->is_floating_point_ty())
input = cast(input, other_scalar_ty, builder);
// int / int (cast to float32)
else if(input_scalar_ty->is_integer_ty() && other_scalar_ty->is_integer_ty()){
input = cast(input, builder->get_float_ty(), builder);
other = cast(other, builder->get_float_ty(), builder);
}
// float / float (cast to highest exponent type)
else if(input_scalar_ty->is_floating_point_ty() && other_scalar_ty->is_floating_point_ty()){
if(input_scalar_ty->get_fp_mantissa_width() > other_scalar_ty->get_fp_mantissa_width())
other = cast(other, input_scalar_ty, builder);
else
input = cast(input, other_scalar_ty, builder);
}
// unreachable
else
return throw_unreachable("div");
return builder->create_fdiv(input, other);
}
ir::value *dispatch::floordiv(ir::value *input, ir::value *other, ir::builder *builder){
binary_op_type_checking(input, other, builder, false, false, false);
ir::type *input_scalar_ty = input->get_type()->get_scalar_ty();
ir::type *other_scalar_ty = other->get_type()->get_scalar_ty();
if(input_scalar_ty->is_integer_ty() && other_scalar_ty->is_integer_ty()){
ir::type *ret_ty = integer_promote(input_scalar_ty, other_scalar_ty);
input = dispatch::cast(input, ret_ty, builder);
other = dispatch::cast(other, ret_ty, builder);
return builder->create_sdiv(input, other);
}
return throw_unreachable("floordiv");
}
ir::value *dispatch::mod(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// float % int
if (scalar_ty->is_floating_point_ty())
return builder->create_frem(input, other);
// int % int
else if (scalar_ty->is_integer_ty())
return builder->create_srem(input, other);
return throw_unreachable("mod");
}
void bitwise_op_type_checking(ir::value *&input, ir::value *&other, ir::builder *builder, bool force_lhs_type = false){
binary_op_type_checking(input, other, builder, false, false, false);
ir::type *input_sca_ty = input->get_type()->get_scalar_ty();
ir::type *other_sca_ty = other->get_type()->get_scalar_ty();
if(!input_sca_ty->is_integer_ty() || !other_sca_ty->is_integer_ty())
throw_incompatible_types(input_sca_ty, other_sca_ty);
// for some reason pytorch assigns the result of binary op to have the type of the lhs...
if(force_lhs_type){
if(input_sca_ty->get_integer_bitwidth() != other_sca_ty->get_integer_bitwidth())
other = dispatch::cast(other, input_sca_ty, builder);
}
else{
if(input_sca_ty->get_integer_bitwidth() < other_sca_ty->get_integer_bitwidth())
input = dispatch::cast(input, other_sca_ty, builder);
else if(other_sca_ty->get_integer_bitwidth() < input_sca_ty->get_integer_bitwidth())
other = dispatch::cast(other, input_sca_ty, builder);
}
}
ir::value *dispatch::and_(ir::value *input, ir::value *other, ir::builder *builder) {
bitwise_op_type_checking(input, other, builder, true);
return builder->create_and(input, other);
}
ir::value *dispatch::or_(ir::value *input, ir::value *other, ir::builder *builder) {
bitwise_op_type_checking(input, other, builder, true);
return builder->create_or(input, other);
}
ir::value *dispatch::xor_(ir::value *input, ir::value *other, ir::builder *builder) {
bitwise_op_type_checking(input, other, builder, true);
return builder->create_xor(input, other);
}
ir::value *dispatch::lshr(ir::value *input, ir::value *other, ir::builder *builder) {
bitwise_op_type_checking(input, other, builder, false);
return builder->create_lshr(input, other);
}
ir::value *dispatch::shl(ir::value *input, ir::value *other, ir::builder *builder) {
bitwise_op_type_checking(input, other, builder, false);
return builder->create_shl(input, other);
}
//===----------------------------------------------------------------------===//
// Unary Operators
//===----------------------------------------------------------------------===//
ir::value *dispatch::plus(ir::value *input, ir::builder *) {
return input;
}
ir::value *dispatch::minus(ir::value *input, ir::builder *builder) {
ir::type* input_sca_ty = input->get_type()->get_scalar_ty();
if(input_sca_ty->is_pointer_ty())
throw semantic_error("wrong type argument to unary minus (" + input_sca_ty->repr() + ")");
ir::value *_0 = ir::constant::get_null_value(input_sca_ty);
return dispatch::sub(_0, input, builder);
}
ir::value *dispatch::invert(ir::value *input, ir::builder *builder) {
ir::type* input_sca_ty = input->get_type()->get_scalar_ty();
if(input_sca_ty->is_pointer_ty() || input_sca_ty->is_floating_point_ty())
throw semantic_error("wrong type argument to unary invert (" + input_sca_ty->repr() + ")");
ir::value *_1 = ir::constant::get_all_ones_value(input_sca_ty);
return dispatch::xor_(input, _1, builder);
}
//===----------------------------------------------------------------------===//
// Comparison Operators
//===----------------------------------------------------------------------===//
ir::value *dispatch::greater_than(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// float > float
if (scalar_ty->is_floating_point_ty())
return builder->create_fcmpOGT(input, other);
// int > int
else if (scalar_ty->is_integer_ty())
return builder->create_icmpSGT(input, other);
return throw_unreachable("greater_than");
}
ir::value *dispatch::greater_equal(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// float >= float
if (scalar_ty->is_floating_point_ty())
return builder->create_fcmpOGE(input, other);
// int >= int
else if (scalar_ty->is_integer_ty())
return builder->create_icmpSGE(input, other);
return throw_unreachable("greater_equal");
}
ir::value *dispatch::less_than(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// float < float
if (scalar_ty->is_floating_point_ty())
return builder->create_fcmpOLT(input, other);
// int < int
else if (scalar_ty->is_integer_ty())
return builder->create_icmpSLT(input, other);
return throw_unreachable("less_than");
}
ir::value *dispatch::less_equal(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// float < float
if (scalar_ty->is_floating_point_ty())
return builder->create_fcmpOLE(input, other);
// int < int
else if (scalar_ty->is_integer_ty())
return builder->create_icmpSLE(input, other);
return throw_unreachable("less_equal");
}
ir::value *dispatch::equal(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// float == float
if (scalar_ty->is_floating_point_ty())
return builder->create_fcmpOEQ(input, other);
// int == int
else if (scalar_ty->is_integer_ty())
return builder->create_icmpEQ(input, other);
return throw_unreachable("equal");
}
ir::value *dispatch::not_equal(ir::value *input, ir::value *other, ir::builder *builder) {
binary_op_type_checking(input, other, builder);
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// float == float
if (scalar_ty->is_floating_point_ty())
return builder->create_fcmpUNE(input, other);
// int == int
else if (scalar_ty->is_integer_ty())
return builder->create_icmpNE(input, other);
return throw_unreachable("equal");
}
//===----------------------------------------------------------------------===//
// Block Creation
//===----------------------------------------------------------------------===//
ir::value* dispatch::arange(int start, int end, ir::builder *builder) {
return builder->get_range(start, end);
}
ir::value* dispatch::zeros(shape_t shape, ir::type *dtype, ir::builder *builder) {
ir::value *_0 = ir::constant::get_null_value(dtype);
return builder->create_splat(_0, shape);
}
//===----------------------------------------------------------------------===//
// Shape Manipulation
//===----------------------------------------------------------------------===//
ir::value *dispatch::reshape(ir::value *input, shape_t dst_shape, ir::builder *builder) {
unsigned numel = 1;
for(unsigned s: dst_shape) numel *= s;
if(input->get_type()->get_tile_num_elements() != numel)
throw semantic_error("cannot reshape block of different shape");
return builder->create_reshape(input, dst_shape);
}
ir::value *dispatch::broadcast(ir::value *input, shape_t shape, ir::builder *builder) {
if (!input->get_type()->is_block_ty())
return builder->create_splat(input, shape);
auto src_shape = input->get_type()->get_block_shapes();
if (src_shape.size() != shape.size())
throw std::runtime_error("Cannot broadcast");
if(shape == src_shape)
return input;
return builder->create_broadcast(input, shape);
}
std::tuple<ir::value*, ir::value*> dispatch::broadcast(ir::value *lhs, ir::value* rhs, ir::builder *builder) {
ir::type *lhs_ty = lhs->get_type();
ir::type *rhs_ty = rhs->get_type();
// make_shape_compatible(block, scalar)
if (lhs_ty->is_block_ty() && !rhs_ty->is_block_ty())
rhs = builder->create_splat(rhs, lhs_ty->get_block_shapes());
// make_shape_compatible(scalar, block)
else if (!lhs_ty->is_block_ty() && rhs_ty->is_block_ty())
lhs = builder->create_splat(lhs, rhs_ty->get_block_shapes());
// make_shape_compatible(block, block)
else if (lhs_ty->is_block_ty() && rhs_ty->is_block_ty()) {
auto lhs_shape = lhs_ty->get_block_shapes();
auto rhs_shape = rhs_ty->get_block_shapes();
if (lhs_shape.size() != rhs_shape.size())
throw std::runtime_error("Cannot make_shape_compatible: blocks must have the same rank");
ir::type::block_shapes_t ret_shape;
for (size_t i = 0; i < lhs_shape.size(); ++i) {
unsigned left = lhs_shape[i];
unsigned right = rhs_shape[i];
if (left == 1)
ret_shape.push_back(right);
else if (right == 1)
ret_shape.push_back(left);
else if (left == right)
ret_shape.push_back(left);
else
throw std::runtime_error("Cannot make_shape_compatible: incompatible dimensions at index " + std::to_string(i) +
": " + std::to_string(left) + " and " + std::to_string(right));
}
if (lhs_shape != ret_shape)
lhs = builder->create_broadcast(lhs, ret_shape);
if (rhs_shape != ret_shape)
rhs = builder->create_broadcast(rhs, ret_shape);
}
return std::make_tuple(lhs, rhs);
}
ir::value *dispatch::bitcast(ir::value *input, ir::type *dst_ty, ir::builder *builder){
ir::type *src_ty = input->get_type();
if (src_ty->is_block_ty())
dst_ty = ir::block_type::get(dst_ty, input->get_type()->get_block_shapes());
if(src_ty == dst_ty)
return input;
ir::type *src_sca_ty = src_ty->get_scalar_ty();
ir::type *dst_sca_ty = dst_ty->get_scalar_ty();
if(src_sca_ty->is_pointer_ty() || dst_sca_ty->is_pointer_ty())
return cast(input, dst_ty, builder);
// Bitcast
int src_bits = src_sca_ty->get_primitive_size_in_bits();
int dst_bits = dst_sca_ty->get_primitive_size_in_bits();
if( src_bits!= dst_bits)
throw std::runtime_error("Cannot bitcast data-type of size " + std::to_string(src_bits) +
"to data-type of size " + std::to_string(dst_bits));
return builder->create_cast(ir::BitCast, input, dst_ty);
}
ir::value *dispatch::cast(ir::value *input, ir::type *dst_ty, ir::builder *builder) {
ir::type *src_ty = input->get_type();
if (src_ty->is_block_ty())
dst_ty = ir::block_type::get(dst_ty, input->get_type()->get_block_shapes());
if(src_ty == dst_ty)
return input;
ir::type *src_sca_ty = src_ty->get_scalar_ty();
ir::type *dst_sca_ty = dst_ty->get_scalar_ty();
// FP Truncation
bool truncate_fp = src_sca_ty->is_floating_point_ty() &&
dst_sca_ty->is_floating_point_ty() &&
src_sca_ty->get_fp_mantissa_width() > dst_sca_ty->get_fp_mantissa_width();
if (truncate_fp)
return builder->create_fp_trunc(input, dst_ty);
// FP Extension
bool ext_fp = src_sca_ty->is_floating_point_ty() &&
dst_sca_ty->is_floating_point_ty() &&
src_sca_ty->get_fp_mantissa_width() < dst_sca_ty->get_fp_mantissa_width();
if (ext_fp)
return builder->create_fp_ext(input, dst_ty);
// Int cast
if (src_sca_ty->is_integer_ty() && dst_sca_ty->is_integer_ty() &&
src_sca_ty->get_integer_bitwidth() != dst_sca_ty->get_integer_bitwidth())
return builder->create_int_cast(input, dst_ty, src_sca_ty != builder->get_int1_ty());
// Float -> Int
if (src_sca_ty->is_floating_point_ty() && dst_sca_ty->is_integer_ty()){
if(dst_sca_ty->is_bool_ty())
return builder->create_fp_to_ui(input, dst_ty);
else
return builder->create_fp_to_si(input, dst_ty);
}
// int -> Float
if (src_sca_ty->is_integer_ty() && dst_sca_ty->is_floating_point_ty()){
if(src_sca_ty->is_bool_ty())
return builder->create_ui_to_fp(input, dst_ty);
else
return builder->create_si_to_fp(input, dst_ty);
}
if (src_sca_ty->is_pointer_ty() && !dst_sca_ty->is_pointer_ty())
return builder->create_cast(ir::PtrToInt, input, dst_ty);
if (!src_sca_ty->is_pointer_ty() && dst_sca_ty->is_pointer_ty())
return builder->create_cast(ir::IntToPtr, input, dst_ty);
// Ptr -> Ptr
if (src_sca_ty->is_pointer_ty() && dst_sca_ty->is_pointer_ty())
return builder->create_cast(ir::BitCast, input, dst_ty);
// * -> Bool
if (dst_sca_ty->is_bool_ty()) {
if (src_sca_ty->is_pointer_ty())
input = cast(input, builder->get_int64_ty(), builder);
ir::value *other = builder->get_int64(0);
if (src_ty->is_bool_ty())
other = builder->create_splat(other, src_ty->get_block_shapes());
return builder->create_icmpNE(input, other);
}
return throw_unreachable("cast");
}
//===----------------------------------------------------------------------===//
// Memory Operators
//===----------------------------------------------------------------------===//
ir::value *dispatch::load(ir::value* ptr, ir::value* mask, ir::value* other, ir::builder* builder) {
if(!ptr->get_type()->get_scalar_ty()->is_pointer_ty())
throw semantic_error("Pointer argument of load instruction is " + ptr->get_type()->repr());
if(ptr->get_type()->is_block_ty()){
if(mask){
mask = dispatch::broadcast(mask, ptr->get_type()->get_block_shapes(), builder);
}
if(other){
other = dispatch::broadcast(other, ptr->get_type()->get_block_shapes(), builder);
other = dispatch::cast(other, ptr->get_type()->get_scalar_ty()->get_pointer_element_ty(), builder);
}
}
ir::type *ptr_ty = ptr->get_type()->get_scalar_ty();
ir::type *elt_ty = ptr_ty->get_pointer_element_ty();
// treat bool* as int8*
if(elt_ty == builder->get_int1_ty()){
elt_ty = builder->get_int8_ty();
ptr_ty = pointer_type::get(elt_ty, ptr_ty->get_pointer_address_space());
ptr = dispatch::cast(ptr, ptr_ty, builder);
}
if (!mask && !other)
return builder->create_load(ptr);
if (!mask)
throw std::runtime_error("`other` cannot be provided without `mask`");
auto shape = ptr->get_type()->get_block_shapes();
if(!other){
other = ir::undef_value::get(elt_ty);
if(ptr->get_type()->is_block_ty())
other = builder->create_splat(other, ptr->get_type()->get_block_shapes());
}
return builder->create_masked_load(ptr, mask, other);
}
ir::value *dispatch::store(ir::value* ptr, ir::value *val, ir::value* mask, ir::builder *builder) {
if(!ptr->get_type()->get_scalar_ty()->is_pointer_ty())
throw semantic_error("Pointer argument of store instruction is " + ptr->get_type()->repr());
if(ptr->get_type()->is_block_ty())
val = dispatch::broadcast(val, ptr->get_type()->get_block_shapes(), builder);
if(mask)
mask = dispatch::broadcast(mask, ptr->get_type()->get_block_shapes(), builder);
ir::type *ptr_ty = ptr->get_type()->get_scalar_ty();
ir::type *elt_ty = ptr_ty->get_pointer_element_ty();
// treat bool* as int8*
if(elt_ty == builder->get_int1_ty()){
elt_ty = builder->get_int8_ty();
ptr_ty = pointer_type::get(elt_ty, ptr_ty->get_pointer_address_space());
ptr = dispatch::cast(ptr, ptr_ty, builder);
}
// cast to target data-type
val = dispatch::cast(val, elt_ty, builder);
if (!mask)
return builder->create_store(ptr, val);
if(!mask->get_type()->get_scalar_ty()->is_bool_ty())
throw semantic_error("Mask must have boolean scalar type");
return builder->create_masked_store(ptr, val, mask);
}
ir::value *dispatch::atomic_cas(ir::value* ptr, ir::value *cmp, ir::value *val, ir::builder *builder){
return builder->create_atomic_cas(ptr, cmp, val);
}
void atom_red_typechecking(ir::value*& ptr, ir::value *&val, ir::value *&mask, ir::builder *builder){
if(!ptr->get_type()->get_scalar_ty()->is_pointer_ty())
throw semantic_error("Pointer argument of store instruction is " + ptr->get_type()->repr());
if(ptr->get_type()->is_block_ty()){
if(mask){
mask = dispatch::broadcast(mask, ptr->get_type()->get_block_shapes(), builder);
}
if(val){
val = dispatch::broadcast(val, ptr->get_type()->get_block_shapes(), builder);
}
}
val = dispatch::cast(val, ptr->get_type()->get_scalar_ty()->get_pointer_element_ty(), builder);
if(!mask){
mask = builder->get_int1(true);
if(ptr->get_type()->is_block_ty())
mask = builder->create_splat(mask, ptr->get_type()->get_block_shapes());
}
}
ir::value *dispatch::atomic_max(ir::value* ptr, ir::value *val, ir::value *mask, ir::builder *builder){
atom_red_typechecking(ptr, val, mask, builder);
ir::type* sca_ty = val->get_type()->get_scalar_ty();
// direct call to atomic_max for integers
if(sca_ty->is_integer_ty())
return builder->create_atomic_rmw(ir::atomic_rmw_op_t::Max, ptr, val, mask);
// for float
// return atomic_smax(i_ptr, i_val) if val >= 0
// return atomic_umin(i_ptr, i_val) if val < 0
ir::value* i_val = bitcast(val, builder->get_int32_ty(), builder);
ir::value* i_ptr = bitcast(ptr, pointer_type::get(builder->get_int32_ty(), 1), builder);
ir::value* pos = greater_equal(val, constant_fp::get(sca_ty, 0), builder);
ir::value* neg = less_than(val, constant_fp::get(sca_ty, 0), builder);
ir::value* pos_ret = builder->create_atomic_rmw(ir::atomic_rmw_op_t::Max, i_ptr, i_val, and_(mask, pos, builder));
ir::value* neg_ret = builder->create_atomic_rmw(ir::atomic_rmw_op_t::UMin, i_ptr, i_val, and_(mask, neg, builder));
return where(pos, pos_ret, neg_ret, builder);
}
ir::value *dispatch::atomic_min(ir::value* ptr, ir::value *val, ir::value *mask, ir::builder *builder){
atom_red_typechecking(ptr, val, mask, builder);
ir::type* sca_ty = val->get_type()->get_scalar_ty();
// direct call to atomic_max for integers
if(sca_ty->is_integer_ty())
return builder->create_atomic_rmw(ir::atomic_rmw_op_t::Min, ptr, val, mask);
// for float
// return atomic_smin(i_ptr, i_val) if val >= 0
// return atomic_umax(i_ptr, i_val) if val < 0
ir::value* i_val = bitcast(val, builder->get_int32_ty(), builder);
ir::value* i_ptr = bitcast(ptr, pointer_type::get(builder->get_int32_ty(), 1), builder);
ir::value* pos = greater_equal(val, constant_fp::get(sca_ty, 0), builder);
ir::value* neg = less_than(val, constant_fp::get(sca_ty, 0), builder);
ir::value* pos_ret = builder->create_atomic_rmw(ir::atomic_rmw_op_t::Min, i_ptr, i_val, and_(mask, pos, builder));
ir::value* neg_ret = builder->create_atomic_rmw(ir::atomic_rmw_op_t::UMax, i_ptr, i_val, and_(mask, neg, builder));
return where(pos, pos_ret, neg_ret, builder);
}
ir::value *dispatch::atomic_add(ir::value* ptr, ir::value *val, ir::value *mask, ir::builder *builder){
atom_red_typechecking(ptr, val, mask, builder);
ir::type* sca_ty = val->get_type()->get_scalar_ty();
auto op = sca_ty->is_floating_point_ty() ? ir::atomic_rmw_op_t::FAdd : ir::atomic_rmw_op_t::Add;
return builder->create_atomic_rmw(op, ptr, val, mask);
}
ir::value *dispatch::atomic_and(ir::value* ptr, ir::value *val, ir::value *mask, ir::builder *builder){
atom_red_typechecking(ptr, val, mask, builder);
return builder->create_atomic_rmw(ir::atomic_rmw_op_t::And, ptr, val, mask);
}
ir::value *dispatch::atomic_or(ir::value* ptr, ir::value *val, ir::value *mask, ir::builder *builder){
atom_red_typechecking(ptr, val, mask, builder);
return builder->create_atomic_rmw(ir::atomic_rmw_op_t::Or, ptr, val, mask);
}
ir::value *dispatch::atomic_xor(ir::value* ptr, ir::value *val, ir::value *mask, ir::builder *builder){
atom_red_typechecking(ptr, val, mask, builder);
return builder->create_atomic_rmw(ir::atomic_rmw_op_t::Xor, ptr, val, mask);
}
ir::value *dispatch::atomic_xchg(ir::value* ptr, ir::value *val, ir::value *mask, ir::builder *builder){
atom_red_typechecking(ptr, val, mask, builder);
ir::type* sca_ty = val->get_type()->get_scalar_ty();
return builder->create_atomic_rmw(ir::atomic_rmw_op_t::Xchg, ptr, val, mask);
}
//===----------------------------------------------------------------------===//
// Linear Algebra
//===----------------------------------------------------------------------===//
ir::value *dispatch::dot(ir::value *lhs, ir::value *rhs, ir::builder *builder) {
ir::value *_0 = builder->get_float32(0);
unsigned M = lhs->get_type()->get_block_shapes()[0];
unsigned N = rhs->get_type()->get_block_shapes()[1];
_0 = builder->create_splat(_0, {M, N});
return builder->create_dot(lhs, rhs, _0);
}
//===----------------------------------------------------------------------===//
// Indexing
//===----------------------------------------------------------------------===//
ir::value *dispatch::where(ir::value* condition, ir::value *x, ir::value *y, ir::builder *builder){
condition = dispatch::cast(condition, builder->get_int1_ty(), builder);
if(condition->get_type()->is_block_ty()){
x = dispatch::broadcast(x, condition->get_type()->get_block_shapes(), builder);
y = dispatch::broadcast(y, condition->get_type()->get_block_shapes(), builder);
}
if(x->get_type()->get_scalar_ty() != y->get_type()->get_scalar_ty())
throw_incompatible_types(x->get_type()->get_scalar_ty(), y->get_type()->get_scalar_ty());
return builder->create_select(condition, x, y);
}
//===----------------------------------------------------------------------===//
// Reductions
//===----------------------------------------------------------------------===//
ir::value *reduce_impl(ir::value *input, unsigned int axis, ir::builder *builder, const std::string &name,
ir::reduce_inst::op_t FLOAT_OP, ir::reduce_inst::op_t INT_OP) {
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
// input is extended to 32-bits if necessary
// this increases numerical accuracy and can be done pretty much for free
// on GPUs
if(scalar_ty->is_integer_ty() && scalar_ty->get_integer_bitwidth() <= 32)
input = dispatch::cast(input, type::get_int32_ty(scalar_ty->get_context()), builder);
if (scalar_ty->is_floating_point_ty())
return builder->create_reduce(input, FLOAT_OP, axis);
else if (scalar_ty->is_integer_ty())
return builder->create_reduce(input, INT_OP, axis);
return throw_unreachable(name);
}
ir::value *dispatch::min(ir::value *input, unsigned int axis, ir::builder *builder) {
return reduce_impl(input, axis, builder, "min", ir::reduce_inst::FMIN, ir::reduce_inst::MIN);
}
ir::value *dispatch::max(ir::value *input, unsigned int axis, ir::builder *builder) {
return reduce_impl(input, axis, builder, "max", ir::reduce_inst::FMAX, ir::reduce_inst::MAX);
}
ir::value *dispatch::sum(ir::value *input, unsigned int axis, ir::builder *builder) {
return reduce_impl(input, axis, builder, "sum", ir::reduce_inst::FADD, ir::reduce_inst::ADD);
}
//===----------------------------------------------------------------------===//
// Math
//===----------------------------------------------------------------------===//
ir::value *dispatch::exp(ir::value *x, ir::builder *builder) {
return builder->create_exp(x);
}
ir::value *dispatch::log(ir::value *x, ir::builder *builder) {
return builder->create_log(x);
}
ir::value *dispatch::cos(ir::value *x, ir::builder *builder) {
return builder->create_cos(x);
}
ir::value *dispatch::sin(ir::value *x, ir::builder *builder) {
return builder->create_sin(x);
}
ir::value *dispatch::sqrt(ir::value *x, ir::builder *builder) {
return builder->create_sqrt(x);
}
//
ir::value *dispatch::multiple_of(ir::value *x, int value, ir::builder *){
ir::instruction* i = dynamic_cast<ir::instruction*>(x);
if(!i)
throw_unreachable("multiple_of");
i->set_metadata(ir::metadata::multiple_of, value);
return i;
}
ir::value *dispatch::max_contiguous(ir::value *x, int value, ir::builder *){
ir::instruction* i = dynamic_cast<ir::instruction*>(x);
if(!i)
throw_unreachable("max_contiguous");
i->set_metadata(ir::metadata::max_contiguous, value);
return i;
}
ir::value *dispatch::debug_barrier(ir::builder *builder) {
return builder->create_barrier();
}
}
}

View File

@@ -33,8 +33,10 @@ void argument::accept(visitor *v) {
/* function */
function::function(function_type *ty, linkage_types_t linkage,
const std::string &name, module *parent)
: global_object(ty, 0, linkage, name), parent_(parent), fn_ty_(ty) {
: global_object(ty, 0, linkage, name), parent_(parent), fn_ty_(ty), is_kernel_(false) {
unsigned num_params = fn_ty_->get_num_params();
if(parent)
parent->push_function(this);
// skip if no parameter
if(num_params == 0)
return;
@@ -44,8 +46,6 @@ function::function(function_type *ty, linkage_types_t linkage,
type *param_ty = fn_ty_->get_param_ty(i);
args_[i] = argument::create(param_ty, "", this, i);
}
if(parent)
parent->push_function(this);
}
/* basic block */

View File

@@ -5,6 +5,7 @@
#include "triton/ir/instructions.h"
#include "triton/ir/constant.h"
#include "triton/ir/type.h"
#include "triton/ir/function.h"
namespace triton{
namespace ir{
@@ -68,6 +69,7 @@ void phi_node::set_incoming_block(unsigned i, basic_block *block){
// Add incoming
void phi_node::add_incoming(value *v, basic_block *block){
assert(v && "PHI node got a null value!!");
resize_ops(get_num_operands() + 1);
blocks_.resize(get_num_operands() + 1);
set_incoming_value(get_num_operands() - 1, v);
@@ -79,6 +81,70 @@ phi_node* phi_node::create(type *ty, unsigned num_reserved, const std::string &n
return new phi_node(ty, num_reserved, name, next);
}
//===----------------------------------------------------------------------===//
// call_inst classes
//===----------------------------------------------------------------------===//
std::string call_inst::repr_impl() const { return "call " + fn_->get_name(); }
call_inst::call_inst(ir::function* fn, const std::vector<ir::value*>& values, const std::string& name, instruction* next)
: instruction(fn->get_fn_type()->get_return_ty(), INST_CALL, values.size(), name, next), fn_(fn){
for(size_t i = 0; i < values.size(); i++)
set_operand(i, values.at(i));
}
call_inst* call_inst::create(ir::function* fn, const std::vector<ir::value*>& values, const std::string &name, instruction *next) {
return new call_inst(fn, values, name, next);
}
// launch
launch_inst::launch_inst(ir::function* fn, const std::vector<ir::value*>& values, const std::vector<ir::value*>& grid, ir::value* num_warps, const std::string& name, instruction* next)
: instruction(fn->get_fn_type()->get_return_ty(), INST_LAUNCH, 1 + values.size() + grid.size() + 1, name, next){
int k = 0;
if(grid.size() != 3)
throw std::runtime_error("grid must have 3 elements");
set_operand(k++, fn);
val_begin = k;
for(ir::value* v: values)
set_operand(k++, v);
val_end = k;
grid_begin = k;
for(ir::value* g: grid)
set_operand(k++, g);
grid_end = k;
set_operand(k++, num_warps);
}
ir::function* launch_inst::get_fn() {
return (ir::function*)get_operand(0);
}
std::vector<ir::value*> launch_inst::get_values() {
std::vector<ir::value*> ret;
for(int i = val_begin; i < val_end; i++)
ret.push_back(get_operand(i));
return ret;
}
std::vector<ir::value*> launch_inst::get_grid() {
std::vector<ir::value*> ret;
for(int i = grid_begin; i < grid_end; i++)
ret.push_back(get_operand(i));
return ret;
}
ir::value* launch_inst::get_num_warps() {
return get_operand(grid_end);
}
launch_inst* launch_inst::create(ir::function *fn, const std::vector<ir::value *> &values, const std::vector<ir::value *> &grid, ir::value *num_warps, const std::string &name, instruction *next) {
return new launch_inst(fn, values, grid, num_warps, name, next);
}
//===----------------------------------------------------------------------===//
// binary_operator classes
@@ -134,7 +200,7 @@ bool binary_operator::is_int_add_sub() const {
binary_operator::binary_operator(binary_op_t op, value *lhs, value *rhs, type *ty, const std::string &name, instruction *next)
: instruction(ty, INST_BINOP, 2, name, next), op_(op){
: instruction(ty, INST_BINOP, 2, name, next), op_(op), fdiv_ieee_rnd_(false){
set_operand(0, lhs);
set_operand(1, rhs);
}
@@ -232,6 +298,7 @@ icmp_inst::icmp_inst(type *ty, cmp_pred_t pred,
icmp_inst* icmp_inst::create(cmp_pred_t pred, value *lhs, value *rhs, const std::string &name, instruction *next){
assert(is_int_predicate(pred));
assert(lhs->get_type() == rhs->get_type());
type *res_ty = make_cmp_result_type(lhs->get_type());
return new icmp_inst(res_ty, pred, lhs, rhs, name, next);
}
@@ -256,6 +323,21 @@ unary_inst::unary_inst(type *ty, value_id_t id, value *v, const std::string &nam
set_operand(0, v);
}
//===----------------------------------------------------------------------===//
// dequantize_inst classes
//===----------------------------------------------------------------------===//
dequantize_inst::dequantize_inst(type *ty, value *v, value *scale, value *shift, const std::string &name, instruction *next)
: instruction(ty, INST_DEQUANTIZE, 3, name, next) {
set_operand(0, v);
set_operand(1, scale);
set_operand(2, shift);
}
dequantize_inst *dequantize_inst::create(value *arg, value *scale, value *shift, type *ty, const std::string &name, instruction *next){
return new dequantize_inst(ty, arg, scale, shift, name, next);
}
//===----------------------------------------------------------------------===//
// cast_inst classes
//===----------------------------------------------------------------------===//
@@ -311,8 +393,8 @@ cast_inst *cast_inst::create_integer_cast(value *arg, type *ty, bool is_signed,
unsigned arg_bits = arg_ty->get_scalar_ty()->get_integer_bitwidth();
unsigned dst_bits = ty->get_scalar_ty()->get_integer_bitwidth();
cast_op_t op = (arg_bits == dst_bits ? cast_op_t::BitCast :
(arg_bits > dst_bits ? cast_op_t::Trunc :
(is_signed ? cast_op_t::SExt : cast_op_t::ZExt)));
(arg_bits > dst_bits ? cast_op_t::Trunc :
(is_signed ? cast_op_t::SExt : cast_op_t::ZExt)));
return create(op, arg, ty, name, next);
}
@@ -323,7 +405,7 @@ cast_inst *cast_inst::create_integer_cast(value *arg, type *ty, bool is_signed,
// return_inst
return_inst::return_inst(context &ctx, value *ret_val, instruction *next)
: terminator_inst(type::get_void_ty(ctx), INST_RETURN, ret_val!=nullptr, "", next){
: terminator_inst(ret_val?ret_val->get_type():type::get_void_ty(ctx), INST_RETURN, ret_val!=nullptr, "", next){
if(ret_val)
set_operand(0, ret_val);
}
@@ -428,13 +510,13 @@ getelementptr_inst *getelementptr_inst::create(value *ptr, const std::vector<val
//===----------------------------------------------------------------------===//
// io_inst
io_inst::io_inst(type *ty, value_id_t id, unsigned num_ops, const std::string &name, instruction *next)
: instruction(ty, id, num_ops, name, next)
io_inst::io_inst(type *ty, value_id_t id, unsigned num_ops, EVICTION_POLICY eviction, const std::string &name, instruction *next)
: instruction(ty, id, num_ops, name, next), eviction_(eviction)
{ }
// load_inst
load_inst::load_inst(value *ptr, value_id_t id, unsigned num_ops, const std::string &name, instruction *next)
: io_inst(get_pointee_type(ptr->get_type()), id, num_ops, name, next)
load_inst::load_inst(value *ptr, value_id_t id, unsigned num_ops, load_inst::CACHE_MODIFIER cache, EVICTION_POLICY eviction, bool is_volatile, const std::string &name, instruction *next)
: io_inst(get_pointee_type(ptr->get_type()), id, num_ops, eviction, name, next), cache_(cache), is_volatile_(is_volatile)
{ }
// load
@@ -447,83 +529,136 @@ type *load_inst::get_pointee_type(type *ty) {
}
// unmasked_load
unmasked_load_inst::unmasked_load_inst(value *ptr, const std::string &name, instruction *next)
: load_inst(ptr, INST_UNMASKED_LOAD, 1, name, next) {
unmasked_load_inst::unmasked_load_inst(value *ptr, load_inst::CACHE_MODIFIER cache,load_inst::EVICTION_POLICY eviction, bool is_volatile, const std::string &name, instruction *next)
: load_inst(ptr, INST_UNMASKED_LOAD, 1, cache, eviction, is_volatile, name, next) {
set_operand(0, ptr);
}
unmasked_load_inst* unmasked_load_inst::create(value *ptr, const std::string &name, instruction *next) {
return new unmasked_load_inst(ptr, name, next);
unmasked_load_inst* unmasked_load_inst::create(value *ptr, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction, bool is_volatile, const std::string &name, instruction *next) {
return new unmasked_load_inst(ptr, cache, eviction, is_volatile, name, next);
}
// masked load
masked_load_inst::masked_load_inst(value *ptr, value *mask, value *false_value,
masked_load_inst::masked_load_inst(value *ptr, value *mask, value *false_value, load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction,
bool is_volatile,
const std::string &name, instruction *next)
: load_inst(ptr, INST_MASKED_LOAD, 3, name, next) {
: load_inst(ptr, INST_MASKED_LOAD, 3, cache, eviction, is_volatile, name, next) {
set_operand(0, ptr);
set_operand(1, mask);
set_operand(2, false_value);
}
masked_load_inst* masked_load_inst::create(value *ptr, value *mask, value *false_value,
load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction,
bool is_volatile,
const std::string &name, instruction *next) {
return new masked_load_inst(ptr, mask, false_value, name, next);
return new masked_load_inst(ptr, mask, false_value, cache, eviction, is_volatile, name, next);
}
// masked load async
masked_load_async_inst::masked_load_async_inst(value *ptr, value *mask, value *false_value,
load_inst::CACHE_MODIFIER cache, load_inst::EVICTION_POLICY eviction,
const std::string &name, instruction *next)
: load_inst(ptr, INST_MASKED_LOAD_ASYNC, 3, name, next) {
: load_inst(ptr, INST_MASKED_LOAD_ASYNC, 3, cache, eviction, false, name, next) {
set_operand(0, ptr);
set_operand(1, mask);
set_operand(2, false_value);
}
masked_load_async_inst* masked_load_async_inst::create(value *ptr, value *mask, value *false_value,
load_inst::CACHE_MODIFIER cache, EVICTION_POLICY eviction,
const std::string &name, instruction *next) {
return new masked_load_async_inst(ptr, mask, false_value, name, next);
return new masked_load_async_inst(ptr, mask, false_value, cache, eviction, name, next);
}
// store
store_inst::store_inst(value *ptr, value_id_t id, unsigned num_ops, const std::string &name, instruction *next)
: io_inst(type::get_void_ty(ptr->get_type()->get_context()), id, num_ops, name, next)
store_inst::store_inst(value *ptr, value_id_t id, unsigned num_ops, EVICTION_POLICY eviction, const std::string &name, instruction *next)
: io_inst(type::get_void_ty(ptr->get_type()->get_context()), id, num_ops, eviction, name, next)
{ }
// unmasked_store
unmasked_store_inst::unmasked_store_inst(value *ptr, value *val,
unmasked_store_inst::unmasked_store_inst(value *ptr, value *val, EVICTION_POLICY eviction,
const std::string &name, instruction *next)
: store_inst(ptr, INST_UNMASKED_STORE, 2, name, next) {
: store_inst(ptr, INST_UNMASKED_STORE, 2, eviction, name, next) {
set_operand(0, ptr);
set_operand(1, val);
}
unmasked_store_inst* unmasked_store_inst::create(value *ptr, value *val,
unmasked_store_inst* unmasked_store_inst::create(value *ptr, value *val, EVICTION_POLICY eviction,
const std::string &name, instruction *next) {
return new unmasked_store_inst(ptr, val, name, next);
return new unmasked_store_inst(ptr, val, eviction, name, next);
}
// masked store
masked_store_inst::masked_store_inst(value *ptr, value *val, value *mask,
masked_store_inst::masked_store_inst(value *ptr, value *val, value *mask, EVICTION_POLICY eviction,
const std::string &name, instruction *next)
: store_inst(ptr, INST_MASKED_STORE, 3, name, next) {
: store_inst(ptr, INST_MASKED_STORE, 3, eviction, name, next) {
set_operand(0, ptr);
set_operand(1, val);
set_operand(2, mask);
}
masked_store_inst* masked_store_inst::create(value *ptr, value *val, value *mask, const std::string &name, instruction *next) {
return new masked_store_inst(ptr, val, mask, name, next);
masked_store_inst* masked_store_inst::create(value *ptr, value *val, value *mask, EVICTION_POLICY eviction,
const std::string &name, instruction *next) {
return new masked_store_inst(ptr, val, mask, eviction, name, next);
}
//===----------------------------------------------------------------------===//
// struct classes
//===----------------------------------------------------------------------===//
// insert value
insert_value_inst::insert_value_inst(value *val, value *elt, size_t idx, const std::string& name, instruction *next)
: instruction(val->get_type(), INST_INSERT_VALUE, 2, name, next), idx_(idx) {
set_operand(0, val);
set_operand(1, elt);
}
insert_value_inst* insert_value_inst::create(value *val, value *elt, size_t idx, const std::string& name, instruction *next){
return new insert_value_inst(val, elt, idx, name, next);
}
// extract value
extract_value_inst::extract_value_inst(value *val, size_t idx, const std::string& name, instruction *next)
: instruction(val->get_type()->get_struct_type(idx), INST_EXTRACT_VALUE, 1, name, next), idx_(idx) {
set_operand(0, val);
}
extract_value_inst* extract_value_inst::create(value *val, size_t idx, const std::string& name, instruction *next){
return new extract_value_inst(val, idx, name, next);
}
//===----------------------------------------------------------------------===//
// retile_inst classes
//===----------------------------------------------------------------------===//
// cat
cat_inst::cat_inst(value *x, value *y, const std::string &name, instruction *next)
: instruction(block_type::get(x->get_type()->get_scalar_ty(),
{x->get_type()->get_block_shapes()[0] +
y->get_type()->get_block_shapes()[0] }), INST_CAT, 2, name, next) {
set_operand(0, x);
set_operand(1, y);
}
instruction* cat_inst::create(value *lhs, value *rhs, const std::string &name, instruction *next) {
return new cat_inst(lhs, rhs, name, next);
}
// retile
retile_inst::retile_inst(value *arg, value_id_t id, const type::block_shapes_t &shapes,
const std::string &name, instruction *next)
: unary_inst(block_type::get(arg->get_type()->get_scalar_ty(), shapes), id, arg, name, next) { }
// reshape
instruction* reshape_inst::create(value *arg, const type::block_shapes_t &shapes,
@@ -552,44 +687,48 @@ instruction* downcast_inst::create(value *arg, const std::string &name, instruct
return new downcast_inst(arg->get_type()->get_scalar_ty(), INST_DOWNCAST, arg, name, next);
}
//===----------------------------------------------------------------------===//
// matmul_inst classes
//===----------------------------------------------------------------------===//
dot_inst::dot_inst(value *A, value *B, value *C, TransT AT, TransT BT,
dot_inst::dot_inst(value *A, value *B, value *C, TransT AT, TransT BT, bool allow_tf32,
const std::string &name, instruction *next)
: builtin_inst(C->get_type(), INST_DOT, 3, name, next) {
: builtin_inst(C->get_type(), INST_DOT, 3, name, next), AT_(AT), BT_(BT){
set_operand(0, A);
set_operand(1, B);
set_operand(2, C);
allow_tf32_ = allow_tf32;
}
instruction *dot_inst::create(value *A, value *B, value *C,
bool AT, bool BT,
bool AT, bool BT, bool allow_tf32,
const std::string &name, instruction *next) {
TransT OPA = AT ? Trans : NoTrans;
TransT OPB = BT ? Trans : NoTrans;
return new dot_inst(A, B, C, OPA, OPB, name, next);
return new dot_inst(A, B, C, OPA, OPB, allow_tf32, name, next);
}
instruction *dot_inst::create_nn(value *A, value *B, value *C,
instruction *dot_inst::create_nn(value *A, value *B, value *C, bool allow_tf32,
const std::string &name, instruction *next) {
return new dot_inst(A, B, C, NoTrans, NoTrans, name, next);
return new dot_inst(A, B, C, NoTrans, NoTrans, allow_tf32, name, next);
}
instruction *dot_inst::create_nt(value *A, value *B, value *C,
instruction *dot_inst::create_nt(value *A, value *B, value *C, bool allow_tf32,
const std::string &name, instruction *next) {
return new dot_inst(A, B, C, NoTrans, Trans, name, next);
return new dot_inst(A, B, C, NoTrans, Trans, allow_tf32, name, next);
}
instruction *dot_inst::create_tn(value *A, value *B, value *C,
instruction *dot_inst::create_tn(value *A, value *B, value *C, bool allow_tf32,
const std::string &name, instruction *next) {
return new dot_inst(A, B, C, Trans, NoTrans, name, next);
return new dot_inst(A, B, C, Trans, NoTrans, allow_tf32, name, next);
}
instruction *dot_inst::create_tt(value *A, value *B, value *C,
instruction *dot_inst::create_tt(value *A, value *B, value *C, bool allow_tf32,
const std::string &name, instruction *next) {
return new dot_inst(A, B, C, Trans, Trans, name, next);
return new dot_inst(A, B, C, Trans, Trans, allow_tf32, name, next);
}
//===----------------------------------------------------------------------===//
@@ -758,6 +897,19 @@ instruction* atomic_cas_inst::create(value *ptr, value *cmp, value *val, const s
}
// umulhi
umulhi_inst::umulhi_inst(value *lhs, value *rhs, const std::string &name, instruction *next)
: builtin_inst(lhs->get_type(), INST_UMULHI, 2, name, next) {
set_operand(0, lhs);
set_operand(1, rhs);
}
instruction* umulhi_inst::create(value *lhs, value *rhs, const std::string &name, instruction *next) {
return new umulhi_inst(lhs, rhs, name, next);
}
// exp
exp_inst::exp_inst(value *val, const std::string &name, instruction *next)
@@ -824,8 +976,7 @@ copy_from_shared_inst* copy_from_shared_inst::create(value *arg, const std::stri
}
// barrier
barrier_inst::barrier_inst(context &ctx, const std::string &name,
instruction *next)
barrier_inst::barrier_inst(context &ctx, const std::string &name, instruction *next)
: instruction(type::get_void_ty(ctx), INST_BARRIER, 0, name, next) { }
barrier_inst* barrier_inst::create(context &ctx, const std::string &name, instruction *next) {
@@ -844,27 +995,44 @@ prefetch_s_inst *prefetch_s_inst::create(context &ctx, value *arg, int inc, cons
return new prefetch_s_inst(ctx, arg, inc, name, next);
}
//// nv_dynamic_program_idx
//make_range_dyn::make_range_dyn(type *ty, const std::string &name, instruction *next)
// : instruction(ty, INST_MAKE_RANGE_DYN, 0, name, next) { }
// global timer
globaltimer_inst::globaltimer_inst(context &ctx, const std::string &name, instruction *next)
: instruction(type::get_int64_ty(ctx), INST_GLOBALTIMER, 0, name, next) { }
//make_range_dyn* make_range_dyn::create(type *ty, const std::string &name, instruction *next) {
// return new make_range_dyn(ty, name, next);
//}
globaltimer_inst* globaltimer_inst::create(context &ctx, const std::string &name, instruction *next) {
return new globaltimer_inst(ctx, name, next);
}
//// nv_static_program_idx
//make_range_sta::make_range_sta(make_range *range)
// : constant(range->get_type(), 0), range_(range) { }
// extern elementwise
extern_elementwise_inst::extern_elementwise_inst(
context &ctx, const std::vector<value *> &args, type *ret_ty,
const std::string &lib_name, const std::string &lib_path,
const std::string &symbol_name, const std::string &name, instruction *next)
: instruction(ret_ty, INST_EXTERN_ELEMENTWISE, args.size(), name, next),
lib_name_(lib_name),
lib_path_(lib_path),
symbol_name_(symbol_name) {
for (size_t i = 0; i < args.size(); i++) {
set_operand(i, args[i]);
}
}
//make_range* make_range_sta::get_range() const
//{ return range_; }
extern_elementwise_inst *extern_elementwise_inst::create(
context &ctx, const std::vector<value *> &args, type *ret_ty,
const std::string &lib_name, const std::string &lib_path,
const std::string &symbol_name, const std::string &name,
instruction *next) {
return new extern_elementwise_inst(ctx, args, ret_ty, lib_name, lib_path,
symbol_name, name, next);
}
//make_range_sta* make_range_sta::get(make_range* range) {
// static std::map<make_range*, make_range_sta*> cache;
// if(cache.find(range) == cache.end())
// cache.insert({range, new make_range_sta(range)});
// return cache.at(range);
//}
// clock
clock_inst::clock_inst(context &ctx, const std::string &name, instruction *next)
: instruction(type::get_int64_ty(ctx), INST_CLOCK, 0, name, next) { }
clock_inst* clock_inst::create(context &ctx, const std::string &name, instruction *next) {
return new clock_inst(ctx, name, next);
}
// make_range
@@ -874,7 +1042,7 @@ make_range::make_range(type *ty, constant_int *first, constant_int *last)
make_range *make_range::create(constant_int *first, constant_int *last) {
assert(first->get_type()->is_integer_ty());
assert(first->get_type() == last->get_type());
assert(((constant_int*)first)->get_value() == 0);
// assert(((constant_int*)first)->get_value() == 0);
type *ty = block_type::get(first->get_type(), {(unsigned)last->get_value() - (unsigned)first->get_value()});
return new make_range(ty, first, last);
}
@@ -887,7 +1055,5 @@ const constant_int* make_range::get_last() const {
return last_;
}
}
}

View File

@@ -3,10 +3,10 @@
namespace triton{
namespace ir{
metadata::metadata(kind_t kind, unsigned value)
metadata::metadata(kind_t kind, std::vector<unsigned> value)
: kind_(kind), value_(value) { }
metadata* metadata::get(kind_t kind, unsigned value) {
metadata* metadata::get(kind_t kind, std::vector<unsigned> value) {
return new metadata(kind, value);
}

View File

@@ -9,151 +9,16 @@
namespace triton{
namespace ir{
/* Module */
module::module(const std::string &name, builder &builder)
: name_(name), builder_(builder) {
sealed_blocks_.insert(nullptr);
}
ir::builder& module::get_builder() {
return builder_;
}
void module::set_value(const std::string& name, ir::basic_block *block, ir::value *value){
values_[val_key_t{name, block}] = value;
auto it = metadatas_.find(name);
if(auto *x = dynamic_cast<ir::instruction*>(value))
if(it != metadatas_.end()){
x->set_metadata(it->second.first, it->second.second);
}
// value->set_name(name);
}
void module::set_value(const std::string& name, ir::value *value){
return set_value(name, builder_.get_insert_block(), value);
}
void module::set_const(const std::string& name){
const_.insert(name);
}
void module::set_continue_fn(std::function<ir::value*()> fn) {
continue_fn_ = fn;
}
std::function<ir::value*()> module::get_continue_fn() {
return continue_fn_;
}
ir::phi_node* module::make_phi(ir::type *ty, unsigned num_values, ir::basic_block *block){
basic_block::iterator insert = block->get_first_non_phi();
if(insert != block->end()){
builder_.set_insert_point(insert);
}
ir::phi_node *res = builder_.create_phi(ty, num_values);
if(insert != block->end())
builder_.set_insert_point(block);
return res;
}
ir::value *module::try_remove_trivial_phis(ir::phi_node *&phi){
// find non-self references
std::set<ir::value*> non_self_ref;
std::copy_if(phi->ops().begin(), phi->ops().end(), std::inserter(non_self_ref, non_self_ref.begin()),
[phi](ir::value* op){ return op != phi && op; });
// non-trivial
if(non_self_ref.size() != 1)
return phi;
// unique value or self-reference
ir::value *same = *non_self_ref.begin();
assert(same != nullptr);
phi->replace_all_uses_with(same);
phi->erase_from_parent();
std::set<ir::user*> users = phi->get_users();
for(ir::user* u: users)
if(auto *uphi = dynamic_cast<ir::phi_node*>(u))
if(uphi != phi)
try_remove_trivial_phis(uphi);
return same;
}
ir::value *module::add_phi_operands(const std::string& name, ir::phi_node *&phi){
// already initialized
if(phi->get_num_operands())
return phi;
ir::basic_block *block = phi->get_parent();
for(ir::basic_block *pred: block->get_predecessors()){
ir::value *value = get_value(name, pred);
phi->add_incoming(value, pred);
}
return phi;
}
ir::value *module::get_value_recursive(const std::string& name, ir::basic_block *block) {
ir::value *result;
bool is_const = const_.find(name) != const_.end();
auto &preds = block->get_predecessors();
ir::type *ty = types_.at(name);
if(block && !is_const && sealed_blocks_.find(block) == sealed_blocks_.end()){
incomplete_phis_[block][name] = make_phi(ty, 1, block);
result = (ir::value*)incomplete_phis_[block][name];
}
else if(preds.size() <= 1){
bool has_pred = preds.size();
result = get_value(name, has_pred?preds.front():nullptr);
}
else{
ir::phi_node* phi = make_phi(ty, 1, block);
set_value(name, block, phi);
result = add_phi_operands(name, phi);
if(auto *phi = dynamic_cast<ir::phi_node*>(result))
result = try_remove_trivial_phis(phi);
}
if(auto *phi = dynamic_cast<ir::phi_node*>(result)){
result = try_remove_trivial_phis(phi);
}
set_value(name, block, result);
return result;
}
ir::value *module::get_value(const std::string& name, ir::basic_block *block) {
ir::basic_block* save_block = builder_.get_insert_block();
ir::basic_block::iterator save_pt = builder_.get_insert_point();
val_key_t key(name, block);
if(values_.find(key) != values_.end()){
return values_.at(key);
}
ir::value *result = get_value_recursive(name, block);
builder_.set_insert_point(save_block);
if(save_pt != save_block->end())
builder_.set_insert_point(save_pt);
return result;
}
ir::value *module::get_value(const std::string& name) {
return get_value(name, builder_.get_insert_block());
}
const std::string& module::get_name() {
return name_;
}
void module::seal_block(ir::basic_block *block){
for(auto &x: incomplete_phis_[block]){
add_phi_operands(x.first, x.second);
if(get_value(x.first) == x.second)
set_value(x.first, try_remove_trivial_phis(x.second));
}
sealed_blocks_.insert(block);
incomplete_phis_[block].clear();
void module::reset_ret_ty(const std::string& name, type* ty) {
get_function(name)->get_fn_type()->reset_ret_ty(ty);
}
/* functions */
function *module::get_or_insert_function(const std::string &name, function_type *ty) {
function *&fn = (function*&)symbols_[name];
if(fn == nullptr)
return fn = function::create(ty, global_value::external, name, this);
if(fn == nullptr){
fn = function::create(ty, global_value::external, name, this);
}
return fn;
}

View File

@@ -92,7 +92,7 @@ public:
//-------------------------
void SlotTracker::process_module() {
// Nothing to do at the moment.
// Create slots for global variable & unamed functions & ...
// Create slots for global variable & unnamed functions & ...
module_processed = true;
}

View File

@@ -27,7 +27,7 @@ unsigned type::get_primitive_size_in_bits() const {
case BF16TyID: return 16;
case FP32TyID: return 32;
case FP64TyID: return 64;
case IntegerTyID: return ((integer_type*)(this))->get_bitwidth();
case IntegerTyID: return std::max<int>(8, ((integer_type*)(this))->get_bitwidth());
case BlockTyID: return ((block_type*)(this))->get_bitwidth();
default: return 0;
}
@@ -153,10 +153,10 @@ pointer_type* pointer_type::get(type *elt_ty, unsigned address_space){
assert(is_valid_elt_ty(elt_ty) && "Invalid type for pointer element!");
// look-up
context_impl *impl = elt_ty->get_context().p_impl.get();
pointer_type *&entry = impl->ptr_tys[std::make_pair(elt_ty, address_space)];
std::unique_ptr<pointer_type> &entry = impl->ptr_tys[std::make_pair(elt_ty, address_space)];
if(!entry)
entry = new pointer_type(elt_ty, address_space);
return entry;
entry.reset(new pointer_type(elt_ty, address_space));
return entry.get();
}
//===----------------------------------------------------------------------===//
@@ -174,7 +174,26 @@ bool composite_type::index_valid(value *idx) const{
}
//===----------------------------------------------------------------------===//
// tile_type class
// struct_type class
//===----------------------------------------------------------------------===//
struct_type::struct_type(const contained_tys_vec_t& tys, bool is_packed)
: composite_type(tys[0]->get_context(), StructTyID), is_packed_(is_packed) {
contained_tys_ = tys;
}
struct_type* struct_type::get(const contained_tys_vec_t& tys, bool is_packed) {
assert(tys.size());
context_impl* impl = tys[0]->get_context().p_impl.get();
struct_type *& entry = impl->struct_tys[tys];
if(!entry)
entry = new struct_type(tys, is_packed);
return entry;
}
//===----------------------------------------------------------------------===//
// block_type class
//===----------------------------------------------------------------------===//
block_type::block_type(type *ty, const block_shapes_t &shapes)
@@ -203,10 +222,10 @@ block_type* block_type::get(type *elt_ty, const block_shapes_t &shapes) {
assert(is_valid_elt_ty(elt_ty) && "Invalid type for tile element!");
// look-up
context_impl *impl = elt_ty->get_context().p_impl.get();
block_type *&entry = impl->block_tys[std::make_pair(elt_ty, shapes)];
std::unique_ptr<block_type> &entry = impl->block_tys[std::make_pair(elt_ty, shapes)];
if(!entry)
entry = new block_type(elt_ty, shapes);
return entry;
entry.reset(new block_type(elt_ty, shapes));
return entry.get();
}
block_type* block_type::get_same_shapes(type *ty, type *ref){

View File

@@ -43,6 +43,15 @@ std::vector<basic_block*> cfg::reverse_post_order(function* fn) {
return result;
}
void for_each_instruction_backward(module &mod, const std::function<void (instruction *)> &do_work) {
for(ir::function *fn: mod.get_function_list())
for(ir::basic_block *block: cfg::post_order(fn)){
auto inst_list = block->get_inst_list();
for(auto it = inst_list.rbegin(); it != inst_list.rend() ; it++)
do_work(*it);
}
}
void for_each_instruction(module &mod, const std::function<void (instruction *)> &do_work) {
for(ir::function *fn: mod.get_function_list())
for(ir::basic_block *block: cfg::reverse_post_order(fn))

View File

@@ -1,5 +1,6 @@
#include <cassert>
#include <iostream>
#include <algorithm>
#include "triton/ir/value.h"
#include "triton/ir/instructions.h"
@@ -17,11 +18,11 @@ value::value(type *ty, const std::string &name): ty_(ty){
}
void value::add_use(user *arg) {
users_.insert(arg);
users_.push_back(arg);
}
value::users_t::iterator value::erase_use(user *arg){
auto it = users_.find(arg);
auto it = std::find(users_.begin(), users_.end(), arg);
if(it == users_.end())
return it;
return users_.erase(it);

View File

@@ -1,4 +1,5 @@
import torch
import triton
# -------------------------------
@@ -8,18 +9,18 @@ import triton
nt = {False: 'n', True: 't'}
square_confs = [
triton.testing.Benchmark(
x_names = ['M', 'N', 'K'],
x_vals = [128, 256, 512, 1024, 2048, 3072, 4096, 6144],
line_arg = 'block',
line_vals = [16, 32, 64, 128],
line_names = ['Block16', 'Block32', 'Block64', 'Block128'],
ylabel = 'TFLOPS',
plot_name = f'{op_mode}-{layout_mode}-square-{nt[AT]}{nt[BT]}',
args = {'layout_mode': layout_mode, 'op_mode': op_mode,
'AT': AT, 'BT': BT, 'dtype': torch.float16, 'provider': 'triton'}
)\
for AT in [False] for BT in [False] \
for op_mode in ['dsd'] for layout_mode in ['dense']
x_names=['M', 'N', 'K'],
x_vals=[128, 256, 512, 1024, 2048, 3072, 4096, 6144],
line_arg='block',
line_vals=[16, 32, 64, 128],
line_names=['Block16', 'Block32', 'Block64', 'Block128'],
ylabel='TFLOPS',
plot_name=f'{op_mode}-{layout_mode}-square-{nt[AT]}{nt[BT]}',
args={'layout_mode': layout_mode, 'op_mode': op_mode,
'AT': AT, 'BT': BT, 'dtype': torch.float16, 'provider': 'triton'}
)
for AT in [False] for BT in [False]
for op_mode in ['dsd'] for layout_mode in ['dense']
]
@@ -27,7 +28,7 @@ square_confs = [
def bench_matmul(M, N, K, block, layout_mode, op_mode, AT, BT, dtype, provider, warmup=100, rep=1000):
Z, H = 1, 1
make_layout = {
'tril': lambda H, M, N: torch.tril(torch.ones((H, M, N), dtype=torch.int64)),\
'tril': lambda H, M, N: torch.tril(torch.ones((H, M, N), dtype=torch.int64)),
'dense': lambda H, M, N: torch.ones(H, M, N, dtype=torch.int64),
}[layout_mode]
# create layout
@@ -39,16 +40,16 @@ def bench_matmul(M, N, K, block, layout_mode, op_mode, AT, BT, dtype, provider,
# create op
tflops = lambda ms: num_flops / ms * 1e3
if provider == 'triton':
op = triton.ops.blocksparse.matmul(layout, block, op_mode, trans_a=AT, trans_b=BT)
op = triton.ops.blocksparse.matmul(layout, block, op_mode, device="cuda", trans_a=AT, trans_b=BT)
# inputs
a = triton.testing.sparsify_tensor(a, layout, block) if op_mode == 'dsd' else a
b = triton.testing.sparsify_tensor(b, layout, block) if op_mode == 'dds' else b
mean_ms, min_ms, max_ms = triton.testing.do_bench(lambda: op(a, b), warmup=warmup, rep=rep)
num_flops = {
'sdd': 2 * Z * K * float(layout.sum()) * block * block,\
'dsd': 2 * Z * N * float(layout.sum()) * block * block,\
'sdd': 2 * Z * K * float(layout.sum()) * block * block,
'dsd': 2 * Z * N * float(layout.sum()) * block * block,
'dds': 2 * Z * M * float(layout.sum()) * block * block
}[op_mode]*1e-12
}[op_mode] * 1e-12
return tflops(mean_ms), tflops(min_ms), tflops(max_ms)
@@ -58,15 +59,15 @@ def bench_matmul(M, N, K, block, layout_mode, op_mode, AT, BT, dtype, provider,
square_confs = [
triton.testing.Benchmark(
x_names = ['M', 'N'],
x_vals = [128, 256, 512, 1024, 2048, 3072, 4096, 6144],
line_arg = 'block',
line_vals = [16, 32, 64],
line_names = ['Block16', 'Block32', 'Block64'],
ylabel = 'GBPS',
plot_name = f'{layout_mode}-square',
args = {'layout_mode': layout_mode, 'dtype': torch.float16, 'provider': 'triton'}
)\
x_names=['M', 'N'],
x_vals=[128, 256, 512, 1024, 2048, 3072, 4096, 6144],
line_arg='block',
line_vals=[16, 32, 64],
line_names=['Block16', 'Block32', 'Block64'],
ylabel='GBPS',
plot_name=f'{layout_mode}-square',
args={'layout_mode': layout_mode, 'dtype': torch.float16, 'provider': 'triton'}
)
for layout_mode in ['dense', 'tril']
]
@@ -82,10 +83,10 @@ def bench_softmax(M, N, block, layout_mode, dtype, provider, warmup=10, rep=50):
a = torch.randn((Z, H, M, N), dtype=dtype, device='cuda')
if provider == 'triton':
a = triton.testing.sparsify_tensor(a, layout, block)
op = triton.ops.blocksparse.softmax(layout, block)
op = triton.ops.blocksparse.softmax(layout, block, device="cuda")
gbps = lambda ms: (2 * a.numel() * a.element_size() * 1e-9) / (ms * 1e-3)
mean_ms, min_ms, max_ms = triton.testing.do_bench(lambda: op(a), warmup=warmup, rep=rep)
return gbps(mean_ms), gbps(min_ms), gbps(max_ms)
bench_matmul.run(print_data=True, show_plots=True)
bench_matmul.run(print_data=True, show_plots=True)

View File

@@ -1,17 +1,18 @@
import torch
import triton
confs = [
triton.testing.Benchmark(
x_names = ['N'],
x_vals = [128, 256, 512, 1024, 2048, 3072, 4096, 6144, 8192],
line_arg = 'provider',
line_vals = ['triton', 'torch'],
line_names = ['Triton', 'Torch'],
ylabel = 'GBPS',
plot_name = f'{mode}-2048',
args = {'M': 2048, 'dtype': torch.float16, 'mode': mode}
)\
x_names=['N'],
x_vals=[128, 256, 512, 1024, 2048, 3072, 4096, 6144, 8192],
line_arg='provider',
line_vals=['triton', 'torch'],
line_names=['Triton', 'Torch'],
ylabel='GBPS',
plot_name=f'{mode}-2048',
args={'M': 2048, 'dtype': torch.float16, 'mode': mode}
)
for mode in ['forward', 'backward']
]
@@ -24,8 +25,8 @@ def bench_op(M, N, dtype, mode, provider):
num_gb = (2 * x.numel() * x.element_size() * 1e-9)
gbps = lambda ms: num_gb / ms * 1e3
# forward pass
op = {'torch': torch.nn.CrossEntropyLoss(reduction='none'), \
'triton': triton.ops.cross_entropy}[provider]
op = {'torch': torch.nn.CrossEntropyLoss(reduction='none'),
'triton': triton.ops.cross_entropy}[provider]
if mode == 'forward':
mean_ms, min_ms, max_ms = triton.testing.do_bench(lambda: op(x, idx))
if mode == 'backward':
@@ -37,4 +38,4 @@ def bench_op(M, N, dtype, mode, provider):
if __name__ == '__main__':
bench_op.run(print_data=True)
bench_op.run(print_data=True)

View File

@@ -1,11 +1,11 @@
import triton
import torch
import os
import triton
def rounded_linspace(low, high, steps, div):
ret = torch.linspace(low, high, steps)
ret = (ret.int() + div - 1) // div * div
ret = torch.div(ret.int() + div - 1, div, rounding_mode='trunc') * div
ret = torch.unique(ret)
return list(map(int, ret))
@@ -29,16 +29,16 @@ square_confs = [
transformer_confs = [
triton.testing.Benchmark(
x_names=[x],
x_vals = rounded_linspace(NK//16, NK, 32, 128),
x_vals=rounded_linspace(NK // 16, NK, 32, 128),
line_arg="provider",
line_vals=["cublas", "triton", "cutlass"],
line_names=["cuBLAS", "Triton", "CUTLASS"],
ylabel="TFLOPS",
plot_name=f"matmul-M{M}-{'NK'.replace(x, '')}{NK}",
args= {"M": M, 'NK'.replace(x,''): NK, "AT": False, "BT": False, "dtype": torch.float16}
) for NK in [12288]\
for i, x in enumerate(["N", "K"])\
for M in [2048]
args={"M": M, 'NK'.replace(x, ''): NK, "AT": False, "BT": False, "dtype": torch.float16}
) for NK in [12288]
for i, x in enumerate(["N", "K"])
for M in [2048]
]
@@ -46,9 +46,10 @@ transformer_confs = [
def bench_op(M, N, K, AT, BT, dtype, provider, warmup=25, rep=75):
a = torch.rand((K, M) if AT else (M, K), device="cuda", dtype=dtype)
b = torch.rand((N, K) if BT else (K, N), device="cuda", dtype=dtype)
if AT: a = a.t()
if BT: b = b.t()
num_flops = 2 * M * N * K
if AT:
a = a.t()
if BT:
b = b.t()
tflops = lambda ms: 2. * M * N * K / ms * 1e-9
if provider == "cublas":
ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.matmul(a, b), warmup=warmup, rep=rep)
@@ -61,6 +62,6 @@ def bench_op(M, N, K, AT, BT, dtype, provider, warmup=25, rep=75):
try:
ms, min_ms, max_ms = triton.testing.do_bench(lambda: cutlass_matmul(a, b), warmup=warmup, rep=rep)
return tflops(ms), tflops(max_ms), tflops(min_ms)
except:
except Exception:
return None
return None

View File

@@ -1,7 +1,8 @@
import argparse
import sys
import os
import inspect
import os
import sys
import triton

View File

@@ -1 +0,0 @@
scipy >= 1.7.1

View File

@@ -1,2 +1,8 @@
[metadata]
description-file = README.md
description_file = README.md
[pycodestyle]
ignore = E501,E701,E731
[flake8]
ignore = E501,E701,E731

View File

@@ -1,45 +1,83 @@
import os
import re
import sys
import sysconfig
import platform
import subprocess
import distutils
import glob
import tempfile
import shutil
from distutils.version import LooseVersion
from setuptools import setup, Extension, find_packages
from setuptools.command.build_ext import build_ext
from setuptools.command.test import test as TestCommand
import distutils.spawn
import urllib.request
import os
import platform
import re
import shutil
import subprocess
import sys
import tarfile
import urllib.request
from distutils.version import LooseVersion
from typing import NamedTuple
def get_llvm():
# tries to find system LLVM
versions = ['-11.0', '-11', '-11-64']
from setuptools import Extension, setup
from setuptools.command.build_ext import build_ext
import torch
# Taken from https://github.com/pytorch/pytorch/blob/master/tools/setup_helpers/env.py
def check_env_flag(name: str, default: str = "") -> bool:
return os.getenv(name, default).upper() in ["ON", "1", "YES", "TRUE", "Y"]
def get_build_type():
if check_env_flag("DEBUG"):
return "Debug"
elif check_env_flag("REL_WITH_DEB_INFO"):
return "RelWithDebInfo"
else:
return "Release"
def use_system_llvm():
if platform.system() == "Windows":
return True
# versions = ['-11.0', '-11', '-11-64']
versions = ['-13.0', '-13', '-13-64']
supported = ['llvm-config{v}'.format(v=v) for v in versions]
paths = [distutils.spawn.find_executable(cfg) for cfg in supported]
paths = [p for p in paths if p is not None]
if paths:
return '', ''
# download if nothing is installed
name = 'clang+llvm-11.0.1-x86_64-linux-gnu-ubuntu-16.04'
dir = '/tmp'
llvm_include_dir = '{dir}/{name}/include'.format(dir=dir, name=name)
llvm_library_dir = '{dir}/{name}/lib'.format(dir=dir, name=name)
if not os.path.exists(llvm_library_dir):
try:
shutil.rmtree(os.path.join(dir, name))
except:
pass
url = "https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.1/{name}.tar.xz".format(name=name)
print('downloading and extracting ' + url + '...')
ftpstream = urllib.request.urlopen(url)
file = tarfile.open(fileobj=ftpstream, mode="r|xz")
file.extractall(path=dir)
return llvm_include_dir, llvm_library_dir
return any(p is not None for p in paths)
def get_thirdparty_packages(triton_cache_path):
class Package(NamedTuple):
package: str
name: str
url: str
test_file: str
include_flag: str
lib_flag: str
packages = [
Package("pybind11", "pybind11-2.10.0", "https://github.com/pybind/pybind11/archive/refs/tags/v2.10.0.tar.gz", "include/pybind11/pybind11.h", "PYBIND11_INCLUDE_DIR", "")
]
if not use_system_llvm():
# donwload LLVM if no suitable system LLVM is installed
packages.append(
Package("llvm", "clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-16.04", "https://github.com/llvm/llvm-project/releases/download/llvmorg-13.0.0/clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-16.04.tar.xz", "lib", "LLVM_INCLUDE_DIRS", "LLVM_LIBRARY_DIR")
)
thirdparty_cmake_args = []
for p in packages:
package_root_dir = os.path.join(triton_cache_path, p.package)
package_dir = os.path.join(package_root_dir, p.name)
test_file_path = os.path.join(package_dir, p.test_file)
if not os.path.exists(test_file_path):
try:
shutil.rmtree(package_root_dir)
except Exception:
pass
os.makedirs(package_root_dir, exist_ok=True)
print('downloading and extracting {} ...'.format(p.url))
ftpstream = urllib.request.urlopen(p.url)
file = tarfile.open(fileobj=ftpstream, mode="r|*")
file.extractall(path=package_root_dir)
if p.include_flag:
thirdparty_cmake_args.append("-D{}={}/include".format(p.include_flag, package_dir))
if p.lib_flag:
thirdparty_cmake_args.append("-D{}={}/lib".format(p.lib_flag, package_dir))
return thirdparty_cmake_args
class CMakeExtension(Extension):
@@ -77,35 +115,28 @@ class CMakeBuild(build_ext):
self.build_extension(ext)
def build_extension(self, ext):
llvm_include_dir, llvm_library_dir = get_llvm()
# self.debug = True
triton_cache_path = os.path.join(os.environ["HOME"], ".triton")
thirdparty_cmake_args = get_thirdparty_packages(triton_cache_path)
extdir = os.path.abspath(os.path.dirname(self.get_ext_fullpath(ext.path)))
# create build directories
build_suffix = 'debug' if self.debug else 'release'
llvm_build_dir = os.path.join(tempfile.gettempdir(), "llvm-" + build_suffix)
if not os.path.exists(self.build_temp):
os.makedirs(self.build_temp)
if not os.path.exists(llvm_build_dir):
os.makedirs(llvm_build_dir)
# python directories
python_include_dirs = [distutils.sysconfig.get_python_inc()] + ['/usr/local/cuda/include']
python_include_dirs = [distutils.sysconfig.get_python_inc()]
cmake_args = [
"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=" + extdir,
"-DBUILD_TUTORIALS=OFF",
"-DBUILD_PYTHON_MODULE=ON",
"-DLLVM_INCLUDE_DIRS=" + llvm_include_dir,
"-DLLVM_LIBRARY_DIR=" + llvm_library_dir,
#'-DPYTHON_EXECUTABLE=' + sys.executable,
#'-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON',
"-DTRITON_LLVM_BUILD_DIR=" + llvm_build_dir,
# '-DPYTHON_EXECUTABLE=' + sys.executable,
# '-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON',
"-DPYTHON_INCLUDE_DIRS=" + ";".join(python_include_dirs)
]
] + thirdparty_cmake_args
# configuration
cfg = "Debug" if self.debug else "Release"
cfg = get_build_type()
build_args = ["--config", cfg]
if platform.system() == "Windows":
cmake_args += ["-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_{}={}".format(cfg.upper(), extdir)]
cmake_args += ["-DCMAKE_RUNTIME_OUTPUT_DIRECTORY_{}={}".format(cfg.upper(), extdir)]
if sys.maxsize > 2**32:
cmake_args += ["-A", "x64"]
build_args += ["--", "/m"]
@@ -115,20 +146,31 @@ class CMakeBuild(build_ext):
build_args += ["--", '-j' + str(2 * multiprocessing.cpu_count())]
env = os.environ.copy()
if torch.version.hip is not None:
env["TRITON_USE_ROCM"] = "ON"
subprocess.check_call(["cmake", self.base_dir] + cmake_args, cwd=self.build_temp, env=env)
subprocess.check_call(["cmake", "--build", "."] + build_args, cwd=self.build_temp)
setup(
name="triton",
version="1.1.1",
version="2.0.0",
author="Philippe Tillet",
author_email="phil@openai.com",
description="A language and compiler for custom Deep Learning operations",
long_description="",
packages=["triton", "triton/_C", "triton/language", "triton/tools", "triton/ops", "triton/ops/blocksparse"],
install_requires=["torch", "filelock"],
package_data={"triton/ops": ["*.c"], "triton/ops/blocksparse": ["*.c"]},
packages=["triton", "triton/_C", "triton/language", "triton/runtime", "triton/tools", "triton/ops", "triton/ops/blocksparse"],
install_requires=[
"cmake",
"filelock",
"torch",
],
package_data={
"triton/ops": ["*.c"],
"triton/ops/blocksparse": ["*.c"],
"triton/language": ["*.bc"],
},
include_package_data=True,
ext_modules=[CMakeExtension("triton", "triton/_C/")],
cmdclass={"build_ext": CMakeBuild},
@@ -143,4 +185,19 @@ setup(
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3.6",
],
extras_require={
"tests": [
"autopep8",
"flake8",
"isort",
"numpy",
"pytest",
"scipy>=1.7.1",
],
"tutorials": [
"matplotlib",
"pandas",
"tabulate",
],
},
)

View File

@@ -105,7 +105,7 @@ ir::value *cast(ir::value *input, type_code _dtype, ir::builder *builder) {
other = builder->create_splat(other, src_ty->get_block_shapes());
return builder->create_icmpNE(input, other);
}
throw_not_implemented("cast");
throw_not_implemented("cast from " + src_sca_ty->repr() + " to " + dst_sca_ty->repr());
}
/*----------------------------------------------
@@ -253,7 +253,7 @@ ir::value *dot(ir::value *lhs, ir::value *rhs, ir::builder *builder) {
std::string where_docstr = R"pbdoc(
Returns a block of elements from either `x` or `y`, depending on `condition`.
Note that `x` and `y` are always evaluated regardless of the value of `condition`.
If you want to avoid unintented memory operations, use the `mask` arguments in `triton.load` and `triton.store` instead.
If you want to avoid unintended memory operations, use the `mask` arguments in `triton.load` and `triton.store` instead.
:param condition: When True (nonzero), yield x, otherwise yield y.
:type condition: Block of triton.bool
@@ -353,9 +353,6 @@ ir::value *sqrt(ir::value *input, ir::builder *builder) {
return builder->create_sqrt(input);
};
/*----------------------------------------------
definition of triton.min
----------------------------------------------*/
ir::value *reduce_impl(ir::value *input, unsigned int axis, ir::builder *builder, const std::string &name,
ir::reduce_inst::op_t FLOAT_OP, ir::reduce_inst::op_t INT_OP) {
ir::type *scalar_ty = input->get_type()->get_scalar_ty();
@@ -367,6 +364,9 @@ ir::value *reduce_impl(ir::value *input, unsigned int axis, ir::builder *builder
throw_not_int_or_float(name);
}
/*----------------------------------------------
definition of triton.min
----------------------------------------------*/
std::string min_docstr = R"pbdoc(
Returns the minimum value of `input`.
)pbdoc";
@@ -374,6 +374,16 @@ ir::value *min(ir::value *input, unsigned int axis, ir::builder *builder) {
return reduce_impl(input, axis, builder, "min", ir::reduce_inst::FMIN, ir::reduce_inst::MIN);
};
/*----------------------------------------------
definition of triton.arg_min
----------------------------------------------*/
std::string min_docstr = R"pbdoc(
Returns the minimum value's index of `input`.
)pbdoc";
ir::value *argmin(ir::value *input, unsigned int axis, ir::builder *builder) {
return reduce_impl(input, axis, builder, "argmin", ir::reduce_inst::ARGFMIN, ir::reduce_inst::ARGMIN);
};
/*----------------------------------------------
definition of triton.max
----------------------------------------------*/
@@ -384,6 +394,16 @@ ir::value *max(ir::value *input, unsigned int axis, ir::builder *builder) {
return reduce_impl(input, axis, builder, "max", ir::reduce_inst::FMAX, ir::reduce_inst::MAX);
};
/*----------------------------------------------
definition of triton.arg_max
----------------------------------------------*/
std::string max_docstr = R"pbdoc(
Returns the maximum value's index of `input`.
)pbdoc";
ir::value *argmax(ir::value *input, unsigned int axis, ir::builder *builder) {
return reduce_impl(input, axis, builder, "argmax", ir::reduce_inst::ARGFMAX, ir::reduce_inst::ARGMAX);
};
/*----------------------------------------------
definition of triton.sum
----------------------------------------------*/

View File

@@ -1,493 +0,0 @@
/*
pybind11/attr.h: Infrastructure for processing custom
type and function attributes
Copyright (c) 2016 Wenzel Jakob <wenzel.jakob@epfl.ch>
All rights reserved. Use of this source code is governed by a
BSD-style license that can be found in the LICENSE file.
*/
#pragma once
#include "cast.h"
NAMESPACE_BEGIN(PYBIND11_NAMESPACE)
/// \addtogroup annotations
/// @{
/// Annotation for methods
struct is_method { handle class_; is_method(const handle &c) : class_(c) { } };
/// Annotation for operators
struct is_operator { };
/// Annotation for parent scope
struct scope { handle value; scope(const handle &s) : value(s) { } };
/// Annotation for documentation
struct doc { const char *value; doc(const char *value) : value(value) { } };
/// Annotation for function names
struct name { const char *value; name(const char *value) : value(value) { } };
/// Annotation indicating that a function is an overload associated with a given "sibling"
struct sibling { handle value; sibling(const handle &value) : value(value.ptr()) { } };
/// Annotation indicating that a class derives from another given type
template <typename T> struct base {
PYBIND11_DEPRECATED("base<T>() was deprecated in favor of specifying 'T' as a template argument to class_")
base() { }
};
/// Keep patient alive while nurse lives
template <size_t Nurse, size_t Patient> struct keep_alive { };
/// Annotation indicating that a class is involved in a multiple inheritance relationship
struct multiple_inheritance { };
/// Annotation which enables dynamic attributes, i.e. adds `__dict__` to a class
struct dynamic_attr { };
/// Annotation which enables the buffer protocol for a type
struct buffer_protocol { };
/// Annotation which requests that a special metaclass is created for a type
struct metaclass {
handle value;
PYBIND11_DEPRECATED("py::metaclass() is no longer required. It's turned on by default now.")
metaclass() {}
/// Override pybind11's default metaclass
explicit metaclass(handle value) : value(value) { }
};
/// Annotation that marks a class as local to the module:
struct module_local { const bool value; constexpr module_local(bool v = true) : value(v) { } };
/// Annotation to mark enums as an arithmetic type
struct arithmetic { };
/** \rst
A call policy which places one or more guard variables (``Ts...``) around the function call.
For example, this definition:
.. code-block:: cpp
m.def("foo", foo, py::call_guard<T>());
is equivalent to the following pseudocode:
.. code-block:: cpp
m.def("foo", [](args...) {
T scope_guard;
return foo(args...); // forwarded arguments
});
\endrst */
template <typename... Ts> struct call_guard;
template <> struct call_guard<> { using type = detail::void_type; };
template <typename T>
struct call_guard<T> {
static_assert(std::is_default_constructible<T>::value,
"The guard type must be default constructible");
using type = T;
};
template <typename T, typename... Ts>
struct call_guard<T, Ts...> {
struct type {
T guard{}; // Compose multiple guard types with left-to-right default-constructor order
typename call_guard<Ts...>::type next{};
};
};
/// @} annotations
NAMESPACE_BEGIN(detail)
/* Forward declarations */
enum op_id : int;
enum op_type : int;
struct undefined_t;
template <op_id id, op_type ot, typename L = undefined_t, typename R = undefined_t> struct op_;
inline void keep_alive_impl(size_t Nurse, size_t Patient, function_call &call, handle ret);
/// Internal data structure which holds metadata about a keyword argument
struct argument_record {
const char *name; ///< Argument name
const char *descr; ///< Human-readable version of the argument value
handle value; ///< Associated Python object
bool convert : 1; ///< True if the argument is allowed to convert when loading
bool none : 1; ///< True if None is allowed when loading
argument_record(const char *name, const char *descr, handle value, bool convert, bool none)
: name(name), descr(descr), value(value), convert(convert), none(none) { }
};
/// Internal data structure which holds metadata about a bound function (signature, overloads, etc.)
struct function_record {
function_record()
: is_constructor(false), is_new_style_constructor(false), is_stateless(false),
is_operator(false), has_args(false), has_kwargs(false), is_method(false) { }
/// Function name
char *name = nullptr; /* why no C++ strings? They generate heavier code.. */
// User-specified documentation string
char *doc = nullptr;
/// Human-readable version of the function signature
char *signature = nullptr;
/// List of registered keyword arguments
std::vector<argument_record> args;
/// Pointer to lambda function which converts arguments and performs the actual call
handle (*impl) (function_call &) = nullptr;
/// Storage for the wrapped function pointer and captured data, if any
void *data[3] = { };
/// Pointer to custom destructor for 'data' (if needed)
void (*free_data) (function_record *ptr) = nullptr;
/// Return value policy associated with this function
return_value_policy policy = return_value_policy::automatic;
/// True if name == '__init__'
bool is_constructor : 1;
/// True if this is a new-style `__init__` defined in `detail/init.h`
bool is_new_style_constructor : 1;
/// True if this is a stateless function pointer
bool is_stateless : 1;
/// True if this is an operator (__add__), etc.
bool is_operator : 1;
/// True if the function has a '*args' argument
bool has_args : 1;
/// True if the function has a '**kwargs' argument
bool has_kwargs : 1;
/// True if this is a method
bool is_method : 1;
/// Number of arguments (including py::args and/or py::kwargs, if present)
std::uint16_t nargs;
/// Python method object
PyMethodDef *def = nullptr;
/// Python handle to the parent scope (a class or a module)
handle scope;
/// Python handle to the sibling function representing an overload chain
handle sibling;
/// Pointer to next overload
function_record *next = nullptr;
};
/// Special data structure which (temporarily) holds metadata about a bound class
struct type_record {
PYBIND11_NOINLINE type_record()
: multiple_inheritance(false), dynamic_attr(false), buffer_protocol(false),
default_holder(true), module_local(false) { }
/// Handle to the parent scope
handle scope;
/// Name of the class
const char *name = nullptr;
// Pointer to RTTI type_info data structure
const std::type_info *type = nullptr;
/// How large is the underlying C++ type?
size_t type_size = 0;
/// What is the alignment of the underlying C++ type?
size_t type_align = 0;
/// How large is the type's holder?
size_t holder_size = 0;
/// The global operator new can be overridden with a class-specific variant
void *(*operator_new)(size_t) = nullptr;
/// Function pointer to class_<..>::init_instance
void (*init_instance)(instance *, const void *) = nullptr;
/// Function pointer to class_<..>::dealloc
void (*dealloc)(detail::value_and_holder &) = nullptr;
/// List of base classes of the newly created type
list bases;
/// Optional docstring
const char *doc = nullptr;
/// Custom metaclass (optional)
handle metaclass;
/// Multiple inheritance marker
bool multiple_inheritance : 1;
/// Does the class manage a __dict__?
bool dynamic_attr : 1;
/// Does the class implement the buffer protocol?
bool buffer_protocol : 1;
/// Is the default (unique_ptr) holder type used?
bool default_holder : 1;
/// Is the class definition local to the module shared object?
bool module_local : 1;
PYBIND11_NOINLINE void add_base(const std::type_info &base, void *(*caster)(void *)) {
auto base_info = detail::get_type_info(base, false);
if (!base_info) {
std::string tname(base.name());
detail::clean_type_id(tname);
pybind11_fail("generic_type: type \"" + std::string(name) +
"\" referenced unknown base type \"" + tname + "\"");
}
if (default_holder != base_info->default_holder) {
std::string tname(base.name());
detail::clean_type_id(tname);
pybind11_fail("generic_type: type \"" + std::string(name) + "\" " +
(default_holder ? "does not have" : "has") +
" a non-default holder type while its base \"" + tname + "\" " +
(base_info->default_holder ? "does not" : "does"));
}
bases.append((PyObject *) base_info->type);
if (base_info->type->tp_dictoffset != 0)
dynamic_attr = true;
if (caster)
base_info->implicit_casts.emplace_back(type, caster);
}
};
inline function_call::function_call(const function_record &f, handle p) :
func(f), parent(p) {
args.reserve(f.nargs);
args_convert.reserve(f.nargs);
}
/// Tag for a new-style `__init__` defined in `detail/init.h`
struct is_new_style_constructor { };
/**
* Partial template specializations to process custom attributes provided to
* cpp_function_ and class_. These are either used to initialize the respective
* fields in the type_record and function_record data structures or executed at
* runtime to deal with custom call policies (e.g. keep_alive).
*/
template <typename T, typename SFINAE = void> struct process_attribute;
template <typename T> struct process_attribute_default {
/// Default implementation: do nothing
static void init(const T &, function_record *) { }
static void init(const T &, type_record *) { }
static void precall(function_call &) { }
static void postcall(function_call &, handle) { }
};
/// Process an attribute specifying the function's name
template <> struct process_attribute<name> : process_attribute_default<name> {
static void init(const name &n, function_record *r) { r->name = const_cast<char *>(n.value); }
};
/// Process an attribute specifying the function's docstring
template <> struct process_attribute<doc> : process_attribute_default<doc> {
static void init(const doc &n, function_record *r) { r->doc = const_cast<char *>(n.value); }
};
/// Process an attribute specifying the function's docstring (provided as a C-style string)
template <> struct process_attribute<const char *> : process_attribute_default<const char *> {
static void init(const char *d, function_record *r) { r->doc = const_cast<char *>(d); }
static void init(const char *d, type_record *r) { r->doc = const_cast<char *>(d); }
};
template <> struct process_attribute<char *> : process_attribute<const char *> { };
/// Process an attribute indicating the function's return value policy
template <> struct process_attribute<return_value_policy> : process_attribute_default<return_value_policy> {
static void init(const return_value_policy &p, function_record *r) { r->policy = p; }
};
/// Process an attribute which indicates that this is an overloaded function associated with a given sibling
template <> struct process_attribute<sibling> : process_attribute_default<sibling> {
static void init(const sibling &s, function_record *r) { r->sibling = s.value; }
};
/// Process an attribute which indicates that this function is a method
template <> struct process_attribute<is_method> : process_attribute_default<is_method> {
static void init(const is_method &s, function_record *r) { r->is_method = true; r->scope = s.class_; }
};
/// Process an attribute which indicates the parent scope of a method
template <> struct process_attribute<scope> : process_attribute_default<scope> {
static void init(const scope &s, function_record *r) { r->scope = s.value; }
};
/// Process an attribute which indicates that this function is an operator
template <> struct process_attribute<is_operator> : process_attribute_default<is_operator> {
static void init(const is_operator &, function_record *r) { r->is_operator = true; }
};
template <> struct process_attribute<is_new_style_constructor> : process_attribute_default<is_new_style_constructor> {
static void init(const is_new_style_constructor &, function_record *r) { r->is_new_style_constructor = true; }
};
/// Process a keyword argument attribute (*without* a default value)
template <> struct process_attribute<arg> : process_attribute_default<arg> {
static void init(const arg &a, function_record *r) {
if (r->is_method && r->args.empty())
r->args.emplace_back("self", nullptr, handle(), true /*convert*/, false /*none not allowed*/);
r->args.emplace_back(a.name, nullptr, handle(), !a.flag_noconvert, a.flag_none);
}
};
/// Process a keyword argument attribute (*with* a default value)
template <> struct process_attribute<arg_v> : process_attribute_default<arg_v> {
static void init(const arg_v &a, function_record *r) {
if (r->is_method && r->args.empty())
r->args.emplace_back("self", nullptr /*descr*/, handle() /*parent*/, true /*convert*/, false /*none not allowed*/);
if (!a.value) {
#if !defined(NDEBUG)
std::string descr("'");
if (a.name) descr += std::string(a.name) + ": ";
descr += a.type + "'";
if (r->is_method) {
if (r->name)
descr += " in method '" + (std::string) str(r->scope) + "." + (std::string) r->name + "'";
else
descr += " in method of '" + (std::string) str(r->scope) + "'";
} else if (r->name) {
descr += " in function '" + (std::string) r->name + "'";
}
pybind11_fail("arg(): could not convert default argument "
+ descr + " into a Python object (type not registered yet?)");
#else
pybind11_fail("arg(): could not convert default argument "
"into a Python object (type not registered yet?). "
"Compile in debug mode for more information.");
#endif
}
r->args.emplace_back(a.name, a.descr, a.value.inc_ref(), !a.flag_noconvert, a.flag_none);
}
};
/// Process a parent class attribute. Single inheritance only (class_ itself already guarantees that)
template <typename T>
struct process_attribute<T, enable_if_t<is_pyobject<T>::value>> : process_attribute_default<handle> {
static void init(const handle &h, type_record *r) { r->bases.append(h); }
};
/// Process a parent class attribute (deprecated, does not support multiple inheritance)
template <typename T>
struct process_attribute<base<T>> : process_attribute_default<base<T>> {
static void init(const base<T> &, type_record *r) { r->add_base(typeid(T), nullptr); }
};
/// Process a multiple inheritance attribute
template <>
struct process_attribute<multiple_inheritance> : process_attribute_default<multiple_inheritance> {
static void init(const multiple_inheritance &, type_record *r) { r->multiple_inheritance = true; }
};
template <>
struct process_attribute<dynamic_attr> : process_attribute_default<dynamic_attr> {
static void init(const dynamic_attr &, type_record *r) { r->dynamic_attr = true; }
};
template <>
struct process_attribute<buffer_protocol> : process_attribute_default<buffer_protocol> {
static void init(const buffer_protocol &, type_record *r) { r->buffer_protocol = true; }
};
template <>
struct process_attribute<metaclass> : process_attribute_default<metaclass> {
static void init(const metaclass &m, type_record *r) { r->metaclass = m.value; }
};
template <>
struct process_attribute<module_local> : process_attribute_default<module_local> {
static void init(const module_local &l, type_record *r) { r->module_local = l.value; }
};
/// Process an 'arithmetic' attribute for enums (does nothing here)
template <>
struct process_attribute<arithmetic> : process_attribute_default<arithmetic> {};
template <typename... Ts>
struct process_attribute<call_guard<Ts...>> : process_attribute_default<call_guard<Ts...>> { };
/**
* Process a keep_alive call policy -- invokes keep_alive_impl during the
* pre-call handler if both Nurse, Patient != 0 and use the post-call handler
* otherwise
*/
template <size_t Nurse, size_t Patient> struct process_attribute<keep_alive<Nurse, Patient>> : public process_attribute_default<keep_alive<Nurse, Patient>> {
template <size_t N = Nurse, size_t P = Patient, enable_if_t<N != 0 && P != 0, int> = 0>
static void precall(function_call &call) { keep_alive_impl(Nurse, Patient, call, handle()); }
template <size_t N = Nurse, size_t P = Patient, enable_if_t<N != 0 && P != 0, int> = 0>
static void postcall(function_call &, handle) { }
template <size_t N = Nurse, size_t P = Patient, enable_if_t<N == 0 || P == 0, int> = 0>
static void precall(function_call &) { }
template <size_t N = Nurse, size_t P = Patient, enable_if_t<N == 0 || P == 0, int> = 0>
static void postcall(function_call &call, handle ret) { keep_alive_impl(Nurse, Patient, call, ret); }
};
/// Recursively iterate over variadic template arguments
template <typename... Args> struct process_attributes {
static void init(const Args&... args, function_record *r) {
int unused[] = { 0, (process_attribute<typename std::decay<Args>::type>::init(args, r), 0) ... };
ignore_unused(unused);
}
static void init(const Args&... args, type_record *r) {
int unused[] = { 0, (process_attribute<typename std::decay<Args>::type>::init(args, r), 0) ... };
ignore_unused(unused);
}
static void precall(function_call &call) {
int unused[] = { 0, (process_attribute<typename std::decay<Args>::type>::precall(call), 0) ... };
ignore_unused(unused);
}
static void postcall(function_call &call, handle fn_ret) {
int unused[] = { 0, (process_attribute<typename std::decay<Args>::type>::postcall(call, fn_ret), 0) ... };
ignore_unused(unused);
}
};
template <typename T>
using is_call_guard = is_instantiation<call_guard, T>;
/// Extract the ``type`` from the first `call_guard` in `Extras...` (or `void_type` if none found)
template <typename... Extra>
using extract_guard_t = typename exactly_one_t<is_call_guard, call_guard<>, Extra...>::type;
/// Check the number of named arguments at compile time
template <typename... Extra,
size_t named = constexpr_sum(std::is_base_of<arg, Extra>::value...),
size_t self = constexpr_sum(std::is_same<is_method, Extra>::value...)>
constexpr bool expected_num_args(size_t nargs, bool has_args, bool has_kwargs) {
return named == 0 || (self + named + has_args + has_kwargs) == nargs;
}
NAMESPACE_END(detail)
NAMESPACE_END(PYBIND11_NAMESPACE)

View File

@@ -1,108 +0,0 @@
/*
pybind11/buffer_info.h: Python buffer object interface
Copyright (c) 2016 Wenzel Jakob <wenzel.jakob@epfl.ch>
All rights reserved. Use of this source code is governed by a
BSD-style license that can be found in the LICENSE file.
*/
#pragma once
#include "detail/common.h"
NAMESPACE_BEGIN(PYBIND11_NAMESPACE)
/// Information record describing a Python buffer object
struct buffer_info {
void *ptr = nullptr; // Pointer to the underlying storage
ssize_t itemsize = 0; // Size of individual items in bytes
ssize_t size = 0; // Total number of entries
std::string format; // For homogeneous buffers, this should be set to format_descriptor<T>::format()
ssize_t ndim = 0; // Number of dimensions
std::vector<ssize_t> shape; // Shape of the tensor (1 entry per dimension)
std::vector<ssize_t> strides; // Number of entries between adjacent entries (for each per dimension)
buffer_info() { }
buffer_info(void *ptr, ssize_t itemsize, const std::string &format, ssize_t ndim,
detail::any_container<ssize_t> shape_in, detail::any_container<ssize_t> strides_in)
: ptr(ptr), itemsize(itemsize), size(1), format(format), ndim(ndim),
shape(std::move(shape_in)), strides(std::move(strides_in)) {
if (ndim != (ssize_t) shape.size() || ndim != (ssize_t) strides.size())
pybind11_fail("buffer_info: ndim doesn't match shape and/or strides length");
for (size_t i = 0; i < (size_t) ndim; ++i)
size *= shape[i];
}
template <typename T>
buffer_info(T *ptr, detail::any_container<ssize_t> shape_in, detail::any_container<ssize_t> strides_in)
: buffer_info(private_ctr_tag(), ptr, sizeof(T), format_descriptor<T>::format(), static_cast<ssize_t>(shape_in->size()), std::move(shape_in), std::move(strides_in)) { }
buffer_info(void *ptr, ssize_t itemsize, const std::string &format, ssize_t size)
: buffer_info(ptr, itemsize, format, 1, {size}, {itemsize}) { }
template <typename T>
buffer_info(T *ptr, ssize_t size)
: buffer_info(ptr, sizeof(T), format_descriptor<T>::format(), size) { }
explicit buffer_info(Py_buffer *view, bool ownview = true)
: buffer_info(view->buf, view->itemsize, view->format, view->ndim,
{view->shape, view->shape + view->ndim}, {view->strides, view->strides + view->ndim}) {
this->view = view;
this->ownview = ownview;
}
buffer_info(const buffer_info &) = delete;
buffer_info& operator=(const buffer_info &) = delete;
buffer_info(buffer_info &&other) {
(*this) = std::move(other);
}
buffer_info& operator=(buffer_info &&rhs) {
ptr = rhs.ptr;
itemsize = rhs.itemsize;
size = rhs.size;
format = std::move(rhs.format);
ndim = rhs.ndim;
shape = std::move(rhs.shape);
strides = std::move(rhs.strides);
std::swap(view, rhs.view);
std::swap(ownview, rhs.ownview);
return *this;
}
~buffer_info() {
if (view && ownview) { PyBuffer_Release(view); delete view; }
}
private:
struct private_ctr_tag { };
buffer_info(private_ctr_tag, void *ptr, ssize_t itemsize, const std::string &format, ssize_t ndim,
detail::any_container<ssize_t> &&shape_in, detail::any_container<ssize_t> &&strides_in)
: buffer_info(ptr, itemsize, format, ndim, std::move(shape_in), std::move(strides_in)) { }
Py_buffer *view = nullptr;
bool ownview = false;
};
NAMESPACE_BEGIN(detail)
template <typename T, typename SFINAE = void> struct compare_buffer_info {
static bool compare(const buffer_info& b) {
return b.format == format_descriptor<T>::format() && b.itemsize == (ssize_t) sizeof(T);
}
};
template <typename T> struct compare_buffer_info<T, detail::enable_if_t<std::is_integral<T>::value>> {
static bool compare(const buffer_info& b) {
return (size_t) b.itemsize == sizeof(T) && (b.format == format_descriptor<T>::value ||
((sizeof(T) == sizeof(long)) && b.format == (std::is_unsigned<T>::value ? "L" : "l")) ||
((sizeof(T) == sizeof(size_t)) && b.format == (std::is_unsigned<T>::value ? "N" : "n")));
}
};
NAMESPACE_END(detail)
NAMESPACE_END(PYBIND11_NAMESPACE)

File diff suppressed because it is too large Load Diff

View File

@@ -1,162 +0,0 @@
/*
pybind11/chrono.h: Transparent conversion between std::chrono and python's datetime
Copyright (c) 2016 Trent Houliston <trent@houliston.me> and
Wenzel Jakob <wenzel.jakob@epfl.ch>
All rights reserved. Use of this source code is governed by a
BSD-style license that can be found in the LICENSE file.
*/
#pragma once
#include "pybind11.h"
#include <cmath>
#include <ctime>
#include <chrono>
#include <datetime.h>
// Backport the PyDateTime_DELTA functions from Python3.3 if required
#ifndef PyDateTime_DELTA_GET_DAYS
#define PyDateTime_DELTA_GET_DAYS(o) (((PyDateTime_Delta*)o)->days)
#endif
#ifndef PyDateTime_DELTA_GET_SECONDS
#define PyDateTime_DELTA_GET_SECONDS(o) (((PyDateTime_Delta*)o)->seconds)
#endif
#ifndef PyDateTime_DELTA_GET_MICROSECONDS
#define PyDateTime_DELTA_GET_MICROSECONDS(o) (((PyDateTime_Delta*)o)->microseconds)
#endif
NAMESPACE_BEGIN(PYBIND11_NAMESPACE)
NAMESPACE_BEGIN(detail)
template <typename type> class duration_caster {
public:
typedef typename type::rep rep;
typedef typename type::period period;
typedef std::chrono::duration<uint_fast32_t, std::ratio<86400>> days;
bool load(handle src, bool) {
using namespace std::chrono;
// Lazy initialise the PyDateTime import
if (!PyDateTimeAPI) { PyDateTime_IMPORT; }
if (!src) return false;
// If invoked with datetime.delta object
if (PyDelta_Check(src.ptr())) {
value = type(duration_cast<duration<rep, period>>(
days(PyDateTime_DELTA_GET_DAYS(src.ptr()))
+ seconds(PyDateTime_DELTA_GET_SECONDS(src.ptr()))
+ microseconds(PyDateTime_DELTA_GET_MICROSECONDS(src.ptr()))));
return true;
}
// If invoked with a float we assume it is seconds and convert
else if (PyFloat_Check(src.ptr())) {
value = type(duration_cast<duration<rep, period>>(duration<double>(PyFloat_AsDouble(src.ptr()))));
return true;
}
else return false;
}
// If this is a duration just return it back
static const std::chrono::duration<rep, period>& get_duration(const std::chrono::duration<rep, period> &src) {
return src;
}
// If this is a time_point get the time_since_epoch
template <typename Clock> static std::chrono::duration<rep, period> get_duration(const std::chrono::time_point<Clock, std::chrono::duration<rep, period>> &src) {
return src.time_since_epoch();
}
static handle cast(const type &src, return_value_policy /* policy */, handle /* parent */) {
using namespace std::chrono;
// Use overloaded function to get our duration from our source
// Works out if it is a duration or time_point and get the duration
auto d = get_duration(src);
// Lazy initialise the PyDateTime import
if (!PyDateTimeAPI) { PyDateTime_IMPORT; }
// Declare these special duration types so the conversions happen with the correct primitive types (int)
using dd_t = duration<int, std::ratio<86400>>;
using ss_t = duration<int, std::ratio<1>>;
using us_t = duration<int, std::micro>;
auto dd = duration_cast<dd_t>(d);
auto subd = d - dd;
auto ss = duration_cast<ss_t>(subd);
auto us = duration_cast<us_t>(subd - ss);
return PyDelta_FromDSU(dd.count(), ss.count(), us.count());
}
PYBIND11_TYPE_CASTER(type, _("datetime.timedelta"));
};
// This is for casting times on the system clock into datetime.datetime instances
template <typename Duration> class type_caster<std::chrono::time_point<std::chrono::system_clock, Duration>> {
public:
typedef std::chrono::time_point<std::chrono::system_clock, Duration> type;
bool load(handle src, bool) {
using namespace std::chrono;
// Lazy initialise the PyDateTime import
if (!PyDateTimeAPI) { PyDateTime_IMPORT; }
if (!src) return false;
if (PyDateTime_Check(src.ptr())) {
std::tm cal;
cal.tm_sec = PyDateTime_DATE_GET_SECOND(src.ptr());
cal.tm_min = PyDateTime_DATE_GET_MINUTE(src.ptr());
cal.tm_hour = PyDateTime_DATE_GET_HOUR(src.ptr());
cal.tm_mday = PyDateTime_GET_DAY(src.ptr());
cal.tm_mon = PyDateTime_GET_MONTH(src.ptr()) - 1;
cal.tm_year = PyDateTime_GET_YEAR(src.ptr()) - 1900;
cal.tm_isdst = -1;
value = system_clock::from_time_t(std::mktime(&cal)) + microseconds(PyDateTime_DATE_GET_MICROSECOND(src.ptr()));
return true;
}
else return false;
}
static handle cast(const std::chrono::time_point<std::chrono::system_clock, Duration> &src, return_value_policy /* policy */, handle /* parent */) {
using namespace std::chrono;
// Lazy initialise the PyDateTime import
if (!PyDateTimeAPI) { PyDateTime_IMPORT; }
std::time_t tt = system_clock::to_time_t(src);
// this function uses static memory so it's best to copy it out asap just in case
// otherwise other code that is using localtime may break this (not just python code)
std::tm localtime = *std::localtime(&tt);
// Declare these special duration types so the conversions happen with the correct primitive types (int)
using us_t = duration<int, std::micro>;
return PyDateTime_FromDateAndTime(localtime.tm_year + 1900,
localtime.tm_mon + 1,
localtime.tm_mday,
localtime.tm_hour,
localtime.tm_min,
localtime.tm_sec,
(duration_cast<us_t>(src.time_since_epoch() % seconds(1))).count());
}
PYBIND11_TYPE_CASTER(type, _("datetime.datetime"));
};
// Other clocks that are not the system clock are not measured as datetime.datetime objects
// since they are not measured on calendar time. So instead we just make them timedeltas
// Or if they have passed us a time as a float we convert that
template <typename Clock, typename Duration> class type_caster<std::chrono::time_point<Clock, Duration>>
: public duration_caster<std::chrono::time_point<Clock, Duration>> {
};
template <typename Rep, typename Period> class type_caster<std::chrono::duration<Rep, Period>>
: public duration_caster<std::chrono::duration<Rep, Period>> {
};
NAMESPACE_END(detail)
NAMESPACE_END(PYBIND11_NAMESPACE)

View File

@@ -1,2 +0,0 @@
#include "detail/common.h"
#warning "Including 'common.h' is deprecated. It will be removed in v3.0. Use 'pybind11.h'."

View File

@@ -1,65 +0,0 @@
/*
pybind11/complex.h: Complex number support
Copyright (c) 2016 Wenzel Jakob <wenzel.jakob@epfl.ch>
All rights reserved. Use of this source code is governed by a
BSD-style license that can be found in the LICENSE file.
*/
#pragma once
#include "pybind11.h"
#include <complex>
/// glibc defines I as a macro which breaks things, e.g., boost template names
#ifdef I
# undef I
#endif
NAMESPACE_BEGIN(PYBIND11_NAMESPACE)
template <typename T> struct format_descriptor<std::complex<T>, detail::enable_if_t<std::is_floating_point<T>::value>> {
static constexpr const char c = format_descriptor<T>::c;
static constexpr const char value[3] = { 'Z', c, '\0' };
static std::string format() { return std::string(value); }
};
#ifndef PYBIND11_CPP17
template <typename T> constexpr const char format_descriptor<
std::complex<T>, detail::enable_if_t<std::is_floating_point<T>::value>>::value[3];
#endif
NAMESPACE_BEGIN(detail)
template <typename T> struct is_fmt_numeric<std::complex<T>, detail::enable_if_t<std::is_floating_point<T>::value>> {
static constexpr bool value = true;
static constexpr int index = is_fmt_numeric<T>::index + 3;
};
template <typename T> class type_caster<std::complex<T>> {
public:
bool load(handle src, bool convert) {
if (!src)
return false;
if (!convert && !PyComplex_Check(src.ptr()))
return false;
Py_complex result = PyComplex_AsCComplex(src.ptr());
if (result.real == -1.0 && PyErr_Occurred()) {
PyErr_Clear();
return false;
}
value = std::complex<T>((T) result.real, (T) result.imag);
return true;
}
static handle cast(const std::complex<T> &src, return_value_policy /* policy */, handle /* parent */) {
return PyComplex_FromDoubles((double) src.real(), (double) src.imag());
}
PYBIND11_TYPE_CASTER(std::complex<T>, _("complex"));
};
NAMESPACE_END(detail)
NAMESPACE_END(PYBIND11_NAMESPACE)

View File

@@ -1,623 +0,0 @@
/*
pybind11/detail/class.h: Python C API implementation details for py::class_
Copyright (c) 2017 Wenzel Jakob <wenzel.jakob@epfl.ch>
All rights reserved. Use of this source code is governed by a
BSD-style license that can be found in the LICENSE file.
*/
#pragma once
#include "../attr.h"
#include "../options.h"
NAMESPACE_BEGIN(PYBIND11_NAMESPACE)
NAMESPACE_BEGIN(detail)
#if PY_VERSION_HEX >= 0x03030000
# define PYBIND11_BUILTIN_QUALNAME
# define PYBIND11_SET_OLDPY_QUALNAME(obj, nameobj)
#else
// In pre-3.3 Python, we still set __qualname__ so that we can produce reliable function type
// signatures; in 3.3+ this macro expands to nothing:
# define PYBIND11_SET_OLDPY_QUALNAME(obj, nameobj) setattr((PyObject *) obj, "__qualname__", nameobj)
#endif
inline PyTypeObject *type_incref(PyTypeObject *type) {
Py_INCREF(type);
return type;
}
#if !defined(PYPY_VERSION)
/// `pybind11_static_property.__get__()`: Always pass the class instead of the instance.
extern "C" inline PyObject *pybind11_static_get(PyObject *self, PyObject * /*ob*/, PyObject *cls) {
return PyProperty_Type.tp_descr_get(self, cls, cls);
}
/// `pybind11_static_property.__set__()`: Just like the above `__get__()`.
extern "C" inline int pybind11_static_set(PyObject *self, PyObject *obj, PyObject *value) {
PyObject *cls = PyType_Check(obj) ? obj : (PyObject *) Py_TYPE(obj);
return PyProperty_Type.tp_descr_set(self, cls, value);
}
/** A `static_property` is the same as a `property` but the `__get__()` and `__set__()`
methods are modified to always use the object type instead of a concrete instance.
Return value: New reference. */
inline PyTypeObject *make_static_property_type() {
constexpr auto *name = "pybind11_static_property";
auto name_obj = reinterpret_steal<object>(PYBIND11_FROM_STRING(name));
/* Danger zone: from now (and until PyType_Ready), make sure to
issue no Python C API calls which could potentially invoke the
garbage collector (the GC will call type_traverse(), which will in
turn find the newly constructed type in an invalid state) */
auto heap_type = (PyHeapTypeObject *) PyType_Type.tp_alloc(&PyType_Type, 0);
if (!heap_type)
pybind11_fail("make_static_property_type(): error allocating type!");
heap_type->ht_name = name_obj.inc_ref().ptr();
#ifdef PYBIND11_BUILTIN_QUALNAME
heap_type->ht_qualname = name_obj.inc_ref().ptr();
#endif
auto type = &heap_type->ht_type;
type->tp_name = name;
type->tp_base = type_incref(&PyProperty_Type);
type->tp_flags = Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_HEAPTYPE;
type->tp_descr_get = pybind11_static_get;
type->tp_descr_set = pybind11_static_set;
if (PyType_Ready(type) < 0)
pybind11_fail("make_static_property_type(): failure in PyType_Ready()!");
setattr((PyObject *) type, "__module__", str("pybind11_builtins"));
PYBIND11_SET_OLDPY_QUALNAME(type, name_obj);
return type;
}
#else // PYPY
/** PyPy has some issues with the above C API, so we evaluate Python code instead.
This function will only be called once so performance isn't really a concern.
Return value: New reference. */
inline PyTypeObject *make_static_property_type() {
auto d = dict();
PyObject *result = PyRun_String(R"(\
class pybind11_static_property(property):
def __get__(self, obj, cls):
return property.__get__(self, cls, cls)
def __set__(self, obj, value):
cls = obj if isinstance(obj, type) else type(obj)
property.__set__(self, cls, value)
)", Py_file_input, d.ptr(), d.ptr()
);
if (result == nullptr)
throw error_already_set();
Py_DECREF(result);
return (PyTypeObject *) d["pybind11_static_property"].cast<object>().release().ptr();
}
#endif // PYPY
/** Types with static properties need to handle `Type.static_prop = x` in a specific way.
By default, Python replaces the `static_property` itself, but for wrapped C++ types
we need to call `static_property.__set__()` in order to propagate the new value to
the underlying C++ data structure. */
extern "C" inline int pybind11_meta_setattro(PyObject* obj, PyObject* name, PyObject* value) {
// Use `_PyType_Lookup()` instead of `PyObject_GetAttr()` in order to get the raw
// descriptor (`property`) instead of calling `tp_descr_get` (`property.__get__()`).
PyObject *descr = _PyType_Lookup((PyTypeObject *) obj, name);
// The following assignment combinations are possible:
// 1. `Type.static_prop = value` --> descr_set: `Type.static_prop.__set__(value)`
// 2. `Type.static_prop = other_static_prop` --> setattro: replace existing `static_prop`
// 3. `Type.regular_attribute = value` --> setattro: regular attribute assignment
const auto static_prop = (PyObject *) get_internals().static_property_type;
const auto call_descr_set = descr && PyObject_IsInstance(descr, static_prop)
&& !PyObject_IsInstance(value, static_prop);
if (call_descr_set) {
// Call `static_property.__set__()` instead of replacing the `static_property`.
#if !defined(PYPY_VERSION)
return Py_TYPE(descr)->tp_descr_set(descr, obj, value);
#else
if (PyObject *result = PyObject_CallMethod(descr, "__set__", "OO", obj, value)) {
Py_DECREF(result);
return 0;
} else {
return -1;
}
#endif
} else {
// Replace existing attribute.
return PyType_Type.tp_setattro(obj, name, value);
}
}
#if PY_MAJOR_VERSION >= 3
/**
* Python 3's PyInstanceMethod_Type hides itself via its tp_descr_get, which prevents aliasing
* methods via cls.attr("m2") = cls.attr("m1"): instead the tp_descr_get returns a plain function,
* when called on a class, or a PyMethod, when called on an instance. Override that behaviour here
* to do a special case bypass for PyInstanceMethod_Types.
*/
extern "C" inline PyObject *pybind11_meta_getattro(PyObject *obj, PyObject *name) {
PyObject *descr = _PyType_Lookup((PyTypeObject *) obj, name);
if (descr && PyInstanceMethod_Check(descr)) {
Py_INCREF(descr);
return descr;
}
else {
return PyType_Type.tp_getattro(obj, name);
}
}
#endif
/** This metaclass is assigned by default to all pybind11 types and is required in order
for static properties to function correctly. Users may override this using `py::metaclass`.
Return value: New reference. */
inline PyTypeObject* make_default_metaclass() {
constexpr auto *name = "pybind11_type";
auto name_obj = reinterpret_steal<object>(PYBIND11_FROM_STRING(name));
/* Danger zone: from now (and until PyType_Ready), make sure to
issue no Python C API calls which could potentially invoke the
garbage collector (the GC will call type_traverse(), which will in
turn find the newly constructed type in an invalid state) */
auto heap_type = (PyHeapTypeObject *) PyType_Type.tp_alloc(&PyType_Type, 0);
if (!heap_type)
pybind11_fail("make_default_metaclass(): error allocating metaclass!");
heap_type->ht_name = name_obj.inc_ref().ptr();
#ifdef PYBIND11_BUILTIN_QUALNAME
heap_type->ht_qualname = name_obj.inc_ref().ptr();
#endif
auto type = &heap_type->ht_type;
type->tp_name = name;
type->tp_base = type_incref(&PyType_Type);
type->tp_flags = Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_HEAPTYPE;
type->tp_setattro = pybind11_meta_setattro;
#if PY_MAJOR_VERSION >= 3
type->tp_getattro = pybind11_meta_getattro;
#endif
if (PyType_Ready(type) < 0)
pybind11_fail("make_default_metaclass(): failure in PyType_Ready()!");
setattr((PyObject *) type, "__module__", str("pybind11_builtins"));
PYBIND11_SET_OLDPY_QUALNAME(type, name_obj);
return type;
}
/// For multiple inheritance types we need to recursively register/deregister base pointers for any
/// base classes with pointers that are difference from the instance value pointer so that we can
/// correctly recognize an offset base class pointer. This calls a function with any offset base ptrs.
inline void traverse_offset_bases(void *valueptr, const detail::type_info *tinfo, instance *self,
bool (*f)(void * /*parentptr*/, instance * /*self*/)) {
for (handle h : reinterpret_borrow<tuple>(tinfo->type->tp_bases)) {
if (auto parent_tinfo = get_type_info((PyTypeObject *) h.ptr())) {
for (auto &c : parent_tinfo->implicit_casts) {
if (c.first == tinfo->cpptype) {
auto *parentptr = c.second(valueptr);
if (parentptr != valueptr)
f(parentptr, self);
traverse_offset_bases(parentptr, parent_tinfo, self, f);
break;
}
}
}
}
}
inline bool register_instance_impl(void *ptr, instance *self) {
get_internals().registered_instances.emplace(ptr, self);
return true; // unused, but gives the same signature as the deregister func
}
inline bool deregister_instance_impl(void *ptr, instance *self) {
auto &registered_instances = get_internals().registered_instances;
auto range = registered_instances.equal_range(ptr);
for (auto it = range.first; it != range.second; ++it) {
if (Py_TYPE(self) == Py_TYPE(it->second)) {
registered_instances.erase(it);
return true;
}
}
return false;
}
inline void register_instance(instance *self, void *valptr, const type_info *tinfo) {
register_instance_impl(valptr, self);
if (!tinfo->simple_ancestors)
traverse_offset_bases(valptr, tinfo, self, register_instance_impl);
}
inline bool deregister_instance(instance *self, void *valptr, const type_info *tinfo) {
bool ret = deregister_instance_impl(valptr, self);
if (!tinfo->simple_ancestors)
traverse_offset_bases(valptr, tinfo, self, deregister_instance_impl);
return ret;
}
/// Instance creation function for all pybind11 types. It allocates the internal instance layout for
/// holding C++ objects and holders. Allocation is done lazily (the first time the instance is cast
/// to a reference or pointer), and initialization is done by an `__init__` function.
inline PyObject *make_new_instance(PyTypeObject *type) {
#if defined(PYPY_VERSION)
// PyPy gets tp_basicsize wrong (issue 2482) under multiple inheritance when the first inherited
// object is a a plain Python type (i.e. not derived from an extension type). Fix it.
ssize_t instance_size = static_cast<ssize_t>(sizeof(instance));
if (type->tp_basicsize < instance_size) {
type->tp_basicsize = instance_size;
}
#endif
PyObject *self = type->tp_alloc(type, 0);
auto inst = reinterpret_cast<instance *>(self);
// Allocate the value/holder internals:
inst->allocate_layout();
inst->owned = true;
return self;
}
/// Instance creation function for all pybind11 types. It only allocates space for the
/// C++ object, but doesn't call the constructor -- an `__init__` function must do that.
extern "C" inline PyObject *pybind11_object_new(PyTypeObject *type, PyObject *, PyObject *) {
return make_new_instance(type);
}
/// An `__init__` function constructs the C++ object. Users should provide at least one
/// of these using `py::init` or directly with `.def(__init__, ...)`. Otherwise, the
/// following default function will be used which simply throws an exception.
extern "C" inline int pybind11_object_init(PyObject *self, PyObject *, PyObject *) {
PyTypeObject *type = Py_TYPE(self);
std::string msg;
#if defined(PYPY_VERSION)
msg += handle((PyObject *) type).attr("__module__").cast<std::string>() + ".";
#endif
msg += type->tp_name;
msg += ": No constructor defined!";
PyErr_SetString(PyExc_TypeError, msg.c_str());
return -1;
}
inline void add_patient(PyObject *nurse, PyObject *patient) {
auto &internals = get_internals();
auto instance = reinterpret_cast<detail::instance *>(nurse);
instance->has_patients = true;
Py_INCREF(patient);
internals.patients[nurse].push_back(patient);
}
inline void clear_patients(PyObject *self) {
auto instance = reinterpret_cast<detail::instance *>(self);
auto &internals = get_internals();
auto pos = internals.patients.find(self);
assert(pos != internals.patients.end());
// Clearing the patients can cause more Python code to run, which
// can invalidate the iterator. Extract the vector of patients
// from the unordered_map first.
auto patients = std::move(pos->second);
internals.patients.erase(pos);
instance->has_patients = false;
for (PyObject *&patient : patients)
Py_CLEAR(patient);
}
/// Clears all internal data from the instance and removes it from registered instances in
/// preparation for deallocation.
inline void clear_instance(PyObject *self) {
auto instance = reinterpret_cast<detail::instance *>(self);
// Deallocate any values/holders, if present:
for (auto &v_h : values_and_holders(instance)) {
if (v_h) {
// We have to deregister before we call dealloc because, for virtual MI types, we still
// need to be able to get the parent pointers.
if (v_h.instance_registered() && !deregister_instance(instance, v_h.value_ptr(), v_h.type))
pybind11_fail("pybind11_object_dealloc(): Tried to deallocate unregistered instance!");
if (instance->owned || v_h.holder_constructed())
v_h.type->dealloc(v_h);
}
}
// Deallocate the value/holder layout internals:
instance->deallocate_layout();
if (instance->weakrefs)
PyObject_ClearWeakRefs(self);
PyObject **dict_ptr = _PyObject_GetDictPtr(self);
if (dict_ptr)
Py_CLEAR(*dict_ptr);
if (instance->has_patients)
clear_patients(self);
}
/// Instance destructor function for all pybind11 types. It calls `type_info.dealloc`
/// to destroy the C++ object itself, while the rest is Python bookkeeping.
extern "C" inline void pybind11_object_dealloc(PyObject *self) {
clear_instance(self);
auto type = Py_TYPE(self);
type->tp_free(self);
// `type->tp_dealloc != pybind11_object_dealloc` means that we're being called
// as part of a derived type's dealloc, in which case we're not allowed to decref
// the type here. For cross-module compatibility, we shouldn't compare directly
// with `pybind11_object_dealloc`, but with the common one stashed in internals.
auto pybind11_object_type = (PyTypeObject *) get_internals().instance_base;
if (type->tp_dealloc == pybind11_object_type->tp_dealloc)
Py_DECREF(type);
}
/** Create the type which can be used as a common base for all classes. This is
needed in order to satisfy Python's requirements for multiple inheritance.
Return value: New reference. */
inline PyObject *make_object_base_type(PyTypeObject *metaclass) {
constexpr auto *name = "pybind11_object";
auto name_obj = reinterpret_steal<object>(PYBIND11_FROM_STRING(name));
/* Danger zone: from now (and until PyType_Ready), make sure to
issue no Python C API calls which could potentially invoke the
garbage collector (the GC will call type_traverse(), which will in
turn find the newly constructed type in an invalid state) */
auto heap_type = (PyHeapTypeObject *) metaclass->tp_alloc(metaclass, 0);
if (!heap_type)
pybind11_fail("make_object_base_type(): error allocating type!");
heap_type->ht_name = name_obj.inc_ref().ptr();
#ifdef PYBIND11_BUILTIN_QUALNAME
heap_type->ht_qualname = name_obj.inc_ref().ptr();
#endif
auto type = &heap_type->ht_type;
type->tp_name = name;
type->tp_base = type_incref(&PyBaseObject_Type);
type->tp_basicsize = static_cast<ssize_t>(sizeof(instance));
type->tp_flags = Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_HEAPTYPE;
type->tp_new = pybind11_object_new;
type->tp_init = pybind11_object_init;
type->tp_dealloc = pybind11_object_dealloc;
/* Support weak references (needed for the keep_alive feature) */
type->tp_weaklistoffset = offsetof(instance, weakrefs);
if (PyType_Ready(type) < 0)
pybind11_fail("PyType_Ready failed in make_object_base_type():" + error_string());
setattr((PyObject *) type, "__module__", str("pybind11_builtins"));
PYBIND11_SET_OLDPY_QUALNAME(type, name_obj);
assert(!PyType_HasFeature(type, Py_TPFLAGS_HAVE_GC));
return (PyObject *) heap_type;
}
/// dynamic_attr: Support for `d = instance.__dict__`.
extern "C" inline PyObject *pybind11_get_dict(PyObject *self, void *) {
PyObject *&dict = *_PyObject_GetDictPtr(self);
if (!dict)
dict = PyDict_New();
Py_XINCREF(dict);
return dict;
}
/// dynamic_attr: Support for `instance.__dict__ = dict()`.
extern "C" inline int pybind11_set_dict(PyObject *self, PyObject *new_dict, void *) {
if (!PyDict_Check(new_dict)) {
PyErr_Format(PyExc_TypeError, "__dict__ must be set to a dictionary, not a '%.200s'",
Py_TYPE(new_dict)->tp_name);
return -1;
}
PyObject *&dict = *_PyObject_GetDictPtr(self);
Py_INCREF(new_dict);
Py_CLEAR(dict);
dict = new_dict;
return 0;
}
/// dynamic_attr: Allow the garbage collector to traverse the internal instance `__dict__`.
extern "C" inline int pybind11_traverse(PyObject *self, visitproc visit, void *arg) {
PyObject *&dict = *_PyObject_GetDictPtr(self);
Py_VISIT(dict);
return 0;
}
/// dynamic_attr: Allow the GC to clear the dictionary.
extern "C" inline int pybind11_clear(PyObject *self) {
PyObject *&dict = *_PyObject_GetDictPtr(self);
Py_CLEAR(dict);
return 0;
}
/// Give instances of this type a `__dict__` and opt into garbage collection.
inline void enable_dynamic_attributes(PyHeapTypeObject *heap_type) {
auto type = &heap_type->ht_type;
#if defined(PYPY_VERSION)
pybind11_fail(std::string(type->tp_name) + ": dynamic attributes are "
"currently not supported in "
"conjunction with PyPy!");
#endif
type->tp_flags |= Py_TPFLAGS_HAVE_GC;
type->tp_dictoffset = type->tp_basicsize; // place dict at the end
type->tp_basicsize += (ssize_t)sizeof(PyObject *); // and allocate enough space for it
type->tp_traverse = pybind11_traverse;
type->tp_clear = pybind11_clear;
static PyGetSetDef getset[] = {
{const_cast<char*>("__dict__"), pybind11_get_dict, pybind11_set_dict, nullptr, nullptr},
{nullptr, nullptr, nullptr, nullptr, nullptr}
};
type->tp_getset = getset;
}
/// buffer_protocol: Fill in the view as specified by flags.
extern "C" inline int pybind11_getbuffer(PyObject *obj, Py_buffer *view, int flags) {
// Look for a `get_buffer` implementation in this type's info or any bases (following MRO).
type_info *tinfo = nullptr;
for (auto type : reinterpret_borrow<tuple>(Py_TYPE(obj)->tp_mro)) {
tinfo = get_type_info((PyTypeObject *) type.ptr());
if (tinfo && tinfo->get_buffer)
break;
}
if (view == nullptr || !tinfo || !tinfo->get_buffer) {
if (view)
view->obj = nullptr;
PyErr_SetString(PyExc_BufferError, "pybind11_getbuffer(): Internal error");
return -1;
}
std::memset(view, 0, sizeof(Py_buffer));
buffer_info *info = tinfo->get_buffer(obj, tinfo->get_buffer_data);
view->obj = obj;
view->ndim = 1;
view->internal = info;
view->buf = info->ptr;
view->itemsize = info->itemsize;
view->len = view->itemsize;
for (auto s : info->shape)
view->len *= s;
if ((flags & PyBUF_FORMAT) == PyBUF_FORMAT)
view->format = const_cast<char *>(info->format.c_str());
if ((flags & PyBUF_STRIDES) == PyBUF_STRIDES) {
view->ndim = (int) info->ndim;
view->strides = &info->strides[0];
view->shape = &info->shape[0];
}
Py_INCREF(view->obj);
return 0;
}
/// buffer_protocol: Release the resources of the buffer.
extern "C" inline void pybind11_releasebuffer(PyObject *, Py_buffer *view) {
delete (buffer_info *) view->internal;
}
/// Give this type a buffer interface.
inline void enable_buffer_protocol(PyHeapTypeObject *heap_type) {
heap_type->ht_type.tp_as_buffer = &heap_type->as_buffer;
#if PY_MAJOR_VERSION < 3
heap_type->ht_type.tp_flags |= Py_TPFLAGS_HAVE_NEWBUFFER;
#endif
heap_type->as_buffer.bf_getbuffer = pybind11_getbuffer;
heap_type->as_buffer.bf_releasebuffer = pybind11_releasebuffer;
}
/** Create a brand new Python type according to the `type_record` specification.
Return value: New reference. */
inline PyObject* make_new_python_type(const type_record &rec) {
auto name = reinterpret_steal<object>(PYBIND11_FROM_STRING(rec.name));
auto qualname = name;
if (rec.scope && !PyModule_Check(rec.scope.ptr()) && hasattr(rec.scope, "__qualname__")) {
#if PY_MAJOR_VERSION >= 3
qualname = reinterpret_steal<object>(
PyUnicode_FromFormat("%U.%U", rec.scope.attr("__qualname__").ptr(), name.ptr()));
#else
qualname = str(rec.scope.attr("__qualname__").cast<std::string>() + "." + rec.name);
#endif
}
object module;
if (rec.scope) {
if (hasattr(rec.scope, "__module__"))
module = rec.scope.attr("__module__");
else if (hasattr(rec.scope, "__name__"))
module = rec.scope.attr("__name__");
}
auto full_name = c_str(
#if !defined(PYPY_VERSION)
module ? str(module).cast<std::string>() + "." + rec.name :
#endif
rec.name);
char *tp_doc = nullptr;
if (rec.doc && options::show_user_defined_docstrings()) {
/* Allocate memory for docstring (using PyObject_MALLOC, since
Python will free this later on) */
size_t size = strlen(rec.doc) + 1;
tp_doc = (char *) PyObject_MALLOC(size);
memcpy((void *) tp_doc, rec.doc, size);
}
auto &internals = get_internals();
auto bases = tuple(rec.bases);
auto base = (bases.size() == 0) ? internals.instance_base
: bases[0].ptr();
/* Danger zone: from now (and until PyType_Ready), make sure to
issue no Python C API calls which could potentially invoke the
garbage collector (the GC will call type_traverse(), which will in
turn find the newly constructed type in an invalid state) */
auto metaclass = rec.metaclass.ptr() ? (PyTypeObject *) rec.metaclass.ptr()
: internals.default_metaclass;
auto heap_type = (PyHeapTypeObject *) metaclass->tp_alloc(metaclass, 0);
if (!heap_type)
pybind11_fail(std::string(rec.name) + ": Unable to create type object!");
heap_type->ht_name = name.release().ptr();
#ifdef PYBIND11_BUILTIN_QUALNAME
heap_type->ht_qualname = qualname.inc_ref().ptr();
#endif
auto type = &heap_type->ht_type;
type->tp_name = full_name;
type->tp_doc = tp_doc;
type->tp_base = type_incref((PyTypeObject *)base);
type->tp_basicsize = static_cast<ssize_t>(sizeof(instance));
if (bases.size() > 0)
type->tp_bases = bases.release().ptr();
/* Don't inherit base __init__ */
type->tp_init = pybind11_object_init;
/* Supported protocols */
type->tp_as_number = &heap_type->as_number;
type->tp_as_sequence = &heap_type->as_sequence;
type->tp_as_mapping = &heap_type->as_mapping;
/* Flags */
type->tp_flags |= Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_HEAPTYPE;
#if PY_MAJOR_VERSION < 3
type->tp_flags |= Py_TPFLAGS_CHECKTYPES;
#endif
if (rec.dynamic_attr)
enable_dynamic_attributes(heap_type);
if (rec.buffer_protocol)
enable_buffer_protocol(heap_type);
if (PyType_Ready(type) < 0)
pybind11_fail(std::string(rec.name) + ": PyType_Ready failed (" + error_string() + ")!");
assert(rec.dynamic_attr ? PyType_HasFeature(type, Py_TPFLAGS_HAVE_GC)
: !PyType_HasFeature(type, Py_TPFLAGS_HAVE_GC));
/* Register type with the parent scope */
if (rec.scope)
setattr(rec.scope, rec.name, (PyObject *) type);
else
Py_INCREF(type); // Keep it alive forever (reference leak)
if (module) // Needed by pydoc
setattr((PyObject *) type, "__module__", module);
PYBIND11_SET_OLDPY_QUALNAME(type, qualname);
return (PyObject *) type;
}
NAMESPACE_END(detail)
NAMESPACE_END(PYBIND11_NAMESPACE)

Some files were not shown because too many files have changed in this diff Show More