[PYTHON][TESTS][DOC] Various improvement of the API and code quality:

* Simplified `triton.kernel` API to achieve lower latency: > .data_ptr() must now be passed as kernel argument. No more implicit conversion from torch.tensor > compilation options are now constant attributes, i.e., opt.d('VAR') becomes opt.VAR > torch.device must now be passed explicitly to triton.kernel (no longer inferred from torch.tensor arguments) * C++ tests moved to `python/tests/` * C++ tutorial created in `tutorials/` * Python tutorial created in python/tutorials/ * Version changed to 1.0alpha * No longer copying C++ headers into the Python package * added python/triton/ops/ package for pre-written Triton ops
2021-01-29 17:27:16 -05:00
parent a5a477c36b
commit 269ebc12e5
63 changed files with 2255 additions and 3883 deletions
--- a/python/tests/test_conv.py
+++ b/python/tests/test_conv.py
@@ -0,0 +1,17 @@
+import torch
+import triton
+
+
+def test_op():
+    torch.manual_seed(0)
+    DTYPE = torch.float16
+    N, H, W, CI, CO, R, S = 1, 56, 56, 1024, 1024, 3, 3
+    pad, stride, = (1, 1), (1, 1)
+    dilation = (1, 1)
+    a = torch.rand((N , CI, H, W ), dtype=DTYPE, device='cuda')  / CI**.5
+    b = torch.rand((CI, R , S, CO), dtype=DTYPE, device='cuda')  / CI**.5
+    th_c = torch.nn.functional.conv2d(a, b.permute(3,0,1,2), None, stride, pad, dilation)
+    tt_c = triton.ops.conv(a, b, pad, stride)
+    rtol, atol = {torch.float32: (1e-4, 1e-5),
+                  torch.float16: (1e-2, 1e-3)}[DTYPE]
+    assert torch.allclose(tt_c, th_c, atol=atol, rtol=rtol)