[TESTING] use torch.int for autotuning cache (#840)

For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for `zero_`, so using `int8` buffer in `do_bench` makes it slow. Co-authored-by: Philippe Tillet <phil@openai.com>
2022-11-04 18:05:16 -07:00
parent 77bc5187b5
commit 0d7e753227
2 changed files with 10 additions and 3 deletions
--- a/python/test/regression/test_performance.py
+++ b/python/test/regression/test_performance.py
@@ -128,7 +128,7 @@ elementwise_data = {
        1024 * 16: 0.0219,
        1024 * 64: 0.0791,
        1024 * 256: 0.243,
-        1024 * 1024: 0.534,
+        1024 * 1024: 0.530,
        1024 * 4096: 0.796,
        1024 * 16384: 0.905,
        1024 * 65536: 0.939,