Improved handling of asynchronous copy, scheduling and synchronization for A100. Now achieving CUTLASS-like performance on large square dense matrix multiplication tasks
There is no plan to support OpenCL anytime soon (Vulkan would be preferred). Removing the adequate portion of the driver code