* Example for blocksparse matrix multiplication * Simplified Triton kernel API * Revived auto-tuning in einsum