Files
triton/master/.doctrees/getting-started/tutorials/05-layer-norm.doctree

711 lines
34 KiB
Plaintext
Raw Normal View History

2022-07-14 07:22:19 +00:00
<EFBFBD><05><><EFBFBD><00>sphinx.addnodes<65><73>document<6E><74><EFBFBD>)<29><>}<7D>(<28> rawsource<63><65><00><>children<65>]<5D>(<28>docutils.nodes<65><73>comment<6E><74><EFBFBD>)<29><>}<7D>(h<05> DO NOT EDIT.<2E>h]<5D>h <09>Text<78><74><EFBFBD><EFBFBD> DO NOT EDIT.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<06>parent<6E>h uba<62>
2022-06-05 21:05:02 +00:00
attributes<EFBFBD>}<7D>(<28>ids<64>]<5D><>classes<65>]<5D><>names<65>]<5D><>dupnames<65>]<5D><>backrefs<66>]<5D><> xml:space<63><65>preserve<76>u<EFBFBD>tagname<6D>h
2022-07-28 00:52:17 +00:00
hhhh<03>source<63><65>j/tmp/tmpm9ohsbcq/923d468187082b2783233f4e7f37901e111804d8/docs/getting-started/tutorials/05-layer-norm.rst<73><74>line<6E>Kubh )<29><>}<7D>(h<05>8THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.<2E>h]<5D>h<11>8THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhh)ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
2022-06-05 21:05:02 +00:00
hhhhh&h'h(Kubh )<29><>}<7D>(h<05>-TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:<3A>h]<5D>h<11>-TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhh7ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
hhhhh&h'h(Kubh )<29><>}<7D>(h<05>,"getting-started/tutorials/05-layer-norm.py"<22>h]<5D>h<11>,"getting-started/tutorials/05-layer-norm.py"<22><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhhEubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
hhhhh&h'h(Kubh )<29><>}<7D>(h<05>LINE NUMBERS ARE GIVEN BELOW.<2E>h]<5D>h<11>LINE NUMBERS ARE GIVEN BELOW.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhhSubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
hhhhh&h'h(Kubh<00>only<6C><79><EFBFBD>)<29><>}<7D>(hhh]<5D>h <09>note<74><65><EFBFBD>)<29><>}<7D>(h<05>rClick :ref:`here <sphx_glr_download_getting-started_tutorials_05-layer-norm.py>`
to download the full example code<64>h]<5D>h <09> paragraph<70><68><EFBFBD>)<29><>}<7D>(h<05>rClick :ref:`here <sphx_glr_download_getting-started_tutorials_05-layer-norm.py>`
to download the full example code<64>h]<5D>(h<11>Click <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(h<05>Click <20>hhnubh<00> pending_xref<65><66><EFBFBD>)<29><>}<7D>(h<05>J:ref:`here <sphx_glr_download_getting-started_tutorials_05-layer-norm.py>`<60>h]<5D>h <09>inline<6E><65><EFBFBD>)<29><>}<7D>(hh{h]<5D>h<11>here<72><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhhubah}<7D>(h]<5D>h]<5D>(<28>xref<65><66>std<74><64>std-ref<65>eh]<5D>h]<5D>h!]<5D>uh%h}hhyubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>refdoc<6F><63>'getting-started/tutorials/05-layer-norm<72><6D> refdomain<69>h<EFBFBD><68>reftype<70><65>ref<65><66> refexplicit<69><74><EFBFBD>refwarn<72><6E><EFBFBD> reftarget<65><74><sphx_glr_download_getting-started_tutorials_05-layer-norm.py<70>uh%hwh&h'h(K hhnubh<11>"
to download the full example code<64><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(h<05>"
2022-07-14 07:22:19 +00:00
to download the full example code<64>hhnubeh}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlh&h'h(K hhhubah}<7D>(h]<5D>h]<5D><>sphx-glr-download-link-note<74>ah]<5D>h]<5D>h!]<5D>uh%hfhhchhh&h'h(Nubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>expr<70><72>html<6D>uh%hahhh&h'h(Khhubh <09>target<65><74><EFBFBD>)<29><>}<7D>(h<05>8.. _sphx_glr_getting-started_tutorials_05-layer-norm.py:<3A>h]<5D>h}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>refid<69><64>3sphx-glr-getting-started-tutorials-05-layer-norm-py<70>uh%h<>h(Khhhhh&h'ubh <09>section<6F><6E><EFBFBD>)<29><>}<7D>(hhh]<5D>(h <09>title<6C><65><EFBFBD>)<29><>}<7D>(h<05>Layer Normalization<6F>h]<5D>h<11>Layer Normalization<6F><6E><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hh<>hhh&Nh(Nubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%h<>hh<>hhh&h'h(Kubh )<29><>}<7D>(h<05>(GENERATED FROM PYTHON SOURCE LINES 5-316<31>h]<5D>h<11>(GENERATED FROM PYTHON SOURCE LINES 5-316<31><36><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhh<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
2022-06-05 21:05:02 +00:00
hh<>hhh&h'h(Kubh <09>image<67><65><EFBFBD>)<29><>}<7D>(h<05><>.. image:: /getting-started/tutorials/images/sphx_glr_05-layer-norm_001.png
:alt: 05 layer norm
:class: sphx-glr-single-img
<EFBFBD>h]<5D>h}<7D>(h]<5D>h]<5D><>sphx-glr-single-img<6D>ah]<5D>h]<5D>h!]<5D><>alt<6C><74> 05 layer norm<72><6D>uri<72><69>?getting-started/tutorials/images/sphx_glr_05-layer-norm_001.png<6E><67>
candidates<EFBFBD>}<7D><>*<2A>h<EFBFBD>suh%h<>hh<>hhh&h'h(Nubhm)<29><>}<7D>(h<05>Out:<3A>h]<5D>h<11>Out:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hh<>hhh&Nh(Nubah}<7D>(h]<5D>h]<5D><>sphx-glr-script-out<75>ah]<5D>h]<5D>h!]<5D>uh%hlh&h'h(K#hh<>hhubh <09> literal_block<63><6B><EFBFBD>)<29><>}<7D>(hX<>layer-norm:
N Triton Torch Apex
2022-06-26 00:50:12 +00:00
0 1024.0 585.142849 277.694907 468.114273
2022-07-17 00:49:40 +00:00
1 1536.0 630.153868 323.368435 511.999982
2022-07-28 00:52:17 +00:00
2 2048.0 682.666643 334.367358 520.126988
3 2560.0 694.237267 365.714281 518.481028
4 3072.0 712.347810 378.092307 496.484863
5 3584.0 725.873439 384.859062 455.111115
6 4096.0 728.177767 381.023256 455.111095
7 4608.0 670.254540 394.267384 421.302872
8 5120.0 688.403381 397.669909 424.455959
9 5632.0 704.000002 395.228063 411.470331
10 6144.0 697.191505 402.885254 411.313806
11 6656.0 700.631610 400.360920 398.861429
12 7168.0 690.891575 396.844306 387.459443
2022-07-27 00:51:15 +00:00
13 7680.0 678.895043 393.846167 386.415087
2022-07-28 00:52:17 +00:00
14 8192.0 639.375598 393.609605 372.363633
15 8704.0 624.502255 389.005597 380.502740
16 9216.0 604.327881 407.337026 383.002605
2022-07-27 00:51:15 +00:00
17 9728.0 585.142883 409.599987 383.369452
2022-07-28 00:52:17 +00:00
18 10240.0 564.965524 408.578556 382.803739
2022-07-27 00:51:15 +00:00
19 10752.0 546.133312 411.559798 381.445676
2022-07-28 00:52:17 +00:00
20 11264.0 533.207081 406.826188 373.134567
21 11776.0 520.486200 409.599991 377.587162
22 12288.0 516.031509 413.911572 383.251457
23 12800.0 504.433489 410.420828 376.470582
24 13312.0 494.180982 405.699062 376.976995
25 13824.0 482.934503 411.888257 379.389355
26 14336.0 471.967074 406.695045 374.185964
27 14848.0 461.297068 408.192434 375.304904
2022-07-27 00:51:15 +00:00
28 15360.0 454.269882 406.214870 378.092307
2022-07-28 00:52:17 +00:00
29 15872.0 447.098578 406.974373 376.225175<EFBFBD>h]<5D>hX<>layer-norm:
2022-06-05 21:05:02 +00:00
N Triton Torch Apex
2022-06-26 00:50:12 +00:00
0 1024.0 585.142849 277.694907 468.114273
2022-07-17 00:49:40 +00:00
1 1536.0 630.153868 323.368435 511.999982
2022-07-28 00:52:17 +00:00
2 2048.0 682.666643 334.367358 520.126988
3 2560.0 694.237267 365.714281 518.481028
4 3072.0 712.347810 378.092307 496.484863
5 3584.0 725.873439 384.859062 455.111115
6 4096.0 728.177767 381.023256 455.111095
7 4608.0 670.254540 394.267384 421.302872
8 5120.0 688.403381 397.669909 424.455959
9 5632.0 704.000002 395.228063 411.470331
10 6144.0 697.191505 402.885254 411.313806
11 6656.0 700.631610 400.360920 398.861429
12 7168.0 690.891575 396.844306 387.459443
2022-07-27 00:51:15 +00:00
13 7680.0 678.895043 393.846167 386.415087
2022-07-28 00:52:17 +00:00
14 8192.0 639.375598 393.609605 372.363633
15 8704.0 624.502255 389.005597 380.502740
16 9216.0 604.327881 407.337026 383.002605
2022-07-27 00:51:15 +00:00
17 9728.0 585.142883 409.599987 383.369452
2022-07-28 00:52:17 +00:00
18 10240.0 564.965524 408.578556 382.803739
2022-07-27 00:51:15 +00:00
19 10752.0 546.133312 411.559798 381.445676
2022-07-28 00:52:17 +00:00
20 11264.0 533.207081 406.826188 373.134567
21 11776.0 520.486200 409.599991 377.587162
22 12288.0 516.031509 413.911572 383.251457
23 12800.0 504.433489 410.420828 376.470582
24 13312.0 494.180982 405.699062 376.976995
25 13824.0 482.934503 411.888257 379.389355
26 14336.0 471.967074 406.695045 374.185964
27 14848.0 461.297068 408.192434 375.304904
2022-07-27 00:51:15 +00:00
28 15360.0 454.269882 406.214870 378.092307
2022-07-28 00:52:17 +00:00
29 15872.0 447.098578 406.974373 376.225175<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhjubah}<7D>(h]<5D>h]<5D>jah]<5D>h]<5D>h!]<5D>h#h$<24>force<63><65><EFBFBD>language<67><65>none<6E><65>highlight_args<67>}<7D>uh%j h&h'h(K%hh<>hhubh <09>
2022-07-14 07:22:19 +00:00
line_block<EFBFBD><EFBFBD><EFBFBD>)<29><>}<7D>(hhh]<5D>h h(<28><>)<29><>}<7D>(hhh]<5D>h}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%h(hj#hhh&h'h(K<00>indent<6E>Kubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%j!hh<>hhh&h'h(KMubj )<29><>}<7D>(hX<>+import torch
2022-06-05 21:05:02 +00:00
import triton
import triton.language as tl
try:
# This is https://github.com/NVIDIA/apex, NOT the apex on PyPi, so it
# should not be added to extras_require in setup.py.
import apex
HAS_APEX = True
except ModuleNotFoundError:
HAS_APEX = False
@triton.jit
def _layer_norm_fwd_fused(
Out,
A,
Weight,
Bias,
Mean, Rstd,
stride, N, eps,
BLOCK_SIZE: tl.constexpr,
):
# position of elements processed by this program
row = tl.program_id(0)
Out += row * stride
A += row * stride
# compute mean
mean = 0
_mean = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
a = tl.load(A + cols, mask=cols < N, other=0., eviction_policy="evict_last").to(tl.float32)
_mean += a
mean = tl.sum(_mean, axis=0) / N
# compute variance
_var = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
a = tl.load(A + cols, mask=cols < N, other=0., eviction_policy="evict_last").to(tl.float32)
a = tl.where(cols < N, a - mean, 0.)
_var += a * a
var = tl.sum(_var, axis=0) / N
rstd = 1 / tl.sqrt(var + eps)
# write-back mean/rstd
tl.store(Mean + row, mean)
tl.store(Rstd + row, rstd)
# multiply by weight and add bias
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
mask = cols < N
weight = tl.load(Weight + cols, mask=mask)
bias = tl.load(Bias + cols, mask=mask)
a = tl.load(A + cols, mask=mask, other=0., eviction_policy="evict_first").to(tl.float32)
a_hat = (a - mean) * rstd
out = a_hat * weight + bias
# # write-back
tl.store(Out + cols, out, mask=mask)
# Backward pass (DA + partial DW + partial DB)
@triton.jit
def _layer_norm_bwd_dx_fused(
_DA,
_DOut,
_A,
Weight,
Mean, Rstd,
stride, NumRows, NumCols, eps,
BLOCK_SIZE_N: tl.constexpr,
):
# position of elements processed by this program
pid = tl.program_id(0)
row = pid
A = _A + row * stride
DOut = _DOut + row * stride
DA = _DA + row * stride
mean = tl.load(Mean + row)
rstd = tl.load(Rstd + row)
# load data to SRAM
_mean1 = tl.zeros([BLOCK_SIZE_N], dtype=tl.float32)
_mean2 = tl.zeros([BLOCK_SIZE_N], dtype=tl.float32)
for off in range(0, NumCols, BLOCK_SIZE_N):
cols = off + tl.arange(0, BLOCK_SIZE_N)
mask = cols < NumCols
a = tl.load(A + cols, mask=mask, other=0).to(tl.float32)
dout = tl.load(DOut + cols, mask=mask, other=0).to(tl.float32)
weight = tl.load(Weight + cols, mask=mask, other=0).to(tl.float32)
a_hat = (a - mean) * rstd
wdout = weight * dout
_mean1 += a_hat * wdout
_mean2 += wdout
mean1 = tl.sum(_mean1, axis=0) / NumCols
mean2 = 0.
mean2 = tl.sum(_mean2, axis=0) / NumCols
for off in range(0, NumCols, BLOCK_SIZE_N):
cols = off + tl.arange(0, BLOCK_SIZE_N)
mask = cols < NumCols
a = tl.load(A + cols, mask=mask, other=0).to(tl.float32)
dout = tl.load(DOut + cols, mask=mask, other=0).to(tl.float32)
weight = tl.load(Weight + cols, mask=mask, other=0).to(tl.float32)
a_hat = (a - mean) * rstd
wdout = weight * dout
da = (wdout - (a_hat * mean1 + mean2)) * rstd
# write-back dx
tl.store(DA + cols, da, mask=mask)
# Backward pass (total DW + total DB)
@triton.jit
def _layer_norm_bwd_dwdb(
A, DOut,
Mean, Var,
DW,
DB,
M, N,
BLOCK_SIZE_M: tl.constexpr,
BLOCK_SIZE_N: tl.constexpr,
):
pid = tl.program_id(0)
cols = pid * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
dw = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
db = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
2022-07-14 07:22:19 +00:00
UNROLL: tl.constexpr = 4
for i in range(0, M, BLOCK_SIZE_M * UNROLL):
for j in range(UNROLL):
rows = i + j * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
mask = (rows[:, None] < M) & (cols[None, :] < N)
offs = rows[:, None] * N + cols[None, :]
a = tl.load(A + offs, mask=mask, other=0.).to(tl.float32)
dout = tl.load(DOut + offs, mask=mask, other=0.).to(tl.float32)
mean = tl.load(Mean + rows, mask=rows < M, other=0.)
rstd = tl.load(Var + rows, mask=rows < M, other=0.)
a_hat = (a - mean[:, None]) * rstd[:, None]
dw += dout * a_hat
db += dout
2022-06-05 21:05:02 +00:00
sum_dw = tl.sum(dw, axis=0)
sum_db = tl.sum(db, axis=0)
tl.store(DW + cols, sum_dw, mask=cols < N)
tl.store(DB + cols, sum_db, mask=cols < N)
class LayerNorm(torch.autograd.Function):
@staticmethod
def forward(ctx, a, normalized_shape, weight, bias, eps):
# allocate output
out = torch.empty_like(a)
# reshape input data into 2D tensor
a_arg = a.reshape(-1, a.shape[-1])
M, N = a_arg.shape
mean = torch.empty((M,), dtype=torch.float32, device="cuda")
rstd = torch.empty((M,), dtype=torch.float32, device="cuda")
# Less than 64KB per feature: enqueue fused kernel
MAX_FUSED_SIZE = 65536 // a.element_size()
BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(N))
BLOCK_SIZE = max(BLOCK_SIZE, 128)
BLOCK_SIZE = min(BLOCK_SIZE, 4096)
# heuristics for number of warps
num_warps = min(max(BLOCK_SIZE // 256, 1), 8)
_layer_norm_fwd_fused[(M,)](
out,
a_arg,
weight,
bias,
mean, rstd,
a_arg.stride(0), N, eps,
BLOCK_SIZE=BLOCK_SIZE,
num_warps=num_warps,
)
ctx.save_for_backward(
a, weight, bias, mean, rstd,
)
ctx.BLOCK_SIZE = BLOCK_SIZE
ctx.num_warps = num_warps
ctx.eps = eps
if hasattr(bias, "config"):
assert bias.config.grad_scale_name == weight.config.grad_scale_name
grad_scale_name = bias.config.grad_scale_name
else:
grad_scale_name = None
ctx.grad_scale_gain_bias_name = grad_scale_name
return out
@staticmethod
def backward(ctx, dout):
assert dout.is_contiguous()
a, weight, bias, mean, var = ctx.saved_tensors
# heuristics for amount of parallel reduction stream for DG/DB
N = weight.shape[0]
# allocate output
da = torch.empty_like(dout)
# enqueue kernel using forward pass heuristics
# also compute partial sums for DW and DB
x_arg = a.reshape(-1, a.shape[-1])
M, N = x_arg.shape
dweight = torch.empty((weight.shape[0],), dtype=weight.dtype, device=weight.device)
dbias = torch.empty((weight.shape[0],), dtype=weight.dtype, device=weight.device)
_layer_norm_bwd_dx_fused[(M,)](
da,
dout,
a,
weight,
mean, var,
x_arg.stride(0), M, N,
ctx.eps,
BLOCK_SIZE_N=ctx.BLOCK_SIZE,
num_warps=ctx.num_warps,
)
2022-07-14 07:22:19 +00:00
if N > 10240:
BLOCK_SIZE_N = 128
BLOCK_SIZE_M = 32
num_warps = 4
else:
# maximize occupancy for small N
BLOCK_SIZE_N = 16
BLOCK_SIZE_M = 16
num_warps = 8
2022-06-05 21:05:02 +00:00
grid = lambda meta: [triton.cdiv(N, meta["BLOCK_SIZE_N"])]
_layer_norm_bwd_dwdb[grid](
a, dout,
mean, var,
dweight,
dbias,
M,
N,
2022-07-14 07:22:19 +00:00
BLOCK_SIZE_M=BLOCK_SIZE_M,
BLOCK_SIZE_N=BLOCK_SIZE_N,
num_warps=num_warps
2022-06-05 21:05:02 +00:00
)
2022-07-14 07:22:19 +00:00
return (da, None, dweight, dbias, None)
2022-06-05 21:05:02 +00:00
def layer_norm(a, normalized_shape, weight, bias, eps):
return LayerNorm.apply(a, normalized_shape, weight, bias, eps)
def test_layer_norm(M, N, dtype, eps=1e-5, device='cuda'):
torch.manual_seed(0)
# create data
x_shape = (M, N)
w_shape = (x_shape[-1], )
weight = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
bias = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
dy = .1 * torch.randn_like(x)
x.requires_grad_(True)
# forward pass
y_tri = layer_norm(x, w_shape, weight, bias, eps)
y_ref = torch.nn.functional.layer_norm(x, w_shape, weight, bias, eps).to(dtype)
# backward pass (triton)
y_tri.backward(dy, retain_graph=True)
dx_tri, dw_tri, db_tri = [_.grad.clone() for _ in [x, weight, bias]]
x.grad, weight.grad, bias.grad = None, None, None
# backward pass (torch)
y_ref.backward(dy, retain_graph=True)
dx_ref, dw_ref, db_ref = [_.grad.clone() for _ in [x, weight, bias]]
# compare
triton.testing.assert_almost_equal(y_tri, y_ref)
triton.testing.assert_almost_equal(dx_tri, dx_ref)
triton.testing.assert_almost_equal(db_tri, db_ref, decimal=1)
triton.testing.assert_almost_equal(dw_tri, dw_ref, decimal=1)
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['N'],
x_vals=[512 * i for i in range(2, 32)],
line_arg='provider',
line_vals=['triton', 'torch'] + (['apex'] if HAS_APEX else []),
line_names=['Triton', 'Torch'] + (['Apex'] if HAS_APEX else []),
styles=[('blue', '-'), ('green', '-'), ('orange', '-')],
ylabel='GB/s',
plot_name='layer-norm',
args={'M': 4096, 'dtype': torch.float16, 'mode': 'forward'}
)
)
def bench_layer_norm(M, N, dtype, provider, mode, eps=1e-5, device='cuda'):
# create data
x_shape = (M, N)
w_shape = (x_shape[-1], )
weight = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
bias = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
dy = .1 * torch.randn_like(x)
x.requires_grad_(True)
# utility functions
if provider == 'triton':
y_fwd = lambda: layer_norm(x, w_shape, weight, bias, eps)
if provider == 'torch':
y_fwd = lambda: torch.nn.functional.layer_norm(x, w_shape, weight, bias, eps)
if provider == 'apex':
apex_layer_norm = apex.normalization.FusedLayerNorm(w_shape).to(x.device).to(x.dtype)
y_fwd = lambda: apex_layer_norm(x)
# forward pass
if mode == 'forward':
gbps = lambda ms: 2 * x.numel() * x.element_size() / ms * 1e-6
ms, min_ms, max_ms = triton.testing.do_bench(y_fwd, rep=500)
# backward pass
if mode == 'backward':
gbps = lambda ms: 3 * x.numel() * x.element_size() / ms * 1e-6
y = y_fwd()
ms, min_ms, max_ms = triton.testing.do_bench(lambda: y.backward(dy, retain_graph=True),
grad_to_none=[x], rep=500)
return gbps(ms), gbps(max_ms), gbps(min_ms)
# test_layer_norm(1151, 8192, torch.float16)
2022-07-14 07:22:19 +00:00
bench_layer_norm.run(save_path='.', print_data=True)<29>h]<5D>hX<>+import torch
2022-06-05 21:05:02 +00:00
import triton
import triton.language as tl
try:
# This is https://github.com/NVIDIA/apex, NOT the apex on PyPi, so it
# should not be added to extras_require in setup.py.
import apex
HAS_APEX = True
except ModuleNotFoundError:
HAS_APEX = False
@triton.jit
def _layer_norm_fwd_fused(
Out,
A,
Weight,
Bias,
Mean, Rstd,
stride, N, eps,
BLOCK_SIZE: tl.constexpr,
):
# position of elements processed by this program
row = tl.program_id(0)
Out += row * stride
A += row * stride
# compute mean
mean = 0
_mean = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
a = tl.load(A + cols, mask=cols < N, other=0., eviction_policy="evict_last").to(tl.float32)
_mean += a
mean = tl.sum(_mean, axis=0) / N
# compute variance
_var = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
a = tl.load(A + cols, mask=cols < N, other=0., eviction_policy="evict_last").to(tl.float32)
a = tl.where(cols < N, a - mean, 0.)
_var += a * a
var = tl.sum(_var, axis=0) / N
rstd = 1 / tl.sqrt(var + eps)
# write-back mean/rstd
tl.store(Mean + row, mean)
tl.store(Rstd + row, rstd)
# multiply by weight and add bias
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
mask = cols < N
weight = tl.load(Weight + cols, mask=mask)
bias = tl.load(Bias + cols, mask=mask)
a = tl.load(A + cols, mask=mask, other=0., eviction_policy="evict_first").to(tl.float32)
a_hat = (a - mean) * rstd
out = a_hat * weight + bias
# # write-back
tl.store(Out + cols, out, mask=mask)
# Backward pass (DA + partial DW + partial DB)
@triton.jit
def _layer_norm_bwd_dx_fused(
_DA,
_DOut,
_A,
Weight,
Mean, Rstd,
stride, NumRows, NumCols, eps,
BLOCK_SIZE_N: tl.constexpr,
):
# position of elements processed by this program
pid = tl.program_id(0)
row = pid
A = _A + row * stride
DOut = _DOut + row * stride
DA = _DA + row * stride
mean = tl.load(Mean + row)
rstd = tl.load(Rstd + row)
# load data to SRAM
_mean1 = tl.zeros([BLOCK_SIZE_N], dtype=tl.float32)
_mean2 = tl.zeros([BLOCK_SIZE_N], dtype=tl.float32)
for off in range(0, NumCols, BLOCK_SIZE_N):
cols = off + tl.arange(0, BLOCK_SIZE_N)
mask = cols < NumCols
a = tl.load(A + cols, mask=mask, other=0).to(tl.float32)
dout = tl.load(DOut + cols, mask=mask, other=0).to(tl.float32)
weight = tl.load(Weight + cols, mask=mask, other=0).to(tl.float32)
a_hat = (a - mean) * rstd
wdout = weight * dout
_mean1 += a_hat * wdout
_mean2 += wdout
mean1 = tl.sum(_mean1, axis=0) / NumCols
mean2 = 0.
mean2 = tl.sum(_mean2, axis=0) / NumCols
for off in range(0, NumCols, BLOCK_SIZE_N):
cols = off + tl.arange(0, BLOCK_SIZE_N)
mask = cols < NumCols
a = tl.load(A + cols, mask=mask, other=0).to(tl.float32)
dout = tl.load(DOut + cols, mask=mask, other=0).to(tl.float32)
weight = tl.load(Weight + cols, mask=mask, other=0).to(tl.float32)
a_hat = (a - mean) * rstd
wdout = weight * dout
da = (wdout - (a_hat * mean1 + mean2)) * rstd
# write-back dx
tl.store(DA + cols, da, mask=mask)
# Backward pass (total DW + total DB)
@triton.jit
def _layer_norm_bwd_dwdb(
A, DOut,
Mean, Var,
DW,
DB,
M, N,
BLOCK_SIZE_M: tl.constexpr,
BLOCK_SIZE_N: tl.constexpr,
):
pid = tl.program_id(0)
cols = pid * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
dw = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
db = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
2022-07-14 07:22:19 +00:00
UNROLL: tl.constexpr = 4
for i in range(0, M, BLOCK_SIZE_M * UNROLL):
for j in range(UNROLL):
rows = i + j * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
mask = (rows[:, None] < M) & (cols[None, :] < N)
offs = rows[:, None] * N + cols[None, :]
a = tl.load(A + offs, mask=mask, other=0.).to(tl.float32)
dout = tl.load(DOut + offs, mask=mask, other=0.).to(tl.float32)
mean = tl.load(Mean + rows, mask=rows < M, other=0.)
rstd = tl.load(Var + rows, mask=rows < M, other=0.)
a_hat = (a - mean[:, None]) * rstd[:, None]
dw += dout * a_hat
db += dout
2022-06-05 21:05:02 +00:00
sum_dw = tl.sum(dw, axis=0)
sum_db = tl.sum(db, axis=0)
tl.store(DW + cols, sum_dw, mask=cols < N)
tl.store(DB + cols, sum_db, mask=cols < N)
class LayerNorm(torch.autograd.Function):
@staticmethod
def forward(ctx, a, normalized_shape, weight, bias, eps):
# allocate output
out = torch.empty_like(a)
# reshape input data into 2D tensor
a_arg = a.reshape(-1, a.shape[-1])
M, N = a_arg.shape
mean = torch.empty((M,), dtype=torch.float32, device="cuda")
rstd = torch.empty((M,), dtype=torch.float32, device="cuda")
# Less than 64KB per feature: enqueue fused kernel
MAX_FUSED_SIZE = 65536 // a.element_size()
BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(N))
BLOCK_SIZE = max(BLOCK_SIZE, 128)
BLOCK_SIZE = min(BLOCK_SIZE, 4096)
# heuristics for number of warps
num_warps = min(max(BLOCK_SIZE // 256, 1), 8)
_layer_norm_fwd_fused[(M,)](
out,
a_arg,
weight,
bias,
mean, rstd,
a_arg.stride(0), N, eps,
BLOCK_SIZE=BLOCK_SIZE,
num_warps=num_warps,
)
ctx.save_for_backward(
a, weight, bias, mean, rstd,
)
ctx.BLOCK_SIZE = BLOCK_SIZE
ctx.num_warps = num_warps
ctx.eps = eps
if hasattr(bias, "config"):
assert bias.config.grad_scale_name == weight.config.grad_scale_name
grad_scale_name = bias.config.grad_scale_name
else:
grad_scale_name = None
ctx.grad_scale_gain_bias_name = grad_scale_name
return out
@staticmethod
def backward(ctx, dout):
assert dout.is_contiguous()
a, weight, bias, mean, var = ctx.saved_tensors
# heuristics for amount of parallel reduction stream for DG/DB
N = weight.shape[0]
# allocate output
da = torch.empty_like(dout)
# enqueue kernel using forward pass heuristics
# also compute partial sums for DW and DB
x_arg = a.reshape(-1, a.shape[-1])
M, N = x_arg.shape
dweight = torch.empty((weight.shape[0],), dtype=weight.dtype, device=weight.device)
dbias = torch.empty((weight.shape[0],), dtype=weight.dtype, device=weight.device)
_layer_norm_bwd_dx_fused[(M,)](
da,
dout,
a,
weight,
mean, var,
x_arg.stride(0), M, N,
ctx.eps,
BLOCK_SIZE_N=ctx.BLOCK_SIZE,
num_warps=ctx.num_warps,
)
2022-07-14 07:22:19 +00:00
if N > 10240:
BLOCK_SIZE_N = 128
BLOCK_SIZE_M = 32
num_warps = 4
else:
# maximize occupancy for small N
BLOCK_SIZE_N = 16
BLOCK_SIZE_M = 16
num_warps = 8
2022-06-05 21:05:02 +00:00
grid = lambda meta: [triton.cdiv(N, meta["BLOCK_SIZE_N"])]
_layer_norm_bwd_dwdb[grid](
a, dout,
mean, var,
dweight,
dbias,
M,
N,
2022-07-14 07:22:19 +00:00
BLOCK_SIZE_M=BLOCK_SIZE_M,
BLOCK_SIZE_N=BLOCK_SIZE_N,
num_warps=num_warps
2022-06-05 21:05:02 +00:00
)
2022-07-14 07:22:19 +00:00
return (da, None, dweight, dbias, None)
2022-06-05 21:05:02 +00:00
def layer_norm(a, normalized_shape, weight, bias, eps):
return LayerNorm.apply(a, normalized_shape, weight, bias, eps)
def test_layer_norm(M, N, dtype, eps=1e-5, device='cuda'):
torch.manual_seed(0)
# create data
x_shape = (M, N)
w_shape = (x_shape[-1], )
weight = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
bias = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
dy = .1 * torch.randn_like(x)
x.requires_grad_(True)
# forward pass
y_tri = layer_norm(x, w_shape, weight, bias, eps)
y_ref = torch.nn.functional.layer_norm(x, w_shape, weight, bias, eps).to(dtype)
# backward pass (triton)
y_tri.backward(dy, retain_graph=True)
dx_tri, dw_tri, db_tri = [_.grad.clone() for _ in [x, weight, bias]]
x.grad, weight.grad, bias.grad = None, None, None
# backward pass (torch)
y_ref.backward(dy, retain_graph=True)
dx_ref, dw_ref, db_ref = [_.grad.clone() for _ in [x, weight, bias]]
# compare
triton.testing.assert_almost_equal(y_tri, y_ref)
triton.testing.assert_almost_equal(dx_tri, dx_ref)
triton.testing.assert_almost_equal(db_tri, db_ref, decimal=1)
triton.testing.assert_almost_equal(dw_tri, dw_ref, decimal=1)
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['N'],
x_vals=[512 * i for i in range(2, 32)],
line_arg='provider',
line_vals=['triton', 'torch'] + (['apex'] if HAS_APEX else []),
line_names=['Triton', 'Torch'] + (['Apex'] if HAS_APEX else []),
styles=[('blue', '-'), ('green', '-'), ('orange', '-')],
ylabel='GB/s',
plot_name='layer-norm',
args={'M': 4096, 'dtype': torch.float16, 'mode': 'forward'}
)
)
def bench_layer_norm(M, N, dtype, provider, mode, eps=1e-5, device='cuda'):
# create data
x_shape = (M, N)
w_shape = (x_shape[-1], )
weight = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
bias = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
dy = .1 * torch.randn_like(x)
x.requires_grad_(True)
# utility functions
if provider == 'triton':
y_fwd = lambda: layer_norm(x, w_shape, weight, bias, eps)
if provider == 'torch':
y_fwd = lambda: torch.nn.functional.layer_norm(x, w_shape, weight, bias, eps)
if provider == 'apex':
apex_layer_norm = apex.normalization.FusedLayerNorm(w_shape).to(x.device).to(x.dtype)
y_fwd = lambda: apex_layer_norm(x)
# forward pass
if mode == 'forward':
gbps = lambda ms: 2 * x.numel() * x.element_size() / ms * 1e-6
ms, min_ms, max_ms = triton.testing.do_bench(y_fwd, rep=500)
# backward pass
if mode == 'backward':
gbps = lambda ms: 3 * x.numel() * x.element_size() / ms * 1e-6
y = y_fwd()
ms, min_ms, max_ms = triton.testing.do_bench(lambda: y.backward(dy, retain_graph=True),
grad_to_none=[x], rep=500)
return gbps(ms), gbps(max_ms), gbps(min_ms)
# test_layer_norm(1151, 8192, torch.float16)
2022-07-28 00:52:17 +00:00
bench_layer_norm.run(save_path='.', print_data=True)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj7ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$j<00>j<00>default<6C>j}<7D>uh%j h&h'h(KOhh<>hhubhm)<29><>}<7D>(h<05>B**Total running time of the script:** ( 5 minutes 37.477 seconds)<29>h]<5D>(h <09>strong<6E><67><EFBFBD>)<29><>}<7D>(h<05>%**Total running time of the script:**<2A>h]<5D>h<11>!Total running time of the script:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhjMubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%jKhjGubh<11> ( 5 minutes 37.477 seconds)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(h<05> ( 5 minutes 37.477 seconds)<29>hjGhhh&Nh(Nubeh}<7D>(h]<5D>h]<5D><>sphx-glr-timing<6E>ah]<5D>h]<5D>h!]<5D>uh%hlh&h'h(M<>hh<>hhubh<62>)<29><>}<7D>(h<05>A.. _sphx_glr_download_getting-started_tutorials_05-layer-norm.py:<3A>h]<5D>h}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>hČ<sphx-glr-download-getting-started-tutorials-05-layer-norm-py<70>uh%h<>h(M<>hh<>hhh&h'ubhb)<29><>}<7D>(hhh]<5D>h <09> container<65><72><EFBFBD>)<29><>}<7D>(hX.. container:: sphx-glr-download sphx-glr-download-python
2022-06-05 21:05:02 +00:00
:download:`Download Python source code: 05-layer-norm.py <05-layer-norm.py>`
.. container:: sphx-glr-download sphx-glr-download-jupyter
2022-07-14 07:22:19 +00:00
:download:`Download Jupyter notebook: 05-layer-norm.ipynb <05-layer-norm.ipynb>`<60>h]<5D>(jv)<29><>}<7D>(h<05>L:download:`Download Python source code: 05-layer-norm.py <05-layer-norm.py>`<60>h]<5D>hm)<29><>}<7D>(hj}h]<5D>h<00>download_reference<63><65><EFBFBD>)<29><>}<7D>(hj}h]<5D>h <09>literal<61><6C><EFBFBD>)<29><>}<7D>(hj}h]<5D>h<11>-Download Python source code: 05-layer-norm.py<70><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj<>ubah}<7D>(h]<5D>h]<5D>(h<><68>download<61>eh]<5D>h]<5D>h!]<5D>uh%j<>hj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>refdoc<6F>h<EFBFBD><68> refdomain<69>h<06>reftype<70>j<EFBFBD><00> refexplicit<69><74><EFBFBD>refwarn<72><6E>h<EFBFBD><68>05-layer-norm.py<70><79>filename<6D><65>1935c0dd0fbeb4b2e69588471cbb2d4b2/05-layer-norm.py<70>uh%j<>h&h'h(M<>hjubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlh&h'h(M<>hj{ubah}<7D>(h]<5D>h]<5D>(<28>sphx-glr-download<61><64>sphx-glr-download-python<6F>eh]<5D>h]<5D>h!]<5D>uh%juhjwubjv)<29><>}<7D>(h<05>P:download:`Download Jupyter notebook: 05-layer-norm.ipynb <05-layer-norm.ipynb>`<60>h]<5D>hm)<29><>}<7D>(hj<>h]<5D>j<EFBFBD>)<29><>}<7D>(hj<>h]<5D>j<EFBFBD>)<29><>}<7D>(hj<>h]<5D>h<11>.Download Jupyter notebook: 05-layer-norm.ipynb<6E><62><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj<>ubah}<7D>(h]<5D>h]<5D>(h<><68>download<61>eh]<5D>h]<5D>h!]<5D>uh%j<>hj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>refdoc<6F>h<EFBFBD><68> refdomain<69>h<06>reftype<70>j<EFBFBD><00> refexplicit<69><74><EFBFBD>refwarn<72><6E>h<EFBFBD><68>05-layer-norm.ipynb<6E>j<EFBFBD><00>4ae7fff29e1b574187bc930ed94bcc353/05-layer-norm.ipynb<6E>uh%j<>h&h'h(M<>hj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlh&h'h(M<>hj<>ubah}<7D>(h]<5D>h]<5D>(<28>sphx-glr-download<61><64>sphx-glr-download-jupyter<65>eh]<5D>h]<5D>h!]<5D>uh%juhjwubeh}<7D>(h]<5D>h]<5D>(<28>sphx-glr-footer<65><72>class<73><73>sphx-glr-footer-example<6C>eh]<5D>h]<5D>h!]<5D>uh%juhjrhhh&Nh(Nubah}<7D>(h]<5D>jqah]<5D>h]<5D><><sphx_glr_download_getting-started_tutorials_05-layer-norm.py<70>ah]<5D>h!]<5D>h<EFBFBD><68>html<6D>uh%hahhh&h'h(M<>hexpect_referenced_by_name<6D>}<7D>j<EFBFBD>jgs<>expect_referenced_by_id<69>}<7D>jqjgsubhb)<29><>}<7D>(hhh]<5D>hm)<29><>}<7D>(h<05>I`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_<>h]<5D>(h <09> reference<63><65><EFBFBD>)<29><>}<7D>(hjh]<5D>h<11>#Gallery generated by Sphinx-Gallery<72><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(h<05>#Gallery generated by Sphinx-Gallery<72>hjubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>name<6D><65>#Gallery generated by Sphinx-Gallery<72><79>refuri<72><69> https://sphinx-gallery.github.io<69>uh%jhj<>ubh<62>)<29><>}<7D>(h<05># <https://sphinx-gallery.github.io><3E>h]<5D>h}<7D>(h]<5D><>#gallery-generated-by-sphinx-gallery<72>ah]<5D>h]<5D><>#gallery generated by sphinx-gallery<72>ah]<5D>h!]<5D><>refuri<72>juh%h<><68>
referenced<EFBFBD>Khj<>ubeh}<7D>(h]<5D>h]<5D><>sphx-glr-signature<72>ah]<5D>h]<5D>h!]<5D>uh%hlh&h'h(M<>hj<>hhubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h<EFBFBD><68>html<6D>uh%hahhh&h'h(M<>hh<>ubeh}<7D>(h]<5D>(<28>layer-normalization<6F>h<EFBFBD>eh]<5D><>sphx-glr-example-title<6C>ah]<5D>(<28>layer normalization<6F><6E>3sphx_glr_getting-started_tutorials_05-layer-norm.py<70>eh]<5D>h!]<5D>uh%h<>hhhhh&h'h(Kj<>}<7D>j9h<>sj<73>}<7D>h<EFBFBD>h<EFBFBD>subeh}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>source<63>h'uh%h<01>current_source<63>N<EFBFBD> current_line<6E>N<EFBFBD>settings<67><73>docutils.frontend<6E><64>Values<65><73><EFBFBD>)<29><>}<7D>(h<>N<EFBFBD> generator<6F>N<EFBFBD> datestamp<6D>N<EFBFBD> source_link<6E>N<EFBFBD>
2022-06-05 21:05:02 +00:00
source_url<EFBFBD>N<EFBFBD> toc_backlinks<6B><73>entry<72><79>footnote_backlinks<6B>K<01> sectnum_xform<72>K<01>strip_comments<74>N<EFBFBD>strip_elements_with_classes<65>N<EFBFBD> strip_classes<65>N<EFBFBD> report_level<65>K<02>
halt_level<EFBFBD>K<05>exit_status_level<65>K<05>debug<75>N<EFBFBD>warning_stream<61>N<EFBFBD> traceback<63><6B><EFBFBD>input_encoding<6E><67> utf-8-sig<69><67>input_encoding_error_handler<65><72>strict<63><74>output_encoding<6E><67>utf-8<><38>output_encoding_error_handler<65>ja<00>error_encoding<6E><67>utf-8<><38>error_encoding_error_handler<65><72>backslashreplace<63><65> language_code<64><65>en<65><6E>record_dependencies<65>N<EFBFBD>config<69>N<EFBFBD> id_prefix<69>h<06>auto_id_prefix<69><78>id<69><64> dump_settings<67>N<EFBFBD>dump_internals<6C>N<EFBFBD>dump_transforms<6D>N<EFBFBD>dump_pseudo_xml<6D>N<EFBFBD>expose_internals<6C>N<EFBFBD>strict_visitor<6F>N<EFBFBD>_disable_config<69>N<EFBFBD>_source<63>h'<27> _destination<6F>N<EFBFBD> _config_files<65>]<5D><>pep_references<65>N<EFBFBD> pep_base_url<72><6C> https://www.python.org/dev/peps/<2F><>pep_file_url_template<74><65>pep-%04d<34><64>rfc_references<65>N<EFBFBD> rfc_base_url<72><6C>https://tools.ietf.org/html/<2F><> tab_width<74>K<08>trim_footnote_reference_space<63><65><EFBFBD>file_insertion_enabled<65><64><EFBFBD> raw_enabled<65>K<01>syntax_highlight<68><74>long<6E><67> smart_quotes<65><73><EFBFBD>smartquotes_locales<65>]<5D><>character_level_inline_markup<75><70><EFBFBD>doctitle_xform<72><6D><EFBFBD> docinfo_xform<72>K<01>sectsubtitle_xform<72><6D><EFBFBD>embed_stylesheet<65><74><EFBFBD>cloak_email_addresses<65><73><EFBFBD>env<6E>Nub<75>reporter<65>N<EFBFBD>indirect_targets<74>]<5D><>substitution_defs<66>}<7D><>substitution_names<65>}<7D><>refnames<65>}<7D><>refids<64>}<7D>(h<>]<5D>h<EFBFBD>ajq]<5D>jgau<61>nameids<64>}<7D>(j9h<>j8j4j<>jqjju<> nametypes<65>}<7D>(j9<00>j8Nj<4E><00>j<00>uh}<7D>(h<>h<EFBFBD>j4h<>jqjrjju<> footnote_refs<66>}<7D><> citation_refs<66>}<7D><> autofootnotes<65>]<5D><>autofootnote_refs<66>]<5D><>symbol_footnotes<65>]<5D><>symbol_footnote_refs<66>]<5D><> footnotes<65>]<5D><> citations<6E>]<5D><>autofootnote_start<72>K<01>symbol_footnote_start<72>K<00>
2022-07-14 07:22:19 +00:00
id_counter<EFBFBD><EFBFBD> collections<6E><73>Counter<65><72><EFBFBD>}<7D><><EFBFBD>R<EFBFBD><52>parse_messages<65>]<5D><>transform_messages<65>]<5D>(h <09>system_message<67><65><EFBFBD>)<29><>}<7D>(hhh]<5D>hm)<29><>}<7D>(hhh]<5D>h<11>YHyperlink target "sphx-glr-getting-started-tutorials-05-layer-norm-py" is not referenced.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlhj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>level<65>K<01>type<70><65>INFO<46><4F>source<63>h'<27>line<6E>Kuh%j<>ubj<62>)<29><>}<7D>(hhh]<5D>hm)<29><>}<7D>(hhh]<5D>h<11>bHyperlink target "sphx-glr-download-getting-started-tutorials-05-layer-norm-py" is not referenced.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlhj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>level<65>K<01>type<70>j<EFBFBD><00>source<63>h'<27>line<6E>M<EFBFBD>uh%j<>ube<62> transformer<65>N<EFBFBD>
2022-06-05 21:05:02 +00:00
decoration<EFBFBD>Nhhub.