Files
triton/master/.doctrees/getting-started/tutorials/05-layer-norm.doctree

703 lines
34 KiB
Plaintext
Raw Normal View History

2022-06-05 21:05:02 +00:00
<EFBFBD><05>u<EFBFBD><00>sphinx.addnodes<65><73>document<6E><74><EFBFBD>)<29><>}<7D>(<28> rawsource<63><65><00><>children<65>]<5D>(<28>docutils.nodes<65><73>comment<6E><74><EFBFBD>)<29><>}<7D>(h<05> DO NOT EDIT.<2E>h]<5D>h <09>Text<78><74><EFBFBD><EFBFBD> DO NOT EDIT.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<06>parent<6E>h uba<62>
attributes<EFBFBD>}<7D>(<28>ids<64>]<5D><>classes<65>]<5D><>names<65>]<5D><>dupnames<65>]<5D><>backrefs<66>]<5D><> xml:space<63><65>preserve<76>u<EFBFBD>tagname<6D>h
2022-06-27 00:48:22 +00:00
hhhh<03>source<63><65>j/tmp/tmp8ybf2ejw/87413bc92522f14da4860adb506a8bc96c5e3a89/docs/getting-started/tutorials/05-layer-norm.rst<73><74>line<6E>Kubh )<29><>}<7D>(h<05>8THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.<2E>h]<5D>h<11>8THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhh)ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
2022-06-05 21:05:02 +00:00
hhhhh&h'h(Kubh )<29><>}<7D>(h<05>-TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:<3A>h]<5D>h<11>-TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhh7ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
hhhhh&h'h(Kubh )<29><>}<7D>(h<05>,"getting-started/tutorials/05-layer-norm.py"<22>h]<5D>h<11>,"getting-started/tutorials/05-layer-norm.py"<22><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhhEubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
hhhhh&h'h(Kubh )<29><>}<7D>(h<05>LINE NUMBERS ARE GIVEN BELOW.<2E>h]<5D>h<11>LINE NUMBERS ARE GIVEN BELOW.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhhSubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
hhhhh&h'h(Kubh<00>only<6C><79><EFBFBD>)<29><>}<7D>(hhh]<5D>h <09>note<74><65><EFBFBD>)<29><>}<7D>(h<05>rClick :ref:`here <sphx_glr_download_getting-started_tutorials_05-layer-norm.py>`
to download the full example code<64>h]<5D>h <09> paragraph<70><68><EFBFBD>)<29><>}<7D>(h<05>rClick :ref:`here <sphx_glr_download_getting-started_tutorials_05-layer-norm.py>`
to download the full example code<64>h]<5D>(h<11>Click <20><><EFBFBD><EFBFBD><EFBFBD>}<7D>(h<05>Click <20>hhnubh<00> pending_xref<65><66><EFBFBD>)<29><>}<7D>(h<05>J:ref:`here <sphx_glr_download_getting-started_tutorials_05-layer-norm.py>`<60>h]<5D>h <09>inline<6E><65><EFBFBD>)<29><>}<7D>(hh{h]<5D>h<11>here<72><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhhubah}<7D>(h]<5D>h]<5D>(<28>xref<65><66>std<74><64>std-ref<65>eh]<5D>h]<5D>h!]<5D>uh%h}hhyubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>refdoc<6F><63>'getting-started/tutorials/05-layer-norm<72><6D> refdomain<69>h<EFBFBD><68>reftype<70><65>ref<65><66> refexplicit<69><74><EFBFBD>refwarn<72><6E><EFBFBD> reftarget<65><74><sphx_glr_download_getting-started_tutorials_05-layer-norm.py<70>uh%hwh&h'h(K hhnubh<11>"
to download the full example code<64><65><EFBFBD><EFBFBD><EFBFBD>}<7D>(h<05>"
to download the full example code<64>hhnubeh}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlh&h'h(K hhhubah}<7D>(h]<5D>h]<5D><>sphx-glr-download-link-note<74>ah]<5D>h]<5D>h!]<5D>uh%hfhhchhh&h'h(Nubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>expr<70><72>html<6D>uh%hahhh&h'h(Khhubh <09>target<65><74><EFBFBD>)<29><>}<7D>(h<05>8.. _sphx_glr_getting-started_tutorials_05-layer-norm.py:<3A>h]<5D>h}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>refid<69><64>3sphx-glr-getting-started-tutorials-05-layer-norm-py<70>uh%h<>h(Khhhhh&h'ubh <09>section<6F><6E><EFBFBD>)<29><>}<7D>(hhh]<5D>(h <09>title<6C><65><EFBFBD>)<29><>}<7D>(h<05>Layer Normalization<6F>h]<5D>h<11>Layer Normalization<6F><6E><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hh<>hhh&Nh(Nubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%h<>hh<>hhh&h'h(Kubh )<29><>}<7D>(h<05>(GENERATED FROM PYTHON SOURCE LINES 5-312<31>h]<5D>h<11>(GENERATED FROM PYTHON SOURCE LINES 5-312<31><32><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhh<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$uh%h
hh<>hhh&h'h(Kubh <09>image<67><65><EFBFBD>)<29><>}<7D>(h<05><>.. image:: /getting-started/tutorials/images/sphx_glr_05-layer-norm_001.png
:alt: 05 layer norm
:class: sphx-glr-single-img
<EFBFBD>h]<5D>h}<7D>(h]<5D>h]<5D><>sphx-glr-single-img<6D>ah]<5D>h]<5D>h!]<5D><>alt<6C><74> 05 layer norm<72><6D>uri<72><69>?getting-started/tutorials/images/sphx_glr_05-layer-norm_001.png<6E><67>
candidates<EFBFBD>}<7D><>*<2A>h<EFBFBD>suh%h<>hh<>hhh&h'h(Nubhm)<29><>}<7D>(h<05>Out:<3A>h]<5D>h<11>Out:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hh<>hh<>hhh&Nh(Nubah}<7D>(h]<5D>h]<5D><>sphx-glr-script-out<75>ah]<5D>h]<5D>h!]<5D>uh%hlh&h'h(K#hh<>hhubh <09> literal_block<63><6B><EFBFBD>)<29><>}<7D>(hX<>layer-norm:
N Triton Torch Apex
2022-06-26 00:50:12 +00:00
0 1024.0 585.142849 277.694907 468.114273
2022-06-24 00:46:49 +00:00
1 1536.0 630.153868 323.368435 511.999982
2 2048.0 682.666643 334.367358 520.126988
2022-06-27 00:48:22 +00:00
3 2560.0 694.237267 365.714281 518.481028
2022-06-26 00:50:12 +00:00
4 3072.0 712.347810 378.092307 501.551037
2022-06-27 00:48:22 +00:00
5 3584.0 725.873439 384.859062 458.751978
2022-06-26 00:50:12 +00:00
6 4096.0 728.177767 381.023256 458.293714
2022-06-27 00:48:22 +00:00
7 4608.0 670.254540 396.387087 426.173427
8 5120.0 694.237267 397.669909 426.666652
9 5632.0 704.000002 396.969169 413.357796
2022-06-24 00:46:49 +00:00
10 6144.0 702.171410 402.885254 411.313806
2022-06-27 00:48:22 +00:00
11 6656.0 700.631610 400.360920 400.360920
12 7168.0 695.078767 396.844306 388.772874
2022-06-26 00:50:12 +00:00
13 7680.0 682.666656 393.846167 387.634072
2022-06-27 00:48:22 +00:00
14 8192.0 642.509816 393.609605 372.363633
15 8704.0 627.315309 389.005597 380.502740
16 9216.0 606.814809 407.337026 383.999986
2022-06-25 00:46:57 +00:00
17 9728.0 589.575753 409.599987 383.369452
2022-06-24 00:46:49 +00:00
18 10240.0 566.920437 408.578556 382.803739
2022-06-25 00:46:57 +00:00
19 10752.0 549.623009 411.559798 381.445676
2022-06-24 00:46:49 +00:00
20 11264.0 536.380957 406.826188 373.134567
2022-06-27 00:48:22 +00:00
21 11776.0 523.377770 410.492372 377.587162
22 12288.0 517.389457 414.784810 383.251457
23 12800.0 505.679014 410.420828 376.470582
2022-06-26 00:50:12 +00:00
24 13312.0 494.180982 405.699062 376.976995
2022-06-24 00:46:49 +00:00
25 13824.0 482.934503 411.888257 379.389355
26 14336.0 471.967074 406.695045 374.185964
27 14848.0 461.297068 408.192434 375.304904
2022-06-23 00:46:26 +00:00
28 15360.0 454.269882 406.214870 378.092307
2022-06-26 00:50:12 +00:00
29 15872.0 447.887117 407.627589 376.225175<EFBFBD>h]<5D>hX<>layer-norm:
2022-06-05 21:05:02 +00:00
N Triton Torch Apex
2022-06-26 00:50:12 +00:00
0 1024.0 585.142849 277.694907 468.114273
2022-06-24 00:46:49 +00:00
1 1536.0 630.153868 323.368435 511.999982
2 2048.0 682.666643 334.367358 520.126988
2022-06-27 00:48:22 +00:00
3 2560.0 694.237267 365.714281 518.481028
2022-06-26 00:50:12 +00:00
4 3072.0 712.347810 378.092307 501.551037
2022-06-27 00:48:22 +00:00
5 3584.0 725.873439 384.859062 458.751978
2022-06-26 00:50:12 +00:00
6 4096.0 728.177767 381.023256 458.293714
2022-06-27 00:48:22 +00:00
7 4608.0 670.254540 396.387087 426.173427
8 5120.0 694.237267 397.669909 426.666652
9 5632.0 704.000002 396.969169 413.357796
2022-06-24 00:46:49 +00:00
10 6144.0 702.171410 402.885254 411.313806
2022-06-27 00:48:22 +00:00
11 6656.0 700.631610 400.360920 400.360920
12 7168.0 695.078767 396.844306 388.772874
2022-06-26 00:50:12 +00:00
13 7680.0 682.666656 393.846167 387.634072
2022-06-27 00:48:22 +00:00
14 8192.0 642.509816 393.609605 372.363633
15 8704.0 627.315309 389.005597 380.502740
16 9216.0 606.814809 407.337026 383.999986
2022-06-25 00:46:57 +00:00
17 9728.0 589.575753 409.599987 383.369452
2022-06-24 00:46:49 +00:00
18 10240.0 566.920437 408.578556 382.803739
2022-06-25 00:46:57 +00:00
19 10752.0 549.623009 411.559798 381.445676
2022-06-24 00:46:49 +00:00
20 11264.0 536.380957 406.826188 373.134567
2022-06-27 00:48:22 +00:00
21 11776.0 523.377770 410.492372 377.587162
22 12288.0 517.389457 414.784810 383.251457
23 12800.0 505.679014 410.420828 376.470582
2022-06-26 00:50:12 +00:00
24 13312.0 494.180982 405.699062 376.976995
2022-06-24 00:46:49 +00:00
25 13824.0 482.934503 411.888257 379.389355
26 14336.0 471.967074 406.695045 374.185964
27 14848.0 461.297068 408.192434 375.304904
2022-06-23 00:46:26 +00:00
28 15360.0 454.269882 406.214870 378.092307
2022-06-26 00:50:12 +00:00
29 15872.0 447.887117 407.627589 376.225175<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhjubah}<7D>(h]<5D>h]<5D>jah]<5D>h]<5D>h!]<5D>h#h$<24>force<63><65><EFBFBD>language<67><65>none<6E><65>highlight_args<67>}<7D>uh%j h&h'h(K%hh<>hhubh <09>
2022-06-05 21:05:02 +00:00
line_block<EFBFBD><EFBFBD><EFBFBD>)<29><>}<7D>(hhh]<5D>h h(<28><>)<29><>}<7D>(hhh]<5D>h}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%h(hj#hhh&h'h(K<00>indent<6E>Kubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%j!hh<>hhh&h'h(KMubj )<29><>}<7D>(hX +import torch
import triton
import triton.language as tl
try:
# This is https://github.com/NVIDIA/apex, NOT the apex on PyPi, so it
# should not be added to extras_require in setup.py.
import apex
HAS_APEX = True
except ModuleNotFoundError:
HAS_APEX = False
@triton.jit
def _layer_norm_fwd_fused(
Out,
A,
Weight,
Bias,
Mean, Rstd,
stride, N, eps,
BLOCK_SIZE: tl.constexpr,
):
# position of elements processed by this program
row = tl.program_id(0)
Out += row * stride
A += row * stride
# compute mean
mean = 0
_mean = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
a = tl.load(A + cols, mask=cols < N, other=0., eviction_policy="evict_last").to(tl.float32)
_mean += a
mean = tl.sum(_mean, axis=0) / N
# compute variance
_var = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
a = tl.load(A + cols, mask=cols < N, other=0., eviction_policy="evict_last").to(tl.float32)
a = tl.where(cols < N, a - mean, 0.)
_var += a * a
var = tl.sum(_var, axis=0) / N
rstd = 1 / tl.sqrt(var + eps)
# write-back mean/rstd
tl.store(Mean + row, mean)
tl.store(Rstd + row, rstd)
# multiply by weight and add bias
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
mask = cols < N
weight = tl.load(Weight + cols, mask=mask)
bias = tl.load(Bias + cols, mask=mask)
a = tl.load(A + cols, mask=mask, other=0., eviction_policy="evict_first").to(tl.float32)
a_hat = (a - mean) * rstd
out = a_hat * weight + bias
# # write-back
tl.store(Out + cols, out, mask=mask)
# Backward pass (DA + partial DW + partial DB)
@triton.jit
def _layer_norm_bwd_dx_fused(
_DA,
_DOut,
_A,
Weight,
Mean, Rstd,
stride, NumRows, NumCols, eps,
BLOCK_SIZE_N: tl.constexpr,
):
# position of elements processed by this program
pid = tl.program_id(0)
row = pid
A = _A + row * stride
DOut = _DOut + row * stride
DA = _DA + row * stride
mean = tl.load(Mean + row)
rstd = tl.load(Rstd + row)
# load data to SRAM
_mean1 = tl.zeros([BLOCK_SIZE_N], dtype=tl.float32)
_mean2 = tl.zeros([BLOCK_SIZE_N], dtype=tl.float32)
for off in range(0, NumCols, BLOCK_SIZE_N):
cols = off + tl.arange(0, BLOCK_SIZE_N)
mask = cols < NumCols
a = tl.load(A + cols, mask=mask, other=0).to(tl.float32)
dout = tl.load(DOut + cols, mask=mask, other=0).to(tl.float32)
weight = tl.load(Weight + cols, mask=mask, other=0).to(tl.float32)
a_hat = (a - mean) * rstd
wdout = weight * dout
_mean1 += a_hat * wdout
_mean2 += wdout
mean1 = tl.sum(_mean1, axis=0) / NumCols
mean2 = 0.
mean2 = tl.sum(_mean2, axis=0) / NumCols
for off in range(0, NumCols, BLOCK_SIZE_N):
cols = off + tl.arange(0, BLOCK_SIZE_N)
mask = cols < NumCols
a = tl.load(A + cols, mask=mask, other=0).to(tl.float32)
dout = tl.load(DOut + cols, mask=mask, other=0).to(tl.float32)
weight = tl.load(Weight + cols, mask=mask, other=0).to(tl.float32)
a_hat = (a - mean) * rstd
wdout = weight * dout
da = (wdout - (a_hat * mean1 + mean2)) * rstd
# write-back dx
tl.store(DA + cols, da, mask=mask)
# Backward pass (total DW + total DB)
@triton.jit
def _layer_norm_bwd_dwdb(
A, DOut,
Mean, Var,
DW,
DB,
M, N,
BLOCK_SIZE_M: tl.constexpr,
BLOCK_SIZE_N: tl.constexpr,
):
pid = tl.program_id(0)
cols = pid * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
dw = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
db = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
for i in range(0, M, BLOCK_SIZE_M):
rows = i + tl.arange(0, BLOCK_SIZE_M)
mask = (rows[:, None] < M) & (cols[None, :] < N)
offs = rows[:, None] * N + cols[None, :]
a = tl.load(A + offs, mask=mask, other=0.).to(tl.float32)
dout = tl.load(DOut + offs, mask=mask, other=0.).to(tl.float32)
mean = tl.load(Mean + rows, mask=rows < M, other=0.)
rstd = tl.load(Var + rows, mask=rows < M, other=0.)
a_hat = (a - mean[:, None]) * rstd[:, None]
dw += dout * a_hat
db += dout
sum_dw = tl.sum(dw, axis=0)
sum_db = tl.sum(db, axis=0)
tl.store(DW + cols, sum_dw, mask=cols < N)
tl.store(DB + cols, sum_db, mask=cols < N)
class LayerNorm(torch.autograd.Function):
@staticmethod
def forward(ctx, a, normalized_shape, weight, bias, eps):
# allocate output
out = torch.empty_like(a)
# reshape input data into 2D tensor
a_arg = a.reshape(-1, a.shape[-1])
M, N = a_arg.shape
mean = torch.empty((M,), dtype=torch.float32, device="cuda")
rstd = torch.empty((M,), dtype=torch.float32, device="cuda")
# Less than 64KB per feature: enqueue fused kernel
MAX_FUSED_SIZE = 65536 // a.element_size()
BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(N))
BLOCK_SIZE = max(BLOCK_SIZE, 128)
BLOCK_SIZE = min(BLOCK_SIZE, 4096)
# heuristics for number of warps
num_warps = min(max(BLOCK_SIZE // 256, 1), 8)
_layer_norm_fwd_fused[(M,)](
out,
a_arg,
weight,
bias,
mean, rstd,
a_arg.stride(0), N, eps,
BLOCK_SIZE=BLOCK_SIZE,
num_warps=num_warps,
)
ctx.save_for_backward(
a, weight, bias, mean, rstd,
)
ctx.BLOCK_SIZE = BLOCK_SIZE
ctx.num_warps = num_warps
ctx.eps = eps
if hasattr(bias, "config"):
assert bias.config.grad_scale_name == weight.config.grad_scale_name
grad_scale_name = bias.config.grad_scale_name
else:
grad_scale_name = None
ctx.grad_scale_gain_bias_name = grad_scale_name
return out
@staticmethod
def backward(ctx, dout):
assert dout.is_contiguous()
a, weight, bias, mean, var = ctx.saved_tensors
# heuristics for amount of parallel reduction stream for DG/DB
N = weight.shape[0]
# allocate output
da = torch.empty_like(dout)
# enqueue kernel using forward pass heuristics
# also compute partial sums for DW and DB
x_arg = a.reshape(-1, a.shape[-1])
M, N = x_arg.shape
dweight = torch.empty((weight.shape[0],), dtype=weight.dtype, device=weight.device)
dbias = torch.empty((weight.shape[0],), dtype=weight.dtype, device=weight.device)
_layer_norm_bwd_dx_fused[(M,)](
da,
dout,
a,
weight,
mean, var,
x_arg.stride(0), M, N,
ctx.eps,
BLOCK_SIZE_N=ctx.BLOCK_SIZE,
num_warps=ctx.num_warps,
)
# accumulate partial sums in separate kernel
grid = lambda meta: [triton.cdiv(N, meta["BLOCK_SIZE_N"])]
_layer_norm_bwd_dwdb[grid](
a, dout,
mean, var,
dweight,
dbias,
M,
N,
BLOCK_SIZE_M=32,
BLOCK_SIZE_N=128,
)
return (da, None, dweight, dbias, None, None,
None, None, None, None,
None,
None, None, None,
None,
None, None, None,
None, None, None,
None, None, None)
def layer_norm(a, normalized_shape, weight, bias, eps):
return LayerNorm.apply(a, normalized_shape, weight, bias, eps)
def test_layer_norm(M, N, dtype, eps=1e-5, device='cuda'):
torch.manual_seed(0)
# create data
x_shape = (M, N)
w_shape = (x_shape[-1], )
weight = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
bias = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
dy = .1 * torch.randn_like(x)
x.requires_grad_(True)
# forward pass
y_tri = layer_norm(x, w_shape, weight, bias, eps)
y_ref = torch.nn.functional.layer_norm(x, w_shape, weight, bias, eps).to(dtype)
# backward pass (triton)
y_tri.backward(dy, retain_graph=True)
dx_tri, dw_tri, db_tri = [_.grad.clone() for _ in [x, weight, bias]]
x.grad, weight.grad, bias.grad = None, None, None
# backward pass (torch)
y_ref.backward(dy, retain_graph=True)
dx_ref, dw_ref, db_ref = [_.grad.clone() for _ in [x, weight, bias]]
# compare
triton.testing.assert_almost_equal(y_tri, y_ref)
triton.testing.assert_almost_equal(dx_tri, dx_ref)
triton.testing.assert_almost_equal(db_tri, db_ref, decimal=1)
triton.testing.assert_almost_equal(dw_tri, dw_ref, decimal=1)
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['N'],
x_vals=[512 * i for i in range(2, 32)],
line_arg='provider',
line_vals=['triton', 'torch'] + (['apex'] if HAS_APEX else []),
line_names=['Triton', 'Torch'] + (['Apex'] if HAS_APEX else []),
styles=[('blue', '-'), ('green', '-'), ('orange', '-')],
ylabel='GB/s',
plot_name='layer-norm',
args={'M': 4096, 'dtype': torch.float16, 'mode': 'forward'}
)
)
def bench_layer_norm(M, N, dtype, provider, mode, eps=1e-5, device='cuda'):
# create data
x_shape = (M, N)
w_shape = (x_shape[-1], )
weight = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
bias = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
dy = .1 * torch.randn_like(x)
x.requires_grad_(True)
# utility functions
if provider == 'triton':
y_fwd = lambda: layer_norm(x, w_shape, weight, bias, eps)
if provider == 'torch':
y_fwd = lambda: torch.nn.functional.layer_norm(x, w_shape, weight, bias, eps)
if provider == 'apex':
apex_layer_norm = apex.normalization.FusedLayerNorm(w_shape).to(x.device).to(x.dtype)
y_fwd = lambda: apex_layer_norm(x)
# forward pass
if mode == 'forward':
gbps = lambda ms: 2 * x.numel() * x.element_size() / ms * 1e-6
ms, min_ms, max_ms = triton.testing.do_bench(y_fwd, rep=500)
# backward pass
if mode == 'backward':
gbps = lambda ms: 3 * x.numel() * x.element_size() / ms * 1e-6
y = y_fwd()
ms, min_ms, max_ms = triton.testing.do_bench(lambda: y.backward(dy, retain_graph=True),
grad_to_none=[x], rep=500)
return gbps(ms), gbps(max_ms), gbps(min_ms)
# test_layer_norm(1151, 8192, torch.float16)
bench_layer_norm.run(save_path='.', print_data=True)<29>h]<5D>hX +import torch
import triton
import triton.language as tl
try:
# This is https://github.com/NVIDIA/apex, NOT the apex on PyPi, so it
# should not be added to extras_require in setup.py.
import apex
HAS_APEX = True
except ModuleNotFoundError:
HAS_APEX = False
@triton.jit
def _layer_norm_fwd_fused(
Out,
A,
Weight,
Bias,
Mean, Rstd,
stride, N, eps,
BLOCK_SIZE: tl.constexpr,
):
# position of elements processed by this program
row = tl.program_id(0)
Out += row * stride
A += row * stride
# compute mean
mean = 0
_mean = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
a = tl.load(A + cols, mask=cols < N, other=0., eviction_policy="evict_last").to(tl.float32)
_mean += a
mean = tl.sum(_mean, axis=0) / N
# compute variance
_var = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
a = tl.load(A + cols, mask=cols < N, other=0., eviction_policy="evict_last").to(tl.float32)
a = tl.where(cols < N, a - mean, 0.)
_var += a * a
var = tl.sum(_var, axis=0) / N
rstd = 1 / tl.sqrt(var + eps)
# write-back mean/rstd
tl.store(Mean + row, mean)
tl.store(Rstd + row, rstd)
# multiply by weight and add bias
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
mask = cols < N
weight = tl.load(Weight + cols, mask=mask)
bias = tl.load(Bias + cols, mask=mask)
a = tl.load(A + cols, mask=mask, other=0., eviction_policy="evict_first").to(tl.float32)
a_hat = (a - mean) * rstd
out = a_hat * weight + bias
# # write-back
tl.store(Out + cols, out, mask=mask)
# Backward pass (DA + partial DW + partial DB)
@triton.jit
def _layer_norm_bwd_dx_fused(
_DA,
_DOut,
_A,
Weight,
Mean, Rstd,
stride, NumRows, NumCols, eps,
BLOCK_SIZE_N: tl.constexpr,
):
# position of elements processed by this program
pid = tl.program_id(0)
row = pid
A = _A + row * stride
DOut = _DOut + row * stride
DA = _DA + row * stride
mean = tl.load(Mean + row)
rstd = tl.load(Rstd + row)
# load data to SRAM
_mean1 = tl.zeros([BLOCK_SIZE_N], dtype=tl.float32)
_mean2 = tl.zeros([BLOCK_SIZE_N], dtype=tl.float32)
for off in range(0, NumCols, BLOCK_SIZE_N):
cols = off + tl.arange(0, BLOCK_SIZE_N)
mask = cols < NumCols
a = tl.load(A + cols, mask=mask, other=0).to(tl.float32)
dout = tl.load(DOut + cols, mask=mask, other=0).to(tl.float32)
weight = tl.load(Weight + cols, mask=mask, other=0).to(tl.float32)
a_hat = (a - mean) * rstd
wdout = weight * dout
_mean1 += a_hat * wdout
_mean2 += wdout
mean1 = tl.sum(_mean1, axis=0) / NumCols
mean2 = 0.
mean2 = tl.sum(_mean2, axis=0) / NumCols
for off in range(0, NumCols, BLOCK_SIZE_N):
cols = off + tl.arange(0, BLOCK_SIZE_N)
mask = cols < NumCols
a = tl.load(A + cols, mask=mask, other=0).to(tl.float32)
dout = tl.load(DOut + cols, mask=mask, other=0).to(tl.float32)
weight = tl.load(Weight + cols, mask=mask, other=0).to(tl.float32)
a_hat = (a - mean) * rstd
wdout = weight * dout
da = (wdout - (a_hat * mean1 + mean2)) * rstd
# write-back dx
tl.store(DA + cols, da, mask=mask)
# Backward pass (total DW + total DB)
@triton.jit
def _layer_norm_bwd_dwdb(
A, DOut,
Mean, Var,
DW,
DB,
M, N,
BLOCK_SIZE_M: tl.constexpr,
BLOCK_SIZE_N: tl.constexpr,
):
pid = tl.program_id(0)
cols = pid * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
dw = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
db = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
for i in range(0, M, BLOCK_SIZE_M):
rows = i + tl.arange(0, BLOCK_SIZE_M)
mask = (rows[:, None] < M) & (cols[None, :] < N)
offs = rows[:, None] * N + cols[None, :]
a = tl.load(A + offs, mask=mask, other=0.).to(tl.float32)
dout = tl.load(DOut + offs, mask=mask, other=0.).to(tl.float32)
mean = tl.load(Mean + rows, mask=rows < M, other=0.)
rstd = tl.load(Var + rows, mask=rows < M, other=0.)
a_hat = (a - mean[:, None]) * rstd[:, None]
dw += dout * a_hat
db += dout
sum_dw = tl.sum(dw, axis=0)
sum_db = tl.sum(db, axis=0)
tl.store(DW + cols, sum_dw, mask=cols < N)
tl.store(DB + cols, sum_db, mask=cols < N)
class LayerNorm(torch.autograd.Function):
@staticmethod
def forward(ctx, a, normalized_shape, weight, bias, eps):
# allocate output
out = torch.empty_like(a)
# reshape input data into 2D tensor
a_arg = a.reshape(-1, a.shape[-1])
M, N = a_arg.shape
mean = torch.empty((M,), dtype=torch.float32, device="cuda")
rstd = torch.empty((M,), dtype=torch.float32, device="cuda")
# Less than 64KB per feature: enqueue fused kernel
MAX_FUSED_SIZE = 65536 // a.element_size()
BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(N))
BLOCK_SIZE = max(BLOCK_SIZE, 128)
BLOCK_SIZE = min(BLOCK_SIZE, 4096)
# heuristics for number of warps
num_warps = min(max(BLOCK_SIZE // 256, 1), 8)
_layer_norm_fwd_fused[(M,)](
out,
a_arg,
weight,
bias,
mean, rstd,
a_arg.stride(0), N, eps,
BLOCK_SIZE=BLOCK_SIZE,
num_warps=num_warps,
)
ctx.save_for_backward(
a, weight, bias, mean, rstd,
)
ctx.BLOCK_SIZE = BLOCK_SIZE
ctx.num_warps = num_warps
ctx.eps = eps
if hasattr(bias, "config"):
assert bias.config.grad_scale_name == weight.config.grad_scale_name
grad_scale_name = bias.config.grad_scale_name
else:
grad_scale_name = None
ctx.grad_scale_gain_bias_name = grad_scale_name
return out
@staticmethod
def backward(ctx, dout):
assert dout.is_contiguous()
a, weight, bias, mean, var = ctx.saved_tensors
# heuristics for amount of parallel reduction stream for DG/DB
N = weight.shape[0]
# allocate output
da = torch.empty_like(dout)
# enqueue kernel using forward pass heuristics
# also compute partial sums for DW and DB
x_arg = a.reshape(-1, a.shape[-1])
M, N = x_arg.shape
dweight = torch.empty((weight.shape[0],), dtype=weight.dtype, device=weight.device)
dbias = torch.empty((weight.shape[0],), dtype=weight.dtype, device=weight.device)
_layer_norm_bwd_dx_fused[(M,)](
da,
dout,
a,
weight,
mean, var,
x_arg.stride(0), M, N,
ctx.eps,
BLOCK_SIZE_N=ctx.BLOCK_SIZE,
num_warps=ctx.num_warps,
)
# accumulate partial sums in separate kernel
grid = lambda meta: [triton.cdiv(N, meta["BLOCK_SIZE_N"])]
_layer_norm_bwd_dwdb[grid](
a, dout,
mean, var,
dweight,
dbias,
M,
N,
BLOCK_SIZE_M=32,
BLOCK_SIZE_N=128,
)
return (da, None, dweight, dbias, None, None,
None, None, None, None,
None,
None, None, None,
None,
None, None, None,
None, None, None,
None, None, None)
def layer_norm(a, normalized_shape, weight, bias, eps):
return LayerNorm.apply(a, normalized_shape, weight, bias, eps)
def test_layer_norm(M, N, dtype, eps=1e-5, device='cuda'):
torch.manual_seed(0)
# create data
x_shape = (M, N)
w_shape = (x_shape[-1], )
weight = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
bias = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
dy = .1 * torch.randn_like(x)
x.requires_grad_(True)
# forward pass
y_tri = layer_norm(x, w_shape, weight, bias, eps)
y_ref = torch.nn.functional.layer_norm(x, w_shape, weight, bias, eps).to(dtype)
# backward pass (triton)
y_tri.backward(dy, retain_graph=True)
dx_tri, dw_tri, db_tri = [_.grad.clone() for _ in [x, weight, bias]]
x.grad, weight.grad, bias.grad = None, None, None
# backward pass (torch)
y_ref.backward(dy, retain_graph=True)
dx_ref, dw_ref, db_ref = [_.grad.clone() for _ in [x, weight, bias]]
# compare
triton.testing.assert_almost_equal(y_tri, y_ref)
triton.testing.assert_almost_equal(dx_tri, dx_ref)
triton.testing.assert_almost_equal(db_tri, db_ref, decimal=1)
triton.testing.assert_almost_equal(dw_tri, dw_ref, decimal=1)
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['N'],
x_vals=[512 * i for i in range(2, 32)],
line_arg='provider',
line_vals=['triton', 'torch'] + (['apex'] if HAS_APEX else []),
line_names=['Triton', 'Torch'] + (['Apex'] if HAS_APEX else []),
styles=[('blue', '-'), ('green', '-'), ('orange', '-')],
ylabel='GB/s',
plot_name='layer-norm',
args={'M': 4096, 'dtype': torch.float16, 'mode': 'forward'}
)
)
def bench_layer_norm(M, N, dtype, provider, mode, eps=1e-5, device='cuda'):
# create data
x_shape = (M, N)
w_shape = (x_shape[-1], )
weight = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
bias = torch.rand(w_shape, dtype=dtype, device='cuda', requires_grad=True)
x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
dy = .1 * torch.randn_like(x)
x.requires_grad_(True)
# utility functions
if provider == 'triton':
y_fwd = lambda: layer_norm(x, w_shape, weight, bias, eps)
if provider == 'torch':
y_fwd = lambda: torch.nn.functional.layer_norm(x, w_shape, weight, bias, eps)
if provider == 'apex':
apex_layer_norm = apex.normalization.FusedLayerNorm(w_shape).to(x.device).to(x.dtype)
y_fwd = lambda: apex_layer_norm(x)
# forward pass
if mode == 'forward':
gbps = lambda ms: 2 * x.numel() * x.element_size() / ms * 1e-6
ms, min_ms, max_ms = triton.testing.do_bench(y_fwd, rep=500)
# backward pass
if mode == 'backward':
gbps = lambda ms: 3 * x.numel() * x.element_size() / ms * 1e-6
y = y_fwd()
ms, min_ms, max_ms = triton.testing.do_bench(lambda: y.backward(dy, retain_graph=True),
grad_to_none=[x], rep=500)
return gbps(ms), gbps(max_ms), gbps(min_ms)
# test_layer_norm(1151, 8192, torch.float16)
2022-06-27 00:48:22 +00:00
bench_layer_norm.run(save_path='.', print_data=True)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj7ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h#h$j<00>j<00>default<6C>j}<7D>uh%j h&h'h(KOhh<>hhubhm)<29><>}<7D>(h<05>B**Total running time of the script:** ( 5 minutes 24.641 seconds)<29>h]<5D>(h <09>strong<6E><67><EFBFBD>)<29><>}<7D>(h<05>%**Total running time of the script:**<2A>h]<5D>h<11>!Total running time of the script:<3A><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhjMubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%jKhjGubh<11> ( 5 minutes 24.641 seconds)<29><><EFBFBD><EFBFBD><EFBFBD>}<7D>(h<05> ( 5 minutes 24.641 seconds)<29>hjGhhh&Nh(Nubeh}<7D>(h]<5D>h]<5D><>sphx-glr-timing<6E>ah]<5D>h]<5D>h!]<5D>uh%hlh&h'h(M<>hh<>hhubh<62>)<29><>}<7D>(h<05>A.. _sphx_glr_download_getting-started_tutorials_05-layer-norm.py:<3A>h]<5D>h}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>hČ<sphx-glr-download-getting-started-tutorials-05-layer-norm-py<70>uh%h<>h(M<>hh<>hhh&h'ubhb)<29><>}<7D>(hhh]<5D>h <09> container<65><72><EFBFBD>)<29><>}<7D>(hX.. container:: sphx-glr-download sphx-glr-download-python
2022-06-05 21:05:02 +00:00
:download:`Download Python source code: 05-layer-norm.py <05-layer-norm.py>`
.. container:: sphx-glr-download sphx-glr-download-jupyter
:download:`Download Jupyter notebook: 05-layer-norm.ipynb <05-layer-norm.ipynb>`<60>h]<5D>(jv)<29><>}<7D>(h<05>L:download:`Download Python source code: 05-layer-norm.py <05-layer-norm.py>`<60>h]<5D>hm)<29><>}<7D>(hj}h]<5D>h<00>download_reference<63><65><EFBFBD>)<29><>}<7D>(hj}h]<5D>h <09>literal<61><6C><EFBFBD>)<29><>}<7D>(hj}h]<5D>h<11>-Download Python source code: 05-layer-norm.py<70><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj<>ubah}<7D>(h]<5D>h]<5D>(h<><68>download<61>eh]<5D>h]<5D>h!]<5D>uh%j<>hj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>refdoc<6F>h<EFBFBD><68> refdomain<69>h<06>reftype<70>j<EFBFBD><00> refexplicit<69><74><EFBFBD>refwarn<72><6E>h<EFBFBD><68>05-layer-norm.py<70><79>filename<6D><65>1935c0dd0fbeb4b2e69588471cbb2d4b2/05-layer-norm.py<70>uh%j<>h&h'h(M<>hjubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlh&h'h(M<>hj{ubah}<7D>(h]<5D>h]<5D>(<28>sphx-glr-download<61><64>sphx-glr-download-python<6F>eh]<5D>h]<5D>h!]<5D>uh%juhjwubjv)<29><>}<7D>(h<05>P:download:`Download Jupyter notebook: 05-layer-norm.ipynb <05-layer-norm.ipynb>`<60>h]<5D>hm)<29><>}<7D>(hj<>h]<5D>j<EFBFBD>)<29><>}<7D>(hj<>h]<5D>j<EFBFBD>)<29><>}<7D>(hj<>h]<5D>h<11>.Download Jupyter notebook: 05-layer-norm.ipynb<6E><62><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj<>ubah}<7D>(h]<5D>h]<5D>(h<><68>download<61>eh]<5D>h]<5D>h!]<5D>uh%j<>hj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>refdoc<6F>h<EFBFBD><68> refdomain<69>h<06>reftype<70>j<EFBFBD><00> refexplicit<69><74><EFBFBD>refwarn<72><6E>h<EFBFBD><68>05-layer-norm.ipynb<6E>j<EFBFBD><00>4ae7fff29e1b574187bc930ed94bcc353/05-layer-norm.ipynb<6E>uh%j<>h&h'h(M<>hj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlh&h'h(M<>hj<>ubah}<7D>(h]<5D>h]<5D>(<28>sphx-glr-download<61><64>sphx-glr-download-jupyter<65>eh]<5D>h]<5D>h!]<5D>uh%juhjwubeh}<7D>(h]<5D>h]<5D>(<28>sphx-glr-footer<65><72>class<73><73>sphx-glr-footer-example<6C>eh]<5D>h]<5D>h!]<5D>uh%juhjrhhh&Nh(Nubah}<7D>(h]<5D>jqah]<5D>h]<5D><><sphx_glr_download_getting-started_tutorials_05-layer-norm.py<70>ah]<5D>h!]<5D>h<EFBFBD><68>html<6D>uh%hahhh&h'h(M<>hexpect_referenced_by_name<6D>}<7D>j<EFBFBD>jgs<>expect_referenced_by_id<69>}<7D>jqjgsubhb)<29><>}<7D>(hhh]<5D>hm)<29><>}<7D>(h<05>I`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_<>h]<5D>(h <09> reference<63><65><EFBFBD>)<29><>}<7D>(hjh]<5D>h<11>#Gallery generated by Sphinx-Gallery<72><79><EFBFBD><EFBFBD><EFBFBD>}<7D>(h<05>#Gallery generated by Sphinx-Gallery<72>hjubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>name<6D><65>#Gallery generated by Sphinx-Gallery<72><79>refuri<72><69> https://sphinx-gallery.github.io<69>uh%jhj<>ubh<62>)<29><>}<7D>(h<05># <https://sphinx-gallery.github.io><3E>h]<5D>h}<7D>(h]<5D><>#gallery-generated-by-sphinx-gallery<72>ah]<5D>h]<5D><>#gallery generated by sphinx-gallery<72>ah]<5D>h!]<5D><>refuri<72>juh%h<><68>
referenced<EFBFBD>Khj<>ubeh}<7D>(h]<5D>h]<5D><>sphx-glr-signature<72>ah]<5D>h]<5D>h!]<5D>uh%hlh&h'h(M<>hj<>hhubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>h<EFBFBD><68>html<6D>uh%hahhh&h'h(M<>hh<>ubeh}<7D>(h]<5D>(<28>layer-normalization<6F>h<EFBFBD>eh]<5D><>sphx-glr-example-title<6C>ah]<5D>(<28>layer normalization<6F><6E>3sphx_glr_getting-started_tutorials_05-layer-norm.py<70>eh]<5D>h!]<5D>uh%h<>hhhhh&h'h(Kj<>}<7D>j9h<>sj<73>}<7D>h<EFBFBD>h<EFBFBD>subeh}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>source<63>h'uh%h<01>current_source<63>N<EFBFBD> current_line<6E>N<EFBFBD>settings<67><73>docutils.frontend<6E><64>Values<65><73><EFBFBD>)<29><>}<7D>(h<>N<EFBFBD> generator<6F>N<EFBFBD> datestamp<6D>N<EFBFBD> source_link<6E>N<EFBFBD>
source_url<EFBFBD>N<EFBFBD> toc_backlinks<6B><73>entry<72><79>footnote_backlinks<6B>K<01> sectnum_xform<72>K<01>strip_comments<74>N<EFBFBD>strip_elements_with_classes<65>N<EFBFBD> strip_classes<65>N<EFBFBD> report_level<65>K<02>
halt_level<EFBFBD>K<05>exit_status_level<65>K<05>debug<75>N<EFBFBD>warning_stream<61>N<EFBFBD> traceback<63><6B><EFBFBD>input_encoding<6E><67> utf-8-sig<69><67>input_encoding_error_handler<65><72>strict<63><74>output_encoding<6E><67>utf-8<><38>output_encoding_error_handler<65>ja<00>error_encoding<6E><67>utf-8<><38>error_encoding_error_handler<65><72>backslashreplace<63><65> language_code<64><65>en<65><6E>record_dependencies<65>N<EFBFBD>config<69>N<EFBFBD> id_prefix<69>h<06>auto_id_prefix<69><78>id<69><64> dump_settings<67>N<EFBFBD>dump_internals<6C>N<EFBFBD>dump_transforms<6D>N<EFBFBD>dump_pseudo_xml<6D>N<EFBFBD>expose_internals<6C>N<EFBFBD>strict_visitor<6F>N<EFBFBD>_disable_config<69>N<EFBFBD>_source<63>h'<27> _destination<6F>N<EFBFBD> _config_files<65>]<5D><>pep_references<65>N<EFBFBD> pep_base_url<72><6C> https://www.python.org/dev/peps/<2F><>pep_file_url_template<74><65>pep-%04d<34><64>rfc_references<65>N<EFBFBD> rfc_base_url<72><6C>https://tools.ietf.org/html/<2F><> tab_width<74>K<08>trim_footnote_reference_space<63><65><EFBFBD>file_insertion_enabled<65><64><EFBFBD> raw_enabled<65>K<01>syntax_highlight<68><74>long<6E><67> smart_quotes<65><73><EFBFBD>smartquotes_locales<65>]<5D><>character_level_inline_markup<75><70><EFBFBD>doctitle_xform<72><6D><EFBFBD> docinfo_xform<72>K<01>sectsubtitle_xform<72><6D><EFBFBD>embed_stylesheet<65><74><EFBFBD>cloak_email_addresses<65><73><EFBFBD>env<6E>Nub<75>reporter<65>N<EFBFBD>indirect_targets<74>]<5D><>substitution_defs<66>}<7D><>substitution_names<65>}<7D><>refnames<65>}<7D><>refids<64>}<7D>(h<>]<5D>h<EFBFBD>ajq]<5D>jgau<61>nameids<64>}<7D>(j9h<>j8j4j<>jqjju<> nametypes<65>}<7D>(j9<00>j8Nj<4E><00>j<00>uh}<7D>(h<>h<EFBFBD>j4h<>jqjrjju<> footnote_refs<66>}<7D><> citation_refs<66>}<7D><> autofootnotes<65>]<5D><>autofootnote_refs<66>]<5D><>symbol_footnotes<65>]<5D><>symbol_footnote_refs<66>]<5D><> footnotes<65>]<5D><> citations<6E>]<5D><>autofootnote_start<72>K<01>symbol_footnote_start<72>K<00>
id_counter<EFBFBD><EFBFBD> collections<6E><73>Counter<65><72><EFBFBD>}<7D><><EFBFBD>R<EFBFBD><52>parse_messages<65>]<5D><>transform_messages<65>]<5D>(h <09>system_message<67><65><EFBFBD>)<29><>}<7D>(hhh]<5D>hm)<29><>}<7D>(hhh]<5D>h<11>YHyperlink target "sphx-glr-getting-started-tutorials-05-layer-norm-py" is not referenced.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlhj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>level<65>K<01>type<70><65>INFO<46><4F>source<63>h'<27>line<6E>Kuh%j<>ubj<62>)<29><>}<7D>(hhh]<5D>hm)<29><>}<7D>(hhh]<5D>h<11>bHyperlink target "sphx-glr-download-getting-started-tutorials-05-layer-norm-py" is not referenced.<2E><><EFBFBD><EFBFBD><EFBFBD>}<7D>(hhhj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D>uh%hlhj<>ubah}<7D>(h]<5D>h]<5D>h]<5D>h]<5D>h!]<5D><>level<65>K<01>type<70>j<EFBFBD><00>source<63>h'<27>line<6E>M<EFBFBD>uh%j<>ube<62> transformer<65>N<EFBFBD>
decoration<EFBFBD>Nhhub.