1ytic/warp-rnntPublic

NotificationsYou must be signed in to change notification settings
Fork41
Star212

CUDA-Warp RNN-Transducer

License

MIT license

212 stars 41 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
pytorch_binding		pytorch_binding
tensorflow_binding		tensorflow_binding
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
core.cu		core.cu
core.h		core.h
core_compact.cu		core_compact.cu
core_gather.cu		core_gather.cu
lattice.gif		lattice.gif
warp-rnnt.nvvp.png		warp-rnnt.nvvp.png
warp-transducer.nvvp.png		warp-transducer.nvvp.png

Repository files navigation

CUDA-Warp RNN-Transducer

A GPU implementation of RNN Transducer (Graves2012,2013).This code is ported from thereference implementation (by Awni Hannun)and fully utilizes the CUDA warp mechanism.

The main bottleneck in the loss is a forward/backward pass, which based on the dynamic programming algorithm.In particular, there is a nested loop to populate a lattice with shape (T, U),and each value in this lattice depend on the two previous cells from each dimension (e.g.forward pass).

CUDA executes threads in groups of 32 parallel threads calledwarps.Full efficiency is realized when all 32 threads of a warp agree on their execution path.This is exactly what is used to optimize the RNN Transducer. The lattice is split into warps in the T dimension.In each warp, variables between threads exchanged using a fast operations.As soon as the current warp fills the last value, the next two warps (t+32, u) and (t, u+1) are start running.A schematic procedure for the forward pass is shown in the figure below, where T - number of frames, U - number of labels, W - warp size.The similar procedure for the backward pass runs in parallel.

Performance

NVIDIA Profiler shows advantage of thewarp implementation over thenon-warp implementation.

This warp implementation:

Non-warp implementationwarp-transducer:

Unfortunately, in practice this advantage disappears because the memory operations takes much longer. Especially if you synchronize memory on each iteration.

	warp_rnnt (gather=False)	warp_rnnt (gather=True)	warprnnt_pytorch	transducer (CPU)
T=150, U=40, V=28
N=1	0.50 ms	0.54 ms	0.63 ms	1.28 ms
N=16	1.79 ms	1.72 ms	1.85 ms	6.15 ms
N=32	3.09 ms	2.94 ms	2.97 ms	12.72 ms
N=64	5.83 ms	5.54 ms	5.23 ms	23.73 ms
N=128	11.30 ms	10.74 ms	9.99 ms	47.93 ms
T=150, U=20, V=5000
N=1	0.95 ms	0.80 ms	1.74 ms	21.18 ms
N=16	8.74 ms	6.24 ms	16.20 ms	240.11 ms
N=32	17.26 ms	12.35 ms	31.64 ms	490.66 ms
N=64	out-of-memory	out-of-memory	out-of-memory	944.73 ms
N=128	out-of-memory	out-of-memory	out-of-memory	1894.93 ms
T=1500, U=300, V=50
N=1	5.89 ms	4.99 ms	10.02 ms	121.82 ms
N=16	95.46 ms	78.88 ms	76.66 ms	732.50 ms
N=32	out-of-memory	157.86 ms	165.38 ms	1448.54 ms
N=64	out-of-memory	out-of-memory	out-of-memory	2767.59 ms

Benchmarked on a GeForce RTX 2070 Super GPU, Intel i7-10875H CPU @ 2.30GHz.

Note

This implementation assumes that the input is log_softmax.
In addition to alphas/betas arrays, counts array is allocated with shape (N, U * 2), which is used as a scheduling mechanism.
core_gather.cu is a memory-efficient version that expects log_probs with the shape (N, T, U, 2) only for blank and labels values. It shows excellent performance with a large vocabulary.
Do not expect that this implementation will greatly reduce the training time of RNN Transducer model. Probably, the main bottleneck will be a trainable joint network with an output (N, T, U, V).
Also, there is a restricted version, calledRecurrent Neural Aligner, with assumption that the length of input sequence is equal to or greater than the length of target sequence.

Install

There are two bindings for the core algorithm:

Reference

Awni Hannuntransducer
Mingkun Huangwarp-transducer

About

CUDA-Warp RNN-Transducer

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CUDA-Warp RNN-Transducer

Performance

Note

Install

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors4

Uh oh!

Languages

Movatterモバイル変換

License

1ytic/warp-rnnt

Folders and files

Latest commit

History

Repository files navigation

CUDA-Warp RNN-Transducer

Performance

Note

Install

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors4

Uh oh!

Languages

Packages