mcabbott/Tullio.jlPublic

NotificationsYou must be signed in to change notification settings
Fork29
Star645

⅀

License

MIT license

645 stars 29 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 627 Commits
.buildkite		.buildkite
.github/workflows		.github/workflows
benchmarks		benchmarks
ext		ext
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Repository files navigation

Tullio.jl

Tullio is a very flexible einsum macro. It understands many array operations written in index notation -- not just matrix multiplication and permutations, but also convolutions, stencils, scatter/gather, and broadcasting. For example:

@tullio M[x,y,c]:= N[x+i, y+j,c]* K[i,j]# sum over i,j, and create M@tullio S[x]= P[x,y]*log(Q[x,y]/ R[y])# sum over y, and write into S@tullio A[i,j]+= B[i,k,l]* C[l,j]* D[k,j]# sum over k,l, and add to values in A@tullio (*) Z[j]:= X[ind[k],j]*exp(-Y[k])# product over k

Used by itself the macro writes ordinary nested loops much likeEinsum.@einsum.One difference is that it can parse more expressions, and infer ranges for their indices.Another is that it will use multi-threading (viaThreads.@spawn) and recursive tiling, on large enough arrays.But it also co-operates with various other packages, provided they are loaded before the macro is called:

It usesLoopVectorization.@avx to speed many things up. (Disable with keywordavx=false.) On a good day this will match the speed of OpenBLAS for matrix multiplication.
It usesKernelAbstractions.@kernel to make a GPU version. (Disable withcuda=false.) This is somewhat experimental, and may not be fast.

The macro also tries to provide a gradient for use withTracker or (viaChainRules) forZygote,Yota, etc. (Disable withgrad=false, ornograd=A.) This is done in one of two ways:

By default it takes a symbolic derivative of the right hand side expression. This works for reductions over+ ormin/max. The functions as typed must be known, mostly fromDiffRules.
The optiongrad=Dual uses insteadForwardDiff to differentiate the right hand side (only for reductions over+). This allows for more complicated expressions.

The entire right hand side is summed over the full possible range of any indices not appearing on the left.Pipe operators|> or<| indicate functions to be performedoutside the sum, for example:

@tullio lse[j]:= log<|exp(mat[i,j])# vec(log.(sum(exp.(mat), dims=1)))

The option@tullio verbose=true will cause it to print index ranges, symbolic derivatives,and notices when it is unable to use the packages mentioned above.Andverbose=2 will print everything.

If it's useful in academic work, you can cite it with this DOI:

Notation

Index notation for some simple functions:

using Pkg; Pkg.add("Tullio")using Tullio, TestM=rand(1:20,3,7)@tullio S[1,c]:= M[r,c]# sum over r ∈ 1:3, for each c ∈ 1:7@test S==sum(M, dims=1)@tullio Q[ρ,c]:= M[ρ,c]+sqrt(S[1,c])# loop over ρ & c, no sum -- broadcasting@test Q≈ M.+sqrt.(S)mult(M,Q)=@tullio P[x,y]:= M[x,c]* Q[y,c]# sum over c ∈ 1:7 -- matrix multiplication@testmult(M,Q)≈ M*transpose(Q)R= [rand(Int8,3,4)for δin1:5]@tullio T[j,i,δ]:= R[δ][i,j]+10im# three nested loops -- concatenation@test T==permutedims(cat(R...; dims=3), (2,1,3)).+10im@tullio (max) X[i]:=abs2(T[j,i,δ])# reduce using max, over j and δ@test X==dropdims(maximum(abs2, T, dims=(1,3)), dims=(1,3))dbl!(M, S)=@tullio M[r,c]=2* S[1,c]# write into existing matrix, M .= 2 .* Sdbl!(M, S)@testall(M[r,c]==2*S[1,c]for r∈1:3, c∈1:7)

More complicated examples:

using TullioA= [abs2(i-11)for iin1:21]# Downsample -- range of i is that allowed by both terms:@tullio B[i]:= (A[2i]+ A[2i+1])/2# 1:10 == intersect(1:10, 0:10)# Shifts -- range of i calculated in terms of that given for j:@tullio M[i,j]:= A[i+j-1]  (jin1:15)# i in 1:7@tullio M[i+_,j]:= A[i+j]  (jin1:15)# i in 0:6, automatic shift "i+_"using OffsetArrays# Convolve a filter:K=OffsetArray([1,-1,2,-1,1],-2:2)@tullio C[i]:= A[i+j]* K[j]# j ∈ -2:2 implies i ∈ 3:19# Index by the values in K@tullio D[i,j]:= A[2K[j]+i]÷ K[j]# extrema(K)==(-1,2) implies i ∈ 3:17# Wrapped & padded:@tullio M[i,j]:= A[mod(i+j)]  (jin1:15, iin1:15)# wraps around, mod(i+j, axes(A,1))@tullio M[i,j]:= A[clamp(i+j)]  (jin1:15, iin1:15)# instead repeats "100"@tullio M[i+_,j]:= A[pad(i+j,3)]  (jin1:15)# fills with zerosusing FFTW# Functions of the indices are OK:S= [0,1,0,0,0,0,0,0]fft(S)≈@tullio F[k]:= S[x]*exp(-im*pi/8* (k-1)* x)  (k∈axes(S,1))# Finalisers <| or |> are applied after sum (the two are equivalent):@tullio N2[j]:= sqrt<| M[i,j]^2# N2 ≈ map(norm, eachcol(M))@tullio n3[_]:= A[i]^3|> (_)^(1/3)# n3[1] ≈ norm(A,3), with _ anon. func.# Reduction over any function:@tullio (*) P[i]:= A[i+k]  (kin0:2)# product@tullio (max) X[i,_]:= D[i,j]# maximum(D, dims=2), almostmin1(x,y)=ifelse(first(x)<first(y), x, y);# findmin(D, dims=1), almost:@tullio (min1) Ts[j+_]:= (D[i,j], (i,j))  init=(typemax(Int), (0,0))# Access to fields & arrays -- this uses j ∈ eachindex(first(N).c)N= [(a=i, b=i^2, c=fill(i^3,3))for iin1:10]@tullio T[i,j]:= (N[i].a//1, N[i].c[j])# Functions which create arrays are evaluated once:@tullio R[i,j]:=abs.((rand(Int8,5)[i],rand(Int8,5)[j]))using NamedDims, AxisKeys# Dimension names, plus pretty printing:@tullio M[row=i, col=j, z=k]:= A[i+j-1]  (jin1:15, kin1:2)@tullio S[i]:= M[col=j-i, z=k, row=i+1]# sum over j,k

Fast & Slow

When used with LoopVectorization, on straightforward matrix multiplication of real numbers,@tullio tends to be about as fast as OpenBLAS. Depending on the size, and on your computer.Here's a speed comparison on mine:v2.5.

This race is a useful diagnostic, but isn't really the goal. There is little point in avoidingusing BLAS libraries, if you want precisely what they are optimised to give you.One of the things@tullio is often very fast at is weird tensor contractions,for which you would otherwise needpermutedims:

using Tullio, LoopVectorization, NNlib, BenchmarkTools# Batched matmul, with batch index first in B:bmm_rev(A, B)=@tullio C[i,k,b]:= A[i,j,b]* B[b,k,j]# (sum over j)A=randn(20,30,500); B=randn(500,40,30);bmm_rev(A, B)≈ NNlib.batched_mul(A,permutedims(B, (3,2,1)))# true@btimebmm_rev($A,$B);# 317.526 μs μs, same speed as un-permuted@btime NNlib.batched_mul($A,permutedims($B, (3,2,1)));# 1.478 ms, with MKL

Complex numbers aren't handled by LoopVectorization, so will be much slower.

Chained multiplication is also very slow, because it doesn't know there's a betteralgorithm. Here it just makes 4 loops, instead of multiplying sequentially,30^4 instead of2 * 30^3 operations:

M1, M2, M3=randn(30,30),randn(30,30),randn(30,30);@btime$M1*$M2*$M3;#  3.525 μs@btime@tullio M4[i,l]:=$M1[i,j]*$M2[j,k]*$M3[k,l];# 30.401 μs

Or slightly less obviously:

M, Σ=randn(100,100),randn(100,100);@tullio R4[i, j]:= (M[μ, i]- M[μ,j])'* Σ[μ,ν]* (M[ν, i]- M[ν, j]);begin  S= M'* Σ* M# two N^3 operations, instead of one N^4@tullio R3[i,j]:= S[i,i]+ S[j,j]- S[i,j]- S[j,i]end;R3≈ R4

Another thing Tullio can be very fast at is broadcast reductions, where it can avoid large allocations. Here LoopVectorization is speeding uplog, and Tullio is handling tiled memory access and multi-threading:

sum_opp(X, Y=X)=@tullio s:= X[i,j]*log(Y[j,i])sum_part(X, Y=X)=@tullio S[i]:= X[i,j]*log(Y[j,i])X=rand(1000,1000);@btimesum_opp($X)#   499.814 μs (93 allocations: 3.97 KiB)@btimesum($X.*log.(transpose($X)))# 8.759 ms (2 allocations: 7.63 MiB)@btimesum_part($X)'#  1.599 ms (not the same computer!)@btimesum($X.*log.(transpose($X)), dims=2)# 13.292 ms

At present indices usingpad,clamp ormod are also slow. These result in extrachecks or operations at every iteration, not just around the edges:

conv1(x,k)=@tullio y[i+_, j+_]:= x[i+a, j+b]* k[a,b]conv2(x,k)=@tullio y[i+_, j+_]:= x[2i-a,2j-b]* k[a,b]conv3(x,k)=@tullio y[i+_, j+_]:= x[pad(i-a,3),pad(j-b,3)]* k[a,b]x100=rand(100,100); k7=randn(7,7);@btimeconv1($x100,$k7);#  25.574 μs@btimeconv2($x100,$k7);#  44.590 μs@btimeconv3($x100,$k7);#  86.228 μsusing Fluxx104=reshape(x100,(100,100,1,1)); k74=reshape(k7,(7,7,1,1));conv1(x100, k7)≈@btimeCrossCor($k74,false)($x104)# 586.694 μsconv2(x100, k7)≈@btimeConv($k74,false, stride=2)($x104)# 901.573 μsconv3(x100, k7)≈@btimeConv($k74,false, pad=3)($x104)# 932.658 μsusing DSP@btime DSP.conv($x100,$k7);# 198.331 μs

Gradients & GPU

Derivatives & GPU

using Tulliomul(A, B)=@tullio C[i,k]:= A[i,j]* B[j,k]A=rand(3,40); B=rand(40,500);A* B≈mul(A, B)# trueusing Tracker# or ZygoteΔA= Tracker.gradient((A,B)->sum(mul(A, B)), A, B)[1]ΔA≈ones(3,500)* B'# trueusing CUDA, KernelAbstractions# Now defined with a GPU version:mul(A, B)=@tullio C[i,k]:= A[i,j]* B[j,k]cu(A* B)≈mul(cu(A),cu(B))# truecu(ΔA)≈ Tracker.gradient((A,B)->sum(mul(A, B)),cu(A),cu(B))[1]# true# Reduction over min/max:Tracker.gradient(x-> (@tullio (max) res:= x[i]^3), [1,2,3,-2,-1,3])[1]

Some warnings are in order:

Complete reductions to a number will not work on the GPU at present.They were extremely slow, and a re-organisation of multi-threading for the CPU case killed them, sorry.
Gradients are not calculated for scalars, only arrays.Thus for examplegradient(a -> (@tullio _ := $a * A[i]), 3.14) will be zero.
When usinggrad=Dual, the right hand side is evaluated a second time during the backward pass.This avoids needing memory to store partials, but if the function is expensive, it may be slow.

Larger Expressions

The expression need not be just one line, for example:

@tullio out[x, y]:=@inbounds(begin# sum over k        a,b= off[k]        mat[mod(x+a),mod(y+b)]end) (xinaxes(mat,1), yinaxes(mat,2)) grad=Dual nograd=off

Here the macro cannot infer the range of the output's indicesx,y, so they must be provided explicitly.(If writing into an existing array, without[x,y] = begin ... or+=, then ranges would be taken from there.)Because it sees assignment being made, it does not attempt to sum overa,b, and it assumes that indices could go out of bounds so does not add@inbounds for you.(Although in factmod(x+a) == mod(x+a, axes(mat,1)) is safe.)It will also not be able to take a symbolic derivative, but dual numbers will work fine.

More examples:

using Tullio, OffsetArrays# A convolution with cyclic indicesmat=zeros(10,10,1); mat[2,2]=101; mat[10,10]=1;@tullio kern[i,j]:=1/(1+i^2+j^2)  (iin-3:3, jin-3:3)@tullio out[x,y,c]:=begin    xi=mod(x+i,axes(mat,1))# xi = ... means that it won't be summed,# yj = mod(y+j, axes(mat,2))@inboundstrunc(Int, mat[xi,mod(y+j), c]* kern[i,j])# and disables automatic @inbounds,end (xin1:10, yin1:10)# and prevents range of x from being inferred.# A stencil?offsets= [(a,b)for ain-2:2for bin-2:2if a>=b]# vector of tuples@tullio out[x,y,1]=begin        a,b= offsets[k]        i=clamp(x+a,extrema(axes(mat,1))...)# j = clamp(y+b, extrema(axes(mat,2))...) # can be written clamp(y+b)@inbounds mat[i,clamp(y+b),1]*10end# ranges of x,y read from out[x,y,1]# Applying a vector of functionsfs= [sin, cos, tan]xs=randn(3,100)@tullio ys[r,c]:= (fs[r])(xs[r,c])using Zygote, ForwardDiffrowmap(fs, xs)=@tullio ys[r,c]:= (fs[r])(xs[r,c]) grad=Dual nograd=fsZygote.gradient(sum∘rowmap, fs,ones(3,2))[f'(1)for fin fs]# agrees

Keyword Options

The default setting is:@tullio threads=true fastmath=true avx=true tensor=true cuda=256 grad=Base verbose=false A[i,j] := ...

threads=false turns off threading, whilethreads=64^3 sets a threshold size at which to divide the work (replacing the macro's best guess).
avx=false turns off the use ofLoopVectorization, whileavx=4 inserts@avx unroll=4 for i in ....
grad=false turns off gradient calculation, andgrad=Dual switches it to useForwardDiff (which must be loaded).
nograd=A turns of the gradient calculation just forA, andnograd=(A,B,C) does this for several arrays.
tensor=false turns off the use ofTensorOperations.
Assignmentxi = ... removesxi from the list of indices: its range is note calculated, and it will not be summed over. It also disables@inbounds since this is now up to you.
verbose=true prints things like the index ranges inferred, and gradient calculations.verbose=2 prints absolutely everything.
A[i,j] := ... makes a new array, whileA[i,j] = ... andA[i,j] += ... write into an existing one.A[row=i, col=j] := ... makes a newNamedDimsArray.
@tullio (*) A[i,j] := ... is a product, as is@tullio A[i,j] *= .... For other reductions,@tullio (f) A[i,j] ^= ... is an in-place update.
init=0.0 gives the initial value for reductions. For+,*,min,min,&,| it has sensible defaults, for other reductions uses zero.

Implicit:

Indices without shifts must have the same range everywhere they appear, but those with shifts (evenA[i+0]) run over the intersection of possible ranges.
Shifted output indices must start at 1, unlessOffsetArrays is visible in the calling module.
The use of@avx, and the calculation of gradients, are switched off by sufficiently complex syntax (such as arrays of arrays).
Gradient hooks are attached for any or all ofReverseDiff,Tracker &Zygote. These packages need not be loaded when the macro is run.
Gradients are only defined for reductions over(+) (default) andmin,max.
GPU kernels are only constructed when bothKernelAbstractions andCUDA are visible. The defaultcuda=256 is passed tokernel(CUDA(), 256).
The CPU kernels fromKernelAbstractions are called only whenthreads=false; they are not at present very fast, but perhaps useful for testing.

Extras:

A[i] := i^2 (i in 1:10) is how you specify a range for indices when this can't be inferred.
A[i] := B[i, $col] - C[i, 2] is how you fix one index to a constant (to preventcol being summed over).
A[i] := $d * B[i] is the preferred way to include other constants. Note that no gradient is calculated ford.
Within indexing,A[mod(i), clamp(j)] both mapsi &j to lie withinaxes(A), and disables inference of their ranges fromA.
Similarly,A[pad(i,3)] extends the range ofi, inserting zeros outside ofA. Instead of zero,pad=NaN uses this value as padding. The implementation of this (andmod,clamp) is not very fast at present.
On the left, when making a new array, an underscore likeA[i+_] := inserts whatever shift is needed to makeA one-based.
Tullio.@printgrad (x+y)*log(x/z) x y z prints out how symbolic derivatives will be done.

Macros:

Tullio.@tensor is a macro which uses TensorOperations to evaluate expressions, but provides gradient definitions. (Previously this was automatic behaviour, when TensorOperations.jl was loaded & the expression was suitable.)
Tullio.@einsum is a variant with a few changes, to allow the running of Einsum.jl's tests.

How it Works

The following three macros all end up calling the same functions as doesC = A * B:

@tensor C[i,j]:= A[i,k]* B[k,j]# TensorOperations.jl@ein C[i,j]:= A[i,k]* B[k,j]# OMEinsum.jl@matmul C[i,j]:=sum(k) A[i,k]* B[k,j]# TensorCast.jl

But this one writes its own for-loops:

@einsum C[i,j]:= A[i,k]* B[k,j]# Einsum.jl

expanding out to roughly this:

T=promote_type(eltype(A),eltype(B))C=Array{T}(undef,size(A,1),size(B,2))@inboundsfor jin1:size(B,2)for iin1:size(A,1)        acc=zero(T)for kin1:size(A,2)            acc+= A[i,k]* B[k,j]end        C[i,j]= accendend

Tullio does something similar, but working through a few functions. Taking a slightly more complicated example, this:

@tullio C[i,j]:= tanh<| A[i,k]* B[k,j]

expands to roughly this:

functionact!(::Type, C::AbstractArray{T}, A, B, ax_i, ax_j, ax_k, keep=nothing, final=true)where T@inbounds@fastmathfor iin ax_ifor jin ax_j            acc=isnothing(keep)?zero(T): C[i,j]for kin ax_k                acc+= A[i,k]* B[k,j]end            C[i,j]=isnothing(final)? acc:tanh(acc)endendendfunctionmake(A, B)    ax_i=axes(A,1)    ax_j=axes(B,2)    ax_k=axes(A,2)# and check this is == axes(B,1)rhs(A,B,i,j,k)=tanh(A[i,k]* B[k,j])    T= Core.Compiler.return_type(rhs,eltype.((A,B,1,1,1)))# plus a fallback    C=similar(A, T, (ax_i, ax_j))    Tullio.threader(act!, Array{T}, C, (A,B), (ax_i,ax_j), (ax_k,),+,64^3)return CendC= Tullio.Eval(make, ∇make)(A, B)

This division allows it to dispatch to other methods ofact!: one generated with@avx if LoopVectorization is loaded, and one for::CuArray if KernelAbstractions is loaded.

It also allowsthreader to divide the work, callingact! many times, from different threads, on small tiles made by dividing the longest axis (sayax_i) in half, repeatedly. If it divides upax_k, these are done sequentially, withkeep=true on all ranges except the first, andfinal=nothing on all except the last. Butax_i andax_j are safe to do in parallel.

Finally,Eval exists to give Zygote and friends somewhere to attach themselves. The gradient calculation looks roughly like this:

@adjointfunction (e::Eval)(AB...)    C= e.fwd(AB...)    C, ΔC-> e.rev(ΔC, C, AB...)endfunction∇act!(::Type, ΔC, ΔA, ΔB, C, A, B, ax_i, ax_j, ax_k, keep)for kin ax_k, iin ax_i, jin ax_j        ex= ΔC[i,j]* (1-C[i,j])^2        ΔA[i,k]+= ex* B[k,j]        ΔB[k,j]+= A[i,k]* exendendfunction∇make(ΔC, C, A, B)    ΔA=similar(A) .=0    ΔB=similar(B) .=0    ax_i, ax_k=axes(A); ax_j=axes(B,2)    Tullio.∇threader(∇act!, Array{T}, (ax_k,), (ax_i, ax_j),nothing)return (ΔA, ΔB)end

In this case, it is the loop overk which can be safely broken among different threads, since bothΔA andΔB have this index. BothΔA andΔB are filled in at once.

Notice that the derivative ofy = tanh(z) has been written in terms ofy (the final result of the forward pass) but free ofz (the result of the sum, which was not saved). If this is not possible, it will fail.

If usinggrad=Dual, the gradient kernel looks different. This method cannot handle finalisers liketanh above, but for plain@tullio C[i,j] := A[i,k] * B[k,j] it would read:

function∇act!(::Type, ΔC, ΔA, ΔB, C, A, B, ax_i, ax_j, ax_k, keep)    eps1= ForwardDiff.Dual(0, (1,0))    eps2= ForwardDiff.Dual(0, (0,1))for kin ax_k, iin ax_i, jin ax_j        res= (A[i,k]+ eps1)* (B[k,j]+ eps2)        ΔA[i,k]+= ForwardDiff.partials(res,1)* ΔC[i,j]        ΔB[k,j]+= ForwardDiff.partials(res,2)* ΔC[i,j]endend

Writing@tullio verbose=2 will print all of these functions out.

Scalar reductions, such as@tullio s := A[i,j] * log(B[j,i]), are slightly different in that theact! function simply returns the sum, i.e. the variableacc above.

Elsewhere

Back-end friends & relatives:

LoopVectorization.jl is used here, if available.
Gaius.jl andPaddedMatrices.jl build on that.
GPUifyLoops.jl andKernelAbstractions.jl generate GPU-compatible kernels.
ThreadsX.jl does threaded reductions, and much else.
Strided.jl does multi-threaded broadcasting.

Front-end near-lookalikes:

Einsum.jl makes simple loops. Seetests/einsum.jl whereusing Tullio: @einsum is an almost-seamless replacement.
TensorOperations.jl andOMEinsum.jl identify patterns on which they can call various basic operations.TensorRules.jl makes@tensor differentiable; see alsoTensorGrad.jl andTensorTrack.jl for earlier attempts.
TensorCast.jl expresses everything as Julia array operations, broadcasting and reduction. (OMEinsum.jl also treats some cases as a special lazy broadcast-reduction.)

Things you can't run:

Tortilla.jl seems to exist, publicly, only in this very nice talk.
ArrayMeta.jl was a Julia 0.5 take on some of this.
Tokamak.jl was another, seereadme here.

About

⅀

Releases27

v0.3.8 Latest

Jan 23, 2025

+ 26 releases

Packages

No packages published

Contributors11

Languages

Julia100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Tullio.jl

Notation

Fast & Slow

Gradients & GPU

Larger Expressions

Keyword Options

How it Works

Elsewhere

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases27

Packages

Uh oh!

Contributors11

Uh oh!

Languages

Movatterモバイル変換

License

mcabbott/Tullio.jl

Folders and files

Latest commit

History

Repository files navigation

Tullio.jl

Notation

Fast & Slow

Gradients & GPU

Larger Expressions

Keyword Options

How it Works

Elsewhere

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases27

Packages0

Uh oh!

Contributors11

Uh oh!

Languages

Packages