Analytic Differentiation of Folds
Introduction
The accelerate library is a high-level interface to GPU programming, implemented as a DSL in Haskell.
Internally, accelerate programs get compiled from a high-level AST to a lower-level one.
While this is convenient from a compiler author's perspective, it is a bit challenging from the perspective of automatic differentiation, since it is no longer possible to implement AD at the operation-by-operation scale as it is in most libraries.
This means that any AD library for accelerate will have to be able to differentiate folds, among other things, as a concept, and not as functions built out of simpler operations.
In this blog post, I will use sage to derive an "analytic" formula for the derivative of a fold, which expresses the derivative of a fold in terms of existing functional combinators like map, scan, and zipWith, as well as the chain rule.
Computation
Let's get started!
First, we'll introduce some symbolic variables and functions.
var('u,v,w,x,y,z')
function('f')(u,v)
Next, we'll compute the Jacobian of a left fold to see if we can find any patterns. We'll compute the symbolic Jacobian of the following Haskell code.
foldl1 f [u,v,w,x,y,z]
The symbolic result is a mess, but it does have some structure we can pick apart.
jacobian(f(f(f(f(f(u,v),w),x),y),z),(u,v,w,x,y,z)).list()
[diff(f(u, v), u)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), diff(f(u, v), v)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(f(f(u, v), w), x), y), z)]
Looking at the structure of the symbolic Jacobian, a few things stand out.
- We can replace the original
foldwith ascan, since the Jacobian requires all the intermediate steps. - The
D[1](f)structure at the beginning of each line has an obvious pattern, which we can capture with azipWith. TheD[0](f)structure that forms the rest of the computation looks like ascanof some sort. - With a little more juggling, we should be able to express the Jacobian in terms of some combinators.
Let's figure out exactly what's going on. We'll define some Sage versions of the combinators we'll need.
def zipWith(f,xs,ys):
return [f(x,y) for x,y in zip(xs,ys)]
def scanl(f, x0, xs):
res = [x0]+xs
for i in range(len(xs)):
res[i+1] = f(res[i], xs[i])
return res
def scanl1(f, xs):
return scanl(f, xs[0], xs[1:])
Next, let's define some basic terms on which we'll operate.
f0andf1are the partial derivatives offlstis the list of variables we are folding overfsis a scan over the input, using our functionf
f0 = diff(f(u,v),u) f1 = diff(f(u,v),v) lst = [u,v,w,x,y,z] fs = scanl1(f, lst)
We can now take care of the second item on our list of observations.
term1 = [1]+zipWith(f1, fs, lst[1:]) term1
[1, diff(f(u, v), v), D[1](f)(f(u, v), w), D[1](f)(f(f(u, v), w), x), D[1](f)(f(f(f(u, v), w), x), y), D[1](f)(f(f(f(f(u, v), w), x), y), z)]
Let's push on ahead and compute the last structural component we noticed.
(Protip: [::-1] is a Python idiom for reversing a list.)
term2 = scanl1(lambda x,y:x*y, zipWith(f0, fs, lst[1:])[::-1])[::-1]+[1] term2
[diff(f(u, v), u)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[0](f)(f(f(f(f(u, v), w), x), y), z), 1]
Finally, we can obtain the symbolic Jacobian we computed directly.
zipWith(lambda x,y:x*y,term1,term2)
[diff(f(u, v), u)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), diff(f(u, v), v)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(f(f(u, v), w), x), y), z)]
Conclusion
accelerate has a few more primitives that will require similar treatment, including scans and segmented folds, as well as a few more fold variants.
However, I'm optimistic all of these can be tackled with a similar approach.
The results obtained here can be summarized like so.
def autodiff_foldl1(f, lst):
fs = scanl1(f,lst)
f0 = diff(f(u,v),u) # first partial (as a function)
f1 = diff(f(u,v),v) # second partial (as a function)
prod = lambda x,y: x*y
term1 = [1]+zipWith(f1, fs, lst[1:])
term2 = scanl1(prod, zipWith(f0, fs, lst[1:])[::-1])[::-1]+[1]
return zipWith(prod, term1, term2)
The technique even works with the edge case of a two-variable vector.
autodiff_foldl1(f, [u,v])
[diff(f(u, v), u), diff(f(u, v), v)]