Analytic Differentiation of Folds
Introduction
The accelerate
library is a high-level interface to GPU programming, implemented as a DSL in Haskell.
Internally, accelerate
programs get compiled from a high-level AST to a lower-level one.
While this is convenient from a compiler author's perspective, it is a bit challenging from the perspective of automatic differentiation, since it is no longer possible to implement AD at the operation-by-operation scale as it is in most libraries.
This means that any AD library for accelerate
will have to be able to differentiate folds, among other things, as a concept, and not as functions built out of simpler operations.
In this blog post, I will use sage
to derive an "analytic" formula for the derivative of a fold, which expresses the derivative of a fold in terms of existing functional combinators like map
, scan
, and zipWith
, as well as the chain rule.
Computation
Let's get started!
First, we'll introduce some symbolic variables and functions.
var('u,v,w,x,y,z') function('f')(u,v)
Next, we'll compute the Jacobian of a left fold to see if we can find any patterns. We'll compute the symbolic Jacobian of the following Haskell code.
foldl1 f [u,v,w,x,y,z]
The symbolic result is a mess, but it does have some structure we can pick apart.
jacobian(f(f(f(f(f(u,v),w),x),y),z),(u,v,w,x,y,z)).list()
[diff(f(u, v), u)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), diff(f(u, v), v)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(f(f(u, v), w), x), y), z)]
Looking at the structure of the symbolic Jacobian, a few things stand out.
- We can replace the original
fold
with ascan
, since the Jacobian requires all the intermediate steps. - The
D[1](f)
structure at the beginning of each line has an obvious pattern, which we can capture with azipWith
. TheD[0](f)
structure that forms the rest of the computation looks like ascan
of some sort. - With a little more juggling, we should be able to express the Jacobian in terms of some combinators.
Let's figure out exactly what's going on. We'll define some Sage versions of the combinators we'll need.
def zipWith(f,xs,ys): return [f(x,y) for x,y in zip(xs,ys)] def scanl(f, x0, xs): res = [x0]+xs for i in range(len(xs)): res[i+1] = f(res[i], xs[i]) return res def scanl1(f, xs): return scanl(f, xs[0], xs[1:])
Next, let's define some basic terms on which we'll operate.
f0
andf1
are the partial derivatives off
lst
is the list of variables we are folding overfs
is a scan over the input, using our functionf
f0 = diff(f(u,v),u) f1 = diff(f(u,v),v) lst = [u,v,w,x,y,z] fs = scanl1(f, lst)
We can now take care of the second item on our list of observations.
term1 = [1]+zipWith(f1, fs, lst[1:]) term1
[1, diff(f(u, v), v), D[1](f)(f(u, v), w), D[1](f)(f(f(u, v), w), x), D[1](f)(f(f(f(u, v), w), x), y), D[1](f)(f(f(f(f(u, v), w), x), y), z)]
Let's push on ahead and compute the last structural component we noticed.
(Protip: [::-1]
is a Python idiom for reversing a list.)
term2 = scanl1(lambda x,y:x*y, zipWith(f0, fs, lst[1:])[::-1])[::-1]+[1] term2
[diff(f(u, v), u)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[0](f)(f(f(f(f(u, v), w), x), y), z), 1]
Finally, we can obtain the symbolic Jacobian we computed directly.
zipWith(lambda x,y:x*y,term1,term2)
[diff(f(u, v), u)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), diff(f(u, v), v)*D[0](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(u, v), w)*D[0](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(u, v), w), x)*D[0](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(f(u, v), w), x), y)*D[0](f)(f(f(f(f(u, v), w), x), y), z), D[1](f)(f(f(f(f(u, v), w), x), y), z)]
Conclusion
accelerate
has a few more primitives that will require similar treatment, including scans and segmented folds, as well as a few more fold variants.
However, I'm optimistic all of these can be tackled with a similar approach.
The results obtained here can be summarized like so.
def autodiff_foldl1(f, lst): fs = scanl1(f,lst) f0 = diff(f(u,v),u) # first partial (as a function) f1 = diff(f(u,v),v) # second partial (as a function) prod = lambda x,y: x*y term1 = [1]+zipWith(f1, fs, lst[1:]) term2 = scanl1(prod, zipWith(f0, fs, lst[1:])[::-1])[::-1]+[1] return zipWith(prod, term1, term2)
The technique even works with the edge case of a two-variable vector.
autodiff_foldl1(f, [u,v])
[diff(f(u, v), u), diff(f(u, v), v)]