Paper: Beautiful differentiation

I have another paper draft for submission to ICFP 2009. This one is called Beautiful differentiation, The paper is a culmination of the several posts I’ve written on derivatives and automatic differentiation (AD). I’m happy with how the derivation keeps getting simpler. Now I’ve boiled extremely general higher-order AD down to a Functor and Applicative morphism.

I’d love to get some readings and feedback. I’m a bit over the page the limit, so I’ll have to do some trimming before submitting.

The abstract:

Automatic differentiation (AD) is a precise, efficient, and convenient method for computing derivatives of functions. Its implementation can be quite simple even when extended to compute all of the higher-order derivatives as well. The higher-dimensional case has also been tackled, though with extra complexity. This paper develops an implementation of higher-dimensional, higher-order differentiation in the extremely general and elegant setting of calculus on manifolds and derives that implementation from a simple and precise specification.

In order to motivate and discover the implementation, the paper poses the question “What does AD mean, independently of implementation?” An answer arises in the form of naturality of sampling a function and its derivative. Automatic differentiation flows out of this naturality condition, together with the chain rule. Graduating from first-order to higher-order AD corresponds to sampling all derivatives instead of just one. Next, the notion of a derivative is generalized via the notions of vector space and linear maps. The specification of AD adapts to this elegant and very general setting, which even simplifies the development.

You can get the paper and see current errata here.

The submission deadline is March 2, so comments before then are most helpful to me.

Enjoy, and thanks!

Comparing formulations of higher-dimensional, higher-order derivatives

I just reread Jason Foutz’s post Higher order multivariate automatic differentiation in Haskell, as I’m thinking about this topic again. I like his trick of using an IntMap to hold the partial derivatives and (recursively) the partials of those partials, etc.

Some thoughts:

  • I bet one can eliminate the constant (C) case in Jason’s representation, and hence 3/4 of the cases to handle, without much loss in performance. He already has a fairly efficient representation of constants, which is a D with an empty IntMap.

  • I imagine there’s also a nice generalization of the code for combining two finite maps used in his third multiply case. The code’s meaning and correctness follows from a model for those maps as total functions with missing elements denoting a default value (zero in this case).

  • Jason’s data type reminds me of a sparse matrix representation, but cooler in how it’s infinitely nested. Perhaps depth n (starting with zero) is a sparse n-dimensional matrix.

  • Finally, I suspect there’s a close connection between Jason’s IntMap-based implementation and my LinearMap-based implementation described in Higher-dimensional, higher-order derivatives, functionally and in Simpler, more efficient, functional linear maps. For the case of Rn, my formulation uses a trie with entries for n basis elements, while Jason’s uses an IntMap (which is also a trie) with n entries (counting any implicit zeros).

I suspect Jason’s formulation is more efficient (since it optimizes the constant case), while mine is more statically typed and more flexible (since it handles more than Rn).

For optimizing constants, I think I’d prefer having a single constructor with a Maybe for the derivatives, to eliminate code duplication.

I am still trying to understand the paper Lazy Multivariate Higher-Order Forward-Mode AD, with its management of various epsilons.

A final remark: I prefer the term “higher-dimensional” over the traditional “multivariate”. I hear classic syntax/semantics confusion in the latter.

Simpler, more efficient, functional linear maps

A previous post described a data type of functional linear maps. As Andy Gill pointed out, we had a heck of a time trying to get good performance. This note describes a new representation that is very simple and much more efficient. It’s terribly obvious in retrospect but took me a good while to stumble onto.

The Haskell module described here is part of the vector-space library (version 0.5 or later) and requires ghc version 6.10 or better (for associated types).


  • 2008-11-09: Changed remarks about versions. The vector-space version 0.5 depends on ghc 6.10.
  • 2008-10-21: Fixed the vector-space library link in the teaser.

Continue reading ‘Simpler, more efficient, functional linear maps’ »

Functional linear maps

Two earlier posts described a simple and general notion of derivative that unifies the many concrete notions taught in traditional calculus courses. All of those variations turn out to be concrete representations of the single abstract notion of a linear map. Correspondingly, the various forms of mulitplication in chain rules all turn out to be implementations of composition of linear maps. For simplicity, I suggested a direct implementation of linear maps as functions. Unfortunately, that direct representation thwarts efficiency, since functions, unlike data structures, do not cache by default.

This post presents a data representation of linear maps that makes crucial use of (a) linearity and (b) the recently added language feature indexed type families (“associated types”).

For a while now, I’ve wondered if a library for linear maps could replace and generalize matrix libraries. After all, matrices represent of a restricted class of linear maps. Unlike conventional matrix libraries, however, the linear map library described in this post captures matrix/linear-map dimensions via static typing. The composition function defined below statically enforces the conformability property required of matrix multiplication (which implements linear map composition). Likewise, conformance for addition of linear maps is also enforced simply and statically. Moreover, with sufficiently sophisticated coaxing of the Haskell compiler, of the sort Don Stewart does, perhaps a library like this one could also have terrific performance. (It doesn’t yet.)

You can read and try out the code for this post in the module Data.LinearMap in version 0.2.0 or later of the vector-space package. That module also contains an implementation of linear map composition, as well as Functor-like and Applicative-like operations. Andy Gill has been helping me get to the bottom of some some severe performance problems, apparently involving huge amounts of redundant dictionary creation.


  • 2008-06-04: Brief explanation of the associated data type declaration.

Continue reading ‘Functional linear maps’ »

Higher-dimensional, higher-order derivatives, functionally

The post Beautiful differentiation showed some lovely code that makes it easy to compute not just the values of user-written functions, but also all of its derivatives (infinitely many). This elegant technique is limited, however, to functions over a scalar (one-dimensional) domain. Next, we explored what it means to transcend that limitation, asking and answering the question What is a derivative, really? The answer to that question is that derivative values are linear maps saying how small input changes result in output changes. This answer allows us to unify several different notions of derivatives and their corresponding chain rules into a single simple and powerful form.

This third post combines the ideas from the two previous posts, to easily compute infinitely many derivatives of functions over arbitrary-dimensional domains.

The code shown here is part of a new Haskell library, which you can download and play with or peruse on the web.

Continue reading ‘Higher-dimensional, higher-order derivatives, functionally’ »

What is a derivative, really?

The post Beautiful differentiation showed how easily and beautifully one can construct an infinite tower of derivative values in Haskell programs, while computing plain old values. The trick (from Jerzy Karczmarczuk) was to overload numeric operators to operate on the following (co)recursive type:

data Dif b = D b (Dif b)

This representation, however, works only when differentiating functions from a scalar (one-dimensional) domain, i.e., functions of type a -> b for a scalar type a. The reason for this limitation is that only in those cases can the type of derivative values be identified with the type of regular values.

Consider a function f :: (R,R) -> R, where R is, say, Double. The value of f at a domain value (x,y) has type R, but the derivative of f consists of two partial derivatives. Moreover, the second derivative consists of four partial second-order derivatives (or three, depending how you count). A function f :: (R,R) -> (R,R,R) also has two partial derivatives at each point (x,y), each of which is a triple. That pair of triples is commonly written as a two-by-three matrix.

Each of these situations has its own derivative shape and its own chain rule (for the derivative of function compositions), using plain-old multiplication, scalar-times-vector, vector-dot-vector, matrix-times-vector, or matrix-times-matrix. Second derivatives are more complex and varied.

How many forms of derivatives and chain rules are enough? Are we doomed to work with a plethora of increasingly complex types of derivatives, as well as the diverse chain rules needed to accommodate all compatible pairs of derivatives? Fortunately, not. There is a single, simple, unifying generalization. By reconsidering what we mean by a derivative value, we can see that these various forms are all representations of a single notion, and all the chain rules mean the same thing on the meanings of the representations.

This blog post is about that unifying view of derivatives.


  • 2008-05-20: There are several comments about this post on reddit.
  • 2008-05-20: Renamed derivative operator from D to deriv to avoid confusion with the data constructor for derivative towers.
  • 2008-05-20: Renamed linear map type from (:->) to (:-*) to make it visually closer to a standard notation.

Continue reading ‘What is a derivative, really?’ »