## Paper: Beautiful differentiation

I have another paper draft for submission to ICFP 2009. This one is called Beautiful differentiation, The paper is a culmination of the several posts I’ve written on derivatives and automatic differentiation (AD). I’m happy with how the derivation keeps getting simpler. Now I’ve boiled extremely general higher-order AD down to a `Functor` and `Applicative` morphism.

I’d love to get some readings and feedback. I’m a bit over the page the limit, so I’ll have to do some trimming before submitting.

The abstract:

Automatic differentiation (AD) is a precise, efficient, and convenient method for computing derivatives of functions. Its implementation can be quite simple even when extended to compute all of the higher-order derivatives as well. The higher-dimensional case has also been tackled, though with extra complexity. This paper develops an implementation of higher-dimensional, higher-order differentiation in the extremely general and elegant setting of calculus on manifolds and derives that implementation from a simple and precise specification.

In order to motivate and discover the implementation, the paper poses the question “What does AD mean, independently of implementation?” An answer arises in the form of naturality of sampling a function and its derivative. Automatic differentiation flows out of this naturality condition, together with the chain rule. Graduating from first-order to higher-order AD corresponds to sampling all derivatives instead of just one. Next, the notion of a derivative is generalized via the notions of vector space and linear maps. The specification of AD adapts to this elegant and very general setting, which even simplifies the development.

Enjoy, and thanks!

1. #### Anonymous:

Great paper, very nice to read! (Only at the end it’s getting a bit dense, but beware that I’m just an undergraduate student, so definitions of “dense” may vary.)

Small typos:

• page 3, figure 5: “dConst” is never introduced.
• page 6, first paragraph of sect. 6.4: s/This only differences/The only differences/
• page 7: s/two-by-three matrix/three-by-two matrix/
• page 7: s/A unifying perspective/An unifying perspective/ (?) (not native myself)
• page 11: Did you mean “I /don’t/ know whether the tricky[…]”?
2. #### Michael Robinson:

You know, automatic differentiation is really about doing algebra in the jet bundle of a manifold. There are some really nice books on the subject that capture the topology of such things. Here’s one: D. J. Saunders, “The geometry of Jet Bundles”, Cambridge 1989. Your implementation is nice, and captures the requisite structures in a pretty clear way at least on R^n.

An interesting direction for future work might be to look at how this works on more general manifolds. The major problem there is that you must require some explicit geometry (ie. a Riemannian metric) to get the higher derivatives to be well-defined. Though the jet bundle structure does exist from a topological standpoint, you can’t really “compute” in it. I guess the problem then is more a matter of how you specify the manifolds you’re working on. I get the sinking feeling that the usual definition of a smooth manifold (charts and transition maps) isn’t going to work neatly (the notion of a maximal atlas scares me, even throwing in generous laziness), but maybe an embedding-like mindset might work. That is, you may try working on algebraic varieties (very nifty, but probably tricky), or embedded manifolds in R^n (like you’re using in FieldTrip, for instance). In any event, nice paper.

3. #### Anonymous2:

@Anonymous:

“page 7: s/A unifying perspective/An unifying perspective/ (?) (not native myself)”

“A unifying perspective” is correct. The rule is whether or not the word starts with a vowel sound, not just a vowel letter.

4. #### Barak A. Pearlmutter:

Hey Conal! Very nice paper.

I will try to formulate my deeper technical comments later. But let me mention three things quickly. First, the paper is only about forward-mode accumulation automatic differentiation, i.e., forward AD. So “AD” should be qualified throughout. Maybe fAD, or AD(f), I dunno. Second, in the intro of a TOPLAS paper on reverse-mode AD, Jeff Siskind and I construct forward AD as a step in the definition of its adjoint. That construction involves compositionality, and has a bit of the flavour of some of the cool things you’re doing here. And third, the definition of the derivative is almost always simplified by reuse of variables calculated in the primal, typically the primal itself. Eg, the derivative of recip is -recip^2, so if you’ve already calculated (recip x) then you can avoid a division by reusing it. Not a fundamental point, but constant factors matter in numerics.

5. #### Marc:

I’ve been reading your work with interest (though not as thoroughly as I should; Haskell is still hard for me). You rhetorically ask in the introduction if calling AD more computationally efficient than analytical calculation is a straw dummy, and it seems to me that it is, because in an ideal world, one might express the ordinary analytical derivative computationally with the same code that one obtains from AD implicitly. I mean that if the AD was implemented using staged computation one could “output” the code for efficiently computing the analytical derivative. It’s just that this expression for the analytical derivation would be expressed in calculus+lambdacalculus instead of calculus alone.

In any case, it’s great to see your clean development of this ideas.

6. #### Paul Liu:

I’m not sure if the exact ideas have been implemented in Haskell before, but for me, I first learnt about it from Henrik Nilsson’s paper “Functional Automatic Differentiation with Dirac Impulses”.

7. #### Paul Liu:

BTW, the most basic operation on a high order derivative is to integrate it over an given interval. But it’s difficult to do this efficiently with the stream representation for AD in Haskell. So I would question the efficiency claim you made in the introduction.

It’s very easy to run into space leaks when there is nested recursion. I won’t get into the details, but it’s yet another problem caused by the loss of sharing in lazy evaluation. Explicit knot tying is a possible solution, but it’s ugly and utterly destroys the beauty of AD.

8. #### conal:

I’m not sure if the exact ideas have been implemented in Haskell before, but for me, I first learnt about it from Henrik Nilsson’s paper “Functional Automatic Differentiation with Dirac Impulses”.

Thanks for the reminder about Henrik’s paper, which I like very much. Henrik’s paper extends Jerzy Karczmarczuk’s higher-order AD to impulses (distributions), while mine extends it to vector spaces (foundation for general manifolds). I’ve added a citation to Henrik’s work. Thanks!

9. #### conal:

BTW, the most basic operation on a high order derivative is to integrate it over an given interval. But it’s difficult to do this efficiently with the stream representation for AD in Haskell. So I would question the efficiency claim you made in the introduction.

The comment about efficiency is about sampling the derivatives, not integrating them. For instance, to compute surface normals during rendering.

It’s very easy to run into space leaks when there is nested recursion. I won’t get into the details, but it’s yet another problem caused by the loss of sharing in lazy evaluation. Explicit knot tying is a possible solution, but it’s ugly and utterly destroys the beauty of AD.

Are you referring to the challenges with recursive integrals, or with derivatives?

10. #### conal:

Anonymous — Thanks a bunch for catching these mistakes. If you want to share your name with me, I’d enjoy acknowledging your help in the paper.

11. #### Paul Liu:

Sure, I agree that taking derivatives of AD is efficient by design. So integral is an issue outside the scope of your paper, but I feel like discussing it anyway – it’s part of the space leak puzzle that fascinated me

If you define an integral function for AD values, it’s likely recursive. For example, using Euler method with a fixed global dt:

```dt = 0.01
next (D v d@(D u _)) = D (v + u * dt) (next d)
integral x t = xs !! (truncate (t / dt))
where xs = x : (map next xs)
```

Then if you use it with any AD value (all high order AD values would be recursive, including constants), there is nested recusion. For example:

```e = D 1.0 e
```

Enter “integral e 1″ and “exp 1″. Well, approximately the same, and that’s good.

Enter “integral e 100″. Boom! It blows up immediately.

Note that by using the list “xs”, it’s already memoizing past values. So I don’t think there exists an easy memoization solution to this problem.

12. #### Freddie Manners:

Figure 2, page 2, you’ve implemented fromInteger in a Fractional instance; should this be fromRational?

13. #### conal:

Freddie — yes, thanks!

14. #### Jared Updike:

Typo on page 7: “an group” should be “a group”.

15. #### conal:

Thanks, Jared!

16. #### Omar:

I enjoyed the paper, your implementation really is pretty.

I just wanted to point out a mathematical naming issue:

In the abstract you mention “calculus on manifolds”, but your paper doesn’t use manifolds at all. Finite-dimensional real vector spaces are very simple special cases of manifolds so it is misleading to mention manifolds at all. Instead of “calculus on manifolds” you can use the traditional names for the calculus on vector spaces: “calculus of several variables” or “calculus on R^n”.

You should correct this lest more mathematicians get excited by the reference to manifolds in the abstract and later disappointed when they read the paper, as happened to me.

17. #### conal:

Thanks, Oma. I’ll downplay the “manifolds” thing some more.

I’m unsure what label (if any) to use, as the traditional names you mentioned are problematic for me. I hear “Calculus of several variables” (or “multi-variate” calculus) as a (traditional) syntax/semantics confusion, since functions don’t have variables (though syntax for them might). And “calculus on R^n” is a more specific setting than I have in mind. Perhaps “calculus on vector spaces”.

18. #### Omar:

You’re right that “calculus of several variables” is strictly speaking wrong. But this confusion is traditional in mathematics and the name “calculus of several variables” though nonsensical really is the name of the subject.

So you could call it that, using it not as description of the subject but literally just a name.

But your suggestion of “calculus on vector spaces” is even better; I hope you go with that.

By the way, in Calculus on Manifolds Spivak has a nice rant about the classical notation for derivatives, a notation which incorporates the variable confusion you mentioned.

19. #### conal:

this confusion is traditional in mathematics and the name “calculus of several variables” though nonsensical really is the name of the subject.

Hi Omar,

I relate to naming (more generally, language) as evolving, not static (as in “is” & “the”). I do believe muddled thinking produces muddled language, which then perpetuates muddled thinking. So I like to contribute my own bit of energy to evolving toward language (and indirectly, thinking) I’d like to see catch on. Given your encouragement to use a non-traditionally accurate replacement, I suspect you agree.

Thanks for mentioning Spivak’s rant. I vaguely remember reading & enjoying it. BTW, his book made a lasting impression on me when I was 19, and only last year was I finally able to use what I learned there.

20. #### Tim S:

I read Spivak’s rant about notation a few years ago at the same I was learning about Haskell. At the time I thought his suggestions almost amounted to writing a Haskell program, so it was very gratifying to read this excellent paper.

21. #### Yrogirg:

I believe there is a small typo in Figure 2 — recip (D x x’) should be

recip (D x x’) = D (recip x) (- x’ / sqr x)

(I mean minus is required)

22. #### conal:

Yrogirg – I fixed this typo in the paper and in the software (vector-space 0.6.0 and later). Thanks very much for this catch!