Conal Elliott » linear map

Paper: Beautiful differentiation

Conal — Tue, 24 Feb 2009 08:05:10 +0000

I have another paper draft for submission to ICFP 2009. This one is called Beautiful differentiation, The paper is a culmination of the several posts I’ve written on derivatives and automatic differentiation (AD). I’m happy with how the derivation keeps getting simpler. Now I’ve boiled extremely general higher-order AD down to a Functor and Applicative morphism.

I’d love to get some readings and feedback. I’m a bit over the page the limit, so I’ll have to do some trimming before submitting.

The abstract:

Automatic differentiation (AD) is a precise, efficient, and convenient method for computing derivatives of functions. Its implementation can be quite simple even when extended to compute all of the higher-order derivatives as well. The higher-dimensional case has also been tackled, though with extra complexity. This paper develops an implementation of higher-dimensional, higher-order differentiation in the extremely general and elegant setting of calculus on manifolds and derives that implementation from a simple and precise specification.

In order to motivate and discover the implementation, the paper poses the question “What does AD mean, independently of implementation?” An answer arises in the form of naturality of sampling a function and its derivative. Automatic differentiation flows out of this naturality condition, together with the chain rule. Graduating from first-order to higher-order AD corresponds to sampling all derivatives instead of just one. Next, the notion of a derivative is generalized via the notions of vector space and linear maps. The specification of AD adapts to this elegant and very general setting, which even simplifies the development.

You can get the paper and see current errata here.

The submission deadline is March 2, so comments before then are most helpful to me.

Enjoy, and thanks!

Comparing formulations of higher-dimensional, higher-order derivatives

Conal — Sun, 25 Jan 2009 01:40:07 +0000

I just reread Jason Foutz’s post Higher order multivariate automatic differentiation in Haskell, as I’m thinking about this topic again. I like his trick of using an IntMap to hold the partial derivatives and (recursively) the partials of those partials, etc.

Some thoughts:

I bet one can eliminate the constant (C) case in Jason’s representation, and hence 3/4 of the cases to handle, without much loss in performance. He already has a fairly efficient representation of constants, which is a D with an empty IntMap.
I imagine there’s also a nice generalization of the code for combining two finite maps used in his third multiply case. The code’s meaning and correctness follows from a model for those maps as total functions with missing elements denoting a default value (zero in this case).
Jason’s data type reminds me of a sparse matrix representation, but cooler in how it’s infinitely nested. Perhaps depth n (starting with zero) is a sparse n-dimensional matrix.
Finally, I suspect there’s a close connection between Jason’s IntMap-based implementation and my LinearMap-based implementation described in Higher-dimensional, higher-order derivatives, functionally and in Simpler, more efficient, functional linear maps. For the case of Rⁿ, my formulation uses a trie with entries for n basis elements, while Jason’s uses an IntMap (which is also a trie) with n entries (counting any implicit zeros).

I suspect Jason’s formulation is more efficient (since it optimizes the constant case), while mine is more statically typed and more flexible (since it handles more than Rⁿ).

For optimizing constants, I think I’d prefer having a single constructor with a Maybe for the derivatives, to eliminate code duplication.

I am still trying to understand the paper Lazy Multivariate Higher-Order Forward-Mode AD, with its management of various epsilons.

A final remark: I prefer the term “higher-dimensional” over the traditional “multivariate”. I hear classic syntax/semantics confusion in the latter.

Simpler, more efficient, functional linear maps

Conal — Mon, 20 Oct 2008 01:26:05 +0000

A previous post described a data type of functional linear maps. As Andy Gill pointed out, we had a heck of a time trying to get good performance. This note describes a new representation that is very simple and much more efficient. It’s terribly obvious in retrospect but took me a good while to stumble onto.

The Haskell module described here is part of the vector-space library (version 0.5 or later) and requires ghc version 6.10 or better (for associated types).

Edits:

2008-11-09: Changed remarks about versions. The vector-space version 0.5 depends on ghc 6.10.
2008-10-21: Fixed the vector-space library link in the teaser.

Linear maps

Semantically, a linear map is a function f ∷ a → b such that, for all scalar values s and "vectors" u, v ∷ a, the following properties hold:

f (s ⋅ u) = s ⋅ f u

f (u + v) = f u + f v

By repeated application of these properties,

f (s₁ ⋅ u₁ + ⋯ + s_n ⋅ u_n) = s₁ ⋅ f u₁ + ⋯ + s_n ⋅ f u_n

Taking the u_i as basis vectors, this form implies that a linear function is determined by its behavior on any basis of its domain type.

Therefore, a linear function can be represented simply as a function from a basis, using the representation described in Vector space bases via type families.

type u :-* v = Basis u → v

The semantic function converts from (u :-* v) to (u → v). It decomposes a source vector into its coordinates, applies the basis function to basis representations, and linearly combines the results.

lapply ∷ ( VectorSpace u, VectorSpace v
         , Scalar u ~ Scalar v, HasBasis u ) ⇒
         (u :-* v) → (u → v)
lapply lm = λ u → sumV [s *^ lm b | (b,s) ← decompose u]

lapply lm = linearCombo ∘ fmap (first lm) ∘ decompose

The reverse function is easier. Convert a function f, presumed linear, to a linear map representation:

linear ∷ (VectorSpace u, VectorSpace v, HasBasis u) ⇒
         (u → v) → (u :-* v)

It suffices to apply f to basis values:

linear f = f ∘ basisValue

Memoization

The idea of the linear map representation is to reconstruct an entire (linear) function out of just a few samples. In other words, we can make a very small sampling of function's domain, and re-use those values in order to compute the function's value at all domain values. As implemented above, however, this trick makes function application more expensive, not less. If lm = linear f, then each use of lapply lm can apply f to the value of every basis element, and then linearly combine results.

A simple trick fixes this efficiency problem: memoize the linear map. We could do the memoization privately, e.g.,

linear f = memo (f ∘ basisValue)

If lm = linear f, then no matter how many times lapply lm is applied, the function f can only get applied as many times as the dimension of the domain of f.

However, there are several other ways to make linear maps, and it would be easy to forget to memoize each combining form. So, instead of the function representation above, I ensure that the function be memoized by representing it as a memo trie.

type u :-* v = Basis u ↛ v

The conversion functions linear and lapply need just a little tweaking. Split memo into its definition untrie ∘ trie, and then move the second phase (untrie) into lapply. We'll also have to add HasTrie constraints:

linear ∷ ( VectorSpace u, VectorSpace v
         , HasBasis u, HasTrie (Basis u) ) ⇒
         (u → v) → (u :-* v)
linear f = trie (f ∘ basisValue)

lapply ∷ ( VectorSpace u, VectorSpace v, Scalar u ~ Scalar v
         , HasBasis u, HasTrie (Basis u) ) ⇒
         (u :-* v) → (u → v)
lapply lm = linearCombo ∘ fmap (first (untrie lm)) ∘ decompose

Now we can build up linear maps conveniently and efficiently by using the operations on memo tries shown in Composing memo tries. For instance, suppose that h is a linear function of two arguments (linear in both, not it each) and m and n are two linear maps. Then liftA2 h m n is the linear function that applies h to the results of m and n.

lapply (liftA2 h m n) a = h (lapply m a) (lapply n a)

Exploiting the applicative functor instance for functions, we get another formulation:

lapply (liftA2 h m n) = liftA2 h (lapply m) (lapply n)

In other words, the meaning of a liftA2 is the liftA2 of the meanings, as discussed in Simplifying semantics with type class morphisms.

Functional linear maps

Conal — Wed, 04 Jun 2008 05:49:20 +0000

Two earlier posts described a simple and general notion of derivative that unifies the many concrete notions taught in traditional calculus courses. All of those variations turn out to be concrete representations of the single abstract notion of a linear map. Correspondingly, the various forms of mulitplication in chain rules all turn out to be implementations of composition of linear maps. For simplicity, I suggested a direct implementation of linear maps as functions. Unfortunately, that direct representation thwarts efficiency, since functions, unlike data structures, do not cache by default.

This post presents a data representation of linear maps that makes crucial use of (a) linearity and (b) the recently added language feature indexed type families (“associated types”).

For a while now, I’ve wondered if a library for linear maps could replace and generalize matrix libraries. After all, matrices represent of a restricted class of linear maps. Unlike conventional matrix libraries, however, the linear map library described in this post captures matrix/linear-map dimensions via static typing. The composition function defined below statically enforces the conformability property required of matrix multiplication (which implements linear map composition). Likewise, conformance for addition of linear maps is also enforced simply and statically. Moreover, with sufficiently sophisticated coaxing of the Haskell compiler, of the sort Don Stewart does, perhaps a library like this one could also have terrific performance. (It doesn’t yet.)

You can read and try out the code for this post in the module Data.LinearMap in version 0.2.0 or later of the vector-space package. That module also contains an implementation of linear map composition, as well as Functor-like and Applicative-like operations. Andy Gill has been helping me get to the bottom of some some severe performance problems, apparently involving huge amounts of redundant dictionary creation.

Edits:

2008-06-04: Brief explanation of the associated data type declaration.

Linear maps

Semantically, a linear map is a function f :: a -> b such that, for all scalar values c and “vectors” u, v :: a, the following properties hold:

f (c *^ u)  == c *^ f u
f (u ^+^ v) == f u ^+^ f v

where (*^) and (^+^) are scalar multiplication and vector addition. (See VectorSpace details in a previous post.)

Although the semantics of a linear map will be a function, the representation will be a data structure.

data a :-* b = ...

The semantic function is

lapply :: (LMapDom a s, VectorSpace b s) =>
          (a :-* b) -> (a -> b)  -- result will be linear

The first constraint says that we know how to represent linear maps whose domain is the vector space a, which has the associated scalar field s. The second constraint say that b must be a vector space over that same scalar field.

Conversely, there is also a function to turn any linear function into a linear map:

linear :: LMapDom a s => (a -> b) -> (a :-* b)  -- argument must be linear

These two functions and the linear map data type are packaged up as the LMapDom type class:

-- | Domain of a linear map.
class VectorSpace a s => LMapDom a s | a -> s where
  -- | Linear map type
  data (:-*) a :: * -> *
  -- | Linear map as function
  lapply :: VectorSpace b s => (a :-* b) -> (a -> b)
  -- | Function (assumed linear) as linear map.
  linear :: (a -> b) -> (a :-* b)

The data definition means that the data type (a :-* b) (of linear maps from a to b) has a variety of representations, each one associated with a type a.

These two conversion functions are required to be inverses:

{-# RULES

"linear.lapply"   forall m. linear (lapply m) = m

"lapply.linear"   forall f. lapply (linear f) = f

 #-}

Scalar domains

Consider a linear function f over a scalar domain. Then

f s == f (s *^ 1)
    == s *^ f 1  -- by linearity

Therefore, f is fully determined by its value at 1, and so an adequate representation of f is then simply the value f 1.

This observation leads to LMapDom instances like the following:

instance LMapDom Double Double where
  data Double :-* o  = DoubleL o
  lapply (DoubleL o) = (*^ o)
  linear f           = DoubleL (f 1)

Non-scalar domains

Maps over non-scalar domains are a little trickier. Consider a linear function f over a domain of pairs of scalar values. Then

f (a,b) == f (a *^ (1,0) ^+^ b *^ (0,1))
        == f (a *^ (1,0)) ^+^ f (b *^ (0,1))  -- linearity
        == a *^ f (1,0) ^+^ b *^ f (0,1)      -- linearity twice more

So f is determined by f (1,0) and f (0,1) and thus can be represented by those two values.

instance LMapDom (Double,Double) Double where
  data (Double,Double) :-* o = PairD o o
  PairD ao bo `lapply` (a,b) = a *^ ao ^+^ b *^ bo
  linear f = PairD (f (0,1)) (f (1,0))

and similarly for triples, etc.

This definition works fine, but I want something compositional. I’d like linear maps over pairs of pairs and so on.

Composing domains

We can still use part of our linearity property. Using zeroV as the zero vector for arbitrary vector spaces,

f (a,b) == f ((a,zeroV) ^+^ (zeroV,b))
        == f (a,zeroV) ^+^ f (zeroV,b)

We see that f is determined by its behavior when either argument is zero.

In other words, f can be reconstructed from two other functions over simpler domains:

fa a = f (a,zeroV)
fb b = f (zeroV,b)

If f :: (a,b) -> o, then fa :: a -> o and fb :: b -> o.

Exercise: show that fa and fb are linear if f is. We can thus reduce the problem of representing the linear function f to the problems of representing fa and fb. This insight is captured in the following LMapDom instance:

instance (LMapDom a s, LMapDom b s) => LMapDom (a,b) s where
  data (a,b) :-* o = PairL (a :-* o) (b :-* o)
  PairL ao bo `lapply` (a,b) = ao `lapply` a ^+^ bo `lapply` b
  linear f = PairL (linear ( a -> f (a,zeroV)))
                   (linear ( b -> f (zeroV,b)))

Of course, there are similar instances for triples, etc, as well as for tuple variants with strict fields, such as OpenGL’s Vector2 and Vector3 types.

What have we done?

If you’ve studied linear algebra, you may be thinking now about the idea of a basis of a vector space. A basis is a minimal set of vectors that can be combined linearly to cover the entire vector space. Any linear map is determined by its behavior on any basis. For scalars, the set {1} is a basis, while for pairs of scalars, {(1,0), (0,1)} is a basis. It is not just coincidental that exactly these basis vectors showed up in the definitions of linear for Double and (Double,Double).

In the general pairing instance of LMapDom above, bases are built up recursively. Each recursive call to linear results in a data structure that holds the values of fa over a basis for a and the values of fb over a basis for b. Each of those basis vectors corresponds to a basis vector for (a,b), by zeroV-padding.

The dimension of a vector space is the number of elements in a basis (which is independent of the particular choice of basis). For vector space types a and b,

dimension (a,b) == dimension a + dimension b

which corresponds to the fact that our linear map representation (as built by linear) contains samples for each basis element of a, plus samples for each basis element of b (all zeroV-padded).

Working with linear maps

Besides the instances above for creating and applying linear maps, what else can we do? For starters, let’s define the identity linear map. Since the identity function is already linear, simply convert it to a linear map:

idL :: LMapDom a s => a :-* a
idL = linear id

Another very useful tool is transforming a linear map by transforming a linear function.

inL :: (LMapDom c s, VectorSpace b s', LMapDom a s') =>
        ((a -> b) -> (c -> d)) -> ((a :-* b) -> (c :-* d))
inL h = linear . h . lapply

where the higher-order function h is assumed to map linear functions to linear functions.

We don’t have to stop at unary transformations of linear functions.

-- | Transform a linear maps by transforming linear functions.
inL2 :: ( LMapDom c s, VectorSpace b s', LMapDom a s'
        , LMapDom e s, VectorSpace d s ) =>
        ((a  -> b) -> (c  -> d) -> (e  -> f))
     -> ((a :-* b) -> (c :-* d) -> (e :-* f))
inL2 h = inL . h . lapply

The type constraints are starting to get hairy. Fortunately, they’re entirely inferred by the compiler.

Let’s do some inlining and simplification to see what goes on inside inL2:

inL2 h m n
  == (inL . h . lapply) m n                -- inline inL2
  == inL (h (lapply m)) n                  -- inline (.)
  == (linear . (h (lapply m)) . lapply) n  -- inline inL
  == linear (h (lapply m) (lapply n))      -- inline (.)

Ternary transformations are defined similarly. I’ll spare you the type constraints this time.

inL3 :: ( ... ) =>
        ((a  -> b) -> (c  -> d) -> (e  -> f) -> (p  -> q))
     -> ((a :-* b) -> (c :-* d) -> (e :-* f) -> (p :-* q))
inL3 h = inL2 . h . lapply

Look what happens when these operations are composed. As an example,

inL h . inL g
  == (linear . h . lapply) . (linear . g . lapply)
  == linear . h . lapply . linear . g . lapply    -- associativity of (.)
  == linear . h . g . lapply                      -- rule "lapply.linear"
  == inL (h . g)

This transformation is not actually happening in the compiler yet. The “lapply.linear” rule is not firing, and I don’t know why. I’d appreciate suggestions.

There are a few more operations defined in Data.LinearMap. I’ll end with this simple, general definition of composition of linear maps:

-- | Compose linear maps
(.*) :: (VectorSpace c s, LMapDom b s, LMapDom a s) =>
        (b :-* c) -> (a :-* b) -> (a :-* c)
(.*) = inL2 (.)

Derivative towers again

A similar, but recursive, definition is used in the new definition of the the general chain rule for infinite derivative towers, updated since the post Higher-dimensional, higher-order derivatives, functionally.

(@.) :: (LMapDom b s, LMapDom a s, VectorSpace c s) =>
        (b :~> c) -> (a :~> b) -> (a :~> c)
(h @. g) a0 = D c0 (inL2 (@.) c' b')
  where
    D b0 b' = g a0
    D c0 c' = h b0

Higher-dimensional, higher-order derivatives, functionally

Conal — Wed, 21 May 2008 05:29:32 +0000

The post Beautiful differentiation showed some lovely code that makes it easy to compute not just the values of user-written functions, but also all of its derivatives (infinitely many). This elegant technique is limited, however, to functions over a scalar (one-dimensional) domain. Next, we explored what it means to transcend that limitation, asking and answering the question What is a derivative, really? The answer to that question is that derivative values are linear maps saying how small input changes result in output changes. This answer allows us to unify several different notions of derivatives and their corresponding chain rules into a single simple and powerful form.

This third post combines the ideas from the two previous posts, to easily compute infinitely many derivatives of functions over arbitrary-dimensional domains.

The code shown here is part of a new Haskell library, which you can download and play with or peruse on the web.

The general setting: vector spaces

Linear maps (transformations) lie at the heart of the generalized idea of derivative described earlier. Talking about linearity requires a few simple operations, which are encapsulated in the the abstract interface known from math as a vector space.

A vector space v has an associated type s of scalar values (a field) and a set of operations. In Haskell,

class VectorSpace v s | v -> s where
  zeroV   :: v              -- the zero vector
  (*^)    :: s -> v -> v    -- scale a vector
  (^+^)   :: v -> v -> v    -- add vectors
  negateV :: v -> v         -- additive inverse

In many cases, we’ll want to add inner (dot) products as well, to form an inner product space:

class VectorSpace v s => InnerSpace v s | v -> s where
  (<.>) :: v -> v -> s

Several other useful operations can be defined in terms of these five methods. For instance, vector subtraction and linear interpolation for vector spaces, and magnitude and normalization (rescaling to unit length) for inner product spaces. The vector-space library defines instances for Float, Double, and Complex, as well as pairs, triples, and quadruples of vectors, and functions with vector ranges. (By “vector” here, I mean any instance of VectorSpace, recursively).

It’s pretty easy to define new instances of your own. For instance, here is the library’s definition of functions as vector spaces, using the same techniques as before:

instance VectorSpace v s => VectorSpace (a->v) s where
  zeroV   = pure   zeroV
  (*^) s  = fmap   (s *^)
  (^+^)   = liftA2 (^+^)
  negateV = fmap   negateV

Linear transformations could perhaps be defined as an abstract data type, with primitives and a composition operator. I don’t know how to provide enough primitives for all possibly types of interest. I also played with linear maps as a type family, indexed on the domain or range type, but it didn’t quite work out for me. For now, I’ll simply represent a linear map as a function, define a type synonym as reminder of intention:

type a :-* b = a -> b       -- linear map

This definition makes some things quite convenient. Function composition, (.), implements linear map composition. The function VectorSpace instance (above) gives the customary meaning for linear maps as vector spaces. Like, (->), this new (:-*) operator is right-associative, so a :-* b :-* c means a :-* (b :-* c).

Derivative towers

A derivative tower contains a value and all derivatives of a function at a point. Previously, I’d suggested the following type for derivative towers.

data a :> b = D b (a :> (a :-* b))   -- old definition

The values in one of these towers have types b, a :-> b, a :-> a :-> b, …. So, for instance, a second derivative value is a linear map from a to linear maps from a to b. (Uncurrying a second derivative yields a bilinear map.)

Since making this suggestion, I’ve gotten simpler code using the following variation, which I’ll use instead:

data a :> b = D b (a :-* (a :> b))

Now a tower value is a regular value, plus a linear map that yields a tower for the derivative.

We can also write this second version more simply, without the linearity reminder:

data a :> b = D b (a :~> b)

where a :~> b is the type of infinitely differentiable functions, represented as a function that produces a derivative tower:

type a :~> b = a -> (a :> b)

Basics

As in Beautiful differentiation, constant functions have all derivatives equal to zero:

dConst :: VectorSpace b s => b -> a:>b
dConst b = b `D` const dZero

dZero :: VectorSpace b s => a:>b
dZero = dConst zeroV

Note the use of the standard Haskell function const, which makes constant functions (always returning the same value). Also, the use of the zero vector required me to use a VectorSpace constraint in the type signature. (I could have used 0 and Num instead, but Num requires more methods and so is less general than VectorSpace.)

The differentiable identity function plays a very important role. Its towers are sometimes called “the derivation variable” or similar, but it’s a not really a variable. The definition is quite terse:

dId :: VectorSpace u s => u :~> u
dId u = D u ( du -> dConst du)

What’s going on here? The differentiable identity function, dId, takes an argument u and yields a tower. The regular value (the 0^th derivative) is simply the argument u, as one would expect from an identity function. The derivative (a linear map) turns a tiny input offset, du, to a resulting output offset, which is also du (also as expected from an identity function). The higher derivatives are all zero, so our first derivative tower is dConst du.

Linear functions

Returning, for a few moments, to thinking of derivatives as numbers, let’s consider about the function f = x -> m * x + b for some values m and b. We’d usually say that the derivative of f is equal to m everywhere, and indeed f can be interpreted as a line with (constant) slope m and y-intercept b. In the language of linear algebra, the function f is affine in general, and is (more specifically) linear only when b == 0.

In the generalized view of derivatives as linear maps, we say instead that the derivative is x -> m * x. The derivative everywhere is almost the same as f itself. If we take b == 0 (so that f is linear and not just affine), then the derivative of f is exactly f, everywhere! Consequently, its higher derivatives are all zero.

In the generalized view of derivatives as linear maps, this relationship always holds. The derivative of a linear function f is f everywhere. We can encapsulate this general property as a utility function:

linearD :: VectorSpace v s => (u :-* v) -> (u :~> v)
linearD f u = D (f u) ( du -> dConst (f du))

The dConst here sets up all of the higher derivatives to be zero. This definition can also be written more succinctly:

linearD f u = D (f u) (dConst . f)

You may have noticed a similarity between this discussion of linear functions and the identity function above. This similarity is more than coincidental, because the identity function is linear. With this insight, we can write a more compact definition for dId, replacing the one above:

dId = linearD id

As other examples of linear functions, here are differentiable versions of the functions fst and snd, which extract element from a pair.

fstD :: VectorSpace a s => (a,b) :~> a
fstD = linearD fst

sndD :: VectorSpace b s => (a,b) :~> b
sndD = linearD snd

Numeric operations

Numeric operations can be specified much as they were previously. First, those definition again (with variable names changed),

instance Num b => Num (Dif b) where
  fromInteger               = dConst . fromInteger
  D u0 u' + D v0 v'         = D (u0 + v0) (u' + v')
  D u0 u' - D v0 v'         = D (u0 - v0) (u' - v')
  u@(D u0 u') * v@(D v0 v') = D (u0 * v0) (u' * v + u * v')

Now the new definition:

instance (Num b, VectorSpace b b) => Num (a:>b) where
  fromInteger               = dConst . fromInteger
  D u0 u' + D v0 v'         = D (u0 + v0) (u' + v')
  D u0 u' - D v0 v'         = D (u0 - v0) (u' - v')
  u@(D u0 u') * v@(D v0 v') =
    D (u0 * v0) ( da -> (u * v' da) + (u' da * v))

The main change shows up in multiplication. It is no longer meaningful to write something like u' * v, because u' :: b :-* (a :> b), while v :: a :> b. Instead, v' gets applied to the small change in input before multiplying by u. Likewise, u' gets applied to the small change in input before multiplying by v.

The same sort of change has happened silently in the sum and difference cases, but are hidden by the numeric overloadings provided for functions. Written more explicitly:

  D u0 u' + D v0 v' = D (u0 + v0) ( da -> u' da + v' da)

By the way, a bit of magic can also hide the “da -> ...” in the definition of multiplication:

  u@(D u0 u') * v@(D v0 v') = D (u0 * v0) ((u *) . v' + (* v) . u')

The derivative part can be deciphered as follows: transform (the input change) by v' and then pre-multiply by u; transform (the input change) by u' and then post-multiply by v; and add the result. If this sort of wizardry isn’t your game, forget about it and use the more explicit form.

Composition — the chain rule

Here’s the chain rule we used earlier.

(>-<) :: (Num a) => (a -> a) -> (Dif a -> Dif a) -> (Dif a -> Dif a)
f >-< f' =  u@(D u0 u') -> D (f u0) (f' u * u')

The new one differs just slightly:

(>-<) :: VectorSpace u s =>
         (u -> u) -> ((a :> u) -> (a :> s)) -> (a :> u) -> (a :> u)
f >-< f' =  u@(D u0 u') -> D (f u0) ( da -> f' u *^ u' da)

Or we can hide the da, as with multiplication:

f >-< f' =  u@(D u0 u') -> D (f u0) ((f' u *^) . u')

With this change, all of the method definitions in Beautiful differentiation work as before, with only the For instance,

instance (Fractional b, VectorSpace b b) => Fractional (a:>b) where
  fromRational = dConst . fromRational
  recip        = recip >-< recip sqr

See the library for details.

The chain rule pure and simple

The (>-<) operator above is specialized form of the chain rule that is convenient for automatic differentiation. In its simplest and most general form, the chain rule says

deriv (f . g) x = deriv f (g x) . deriv g x

The composition on the right hand side is on linear maps (derivatives). You may be used to seeing the chain rule in one or more of its specialized forms, using some form of product (scalar/scalar, scalar/vector, vector/vector dot, matrix/vector) instead of composition. Those forms all mean the same as this general case, but are defined on various representations of linear maps, instead of linear maps themselves.

The chain rule above constructs only the first derivatives. Instead, we’ll construct all of the derivatives by using all of the derivatives of f and g.

(@.) :: (b :~> c) -> (a :~> b) -> (a :~> c)
(f @. g) a0 = D c0 (c' @. b')
  wfere
    D b0 b' = g a0
    D c0 c' = f b0

Coming attractions

In this post, we’ve combined derivative towers with generalized derivatives (based on linear maps), for constructing infinitely many derivatives of functions over multi-dimensional (or scalar) domains. The inner workings are subtler than the previous code, but almost as simple to express and just as easy to use.

If you’re interested in learning more about generalized derivatives, I recommend the book Calculus on Manifolds.

Future posts will include:

A look at an efficiency issue and consider some solutions.
Elegant executable specifications of smooth surfaces, using derivatives for the surface normals used in shading.

What is a derivative, really?

Conal — Mon, 19 May 2008 05:01:08 +0000

The post Beautiful differentiation showed how easily and beautifully one can construct an infinite tower of derivative values in Haskell programs, while computing plain old values. The trick (from Jerzy Karczmarczuk) was to overload numeric operators to operate on the following (co)recursive type:

data Dif b = D b (Dif b)

This representation, however, works only when differentiating functions from a scalar (one-dimensional) domain, i.e., functions of type a -> b for a scalar type a. The reason for this limitation is that only in those cases can the type of derivative values be identified with the type of regular values.

Consider a function f :: (R,R) -> R, where R is, say, Double. The value of f at a domain value (x,y) has type R, but the derivative of f consists of two partial derivatives. Moreover, the second derivative consists of four partial second-order derivatives (or three, depending how you count). A function f :: (R,R) -> (R,R,R) also has two partial derivatives at each point (x,y), each of which is a triple. That pair of triples is commonly written as a two-by-three matrix.

Each of these situations has its own derivative shape and its own chain rule (for the derivative of function compositions), using plain-old multiplication, scalar-times-vector, vector-dot-vector, matrix-times-vector, or matrix-times-matrix. Second derivatives are more complex and varied.

How many forms of derivatives and chain rules are enough? Are we doomed to work with a plethora of increasingly complex types of derivatives, as well as the diverse chain rules needed to accommodate all compatible pairs of derivatives? Fortunately, not. There is a single, simple, unifying generalization. By reconsidering what we mean by a derivative value, we can see that these various forms are all representations of a single notion, and all the chain rules mean the same thing on the meanings of the representations.

This blog post is about that unifying view of derivatives.

Edits:

2008-05-20: There are several comments about this post on reddit.
2008-05-20: Renamed derivative operator from D to deriv to avoid confusion with the data constructor for derivative towers.
2008-05-20: Renamed linear map type from (:->) to (:-*) to make it visually closer to a standard notation.

What’s a derivative?

To get an intuitive sense of what’s going on with derivatives in general, let’s look at some examples. If you already know about calculus on manifolds, you might want to skip ahead

One dimension

Start with a simple function on real numbers:

f1 :: R -> R
f1 x = x^2 + 3*x + 1

Writing the derivative of a function f as deriv f, let’s now consider the question: what is deriv f1? We might say that

deriv f1 x = 2*x+3

so e.g., deriv f1 5 = 13. In other words, f1 is changing 13 times as fast as its argument, when its argument is passing 5.

Rephrased yet again, if dx is a very tiny number, then f1(5+dx) - f1 5 is very nearly 13 * dx. If f1 maps seconds to meters, then deriv f1 5 is 13 meters per second. So already, we can see that the range of f (meters) and the range of deriv f (meters/second) disagree.

Two dimensions in and one dimension out

As a second example, consider a two-dimensional domain:

f2 :: (R,R) -> R
f2 (x,y) = 2*x*y + 3*x + 5*y + 7

Again, let’s consider some units, to get a guess of what kind of thing deriv f2 (x,y) really is. Suppose that f2 measures altitude of terrain above a plane, as a function of the position in the plane. (So f2 is a “height field”.) You can guess that deriv f (x,y) is going to have something to do with how fast the altitude is changing, i.e. the slope, at (x,y). But there isn’t a single slope. Instead, there’s a slope for every possible compass direction (a hiker’s degrees of freedom).

Now consider the conventional math answer to what is deriv f2 (x,y). Since f2 has a two-dimensional domain, it has two partial derivatives, and its derivative is commonly written as a pair of the two partials:

deriv f2 (x,y) = (2*y+3, 2*x+5)

In our example, these two pieces of information correspond to two of the possible slopes. The first is the slope if heading directly east, and the second if directly north (increasing x and increasing y, respectively).

What good does it do our hiker to be told just two of the infinitude of possible slopes at a point? The answer is perhaps magical: for well-behaved terrains, these two pieces of information are enough to calculate all (infinitely many) slopes, with just a bit of math. Every direction can be described as partly east and partly north (perhaps negatively for westish and southish directions). Given a direction angle ang (where east is zero and north is 90 degrees), the east and north components are cos ang and sin ang, respectively. When heading in the direction ang, the slope will be a weighted sum of the north-going slope and the east-going slope, where the weights are the north and south components (cos ang and sin ang).

Instead of angles, our hiker may prefer thinking directly about the north and east components of a tiny step from the position (x,y). If the step is small enough and lands dx feet to the east and dy feet to the north, then the change in altitude, f2(x+dx,y+dy) - f2(x,y) is very nearly equal to (2*y+3)*dx + (2*x+5)*dy. If we use (<.>) to mean dot (inner) product, then this change in altitude is deriv f2 (x,y) <.> (dx,dy).

From this second example, we can see that the derivative value is not a range value, but also not a rate-of-change of range values. It’s a pair of such rates with the know-how to use those rates to determine output changes.

Two dimensions in and three dimensions out

Next, imagine moving around on a surface in space, say a torus, and suppose that the surface has grid marks to define a two-dimensional parameter space. As our hiker travels around in the 2D parameter space, his position in 3D space changes accordingly, more flexibly than just an altitude. This situation corresponds to a function from 2D to 3D:

f3 :: (R,R) -> (R,R,R)

At any position (s,t) in the parameter space, and for every choice of direction through parameter space, each of the the coordinates of the position in 3D space has a rate of change. Again, if the function is mathematically well-behaved (differentiable), then all of these rates of change can be summarized in two partial derivatives. This time, however, each partial derivative has components in X, Y, and Z, so it takes six numbers to describe the 3D velocities for all possible directions in parameter space. These numbers are usually written as a 3-by-2 matrix m (the Jacobian of f3). Given a small parameter step (dx,dy), the resulting change in 3D position is equal to the product of the derivative matrix and the difference vector, i.e., m `timesVec` (dx,dy).

A common perspective

The examples above use different representations for derivatives: scalar numbers, a vector (pair of numbers), and a matrix. Common to all of these representations is the ability to turn a small step in the function’s domain into a resulting step in the range.

In f1, the (scalar) derivative c really means (c *), meaning multiply by c.
In f2, the (vector) derivative v means (v <.>).
In f3, the (matrix) derivative m means (m `timesVec`).

So, the common meaning of these derivative representations is a function, and not just any function, but a linear function–often called a “linear map” or “linear transformation”. For a function lf to be linear in this context means that

lf (u+v) == lf u + lf v, and
lf (c*v) == c * lf v, for scalar values c.

Now what about the different chain rules, saying to combine derivative values via various kinds of products (scalar/scalar, scalar/vector, vector/vector dot, matrix/vector)? Each of these products implements the same abstract notion, which is composition of linear maps.

What about `Dif`?

Now let’s return to the derivative towers we used before:

data Dif b = D b (Dif b)

As I mentioned above, this representation only works when derivative values can be represented just like range values. That punning of derivative values with range values works when the domain type is one dimensional. For functions over higher-dimensional domains, we’ll have to use a different representation.

Assume a type of linear functions from a to b:

type a :-* b = . . .

(In Haskell, type constructors beginning with a colon are used infix.) Since the derivative type depends on domain as well as range, our derivative tower will have two type parameters instead of one. To make definitions prettier, I’ll change derivative towers to an infix operator as well.

data a :> b = D b (a :> (a :-* b))

An infinitely differentiable function is then one that produces a derivative tower:

type a :~> b = a -> (a:>b)

What’s next?

Perhaps now you’re wondering:

Are these lovely ideas workable in practice?
What happens to the code from Beautiful differentiation?
What use are derivatives, anyway?

These questions and more will be answered in upcoming installments.

Conal Elliott » linear map

Paper: Beautiful differentiation

Comparing formulations of higher-dimensional, higher-order derivatives

Simpler, more efficient, functional linear maps

Linear maps

Memoization

Functional linear maps

Linear maps

Scalar domains

Non-scalar domains

Composing domains

What have we done?

Working with linear maps

Derivative towers again

Higher-dimensional, higher-order derivatives, functionally

The general setting: vector spaces

Derivative towers

Basics

Linear functions

Numeric operations

Composition — the chain rule

The chain rule pure and simple

Coming attractions

What is a derivative, really?

What’s a derivative?

One dimension

Two dimensions in and one dimension out

Two dimensions in and three dimensions out

A common perspective

What about Dif?

What’s next?

What about `Dif`?