Reimagining matrices

The function of the imagination is not
to make strange things settled, so much as
to make settled things strange.

- G.K. Chesterton

Why is matrix multiplication defined so very differently from matrix addition? If we didn’t know these procedures, could we derive them from first principles? What might those principles be?

This post gives a simple semantic model for matrices and then uses it to systematically derive the implementations that we call matrix addition and multiplication. The development illustrates what I call “denotational design”, particularly with type class morphisms. On the way, I give a somewhat unusual formulation of matrices and accompanying definition of matrix “multiplication”.

For more details, see the linear-map-gadt source code.

Edits:

• 2012–12–17: Replaced lost $B$ entries in description of matrix addition. Thanks to Travis Cardwell.

Note: I’m using MathML for the math below, which appears to work well on Firefox but on neither Safari nor Chrome. I use Pandoc to generate the HTML+MathML from markdown+lhs+LaTeX. There’s probably a workaround using different Pandoc settings and requiring some tweaks to my WordPress installation. If anyone knows how (especially the WordPress end), I’d appreciate some pointers.

Matrices

For now, I’ll write matrices in the usual form: $\left(\begin{array}{ccc}{A}_{11}& \cdots & {A}_{1m}\\ ⋮& \ddots & ⋮\\ {A}_{n1}& \cdots & {A}_{nm}\end{array}\right)$

To add two matrices, we add their corresponding components. If $A=\left(\begin{array}{ccc}{A}_{11}& \cdots & {A}_{1m}\\ ⋮& \ddots & ⋮\\ {A}_{n1}& \cdots & {A}_{nm}\end{array}\right)\phantom{\rule{0.167em}{0ex}}\phantom{\rule{0.167em}{0ex}}\mathrm{\text{and}}\phantom{\rule{0.333em}{0ex}}B=\left(\begin{array}{ccc}{B}_{11}& \cdots & {B}_{1m}\\ ⋮& \ddots & ⋮\\ {B}_{n1}& \cdots & {B}_{nm}\end{array}\right),$ then $A+B=\left(\begin{array}{ccc}{A}_{11}+{B}_{11}& \cdots & {A}_{1m}+{B}_{1m}\\ ⋮& \ddots & ⋮\\ {A}_{n1}+{B}_{n1}& \cdots & {A}_{nm}+{B}_{nm}\end{array}\right).$ More succinctly, $\left(A+B{\right)}_{ij}={A}_{ij}+{B}_{ij}.$

Multiplication

Multiplication, on the other hand, works quite differently. If $A=\left(\begin{array}{ccc}{A}_{11}& \cdots & {A}_{1m}\\ ⋮& \ddots & ⋮\\ {A}_{n1}& \cdots & {A}_{nm}\end{array}\right)\phantom{\rule{0.167em}{0ex}}\phantom{\rule{0.167em}{0ex}}\mathrm{\text{and}}\phantom{\rule{0.333em}{0ex}}B=\left(\begin{array}{ccc}{B}_{11}& \cdots & {B}_{1p}\\ ⋮& \ddots & ⋮\\ {B}_{m1}& \cdots & {B}_{mp}\end{array}\right),$ then $\left(A\bullet B{\right)}_{ij}=\sum _{k=1}^{m}{A}_{ik}\cdot {B}_{kj}.$ This time, we form the dot product of each $A$ row and $B$ column.

Why are these two matrix operations defined so differently? Perhaps these two operations are implementations of more fundamental specifications. If so, then making those specifications explicit could lead us to clear and compelling explanations of matrix addition and multiplication.

Transforming vectors

Simplifying from matrix multiplication, we have transformation of a vector by a matrix. If $A=\left(\begin{array}{ccc}{A}_{11}& \cdots & {A}_{1m}\\ ⋮& \ddots & ⋮\\ {A}_{n1}& \cdots & {A}_{nm}\end{array}\right)\phantom{\rule{0.167em}{0ex}}\phantom{\rule{0.167em}{0ex}}\mathrm{\text{and}}\phantom{\rule{0.333em}{0ex}}x=\left(\begin{array}{c}{x}_{1}\\ ⋮\\ {x}_{m}\end{array}\right),$ then $A\cdot x=\left(\begin{array}{ccccc}{A}_{11}\cdot {x}_{1}& +& \cdots & +& {A}_{1m}\cdot {x}_{m}\\ ⋮& & \ddots & & ⋮\\ {A}_{n1}\cdot {x}_{1}& +& \cdots & +& {A}_{nm}\cdot {x}_{m}\end{array}\right)$ More succinctly, $\left(A\cdot x{\right)}_{i}=\sum _{k=1}^{m}{A}_{ik}\cdot {x}_{k}.$

$\left(A+B\right)\phantom{\rule{0.167em}{0ex}}x=A\phantom{\rule{0.167em}{0ex}}x+B\phantom{\rule{0.167em}{0ex}}x$

Matrix “multiplication” composes transformations:

$\left(A\bullet B\right)\phantom{\rule{0.167em}{0ex}}x=A\phantom{\rule{0.167em}{0ex}}\left(B\phantom{\rule{0.167em}{0ex}}x\right)$

What kinds of transformations?

Linear transformations

Matrices represent linear transformations. To say that a transformation (or “function” or “map”) $f$ is “linear” means that $f$ preserves the structure of addition and scalar multiplication. In other words, $\begin{array}{ccc}\hfill f\phantom{\rule{0.167em}{0ex}}\phantom{\rule{0.167em}{0ex}}\left(x+y\right)& \hfill =\hfill & f\phantom{\rule{0.167em}{0ex}}x+f\phantom{\rule{0.167em}{0ex}}y\hfill \\ \hfill f\phantom{\rule{0.167em}{0ex}}\phantom{\rule{0.167em}{0ex}}\left(c\cdot x\right)& \hfill =\hfill & c\cdot f\phantom{\rule{0.167em}{0ex}}x\hfill \end{array}$ Equivalently, $f$ preserves all linear combinations: $f\phantom{\rule{0.167em}{0ex}}\left({c}_{1}\cdot {x}_{1}+\cdots +{c}_{m}\cdot {x}_{m}\right)={c}_{1}\cdot f\phantom{\rule{0.167em}{0ex}}{x}_{1}+\cdots +{c}_{m}\cdot f\phantom{\rule{0.167em}{0ex}}{x}_{m}$

What does it mean to say that “matrices represent linear transformations”? As we saw in the previous section, we can use a matrix to transform a vector. Our semantic function will exactly be this use, i.e., the meaning of matrix is as a function (map) from vectors to vectors. Moreover, these functions will satisfy the linearity properties above.

Representation

For simplicity, I’m going structure matrices in a unconventional way. Instead of a rectangular arrangement of numbers, use the following generalized algebraic data type (GADT):

``````data a ⊸ b where
Dot   ∷ InnerSpace b ⇒
b → (b ⊸ Scalar b)
(:&&) ∷ VS3 a c d ⇒  -- vector spaces with same scalar field
(a ⊸ c) → (a ⊸ d) → (a ⊸ c × d)``````

I’m using the notation “`c × d`” in place of the usual “`(c,d)`”. Precedences are such that “`×`” binds more tightly than “`⊸`”, which binds more tightly than “`→`”.

This definition builds on the `VectorSpace` class, with its associated `Scalar` type and `InnerSpace` subclass. Using `VectorSpace` is overkill for linear maps. It suffices to use modules over semirings, which means that we don’t assume multiplicative or additive inverses. The more general setting enables many more useful applications than vector spaces do, some of which I will describe in future posts.

The idea here is that a linear map results in either (a) a scalar, in which case it’s equivalent to `dot v` (partially applied dot product) for some `v`, or (b) a product, in which case it can be decomposed into two linear maps with simpler range types. Each row in a conventional matrix corresponds to `Dot v` for some vector `v`, and the stacking of rows corresponds to nested applications of `(:&&)`.

Semantics

The semantic function, `apply`, interprets a representation of a linear map as a function (satisfying linearity):

``````apply ∷ (a ⊸ b) → (a → b)
apply (Dot b)   = dot b
apply (f :&& g) = apply f &&& apply g``````

where, `(&&&)` is from `Control.Arrow`.

``(&&&) ∷ Arrow (↝) ⇒ (a ↝ b) → (a ↝ c) → (a ↝ (b,c))``

For functions,

``(f &&& g) a = (f a, g a)``

Functions, linearity, and multilinearity

Functions form a vector space, with scaling and addition defined “pointwise”. Instances from the vector-space package:

``````instance AdditiveGroup v ⇒ AdditiveGroup (a → v) where
zeroV   = pure   zeroV
(^+^)   = liftA2 (^+^)
negateV = fmap   negateV

instance VectorSpace v ⇒ VectorSpace (a → v) where
type Scalar (a → v) = a → Scalar v
(*^) s = fmap (s *^)``````

I wrote the definitions in this form to fit a template for applicative functors in general. Inlining the definitions of `pure`, `liftA2`, and `fmap` on functions, we get the following equivalent instances:

``````instance AdditiveGroup v ⇒ AdditiveGroup (a → v) where
zeroV     = λ _ → zeroV
f ^+^ g   = λ a → f a ^+^ g a
negateV f = λ a → negateV (f a)

instance VectorSpace v ⇒ VectorSpace (a → v) where
type Scalar (a → v) = a → Scalar v
s *^ f = λ a → s *^ f a``````

In math, we usually say that dot product is “bilinear”, or “linear in each argument”, i.e.,

``````dot (s *^ u,v) ≡ s *^ dot (u,v)
dot (u ^+^ w, v) ≡ dot (u,v) ^+^ dot (w,v)``````

Similarly for the second argument:

``````dot (u,s *^ v) ≡ s *^ dot (u,v)
dot (u, v ^+^ w) ≡ dot (u,v) ^+^ dot (u,w)``````

Now recast the first of these properties in a curried form:

``dot (s *^ u) v ≡ s *^ dot u v``

i.e.,

``````dot (s *^ u)
≡ {- η-expand -}
λ v → dot (s *^ u) v
≡ {- "bilinearity" -}
λ v → s *^ dot u v
≡ {- (*^) on functions -}
λ v → (s *^ dot u) v
≡ {- η-contract -}
s *^ dot u``````

Likewise,

``````dot (u ^+^ v)
≡ {- η-expand -}
λ w → dot (u ^+^ v) w
≡ {- "bilinearity" -}
λ w → dot u w ^+^ dot v w
≡ {- (^+^) on functions -}
dot u ^+^ dot v``````

Thus, when “bilinearity” is recast in terms of curried functions, it becomes just linearity. (The same reasoning applies more generally to multilinearity.)

Note that we could also define function addition as follows:

``f ^+^ g = add ∘ (f &&& g)``

where

``add = uncurry (^+^)``

This uncurried form will come in handy in derivations below.

Deriving matrix operations

We’ll add two linear maps using the `(^+^)` operation from `Data.AdditiveGroup`.

``(^+^) ∷ (a ⊸ b) → (a ⊸ b) → (a ⊸ b)``

Following the principle of semantic type class morphisms, the specification simply says that the meaning of the sum is the sum of the meanings:

``apply (f ^+^ g) ≡ apply f ^+^ apply g``

which is half of the definition of “linearity” for `apply`.

The game plan (as always) is to use the semantic specification to derive (or “calculate”) a correct implementation of each operation. For addition, this goal means we want to come up with a definition like

``f ^+^ g = <rhs>``

where `<rhs>` is some expression in terms of `f` and `g` whose meaning is the same as the meaning as `f ^+^ g`, i.e., where

``apply (f ^+^ g) ≡ apply <rhs>``

Since Haskell has convenient pattern matching, we’ll use it for our definition of `(^+^)` above. Addition has two arguments, and our data type has two constructors, there are at most four different cases to consider.

First, add `Dot` and `Dot`. The specification

``apply (f ^+^ g) ≡ apply f ^+^ apply g``

specializes to

``apply (Dot b ^+^ Dot c) ≡ apply (Dot b) ^+^ apply (Dot c)``

Now simplify the right-hand side (RHS):

``````apply (Dot b) ^+^ apply (Dot c)
≡ {- apply definition -}
dot b ^+^ dot c
≡ {- (bi)linearity of dot, as described above -}
dot (b ^+^ c)
≡ {- apply definition -}
apply (Dot (b ^+^ c))``````

So our specialized specification becomes

``apply (Dot b ^+^ Dot c) ≡ apply (Dot (b ^+^ c))``

which is implied by

``Dot b ^+^ Dot c ≡ Dot (b ^+^ c)``

and easily satisfied by the following partial definition (replacing “`≡`” by “`=`”):

``Dot b ^+^ Dot c = Dot (b ^+^ c)``

Now consider the case of addition with two `(:&&)` constructors:

The specification specializes to

``apply ((f :&& g) ^+^ (h :&& k)) ≡ apply (f :&& g) ^+^ apply (h :&& k)``

As with `Dot`, simplify the RHS:

``````apply (f :&& g) ^+^ apply (h :&& k)
≡ {- apply definition -}
(apply f &&& apply g) ^+^ (apply h &&& apply k)
≡ {- See below -}
(apply f ^+^ apply h) &&& (apply g ^+^ apply k)
≡ {- induction -}
apply (f ^+^ h) &&& apply (g ^+^ k)
≡ {- apply definition -}
apply ((f ^+^ h) :&& (g ^+^ k))``````

I used the following property (on functions):

``(f &&& g) ^+^ (h &&& k) ≡ (f ^+^ h) &&& (g ^+^ k)``

Proof:

``````(f &&& g) ^+^ (h &&& k)
≡ {- η-expand -}
λ x → ((f &&& g) ^+^ (h &&& k)) x
≡ {- (&&&) definition for functions -}
λ x → (f x, g x) ^+^ (h x, k x)
≡ {- (^+^) definition for pairs -}
λ x → (f x ^+^ h x, g x ^+^ k x)
≡ {- (^+^) definition for functions -}
λ x → ((f ^+^ h) x, (g ^+^ k) x)
≡ {- (&&&) definition for functions -}
(f ^+^ h) &&& (g ^+^ k)``````

The specification becomes

``apply ((f :&& g) ^+^ (h :&& k)) ≡ apply ((f ^+^ h) :&& (g ^+^ k))``

which is easily satisfied by the following partial definition

``(f :&& g) ^+^ (h :&& k) = (f ^+^ h) :&& (g ^+^ k)``

The other two cases are (a) `Dot` and `(:&&)`, and (b) `(:&&)` and `Dot`, but they don’t type-check (assuming that pairs are not scalars).

Composing linear maps

I’ll write linear map composition as “`g ∘ f`”, with type

``(∘) ∷ (b ⊸ c) → (a ⊸ b) → (a ⊸ c)``

This notation is thanks to a `Category` instance, which depends on a generalized `Category` class that uses the recent `ConstraintKinds` language extension. (See the source code.)

Following the semantic type class morphism principle again, the specification says that the meaning of the composition is the composition of the meanings:

``apply (g ∘ f) ≡ apply g ∘ apply f``

In the following, note that the `∘` operator binds more tightly than `&&&`, so `f ∘ h &&& g ∘ h` means `(f ∘ h) &&& (g ∘ h)`.

Derivation

Again, since there are two constructors, we have four possible cases cases. We can handle two of these cases together, namely `(:&&)` and anything. The specification:

``apply ((f :&& g) ∘ h) ≡ apply (f :&& g) ∘ apply h``

Reasoning proceeds as above, simplifying the RHS of the constructor-specialized specification.

Simplify the RHS:

``````apply (f :&& g) ∘ apply h
≡ {- apply definition -}
(apply f &&& apply g) ∘ apply h
≡ {- see below -}
apply f ∘ apply h &&& apply g ∘ apply h
≡ {- induction -}
apply (f ∘ h) &&& apply (g ∘ h)
≡ {- apply definition -}
apply (f ∘ h :&& g ∘ h)``````

This simplification uses the following property of functions:

``(p &&& q) ∘ r ≡ p ∘ r &&& q ∘ r``

Sufficient definition:

``(f :&& g) ∘ h = f ∘ h :&& g ∘ h``

We have two more cases, specified as follows:

``````apply (Dot c ∘ Dot b) ≡ apply (Dot c) ∘ apply (Dot b)

apply (Dot c ∘ (f :&& g)) ≡ apply (Dot c) ∘ apply (f :&& g)``````

Based on types, `c` must be a scalar in the first case and a pair in the second. (`Dot b` produces a scalar, while `f :&& g` produces a pair.) Thus, we can write these two cases more specifically:

``````apply (Dot s ∘ Dot b) ≡ apply (Dot s) ∘ apply (Dot b)

apply (Dot (a,b) ∘ (f :&& g)) ≡ apply (Dot (a,b)) ∘ apply (f :&& g)``````

In the derivation, I won’t spell out as many details as before. Simplify the RHSs:

``````  apply (Dot s) ∘ apply (Dot b)
≡ dot s ∘ dot b
≡ dot (s *^ b)
≡ apply (Dot (s *^ b))``````
``````  apply (Dot (a,b)) ∘ apply (f :&& g)
≡ dot (a,b) ∘ (apply f &&& apply g)
≡ add ∘ (dot a ∘ apply f &&& dot b ∘ apply g)
≡ dot a ∘ apply f ^+^ dot b ∘ apply g
≡ apply (Dot a ∘ f ^+^ Dot b ∘ g)``````

I’ve used the following properties of functions:

``````dot (a,b)             ≡ add ∘ (dot a *** dot b)

(r *** s) ∘ (p &&& q) ≡ r ∘ p &&& s ∘ q

add ∘ (p &&& q)       ≡ p ^+^ q

apply (f ^+^ g)       ≡ apply f ^+^ apply g``````

Implementation:

`````` Dot s     ∘ Dot b     = Dot (s *^ b)
Dot (a,b) ∘ (f :&& g) = Dot a ∘ f ^+^ Dot b ∘ g``````

Cross products

Another `Arrow` operation handy for linear maps is the parallel composition (product):

``(***) ∷ (a ⊸ c) → (b ⊸ d) → (a × b ⊸ c × d)``

The specification says that `apply` distributes over `(***)`. In other words, the meaning of the product is the product of the meanings.

``apply (f *** g) = apply f *** apply g``

Where, on functions,

``````p *** q = λ (a,b) → (p a, q b)
≡ p ∘ fst &&& q ∘ snd``````

Simplify the specifications RHS:

``````  apply f *** apply g
≡ apply f ∘ fst &&& apply g ∘ snd``````

If we knew how to represent `fst` and `snd` via our linear map constructors, we’d be nearly done. Instead, let’s suppose we have the following functions.

``````compFst ∷ VS3 a b c ⇒ a ⊸ c → a × b ⊸ c
compSnd ∷ VS3 a b c ⇒ b ⊸ c → a × b ⊸ c``````

specified as follows:

``````apply (compFst f) ≡ apply f ∘ fst
apply (compSnd g) ≡ apply g ∘ snd``````

With these two functions (to be defined) in hand, let’s try again.

``````  apply f *** apply g
≡ apply f ∘ fst &&& apply g ∘ snd
≡ apply (compFst f) &&& apply (compSnd g)
≡ apply (compFst f :&& compSnd g)``````

Composing with `fst` and `snd`

I’ll elide even more of the derivation this time, focusing reasoning on the meanings. Relating to the representation is left as an exercise. The key steps in the derivation:

``````dot a     ∘ fst ≡ dot (a,0)

(f &&& g) ∘ fst ≡ f ∘ fst &&& g ∘ fst

dot b     ∘ snd ≡ dot (0,b)

(f &&& g) ∘ snd ≡ f ∘ snd &&& g ∘ snd``````

Implementation:

``````compFst (Dot a)   = Dot (a,zeroV)
compFst (f :&& g) = compFst f &&& compFst g

compSnd (Dot b)   = Dot (zeroV,b)
compSnd (f :&& g) = compSnd f &&& compSnd g``````

where `zeroV` is the zero vector.

Given `compFst` and `compSnd`, we can implement `fst` and `snd` as linear maps simply as `compFst id` and `compSnd id`, where `id` is the (polymorphic) identity linear map.

Reflections

This post reflects an approach to programming that I apply wherever I’m able. As a summary:

• Look for an elegant what behind a familiar how.
• Define a semantic function for each data type.
• Derive a correct implementation from the semantics.

You can find more examples of this methodology elsewhere in this blog and in the paper Denotational design with type class morphisms.

1. rdm:

Note that the derivation is a lot simpler if you start with a form of multiplication which follows the same structural rules as addition and build from that an “inner product” operation which does the multiply-and-sum thing.

2. conal:

in what way is this a “reimagining”? That matrices represent linear transformations is absolutely fundamental.

Thanks for asking. What I mean by “reimagining” is (a) packaging of linear maps via the `Category` & `Arrow` vocabulary (more explicit in the library), (b) structuring the representation and semantics to match the algebraic structure of `dot` and `(&&&)`, and (c) derivation of operations from semantics.

3. Matthew Brunelle:

I just finished up linear algebra class so this article was a very stimulating way to apply that information. Thanks for the post!

4. rebcabin:

my big big big question is : can we do arbitrary computation using just matrix addition and multiplication (and maybe transpose)… i.e., can we capture the lambda calculus or the universal turing machine in a little bit of linear algebra? I have managed to find McCarthy COND using only primitive matrix ops. Is this enough, or do I need NAND?

5. Franklin Chen:

Before I got to the comment section, I had almost exactly the same thought as @DrMathochist in my head, so I was glad I kept that on hold and continued on, and your response clarified what exactly you meant by “reimagining”. (In college, for my first introduction to linear algebra, we went quite far without ever seeing or using a matrix, finally getting there, and deriving the matrix operations, only when it became necessary as a concrete representation of a linear transformation.) I suspect that people seeing only the title and first paragraphs of this post might be confused, however.

6. rebcabin:

There is a lovely demonstration somewhere vaguely remembered in one of my favorite books http://matrixeditions.com/UnifiedApproach4th.html that Determinant is the UNIQUE multilinear, antisymmetric function of matrices. In other words, it ought to be derivable from those properties, just in the way you derived mat mul from linearity and composition. dreaming

7. Jonathan Fischoff:

I can’t shake this feeling that Arrow erupted from the bowels of Base and tried to take possession of Vect.

It seems more natural to express Vect as a symmetric bimonoidal category (http://ncatlab.org/nlab/show/bimonoidal+category) using the tensor product and direct sum. (***) is basically the tensor product, but I think the direct sum of transformations would be more like (+++).

8. Dave Abrahams:

I’m probably revealing a weakness in my jargon knowledge, but what do you mean when you use the word “preserves,” here? You’ve used that one word to describe multiple things that look like mathematical laws (e.g. distributivity) each of which has its own name.

9. Peteris Erins:

Hi Conal, great post, inspired me to “reimagine” tensors: http://peteriserins.tumblr.com/post/38783455201/typed-type-tensors-in-scala.

10. Jason Turner:

I am a mathematical physicist, lately working in software engineering – and somewhat new to the practice of functional programming; I came across your publications by way of listening to your HaskellCast re FRP and denotational design and find it very very interesting.

I think that you may find some interesting reading on a similar vein and with some curious twists in the book “Probability Theory, The Logic of Science” by E T Jaynes, specifically chapter 2. This derives the quantitative rules of probability theory from the ‘basic desiderata’ of ‘plausible reasoning’. What I find interesting is firstly another fine example of the approach you have applied here and from what I surmise quite widely elsewhere; the approach is the same and the particular steps are also similar although as the problem is more general the equations involved are more general functional equations, one being ‘The Associativity Equation’. But it is also interesting that the author took this approach, with some inspiration from G Polya, around 65 years ago and while active in applying computation to probability and statistics was certainly not of a theoretical computer science orientation.