## Reimagining matrices

The function of the imagination is not
to make strange things settled, so much as
to make settled things strange.

- G.K. Chesterton

Why is matrix multiplication defined so very differently from matrix addition? If we didn’t know these procedures, could we derive them from first principles? What might those principles be?

This post gives a simple semantic model for matrices and then uses it to systematically derive the implementations that we call matrix addition and multiplication. The development illustrates what I call “denotational design”, particularly with type class morphisms. On the way, I give a somewhat unusual formulation of matrices and accompanying definition of matrix “multiplication”.

For more details, see the linear-map-gadt source code.

Edits:

• 2012–12–17: Replaced lost $B$ entries in description of matrix addition. Thanks to Travis Cardwell.

Note: I’m using MathML for the math below, which appears to work well on Firefox but on neither Safari nor Chrome. I use Pandoc to generate the HTML+MathML from markdown+lhs+LaTeX. There’s probably a workaround using different Pandoc settings and requiring some tweaks to my WordPress installation. If anyone knows how (especially the WordPress end), I’d appreciate some pointers.

## Parallel speculative addition via memoization

I’ve been thinking much more about parallel computation for the last couple of years, especially since starting to work at Tabula a year ago. Until getting into parallelism explicitly, I’d naïvely thought that my pure functional programming style was mostly free of sequential bias. After all, functional programming lacks the implicit accidental dependencies imposed by the imperative model. Now, however, I’m coming to see that designing parallel-friendly algorithms takes attention to minimizing the depth of the remaining, explicit data dependencies.

As an example, consider binary addition, carried out from least to most significant bit (as usual). We can immediately compute the first (least significant) bit of the result, but in order to compute the second bit, we’ll have to know whether or not a carry resulted from the first addition. More generally, the $\left(n+1\right)$th sum & carry require knowing the $n$th carry, so this algorithm does not allow parallel execution. Even if we have one processor per bit position, only one processor will be able to work at a time, due to the linear chain of dependencies.

One general technique for improving parallelism is speculation—doing more work than might be needed so that we don’t have to wait to find out exactly what will be needed. In this post, we’ll see a progression of definitions for bitwise addition. We’ll start with a linear-depth chain of carry dependencies and end with logarithmic depth. Moreover, by making careful use of abstraction, these versions will be simply different type specializations of a single polymorphic definition with an extremely terse definition.

## A third view on trees

A few recent posts have played with trees from two perspectives. The more commonly used I call "top-down", because the top-level structure is most immediately apparent. A top-down binary tree is either a leaf or a pair of such trees, and that pair can be accessed without wading through intervening structure. Much less commonly used are "bottom-up" trees. A bottom-up binary tree is either a leaf or a single such tree of pairs. In the non-leaf case, the pair structure of the tree elements is accessible by operations like mapping, folding, or scanning. The difference is between a pair of trees and a tree of pairs.

As an alternative to the top-down and bottom-up views on trees, I now want to examine a third view, which is a hybrid of the two. Instead of pairs of trees or trees of pairs, this hybrid view is of trees of trees, and more specifically of bottom-up trees of top-down trees. As we’ll see, these hybrid trees emerge naturally from the top-down and bottom-up views. A later post will show how this third view lends itself to an in-place (destructive) scan algorithm, suitable for execution on modern GPUs.

Edits:

• 2011-06-04: "Suppose we have a bottom-up tree of top-down trees, i.e., `t ∷ TB (TT a)`. Was backwards. (Thanks to Noah Easterly.)
• 2011-06-04: Notation: "`f ➶ n`" and "`f ➴ n`".

Continue reading ‘A third view on trees’ »

## Parallel tree scanning by composition

My last few blog posts have been on the theme of scans, and particularly on parallel scans. In Composable parallel scanning, I tackled parallel scanning in a very general setting. There are five simple building blocks out of which a vast assortment of data structures can be built, namely constant (no value), identity (one value), sum, product, and composition. The post defined parallel prefix and suffix scan for each of these five "functor combinators", in terms of the same scan operation on each of the component functors. Every functor built out of this basic set thus has a parallel scan. Functors defined more conventionally can be given scan implementations simply by converting to a composition of the basic set, scanning, and then back to the original functor. Moreover, I expect this implementation could be generated automatically, similarly to GHC’s `DerivingFunctor` extension.

Now I’d like to show two examples of parallel scan composition in terms of binary trees, namely the top-down and bottom-up variants of perfect binary leaf trees used in previous posts. (In previous posts, I used the terms "right-folded" and "left-folded" instead of "top-down" and "bottom-up".) The resulting two algorithms are expressed nearly identically, but have differ significantly in the work performed. The top-down version does $\Theta \left(n\phantom{\rule{0.167em}{0ex}}\mathrm{log}\phantom{\rule{0.167em}{0ex}}n\right)$ work, while the bottom-up version does only $\Theta \left(n\right)$, and thus the latter algorithm is work-efficient, while the former is not. Moreover, with a very simple optimization, the bottom-up tree algorithm corresponds closely to Guy Blelloch’s parallel prefix scan for arrays, given in Programming parallel algorithms. I’m delighted with this result, as I had been wondering how to think about Guy’s algorithm.

Edit:

• 2011-05-31: Added `Scan` and `Applicative` instances for `T2` and `T4`.

Continue reading ‘Parallel tree scanning by composition’ »

## Composable parallel scanning

The post Deriving list scans gave a simple specification of the list-scanning functions `scanl` and `scanr`, and then transformed those specifications into the standard optimized implementations. Next, the post Deriving parallel tree scans adapted the specifications and derivations to a type of binary trees. The resulting implementations are parallel-friendly, but not work-efficient, in that they perform $n\phantom{\rule{0.167em}{0ex}}\mathrm{log}\phantom{\rule{0.167em}{0ex}}n$ work vs linear work as in the best-known sequential algorithm.

Besides the work-inefficiency, I don’t know how to extend the critical `initTs` and `tailTs` functions (analogs of `inits` and `tails` on lists) to depth-typed, perfectly balanced trees, of the sort I played with in A trie for length-typed vectors and From tries to trees. The difficulty I encounter is that the functions `initTs` and `tailTs` make unbalanced trees out of balanced ones, so I don’t know how to adapt the specifications when types prevent the existence of unbalanced trees.

This new post explores an approach to generalized scanning via type classes. After defining the classes and giving a simple example, I’ll give a simple & general framework based on composing functor combinators.

Edits:

• 2011-03-02: Fixed typo. "constant functor is easiest" (instead of "identity functor"). Thanks, frguybob.
• 2011-03-05: Removed final unfinished sentence.
• 2011-07-28: Replace "`assocL`" with "`assocR`" in `prefixScan` derivation for `g ∘ f`.

Continue reading ‘Composable parallel scanning’ »

## Deriving parallel tree scans

The post Deriving list scans explored folds and scans on lists and showed how the usual, efficient scan implementations can be derived from simpler specifications.

Let’s see now how to apply the same techniques to scans over trees.

This new post is one of a series leading toward algorithms optimized for execution on massively parallel, consumer hardware, using CUDA or OpenCL.

Edits:

• 2011-03-01: Added clarification about "`∅`" and "`(⊕)`".
• 2011-03-23: corrected "linear-time" to "linear-work" in two places.

Continue reading ‘Deriving parallel tree scans’ »

## Deriving list scans

I’ve been playing with deriving efficient parallel, imperative implementations of "prefix sum" or more generally "left scan". Following posts will explore the parallel & imperative derivations, but as a warm-up, I’ll tackle the functional & sequential case here.

### Folds

You’re probably familiar with the higher-order functions for left and right "fold". The current documentation says:

`foldl`, applied to a binary operator, a starting value (typically the left-identity of the operator), and a list, reduces the list using the binary operator, from left to right:

``foldl f z [x1, x2, ⋯, xn] ≡ (⋯((z `f` x1) `f` x2) `f`⋯) `f` xn``

The list must be finite.

`foldr`, applied to a binary operator, a starting value (typically the right-identity of the operator), and a list, reduces the list using the binary operator, from right to left:

``foldr f z [x1, x2, ⋯, xn] ≡ x1 `f` (x2 `f` ⋯ (xn `f` z)⋯)``

And here are typical definitions:

``foldl ∷ (b → a → b) → b → [a] → bfoldl f z []     = zfoldl f z (x:xs) = foldl f (z `f` x) xsfoldr ∷ (a → b → b) → b → [a] → bfoldr f z []     = zfoldr f z (x:xs) = x `f` foldr f z xs``

Notice that `foldl` builds up its result one step at a time and reveals it all at once, in the end. The whole result value is locked up until the entire input list has been traversed. In contrast, `foldr` starts revealing information right away, and so works well with infinite lists. Like `foldl`, `foldr` also yields only a final value.

Sometimes it’s handy to also get to all of the intermediate steps. Doing so takes us beyond the land of folds to the kingdom of scans.

### Scans

The `scanl` and `scanr` functions correspond to `foldl` and `foldr` but produce all intermediate accumulations, not just the final one.

``scanl ∷ (b → a → b) → b → [a] → [b]scanl f z [x1, x2,  ⋯ ] ≡ [z, z `f` x1, (z `f` x1) `f` x2, ⋯]scanr ∷ (a → b → b) → b → [a] → [b]scanr f z [⋯, xn_1, xn] ≡ [⋯, xn_1 `f` (xn `f` z), xn `f` z, z]``

As you might expect, the last value is the complete left fold, and the first value in the scan is the complete right fold:

``last (scanl f z xs) ≡ foldl f z xshead (scanr f z xs) ≡ foldr f z xs``

which is to say

``last ∘ scanl f z ≡ foldl f zhead ∘ scanr f z ≡ foldr f z``

The standard scan definitions are trickier than the fold definitions:

``scanl ∷ (b → a → b) → b → [a] → [b]scanl f z ls = z : (case ls of                     []   → []                     x:xs → scanl f (z `f` x) xs)scanr ∷ (a → b → b) → b → [a] → [b]scanr _ z []     = [z]scanr f z (x:xs) = (x `f` q) : qs                   where qs@(q:_) = scanr f z xs``

Every time I encounter these definitions, I have to walk through it again to see what’s going on. I finally sat down to figure out how these tricky definitions might emerge from simpler specifications. In other words, how to derive these definitions systematically from simpler but less efficient definitions.

Most likely, these derivations have been done before, but I learned something from the effort, and I hope you do, too.

Continue reading ‘Deriving list scans’ »