## Parallel speculative addition via memoization

I’ve been thinking much more about parallel computation for the last couple of years, especially since starting to work at Tabula a year ago. Until getting into parallelism explicitly, I’d naïvely thought that my pure functional programming style was mostly free of sequential bias. After all, functional programming lacks the implicit accidental dependencies imposed by the imperative model. Now, however, I’m coming to see that designing parallel-friendly algorithms takes attention to minimizing the depth of the remaining, explicit data dependencies.

As an example, consider binary addition, carried out from least to most significant bit (as usual). We can immediately compute the first (least significant) bit of the result, but in order to compute the second bit, we’ll have to know whether or not a carry resulted from the first addition. More generally, the $(n+1)$*th* sum & carry require knowing the $n$*th* carry, so this algorithm does not allow parallel execution. Even if we have one processor per bit position, only one processor will be able to work at a time, due to the linear chain of dependencies.

One general technique for improving parallelism is *speculation*—doing more work than might be needed so that we don’t have to wait to find out exactly what *will* be needed. In this post, we’ll see a progression of definitions for bitwise addition. We’ll start with a linear-depth chain of carry dependencies and end with logarithmic depth. Moreover, by making careful use of abstraction, these versions will be simply different type specializations of a single polymorphic definition with an extremely terse definition.

### A full adder

Let’s start with an adder for two one-bit numbers. Because of the possibility of overflow, the result will be two bits, which I’ll call “sum” and “carry”. So that we can chain these one-bit adders, we’ll also add a carry input.

`addB ∷ (Bool,Bool) → Bool → (Bool,Bool)`

In the result, the first `Bool`

will be the sum, and the second will be the carry. I’ve curried the carry input to make it stand out from the (other) addends.

There are a few ways to define `addB`

in terms of logic operations. I like the following definition, as it shares a little work between sum & carry:

```
addB (a,b) cin = (axb ≠ cin, anb ∨ (cin ∧ axb))
where
axb = a ≠ b
anb = a ∧ b
```

I’m using `(≠)`

on `Bool`

for exclusive or.

### A ripple carry adder

Now suppose we have not just two bits, but two *sequences* of bits, interpreted as binary numbers arranged from least to most significant bit. For simplicity, I’d like to assume that these sequences to have the same length, so rather than taking a pair of bit lists, let’s take a list of bit pairs:

`add ∷ [(Bool,Bool)] → Bool → ([Bool],Bool)`

To implement `add`

, traverse the list of bit pairs, threading the carries:

```
add [] c = ([] , c)
add (p:ps) c = (s:ss, c'')
where
(s ,c' ) = addB p c
(ss,c'') = add ps c'
```

### State

This `add`

definition contains a familiar pattern. The carry values act as a sort of *state* that gets updated in a linear (non-branching) way. The `State`

monad captures this pattern of computation:

`newtype State s a = State (s → (a,s))`

By using `State`

and its `Monad`

instance, we can shorten our `add`

definition. First we’ll need a new full adder definition, tweaked for `State`

:

```
addB ∷ (Bool,Bool) → State Bool Bool
addB (a,b) = do cin ← get
put (anb ∨ cin ∧ axb)
return (axb ≠ cin)
where
anb = a ∧ b
axb = a ≠ b
```

And then the multi-bit adder:

```
add ∷ [(Bool,Bool)] → State Bool [Bool]
add [] = return []
add (p:ps) = do s ← addB p
ss ← add ps
return (s:ss)
```

We don’t really need the `Monad`

interface to define `add`

. The simpler and more general `Applicative`

interface suffices:

```
add [] = pure []
add (p:ps) = liftA2 (:) (addB p) (add ps)
```

This pattern also looks familiar. Oh — the `Traversable`

instance for lists makes for a very compact definition:

`add = traverse addB`

Wow. The definition is now so simple that it doesn’t depend on the specific choice of lists. To find out the most general type `add`

can have (with this definition), remove the type signature, turn off the monomorphism restriction, and see what GHCi has to say:

`add ∷ Traversable t ⇒ t (Bool,Bool) → State Bool (t Bool)`

This constraint is *very* lenient. `Traversable`

can be derived automatically for *all* algebraic data types, including nested/non-regular ones.

For instance,

```
data Tree a = Leaf a | Branch (Tree a) (Tree a)
deriving (Functor,Foldable,Traversable)
```

We can now specialize this general `add`

back to lists:

```
addLS ∷ [(Bool,Bool)] → State Bool [Bool]
addLS = add
```

We can also specialize for trees:

```
addTS ∷ Tree (Bool,Bool) → State Bool (Tree Bool)
addTS = add
```

Or for depth-typed perfect trees (e.g., as described in *From tries to trees*):

```
addTnS ∷ IsNat n ⇒
T n (Bool,Bool) → State Bool (T n Bool)
addTnS = add
```

Binary trees are often better than lists for parallelism, because they allow quick recursive splitting and joining. In the case of ripple adders, we don’t really get parallelism, however, because of the single-threaded (linear) nature of `State`

. Can we get around this unfortunate linearization?

### Speculation

The linearity of carry propagation interferes with parallel execution even when using a tree representation. The problem is that each `addB`

(full adder) invocation must access the carry out from the previous (immediately less significant) bit position and so must wait for that carry to be computed. Since each bit addition must wait for the previous one to finish, we get linear running time, even with unlimited parallel processing available. If we didn’t have to wait for carries, we could instead get logarithmic running time using the tree representation, since subtrees could be added in parallel.

A way out of this dilemma is to speculatively compute the bit sums for *both* possibilities, i.e., for carry and no carry. We’ll do more work, but much less waiting.

### State memoization

Recall the `State`

definition:

`newtype State s a = State (s → (a,s))`

Rather than using a *function* of `s`

, let’s use a *table* indexed by `s`

. Since `s`

is `Bool`

in our use, a table is simply a uniform pair, so we could replace `State Bool a`

with the following:

`newtype BoolStateTable a = BST ((a,Bool), (a,Bool))`

*Exercise:* define `Functor`

, `Applicative`

, and `Monad`

instances for `BoolStateTable`

.

Rather than defining such a specialized type, let’s stand back and consider what’s going on. We’re replacing a function by an isomorphic data type. This replacement is exactly what memoization is about. So let’s define a general *memoizing state monad*:

`newtype StateTrie s a = StateTrie (s ⇰ (a,s))`

Note that the definition of memoizing state is nearly identical to `State`

. I’ve simply replaced “`→`

” by “`⇰`

”, i.e., memo tries. For the (simple) source code of `StateTrie`

, see the github project. (Poking around on Hackage, I just found monad-memo, which looks related.)

The full-adder function `addB`

is restricted to `State`

, but unnecessarily so. The most general type is inferred as

`addB ∷ MonadState Bool m ⇒ (Bool,Bool) → m Bool`

where the `MonadState`

class comes from the mtl package.

With the type-generalized `addB`

, we get a more general type for `add`

as well:

```
add ∷ (Traversable t, Applicative m, MonadState Bool m) ⇒
t (Bool,Bool) → m (t Bool)
add = traverse addB
```

Now we can specialize `add`

to work with memoized state:

```
addLM ∷ [(Bool,Bool)] → StateTrie Bool [Bool]
addLM = add
addTM ∷ Tree (Bool,Bool) → StateTrie Bool (Tree Bool)
addTM = add
```

### What have we done?

The essential tricks in this post are to (a) boost parallelism by speculative evaluation (an old idea) and (b) express speculation as memoization (new, to me at least). The technique wins for binary addition thanks to the small number of possible states, which then makes memoization (full speculation) affordable.

I’m not suggesting that the code above has impressive parallel execution when compiled under GHC. Perhaps it could with some `par`

and `pseq`

annotations. I haven’t tried. This exploration helps me understand a little of the space of hardware-oriented algorithms.

The conditional sum adder looks quite similar to the development above. It has the twist, however, of speculating carries on blocks of a few bits rather than single bits. It’s astonishingly easy to adapt the development above for such a hybrid scheme, forming traversable structures of sequences of bits:

```
addH ∷ Tree [(Bool,Bool)] → StateTrie Bool (Tree [Bool])
addH = traverse (fromState ∘ add)
```

I’m using the adapter `fromState`

so that the inner list additions will use `State`

while the outer tree additions will use `StateTrie`

, thanks to type inference. This adapter memoizes and rewraps the state transition function:

```
fromState ∷ HasTrie s ⇒ State s a → StateTrie s a
fromState = StateTrie ∘ trie ∘ runState
```

## Geoffrey Irving:

It’s probably worth mentioning the way addition (and similar associative operations) are normally parallelized, namely parallel prefix circuits.

27 November 2012, 9:25 pm## conal:

Hi Geoffrey. I’m glad you brought up parallel prefix sum, and more generally scans, which I wrote about here recently. As I understand them, parallel prefix (aka “prefix scan”) circuits add many numbers and produce sums of all prefixes (and more generally, associative operations, as you mentioned). The tricky bit is efficiently calculating

allof the prefix sums, maximizing parallelism (to reduce time) and minimizing redundant computation (to reduce work). If one wants just a single result, then a balanced tree fold is much simpler and faster.In contrast, the problem I’m considering here is to add just a pair of numbers operating on the

28 November 2012, 11:02 ammultiple bitsin parallel, producing a single, multi-bit sum. I don’t know whether there are any interesting connections between pair addition and either scan or fold.## GrahamHutton:

Hi Conal – have you looked at ‘carry save addition’, which uses an alternative representation of binary numbers to avoid the rippling carry in addition? Many moons ago myself and Erik Meijer wrote a paper about this: http://www.cs.nott.ac.uk/~gmh/basics.pdf. It would be nice to see how this could be expressed using more modern FP ideas.

28 November 2012, 2:17 pm## augustss:

Conal, I think Geoffrey is referring to one of the standard multibit adders which works by parallel prefix, but replacing the full adder by an associative operator.

28 November 2012, 2:25 pm## conal:

Geoffrey & Lennart: Thanks! I’ll dig into these parallel prefix approaches.

28 November 2012, 7:45 pm## conal:

Graham: thanks for the pointer to your paper and to carry save addition.

28 November 2012, 7:46 pm