## Parallel speculative addition via memoization

I’ve been thinking much more about parallel computation for the last couple of years, especially since starting to work at Tabula a year ago. Until getting into parallelism explicitly, I’d naïvely thought that my pure functional programming style was mostly free of sequential bias. After all, functional programming lacks the implicit accidental dependencies imposed by the imperative model. Now, however, I’m coming to see that designing parallel-friendly algorithms takes attention to minimizing the depth of the remaining, explicit data dependencies.

As an example, consider binary addition, carried out from least to most significant bit (as usual). We can immediately compute the first (least significant) bit of the result, but in order to compute the second bit, we’ll have to know whether or not a carry resulted from the first addition. More generally, the $\left(n+1\right)$th sum & carry require knowing the $n$th carry, so this algorithm does not allow parallel execution. Even if we have one processor per bit position, only one processor will be able to work at a time, due to the linear chain of dependencies.

One general technique for improving parallelism is speculation—doing more work than might be needed so that we don’t have to wait to find out exactly what will be needed. In this post, we’ll see a progression of definitions for bitwise addition. We’ll start with a linear-depth chain of carry dependencies and end with logarithmic depth. Moreover, by making careful use of abstraction, these versions will be simply different type specializations of a single polymorphic definition with an extremely terse definition.

Let’s start with an adder for two one-bit numbers. Because of the possibility of overflow, the result will be two bits, which I’ll call “sum” and “carry”. So that we can chain these one-bit adders, we’ll also add a carry input.

In the result, the first Bool will be the sum, and the second will be the carry. I’ve curried the carry input to make it stand out from the (other) addends.

There are a few ways to define addB in terms of logic operations. I like the following definition, as it shares a little work between sum & carry:

addB (a,b) cin = (axb ≠ cin, anb ∨ (cin ∧ axb))
where
axb = a ≠ b
anb = a ∧ b

I’m using (≠) on Bool for exclusive or.

Now suppose we have not just two bits, but two sequences of bits, interpreted as binary numbers arranged from least to most significant bit. For simplicity, I’d like to assume that these sequences to have the same length, so rather than taking a pair of bit lists, let’s take a list of bit pairs:

To implement add, traverse the list of bit pairs, threading the carries:

add [] c     = ([]  , c)
add (p:ps) c = (s:ss, c'')
where
(s ,c' ) = addB p c

### State

This add definition contains a familiar pattern. The carry values act as a sort of state that gets updated in a linear (non-branching) way. The State monad captures this pattern of computation:

newtype State s a = State (s  (a,s))

By using State and its Monad instance, we can shorten our add definition. First we’ll need a new full adder definition, tweaked for State:

addB (a,b) = do cin  get
put (anb ∨ cin ∧ axb)
return (axb ≠ cin)
where
anb = a ∧ b
axb = a ≠ b

return (s:ss)

We don’t really need the Monad interface to define add. The simpler and more general Applicative interface suffices:

This pattern also looks familiar. Oh — the Traversable instance for lists makes for a very compact definition:

Wow. The definition is now so simple that it doesn’t depend on the specific choice of lists. To find out the most general type add can have (with this definition), remove the type signature, turn off the monomorphism restriction, and see what GHCi has to say:

add  Traversable t  t (Bool,Bool)  State Bool (t Bool)

This constraint is very lenient. Traversable can be derived automatically for all algebraic data types, including nested/non-regular ones.

For instance,

data Tree a = Leaf a | Branch (Tree a) (Tree a)
deriving (Functor,Foldable,Traversable)

We can now specialize this general add back to lists:

We can also specialize for trees:

addTS  Tree (Bool,Bool)  State Bool (Tree Bool)

Or for depth-typed perfect trees (e.g., as described in From tries to trees):

T n (Bool,Bool)  State Bool (T n Bool)

Binary trees are often better than lists for parallelism, because they allow quick recursive splitting and joining. In the case of ripple adders, we don’t really get parallelism, however, because of the single-threaded (linear) nature of State. Can we get around this unfortunate linearization?

### Speculation

The linearity of carry propagation interferes with parallel execution even when using a tree representation. The problem is that each addB (full adder) invocation must access the carry out from the previous (immediately less significant) bit position and so must wait for that carry to be computed. Since each bit addition must wait for the previous one to finish, we get linear running time, even with unlimited parallel processing available. If we didn’t have to wait for carries, we could instead get logarithmic running time using the tree representation, since subtrees could be added in parallel.

A way out of this dilemma is to speculatively compute the bit sums for both possibilities, i.e., for carry and no carry. We’ll do more work, but much less waiting.

### State memoization

Recall the State definition:

newtype State s a = State (s  (a,s))

Rather than using a function of s, let’s use a table indexed by s. Since s is Bool in our use, a table is simply a uniform pair, so we could replace State Bool a with the following:

newtype BoolStateTable a = BST ((a,Bool), (a,Bool))

Exercise: define Functor, Applicative, and Monad instances for BoolStateTable.

Rather than defining such a specialized type, let’s stand back and consider what’s going on. We’re replacing a function by an isomorphic data type. This replacement is exactly what memoization is about. So let’s define a general memoizing state monad:

newtype StateTrie s a = StateTrie (s ⇰ (a,s))

Note that the definition of memoizing state is nearly identical to State. I’ve simply replaced “” by “”, i.e., memo tries. For the (simple) source code of StateTrie, see the github project. (Poking around on Hackage, I just found monad-memo, which looks related.)

The full-adder function addB is restricted to State, but unnecessarily so. The most general type is inferred as

where the MonadState class comes from the mtl package.

With the type-generalized addB, we get a more general type for add as well:

t (Bool,Bool)  m (t Bool)

Now we can specialize add to work with memoized state:

addTM  Tree (Bool,Bool)  StateTrie Bool (Tree Bool)

### What have we done?

The essential tricks in this post are to (a) boost parallelism by speculative evaluation (an old idea) and (b) express speculation as memoization (new, to me at least). The technique wins for binary addition thanks to the small number of possible states, which then makes memoization (full speculation) affordable.

I’m not suggesting that the code above has impressive parallel execution when compiled under GHC. Perhaps it could with some par and pseq annotations. I haven’t tried. This exploration helps me understand a little of the space of hardware-oriented algorithms.

The conditional sum adder looks quite similar to the development above. It has the twist, however, of speculating carries on blocks of a few bits rather than single bits. It’s astonishingly easy to adapt the development above for such a hybrid scheme, forming traversable structures of sequences of bits:

addH  Tree [(Bool,Bool)]  StateTrie Bool (Tree [Bool])

I’m using the adapter fromState so that the inner list additions will use State while the outer tree additions will use StateTrie, thanks to type inference. This adapter memoizes and rewraps the state transition function:

fromState  HasTrie s  State s a  StateTrie s a
fromState = StateTrie ∘ trie ∘ runState

1. #### Geoffrey Irving:

It’s probably worth mentioning the way addition (and similar associative operations) are normally parallelized, namely parallel prefix circuits.

2. #### conal:

Hi Geoffrey. I’m glad you brought up parallel prefix sum, and more generally scans, which I wrote about here recently. As I understand them, parallel prefix (aka “prefix scan”) circuits add many numbers and produce sums of all prefixes (and more generally, associative operations, as you mentioned). The tricky bit is efficiently calculating all of the prefix sums, maximizing parallelism (to reduce time) and minimizing redundant computation (to reduce work). If one wants just a single result, then a balanced tree fold is much simpler and faster.

In contrast, the problem I’m considering here is to add just a pair of numbers operating on the multiple bits in parallel, producing a single, multi-bit sum. I don’t know whether there are any interesting connections between pair addition and either scan or fold.

3. #### GrahamHutton:

Hi Conal – have you looked at ‘carry save addition’, which uses an alternative representation of binary numbers to avoid the rippling carry in addition? Many moons ago myself and Erik Meijer wrote a paper about this: http://www.cs.nott.ac.uk/~gmh/basics.pdf. It would be nice to see how this could be expressed using more modern FP ideas.

4. #### augustss:

Conal, I think Geoffrey is referring to one of the standard multibit adders which works by parallel prefix, but replacing the full adder by an associative operator.

5. #### conal:

Geoffrey & Lennart: Thanks! I’ll dig into these parallel prefix approaches.

6. #### conal:

Graham: thanks for the pointer to your paper and to carry save addition.