I’ve been thinking much more about parallel computation for the last couple of years, especially since starting to work at Tabula a year ago. Until getting into parallelism explicitly, I’d naïvely thought that my pure functional programming style was mostly free of sequential bias. After all, functional programming lacks the implicit accidental dependencies imposed by the imperative model. Now, however, I’m coming to see that designing parallel-friendly algorithms takes attention to minimizing the depth of the remaining, explicit data dependencies.
As an example, consider binary addition, carried out from least to most significant bit (as usual). We can immediately compute the first (least significant) bit of the result, but in order to compute the second bit, we’ll have to know whether or not a carry resulted from the first addition. More generally, the th sum & carry require knowing the th carry, so this algorithm does not allow parallel execution. Even if we have one processor per bit position, only one processor will be able to work at a time, due to the linear chain of dependencies.
One general technique for improving parallelism is speculation—doing more work than might be needed so that we don’t have to wait to find out exactly what will be needed. In this post, we’ll see a progression of definitions for bitwise addition. We’ll start with a linear-depth chain of carry dependencies and end with logarithmic depth. Moreover, by making careful use of abstraction, these versions will be simply different type specializations of a single polymorphic definition with an extremely terse definition.
A full adder
Let’s start with an adder for two one-bit numbers. Because of the possibility of overflow, the result will be two bits, which I’ll call “sum” and “carry”. So that we can chain these one-bit adders, we’ll also add a carry input.
addB ∷ (Bool,Bool) → Bool → (Bool,Bool)
In the result, the first
Bool will be the sum, and the second will be the carry. I’ve curried the carry input to make it stand out from the (other) addends.
There are a few ways to define
addB in terms of logic operations. I like the following definition, as it shares a little work between sum & carry:
addB (a,b) cin = (axb ≠ cin, anb ∨ (cin ∧ axb)) where axb = a ≠ b anb = a ∧ b
Bool for exclusive or.
A ripple carry adder
Now suppose we have not just two bits, but two sequences of bits, interpreted as binary numbers arranged from least to most significant bit. For simplicity, I’d like to assume that these sequences to have the same length, so rather than taking a pair of bit lists, let’s take a list of bit pairs:
add ∷ [(Bool,Bool)] → Bool → ([Bool],Bool)
add, traverse the list of bit pairs, threading the carries:
add  c = ( , c) add (p:ps) c = (s:ss, c'') where (s ,c' ) = addB p c (ss,c'') = add ps c'
add definition contains a familiar pattern. The carry values act as a sort of state that gets updated in a linear (non-branching) way. The
State monad captures this pattern of computation:
newtype State s a = State (s → (a,s))
State and its
Monad instance, we can shorten our
add definition. First we’ll need a new full adder definition, tweaked for
addB ∷ (Bool,Bool) → State Bool Bool addB (a,b) = do cin ← get put (anb ∨ cin ∧ axb) return (axb ≠ cin) where anb = a ∧ b axb = a ≠ b
And then the multi-bit adder:
add ∷ [(Bool,Bool)] → State Bool [Bool] add  = return  add (p:ps) = do s ← addB p ss ← add ps return (s:ss)
We don’t really need the
Monad interface to define
add. The simpler and more general
Applicative interface suffices:
add  = pure  add (p:ps) = liftA2 (:) (addB p) (add ps)
This pattern also looks familiar. Oh — the
Traversable instance for lists makes for a very compact definition:
add = traverse addB
Wow. The definition is now so simple that it doesn’t depend on the specific choice of lists. To find out the most general type
add can have (with this definition), remove the type signature, turn off the monomorphism restriction, and see what GHCi has to say:
add ∷ Traversable t ⇒ t (Bool,Bool) → State Bool (t Bool)
This constraint is very lenient.
Traversable can be derived automatically for all algebraic data types, including nested/non-regular ones.
data Tree a = Leaf a | Branch (Tree a) (Tree a) deriving (Functor,Foldable,Traversable)
We can now specialize this general
add back to lists:
addLS ∷ [(Bool,Bool)] → State Bool [Bool] addLS = add
We can also specialize for trees:
addTS ∷ Tree (Bool,Bool) → State Bool (Tree Bool) addTS = add
Or for depth-typed perfect trees (e.g., as described in From tries to trees):
addTnS ∷ IsNat n ⇒ T n (Bool,Bool) → State Bool (T n Bool) addTnS = add
Binary trees are often better than lists for parallelism, because they allow quick recursive splitting and joining. In the case of ripple adders, we don’t really get parallelism, however, because of the single-threaded (linear) nature of
State. Can we get around this unfortunate linearization?
The linearity of carry propagation interferes with parallel execution even when using a tree representation. The problem is that each
addB (full adder) invocation must access the carry out from the previous (immediately less significant) bit position and so must wait for that carry to be computed. Since each bit addition must wait for the previous one to finish, we get linear running time, even with unlimited parallel processing available. If we didn’t have to wait for carries, we could instead get logarithmic running time using the tree representation, since subtrees could be added in parallel.
A way out of this dilemma is to speculatively compute the bit sums for both possibilities, i.e., for carry and no carry. We’ll do more work, but much less waiting.
newtype State s a = State (s → (a,s))
Rather than using a function of
s, let’s use a table indexed by
Bool in our use, a table is simply a uniform pair, so we could replace
State Bool a with the following:
newtype BoolStateTable a = BST ((a,Bool), (a,Bool))
Monad instances for
Rather than defining such a specialized type, let’s stand back and consider what’s going on. We’re replacing a function by an isomorphic data type. This replacement is exactly what memoization is about. So let’s define a general memoizing state monad:
newtype StateTrie s a = StateTrie (s ⇰ (a,s))
Note that the definition of memoizing state is nearly identical to
State. I’ve simply replaced “
→” by “
⇰”, i.e., memo tries. For the (simple) source code of
StateTrie, see the github project. (Poking around on Hackage, I just found monad-memo, which looks related.)
The full-adder function
addB is restricted to
State, but unnecessarily so. The most general type is inferred as
addB ∷ MonadState Bool m ⇒ (Bool,Bool) → m Bool
MonadState class comes from the mtl package.
With the type-generalized
addB, we get a more general type for
add as well:
add ∷ (Traversable t, Applicative m, MonadState Bool m) ⇒ t (Bool,Bool) → m (t Bool) add = traverse addB
Now we can specialize
add to work with memoized state:
addLM ∷ [(Bool,Bool)] → StateTrie Bool [Bool] addLM = add addTM ∷ Tree (Bool,Bool) → StateTrie Bool (Tree Bool) addTM = add
What have we done?
The essential tricks in this post are to (a) boost parallelism by speculative evaluation (an old idea) and (b) express speculation as memoization (new, to me at least). The technique wins for binary addition thanks to the small number of possible states, which then makes memoization (full speculation) affordable.
I’m not suggesting that the code above has impressive parallel execution when compiled under GHC. Perhaps it could with some
pseq annotations. I haven’t tried. This exploration helps me understand a little of the space of hardware-oriented algorithms.
The conditional sum adder looks quite similar to the development above. It has the twist, however, of speculating carries on blocks of a few bits rather than single bits. It’s astonishingly easy to adapt the development above for such a hybrid scheme, forming traversable structures of sequences of bits:
addH ∷ Tree [(Bool,Bool)] → StateTrie Bool (Tree [Bool]) addH = traverse (fromState ∘ add)
I’m using the adapter
fromState so that the inner list additions will use
State while the outer tree additions will use
StateTrie, thanks to type inference. This adapter memoizes and rewraps the state transition function:
fromState ∷ HasTrie s ⇒ State s a → StateTrie s a fromState = StateTrie ∘ trie ∘ runState