## Circuits as a bicartesian closed category

My previous few posts have been about cartesian closed categories (CCCs). In From Haskell to hardware via cartesian closed categories, I gave a brief motivation: typed lambda expressions and the CCC vocabulary are equally expressive, but have different strengths:

• In Haskell, the CCC vocabulary is overloadable and so can be interpreted more flexibly than lambda and application.
• Lambda expressions are friendlier for human programmers to write and read.

An interpretation I’m especially keen on—and the one that inspired this series of posts—is circuits, as described in this post.

## Optimizing CCCs

In the post Overloading lambda, I gave a translation from a typed lambda calculus into the vocabulary of cartesian closed categories (CCCs). This simple translation leads to unnecessarily complex expressions. For instance, the simple lambda term, “`λ ds → (λ (a,b) → (b,a)) ds`”, translated to a rather complicated CCC term:

``apply ∘ (curry (apply ∘ (apply ∘ (const (,) △ (id ∘ exr) ∘ exr) △ (id ∘ exl) ∘ exr)) △ id)``

(Recall from the previous post that `(∘)` binds more tightly than `(△)` and `(▽)`.)

However, we can do much better, translating to

``exr △ exl``

which says to pair the right and left halves of the argument pair, i.e., swap.

This post applies some equational properties to greatly simplify/optimize the result of translation to CCC form, including example above. First I’ll show the equational reasoning and then how it’s automated in the lambda-ccc library.

Haskell’s type class facility is a powerful abstraction mechanism. Using it, we can overload multiple interpretations onto a single vocabulary, with each interpretation corresponding to a different type. The class laws constrain these interpretations and allow reasoning that is valid over all (law-abiding) instances—even ones not yet defined.

As Haskell is a higher-order functional language in the heritage of Church’s (typed) lambda calculus, it also supports “lambda abstraction”.

Sadly, however, these two forms of abstraction don’t go together. When we use the vocabulary of lambda abstraction (“`λ x → ⋯`”) and application (“`u v`”), our expressions can only be interpreted as one type (constructor), namely functions. (Note that I am not talking about parametric polymorphism, which is available with both lambda abstraction and type-class-style overloading.) Is it possible to overload lambda and application using type classes, or perhaps in the same spirit? The answer is yes, and there are some wonderful benefits of doing so. I’ll explain the how in this post and hint at the why, to be elaborated in futures posts.

## From Haskell to hardware via cartesian closed categories

Since fall of last year, I’ve been working at Tabula, a Silicon Valley start-up developing an innovative programmable hardware architecture called “Spacetime”, somewhat similar to an FPGA, but much more flexible and efficient. I met the founder, Steve Teig, at a Bay Area Haskell Hackathon in February of 2011. He described his Spacetime architecture, which is based on the geometry of the same name, developed by Hermann Minkowski to elegantly capture Einstein’s theory of special relativity. Within the first 30 seconds or so of hearing what Steve was up to, I knew I wanted to help.

The vision Steve shared with me included not only a better alternative for hardware designers (programmed in hardware languages like Verilog and VHDL), but also a platform for massively parallel execution of software written in a purely functional language. Lately, I’ve been working mainly on this latter aspect, and specifically on the problem of how to compile Haskell. Our plan is to develop the Haskell compiler openly and encourage collaboration. If anything you see in this blog series interests you, and especially if have advice or you’d like to collaborate on the project, please let me know.

In my next series of blog posts, I’ll describe some of the technical ideas I’ve been working with for compiling Haskell for massively parallel execution. For now, I want to introduce a central idea I’m using to approach the problem.

## Reimagining matrices

The function of the imagination is not
to make strange things settled, so much as
to make settled things strange.

- G.K. Chesterton

Why is matrix multiplication defined so very differently from matrix addition? If we didn’t know these procedures, could we derive them from first principles? What might those principles be?

This post gives a simple semantic model for matrices and then uses it to systematically derive the implementations that we call matrix addition and multiplication. The development illustrates what I call “denotational design”, particularly with type class morphisms. On the way, I give a somewhat unusual formulation of matrices and accompanying definition of matrix “multiplication”.

For more details, see the linear-map-gadt source code.

Edits:

• 2012–12–17: Replaced lost $B$ entries in description of matrix addition. Thanks to Travis Cardwell.

Note: I’m using MathML for the math below, which appears to work well on Firefox but on neither Safari nor Chrome. I use Pandoc to generate the HTML+MathML from markdown+lhs+LaTeX. There’s probably a workaround using different Pandoc settings and requiring some tweaks to my WordPress installation. If anyone knows how (especially the WordPress end), I’d appreciate some pointers.

## Parallel speculative addition via memoization

I’ve been thinking much more about parallel computation for the last couple of years, especially since starting to work at Tabula a year ago. Until getting into parallelism explicitly, I’d naïvely thought that my pure functional programming style was mostly free of sequential bias. After all, functional programming lacks the implicit accidental dependencies imposed by the imperative model. Now, however, I’m coming to see that designing parallel-friendly algorithms takes attention to minimizing the depth of the remaining, explicit data dependencies.

As an example, consider binary addition, carried out from least to most significant bit (as usual). We can immediately compute the first (least significant) bit of the result, but in order to compute the second bit, we’ll have to know whether or not a carry resulted from the first addition. More generally, the $\left(n+1\right)$th sum & carry require knowing the $n$th carry, so this algorithm does not allow parallel execution. Even if we have one processor per bit position, only one processor will be able to work at a time, due to the linear chain of dependencies.

One general technique for improving parallelism is speculation—doing more work than might be needed so that we don’t have to wait to find out exactly what will be needed. In this post, we’ll see a progression of definitions for bitwise addition. We’ll start with a linear-depth chain of carry dependencies and end with logarithmic depth. Moreover, by making careful use of abstraction, these versions will be simply different type specializations of a single polymorphic definition with an extremely terse definition.

## A third view on trees

A few recent posts have played with trees from two perspectives. The more commonly used I call "top-down", because the top-level structure is most immediately apparent. A top-down binary tree is either a leaf or a pair of such trees, and that pair can be accessed without wading through intervening structure. Much less commonly used are "bottom-up" trees. A bottom-up binary tree is either a leaf or a single such tree of pairs. In the non-leaf case, the pair structure of the tree elements is accessible by operations like mapping, folding, or scanning. The difference is between a pair of trees and a tree of pairs.

As an alternative to the top-down and bottom-up views on trees, I now want to examine a third view, which is a hybrid of the two. Instead of pairs of trees or trees of pairs, this hybrid view is of trees of trees, and more specifically of bottom-up trees of top-down trees. As we’ll see, these hybrid trees emerge naturally from the top-down and bottom-up views. A later post will show how this third view lends itself to an in-place (destructive) scan algorithm, suitable for execution on modern GPUs.

Edits:

• 2011-06-04: "Suppose we have a bottom-up tree of top-down trees, i.e., `t ∷ TB (TT a)`. Was backwards. (Thanks to Noah Easterly.)
• 2011-06-04: Notation: "`f ➶ n`" and "`f ➴ n`".

Continue reading ‘A third view on trees’ »

## Parallel tree scanning by composition

My last few blog posts have been on the theme of scans, and particularly on parallel scans. In Composable parallel scanning, I tackled parallel scanning in a very general setting. There are five simple building blocks out of which a vast assortment of data structures can be built, namely constant (no value), identity (one value), sum, product, and composition. The post defined parallel prefix and suffix scan for each of these five "functor combinators", in terms of the same scan operation on each of the component functors. Every functor built out of this basic set thus has a parallel scan. Functors defined more conventionally can be given scan implementations simply by converting to a composition of the basic set, scanning, and then back to the original functor. Moreover, I expect this implementation could be generated automatically, similarly to GHC’s `DerivingFunctor` extension.

Now I’d like to show two examples of parallel scan composition in terms of binary trees, namely the top-down and bottom-up variants of perfect binary leaf trees used in previous posts. (In previous posts, I used the terms "right-folded" and "left-folded" instead of "top-down" and "bottom-up".) The resulting two algorithms are expressed nearly identically, but have differ significantly in the work performed. The top-down version does $\Theta \left(n\phantom{\rule{0.167em}{0ex}}\mathrm{log}\phantom{\rule{0.167em}{0ex}}n\right)$ work, while the bottom-up version does only $\Theta \left(n\right)$, and thus the latter algorithm is work-efficient, while the former is not. Moreover, with a very simple optimization, the bottom-up tree algorithm corresponds closely to Guy Blelloch’s parallel prefix scan for arrays, given in Programming parallel algorithms. I’m delighted with this result, as I had been wondering how to think about Guy’s algorithm.

Edit:

• 2011-05-31: Added `Scan` and `Applicative` instances for `T2` and `T4`.

Continue reading ‘Parallel tree scanning by composition’ »

## Composable parallel scanning

The post Deriving list scans gave a simple specification of the list-scanning functions `scanl` and `scanr`, and then transformed those specifications into the standard optimized implementations. Next, the post Deriving parallel tree scans adapted the specifications and derivations to a type of binary trees. The resulting implementations are parallel-friendly, but not work-efficient, in that they perform $n\phantom{\rule{0.167em}{0ex}}\mathrm{log}\phantom{\rule{0.167em}{0ex}}n$ work vs linear work as in the best-known sequential algorithm.

Besides the work-inefficiency, I don’t know how to extend the critical `initTs` and `tailTs` functions (analogs of `inits` and `tails` on lists) to depth-typed, perfectly balanced trees, of the sort I played with in A trie for length-typed vectors and From tries to trees. The difficulty I encounter is that the functions `initTs` and `tailTs` make unbalanced trees out of balanced ones, so I don’t know how to adapt the specifications when types prevent the existence of unbalanced trees.

This new post explores an approach to generalized scanning via type classes. After defining the classes and giving a simple example, I’ll give a simple & general framework based on composing functor combinators.

Edits:

• 2011-03-02: Fixed typo. "constant functor is easiest" (instead of "identity functor"). Thanks, frguybob.
• 2011-03-05: Removed final unfinished sentence.
• 2011-07-28: Replace "`assocL`" with "`assocR`" in `prefixScan` derivation for `g ∘ f`.

Continue reading ‘Composable parallel scanning’ »

## Deriving parallel tree scans

The post Deriving list scans explored folds and scans on lists and showed how the usual, efficient scan implementations can be derived from simpler specifications.

Let’s see now how to apply the same techniques to scans over trees.

This new post is one of a series leading toward algorithms optimized for execution on massively parallel, consumer hardware, using CUDA or OpenCL.

Edits:

• 2011-03-01: Added clarification about "`∅`" and "`(⊕)`".
• 2011-03-23: corrected "linear-time" to "linear-work" in two places.

Continue reading ‘Deriving parallel tree scans’ »