Parallel tree scanning by composition

My last few blog posts have been on the theme of scans, and particularly on parallel scans. In Composable parallel scanning, I tackled parallel scanning in a very general setting. There are five simple building blocks out of which a vast assortment of data structures can be built, namely constant (no value), identity (one value), sum, product, and composition. The post defined parallel prefix and suffix scan for each of these five "functor combinators", in terms of the same scan operation on each of the component functors. Every functor built out of this basic set thus has a parallel scan. Functors defined more conventionally can be given scan implementations simply by converting to a composition of the basic set, scanning, and then back to the original functor. Moreover, I expect this implementation could be generated automatically, similarly to GHC’s `DerivingFunctor` extension.

Now I’d like to show two examples of parallel scan composition in terms of binary trees, namely the top-down and bottom-up variants of perfect binary leaf trees used in previous posts. (In previous posts, I used the terms "right-folded" and "left-folded" instead of "top-down" and "bottom-up".) The resulting two algorithms are expressed nearly identically, but have differ significantly in the work performed. The top-down version does $\Theta \left(n\phantom{\rule{0.167em}{0ex}}\mathrm{log}\phantom{\rule{0.167em}{0ex}}n\right)$ work, while the bottom-up version does only $\Theta \left(n\right)$, and thus the latter algorithm is work-efficient, while the former is not. Moreover, with a very simple optimization, the bottom-up tree algorithm corresponds closely to Guy Blelloch’s parallel prefix scan for arrays, given in Programming parallel algorithms. I’m delighted with this result, as I had been wondering how to think about Guy’s algorithm.

Edit:

• 2011-05-31: Added `Scan` and `Applicative` instances for `T2` and `T4`.

Scanning via functor combinators

In Composable parallel scanning, we saw the `Scan` class:

``class Scan f where  prefixScan, suffixScan ∷ Monoid m ⇒ f m → (m, f m)``

Given a structure of values, the prefix and suffix scan methods generate the overall `fold` (of type `m`), plus a structure of the same type as the input. (In contrast, the usual Haskell `scanl` and `scanr` functions on lists yield a single list with one more element than the source list. I changed the interface for generality and composability.) The post gave instances for the basic set of five functor combinators.

Most functors are not defined via the basic combinators, but as mentioned above, we can scan by conversion to and from the basic set. For convenience, encapsulate this conversion in a type class:

``class EncodeF f where  type Enc f ∷ * → *  encode ∷ f a → Enc f a  decode ∷ Enc f a → f a``

and define scan functions via `EncodeF`:

``prefixScanEnc, suffixScanEnc ∷  (EncodeF f, Scan (Enc f), Monoid m) ⇒ f m → (m, f m)prefixScanEnc = second decode ∘ prefixScan ∘ encodesuffixScanEnc = second decode ∘ suffixScan ∘ encode``

Lists

As a first example, consider

``instance EncodeF [] where  type Enc [] = Const () + Id × []  encode [] = InL (Const ())  encode (a : as) = InR (Id a × as)  decode (InL (Const ())) = []  decode (InR (Id a × as)) = a : as``

And declare a boilerplate `Scan` instance via `EncodeF`:

``instance Scan [] where  prefixScan = prefixScanEnc  suffixScan = suffixScanEnc``

I haven’t checked the details, but I think with this instance, suffix scanning has okay performance, while prefix scan does quadratic work. The reason is the in the `Scan` instance for products, the two components are scanned independently (in parallel), and then the whole second component is adjusted for `prefixScan`, while the whole first component is adjusted for `suffixScan`. In the case of lists, the first component is the list head, and second component is the list tail.

For your reading convenience, here’s that `Scan` instance again:

``instance (Scan f, Scan g, Functor f, Functor g) ⇒ Scan (f × g) where  prefixScan (fa × ga) = (af ⊕ ag, fa' × ((af ⊕) <\$> ga'))   where (af,fa') = prefixScan fa         (ag,ga') = prefixScan ga  suffixScan (fa × ga) = (af ⊕ ag, ((⊕ ag) <\$> fa') × ga')   where (af,fa') = suffixScan fa         (ag,ga') = suffixScan ga``

The lop-sidedness of the list type thus interferes with parallelization, and makes the parallel scans perform much worse than cumulative sequential scans.

Let’s next look at a more balanced type.

Binary Trees

We’ll get better parallel performance by organizing our data so that we can cheaply partition it into roughly equal pieces. Tree types allows such partitioning.

Top-down trees

We’ll try a few variations, starting with a simple binary tree.

``data T1 a = L1 a | B1 (T1 a) (T1 a) deriving Functor``

Encoding and decoding is straightforward:

``instance EncodeF T1 where  type Enc T1 = Id + T1 × T1  encode (L1 a)   = InL (Id a)  encode (B1 s t) = InR (s × t)  decode (InL (Id a))  = L1 a  decode (InR (s × t)) = B1 s tinstance Scan T1 where  prefixScan = prefixScanEnc  suffixScan = suffixScanEnc``

Note that these definitions could be generated automatically from the data type definition.

For balanced trees, prefix and suffix scan divide the problem in half at each step, solve each half, and do linear work to patch up one of the two halves. Letting $n$ be the number of elements, and $W\left(n\right)$ the work, we have the recurrence $W\left(n\right)=2\phantom{\rule{0.167em}{0ex}}W\left(n/2\right)+c\phantom{\rule{0.167em}{0ex}}n$ for some constant factor $c$. By the Master theorem, therefore, the work done is $\Theta \left(n\phantom{\rule{0.167em}{0ex}}\mathrm{log}\phantom{\rule{0.167em}{0ex}}n\right)$. (Use case 2, with $a=b=2$, $f\left(n\right)=c\phantom{\rule{0.167em}{0ex}}n$, and $k=0$.)

Again assuming a balanced tree, the computation dependencies have logarithmic depth, so the ideal parallel running time (assuming sufficient processors) is $\Theta \left(\mathrm{log}n\right)$. Thus we have an algorithm that is depth-efficient (modulo constant factors) but work-inefficient.

Composition

A binary tree as defined above is either a leaf or a pair of binary trees. We can make this pair-ness more explicit with a reformulation:

``data T2 a = L2 a | B2 (Pair (T2 a)) deriving Functor``

where `Pair`, as in Composable parallel scanning, is defined as

``data Pair a = a :# a deriving Functor``

or even

``type Pair = Id × Id``

For encoding and decoding, we could use the same representation as with `T1`, but let’s instead use a more natural one for the definition of `T2`:

``instance EncodeF T2 where  type Enc T2 = Id + Pair ∘ T2  encode (L2 a)  = InL (Id a)  encode (B2 st) = InR (O st)  decode (InL (Id a)) = L2 a  decode (InR (O st)) = B2 st``

Boilerplate scanning:

``instance Scan T2 where  prefixScan = prefixScanEnc  suffixScan = suffixScanEnc``

for which we’ll need an applicative instance:

``instance Applicative T2 where  pure = L2  L2 f <*> L2 x = L2 (f x)  B2 (fs :# gs) <*> B2 (xs :# ys) = B2 ((fs <*> xs) :# (gs <*> ys))  _ <*> _ = error "T2 (<*>): structure mismatch"``

The `O` constructor is for functor composition.

With a small change to the tree type, we can make the composition of `Pair` and `T` more explicit:

``data T3 a = L3 a | B3 ((Pair ∘ T3) a) deriving Functor``

Then the conversion becomes even simpler, since there’s no need to add or remove `O` wrappers:

``instance EncodeF T3 where  type Enc T3 = Id + Pair ∘ T3  encode (L3 a)  = InL (Id a)  encode (B3 st) = InR st  decode (InL (Id a)) = L3 a  decode (InR st)     = B3 st``

Bottom-up trees

In the formulations above, a non-leaf tree consists of a pair of trees. I’ll call these trees "top-down", since visible pair structure begins at the top.

With a very small change, we can instead use a tree of pairs:

``data T4 a = L4 a | B4 (T4 (Pair a)) deriving Functor``

Again an applicative instance allows a standard `Scan` instance:

``instance Scan T4 where  prefixScan = prefixScanEnc  suffixScan = suffixScanEncinstance Applicative T4 where  pure = L4  L4 f   <*> L4 x   = L4 (f x)  B4 fgs <*> B4 xys = B4 (liftA2 h fgs xys)   where h (f :# g) (x :# y) = f x :# g y  _ <*> _ = error "T4 (<*>): structure mismatch"``

or a more explicitly composed form:

``data T5 a = L5 a | B5 ((T5 ∘ Pair) a) deriving Functor``

I’ll call these new variations "bottom-up" trees, since visible pair structure begins at the bottom. After stripping off the branch constructor, `B4`, we can get at the pair-valued leaves by means of `fmap`, `fold`, or `traverse` (or variations). For `B5`, we’d also have to strip off the `O` wrapper (functor composition).

Encoding is nearly the same as with top-down trees. For instance,

``instance EncodeF T4 where  type Enc T4 = Id + T4 ∘ Pair  encode (L4 a) = InL (Id a)  encode (B4 t) = InR (O t)  decode (InL (Id a)) = L4 a  decode (InR (O t))  = B4 t``

Scanning pairs

We’ll need to scan on the `Pair` functor. If we use the definition of `Pair` above in terms of `Id` and `(×)`, then we’ll get scanning for free. For using `Pair`, I find the explicit data type definition above more convenient. We can then derive a `Scan` instance by conversion. Start with a standard specification:

``data Pair a = a :# a deriving Functor``

And encode & decode explicitly:

``instance EncodeF Pair where  type Enc Pair = Id × Id  encode (a :# b) = Id a × Id b  decode (Id a × Id b) = a :# b``

Then use our boilerplate `Scan` instance for `EncodeF` instances:

``instance Scan Pair where  prefixScan = prefixScanEnc  suffixScan = suffixScanEnc``

We’ve seen the `Scan` instance for `(×)` above. The instance for `Id` is very simple:

``newtype Id a = Id ainstance Scan Id where  prefixScan (Id m) = (m, Id ∅)  suffixScan        = prefixScan``

Given these definitions, we can calculate a more streamlined `Scan` instance for `Pair`:

``  prefixScan (a :# b)≡  {- specification -}  prefixScanEnc (a :# b)≡  {- prefixScanEnc definition -}  (second decode ∘ prefixScan ∘ encode) (a :# b)≡  {- (∘) -}  second decode (prefixScan (encode (a :# b)))≡  {- encode definition for Pair -}  second decode (prefixScan (Id a × Id b))≡  {- prefixScan definition for f × g -}  second decode    (af ⊕ ag, fa' × ((af ⊕) <\$> ga'))     where (af,fa') = prefixScan (Id a)           (ag,ga') = prefixScan (Id b)≡  {- Definition of second on functions -}  (af ⊕ ag, decode (fa' × ((af ⊕) <\$> ga')))   where (af,fa') = prefixScan (Id a)         (ag,ga') = prefixScan (Id b)≡  {- prefixScan definition for Id -}  (af ⊕ ag, decode (fa' × ((af ⊕) <\$> ga')))   where (af,fa') = (a, Id ∅)         (ag,ga') = (b, Id ∅)≡  {- substitution -}  (a ⊕ b, decode (Id ∅ × ((a ⊕) <\$> Id ∅)))≡  {- fmap/(<\$>) for Id -}  (a ⊕ b, decode (Id ∅ × Id (a ⊕ ∅)))≡  {- Monoid law -}  (a ⊕ b, decode (Id ∅ × Id a))≡  {- decode definition on Pair -}  (a ⊕ b, (∅ :# a))``

Whew! And similarly for `suffixScan`.

Now let’s recall the `Scan` instance for `Pair` given in Composable parallel scanning:

``instance Scan Pair where  prefixScan (a :# b) = (a ⊕ b, (∅ :# a))  suffixScan (a :# b) = (a ⊕ b, (b :# ∅))``

Hurray! The derivation led us to the same definition. A "sufficiently smart" compiler could do this derivation automatically.

With this warm-up derivation, let’s now turn to trees.

Scanning trees

Given the tree encodings above, how does scan work? We’ll have to consult `Scan` instances for some of the functor combinators. The product instance is repeated above. We’ll also want the instances for sum and composition. Omitting the `suffixScan` definitions for brevity:

``data (f + g) a = InL (f a) | InR (g a)instance (Scan f, Scan g) ⇒ Scan (f + g) where  prefixScan (InL fa) = second InL (prefixScan fa)  prefixScan (InR ga) = second InR (prefixScan ga)newtype (g ∘ f) a = O (g (f a))instance (Scan g, Scan f, Functor f, Applicative g) ⇒ Scan (g ∘ f) where  prefixScan = second (O ∘ fmap adjustL ∘ zip)             ∘ assocR             ∘ first prefixScan             ∘ unzip             ∘ fmap prefixScan             ∘ unO``

This last definition uses a few utility functions:

``zip ∷ Applicative g ⇒ (g a, g b) → g (a,b)zip = uncurry (liftA2 (,))unzip ∷ Functor g ⇒ g (a,b) → (g a, g b)unzip = fmap fst &&& fmap sndassocR ∷ ((a,b),c) → (a,(b,c))assocR   ((a,b),c) =  (a,(b,c))adjustL ∷ (Functor f, Monoid m) ⇒ (m, f m) → f madjustL (m, ms) = (m ⊕) <\$> ms``

Let’s consider how the `Scan (g ∘ f)` instance plays out for top-down vs bottom-up trees, given the functor-composition encodings above. The critical definitions:

``type Enc T2 = Id + Pair ∘ T2type Enc T4 = Id + T4 ∘ Pair``

Focusing on the branch case, we have `Pair ∘ T2` vs `T4 ∘ Pair`, so we’ll use the `Scan (g ∘ f)` instance either way. Let’s consider the work implied by that instance. There are two calls to `prefixScan`, plus a linear amount of other work. The meanings of those two calls differ, however:

• For top-down trees (`T2`), the recursive tree scans are in `fmap prefixScan`, mapping over the pair of trees. The `first prefixScan` is a pair scan and so does constant work. Since there are two recursive calls, each working on a tree of half size (assuming balance), plus linear other work, the total work $\Theta \left(n\phantom{\rule{0.167em}{0ex}}\mathrm{log}\phantom{\rule{0.167em}{0ex}}n\right)$, as explained above.
• For bottom-up trees (`T4`), there is only one recursive recursive tree scan, which appears in `first prefixScan`. The `prefixScan` in `fmap prefixScan` is pair scan and so does constant work but is mapped over the half-sized tree (of pairs), and so does linear work altogether. Since there only one recursive tree scan, at half size, plus linear other work, the total work is then proportional to $n+n/2+n/4+\dots \approx 2\phantom{\rule{0.167em}{0ex}}n=\Theta \left(n\right)$. So we have a work-efficient algorithm!

Looking deeper

In addition to the simple analysis above of scanning over top-down and over bottom-up, let’s look in detail at what transpires and how each case can be optimized. This section may well have more detail than you’re interested in. If so, feel free to skip ahead.

Top-down

Beginning as with `Pair`,

``  prefixScan t≡  {- specification -}  prefixScanEnc t≡  {- prefixScanEnc definition -}  (second decode ∘ prefixScan ∘ encode) t≡  {- (∘) -}  second decode (prefixScan (encode t))``

Take `T2`, with `T3` being quite similar. Now split into two cases for the two constructors of `T2`. First leaf:

``  prefixScan (L2 m)≡  {- as above -}  second decode (prefixScan (encode (L2 m)))≡  {- encode for L2 -}  second decode (prefixScan (InL (Id m)))≡  {- prefixScan for functor sum -}  second decode (second InL (prefixScan (Id m)))≡  {- prefixScan for Id -}  second decode (second InL (m, Id ∅))≡  {- second for functions -}  second decode (m, InL (Id ∅))≡  {- second for functions -}  (m, decode (InL (Id ∅)))≡  {- decode for L2 -}  (m, L2 ∅)``

Then branch:

``  prefixScan (B2 (s :# t))≡  {- as above -}  second decode (prefixScan (encode (B2 (s :# t))))≡  {- encode for L2 -}  second decode (prefixScan (InR (O (s :# t))))≡  {- prefixScan for (+) -}  second decode (second InR (prefixScan (O (s :# t))))≡  {- property of second -}  second (decode ∘ InR) (prefixScan (O (s :# t)))``

Focus on the `prefixScan` application:

``  prefixScan (O (s :# t)) =≡  {- prefixScan for (∘) -} ( second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan ∘ unzip ∘ fmap prefixScan ∘ unO ) (O (s :# t))≡  {- unO/O -}  ( second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan  ∘ unzip ∘ fmap prefixScan ) (s :# t)≡  {- fmap on Pair -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan ∘ unzip)    (prefixScan s :# prefixScan t)≡  {- expand prefixScan -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan ∘ unzip)    ((ms,s') :# (mt,t'))      where (ms,s') = prefixScan s            (mt,t') = prefixScan t≡  {- unzip -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan)    ((ms :# mt), (s' :# t')) where ⋯≡  {- first -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR)    (prefixScan (ms :# mt), (s' :# t')) where ⋯≡  {- prefixScan for Pair -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR)    ((ms ⊕ mt, (∅ :# ms)), (s' :# t')) where ⋯≡  {- assocR -}  (second (O ∘ fmap adjustL ∘ zip))    (ms ⊕ mt, ((∅ :# ms), (s' :# t'))) where ⋯≡  {- second -}  ( ms ⊕ mt  , (O ∘ fmap adjustL ∘ zip) ((∅ :# ms), (s' :# t')) ) where ⋯≡  {- zip -}  ( ms ⊕ mt  , (O ∘ fmap adjustL) ((∅,s') :# (ms,t')) )  where ⋯≡  {- fmap for Pair -}  ( ms ⊕ mt  , O (adjustL (∅,s') :# adjustL (ms,t')) )  where ⋯≡  {- adjustL -}  ( ms ⊕ mt  , O (((∅ ⊕) <\$> s') :# ((ms ⊕) <\$> t')) )  where ⋯≡  {- Monoid law (left identity) -}  ( ms ⊕ mt  , O ((id <\$> s') :# ((ms ⊕) <\$> t')) )  where ⋯≡  {- Functor law (fmap id) -}  ( ms ⊕ mt  , O (s' :# ((ms ⊕) <\$> t')) )      where (ms,s') = prefixScan s            (mt,t') = prefixScan t``

Continuing from above,

``  prefixScan (B2 (s :# t))≡  {- see above -}  second (decode ∘ InR) (prefixScan (O (s :# t)))≡  {- prefixScan focus from above -}  second (decode ∘ InR)    ( ms ⊕ mt    , O (s' :# ((ms ⊕) <\$> t')) )        where (ms,s') = prefixScan s              (mt,t') = prefixScan t≡  {- definition of second on functions -}    (ms ⊕ mt, (decode ∘ InR) (O (s' :# ((ms ⊕) <\$> t')))) where ⋯≡  {- (∘) -}    (ms ⊕ mt, decode (InR (O (s' :# ((ms ⊕) <\$> t'))))) where ⋯≡  {- decode for B2 -}    (ms ⊕ mt, B2 (s' :# ((ms ⊕) <\$> t'))) where ⋯``

This final form is as in Deriving parallel tree scans, changed for the new scan interface. The derivation saved some work in wrapping & unwrapping and method invocation, plus one of the two adjustment passes over the sub-trees. As explained above, this algorithm performs $\Theta \left(n\phantom{\rule{0.167em}{0ex}}\mathrm{log}\phantom{\rule{0.167em}{0ex}}n\right)$ work.

I’ll leave `suffixScan` for you to do yourself.

Bottom-up

What happens if we switch from top-down to bottom-up binary trees? I’ll use `T4` (though `T5` would work as well):

``data T4 a = L4 a | B4 (T4 (Pair a))``

The leaf case is just as with `T2` above, so let’s get right to branches.

``  prefixScan (B4 t)≡  {- as above -}  second decode (prefixScan (encode (B4 t)))≡  {- encode for L2 -}  second decode (prefixScan (InR (O t)))≡  {- prefixScan for (+) -}  second decode (second InR (prefixScan (O t)))≡  {- property of second -}  second (decode ∘ InR) (prefixScan (O t))``

As before, now focus on the `prefixScan` call.

``  prefixScan (O t) =≡  {- prefixScan for (∘) -} ( second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan ∘ unzip ∘ fmap prefixScan ∘ unO ) (O t)≡  {- unO/O -}  ( second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan  ∘ unzip ∘ fmap prefixScan ) t≡  {- prefixScan on Pair (derived above) -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan ∘ unzip)    fmap (λ (a :# b) → (a ⊕ b, (∅ :# a))) t≡  {- unzip/fmap -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR ∘ first prefixScan)    ( fmap (λ (a :# b) → (a ⊕ b)) t    , fmap (λ (a :# b) → (∅ :# a))   t )≡  {- first on functions -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR)    ( prefixScan (fmap (λ (a :# b) → (a ⊕ b)) t)    , fmap (λ (a :# b) → (∅ :# a))   t )≡  {- expand prefixScan -}  (second (O ∘ fmap adjustL ∘ zip) ∘ assocR)    ((mp,p'), fmap (λ (a :# b) → (∅ :# a)) t)   where (mp,p') = prefixScan (fmap (λ (a :# b) → (a ⊕ b)) t)≡  {- assocR -}  (second (O ∘ fmap adjustL ∘ zip))    (mp, (p', fmap (λ (a :# b) → (∅ :# a)) t))   where ⋯≡  {- second on functions -}  (mp, (O ∘ fmap adjustL ∘ zip) (p', fmap (λ (a :# b) → (∅ :# a)) t))    where ⋯≡  {- fmap/zip/fmap -}  (mp, O (liftA2 tweak p' t))    where tweak s (a :# _) = adjustL (s, (∅ :# a))          (mp,p') = prefixScan (fmap (λ (a :# b) → (a ⊕ b)) t)≡  {- adjustL, then simplify -}  (mp, O (liftA2 tweak p' t))    where tweak s (a :# _) = (s :# s ⊕ a)          (mp,p') = prefixScan (fmap (λ (a :# b) → (a ⊕ b)) t)``

Now re-introduce the context of `prefixScan (O t)`:

``  prefixScan (B4 t)≡  {- see above -}  second (decode ∘ InR) (prefixScan (O t))≡  {- see above -}  second (decode ∘ InR)    (mp, O (liftA2 tweak p' t))      where ⋯≡  {- decode for T4 -}  (mp, B4 (liftA2 tweak p' t))    where p = fmap (λ (e :# o) → (e ⊕ o)) t          (mp,p') = prefixScan p          tweak s (e :# _) = (s :# s ⊕ e)``

Notice how much this bottom-up tree scan algorithm differs from the top-down algorithm derived above. In particular, there’s only one recursive tree scan (on a half-sized tree) instead of two, plus linear additional work, for a total of $\Theta \left(n\right)$ work.

Guy Blelloch’s parallel scan algorithm

In Programming parallel algorithms, Guy Blelloch gives the following algorithm for parallel prefix scan, expressed in the parallel functional language NESL:

``function scan(a) =if #a ≡ 1 then [0]else  let es = even_elts(a);      os = odd_elts(a);      ss = scan({e+o: e in es; o in os})  in interleave(ss,{s+e: s in ss; e in es})``

This algorithm is nearly identical to the `T4` scan algorithm above. I was very glad to find this route to Guy’s algorithm, which had been fairly mysterious to me. I mean, I could believe that the algorithm worked, but I had no idea how I might have discovered it myself. With the functor composition approach to scanning, I now see how Guy’s algorithm emerges as well as how it generalizes to other data structures.

Nested data types and parallelism

Most of the recursive algebraic data types that appear in Haskell programs are regular, meaning that the recursive instances are instantiated with the same type parameter as the containing type. For instance, a top-down tree of elements of type `a` is either a leaf or has two trees whose elements have that same type `a`. In contrast, in a bottom-up tree, the (single) recursively contained tree is over elements of type `(a,a)`. Such non-regular data types are called "nested". The two tree scan algorithms above suggest to me that nested data types are particularly useful for efficient parallel algorithms.

1. Sjoerd Visscher:

I think prefixScan for T4 assumes a zip-like applicative instance for T4?

Guy Blelloch’s algorithm works on lists of any length, but your T4 trees can only contain 2^n elements. How should T4 be fixed so that can also contain any number of elements?

2. conal:

I think prefixScan for T4 assumes a zip-like applicative instance for T4?

For `T4`, `prefixScan` is defined simply as `prefixScanEnc`, without relying on an `Applicative` instance.

Guy Blelloch’s algorithm works on lists of any length, but your T4 trees can only contain 2^n elements.

In Programming parallel algorithms, right above Figure 11 (containing the `scan` implementation), Guy says “The particular code shown works only on sequences that have length equal to a power of two, but it is not hard to generalize it to work on sequences of any length.”

How should T4 be fixed so that can also contain any number of elements?

By using a more flexible data type, such as the binary random-access lists in Chris Okasaki’s Purely Functional Data Structures, section 9.2.1. Quoting from that section:

A binary random-access list of size n contains a a tree for each one in the binary representation of n. The rank of each tree corresponds to the rank of the corresponding digit; if the ith bit of n is one, then the random-access list contains a tree of size 2^i.

3. Sjoerd Visscher:

The optimized version of prefixScan (B4 t) ends up being (mp, B4 (liftA2 tweak p’ t)). Here “liftA2 tweak” lifts tweak to T4, right? And as the liftA2 comes from “zip = uncurry (liftA2 (,))”, I guess a zip-like instance is required?

4. conal:

Ah, I see. You’re right. I’d forgotten that the `Scan` instance for `T4` does indeed depend on there being an `Applicative` instance. And yes, I’d define it in a zip-like way. Because `T4` has a sum and its `Applicative` is zip-like, `(<*>)` and `liftA2` will be partial, which is less than ideal. I hadn’t noticed this issue, because I was working out this algorithm for depth-typed trees, which have no (non-degenerate) sums and hence have total `(<*>)` and `liftA2`.

5. Ryan Ingram:

T2’s applicative instance doesn’t follow the applicative laws, though.

for example:

pure id <*> x /= x

You can fix this by ‘cloning’ the leaf elements though:

L2 f <*> B2 (xs :# ys) = B2 (f xs :# f ys)

B2 (fs :# gs) <*> L2 x = B2 ((\$x) fs :# (\$x) gs)

I think a similar trick works for T4.

6. Conal Elliott » Blog Archive » A third view on trees:

[…] About « Parallel tree scanning by composition […]

7. conal:

Ryan: Oops! Thanks for the catch. I know the laws hold for depth-typed trees, since they’re tries (and thus isomorphic to functions, which do satisfy the laws), but I haven’t checked the laws for these non-depth-typed trees. I wonder whether they hold with your alteration.