Blog post

Enter paradis — A new chapter in Rust's parallelism story

Rust is a great language for data-parallel problems with relatively simple parallel access patterns. It particularly excels when the data can be split up into contiguous chunks. On the other hand, things can get messy for parallel access patterns that are non-contiguous or less structured.

In this post, I will show you an uglier side of parallelism in Rust. Thankfully, it is not a fundamental flaw of the language — but we are currently missing key concepts to solve this type of problem in a principled way. Together, we will walk through a series of steps that will lead us to a fundamental low-level abstraction for parallel data structure access.

This abstraction is the keystone underpinning paradis, a new library for parallel processing with disjoint index sets. paradis simplifies writing parallel programs that are currently awkward to correctly express in Rust. In this blog post you will get a taste of both its low-level foundations and high-level functionality.

I am very interested in community feedback on the concepts that I cover here, on the library, and this blog post itself. Please get in touch!

The problem

Let's consider a contrived toy problem, which quite frankly doesn't make much sense, but it turns out to be a useful springboard to more advanced real-world problems.

We have a vector of integers. We want to spawn a thread that mutates elements at even indices, and a separate thread for mutating elements at odd indices. In C++, this might look like:

size_t n = 1000;
std::vector<int> data(n, 0);

std::thread even_thread([n, &data]() {
    for (size_t i = 0; i < n; i += 2) {
        data[i] = 1;
    }
});

std::thread odd_thread([n, &data]() {
    for (size_t i = 1; i < n; i += 2) {
        data[i] = 2;
    }
});

even_thread.join();
odd_thread.join();
size_t n = 1000;
std::vector<int> data(n, 0);

std::thread even_thread([n, &data]() {
  for (size_t i = 0; i < n; i += 2) {
    data[i] = 1;
  }
});

std::thread odd_thread([n, &data]() {
  for (size_t i = 1; i < n; i += 2) {
    data[i] = 2;
  }
});

even_thread.join();
odd_thread.join();

Arguably, it's easy to see what's going on. It might burn down your house if you perturb it slightly, but as shown it ought to do as advertised.

Surely this is simple to do in Rust. Let's give it a try! We first try an almost direct port of our C++ code:

use std::thread::scope;

let n = 1000;
let mut data = vec![0; n];

scope(|s| {
    s.spawn(|| {
        for i in (0 .. n).step_by(2) {
            data[i] = 1;
        }
    });

    s.spawn(|| {
        for i in (1 .. n).step_by(2) {
            data[i] = 2;
        }
    });
})
use std::thread::scope;

let n = 1000;
let mut data = vec![0; n];

scope(|s| {
  s.spawn(|| {
    for i in (0 .. n).step_by(2) {
      data[i] = 1;
    }
  });

  s.spawn(|| {
    for i in (1 .. n).step_by(2) {
      data[i] = 2;
    }
  });
})

In Rust, we use scope because it lets us borrow data with non-'static lifetime, such as data. I bet by now, experienced Rustaceans reading this post are getting agitated by the fact that I have yet to point out the fact that this program cannot possibly compile. Indeed, the borrow checker is not wasting time in letting us know:

error[E0499]: cannot borrow `data` as mutable more than once at a time

Now, if our only goal were to satisfy the borrow checker, we could wrap every element in a mutex, or use atomics. Since atomics don't generalize beyond a few primitive types, let's see how the code would look like with the mutex-based solution.

use std::iter::repeat_with;
use std::sync::Mutex;
use std::thread::scope;

let n = 1000;
let mut data: Vec<_> = repeat_with(|| Mutex::new(0))
    .take(n)
    .collect();

scope(|s| {
    s.spawn(|| {
        for i in (0 .. n).step_by(2) {
            *data[i].lock().unwrap() = 1;
        }
    });

    s.spawn(|| {
        for i in (1 .. n).step_by(2) {
            *data[i].lock().unwrap() = 2;
        }
    });
})
use std::iter::repeat_with;
use std::sync::Mutex;
use std::thread::scope;

let n = 1000;
let mut data: Vec<_> = repeat_with(|| Mutex::new(0))
  .take(n)
  .collect();

scope(|s| {
  s.spawn(|| {
    for i in (0 .. n).step_by(2) {
      *data[i].lock().unwrap() = 1;
    }
  });

  s.spawn(|| {
    for i in (1 .. n).step_by(2) {
      *data[i].lock().unwrap() = 2;
    }
  });
})

Hooray! It compiles! And it does what it should. But this code is also pretty bonkers. If we are at the point at which we want to use multiple threads, then we likely care about performance. Wrapping individual elements in mutexes when there is no overlapping access is excessive, and the performance overhead may be unacceptable.

The C++ program that we started with is very much a valid program, and we should be able to write the same program in Rust. But we might have to use unsafe. Gasp! It's time to get our hands dirty.

Thankfully, we can do whatever we want in unsafe land, so we'll just replace

data[i] = 1;

with

unsafe { *data.get_unchecked_mut(i) = 1; }
unsafe {
  *data.get_unchecked_mut(i) = 1;
}

to get the borrow checker off our back.

error[E0499]: cannot borrow `data` as mutable more than once at a time

What? Oh, right. Turns out that we can't, in fact, do whatever we want in unsafe land. get_unchecked_mut still borrows data mutably, so we are no closer to our goal.

In fact, under no circumstances are we allowed to simultaneously materialize mutable references to data in both threads, so it follows that we can not use the API of Vec<T> at all. Provided that T: Send, we can, however, materialize mutable references to disjoint elements in different threads by going through raw pointers. Here's how this might look like:

use std::thread::scope;

let n = 1000;
let mut data = vec![0; n];
let data_ptr = data.as_mut_ptr();

scope(|s| {
    s.spawn(|| {
        for i in (0 .. n).step_by(2) {
            unsafe { *data_ptr.add(i) = 1; }
        }
    });

    s.spawn(|| {
        for i in (1 .. n).step_by(2) {
            unsafe { *data_ptr.add(i) = 2; }
        }
    });
})
use std::thread::scope;

let n = 1000;
let mut data = vec![0; n];
let data_ptr = data.as_mut_ptr();

scope(|s| {
  s.spawn(|| {
    for i in (0 .. n).step_by(2) {
      unsafe { *data_ptr.add(i) = 1; }
    }
  });

  s.spawn(|| {
    for i in (1 .. n).step_by(2) {
      unsafe { *data_ptr.add(i) = 2; }
    }
  });
})

Alas, we are told that

error[E0277]: `*mut i32` cannot be shared between threads safely

Raw pointers do not implement Sync nor Send, and so we can not really directly use them with threads. This may sound puzzling at first, but the reason is explained in the nomicon: if pointers were Send + Sync, then structs that contain them could automatically be too. This could lead to accidental soundness issues. What we are trying to do here is fine, but we have to essentially force the compiler to allow us to use our data pointer inside the two threads. To do so, we can wrap our pointer in a newtype that can be copied into each of the two threads. Then we can access the pointer inside of each thread. The code might look like this:

use std::thread::scope;

#[derive(Copy, Clone)]
struct Ptr(*mut i32);
unsafe impl Sync for Ptr {}

let n = 1000;
let mut data = vec![0; n];
let data_ptr = Ptr(data.as_mut_ptr());

scope(|s| {
    s.spawn(|| {
        // spawn requires the closure to implement Send.
        // The closure is Send if all its contained
        // variables are Send. Since Ptr is Copy,
        // and we are moving it into the closure in the
        // next line, the closure will capture by immutable
        // borrow, so we need &Ptr: Send, which is the case
        // if and only if Ptr: Sync
        let data_ptr = data_ptr;
        for i in (0 .. n).step_by(2) {
            // Once we have our Ptr structure inside the
            // thread, we can access the embedded pointer
            unsafe { *data_ptr.0.add(i) = 1; }
        }
    });

    s.spawn(|| {
        let data_ptr = data_ptr;
        for i in (1 .. n).step_by(2) {
            unsafe { *data_ptr.0.add(i) = 2; }
        }
    });
})
use std::thread::scope;

#[derive(Copy, Clone)]
struct Ptr(*mut i32);
unsafe impl Sync for Ptr {}

let n = 1000;
let mut data = vec![0; n];
let data_ptr = Ptr(data.as_mut_ptr());

scope(|s| {
  s.spawn(|| {
    // spawn requires the closure to implement Send.
    // The closure is Send if all its contained
    // variables are Send. Since Ptr is Copy,
    // and we are moving it into the closure in the
    // next line, the closure will capture by immutable
    // borrow, so we need &Ptr: Send, which is the case
    // if and only if Ptr: Sync
    let data_ptr = data_ptr;
    for i in (0 .. n).step_by(2) {
      // Once we have our Ptr structure inside the
      // thread, we can access the embedded pointer
      unsafe { *data_ptr.0.add(i) = 1; }
    }
  });

  s.spawn(|| {
    let data_ptr = data_ptr;
    for i in (1 .. n).step_by(2) {
      unsafe { *data_ptr.0.add(i) = 2; }
    }
  });
})

And there you have it. Dead simple! Yeah... No.

The end result is terrible. Our Rust version is no safer than the C++ version, but the code is more complex, and we are forced to deal with the intricacies involved in making sure we maintain Rust's invariants for raw pointers. Arguably, it's easier to make mistakes this way.

Can we do better? Of course! Let's diagnose the problem.

What is wrong

Our latest attempt forces us to simultaneously deal with two sources of possible unsoundness:

The first point is bad enough in this example, but it gets substantially worse for more complex data structures, such as multi-dimensional arrays, possibly with non-uniform strides as supported by sufficiently flexible array/matrix/tensor libraries. Everyone who wants to mutably access a data structure in parallel in a similar fashion must write the same kind of tricky pointer manipulation.

Moreover, since pointers do not carry lifetime information, the call to as_mut_ptr does not borrow data mutably, and so there's nothing stopping us from mutating it after obtaining the pointer. For example, we can add data.push(0) to the second thread, and it will still happily compile our now-unsound code.

How to make it right

From the preceding analysis, we see that we need a low-level abstraction that encapsulates the internal details necessary for sound unsynchronized access. This abstraction needs to mutably borrow our data, so that we prevent illegal manipulation of the data while we are accessing it. Let's try something like:

pub struct SliceParAccessMut<'a, T> {
    ptr: *mut T,
    marker: PhantomData<&'a mut T>
}

We've essentially given our pointer from before a lifetime. We need to tie this lifetime to a mutable slice, like this:

impl<'a, T: Send> SliceParAccessMut<'a, T> {
    pub fn from_slice_mut(slice: &'a mut [T]) -> Self {
        Self {
            ptr: slice.as_mut_ptr(),
            marker: PhantomData::default()
        }
    }
}
impl<'a, T> SliceParAccessMut<'a, T>
where
  T: Send
{
  pub fn from_slice_mut(
    slice: &'a mut [T]
  ) -> Self {
    Self {
      ptr: slice.as_mut_ptr(),
      marker: PhantomData::default()
    }
  }
}

The T: Send bound is not technically necessary at this point, but it's probably convenient to deny construction of our parallel access object if the elements in the slice do not have a minimum of thread safety. The Send bound basically says that we want to be able to access the objects in the slice from different threads (but not at the same time). This bound also implies that &mut T: Send, which is ultimately what we want a little later on.

Next, the key ingredient:

impl<'a, T: Send> SliceParAccessMut<'a, T> {
    /// Return a mutable reference to the element at the given index.
    ///
    /// # Safety
    ///
    /// The provided index must be in bounds, and Rust's aliasing
    /// rules must be respected at all times.
    pub unsafe fn get_unsync_mut_unchecked(&self, idx: usize) -> &'a mut T {
        unsafe { &mut *self.ptr.add(idx) }
    }
}

impl<'a, T> SliceParAccessMut<'a, T>
where
  T: Send
{
  /// Return a mutable reference to
  /// the element at the given index.
  ///
  /// # Safety
  ///
  /// The provided index must be in
  /// bounds, and Rust's aliasing
  /// rules must be respected at
  /// all times.
  pub unsafe fn get_unsync_mut_unchecked(
    &self,
    idx: usize
  ) -> &'a mut T {
    unsafe { &mut *self.ptr.add(idx) }
  }
}

This method is essentially a way for us to tell the compiler to back off and trust that we will satisfy Rust's aliasing rules. Note that we return a reference with lifetime 'a. This is longer than we strictly need for our current example, but it will be necessary to build parallel iterators on top later.

The final piece we need is Sync + Send for our access object:

unsafe impl<'a, T: Send> Sync for SliceParAccessMut<'a, T> {}
unsafe impl<'a, T: Send> Send for SliceParAccessMut<'a, T> {}

unsafe impl<'a, T: Send> Sync
  for SliceParAccessMut<'a, T> {}
unsafe impl<'a, T: Send> Send
  for SliceParAccessMut<'a, T> {}

Strictly speaking, we'll only need Sync for our current example, but it's necessary to be able to send the access object to other threads when building higher-level abstractions, like rayon parallel iterators.

Finally, we can rewrite our example using our new access abstraction:

let n = 1000;
let mut data = vec![0; n];
let access = SliceParAccessMut::from_slice_mut(&mut data);

scope(|s| {
    s.spawn(|| {
        for i in (0 .. n).step_by(2) {
            unsafe { *access.get_unsync_mut_unchecked(i) = 1; }
        }
    });

    s.spawn(|| {
        for i in (1 .. n).step_by(2) {
            unsafe { *access.get_unsync_mut_unchecked(i) = 2; }
        }
    });
})
let n = 1000;
let mut data = vec![0; n];
let access = SliceParAccessMut::from_slice_mut(&mut data);

scope(|s| {
  s.spawn(|| {
    for i in (0 .. n).step_by(2) {
      unsafe {
        *access.get_unsync_mut_unchecked(i) = 1;
      }
    }
  });

  s.spawn(|| {
    for i in (1 .. n).step_by(2) {
      unsafe {
        *access.get_unsync_mut_unchecked(i) = 2;
      }
    }
  });
})

That's more like it!

Our final snippet looks very similar to the original C++ example. But, despite the need to use unsynchronized unsafe access very carefully, our Rust code still affords us a number of important safety protections compared to the C++ example. Since obtaining the access object mutably borrows data, we cannot modify the vector (push, resize etc.) while the access object is alive. The Send bound also prevents us from operating on data that is not thread safe. The access object also explicitly communicates to the compiler and developers that we can access the data structure mutably from several threads at the same time. This is a big step up from C++, where the coder must manually verify that the data structure in question is thread-safe.

A glimpse of paradis

Our toy example shows that separating data structure access from access pattern lets us simplify the code and achieve some level of safety. To generalize this notion, I've built the paradis crate. paradis offers:

The intention is for crates that expose parallel access to data structures to depend on paradis-core, which is a very small crate that currently has no dependencies. It is intended to evolve slowly, and eventually stabilize. This allows users of paradis to use these data structures with higher-level parallel patterns.

Our SliceParAccessMut struct from the previous discussion can also be found in paradis-core, albeit in a slightly modified form. However, the methods that we implemented on it are instead part of a more general ParAccess trait, that at present looks roughly like this:

pub unsafe trait ParAccess<Index: Copy>: Sync + Send {
    type Record: Send;
    unsafe fn clone_access(&self) -> Self;
    unsafe fn get_unsync_unchecked(&self, index: Index) -> Self::Record;
}
unsafe trait ParAccess<Index: Copy>
  : Sync + Send
{
  type Record: Send;

  unsafe fn clone_access(&self)
    -> Self;

  unsafe fn get_unsync_unchecked(
    &self,
    index: Index
  ) -> Self::Record;
}

paradis facilitates access to records in a collection. The ParAccess trait is very general and is not at all limited to array-like collections. Access to matrix and tensor-like data structures is an explicit design goal. In order to build safe abstractions on top, it is necessary to be able to communicate the bounds of the data structure. The trait BoundedParAccess allows "rectangular" data structures — arrays and multi-dimensional arrays — to describe their bounds, and should be implemented whenever possible.

Once a data structure provides parallel access, we may use it in the paradis API. Our contrived example of even-odd processing in separate threads can not yet be expressed with safe code. It will, however, be possible in the future, as I will discuss later. Instead, we will move forward with an example problem that is arguably much more common in practice: parallel mutable access to a subset of a data structure described by a list of unique indices. Let's look at a simple example:

use paradis::index::{IndexList, narrow_access};
use paradis::rayon::create_par_iter;
use rayon::iter::ParallelIterator;

let mut data = vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
let indices = vec![4, 7, 1]
    // A list of usize implements IndexList, but *not* UniqueIndexList
    // To ensure uniqueness, we can opt-in to checking uniqueness at runtime
    .check_unique()
    .expect("Indices are unique");

// Mutable slices implement IntoParAccess,
// prompting a visit from our old friend SliceParallelAccessMut
let access = narrow_access(data.as_mut_slice(), &indices)
    // This does not check individual indices, only the reported bounds by
    // the data structure and indices. See docs for more details
    .expect("Indices are in bounds of the data structure");

// Set desired entries to zero, in parallel
create_par_iter(access).for_each(|x_i| *x_i = 0);

assert_eq!(data, vec![0, 0, 2, 3, 0, 5, 6, 0, 8, 9]);
use paradis::index::{IndexList, narrow_access};
use paradis::rayon::create_par_iter;
use rayon::iter::ParallelIterator;

let mut data =
  vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
let indices = vec![4, 7, 1]
  // A list of usize implements
  // IndexList, but *not*
  // UniqueIndexList. To ensure
  // uniqueness, we can opt-in
  // to runtime checking
  .check_unique()
  .expect("Indices are unique");

// Mutable slices implement
// IntoParAccess, prompting a visit
// from our old friend
// SliceParallelAccessMut
let access = narrow_access(
  data.as_mut_slice(), &indices)
  // This does not check individual
  // indices, only the reported
  // bounds by the data structure
  // and indices. See docs for more
  .expect("Indices are in bounds
           of the data structure");

// Set desired entries to zero,
// in parallel
create_par_iter(access)
  .for_each(|x_i| *x_i = 0);

assert_eq!(
  data,
  vec![0, 0, 2, 3, 0, 5, 6, 0, 8, 9]
);

Hopefully, the API is clear enough that the example feels intuitive. The main actors in this drama are the traits IndexList, UniqueIndexList and the function narrow_access. Vec<usize> implements IndexList, but not UniqueIndexList; the indices are not generally unique. Therefore, we can check uniqueness at runtime with the method IndexList::check_unique to obtain a UniqueIndexList.

Uniqueness of the indices is precisely the precondition for sound parallel mutation. With a list of unique indices and an access object that implements BoundedParAccess, we can narrow the access. This process produces a new access object that uses the provided indices to index into the underlying access object. The narrowed access object conceptually satisfies narrowed_access[i] == underlying_access[indices[i]] for i in the range 0 .. n, where n is the number of indices.

Once a narrowed access has been obtained, it can be used with create_par_iter to produce a rayon parallel iterator. Most of paradis is devoted to composing accesses and indices, and only a tiny portion of code is necessary to support rayon parallel iterators. Therefore, paradis is not married to rayon, and it should be straightforward to support other paradigms too.

Rust and its ecosystem typically enable parallelism by splitting up objects into disjoint, contiguous parts. You can think of access narrowing as a generalization that lifts the contiguity requirement, requiring only that the parts are disjoint. This enables far more general access patterns to be expressed.

Proving uniqueness by structured composition

In many cases, the indices are structured, meaning that the index list is available implicitly. In these cases, the overhead of explicitly storing a list of indices, as well as checking for uniqueness, may be unacceptable. The simplest example is Range<usize>, which is a list of indices that we already know are always unique. More elaborate examples can be constructed with index combinators, which can be chained similar to iterator adapters. For example, the superdiagonal of a 3 x 4 matrix can be expressed as

(0 .. 3).index_zip(1 .. 4)

which amounts to the index list [ (0, 1), (1, 2), (2, 3) ]. As another example, let's take two consecutive columns of the matrix, for which we can use a Cartesian product:

(0 .. 3).index_product(1 ..= 2)

which produces [(0, 1), (0, 2), (1, 1), (1, 2), (2, 1), (2, 2)]. The uniqueness property of these lists is assured by the following properties:

Other combinators preserve uniqueness under similar conditions. Combinators can of course be combined arbitrarily. In the following example, we use a runtime check for the row indices and rely on structured uniqueness for the columns:

vec![2, 1]
    .check_unique()
    .expect("row indices are unique")
    .index_product(1 ..= 2)

Structured uniqueness also extends to higher dimensions. The paradis repository includes an example where combinators are used to access a subset of indices for a four-dimensional array.

Let's go back to the case of mutating the superdiagonal of a matrix in parallel, and study a concrete example of how this can be done with paradis:

use nalgebra::dmatrix;
use paradis::index::{IndexList, narrow_access};
use paradis::rayon::create_par_iter;
use rayon::iter::ParallelIterator;

let mut matrix = dmatrix![1, 1, 1, 1, 1;
                          1, 1, 1, 1, 1;
                          1, 1, 1, 1, 1];

// Superdiagonal indices are [(0, 1), (1, 2), (2, 3)]
let indices = (0 .. 3).index_zip(1 .. 4);

// We omit the implementation of the access object, which hopefully may be
// provided by nalgebra itself in the future
let access = DMatrixParAccessMut::from_matrix_mut(&mut matrix);
let superdiagonal_access = narrow_access(access, &indices)
    .expect("Indices are in bounds");

create_par_iter(superdiagonal_access).for_each(|x_ij| *x_ij = 0);

assert_eq!(matrix,
           dmatrix![1, 0, 1, 1, 1;
                    1, 1, 0, 1, 1;
                    1, 1, 1, 0, 1]);
use nalgebra::dmatrix;
use paradis::index::{IndexList, narrow_access};
use paradis::rayon::create_par_iter;
use rayon::iter::ParallelIterator;

let mut matrix = dmatrix![
  1, 1, 1, 1, 1;
  1, 1, 1, 1, 1;
  1, 1, 1, 1, 1
];

// Superdiagonal indices are
// [(0, 1), (1, 2), (2, 3)]
let indices =
  (0 .. 3).index_zip(1 .. 4);

// We omit the implementation
// of the access object, which
// may perhaps be provided by
// nalgebra itself in the future
let access = DMatrixParAccessMut
  ::from_matrix_mut(&mut matrix);
let superdiagonal_access =
  narrow_access(access, &indices)
  .expect("Indices are in bounds");

create_par_iter(superdiagonal_access)
  .for_each(|x_ij| *x_ij = 0);

assert_eq!(
  matrix,
  dmatrix![1, 0, 1, 1, 1;
           1, 1, 0, 1, 1;
           1, 1, 1, 0, 1]
);

This example showcases how paradis is intended to be used: parallel access to the data structure is ideally provided by the origin crate (nalgebra), but access objects can also be provided externally, as it is here. This lets us encapsulate the internals of the data structure once, and then write high-level — and ideally safe — code on top that enables new parallel patterns.

Finally, it is perhaps worth noting that we don't have to use parallel iterators: paradis also provides a create_iter method that builds a sequential iterator for an access.

Assuming uniqueness

paradis currently only provides a small number of combinators, and as a result the breadth of access patterns that can be expressed by structured uniqueness is still limited. Therefore, paradis provides an unsafe escape hatch:

let indices = vec![0, 2, 8, 7, 1];
// SAFETY: Indices are unique by inspection with eyeballs
let unique_indices = unsafe { indices.assume_unique() };

let indices = vec![0, 2, 8, 7, 1];
// SAFETY: Indices are unique by
// inspection with eyeballs
let unique_indices =
  unsafe { indices.assume_unique() };

assume_unique has zero runtime cost, but as a user you have to pinky-promise that your indices really are unique, otherwise Unsoundness™ ensues. Over time, I hope that more patterns can be covered by safe combinators instead. Nevertheless, even when this is not possible, and explicit runtime checking is unacceptably expensive, we have reduced the unsafe responsibility of the user to ensuring that the indices are unique. This is arguably a substantial improvement over the status quo.

Open question: accessing several offdiagonals

We've seen how paradis can be used to safely mutate entries in the superdiagonal of a matrix in parallel. But what if you wanted to do the same to, say, the first and second superdiagonal and the third subdiagonal? More generally, what if you wanted to mutate a list of particular offdiagonals?

While we can express each of the individual offdiagonals in isolation, we can not simply concatenate them together, because the result of concatenating two index lists is not guaranteed to have unique indices. I've racked my brain for a way to express this in terms of more elementary patterns and combinators. If such a thing is possible, it has certainly eluded me. This seems to suggest that Offdiagonals(UniqueIndexList) — offdiagonals parametrized by a list of unique indices — should be a basic pattern provided by paradis itself. By extrapolating from this example, it might be that many common patterns can not be built from simpler building blocks, and should be provided by paradis in the future.

Future work: Performance and codegen

One aspect that I have deliberately not devoted much time to yet is the performance of code using paradis. In principle, most of the abstractions could be zero-cost. However, similar to iterators, they are at the mercy of compiler optimizations. From my limited benchmarking (there are some in the repo), it generally works as you would expect, but the results may sometimes be slightly unexpected or unintuitive. I think the best way to go forward is to collect more real-world examples and see how well paradis works for these.

This is also an area I would love to have some help, because I quite frankly don't have the requisite expertise to scrutinize assembly output in an effort to judge what could be done to improve codegen.

Future abstraction: collections of index lists

I promised you earlier in the post that I would explain how we might in the future resolve the toy problem with even-odd mutation with safe code. I will honor that promise, but I will ask you to bear with me for a brief history lesson.

paradis was born out of a desire to be able to write parallel sparse matrix assembly for the Finite Element Method (FEM) with safe code. I wrote the fenris-paradis crate for this purpose, which is a more specialized precursor to paradis. It enables parallel matrix assembly for the experimental fenris library, which I built to support my research on finite element methods.

The functionality present in present-day paradis is not sufficient to support parallel sparse matrix assembly. Therefore, paradis does not support my motivating use case yet. This is mostly a matter of porting over and generalizing the functionality already found in fenris-paradis. Curiously, it turns out that the same abstraction needed for fenris to take advantage of paradis can be used to solve our toy even-odd problem with safe code. I'll give you a brief glimpse of how this might be done.

For brevity, I'll skip the FEM specific details and the connections to graph coloring. At its core, the problem can be succintly reduced to the following:

Given a collection of m index lists that each have unique indices, partition the collection into sub-collections, such that in each sub-collection the index lists contained therein are disjoint.

Such a partition is called a coloring of the index list collection.

This is awkward to explain and grok with words, so I'll give an example with familiar Rust types. Consider the following nested vector:

let collection = vec![
  vec![ 0, 2, 4],
  vec![ 3, 2, 4],
  vec![ 1, 5, 3],
  vec![ 1, 2, 4],
];

A possible coloring of collection could be:

let c1 = vec![ vec![0, 2, 4], vec![1, 5, 3] ];
let c2 = vec![ vec![3, 2, 4] ];
let c3 = vec![ vec![1, 2, 4] ];
let colors = vec![ c1, c2, c3 ];
let c1 = vec![ vec![0, 2, 4],
               vec![1, 5, 3] ];
let c2 = vec![ vec![3, 2, 4] ];
let c3 = vec![ vec![1, 2, 4] ];
let colors = vec![ c1, c2, c3 ];

That's a lot of vec!, but the point is that each color in colors is a collection of index lists that are disjoint. For the purposes of FEM assembly, this means that we can soundly mutate records at indices stored in a single color, because all the indices involved in the collection of index lists are unique.

Analogous to our IndexList and UniqueIndexList traits, we can model this concept with the traits IndexLists and DisjointIndexLists, which may perhaps look roughly like this:

trait IndexLists {
    type List: IndexList;

    fn get_list(&self, loc: usize) -> Self::List;
    fn num_lists(&self) -> usize;
}

trait DisjointIndexLists: IndexLists
where
  Self::List: UniqueIndexList
{}
trait IndexLists {
  type List: IndexList;

  fn get_list(&self, loc: usize)
    -> Self::List;
  fn num_lists(&self) -> usize;
}

trait DisjointIndexLists: IndexLists
where
  Self::List: UniqueIndexList
{}

Again, the invariant that needs to be satisfied here is that each individual list has unique indices, and all the index lists are disjoint. Conceptually, we can think of the coloring process as going from a single IndexLists to [DisjointIndexLists] — multiple instances of DisjointIndexLists.

For FEM assembly, we would then sequentially process each color (DisjointIndexLists instance), iterating over the disjoint index lists in parallel and mutating the elements associated with each individual index.

OK, great, but how does any of this help us with the even-odd example?

Let's say we have an index list of odd integers, and a list of even integers. Together, this constitutes a collection of disjoint index lists, so we could model this as a DisjointIndexLists instance with two index lists. Just as we could narrow an access object to an index list, we can imagine that we may split an access into several sub-accesses, each associated with an index list, provided that the index lists are disjoint. The solution to our even-odd toy problem might then look something like this (warning: imaginary code!):

let n = 1000;
let mut data = vec![0; n];
let access = SliceParAccessMut::from_slice_mut(&mut data);

// "Deinterlace" a unique index list into 2 index lists,
// which provides a DisjointIndexLists instance where the first list
// has indices at locations 0, 2, 4, ...
// and the second one has indices at locations 1, 3, 5...
let index_lists = (0 .. n).index_deinterlace(2);

// Split the access according to the index lists.
// Produces a new access object, where each
// record is an access narrowed to one of the index lists
let access = split_access(access, &index_lists);

let mut access_iter = create_iter(access);
let even_access = access_iter.next().unwrap();
let odd_access = access_iter.next().unwrap();

scope(move |s| {
    s.spawn(move || {
        for value in create_iter(even_access) {
            *value = 1;
        }
    })

    s.spawn(move || {
        for value in create_iter(odd_access) {
            *value = 2;
        }
    })
})
let n = 1000;
let mut data = vec![0; n];
let access = SliceParAccessMut
  ::from_slice_mut(&mut data);

// "Deinterlace" a unique index list
// into 2 index lists, which provides
// a DisjointIndexLists instance
// where the first list has indices
// at locations 0, 2, 4, ...,
// and the second one has indices at
// locations 1, 3, 5...
let index_lists = (0 .. n)
  .index_deinterlace(2);

// Split the access according to the
// index lists. Produces a new access
// object, where each record is an
// access narrowed to one of the
// index lists
let access =
  split_access(access,&index_lists);

let mut access_iter =
  create_iter(access);
let even = access_iter.next().unwrap();
let odd = access_iter.next().unwrap();

scope(move |s| {
  s.spawn(move || {
    for value in create_iter(even) {
      *value = 1;
    }
  })

  s.spawn(move || {
    for value in create_iter(odd) {
      *value = 2;
    }
  })
})

Here, we have used a hypothetical general purpose combinator index_deinterlace which, when used on an instance of UniqueIndexList, produces a collection of disjoint index lists. This follows from the fact that our original list has unique indices, and so regardless of how we partition the list into sublists, the resulting sublists will be disjoint.

Naturally, when an access object is split up according to disjoint index lists, the resulting sub-accesses are also disjoint. Hence, it is safe to pass separate sub-accesses to different threads, where we may use create_iter to iterate over their records.

There is some modest design work that must be done before we can land this functionality in paradis. Nevertheless, I hope it's not too far off on the horizon.

Finally, another useful use case of collections of index lists is partitioning. Just as we might have an index_deinterlace combinator, we might have the combinator .index_chunks(chunk_size), which would divide an index list into chunks of the specified size, similar to slice::chunks. This would allow us to partition work to different threads without using rayon, or we could use it with rayon to increase the size of each computational task that rayon sees, which may be beneficial for performance.

Next steps

paradis is currently experimental. It's deliberately lacking in tests for now, to make refactoring and potential changes to the design easier. I'm hoping that with this post, I might get some community feedback on the design and API.

If you have an application that can make use of paradis, then I'd like to hear about it. If you have an application that you think could maybe make use of paradis somehow, I'd also like to hear about it! Feel free to get in touch through the GitHub Discussions forum.

I'm very excited about seeing how far we together can take paradis. I hope it will simplify the process of writing fast, reliable and parallel Rust code.

Acknowledgements

Big thanks to Fabian Löschner and @sarah-ek for early feedback on this blog post!

Back to overview