Parallelization across regions

Given that the most common data structure is a counts dict (whose keys are the region names in our dataset), we often want to call a function for each region in this dictionary:

>>> result = {region: fn(counts[region]) for region in counts}

This pattern may become even more complicated if fn() returns a tuple, for example. Furthermore, it is clear that overall operation is “embarrassingly parallel” with respect to the regions being processed. In order to simplify our code, reduce redundancy, and gain the benefits of parallel execution, we introduce a new decorator: @parallelize_regions, which can be found in the subpackage lib5c.util.parallelization. This decorator allows you to write fn() just once, writing it as if it processes only one matrix, but then call it with one matrix or an entire counts dict as is convenient. For example, we can write

from lib5c.util.parallelization import parallelize_regions

@parallelize_regions
def fn(matrix):
    return matrix + 1

and then call this function via

result_counts = fn(counts)

or alternatively,

result_matrix = fn(counts['Sox2'])

as is convenient for us.

Mechanism and caveats

The following sections dig into the mechanics behind the @parallelize_regions decorator and highlight some important features and caveats.

First positional argument dependence

The @parallelize_regions decorator works by first checking to see if the first argument passed to the decorated function is a dict. If it is not, the decorator does nothing, and the function is executed as normal. If it is a dict, the execution of the function is parallelized across the keys of that dict. This means that if the non-parallelized version of fn() expects a dict as its first positional argument, you will not be able to use the same name for both the parallel and non-parallel versions of the function. To work around this, you can define

from lib5c.util.parallelization import parallelize_regions

def fn(somedict):
    return somedict

fn_parallel = parallelize_regions(fn)

and then you can call fn(somedict) when you want the non-parallelized version and fn_parallel(doubledict) when you want the parallelization.

Per-region args and kwargs

By default, @parallelize_regions will simply copy all the other args and kwargs to each region’s invocation of fn(). In other words, when you call fn(counts, arg_1, arg_2), the following will be executed:

fn(counts['region_1'], arg_1, arg_2)
fn(counts['region_2'], arg_1, arg_2)
...

However, if any arg or kwarg is a dict which has the same keys as the first positional argument (or, if the arg is a nested dict, if its second level has these same keys), the arg will be replaced with each region’s entry in that dict. In other words, if we call fn(counts, primermap), where primermap is a dict whose keys match counts, the following will be executed:

fn(counts['region_1'], primermap['region_1'])
fn(counts['region_2'], primermap['region_2'])
...

This substitution is performed on an arg-by-arg basis, so you can use any mixture of normal and “regional dictionary” arguments when calling the fucnction.

Automatic result unpacking

Let’s say fn() returns a tuple, for example:

from lib5c.util.parallelization import parallelize_regions

@parallelize_regions
def fn(matrix):
    return matrix + 1, matrix - 1

When we call fn() on a single matrix, we expect to see

bigger_matrix, smaller_matrix = fn(matrix)

The same thing will work when calling fn() on a counts dict:

bigger_counts_dict, smaller_counts_dict = fn(counts)

In this case bigger_counts_dict and smaller_counts_dict will each be dicts whose keys match the keys of counts.

Fallback to series execution

If an error is encountered during the parallel processing, the decorator will attempt to re-run the same job in series, in hopes that this will result in a more readable stack trace.

Signature preservation

@parallelize_regions is itself decorated by the @pretty_decorator meta-decorator, which can be found in lib5c.util.pretty_decorator. This allows the signature of the decorated function to be preserved through the decoration process.