lib5c.util.counts module

Module containing utilities for manipulating 5C counts.

lib5c.util.counts.abs_diff_counts(a, b)[source]

Computes the absolute value of the difference between two counts matrices in parallel across regions.

Parameters

b (a,) – The two counts dicts to take the absolute difference between.

Returns

The absolute value of the difference between two counts dicts.

Return type

Dict[str, np.ndarray]

lib5c.util.counts.apply_nonredundant(func, counts, primermap=None)[source]

Applies a function to some counts over the non-redundant elements of the matrix or matrices.

Parameters
  • func (callable) – The function to apply.

  • counts (np.ndarray or dict of np.ndarray) – The counts to apply the function to.

  • primermap (primermap, optional) – If counts is a dict, pass a primermap to reconstruct the resulting counts dict.

Returns

The result of the operation.

Return type

np.ndarray or dict of np.ndarray

lib5c.util.counts.apply_nonredundant_parallel(func, counts, primermap=None)

Applies a function to some counts over the non-redundant elements of the matrix or matrices.

Parameters
  • func (callable) – The function to apply.

  • counts (np.ndarray or dict of np.ndarray) – The counts to apply the function to.

  • primermap (primermap, optional) – If counts is a dict, pass a primermap to reconstruct the resulting counts dict.

Returns

The result of the operation.

Return type

np.ndarray or dict of np.ndarray

lib5c.util.counts.calculate_pvalues(counts, distribution=<scipy.stats._continuous_distns.norm_gen object>, percentile_threshold=None)[source]

Applies lib5c.util.counts.calculate_regional_pvalues() to each region in a counts dict independently and returns the results as a parallel counts dict containing the p-value information.

Parameters
  • counts (dict of 2d numpy arrays) – The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.

  • distribution (subclass of scipy.stats.rv_continuous) – The distribution to use to model the data.

  • percentile_threshold (float between 0 and 100 or None) – If passed, the distribution kwarg is ignored and p-value modeling is skipped. Instead, the returned data struture will contain dummy p-values, which will be 0.0 whenever the peak passes the percentile threshold, and 1.0 otherwise. This percentile threshold is applied independently for each region.

Returns

The first element of the tuple is a counts dict containing p-values for each region. The keys are the region names. The values are the arrays of p-values for that region. These arrays are square and symmetric. The second element of the tuple is a dict containing information about the values of the parameters used when modeling each region’s counts. The keys of this dict are region names, and its values are tuples of floats describing these parameters for that region with the following structure. The number and order of the floats will match the return value of rv_continuous.fit() for the particular distribution specified by the distribution kwarg.

Return type

(dict of 2d numpy arrays, dict of tuples of floats) tuple

Notes

If you only need the p-values and don’t care about the stats, you can also just do a dict comprehension as shown here:

{
    region: calculate_regional_pvalues(counts[region])[2]
    for region in counts.keys()
}
lib5c.util.counts.calculate_regional_pvalues(regional_counts, distribution=<scipy.stats._continuous_distns.norm_gen object>, params=None, percentile_threshold=None)[source]

Models the distribution of counts within a region as a normal distribution with mean mu and standard deviation sigma, then returns an array of p-values for each pairwise interaction assuming that normal distribution.

Parameters
  • regional_counts (2d numpy array) – The square, symmetric array of counts for one region.

  • distribution (subclass of scipy.stats.rv_continuous) – The distribution to use to model the data.

  • params (tuple of floats or None) – The parameters to plug into the distribution specified by the distribution kwarg when modeling the data. If None is passed, the parameters will be automatically calculated using rv_continuous.fit(). The number and order of the floats should match the return value of rv_continuous.fit() for the particular distribution specified by the distribution kwarg.

  • percentile_threshold (float between 0 and 100 or None) – If passed, the distribution and params kwargs are ignored and p-value modeling is skipped. Instead, the returned data struture will contain dummy p-values, which will be 0.0 whenever the peak passes the percentile threshold, and 1.0 otherwise.

Returns

The tuple of floats contains the values of the parameters used to model the distribution. If percentile_threshold was passed, this tuple will contain only one float, which will be the value used for thresholding. The 2d numpy array is the p-value for each count.

Return type

(tuple of floats or None, 2d numpy array) tuple

lib5c.util.counts.convert_pvalues_to_interaction_scores(pvalues)[source]

Calculates interaction scores from p-values.

Parameters

pvalues (np.ndarray) – An array of p-values for a single region

Returns

An array of interaction scores for a single region

Return type

np.ndarray

lib5c.util.counts.distance_filter(matrix, k=5)[source]

Wipes the first k off-diagonals of matrix with np.nan.

Parameters
  • matrix (np.ndarray) – The matrix to distance filter.

  • k (int) – The number of off-diagonals to wipe.

Returns

The wiped matrix.

Return type

np.ndarray

lib5c.util.counts.divide_regional_counts(*list_of_regional_counts)[source]

Perform element-wise serial division on a list of regional counts matrices.

Parallelizable; see lib5c.util.counts.parallel_divide_counts().

Propagates nan’s. Emits nan’s when dividing by zero.

Parameters

list_of_regional_counts (List[np.ndarray]) – The list of regional counts matrices to divide.

Returns

The quotient.

Return type

np.ndarray

lib5c.util.counts.extract_queried_counts(regional_counts, regional_primermap)[source]

Starting from a square, symmetric counts matrix containing primer-level contact information, return a non-square, non-symmetric matrix where the 5’-oriented fragments sit in the rows of the matrix while the 3’-oriented fragments sit in the columns. This restricts the input matrix to only the pairwise contacts that were actually queried by the 5C assay.

Parameters
  • regional_counts (np.ndarray) – The classic square, symmetric counts matrix for this region.

  • regional_primermap (List[Dict[str, Any]]) – The primermap describing the fragments in this region. It must contain a ‘orientation’ metadata key so that regional_primermap[i]['orientation'] is "5'" when the fragment was targeted by a 5’-oriented primer and "3'" otherwise.

Returns

The np.ndarray is the queried counts matrix, as described above. The two lists of dicts are lists of the primers corresponding to the rows and columns, respectively, of the queried counts matrix.

Return type

np.ndarray, list of dict, list of dict

lib5c.util.counts.flatten_and_filter_counts(counts, min_filters=None, max_filters=None)[source]

Flattens and filters multiple counts dicts (typically containing different types or stages of data) in parallel, applying customizable filters.

Parameters
  • counts (dict of dict of np.ndarray or dict of np.ndarray) – Outer keys are always names of the types of count dicts. Inner keys are optional and represent region names. If this level of the dict is omitted this function will flatten all the regional counts matrices. Values are always square symmetric counts matrices.

  • max_filters (min_filters,) – Map outer keys of counts to minimum or maximum values for that type of counts.

Returns

The dict’s values are the parallel flattened and filtered count vectors. It has an extra ‘dist’ key for interaction distance in bin units. The array is a boolean index into the original flattened counts shape representing which positions have been filtered. The list is the order the regions were flattened in, or None if counts had only one level of keys.

Return type

dict of np.ndarray, np.ndarray, list of str

Notes

To separately flatten each region, you can do:

flat, idx, _ = parallelize_regions(flatten_and_filter_counts)(

{r: {t: counts[t][r] for t in types} for r in regions})

where flat will be a dict of dict of flattened vectors (outer keys are regions, inner keys are types) and idx will be a dict of boolean indices (keys are regions).

lib5c.util.counts.flatten_counts(counts, discard_nan=False)[source]

Flattens each region in a counts dictionary into a flat, nonredundant list.

Parameters
  • counts (dict of 2d numpy arrays) – The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.

  • discard_nan (bool) – If True, nan’s will not be present in the returned lists.

Returns

The keys are the region names. The values are flat, nonredundant lists of counts for that region. The (i, j) th element of the counts for a region (for i >= j ) ends up at the (i*(i+1)/2 + j) th index of the flattened list for that region. If discard_nan is True, then the nan elements will be missing and this specific indexing will not be preserved.

Return type

dict of lists of floats

Examples

>>> import numpy as np
>>> from lib5c.util.counts import flatten_counts
>>> counts = {'a': np.array([[1, 2], [2, 3.]]),
...           'b': np.array([[np.nan, 4], [4, 5.]])}
>>> flat_counts = flatten_counts(counts)
>>> list(sorted(flat_counts.keys()))
['a', 'b']
>>> flat_counts['a']
array([1., 2., 3.])
>>> flat_counts['b']
array([nan,  4.,  5.])
>>> flat_counts = flatten_counts(counts, discard_nan=True)
>>> flat_counts['a']
array([1., 2., 3.])
>>> flat_counts['b']
array([4.,  5.])
lib5c.util.counts.flatten_counts_to_list(counts, region_order=None, discard_nan=False)[source]

Flattens counts for all regions into a single, flat, nonredudant list.

Parameters
  • counts (dict of 2d numpy arrays) – The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.

  • region_order (list of str) – List of string reference to region names in the order the regions should be concatenated when constructing the flat list. If None, the regions will be concatenated in arbitrary order.

  • discard_nan (bool) – If True, nan’s will not be present in the returned list.

Returns

The concatenated flattened regional counts. For information on the order in which flattened regional counts are created in, see lib5c.util.counts.flatten_regional_counts(). If discard_nan is True, then the nan elements will be missing and this specific indexing will not be preserved.

Return type

1d numpy array

Examples

>>> import numpy as np
>>> from lib5c.util.counts import flatten_counts_to_list
>>> counts = {'a': np.array([[1, 2], [2, 3.]]),
...           'b': np.array([[np.nan, 4], [4, 5.]])}
>>> flatten_counts_to_list(counts, region_order=['a', 'b'])
array([  1.,   2.,   3.,  nan,   4.,   5.])
>>> flatten_counts_to_list(counts, region_order=['b', 'a'])
array([nan,  4.,  5.,  1.,  2.,  3.])
>>> flatten_counts_to_list(counts, region_order=['a', 'b'],
...                        discard_nan=True)
array([1., 2., 3., 4., 5.])
lib5c.util.counts.flatten_obs_and_exp(obs, exp, discard_nan=True, log=False)[source]

Convenience function for flattening observed and expected counts together.

Parameters
  • exp (obs,) – Regional matrices of observed and expected values, respectively. Pass counts dicts to redirect the call to flatten_obs_and_exp_counts().

  • discard_nan (bool) – Pass True to discard nan’s from the returned vectors.

  • log (bool) – Pass True to log the returned vectors.

Returns

The flattened vectors of obsereved and expected values, respectively.

Return type

np.ndarray, np.ndarray

lib5c.util.counts.flatten_obs_and_exp_counts(obs_counts, exp_counts, discard_nan=True, log=False)[source]

Convenience function for flattening observed and expected counts together.

Parameters
  • exp_counts (obs_counts,) – Counts dicts of observed and expected values, respectively.

  • discard_nan (bool) – Pass True to discard nan’s from the returned vectors.

  • log (bool) – Pass True to log the returned vectors.

Returns

The flattened vectors of obsereved and expected values, respectively.

Return type

np.ndarray, np.ndarray

lib5c.util.counts.flatten_regional_counts(regional_counts, discard_nan=False)[source]

Flattens the counts for a single region into a flat, nonredundant list.

Parameters
  • regional_counts (2d numpy array) – The square, symmetric array of counts for one region.

  • discard_nan (bool) – If True, nan’s will not be present in the returned list.

Returns

A flat, nonredundant lists of counts. The (i, j) th element of the regional_counts array (for i >= j ) ends up at the (i*(i+1)/2 + j) th index of the flattened array. If discard_nan was True, these indices will not necessarily match up and it will not be possible to unflatten the array.

Return type

1d numpy array

Examples

>>> import numpy as np
>>> from lib5c.util.counts import flatten_regional_counts
>>> a = np.array([[ 1, 4, -7], [4, 5, np.nan], [-7, np.nan, 9.]])
>>> a
array([[ 1.,  4., -7.],
       [ 4.,  5., nan],
       [-7., nan,  9.]])
>>> flatten_regional_counts(a)
array([  1.,   4.,   5.,  -7.,  nan,   9.])
>>> flatten_regional_counts(a, discard_nan=True)
array([ 1.,  4.,  5., -7.,  9.])
lib5c.util.counts.flip_pvalues(regional_counts)[source]

To some approximation, convert counts matrices containing left-tail p-values to right-tail p-values or vice-versa.

Parameters

regional_counts (np.ndarray) – The counts matrix containing p-values to flip.

Returns

The flipped p-values.

Return type

np.ndarray

lib5c.util.counts.fold_pvalues(regional_counts)[source]

Folds one-tail p-values into two-tail p-values using convert_to_two_tail(). Only valid for p-values called using continuous distributions.

Parameters

regional_counts (np.ndarray) – An array of one-tail p-values for a single region.

Returns

An array of the corresponding two-tail p-values.

Return type

np.ndarray

lib5c.util.counts.impute_values(regional_counts, size=5)[source]

Impute missing (nan) values in a counts matrix using a local median estimate.

Parameters
  • regional_counts (np.ndarray) – The counts matrix to imupte.

  • size (int) – The size of the window used to compute the local median. Should be an odd integer.

Returns

The counts matrix with missing values filled in with the local median estimates.

Return type

np.ndarray

lib5c.util.counts.log_regional_counts(regional_counts, pseudocount=1.0, base='e')[source]

Logs a regional counts matrix.

Parallelizable; see lib5c.util.counts.parallel_log_counts().

Emits nan when logging a negative number, and -inf when logging zero.

Parameters
  • regional_counts (np.ndarray) – The counts matrix to log.

  • pseudocount (float) – Psuedocount to add before logging.

  • base (str or float) – The base to use when logging. Acceptable string values are ‘e’, ‘2’, or ‘10’.

Returns

The logged counts matrix.

Return type

np.ndarray

Examples

>>> import numpy as np
>>> from lib5c.util.counts import log_regional_counts
>>> a = np.exp(np.array([[1, 2], [2, 4.]]))
>>> log_regional_counts(a, pseudocount=0)
array([[1., 2.],
       [2., 4.]])
>>> a -= 1 #  the default pseudocount will add this back before logging
>>> a[0, 0] = -2  # what happens to negative values?
>>> log_regional_counts(a)
array([[nan,  2.],
       [ 2.,  4.]])
>>> b = np.power(42, np.array([[1, 2], [2, 4.]]))
>>> log_regional_counts(b, base=42, pseudocount=0)
array([[1., 2.],
       [2., 4.]])
lib5c.util.counts.norm_counts(counts, order=1)[source]

Attempt at defining a “norm” for counts dicts by simply summing a matrix p-norm over the regions.

Parameters
  • counts (Dict[str, np.ndarray]) – The counts dict to compute a norm for.

  • order (int) – The order of the matrix norm, as described by numpy.linalg.norm.

Returns

The norm of the counts dict.

Return type

float

lib5c.util.counts.parallel_divide_counts(*list_of_regional_counts)

Perform element-wise serial division on a list of regional counts matrices.

Parallelizable; see lib5c.util.counts.parallel_divide_counts().

Propagates nan’s. Emits nan’s when dividing by zero.

Parameters

list_of_regional_counts (List[np.ndarray]) – The list of regional counts matrices to divide.

Returns

The quotient.

Return type

np.ndarray

lib5c.util.counts.parallel_log_counts(regional_counts, pseudocount=1.0, base='e')

Logs a regional counts matrix.

Parallelizable; see lib5c.util.counts.parallel_log_counts().

Emits nan when logging a negative number, and -inf when logging zero.

Parameters
  • regional_counts (np.ndarray) – The counts matrix to log.

  • pseudocount (float) – Psuedocount to add before logging.

  • base (str or float) – The base to use when logging. Acceptable string values are ‘e’, ‘2’, or ‘10’.

Returns

The logged counts matrix.

Return type

np.ndarray

Examples

>>> import numpy as np
>>> from lib5c.util.counts import log_regional_counts
>>> a = np.exp(np.array([[1, 2], [2, 4.]]))
>>> log_regional_counts(a, pseudocount=0)
array([[1., 2.],
       [2., 4.]])
>>> a -= 1 #  the default pseudocount will add this back before logging
>>> a[0, 0] = -2  # what happens to negative values?
>>> log_regional_counts(a)
array([[nan,  2.],
       [ 2.,  4.]])
>>> b = np.power(42, np.array([[1, 2], [2, 4.]]))
>>> log_regional_counts(b, base=42, pseudocount=0)
array([[1., 2.],
       [2., 4.]])
lib5c.util.counts.parallel_subtract_counts(*list_of_regional_counts)

Perform element-wise serial subtraction on a list of regional counts matrices.

Parallelizable; see lib5c.util.counts.parallel_subtract_counts().

Propagates nan’s.

Parameters

list_of_regional_counts (List[np.ndarray]) – The list of regional counts matrices to divide.

Returns

The quotient.

Return type

np.ndarray

lib5c.util.counts.parallel_unlog_counts(regional_counts, pseudocount=1.0, base='e')

Unlogs a regional counts matrix.

Parallelizable; see lib5c.util.counts.parallel_unlog_counts().

Emits nan’s when the input counts are nan.

Parameters
  • regional_counts (np.ndarray) – The counts matrix to unlog.

  • pseudocount (float) – Psuedocount to subtract after unlogging.

  • base (str or float) – The base to use when unlogging. Acceptable string values are ‘e’, ‘2’, or ‘10’.

Returns

The unlogged counts matrix.

Return type

np.ndarray

Examples

>>> import numpy as np
>>> from lib5c.util.counts import log_regional_counts, unlog_regional_counts
>>> a = np.array([[1, 2], [2, 4.]])
>>> log_regional_counts(unlog_regional_counts(a))
array([[1., 2.],
       [2., 4.]])
>>> log_regional_counts(unlog_regional_counts(a, base=42), base=42)
array([[1., 2.],
       [2., 4.]])
lib5c.util.counts.propagate_nans(regional_counts_a, regional_counts_b)[source]

Propagate nan values between two matrices.

Parameters

regional_counts_b (regional_counts_a,) – The matrices to propagate nan’s between. These should have the same shape.

Returns

The nan-propagated versions of the input matrices, in the order they were passed.

Return type

Tuple[np.ndarray, np.ndarray]

lib5c.util.counts.queried_counts_to_pvalues(queried_counts)[source]

Convert a queried counts matrix to a matrix of equivalent right-tail p-values using the emprical CDF.

Parameters

queried_counts (np.ndarray) – The matrix of queried counts for this region. See lib5c.util.counts.extract_queried_counts().

Returns

The empirical p-value queried counts matrix for this region.

Return type

np.ndarray

lib5c.util.counts.regional_counts_to_pvalues(regional_counts)[source]

Convert a counts matrix to a matrix of equivalent right-tail p-values using the emprical CDF.

Parameters

regional_counts (np.ndarray) – The counts matrix for this region.

Returns

The empirical p-value counts matrix for this region.

Return type

np.ndarray

lib5c.util.counts.subtract_regional_counts(*list_of_regional_counts)[source]

Perform element-wise serial subtraction on a list of regional counts matrices.

Parallelizable; see lib5c.util.counts.parallel_subtract_counts().

Propagates nan’s.

Parameters

list_of_regional_counts (List[np.ndarray]) – The list of regional counts matrices to divide.

Returns

The quotient.

Return type

np.ndarray

lib5c.util.counts.unflatten_counts(flat_counts)[source]

Apply unflatten_regional_counts() in parallel to a dict of flat regional counts to get back the original counts dict.

Parameters

flat_counts (Dict[str, List[float]]) – The keys are region names as strings. The values are flat, nonredundant lists of counts for that region. The (i*(i+1)/2 + j) th element of each list will end up at both the (i, j) th and the (j, i) th element of the returned array for that region.

Returns

The keys are region names as strings. The values are square, symmetric array representations of the counts for that region. The (i*(i+1)/2 + j) th element of flat_regional_counts will end up at both the (i, j) th and the (j, i) th element of this array.

Return type

Dict[str, np.ndarray]

Examples

>>> import numpy as np
>>> from lib5c.util.counts import unflatten_counts
>>> flattened_counts = {'a': np.array([1, 2, 3.]),
...                     'b': np.array([np.nan, 4, 5.])}
>>> counts = unflatten_counts(flattened_counts)
>>> list(sorted(counts.keys()))
['a', 'b']
>>> counts['a']
array([[1., 2.],
       [2., 3.]])
>>> counts['b']
array([[nan,  4.],
       [ 4.,  5.]])
lib5c.util.counts.unflatten_counts_from_list(flattened_counts_array, region_order, pixelmap)[source]

Unflattens a single list of flattened counts from many regions into a standard counts dict structure.

Parameters
  • flattened_counts_array (1d numpy array) – The list of flattened counts to be unflattened. See lib5c.util.counts.flatten_counts_to_list().

  • region_order (list of str) – The list of region names in the order that the regions were concatenated in when making the flattened_counts_list See lib5c.util.counts.flatten_counts_to_list().

  • pixelmap (dict of list of dict) – A pixelmap or primermap. This will be used to determine the size of each region. See lib5c.parsers.primers.get_pixelmap() or lib5c.parsers.primers.get_primermap().

Returns

The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.

Return type

dict of 2d numpy arrays

Examples

>>> import numpy as np
>>> from lib5c.util.counts import unflatten_counts_from_list
>>> flat_counts = np.array([1, 2, 3., np.nan, 4, 5.])
>>> pixelmap = {'a': [{}, {}], 'b': [{}, {}]}
>>> counts = unflatten_counts_from_list(flat_counts, ['a', 'b'], pixelmap)
>>> list(sorted(counts.keys()))
['a', 'b']
>>> counts['a']
array([[1., 2.],
       [2., 3.]])
>>> counts['b']
array([[nan,  4.],
       [ 4.,  5.]])
lib5c.util.counts.unflatten_regional_counts(flat_regional_counts)[source]

Turn a list of flattened counts back into a square symmetric array.

Parameters

flat_regional_counts (1d numpy array) – A flat, nonredundant array of counts. The (i*(i+1)/2 + j) th element of this list will end up at both the (i, j) th and the (j, i) th element of the returned array.

Returns

A square, symmetric array representation of the counts. The (i*(i+1)/2 + j) th element of flat_regional_counts will end up at both the (i, j) th and the (j, i) th element of this array.

Return type

2d numpy array

Examples

>>> import numpy as np
>>> from lib5c.util.counts import unflatten_regional_counts
>>> b = np.array([ 1, 4, 5, -7, np.nan, 9.])
>>> b
array([  1.,   4.,   5.,  -7.,  nan,   9.])
>>> unflatten_regional_counts(b)
array([[ 1.,  4., -7.],
       [ 4.,  5., nan],
       [-7., nan,  9.]])
lib5c.util.counts.unlog_regional_counts(regional_counts, pseudocount=1.0, base='e')[source]

Unlogs a regional counts matrix.

Parallelizable; see lib5c.util.counts.parallel_unlog_counts().

Emits nan’s when the input counts are nan.

Parameters
  • regional_counts (np.ndarray) – The counts matrix to unlog.

  • pseudocount (float) – Psuedocount to subtract after unlogging.

  • base (str or float) – The base to use when unlogging. Acceptable string values are ‘e’, ‘2’, or ‘10’.

Returns

The unlogged counts matrix.

Return type

np.ndarray

Examples

>>> import numpy as np
>>> from lib5c.util.counts import log_regional_counts, unlog_regional_counts
>>> a = np.array([[1, 2], [2, 4.]])
>>> log_regional_counts(unlog_regional_counts(a))
array([[1., 2.],
       [2., 4.]])
>>> log_regional_counts(unlog_regional_counts(a, base=42), base=42)
array([[1., 2.],
       [2., 4.]])