lib5c.util.counts module¶
Module containing utilities for manipulating 5C counts.
-
lib5c.util.counts.
abs_diff_counts
(a, b)[source]¶ Computes the absolute value of the difference between two counts matrices in parallel across regions.
- Parameters
b (a,) – The two counts dicts to take the absolute difference between.
- Returns
The absolute value of the difference between two counts dicts.
- Return type
Dict[str, np.ndarray]
-
lib5c.util.counts.
apply_nonredundant
(func, counts, primermap=None)[source]¶ Applies a function to some counts over the non-redundant elements of the matrix or matrices.
- Parameters
func (callable) – The function to apply.
counts (np.ndarray or dict of np.ndarray) – The counts to apply the function to.
primermap (primermap, optional) – If counts is a dict, pass a primermap to reconstruct the resulting counts dict.
- Returns
The result of the operation.
- Return type
np.ndarray or dict of np.ndarray
-
lib5c.util.counts.
apply_nonredundant_parallel
(func, counts, primermap=None)¶ Applies a function to some counts over the non-redundant elements of the matrix or matrices.
- Parameters
func (callable) – The function to apply.
counts (np.ndarray or dict of np.ndarray) – The counts to apply the function to.
primermap (primermap, optional) – If counts is a dict, pass a primermap to reconstruct the resulting counts dict.
- Returns
The result of the operation.
- Return type
np.ndarray or dict of np.ndarray
-
lib5c.util.counts.
calculate_pvalues
(counts, distribution=<scipy.stats._continuous_distns.norm_gen object>, percentile_threshold=None)[source]¶ Applies lib5c.util.counts.calculate_regional_pvalues() to each region in a counts dict independently and returns the results as a parallel counts dict containing the p-value information.
- Parameters
counts (dict of 2d numpy arrays) – The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.
distribution (subclass of scipy.stats.rv_continuous) – The distribution to use to model the data.
percentile_threshold (float between 0 and 100 or None) – If passed, the
distribution
kwarg is ignored and p-value modeling is skipped. Instead, the returned data struture will contain dummy p-values, which will be 0.0 whenever the peak passes the percentile threshold, and 1.0 otherwise. This percentile threshold is applied independently for each region.
- Returns
The first element of the tuple is a counts dict containing p-values for each region. The keys are the region names. The values are the arrays of p-values for that region. These arrays are square and symmetric. The second element of the tuple is a dict containing information about the values of the parameters used when modeling each region’s counts. The keys of this dict are region names, and its values are tuples of floats describing these parameters for that region with the following structure. The number and order of the floats will match the return value of
rv_continuous.fit()
for the particular distribution specified by thedistribution
kwarg.- Return type
(dict of 2d numpy arrays, dict of tuples of floats) tuple
Notes
If you only need the p-values and don’t care about the stats, you can also just do a dict comprehension as shown here:
{ region: calculate_regional_pvalues(counts[region])[2] for region in counts.keys() }
-
lib5c.util.counts.
calculate_regional_pvalues
(regional_counts, distribution=<scipy.stats._continuous_distns.norm_gen object>, params=None, percentile_threshold=None)[source]¶ Models the distribution of counts within a region as a normal distribution with mean mu and standard deviation sigma, then returns an array of p-values for each pairwise interaction assuming that normal distribution.
- Parameters
regional_counts (2d numpy array) – The square, symmetric array of counts for one region.
distribution (subclass of scipy.stats.rv_continuous) – The distribution to use to model the data.
params (tuple of floats or None) – The parameters to plug into the distribution specified by the
distribution
kwarg when modeling the data. If None is passed, the parameters will be automatically calculated usingrv_continuous.fit()
. The number and order of the floats should match the return value ofrv_continuous.fit()
for the particular distribution specified by thedistribution
kwarg.percentile_threshold (float between 0 and 100 or None) – If passed, the distribution and params kwargs are ignored and p-value modeling is skipped. Instead, the returned data struture will contain dummy p-values, which will be 0.0 whenever the peak passes the percentile threshold, and 1.0 otherwise.
- Returns
The tuple of floats contains the values of the parameters used to model the distribution. If
percentile_threshold was passed
, this tuple will contain only one float, which will be the value used for thresholding. The 2d numpy array is the p-value for each count.- Return type
(tuple of floats or None, 2d numpy array) tuple
-
lib5c.util.counts.
convert_pvalues_to_interaction_scores
(pvalues)[source]¶ Calculates interaction scores from p-values.
- Parameters
pvalues (np.ndarray) – An array of p-values for a single region
- Returns
An array of interaction scores for a single region
- Return type
np.ndarray
-
lib5c.util.counts.
distance_filter
(matrix, k=5)[source]¶ Wipes the first k off-diagonals of matrix with np.nan.
- Parameters
matrix (np.ndarray) – The matrix to distance filter.
k (int) – The number of off-diagonals to wipe.
- Returns
The wiped matrix.
- Return type
np.ndarray
-
lib5c.util.counts.
divide_regional_counts
(*list_of_regional_counts)[source]¶ Perform element-wise serial division on a list of regional counts matrices.
Parallelizable; see
lib5c.util.counts.parallel_divide_counts()
.Propagates nan’s. Emits nan’s when dividing by zero.
- Parameters
list_of_regional_counts (List[np.ndarray]) – The list of regional counts matrices to divide.
- Returns
The quotient.
- Return type
np.ndarray
-
lib5c.util.counts.
extract_queried_counts
(regional_counts, regional_primermap)[source]¶ Starting from a square, symmetric counts matrix containing primer-level contact information, return a non-square, non-symmetric matrix where the 5’-oriented fragments sit in the rows of the matrix while the 3’-oriented fragments sit in the columns. This restricts the input matrix to only the pairwise contacts that were actually queried by the 5C assay.
- Parameters
regional_counts (np.ndarray) – The classic square, symmetric counts matrix for this region.
regional_primermap (List[Dict[str, Any]]) – The primermap describing the fragments in this region. It must contain a ‘orientation’ metadata key so that
regional_primermap[i]['orientation']
is"5'"
when the fragment was targeted by a 5’-oriented primer and"3'"
otherwise.
- Returns
The np.ndarray is the queried counts matrix, as described above. The two lists of dicts are lists of the primers corresponding to the rows and columns, respectively, of the queried counts matrix.
- Return type
np.ndarray, list of dict, list of dict
-
lib5c.util.counts.
flatten_and_filter_counts
(counts, min_filters=None, max_filters=None)[source]¶ Flattens and filters multiple counts dicts (typically containing different types or stages of data) in parallel, applying customizable filters.
- Parameters
counts (dict of dict of np.ndarray or dict of np.ndarray) – Outer keys are always names of the types of count dicts. Inner keys are optional and represent region names. If this level of the dict is omitted this function will flatten all the regional counts matrices. Values are always square symmetric counts matrices.
max_filters (min_filters,) – Map outer keys of
counts
to minimum or maximum values for that type of counts.
- Returns
The dict’s values are the parallel flattened and filtered count vectors. It has an extra ‘dist’ key for interaction distance in bin units. The array is a boolean index into the original flattened counts shape representing which positions have been filtered. The list is the order the regions were flattened in, or None if
counts
had only one level of keys.- Return type
dict of np.ndarray, np.ndarray, list of str
Notes
To separately flatten each region, you can do:
- flat, idx, _ = parallelize_regions(flatten_and_filter_counts)(
{r: {t: counts[t][r] for t in types} for r in regions})
where
flat
will be a dict of dict of flattened vectors (outer keys are regions, inner keys are types) andidx
will be a dict of boolean indices (keys are regions).
-
lib5c.util.counts.
flatten_counts
(counts, discard_nan=False)[source]¶ Flattens each region in a counts dictionary into a flat, nonredundant list.
- Parameters
counts (dict of 2d numpy arrays) – The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.
discard_nan (bool) – If True, nan’s will not be present in the returned lists.
- Returns
The keys are the region names. The values are flat, nonredundant lists of counts for that region. The
(i, j)
th element of the counts for a region (fori >= j
) ends up at the(i*(i+1)/2 + j)
th index of the flattened list for that region. If discard_nan is True, then the nan elements will be missing and this specific indexing will not be preserved.- Return type
dict of lists of floats
Examples
>>> import numpy as np >>> from lib5c.util.counts import flatten_counts >>> counts = {'a': np.array([[1, 2], [2, 3.]]), ... 'b': np.array([[np.nan, 4], [4, 5.]])} >>> flat_counts = flatten_counts(counts) >>> list(sorted(flat_counts.keys())) ['a', 'b'] >>> flat_counts['a'] array([1., 2., 3.]) >>> flat_counts['b'] array([nan, 4., 5.]) >>> flat_counts = flatten_counts(counts, discard_nan=True) >>> flat_counts['a'] array([1., 2., 3.]) >>> flat_counts['b'] array([4., 5.])
-
lib5c.util.counts.
flatten_counts_to_list
(counts, region_order=None, discard_nan=False)[source]¶ Flattens counts for all regions into a single, flat, nonredudant list.
- Parameters
counts (dict of 2d numpy arrays) – The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.
region_order (list of str) – List of string reference to region names in the order the regions should be concatenated when constructing the flat list. If None, the regions will be concatenated in arbitrary order.
discard_nan (bool) – If True, nan’s will not be present in the returned list.
- Returns
The concatenated flattened regional counts. For information on the order in which flattened regional counts are created in, see
lib5c.util.counts.flatten_regional_counts()
. If discard_nan is True, then the nan elements will be missing and this specific indexing will not be preserved.- Return type
1d numpy array
Examples
>>> import numpy as np >>> from lib5c.util.counts import flatten_counts_to_list >>> counts = {'a': np.array([[1, 2], [2, 3.]]), ... 'b': np.array([[np.nan, 4], [4, 5.]])} >>> flatten_counts_to_list(counts, region_order=['a', 'b']) array([ 1., 2., 3., nan, 4., 5.]) >>> flatten_counts_to_list(counts, region_order=['b', 'a']) array([nan, 4., 5., 1., 2., 3.]) >>> flatten_counts_to_list(counts, region_order=['a', 'b'], ... discard_nan=True) array([1., 2., 3., 4., 5.])
-
lib5c.util.counts.
flatten_obs_and_exp
(obs, exp, discard_nan=True, log=False)[source]¶ Convenience function for flattening observed and expected counts together.
- Parameters
exp (obs,) – Regional matrices of observed and expected values, respectively. Pass counts dicts to redirect the call to
flatten_obs_and_exp_counts()
.discard_nan (bool) – Pass True to discard nan’s from the returned vectors.
log (bool) – Pass True to log the returned vectors.
- Returns
The flattened vectors of obsereved and expected values, respectively.
- Return type
np.ndarray, np.ndarray
-
lib5c.util.counts.
flatten_obs_and_exp_counts
(obs_counts, exp_counts, discard_nan=True, log=False)[source]¶ Convenience function for flattening observed and expected counts together.
- Parameters
exp_counts (obs_counts,) – Counts dicts of observed and expected values, respectively.
discard_nan (bool) – Pass True to discard nan’s from the returned vectors.
log (bool) – Pass True to log the returned vectors.
- Returns
The flattened vectors of obsereved and expected values, respectively.
- Return type
np.ndarray, np.ndarray
-
lib5c.util.counts.
flatten_regional_counts
(regional_counts, discard_nan=False)[source]¶ Flattens the counts for a single region into a flat, nonredundant list.
- Parameters
regional_counts (2d numpy array) – The square, symmetric array of counts for one region.
discard_nan (bool) – If True, nan’s will not be present in the returned list.
- Returns
A flat, nonredundant lists of counts. The
(i, j)
th element of theregional_counts
array (fori >= j
) ends up at the(i*(i+1)/2 + j)
th index of the flattened array. Ifdiscard_nan
was True, these indices will not necessarily match up and it will not be possible to unflatten the array.- Return type
1d numpy array
Examples
>>> import numpy as np >>> from lib5c.util.counts import flatten_regional_counts >>> a = np.array([[ 1, 4, -7], [4, 5, np.nan], [-7, np.nan, 9.]]) >>> a array([[ 1., 4., -7.], [ 4., 5., nan], [-7., nan, 9.]]) >>> flatten_regional_counts(a) array([ 1., 4., 5., -7., nan, 9.]) >>> flatten_regional_counts(a, discard_nan=True) array([ 1., 4., 5., -7., 9.])
-
lib5c.util.counts.
flip_pvalues
(regional_counts)[source]¶ To some approximation, convert counts matrices containing left-tail p-values to right-tail p-values or vice-versa.
- Parameters
regional_counts (np.ndarray) – The counts matrix containing p-values to flip.
- Returns
The flipped p-values.
- Return type
np.ndarray
-
lib5c.util.counts.
fold_pvalues
(regional_counts)[source]¶ Folds one-tail p-values into two-tail p-values using
convert_to_two_tail()
. Only valid for p-values called using continuous distributions.- Parameters
regional_counts (np.ndarray) – An array of one-tail p-values for a single region.
- Returns
An array of the corresponding two-tail p-values.
- Return type
np.ndarray
-
lib5c.util.counts.
impute_values
(regional_counts, size=5)[source]¶ Impute missing (nan) values in a counts matrix using a local median estimate.
- Parameters
regional_counts (np.ndarray) – The counts matrix to imupte.
size (int) – The size of the window used to compute the local median. Should be an odd integer.
- Returns
The counts matrix with missing values filled in with the local median estimates.
- Return type
np.ndarray
-
lib5c.util.counts.
log_regional_counts
(regional_counts, pseudocount=1.0, base='e')[source]¶ Logs a regional counts matrix.
Parallelizable; see
lib5c.util.counts.parallel_log_counts()
.Emits nan when logging a negative number, and -inf when logging zero.
- Parameters
regional_counts (np.ndarray) – The counts matrix to log.
pseudocount (float) – Psuedocount to add before logging.
base (str or float) – The base to use when logging. Acceptable string values are ‘e’, ‘2’, or ‘10’.
- Returns
The logged counts matrix.
- Return type
np.ndarray
Examples
>>> import numpy as np >>> from lib5c.util.counts import log_regional_counts >>> a = np.exp(np.array([[1, 2], [2, 4.]])) >>> log_regional_counts(a, pseudocount=0) array([[1., 2.], [2., 4.]]) >>> a -= 1 # the default pseudocount will add this back before logging >>> a[0, 0] = -2 # what happens to negative values? >>> log_regional_counts(a) array([[nan, 2.], [ 2., 4.]]) >>> b = np.power(42, np.array([[1, 2], [2, 4.]])) >>> log_regional_counts(b, base=42, pseudocount=0) array([[1., 2.], [2., 4.]])
-
lib5c.util.counts.
norm_counts
(counts, order=1)[source]¶ Attempt at defining a “norm” for counts dicts by simply summing a matrix p-norm over the regions.
- Parameters
counts (Dict[str, np.ndarray]) – The counts dict to compute a norm for.
order (int) – The order of the matrix norm, as described by
numpy.linalg.norm
.
- Returns
The norm of the counts dict.
- Return type
float
-
lib5c.util.counts.
parallel_divide_counts
(*list_of_regional_counts)¶ Perform element-wise serial division on a list of regional counts matrices.
Parallelizable; see
lib5c.util.counts.parallel_divide_counts()
.Propagates nan’s. Emits nan’s when dividing by zero.
- Parameters
list_of_regional_counts (List[np.ndarray]) – The list of regional counts matrices to divide.
- Returns
The quotient.
- Return type
np.ndarray
-
lib5c.util.counts.
parallel_log_counts
(regional_counts, pseudocount=1.0, base='e')¶ Logs a regional counts matrix.
Parallelizable; see
lib5c.util.counts.parallel_log_counts()
.Emits nan when logging a negative number, and -inf when logging zero.
- Parameters
regional_counts (np.ndarray) – The counts matrix to log.
pseudocount (float) – Psuedocount to add before logging.
base (str or float) – The base to use when logging. Acceptable string values are ‘e’, ‘2’, or ‘10’.
- Returns
The logged counts matrix.
- Return type
np.ndarray
Examples
>>> import numpy as np >>> from lib5c.util.counts import log_regional_counts >>> a = np.exp(np.array([[1, 2], [2, 4.]])) >>> log_regional_counts(a, pseudocount=0) array([[1., 2.], [2., 4.]]) >>> a -= 1 # the default pseudocount will add this back before logging >>> a[0, 0] = -2 # what happens to negative values? >>> log_regional_counts(a) array([[nan, 2.], [ 2., 4.]]) >>> b = np.power(42, np.array([[1, 2], [2, 4.]])) >>> log_regional_counts(b, base=42, pseudocount=0) array([[1., 2.], [2., 4.]])
-
lib5c.util.counts.
parallel_subtract_counts
(*list_of_regional_counts)¶ Perform element-wise serial subtraction on a list of regional counts matrices.
Parallelizable; see
lib5c.util.counts.parallel_subtract_counts()
.Propagates nan’s.
- Parameters
list_of_regional_counts (List[np.ndarray]) – The list of regional counts matrices to divide.
- Returns
The quotient.
- Return type
np.ndarray
-
lib5c.util.counts.
parallel_unlog_counts
(regional_counts, pseudocount=1.0, base='e')¶ Unlogs a regional counts matrix.
Parallelizable; see
lib5c.util.counts.parallel_unlog_counts()
.Emits nan’s when the input counts are nan.
- Parameters
regional_counts (np.ndarray) – The counts matrix to unlog.
pseudocount (float) – Psuedocount to subtract after unlogging.
base (str or float) – The base to use when unlogging. Acceptable string values are ‘e’, ‘2’, or ‘10’.
- Returns
The unlogged counts matrix.
- Return type
np.ndarray
Examples
>>> import numpy as np >>> from lib5c.util.counts import log_regional_counts, unlog_regional_counts >>> a = np.array([[1, 2], [2, 4.]]) >>> log_regional_counts(unlog_regional_counts(a)) array([[1., 2.], [2., 4.]]) >>> log_regional_counts(unlog_regional_counts(a, base=42), base=42) array([[1., 2.], [2., 4.]])
-
lib5c.util.counts.
propagate_nans
(regional_counts_a, regional_counts_b)[source]¶ Propagate nan values between two matrices.
- Parameters
regional_counts_b (regional_counts_a,) – The matrices to propagate nan’s between. These should have the same shape.
- Returns
The nan-propagated versions of the input matrices, in the order they were passed.
- Return type
Tuple[np.ndarray, np.ndarray]
-
lib5c.util.counts.
queried_counts_to_pvalues
(queried_counts)[source]¶ Convert a queried counts matrix to a matrix of equivalent right-tail p-values using the emprical CDF.
- Parameters
queried_counts (np.ndarray) – The matrix of queried counts for this region. See
lib5c.util.counts.extract_queried_counts()
.- Returns
The empirical p-value queried counts matrix for this region.
- Return type
np.ndarray
-
lib5c.util.counts.
regional_counts_to_pvalues
(regional_counts)[source]¶ Convert a counts matrix to a matrix of equivalent right-tail p-values using the emprical CDF.
- Parameters
regional_counts (np.ndarray) – The counts matrix for this region.
- Returns
The empirical p-value counts matrix for this region.
- Return type
np.ndarray
-
lib5c.util.counts.
subtract_regional_counts
(*list_of_regional_counts)[source]¶ Perform element-wise serial subtraction on a list of regional counts matrices.
Parallelizable; see
lib5c.util.counts.parallel_subtract_counts()
.Propagates nan’s.
- Parameters
list_of_regional_counts (List[np.ndarray]) – The list of regional counts matrices to divide.
- Returns
The quotient.
- Return type
np.ndarray
-
lib5c.util.counts.
unflatten_counts
(flat_counts)[source]¶ Apply
unflatten_regional_counts()
in parallel to a dict of flat regional counts to get back the original counts dict.- Parameters
flat_counts (Dict[str, List[float]]) – The keys are region names as strings. The values are flat, nonredundant lists of counts for that region. The
(i*(i+1)/2 + j)
th element of each list will end up at both the(i, j)
th and the(j, i)
th element of the returned array for that region.- Returns
The keys are region names as strings. The values are square, symmetric array representations of the counts for that region. The
(i*(i+1)/2 + j)
th element of flat_regional_counts will end up at both the(i, j)
th and the(j, i)
th element of this array.- Return type
Dict[str, np.ndarray]
Examples
>>> import numpy as np >>> from lib5c.util.counts import unflatten_counts >>> flattened_counts = {'a': np.array([1, 2, 3.]), ... 'b': np.array([np.nan, 4, 5.])} >>> counts = unflatten_counts(flattened_counts) >>> list(sorted(counts.keys())) ['a', 'b'] >>> counts['a'] array([[1., 2.], [2., 3.]]) >>> counts['b'] array([[nan, 4.], [ 4., 5.]])
-
lib5c.util.counts.
unflatten_counts_from_list
(flattened_counts_array, region_order, pixelmap)[source]¶ Unflattens a single list of flattened counts from many regions into a standard counts dict structure.
- Parameters
flattened_counts_array (1d numpy array) – The list of flattened counts to be unflattened. See
lib5c.util.counts.flatten_counts_to_list()
.region_order (list of str) – The list of region names in the order that the regions were concatenated in when making the
flattened_counts_list
Seelib5c.util.counts.flatten_counts_to_list()
.pixelmap (dict of list of dict) – A pixelmap or primermap. This will be used to determine the size of each region. See
lib5c.parsers.primers.get_pixelmap()
orlib5c.parsers.primers.get_primermap()
.
- Returns
The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.
- Return type
dict of 2d numpy arrays
Examples
>>> import numpy as np >>> from lib5c.util.counts import unflatten_counts_from_list >>> flat_counts = np.array([1, 2, 3., np.nan, 4, 5.]) >>> pixelmap = {'a': [{}, {}], 'b': [{}, {}]} >>> counts = unflatten_counts_from_list(flat_counts, ['a', 'b'], pixelmap) >>> list(sorted(counts.keys())) ['a', 'b'] >>> counts['a'] array([[1., 2.], [2., 3.]]) >>> counts['b'] array([[nan, 4.], [ 4., 5.]])
-
lib5c.util.counts.
unflatten_regional_counts
(flat_regional_counts)[source]¶ Turn a list of flattened counts back into a square symmetric array.
- Parameters
flat_regional_counts (1d numpy array) – A flat, nonredundant array of counts. The
(i*(i+1)/2 + j)
th element of this list will end up at both the(i, j)
th and the(j, i)
th element of the returned array.- Returns
A square, symmetric array representation of the counts. The
(i*(i+1)/2 + j)
th element offlat_regional_counts
will end up at both the(i, j)
th and the(j, i)
th element of this array.- Return type
2d numpy array
Examples
>>> import numpy as np >>> from lib5c.util.counts import unflatten_regional_counts >>> b = np.array([ 1, 4, 5, -7, np.nan, 9.]) >>> b array([ 1., 4., 5., -7., nan, 9.]) >>> unflatten_regional_counts(b) array([[ 1., 4., -7.], [ 4., 5., nan], [-7., nan, 9.]])
-
lib5c.util.counts.
unlog_regional_counts
(regional_counts, pseudocount=1.0, base='e')[source]¶ Unlogs a regional counts matrix.
Parallelizable; see
lib5c.util.counts.parallel_unlog_counts()
.Emits nan’s when the input counts are nan.
- Parameters
regional_counts (np.ndarray) – The counts matrix to unlog.
pseudocount (float) – Psuedocount to subtract after unlogging.
base (str or float) – The base to use when unlogging. Acceptable string values are ‘e’, ‘2’, or ‘10’.
- Returns
The unlogged counts matrix.
- Return type
np.ndarray
Examples
>>> import numpy as np >>> from lib5c.util.counts import log_regional_counts, unlog_regional_counts >>> a = np.array([[1, 2], [2, 4.]]) >>> log_regional_counts(unlog_regional_counts(a)) array([[1., 2.], [2., 4.]]) >>> log_regional_counts(unlog_regional_counts(a, base=42), base=42) array([[1., 2.], [2., 4.]])