lib5c.algorithms.qnorm module¶
Module for quantile normalization.
Original author of qnorm()
, _rank_data()
,
_average_rows()
, and _sub_in_normed_val()
: Dan Gillis
Note: data matrices in these functions are typically expected to be arranged
with each column representing one replicate, except for the functions
_rank_data()
, _average_rows()
, and _sub_in_normed_val()
,
which expect them to be arranged with each row representing one replicate.
The exposed functions are qnorm()
, qnorm_parallel()
,
qnorm_fast()
, qnorm_fast_parallel()
, and the convenience function
qnorm_counts_superdict()
.
-
lib5c.algorithms.qnorm.
qnorm
(data, tie='lowest', reference_index=None)[source]¶ Quantile normalizes a data set.
Parallelizable if
data
is a 2d np.ndarray; seelib5c.algorithms.qnorm.qnorm_parallel()
.- Parameters
data (2d numeric structure, or dict of 1d numeric structure) – Anything that can be cast to array. Should be passed as row-major. Quantile normalization will performed on the columns of
data
.tie ({'lowest', 'average'}, optional) – Pass
'lowest'
to set all tied entries to the value of the lowest rank. Pass'average'
to set all tied entries to the average value across the tied ranks.reference_index (int or str, optional) – If
data
is a row-major array, pass a column index to serve as a reference distribution. Ifdata
is a dict, pass a key of that dict that should serve as the reference distribution. Pass None to use the average of all distributions as the target distribution.
- Returns
The quantile normalized data. If
data
was passed as a dict, then a dict with the same keys is returned.- Return type
2d numpy array, or dict of 1d numpy array
Notes
This function is nan-safe. As long as each column of the input data contains the same number of nan’s, nan’s will only get averaged with other nan’s, and they will get substituted back into their original positions. See the Examples section for an example of this.
Examples
>>> import numpy as np >>> from lib5c.algorithms.qnorm import qnorm >>> qnorm(np.array([[5, 4, 3], ... [2, 1, 4], ... [3, 4, 6], ... [4, 2, 8]])) ... array([[5.66666667, 4.66666667, 2. ], [2. , 2. , 3. ], [3. , 4.66666667, 4.66666667], [4.66666667, 3. , 5.66666667]]) >>> qnorm(np.array([[ 5, np.nan, 3], ... [ 2, 1, 4], ... [np.nan, 4, 6], ... [ 4, 2, np.nan]])) ... array([[5. , nan, 2. ], [2. , 2. , 3.33333333], [ nan, 5. , 5. ], [3.33333333, 3.33333333, nan]]) >>> qnorm(np.array([[ 5, np.nan, 3], ... [ 2, 1, 4], ... [np.nan, 4, 6], ... [ 4, 2, np.nan]]), reference_index=1) ... array([[ 4., nan, 1.], [ 1., 1., 2.], [nan, 4., 4.], [ 2., 2., nan]]) >>> res = qnorm({'A': [5, 2, 3, 4], ... 'B': [4, 1, 4, 2], ... 'C': [3, 4, 6, 8]}) >>> list(sorted(res.items())) [('A', array([5.66666667, 2. , 3. , 4.66666667])), ('B', array([4.66666667, 2. , 4.66666667, 3. ])), ('C', array([2. , 3. , 4.66666667, 5.66666667]))] >>> res = qnorm({'A': [5, 2, 3, 4], ... 'B': [4, 1, 4, 2], ... 'C': [3, 4, 6, 8]}, reference_index='C') >>> list(sorted(res.items())) [('A', array([8., 3., 4., 6.])), ('B', array([6., 3., 6., 4.])), ('C', array([3., 4., 6., 8.]))] >>> res = qnorm({'A': [5, 2, 3, 4], ... 'B': [4, 1, 4, 2], ... 'C': [3, 4, 6, 8]}, reference_index='C', tie='average') >>> list(sorted(res.items())) [('A', array([8., 3., 4., 6.])), ('B', array([7., 3., 7., 4.])), ('C', array([3., 4., 6., 8.]))]
-
lib5c.algorithms.qnorm.
qnorm_counts_superdict
(counts_superdict, primermap, tie='lowest', regional=False, condition_on=None, reference=None)[source]¶ Convenience function for quantile normalizing a counts superdict data structure.
- Parameters
counts_superdict (Dict[Dict[np.ndarray]]) – The keys of the outer dict are replicate names, the keys of the inner dict are region names, the values are square symmetric arrays of counts for the specified replicate and region.
primermap (Dict[str, List[Dict[str, Any]]]) – The primermap describing the loci whose interaction counts are described in the
counts_superdict
.tie ({'lowest', 'average'}) – Pass
'lowest'
to set all tied entries to the value of the lowest rank. Pass'average'
to set all tied entries to the average value across the tied ranks.regional (bool) – Pass True to quantile normalize regions separately. Pass False to quantile normalize all regions together.
condition_on (Optional[str]) – Pass a string key into the inner dicts of
primermap
to condition on that quantity. Current limitations: only works withregional=True
and can only condition with exact equality (does not support conditioning on strata of a quantity). Pass None to not do conditional quantile normalization.reference (Optional[str]) – Pass a string key into the
counts_superdict
to indicate a replicate that should be used as a reference distribution to quantile normalize to.
- Returns
The keys of the outer dict are replicate names, the keys of the inner dict are region names, the values are square symmetric arrays of the quantile normalized counts for the specified replicate and region.
- Return type
Dict[Dict[np.ndarray]]
-
lib5c.algorithms.qnorm.
qnorm_fast
(data, reference_index=None)[source]¶ Quantile normalizes a data set.
Simpler, faster implementation compared to
lib5c.algorithms.qnorm()
, but only supportstie='lowest'
behavior and only takes annp.ndarray
as input. This approach was developed and timed in this repsitory.Parallelizable if
data
is a 2d np.ndarray; seelib5c.algorithms.qnorm.qnorm_fast_parallel()
.- Parameters
data (np.ndarray) – Two dimensional, with the columns representing the replicates to be qnormed. Quantile normalization will performed on the columns of
data
.reference_index (int or str, optional) – Pass a column index to serve as a reference distribution. Pass None to use the average of all distributions as the target distribution.
- Returns
The quantile normalized data.
- Return type
np.ndarray
Notes
This function is nan-safe. As long as each column of the input data contains the same number of nan’s, nan’s will only get averaged with other nan’s, and they will get substituted back into their original positions. See the Examples section for an example of this.
Examples
>>> import numpy as np >>> from lib5c.algorithms.qnorm import qnorm_fast >>> qnorm_fast(np.array([[5, 4, 3], ... [2, 1, 4], ... [3, 4, 6], ... [4, 2, 8]])) ... array([[5.66666667, 4.66666667, 2. ], [2. , 2. , 3. ], [3. , 4.66666667, 4.66666667], [4.66666667, 3. , 5.66666667]]) >>> qnorm_fast(np.array([[ 5, np.nan, 3], ... [ 2, 1, 4], ... [np.nan, 4, 6], ... [ 4, 2, np.nan]])) ... array([[5. , nan, 2. ], [2. , 2. , 3.33333333], [ nan, 5. , 5. ], [3.33333333, 3.33333333, nan]]) >>> qnorm_fast(np.array([[ 5, np.nan, 3], ... [ 2, 1, 4], ... [np.nan, 4, 6], ... [ 4, 2, np.nan]]), reference_index=1) ... array([[ 4., nan, 1.], [ 1., 1., 2.], [nan, 4., 4.], [ 2., 2., nan]])
-
lib5c.algorithms.qnorm.
qnorm_fast_parallel
(data, reference_index=None)¶ Quantile normalizes a data set.
Simpler, faster implementation compared to
lib5c.algorithms.qnorm()
, but only supportstie='lowest'
behavior and only takes annp.ndarray
as input. This approach was developed and timed in this repsitory.Parallelizable if
data
is a 2d np.ndarray; seelib5c.algorithms.qnorm.qnorm_fast_parallel()
.- Parameters
data (np.ndarray) – Two dimensional, with the columns representing the replicates to be qnormed. Quantile normalization will performed on the columns of
data
.reference_index (int or str, optional) – Pass a column index to serve as a reference distribution. Pass None to use the average of all distributions as the target distribution.
- Returns
The quantile normalized data.
- Return type
np.ndarray
Notes
This function is nan-safe. As long as each column of the input data contains the same number of nan’s, nan’s will only get averaged with other nan’s, and they will get substituted back into their original positions. See the Examples section for an example of this.
Examples
>>> import numpy as np >>> from lib5c.algorithms.qnorm import qnorm_fast >>> qnorm_fast(np.array([[5, 4, 3], ... [2, 1, 4], ... [3, 4, 6], ... [4, 2, 8]])) ... array([[5.66666667, 4.66666667, 2. ], [2. , 2. , 3. ], [3. , 4.66666667, 4.66666667], [4.66666667, 3. , 5.66666667]]) >>> qnorm_fast(np.array([[ 5, np.nan, 3], ... [ 2, 1, 4], ... [np.nan, 4, 6], ... [ 4, 2, np.nan]])) ... array([[5. , nan, 2. ], [2. , 2. , 3.33333333], [ nan, 5. , 5. ], [3.33333333, 3.33333333, nan]]) >>> qnorm_fast(np.array([[ 5, np.nan, 3], ... [ 2, 1, 4], ... [np.nan, 4, 6], ... [ 4, 2, np.nan]]), reference_index=1) ... array([[ 4., nan, 1.], [ 1., 1., 2.], [nan, 4., 4.], [ 2., 2., nan]])
-
lib5c.algorithms.qnorm.
qnorm_parallel
(data, tie='lowest', reference_index=None)¶ Quantile normalizes a data set.
Parallelizable if
data
is a 2d np.ndarray; seelib5c.algorithms.qnorm.qnorm_parallel()
.- Parameters
data (2d numeric structure, or dict of 1d numeric structure) – Anything that can be cast to array. Should be passed as row-major. Quantile normalization will performed on the columns of
data
.tie ({'lowest', 'average'}, optional) – Pass
'lowest'
to set all tied entries to the value of the lowest rank. Pass'average'
to set all tied entries to the average value across the tied ranks.reference_index (int or str, optional) – If
data
is a row-major array, pass a column index to serve as a reference distribution. Ifdata
is a dict, pass a key of that dict that should serve as the reference distribution. Pass None to use the average of all distributions as the target distribution.
- Returns
The quantile normalized data. If
data
was passed as a dict, then a dict with the same keys is returned.- Return type
2d numpy array, or dict of 1d numpy array
Notes
This function is nan-safe. As long as each column of the input data contains the same number of nan’s, nan’s will only get averaged with other nan’s, and they will get substituted back into their original positions. See the Examples section for an example of this.
Examples
>>> import numpy as np >>> from lib5c.algorithms.qnorm import qnorm >>> qnorm(np.array([[5, 4, 3], ... [2, 1, 4], ... [3, 4, 6], ... [4, 2, 8]])) ... array([[5.66666667, 4.66666667, 2. ], [2. , 2. , 3. ], [3. , 4.66666667, 4.66666667], [4.66666667, 3. , 5.66666667]]) >>> qnorm(np.array([[ 5, np.nan, 3], ... [ 2, 1, 4], ... [np.nan, 4, 6], ... [ 4, 2, np.nan]])) ... array([[5. , nan, 2. ], [2. , 2. , 3.33333333], [ nan, 5. , 5. ], [3.33333333, 3.33333333, nan]]) >>> qnorm(np.array([[ 5, np.nan, 3], ... [ 2, 1, 4], ... [np.nan, 4, 6], ... [ 4, 2, np.nan]]), reference_index=1) ... array([[ 4., nan, 1.], [ 1., 1., 2.], [nan, 4., 4.], [ 2., 2., nan]]) >>> res = qnorm({'A': [5, 2, 3, 4], ... 'B': [4, 1, 4, 2], ... 'C': [3, 4, 6, 8]}) >>> list(sorted(res.items())) [('A', array([5.66666667, 2. , 3. , 4.66666667])), ('B', array([4.66666667, 2. , 4.66666667, 3. ])), ('C', array([2. , 3. , 4.66666667, 5.66666667]))] >>> res = qnorm({'A': [5, 2, 3, 4], ... 'B': [4, 1, 4, 2], ... 'C': [3, 4, 6, 8]}, reference_index='C') >>> list(sorted(res.items())) [('A', array([8., 3., 4., 6.])), ('B', array([6., 3., 6., 4.])), ('C', array([3., 4., 6., 8.]))] >>> res = qnorm({'A': [5, 2, 3, 4], ... 'B': [4, 1, 4, 2], ... 'C': [3, 4, 6, 8]}, reference_index='C', tie='average') >>> list(sorted(res.items())) [('A', array([8., 3., 4., 6.])), ('B', array([7., 3., 7., 4.])), ('C', array([3., 4., 6., 8.]))]