lib5c.algorithms.qnorm module

Module for quantile normalization.

Original author of qnorm(), _rank_data(), _average_rows(), and _sub_in_normed_val(): Dan Gillis

Note: data matrices in these functions are typically expected to be arranged with each column representing one replicate, except for the functions _rank_data(), _average_rows(), and _sub_in_normed_val(), which expect them to be arranged with each row representing one replicate.

The exposed functions are qnorm(), qnorm_parallel(), qnorm_fast(), qnorm_fast_parallel(), and the convenience function qnorm_counts_superdict().

lib5c.algorithms.qnorm.qnorm(data, tie='lowest', reference_index=None)[source]

Quantile normalizes a data set.

Parallelizable if data is a 2d np.ndarray; see lib5c.algorithms.qnorm.qnorm_parallel().

Parameters
  • data (2d numeric structure, or dict of 1d numeric structure) – Anything that can be cast to array. Should be passed as row-major. Quantile normalization will performed on the columns of data.

  • tie ({'lowest', 'average'}, optional) – Pass 'lowest' to set all tied entries to the value of the lowest rank. Pass 'average' to set all tied entries to the average value across the tied ranks.

  • reference_index (int or str, optional) – If data is a row-major array, pass a column index to serve as a reference distribution. If data is a dict, pass a key of that dict that should serve as the reference distribution. Pass None to use the average of all distributions as the target distribution.

Returns

The quantile normalized data. If data was passed as a dict, then a dict with the same keys is returned.

Return type

2d numpy array, or dict of 1d numpy array

Notes

This function is nan-safe. As long as each column of the input data contains the same number of nan’s, nan’s will only get averaged with other nan’s, and they will get substituted back into their original positions. See the Examples section for an example of this.

Examples

>>> import numpy as np
>>> from lib5c.algorithms.qnorm import qnorm
>>> qnorm(np.array([[5,    4,    3],
...                 [2,    1,    4],
...                 [3,    4,    6],
...                 [4,    2,    8]]))
...
array([[5.66666667, 4.66666667, 2.        ],
       [2.        , 2.        , 3.        ],
       [3.        , 4.66666667, 4.66666667],
       [4.66666667, 3.        , 5.66666667]])
>>> qnorm(np.array([[     5, np.nan,      3],
...                 [     2,      1,      4],
...                 [np.nan,      4,      6],
...                 [     4,      2, np.nan]]))
...
array([[5.        ,        nan, 2.        ],
       [2.        , 2.        , 3.33333333],
       [       nan, 5.        , 5.        ],
       [3.33333333, 3.33333333,        nan]])
>>> qnorm(np.array([[     5, np.nan,      3],
...                 [     2,      1,      4],
...                 [np.nan,      4,      6],
...                 [     4,      2, np.nan]]), reference_index=1)
...
array([[ 4., nan,  1.],
       [ 1.,  1.,  2.],
       [nan,  4.,  4.],
       [ 2.,  2., nan]])
>>> res = qnorm({'A': [5, 2, 3, 4],
...              'B': [4, 1, 4, 2],
...              'C': [3, 4, 6, 8]})
>>> list(sorted(res.items()))
[('A', array([5.66666667, 2.        , 3.        , 4.66666667])),
 ('B', array([4.66666667, 2.        , 4.66666667, 3.        ])),
 ('C', array([2.        , 3.        , 4.66666667, 5.66666667]))]
>>> res = qnorm({'A': [5, 2, 3, 4],
...              'B': [4, 1, 4, 2],
...              'C': [3, 4, 6, 8]}, reference_index='C')
>>> list(sorted(res.items()))
[('A', array([8., 3., 4., 6.])),
 ('B', array([6., 3., 6., 4.])),
 ('C', array([3., 4., 6., 8.]))]
>>> res = qnorm({'A': [5, 2, 3, 4],
...              'B': [4, 1, 4, 2],
...              'C': [3, 4, 6, 8]}, reference_index='C', tie='average')
>>> list(sorted(res.items()))
[('A', array([8., 3., 4., 6.])),
 ('B', array([7., 3., 7., 4.])),
 ('C', array([3., 4., 6., 8.]))]
lib5c.algorithms.qnorm.qnorm_counts_superdict(counts_superdict, primermap, tie='lowest', regional=False, condition_on=None, reference=None)[source]

Convenience function for quantile normalizing a counts superdict data structure.

Parameters
  • counts_superdict (Dict[Dict[np.ndarray]]) – The keys of the outer dict are replicate names, the keys of the inner dict are region names, the values are square symmetric arrays of counts for the specified replicate and region.

  • primermap (Dict[str, List[Dict[str, Any]]]) – The primermap describing the loci whose interaction counts are described in the counts_superdict.

  • tie ({'lowest', 'average'}) – Pass 'lowest' to set all tied entries to the value of the lowest rank. Pass 'average' to set all tied entries to the average value across the tied ranks.

  • regional (bool) – Pass True to quantile normalize regions separately. Pass False to quantile normalize all regions together.

  • condition_on (Optional[str]) – Pass a string key into the inner dicts of primermap to condition on that quantity. Current limitations: only works with regional=True and can only condition with exact equality (does not support conditioning on strata of a quantity). Pass None to not do conditional quantile normalization.

  • reference (Optional[str]) – Pass a string key into the counts_superdict to indicate a replicate that should be used as a reference distribution to quantile normalize to.

Returns

The keys of the outer dict are replicate names, the keys of the inner dict are region names, the values are square symmetric arrays of the quantile normalized counts for the specified replicate and region.

Return type

Dict[Dict[np.ndarray]]

lib5c.algorithms.qnorm.qnorm_fast(data, reference_index=None)[source]

Quantile normalizes a data set.

Simpler, faster implementation compared to lib5c.algorithms.qnorm(), but only supports tie='lowest' behavior and only takes an np.ndarray as input. This approach was developed and timed in this repsitory.

Parallelizable if data is a 2d np.ndarray; see lib5c.algorithms.qnorm.qnorm_fast_parallel().

Parameters
  • data (np.ndarray) – Two dimensional, with the columns representing the replicates to be qnormed. Quantile normalization will performed on the columns of data.

  • reference_index (int or str, optional) – Pass a column index to serve as a reference distribution. Pass None to use the average of all distributions as the target distribution.

Returns

The quantile normalized data.

Return type

np.ndarray

Notes

This function is nan-safe. As long as each column of the input data contains the same number of nan’s, nan’s will only get averaged with other nan’s, and they will get substituted back into their original positions. See the Examples section for an example of this.

Examples

>>> import numpy as np
>>> from lib5c.algorithms.qnorm import qnorm_fast
>>> qnorm_fast(np.array([[5,    4,    3],
...                      [2,    1,    4],
...                      [3,    4,    6],
...                      [4,    2,    8]]))
...
array([[5.66666667, 4.66666667, 2.        ],
       [2.        , 2.        , 3.        ],
       [3.        , 4.66666667, 4.66666667],
       [4.66666667, 3.        , 5.66666667]])
>>> qnorm_fast(np.array([[     5, np.nan,      3],
...                      [     2,      1,      4],
...                      [np.nan,      4,      6],
...                      [     4,      2, np.nan]]))
...
array([[5.        ,        nan, 2.        ],
       [2.        , 2.        , 3.33333333],
       [       nan, 5.        , 5.        ],
       [3.33333333, 3.33333333,        nan]])
>>> qnorm_fast(np.array([[     5, np.nan,      3],
...                      [     2,      1,      4],
...                      [np.nan,      4,      6],
...                      [     4,      2, np.nan]]), reference_index=1)
...
array([[ 4., nan,  1.],
       [ 1.,  1.,  2.],
       [nan,  4.,  4.],
       [ 2.,  2., nan]])
lib5c.algorithms.qnorm.qnorm_fast_parallel(data, reference_index=None)

Quantile normalizes a data set.

Simpler, faster implementation compared to lib5c.algorithms.qnorm(), but only supports tie='lowest' behavior and only takes an np.ndarray as input. This approach was developed and timed in this repsitory.

Parallelizable if data is a 2d np.ndarray; see lib5c.algorithms.qnorm.qnorm_fast_parallel().

Parameters
  • data (np.ndarray) – Two dimensional, with the columns representing the replicates to be qnormed. Quantile normalization will performed on the columns of data.

  • reference_index (int or str, optional) – Pass a column index to serve as a reference distribution. Pass None to use the average of all distributions as the target distribution.

Returns

The quantile normalized data.

Return type

np.ndarray

Notes

This function is nan-safe. As long as each column of the input data contains the same number of nan’s, nan’s will only get averaged with other nan’s, and they will get substituted back into their original positions. See the Examples section for an example of this.

Examples

>>> import numpy as np
>>> from lib5c.algorithms.qnorm import qnorm_fast
>>> qnorm_fast(np.array([[5,    4,    3],
...                      [2,    1,    4],
...                      [3,    4,    6],
...                      [4,    2,    8]]))
...
array([[5.66666667, 4.66666667, 2.        ],
       [2.        , 2.        , 3.        ],
       [3.        , 4.66666667, 4.66666667],
       [4.66666667, 3.        , 5.66666667]])
>>> qnorm_fast(np.array([[     5, np.nan,      3],
...                      [     2,      1,      4],
...                      [np.nan,      4,      6],
...                      [     4,      2, np.nan]]))
...
array([[5.        ,        nan, 2.        ],
       [2.        , 2.        , 3.33333333],
       [       nan, 5.        , 5.        ],
       [3.33333333, 3.33333333,        nan]])
>>> qnorm_fast(np.array([[     5, np.nan,      3],
...                      [     2,      1,      4],
...                      [np.nan,      4,      6],
...                      [     4,      2, np.nan]]), reference_index=1)
...
array([[ 4., nan,  1.],
       [ 1.,  1.,  2.],
       [nan,  4.,  4.],
       [ 2.,  2., nan]])
lib5c.algorithms.qnorm.qnorm_parallel(data, tie='lowest', reference_index=None)

Quantile normalizes a data set.

Parallelizable if data is a 2d np.ndarray; see lib5c.algorithms.qnorm.qnorm_parallel().

Parameters
  • data (2d numeric structure, or dict of 1d numeric structure) – Anything that can be cast to array. Should be passed as row-major. Quantile normalization will performed on the columns of data.

  • tie ({'lowest', 'average'}, optional) – Pass 'lowest' to set all tied entries to the value of the lowest rank. Pass 'average' to set all tied entries to the average value across the tied ranks.

  • reference_index (int or str, optional) – If data is a row-major array, pass a column index to serve as a reference distribution. If data is a dict, pass a key of that dict that should serve as the reference distribution. Pass None to use the average of all distributions as the target distribution.

Returns

The quantile normalized data. If data was passed as a dict, then a dict with the same keys is returned.

Return type

2d numpy array, or dict of 1d numpy array

Notes

This function is nan-safe. As long as each column of the input data contains the same number of nan’s, nan’s will only get averaged with other nan’s, and they will get substituted back into their original positions. See the Examples section for an example of this.

Examples

>>> import numpy as np
>>> from lib5c.algorithms.qnorm import qnorm
>>> qnorm(np.array([[5,    4,    3],
...                 [2,    1,    4],
...                 [3,    4,    6],
...                 [4,    2,    8]]))
...
array([[5.66666667, 4.66666667, 2.        ],
       [2.        , 2.        , 3.        ],
       [3.        , 4.66666667, 4.66666667],
       [4.66666667, 3.        , 5.66666667]])
>>> qnorm(np.array([[     5, np.nan,      3],
...                 [     2,      1,      4],
...                 [np.nan,      4,      6],
...                 [     4,      2, np.nan]]))
...
array([[5.        ,        nan, 2.        ],
       [2.        , 2.        , 3.33333333],
       [       nan, 5.        , 5.        ],
       [3.33333333, 3.33333333,        nan]])
>>> qnorm(np.array([[     5, np.nan,      3],
...                 [     2,      1,      4],
...                 [np.nan,      4,      6],
...                 [     4,      2, np.nan]]), reference_index=1)
...
array([[ 4., nan,  1.],
       [ 1.,  1.,  2.],
       [nan,  4.,  4.],
       [ 2.,  2., nan]])
>>> res = qnorm({'A': [5, 2, 3, 4],
...              'B': [4, 1, 4, 2],
...              'C': [3, 4, 6, 8]})
>>> list(sorted(res.items()))
[('A', array([5.66666667, 2.        , 3.        , 4.66666667])),
 ('B', array([4.66666667, 2.        , 4.66666667, 3.        ])),
 ('C', array([2.        , 3.        , 4.66666667, 5.66666667]))]
>>> res = qnorm({'A': [5, 2, 3, 4],
...              'B': [4, 1, 4, 2],
...              'C': [3, 4, 6, 8]}, reference_index='C')
>>> list(sorted(res.items()))
[('A', array([8., 3., 4., 6.])),
 ('B', array([6., 3., 6., 4.])),
 ('C', array([3., 4., 6., 8.]))]
>>> res = qnorm({'A': [5, 2, 3, 4],
...              'B': [4, 1, 4, 2],
...              'C': [3, 4, 6, 8]}, reference_index='C', tie='average')
>>> list(sorted(res.items()))
[('A', array([8., 3., 4., 6.])),
 ('B', array([7., 3., 7., 4.])),
 ('C', array([3., 4., 6., 8.]))]