lib5c.parsers.counts module

Module for parsing .counts files.

lib5c.parsers.counts.load_cis_trans_counts(countsfile, primermap, name_parser=<function default_primer_parser>, force_nan='always', region_order=None)[source]

Loads the counts values from a primer-primer pair .counts file into a single square, symmetric array, and returns it.

Parameters
  • countsfile (str) – String reference to location of .counts file to load counts from.

  • primermap (Dict[str, List[Dict[str, Any]]]) –

    The keys of the outer dict are region names. The values are lists, where the \(i\) th entry represents the \(i\) th primer in that region. Primers are represented as dicts with the following structure:

    {
        'chrom' : str,
        'start' : int,
        'end'   : int
    }
    

    See lib5c.parsers.primers.get_primermap().

  • name_parser (Optional[Callable[[str], Dict[str, Any]]]) –

    Function that takes in the primer names in the countsfile and returns a dict containing key-value pairs containing information required to identify the primer. At a minimum, this dict must have the following structure:

    {
        'region': str
    }
    

    This information is necessary to deduce what region a given primer in the countsfile belongs to.

  • force_nan (Optional[str]) – If ‘always’ is passed and if the primermap contains strand information, impossible ligations will be always set to nan. If ‘implicit’ is passed, impossible ligations will be set to nan when implied by the strand information in the primermap, but not when the ligations are explicitly present in the countsfile. If ‘never’ is passed, strand information will be ignored and impossible ligations will not be identified.

  • region_order (Optional[List[str]]) – If passed, this list will be used to determine the order in which the regions will be concatenated in. If not passed, the regions will be concatenated in order of genomic coordinate.

Returns

The square, symmetric array of counts.

Return type

np.ndarray

lib5c.parsers.counts.load_counts(countsfile, primermap, force_nan='always', dtype=<class 'float'>)[source]

Loads the counts values from a primer-primer pair .counts file into square, symmetric arrays and returns them.

Parameters
  • countsfile (str) – String reference to location of .counts file to load counts from.

  • primermap (Dict[str, List[Dict[str, Any]]]) –

    The keys of the outer dict are region names. The values are lists, where the \(i\) th entry represents the \(i\) th primer in that region. Primers are represented as dicts with the following structure:

    {
        'chrom' : str,
        'start' : int,
        'end'   : int
    }
    

    See lib5c.parsers.primers.load_primermap().

  • force_nan (Optional[str]) – If ‘always’ is passed and if the primermap contains strand information, impossible ligations will be always set to nan. If ‘implicit’ is passed, impossible ligations will be set to nan when implied by the strand information in the primermap, but not when the ligations are explicitly present in the countsfile. If ‘never’ is passed, strand information will be ignored and impossible ligations will not be identified.

  • dtype ({int, float}) – Sets the dtype for the matrix. If the value column contains strings this will be ignored and the dtype will be set to ‘U25’.

Returns

The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.

Return type

Dict[str, np.ndarray]

lib5c.parsers.counts.load_counts_by_name(countsfile, name_list=None, primermap=None, locusmap=None, force_nan='always', region_order=None)[source]

Loads the counts values from any .counts file into a single square, symmetric array, and returns it.

Parameters
  • countsfile (str) – String reference to location of .counts file to load counts from.

  • name_list (Optional[List[str]]) – Ordered list of locus names as strings.

  • primermap (Optional[Dict[str, List[Dict[str, Any]]]]) –

    The keys of the outer dict are region names. The values are lists, where the ith entry represents the ith primer in that region. Primers are represented as dicts with the following structure:

    {
        'chrom' : str,
        'start' : int,
        'end'   : int
    }
    

    See lib5c.parsers.primers.get_primermap().

  • locusmap (Optional[LocusMap]) – Locus information as a LocusMap object.

  • force_nan (Optional[str]) – If ‘always’ is passed and if the primermap contains strand information, impossible ligations will be always set to nan. If ‘implicit’ is passed, impossible ligations will be set to nan when implied by the strand information in the primermap, but not when the ligations are explicitly present in the countsfile. If ‘never’ is passed, strand information will be ignored and impossible ligations will not be identified.

  • region_order (Optional[List[str]]) – If passed, this list will be used to determine the order in which the regions will be concatenated in. If not passed, the regions will be concatenated in order of genomic coordinate. If name_list is passed, this kwarg is ignored.

Returns

The square, symmetric array of counts.

Return type

np.ndarray

lib5c.parsers.counts.load_counts_legacy(countsfile, name_parser=<function default_bin_parser>, pixelmap=None)[source]

Loads the counts values from a binned .counts file into square, symmetric arrays and returns them.

Parameters
  • countsfile (str) – String reference to location of .counts file to load counts from.

  • name_parser (Optional[Callable[[str], Dict[str, Any]]]) –

    Function that takes in the bin name column of the countsfile and returns a dict containing key-value pairs containing information required to identify the bin. At a minimum, this dict must have the following structure:

    {
        'region': str,
        'index': int
    }
    

    This information is necessary to deduce what region a given bin in the countsfile belongs to. The index key is optional, but recommended. If present, its value should be the zero-based index of the bin within the region. If not present, the pixelmap will be searched to identify the bin index.

  • pixelmap (Optional[Dict[str, List[Dict[str, Any]]]]) –

    The keys of the outer dict are region names. The values are lists, where the \(i\) th entry represents the \(i\) th bin in that region. Bins are represented as dicts with the following structure:

    {
        'chrom': str,
        'start': int,
        'end'  : int,
        'name' : str
    }
    

    See lib5c.parsers.get_pixelmap(). The pixelmap is used to identify the index of a bin within a region. If name_parser returns an index key, you can pass None here since the index will be determined from the bin name.

Returns

The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.

Return type

Dict[str, np.ndarray]

Notes

This function casts the counts values in the countsfile to floats, so it will work even if the countsfile actually contains pseudocounts or other non-integer values.

lib5c.parsers.counts.main()[source]
lib5c.parsers.counts.set_cis_trans_nans(counts, aggregated_primermap)[source]

Sets nan’s in a complete cis and trans counts matrix for ligations considered impossible according to a primermap with strand information.

Parameters
  • counts (np.ndarray) – Square, symmetric array storing the complete cis and trans counts, with the regions arranged as implied by the aggregated_primermap

  • aggregated_primermap (List[Dict[str, Any]]) –

    The dicts in the lists represent primers, equal in number and order to the side length of the counts matrix, and have the following structure:

    {
        'chrom'  : str,
        'start'  : int,
        'end'    : int,
        'strand' : '+' or '-'
    }
    

    See lib5c.parsers.primers.get_primermap() and lib5c.util.primers.aggregate_primermap().

Notes

If the aggregated primermap passed has no strand information, this function will do nothing.

This function operates in-place.

lib5c.parsers.counts.set_nans(counts, primermap)[source]

Sets nan’s in counts dict for ligations considered impossible according to a primermap with strand information.

Parameters
  • counts (Dict[str, np.ndarray]) – The keys are the region names. The values are the arrays of counts values for that region. These arrays are square and symmetric.

  • primermap (Dict[str, List[Dict[str, Any]]]) –

    The keys of the outer dict are region names. The values are lists, where the \(i\) th entry represents the \(i\) th primer in that region. Primers are represented as dicts with the following structure:

    {
        'chrom'  : str,
        'start'  : int,
        'end'    : int,
        'strand' : '+' or '-'
    }
    

    See lib5c.parsers.primers.get_primermap().

Notes

If the primermap passed has no strand information, this function will do nothing.

This function operates in-place.