lib5c.structures.dataset module

Module for the Dataset class, which provides a wrapper around a pandas DataFrame allowing for representation of 5C data across replicates and stages of data processing both on disk and in memory.

class lib5c.structures.dataset.Dataset(df, pixelmap=None, repinfo=None)[source]

Bases: object

Wrapper around a Pandas DataFrame.

df

Contains the core data in the Dataset. Columns should be either not hierarchical, or hierarchical with the lower level of the hierarchy matching the replicate names. The row index of this DataFrame must be ‘<upstream_fragment_name>_<downstream_fragment_name>’.

Type

pd.DataFrame

pixelmap

A pixelmap to provide information about the fragments.

Type

pixelmap, optional

repinfo

Its row index should be the replicate names, its columns can provide arbitrary information about each replicate, such as its condition, etc.

Type

pd.DataFrame, optional

add_column_from_counts(counts, name)[source]

Adds a new column to this Dataset’s df.

The counts dict passed is assumed to match the pixelmap bound on this Dataset. If no pixelmap is bound, an ValueError will be raised.

Parameters
  • counts (dict of np.ndarray) – Should contain the values that will make up the new column.

  • name (str) – The name of the new column.

add_columns_from_counts_superdict(counts_superdict, name, rep_order=None)[source]

Adds a new group of columns to the Dataset from a counts superdict structure.

Parameters
  • counts_superdict (dict of dict of np.ndarray) – The outer keys are replicate names as strings, the inner keys are region names as strings, and the values are square, symmetric arrays of values for each replicate and region.

  • name (str) – The name to use for the new group of columns.

  • rep_order (list of str, optional) – Pass a list of replicate names to load the listed replicates in a specific order. Pass None to use the random order of the outer keys of counts_superdict.

apply_across_replicates(fn, inputs, outputs, **kwargs)[source]

Applies a matrix-to-matrix function over the Dataset.

This is useful for functions that don’t operate independently on each replicate of the Dataset.

The main advantage of this function is that it handles the unboxing of the replicates after a matrix-to-matrix function is applied. If you are looking to apply a matrix-to-vector function over the Dataset, you can do it with a one-liner, assigning the vector result(s) to the new column(s) immediately.

Parameters
  • fn (Callable) – The function to apply. It should take in np.ndarrays as its inputs and return np.ndarrays with the same size and shape. If some inputs are specified as individual columns, they will be passed to fn as np.ndarrays shaped as column vectors.

  • inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. At least one input must refer to the top level of a hierarchical column, the first such column encountered will be used to determine the replicates to apply over. Non-hierarchical columns, or hierarchical columns fully specified by a tuple of strings will be passed to fn as column vectors.

  • outputs (list of str) – Names of top-level output columns to be added to the Dataset. The second level will be automatically filled in with the replicate names.

apply_across_replicates_per_region(fn, inputs, outputs, initial_values=0.0, **kwargs)[source]

Applies a matrix-to-matrix function over the Dataset in a per-region manner.

This is useful for functions that don’t operate independently on each replicate of the Dataset, but which operate independently on each region of the Dataset.

The main advantage of this function is that it handles the unboxing of the replicates after a matrix-to-matrix function is applied. If you are looking to apply a matrix-to-vector function over the Dataset in a per-region manner, you can do it with apply_per_region(), feeding a hierarchical column as an input.

Parameters
  • fn (Callable) – The function to apply. It should take in np.ndarrays as its inputs and return np.ndarrays with the same size and shape. If some inputs are specified as individual columns, they will be passed to fn as np.ndarrays shaped as column vectors.

  • inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. At least one input must refer to the top level of a hierarchical column, the first such column encountered will be used to determine the replicates to apply over. Non-hierarchical columns, or hierarchical columns fully specified by a tuple of strings will be passed to fn as column vectors.

  • outputs (list of str) – Names of top-level output columns to be added to the Dataset. The second level will be automatically filled in with the replicate names.

  • initial_values (list of any) – The values with which the new columns will be temporarily initialized. This should control the dtype of the new columns.

apply_per_region(fn, inputs, outputs, initial_values=0.0, **kwargs)[source]

Apply a function over the Dataset on a per-region basis.

Parameters
  • fn (Callable) – The function to apply. It should take in pd.Series’s or pd.DataFrames as its args, in the same order as inputs, and it should return 1D vectors, in the same order as outputs.

  • inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. Omit the secound level of a hierarchical column to pass all replicates to fn as a single pd.DataFrame. A single string or tuple will be wrapped in a list automatically.

  • outputs (list of (str or tuple of str)) – Names of output columns to be added to the Dataset. Use a tuple of strings to create hierarchical columns.

  • initial_values (list of any) – The values with which the new columns will be temporarily initialized. This should control the dtype of the new columns.

apply_per_replicate(fn, inputs, outputs, **kwargs)[source]

Applies a function over the Dataset on a per-replicate basis.

Parameters
  • fn (Callable) – The function to apply. It should take in pd.Series’s as its args, in the same order as inputs, and it should return 1D vectors, in the same order as outputs.

  • inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. At least one input must refer to the top level of a hierarchical column, the first such column encountered will be used to determine the replicates to apply over. Non-hierarchical columns, or hierarchical columns fully specified by a tuple of strings will be broadcast across all replicates.

  • outputs (list of str) – Names of top-level output columns to be added to the Dataset. The second level will be automatically filled in with the replicate names.

apply_per_replicate_per_region(fn, inputs, outputs, initial_values=0.0, **kwargs)[source]

Applies a function over the Dataset on a per-replicate, per-region basis.

Parameters
  • fn (Callable) – The function to apply. It should take in pd.Series’s as its args, in the same order as inputs, and it should return 1D vectors, in the same order as outputs.

  • inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. At least one input must refer to the top level of a hierarchical column, the first such column encountered will be used to determine the replicates to apply over. Non-hierarchical columns, or hierarchical columns fully specified by a tuple of strings will be broadcast across all replicates.

  • outputs (list of str) – Names of top-level output columns to be added to the Dataset. The second level will be automatically filled in with the replicate names.

  • initial_values (list of any) – The values with which the new columns will be temporarily initialized. This should control the dtype of the new columns.

counts(name='counts', rep=None, region=None, fill_value=None, dtype=None)[source]

Converts this Dataset to a regional_counts matrix, a counts dict, a counts_superdict, or a regional_counts_superdict.

Parameters
  • name (str) – The top-level column name to extract.

  • rep (str, optional) – If name corresponds to a hierarchical column, pass a rep name to extract only one rep (return type will be a counts dict). Pass None to return a counts_superdict with all reps. If name corresponds to a normal column, this kwarg will be ignored.

  • region (str, optional) – Pass a region name as a string to extract data for only one region. If name corresponds to a hierarchical column and rep was not passed, the return type will be a regional_counts_superdict. Otherwise, the return type will be a regional_counts matrix. Pass None to extract data for all regions.

  • fill_value (any, optional) – The fill value for the counts_superdict (for entries not present in the Dataset). Pass None to use np.nan.

  • dtype (dtype, optional) – The dtype to use for the np.array’s in the counts_superdict. Pass None to guess them from the Dataset. If the data being extracted is strings, ‘U25’ will be assumed.

Returns

  • regional_counts matrix, counts dict, counts_superdict, or

  • regional_counts_superdict – The data requested. See Parameters for explanation of return type. The general philosophy is that a counts_superdict will be returned, but any single-key levels will be squeezed.

dropna(name='counts', reps=None)[source]

Drops NA’s from the underlying dataframe.

Parameters
  • name (str) – The name of the column to decide to drop based on.

  • reps (list of str, optional) – If name refers to a hierarchial column, pass a list of rep names to only drop based on these reps. Pass None to drop based on the presence of an NA in any rep. If name does not refer to a hierarchical column this kwarg is ignored.

classmethod from_counts_superdict(counts_superdict, pixelmap, name='counts', repinfo=None, rep_order=None)[source]

Creates a Datset from a counts_superdict and associated pixelmap.

Parameters
  • counts_superdict (counts_superdict) – Contains the data that will be put into the Dataset.

  • pixelmap (pixelmap) – Needed to establish the row index on the Dataset.

  • name (str) – Top-level column name for the data.

  • repinfo (repinfo-style pd.Dataframe or list of str, optional) – Repinfo to bind to the Dataset. Pass a list of condition names to automatically create a repinfo object.

  • rep_order (list of str, optional) – Pass this to guarantee the order of the columns for the replicates. Pass None to accept a random order.

Returns

The new Dataset.

Return type

Dataset

classmethod from_table_file(table_file, name='counts', sep=None, pixelmap=None, repinfo=None)[source]

Creates a Dataset from a table file.

The first column of the table file should be a FFLJ ID.

The remaining columns should be count values for each replicate.

The first row should specify the replicate names for each column.

Parameters
  • table_file (str) – The table file to read counts from.

  • name (str) – Top-level column name for the data.

  • sep (str) – The separator to use when parsing the table file.’ ‘ for tsv tables, ‘,’ for csv tables. Pass None to guess this from the filename.

  • pixelmap (pixelmap, optional) – A pixelmap to bind to the Dataset.

  • repinfo (repinfo-style pd.Dataframe, optional) – Repinfo to bind to the Dataset.

Returns

The new Dataset.

Return type

Dataset

classmethod load(filename, sep=None)[source]

Loads a Dataset from disk.

Parameters
  • filename (str) – The .csv or .tsv file to load the Dataset from. If a pixelmap or repinfo file is found next to this file, these files will also be loaded into the Dataset.

  • sep (str, optional) – The separator to use when parsing the .csv/.tsv. Pass None to deduce this automatically from the file extension.

Returns

The loaded Dataset.

Return type

Dataset

save(filename, sep=None)[source]

Writes this Dataset to disk as a .csv/.tsv, and optionally writes the pixelmap and/or repinfo files to disk right next to it if either or both of these data structures exist in the Dataset.

Parameters
  • filename (str) – The filename to write to.

  • sep (str, optional) – The separator to use when writing the file. If filename ends with .csv or .tsv and sep is None, the separator will be determined automatically by the extension, but you can pass a value here to override it.

select(name='counts', rep=None, region=None)[source]

Get a subset of this Dataset’s DataFrame corresponding to a desired column, replicate, and/or region.

Parameters
  • name (str) – The column name of a hierarchical or non-hierarchical column.

  • rep (str, optional) – If name refers to a hierarchical column, you must specify which replicate you want to select data from by passing its name here.

  • region (str, optional) – To select data from only one region, pass its name here. Pass None to select data from all regions.