lib5c.structures.dataset module¶
Module for the Dataset class, which provides a wrapper around a pandas DataFrame allowing for representation of 5C data across replicates and stages of data processing both on disk and in memory.
-
class
lib5c.structures.dataset.
Dataset
(df, pixelmap=None, repinfo=None)[source]¶ Bases:
object
Wrapper around a Pandas DataFrame.
-
df
¶ Contains the core data in the Dataset. Columns should be either not hierarchical, or hierarchical with the lower level of the hierarchy matching the replicate names. The row index of this DataFrame must be ‘<upstream_fragment_name>_<downstream_fragment_name>’.
- Type
pd.DataFrame
-
pixelmap
¶ A pixelmap to provide information about the fragments.
- Type
pixelmap, optional
-
repinfo
¶ Its row index should be the replicate names, its columns can provide arbitrary information about each replicate, such as its condition, etc.
- Type
pd.DataFrame, optional
-
add_column_from_counts
(counts, name)[source]¶ Adds a new column to this Dataset’s df.
The counts dict passed is assumed to match the pixelmap bound on this Dataset. If no pixelmap is bound, an ValueError will be raised.
- Parameters
counts (dict of np.ndarray) – Should contain the values that will make up the new column.
name (str) – The name of the new column.
-
add_columns_from_counts_superdict
(counts_superdict, name, rep_order=None)[source]¶ Adds a new group of columns to the Dataset from a counts superdict structure.
- Parameters
counts_superdict (dict of dict of np.ndarray) – The outer keys are replicate names as strings, the inner keys are region names as strings, and the values are square, symmetric arrays of values for each replicate and region.
name (str) – The name to use for the new group of columns.
rep_order (list of str, optional) – Pass a list of replicate names to load the listed replicates in a specific order. Pass None to use the random order of the outer keys of
counts_superdict
.
-
apply_across_replicates
(fn, inputs, outputs, **kwargs)[source]¶ Applies a matrix-to-matrix function over the Dataset.
This is useful for functions that don’t operate independently on each replicate of the Dataset.
The main advantage of this function is that it handles the unboxing of the replicates after a matrix-to-matrix function is applied. If you are looking to apply a matrix-to-vector function over the Dataset, you can do it with a one-liner, assigning the vector result(s) to the new column(s) immediately.
- Parameters
fn (Callable) – The function to apply. It should take in np.ndarrays as its inputs and return np.ndarrays with the same size and shape. If some inputs are specified as individual columns, they will be passed to fn as np.ndarrays shaped as column vectors.
inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. At least one input must refer to the top level of a hierarchical column, the first such column encountered will be used to determine the replicates to apply over. Non-hierarchical columns, or hierarchical columns fully specified by a tuple of strings will be passed to fn as column vectors.
outputs (list of str) – Names of top-level output columns to be added to the Dataset. The second level will be automatically filled in with the replicate names.
-
apply_across_replicates_per_region
(fn, inputs, outputs, initial_values=0.0, **kwargs)[source]¶ Applies a matrix-to-matrix function over the Dataset in a per-region manner.
This is useful for functions that don’t operate independently on each replicate of the Dataset, but which operate independently on each region of the Dataset.
The main advantage of this function is that it handles the unboxing of the replicates after a matrix-to-matrix function is applied. If you are looking to apply a matrix-to-vector function over the Dataset in a per-region manner, you can do it with apply_per_region(), feeding a hierarchical column as an input.
- Parameters
fn (Callable) – The function to apply. It should take in np.ndarrays as its inputs and return np.ndarrays with the same size and shape. If some inputs are specified as individual columns, they will be passed to fn as np.ndarrays shaped as column vectors.
inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. At least one input must refer to the top level of a hierarchical column, the first such column encountered will be used to determine the replicates to apply over. Non-hierarchical columns, or hierarchical columns fully specified by a tuple of strings will be passed to fn as column vectors.
outputs (list of str) – Names of top-level output columns to be added to the Dataset. The second level will be automatically filled in with the replicate names.
initial_values (list of any) – The values with which the new columns will be temporarily initialized. This should control the dtype of the new columns.
-
apply_per_region
(fn, inputs, outputs, initial_values=0.0, **kwargs)[source]¶ Apply a function over the Dataset on a per-region basis.
- Parameters
fn (Callable) – The function to apply. It should take in pd.Series’s or pd.DataFrames as its args, in the same order as inputs, and it should return 1D vectors, in the same order as outputs.
inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. Omit the secound level of a hierarchical column to pass all replicates to fn as a single pd.DataFrame. A single string or tuple will be wrapped in a list automatically.
outputs (list of (str or tuple of str)) – Names of output columns to be added to the Dataset. Use a tuple of strings to create hierarchical columns.
initial_values (list of any) – The values with which the new columns will be temporarily initialized. This should control the dtype of the new columns.
-
apply_per_replicate
(fn, inputs, outputs, **kwargs)[source]¶ Applies a function over the Dataset on a per-replicate basis.
- Parameters
fn (Callable) – The function to apply. It should take in pd.Series’s as its args, in the same order as inputs, and it should return 1D vectors, in the same order as outputs.
inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. At least one input must refer to the top level of a hierarchical column, the first such column encountered will be used to determine the replicates to apply over. Non-hierarchical columns, or hierarchical columns fully specified by a tuple of strings will be broadcast across all replicates.
outputs (list of str) – Names of top-level output columns to be added to the Dataset. The second level will be automatically filled in with the replicate names.
-
apply_per_replicate_per_region
(fn, inputs, outputs, initial_values=0.0, **kwargs)[source]¶ Applies a function over the Dataset on a per-replicate, per-region basis.
- Parameters
fn (Callable) – The function to apply. It should take in pd.Series’s as its args, in the same order as inputs, and it should return 1D vectors, in the same order as outputs.
inputs (list of (str or tuple of str)) – The list of columns to pass as inputs to fn. Use a tuple of strings to access hierarchical columns. At least one input must refer to the top level of a hierarchical column, the first such column encountered will be used to determine the replicates to apply over. Non-hierarchical columns, or hierarchical columns fully specified by a tuple of strings will be broadcast across all replicates.
outputs (list of str) – Names of top-level output columns to be added to the Dataset. The second level will be automatically filled in with the replicate names.
initial_values (list of any) – The values with which the new columns will be temporarily initialized. This should control the dtype of the new columns.
-
counts
(name='counts', rep=None, region=None, fill_value=None, dtype=None)[source]¶ Converts this Dataset to a regional_counts matrix, a counts dict, a counts_superdict, or a regional_counts_superdict.
- Parameters
name (str) – The top-level column name to extract.
rep (str, optional) – If name corresponds to a hierarchical column, pass a rep name to extract only one rep (return type will be a counts dict). Pass None to return a counts_superdict with all reps. If name corresponds to a normal column, this kwarg will be ignored.
region (str, optional) – Pass a region name as a string to extract data for only one region. If name corresponds to a hierarchical column and rep was not passed, the return type will be a regional_counts_superdict. Otherwise, the return type will be a regional_counts matrix. Pass None to extract data for all regions.
fill_value (any, optional) – The fill value for the counts_superdict (for entries not present in the Dataset). Pass None to use np.nan.
dtype (dtype, optional) – The dtype to use for the np.array’s in the counts_superdict. Pass None to guess them from the Dataset. If the data being extracted is strings, ‘U25’ will be assumed.
- Returns
regional_counts matrix, counts dict, counts_superdict, or
regional_counts_superdict – The data requested. See Parameters for explanation of return type. The general philosophy is that a counts_superdict will be returned, but any single-key levels will be squeezed.
-
dropna
(name='counts', reps=None)[source]¶ Drops NA’s from the underlying dataframe.
- Parameters
name (str) – The name of the column to decide to drop based on.
reps (list of str, optional) – If name refers to a hierarchial column, pass a list of rep names to only drop based on these reps. Pass None to drop based on the presence of an NA in any rep. If name does not refer to a hierarchical column this kwarg is ignored.
-
classmethod
from_counts_superdict
(counts_superdict, pixelmap, name='counts', repinfo=None, rep_order=None)[source]¶ Creates a Datset from a counts_superdict and associated pixelmap.
- Parameters
counts_superdict (counts_superdict) – Contains the data that will be put into the Dataset.
pixelmap (pixelmap) – Needed to establish the row index on the Dataset.
name (str) – Top-level column name for the data.
repinfo (repinfo-style pd.Dataframe or list of str, optional) – Repinfo to bind to the Dataset. Pass a list of condition names to automatically create a repinfo object.
rep_order (list of str, optional) – Pass this to guarantee the order of the columns for the replicates. Pass None to accept a random order.
- Returns
The new Dataset.
- Return type
-
classmethod
from_table_file
(table_file, name='counts', sep=None, pixelmap=None, repinfo=None)[source]¶ Creates a Dataset from a table file.
The first column of the table file should be a FFLJ ID.
The remaining columns should be count values for each replicate.
The first row should specify the replicate names for each column.
- Parameters
table_file (str) – The table file to read counts from.
name (str) – Top-level column name for the data.
sep (str) – The separator to use when parsing the table file.’ ‘ for tsv tables, ‘,’ for csv tables. Pass None to guess this from the filename.
pixelmap (pixelmap, optional) – A pixelmap to bind to the Dataset.
repinfo (repinfo-style pd.Dataframe, optional) – Repinfo to bind to the Dataset.
- Returns
The new Dataset.
- Return type
-
classmethod
load
(filename, sep=None)[source]¶ Loads a Dataset from disk.
- Parameters
filename (str) – The .csv or .tsv file to load the Dataset from. If a pixelmap or repinfo file is found next to this file, these files will also be loaded into the Dataset.
sep (str, optional) – The separator to use when parsing the .csv/.tsv. Pass None to deduce this automatically from the file extension.
- Returns
The loaded Dataset.
- Return type
-
save
(filename, sep=None)[source]¶ Writes this Dataset to disk as a .csv/.tsv, and optionally writes the pixelmap and/or repinfo files to disk right next to it if either or both of these data structures exist in the Dataset.
- Parameters
filename (str) – The filename to write to.
sep (str, optional) – The separator to use when writing the file. If
filename
ends with .csv or .tsv andsep
is None, the separator will be determined automatically by the extension, but you can pass a value here to override it.
-
select
(name='counts', rep=None, region=None)[source]¶ Get a subset of this Dataset’s DataFrame corresponding to a desired column, replicate, and/or region.
- Parameters
name (str) – The column name of a hierarchical or non-hierarchical column.
rep (str, optional) – If
name
refers to a hierarchical column, you must specify which replicate you want to select data from by passing its name here.region (str, optional) – To select data from only one region, pass its name here. Pass None to select data from all regions.
-