Core data structures and file types

This section will introduce the core data structures and file types used ubiquitously throughout lib5c.

Core data structures

The core data structures used throughout lib5c are counts dicts and primermaps/pixelmaps, which serve as representations of contact matrices and locus information, respectively.

Representing contact matrices

The 5C assay attempts to quantify interactions between pairs of genomic loci. These genomic loci do not span the entire genome, as in Hi-C. Instead, they are restricted by the 5C primer design to contiguous blocks, which we will refer to as regions.

Throughout lib5c, we represent the pairwise cis (inter-region) interaction frequencies between loci as a square, symmetric matrix whose number of rows and columns is equal to the number of loci in the region. When there are multiple regions, we will put multiple matrices (each representing one region) into a dictionary whose keys are the region names as strings. Therefore, we end up with expressions like:

counts[region][i, j] = 23.5

where region is the name of the region as a string, i and j are integer indices corresponding to loci within the region, and counts[region][i, j] gives the value of the interaction frequency between the i th and the j th locus of the region, which may be an integer or floating-point number.

We will commonly call these data structures “counts dicts”. More formally, the Python type annotation for a “counts dict” is:

Dict[str, np.ndarray]

Representing information about loci

A contract matrix is meaningless when it is separated from information about what specific genomic loci it describes. For every dictionary of contact matrices, we will usually also have a separate object that stores information about the genomic loci whose interactions are quantified in the contact matrices. This will have the form:

primermap[region][i] = {
    'chrom': 'chr3',
    'start': 34107373,
    'end': 34109022,
    'name': '5C_329_Sox2_REV_1',
    'strand': '-'
}

where region is the name of the region as a string, i is the index of the locus within the region, and primermap[region][i] is a dict storing information about the i th locus in the region. At a minimum, it must indicate the chromosome, start, and end of the locus. In practice, it can also include additional information such as the name of the locus or (for loci that are restriction fragments) the strand that the 5C primer for this fragment was designed to.

We will commonly call these data structures “primermaps” when the loci they describe are primers, and “pixelmaps” when the loci they describe are bins (in reference to the “pixels” on a 5C heatmap). More formally, the Python type annotation for either of these data structures is:

Dict[str, List[Dict[str, Any]]]

where the keys to the outer dict are region names and the inner dict must have at least the keys ‘chrom’, ‘start’, and ‘end’.

Core file types

The core file types used as inputs and outputs throughout lib5c are countsfiles and primerfiles/bin bedfiles.

Representing contact matrices

Countsfiles are used to represent contact matrices.

Representing information about loci

Special bedfiles called primerfiles are used to represent the loci whose interactions are contained in countsfiles.