Core data structures and file types¶
This section will introduce the core data structures and file types used
ubiquitously throughout lib5c
.
Core data structures¶
The core data structures used throughout lib5c
are counts dicts and
primermaps/pixelmaps, which serve as representations of contact matrices and
locus information, respectively.
Representing contact matrices¶
The 5C assay attempts to quantify interactions between pairs of genomic loci. These genomic loci do not span the entire genome, as in Hi-C. Instead, they are restricted by the 5C primer design to contiguous blocks, which we will refer to as regions.
Throughout lib5c
, we represent the pairwise cis (inter-region) interaction
frequencies between loci as a square, symmetric matrix whose number of rows and
columns is equal to the number of loci in the region. When there are multiple
regions, we will put multiple matrices (each representing one region) into a
dictionary whose keys are the region names as strings. Therefore, we end up with
expressions like:
counts[region][i, j] = 23.5
where region
is the name of the region as a string, i
and j
are
integer indices corresponding to loci within the region, and
counts[region][i, j]
gives the value of the interaction frequency between
the i
th and the j
th locus of the region, which may be an integer or
floating-point number.
We will commonly call these data structures “counts dicts”. More formally, the Python type annotation for a “counts dict” is:
Dict[str, np.ndarray]
Representing information about loci¶
A contract matrix is meaningless when it is separated from information about what specific genomic loci it describes. For every dictionary of contact matrices, we will usually also have a separate object that stores information about the genomic loci whose interactions are quantified in the contact matrices. This will have the form:
primermap[region][i] = {
'chrom': 'chr3',
'start': 34107373,
'end': 34109022,
'name': '5C_329_Sox2_REV_1',
'strand': '-'
}
where region
is the name of the region as a string, i
is the index of
the locus within the region, and primermap[region][i]
is a dict storing
information about the i
th locus in the region. At a minimum, it must
indicate the chromosome, start, and end of the locus. In practice, it can also
include additional information such as the name of the locus or (for loci that
are restriction fragments) the strand that the 5C primer for this fragment was
designed to.
We will commonly call these data structures “primermaps” when the loci they describe are primers, and “pixelmaps” when the loci they describe are bins (in reference to the “pixels” on a 5C heatmap). More formally, the Python type annotation for either of these data structures is:
Dict[str, List[Dict[str, Any]]]
where the keys to the outer dict are region names and the inner dict must have at least the keys ‘chrom’, ‘start’, and ‘end’.
Core file types¶
The core file types used as inputs and outputs throughout lib5c
are
countsfiles and primerfiles/bin bedfiles.
Representing contact matrices¶
Countsfiles are used to represent contact matrices.
Representing information about loci¶
Special bedfiles called primerfiles are used to represent the loci whose interactions are contained in countsfiles.