Clustering¶
Sometimes, we want to group 5C interactions into clusters according to some
perceived spatial and architectural related-ness. This can happen when we see
several nearby “pixels” on a 5C heatmap with high intensities. This section
explains the functionality of the lib5c
clustering framework, which can be
used to identify these clusters using any of a suite of clustering algorithms.
Command-line interfaces¶
A command-line interface to this clustering framework exists as a separate repository, which can be found at https://bitbucket.org/creminslab/3d-clusterbot
Exposed functionality¶
The algorithms which make up the clustering framework can be found in the
lib5c.algorithms.clustering
subpackage.
Core data structures¶
- peak (
Dict[str, numeric]
) Withing the clustering framework, the term peak is used to refer to an object that represents a single interaction. The data structure used to represent a peak is a dictionary of the form:
{ 'x': 4, 'y': 7, 'value': 12.43 }
we can think of peaks as an unfolded contact matrix, where each entry in the contact matrix gets represented as a separate dictionary. This representation may seem overly verbose, but it has the dual advantage of becoming sparse when interactions when peaks with values below some threshold are excluded from clustering, and of providing a simple representation of clusters as lists of peaks (see below).
- cluster (
List[Dict[str, numeric]]
) Within the clustering framework, the term cluster is used to refer to a list of peaks that have been determined to belong together.
Core API¶
Many different paradigms for clustering operations exist. There are three broadly-defined clustering operations:
cluster assembly (the initial construction of clustered from a flat list of candidate peaks)
cluster merging (the merging of two clusters into one cluster)
cluster splitting (the splitting of clusters into subclusters)
Cluster assembly¶
The API for cluster assembly is to define a single function
make_clusters(peaks, **kwargs)
that takes in a list of candidate peaks and returns a list of clusters.
The cluster assembly paradigms available are:
- knn, via
lib5c.algorithms.clustering.knn.make_clusters()
Assembles clusters by classifying each peak into a cluster according to the cluster membership of its nearest neighbors.
- adjacency, via
lib5c.algorithms.clustering.adjacency.make_clusters()
Assumes the clusters are simply the connected components made up of peaks, as judged by spatial adjacency.
- greedy, via
lib5c.algorithms.clustering.greedy.make_clusters()
Attempts to grow clusters in a greedy fashion by absorbing all nearby peaks and clusters.
Cluster merging¶
The API for cluster merging is to define a single function
merge_to_which(clusters)
which takes in a list of clusters and returns the index of the cluster that
clusters[0]
should be merged into, or -1 if clusters[0]
shouldn’t be
merged.
The cluster merging paradigms available are:
- adjacency, via
lib5c.algorithms.clustering.adjacency.merge_to_which()
Merges adjacent clusters together.
- enclave, via
lib5c.algorithms.clustering.enclave.merge_to_which()
Merges clusters together if one cluster completely surrounds another.
To recursively merge a list of clusters using a merge_to_which()
function, a
utility function is provided as
lib5c.algorithms.clustering.util.merge_clusters()
which takes in the list of clusters and a reference to the merge_to_which()
function that defines the merge rule to use, and returns a list of
recursively-merged clusters.
Cluster splitting¶
The API for cluster splitting is to define a single function
split_clusters(clusters, **kwargs)
which takes in a list of clusters and returns a list of clusters that have been
recursively split according to some splitting rule. By convention, the splitting
rule should be defined in a function called split_cluster()
, but different
splitting paradigms may use different signatures for this function.
The cluster splitting paradigms available are:
- valley, via
lib5c.algorithms.clustering.valley.split_clusters()
Splits clusters by identifying “valleys” between the “mountains” that correspond to true clusters.
- quasicontiguity, via
lib5c.algorithms.clustering.quasicontiguity.split_clusters()
Splits clusters according to a “quasicontiguity” criterion where clusters that are physically separated into two sufficiently large and sufficiently distant subcomponents get split into separate clusters.