Clustering

Sometimes, we want to group 5C interactions into clusters according to some perceived spatial and architectural related-ness. This can happen when we see several nearby “pixels” on a 5C heatmap with high intensities. This section explains the functionality of the lib5c clustering framework, which can be used to identify these clusters using any of a suite of clustering algorithms.

Command-line interfaces

A command-line interface to this clustering framework exists as a separate repository, which can be found at https://bitbucket.org/creminslab/3d-clusterbot

Exposed functionality

The algorithms which make up the clustering framework can be found in the lib5c.algorithms.clustering subpackage.

Core data structures

peak (Dict[str, numeric])

Withing the clustering framework, the term peak is used to refer to an object that represents a single interaction. The data structure used to represent a peak is a dictionary of the form:

{
    'x': 4,
    'y': 7,
    'value': 12.43
}

we can think of peaks as an unfolded contact matrix, where each entry in the contact matrix gets represented as a separate dictionary. This representation may seem overly verbose, but it has the dual advantage of becoming sparse when interactions when peaks with values below some threshold are excluded from clustering, and of providing a simple representation of clusters as lists of peaks (see below).

cluster (List[Dict[str, numeric]])

Within the clustering framework, the term cluster is used to refer to a list of peaks that have been determined to belong together.

Core API

Many different paradigms for clustering operations exist. There are three broadly-defined clustering operations:

  1. cluster assembly (the initial construction of clustered from a flat list of candidate peaks)

  2. cluster merging (the merging of two clusters into one cluster)

  3. cluster splitting (the splitting of clusters into subclusters)

Cluster assembly

The API for cluster assembly is to define a single function

make_clusters(peaks, **kwargs)

that takes in a list of candidate peaks and returns a list of clusters.

The cluster assembly paradigms available are:

knn, via lib5c.algorithms.clustering.knn.make_clusters()

Assembles clusters by classifying each peak into a cluster according to the cluster membership of its nearest neighbors.

adjacency, via lib5c.algorithms.clustering.adjacency.make_clusters()

Assumes the clusters are simply the connected components made up of peaks, as judged by spatial adjacency.

greedy, via lib5c.algorithms.clustering.greedy.make_clusters()

Attempts to grow clusters in a greedy fashion by absorbing all nearby peaks and clusters.

Cluster merging

The API for cluster merging is to define a single function

merge_to_which(clusters)

which takes in a list of clusters and returns the index of the cluster that clusters[0] should be merged into, or -1 if clusters[0] shouldn’t be merged.

The cluster merging paradigms available are:

adjacency, via lib5c.algorithms.clustering.adjacency.merge_to_which()

Merges adjacent clusters together.

enclave, via lib5c.algorithms.clustering.enclave.merge_to_which()

Merges clusters together if one cluster completely surrounds another.

To recursively merge a list of clusters using a merge_to_which() function, a utility function is provided as

lib5c.algorithms.clustering.util.merge_clusters()

which takes in the list of clusters and a reference to the merge_to_which() function that defines the merge rule to use, and returns a list of recursively-merged clusters.

Cluster splitting

The API for cluster splitting is to define a single function

split_clusters(clusters, **kwargs)

which takes in a list of clusters and returns a list of clusters that have been recursively split according to some splitting rule. By convention, the splitting rule should be defined in a function called split_cluster(), but different splitting paradigms may use different signatures for this function.

The cluster splitting paradigms available are:

valley, via lib5c.algorithms.clustering.valley.split_clusters()

Splits clusters by identifying “valleys” between the “mountains” that correspond to true clusters.

quasicontiguity, via lib5c.algorithms.clustering.quasicontiguity.split_clusters()

Splits clusters according to a “quasicontiguity” criterion where clusters that are physically separated into two sufficiently large and sufficiently distant subcomponents get split into separate clusters.