Intro to Clustering
DBSCAN
Parameters
eps
min_samples
Type of Vertices [Points]
Core Vertices:
Border Vertices:
Noise Vertices:
Algorithm [Black-Boxing]
# core() and neighbor() depends on eps and min_samples
s = {}
noise = []
for v in V:
if core(v):
q = []
h[v] = []
for neighbor in neighbors(v):
h[v].append(neighbor)
V.pop(neighbor)
q.add(neighbor)
while q:
curr = q.pop(0)
for neighbor in neighbors(curr):
h[v].append(neighbor)
V.pop(neighbor)
q.add(neighbor)
V.pop(v)
for v in V:
noise.add(v)
Consensus Driven Propagation
Pipeline
Alleviation from the Concept: The Base Model stands before the Court, and each proposal that presents, Committees from 1 - N judges that proposal and the Mediator will decide the final outcome. Here each proposal introduces a pair of Vertices, all the proposals that presents when all judgements are made, will create our final clusters of vertices.
1) Training labeled data (Supervised)
2) Consensus-Driven Propagation
Given we proceed to extract deep features by inferring on unlabelled data. Specifically, we have and , each representing a vector in the feature spaces of the Base Model and each Committee. Each feature space has information on the deep features of a sample, to cluster we will use a Cosine-Similiarity K-NN algorithm, producing graphs.
From the graphs, we will be able to reach a consensus for each of the pair that the Base Model generated. Specifically, we will use the following metrics:
The Relationship between two nodes and is if it is a neighbor in the view of particular graph and otherwise. That is:
The affinity between two nodes is computed by Cosine Similarity of the mapped features of two nodes:
The local structure of each node is the distribution of similarities between that node and its nearest neighbors. By computing Cosine Similarities between each node and its neighbor , we generate a distribution as a result, that is :
Based on these metrics, we can now enter the Consensus Phase, this is where we will determine which pair of vertex-vertex we will keep. The Mediator gathers information, combining each metric into vectors and then combining everything together. Specifically we have for each pair of vertices in the base model:
Relationship vector where each element is
Affinity vector where each element is
From the Distribution for each pair generated by each committee, we compute the Mean and Variance vector for each vertex in each committee (including the Base Model) that is:
Mean vector
Variance vector
Combining all the vectors we have a vector of dimension , which is fed into the Mediator MLP to be trained on labelled data . The MLP will re-weight the opinions of the Committee and assign a probability to each edge that the Base Model proposes, in this case a probability could also be negative to indicate that each pair does not have the same identity (i.e. belonging to different clusters). Based on the trained MLP we will filter out any edges proposes by the Base Model that does not meet a high positive probability threshold, we do this for every edges to form different clusters as output.
Recall that we trained on labelled data for our previous stages, as for unlabelled data, we will implement a label propagation algorithm that has the following pseudocode:
SCC = find_component(G)
queue = []
final_graph_clusters = []
for component in SCC:
queue.add(component)
while queue:
curr_component = queue.pop()
if curr_component.size > max_size:
weaker_components = disconnect(curr_component, max_size)
for weak_component in weaker_components:
queue.add(weak_component)
else:
final_graph_clusters.add(curr_component)
Finally, we can jointly train on labelled and unlabelled data using the same CNN architecture and the same weight, the following result will produce graph clusters with weights trained on labelled data, deep-learnt from unlabelled data and then clustered together based on parameters such as probability threshold, max cluster size.