Skip to main content

Intro to Clustering

DBSCAN

Parameters

  • eps
  • min_samples

Type of Vertices [Points]

Core Vertices: VS :EVminSamplesV \in S \space : |E_V| \geq minSamples

Border Vertices: VS :EV<minSamplesV \in S \space : |E_V| < minSamples

Noise Vertices: VSV \notin S

Algorithm [Black-Boxing]

# core() and neighbor() depends on eps and min_samples

s = {}
noise = []
for v in V:
if core(v):
q = []
h[v] = []
for neighbor in neighbors(v):
h[v].append(neighbor)
V.pop(neighbor)
q.add(neighbor)
while q:
curr = q.pop(0)
for neighbor in neighbors(curr):
h[v].append(neighbor)
V.pop(neighbor)
q.add(neighbor)
V.pop(v)
for v in V:
noise.add(v)

Consensus Driven Propagation

Pipeline

Alleviation from the Concept: The Base Model stands before the Court, and each proposal that BB presents, Committees CC from 1 - N judges that proposal and the Mediator will decide the final outcome. Here each proposal introduces a pair of Vertices, all the proposals that BB presents when all judgements are made, will create our final clusters of vertices.

1) Training labeled data (Supervised)

FB:DlZ\mathcal{F}_B : D_l \rarr \mathcal{Z}
FCi:DlZ,i=1,2,3,...,N\mathcal{F}_{C_i} : D_l \rarr \mathcal{Z}, i=1,2,3,...,N

2) Consensus-Driven Propagation

Given FB,FC1,...,FCN\mathcal{F}_B, \mathcal{F}_{C_1},..., \mathcal{F}_{C_N} we proceed to extract deep features by inferring on unlabelled data. Specifically, we have FB(Du)\mathcal{F}_{B}(D_u) and FCi(Du)\mathcal{F}_{C_i}(D_u), each representing a vector in the feature spaces F\mathcal{F} of the Base Model and each Committee. Each feature space has information on the deep features of a sample, to cluster we will use a Cosine-Similiarity K-NN algorithm, producing N+1N+1 graphs.

From the graphs, we will be able to reach a consensus for each of the pair that the Base Model generated. Specifically, we will use the following metrics:

The Relationship RR between two nodes n0n_0 and n1n_1 is 11 if it is a neighbor in the view of particular graph and 00 otherwise. That is:

RCi(n0,n1)={1if neighbor(n0,n1)0else i=1,2,...,NR^{(n_0,n_1)}_{C_i} = \begin{cases} 1 & if \space neighbor(n_0, n_1) \\ 0 & else \end{cases} \space i = 1,2,...,N

The affinity between two nodes is computed by Cosine Similarity of the mapped features of two nodes:

ACi(n0,n1)=cos(<FCi(n0),FCi(n1)>) i=0,1,2,...,NA_{C_i}^{(n_0,n_1)} = cos(<\mathcal{F}_{C_i}(n_0),\mathcal{F}_{C_i}(n_1)>) \space i= 0,1,2,...,N

The local structure of each node is the distribution of similarities between that node and its kk nearest neighbors. By computing Cosine Similarities between each node xx and its neighbor xkx_k, we generate a distribution as a result, that is :

DCix={cos(<FCi(x),FCi(xk)>)} k=1,2,...,KD^{x}_{C_i} = \{ cos(<\mathcal{F}_{C_i}(x),\mathcal{F}_{C_i}(x_k)>)\} \space k=1,2,...,K

Based on these metrics, we can now enter the Consensus Phase, this is where we will determine which pair of vertex-vertex we will keep. The Mediator gathers information, combining each metric into vectors and then combining everything together. Specifically we have for each pair of vertices in the base model:

Relationship vector IRRNI_R \in \mathbb{R}^N where each element is RCi(n0,n1)R^{(n_0,n_1)}_{C_i}

Affinity vector IARN+1I_A \in \mathbb{R}^{N+1} where each element is ACi(n0,n1)A_{C_i}^{(n_0,n_1)}

From the Distribution for each pair generated by each committee, we compute the Mean and Variance vector for each vertex in each committee (including the Base Model) that is:

Mean vector IDmeanR2(N+1)=(...μ(DCin0),...,μ(DCin1))I_{D_{mean}} \in \mathbb{R}^{2(N+1)} = (...\mu(D^{n_0}_{C_i}),...,\mu(D^{n_1}_{C_i}))

Variance vector IDvarR2(N+1)=(...σ(DCin0),...,σ(DCin1))I_{D_{var}} \in \mathbb{R}^{2(N+1)} = (...\sigma(D^{n_0}_{C_i}),...,\sigma(D^{n_1}_{C_i}))

Combining all the vectors we have a vector of dimension 6N+56N +5, which is fed into the Mediator MLP to be trained on labelled data DlD_l . The MLP will re-weight the opinions of the Committee and assign a probability to each edge that the Base Model proposes, in this case a probability could also be negative to indicate that each pair does not have the same identity (i.e. belonging to different clusters). Based on the trained MLP we will filter out any edges proposes by the Base Model that does not meet a high positive probability threshold, we do this for every edges to form different clusters as output.

Recall that we trained on labelled data for our previous stages, as for unlabelled data, we will implement a label propagation algorithm that has the following pseudocode:

SCC = find_component(G) 
queue = []
final_graph_clusters = []
for component in SCC:
queue.add(component)

while queue:
curr_component = queue.pop()
if curr_component.size > max_size:
weaker_components = disconnect(curr_component, max_size)
for weak_component in weaker_components:
queue.add(weak_component)
else:
final_graph_clusters.add(curr_component)

Finally, we can jointly train on labelled and unlabelled data using the same CNN architecture and the same weight, the following result will produce graph clusters with weights trained on labelled data, deep-learnt from unlabelled data and then clustered together based on parameters such as probability threshold, max cluster size.