This is one of the most basic graph problem: find the set of connected components. It has lots of applications: network connectivity, image processing, Kruskal's minimum spanning tree algorithm, least common ancestor and others.
Given a set of nodes partitioned into a number of disjoint subsets,
what is the best data structure which can efficiently
 tell if two elements are in the same subset
 find the number of elements inside one element's subset

apply an union between two subsets
The graph has a fixed population of nodes but it may grow over time by having edges appear between certain pairs of nodes.
Consider this set of 8 nodes
and assume they are gathered into 3 subsets as follow:
Every subset could be identified by one node: respectively 1, 5 and 2.
Hence, we can set up an array id
of size N=8
to associate each node to one
subset. At start, when there is any edge yet, the array is setup with id[i]=i
for every node.
i  1  2  3  4  5  6  7  8 

id[i]  1  2  1  5  5  1  5  1 
Two nodes i
and j
are in the same subset if id[i]==id[j]
#define N 1000000
int id[N];
// id has to be initialized:
// for (int i = 0; i < N; i++) {
// id[i] = i;
// }
bool areConnected(int p, int q) {
return id[p] == id[q];
}
void unite(int p, int q) {
int pid = id[p];
for (int i = 0; i < N; i++) {
if (id[i] == pid) {
id[i] = id[q];
}
}
}
Let's take a look at two unite steps:
i  1  2  3  4  5  6  7  8 

id[i]  1  2  1  5  5  1  5  1 
unite(3, 7)  5 
2  5 
5  5  5 
5  5 
unite(1, 2)  2 
2  2 
2 
2 
2 
2 
2 
Every unite operation need to look through the whole array which is not really acceptable if we want to deal with huge dimensions.
To improve this bound, we'll use a better data structure with trees. Every
subset will be a graph with a unique root. Instead of giving the root, id[i]
will represent its ancestor.
i  1  2  3  4  5  6  7  8 

id[i]  1  2  8  5  5  1  5  1 
When merging two subsets, we can create a connection between root of subset 1 and root of the subset 2.
For instance, if node 6 and 7 are united, the algorithm will look for the roots
of 6 and 7 and sets id
to 5: id[root(6)]=root(7)
i  1  2  3  4  5  6  7  8 

id[i]  5  2  8  5  5  1  5  1 
Two nodes are in the same subset if the root of the node 1 and root of node 2
are equal id[...id[i]...] == id[...id[j]...]
Implementation:
#define N 1000000
int id[N];
int root(int i) {
while (i != id[i]) {
i = id[i];
}
return i;
}
bool areConnected(int p, int q) {
return root(p) == root(q);
}
void unite(int p, int q) {
id[root(p)] = root(q);
}
As you can see on the figure, the trees are likely to be imbalanced and every
tall. Due to this fact, the function root
might be very inefficient.
To remedy that, we'll keep track of the trees size in another array sz
.
Smaller tree is connected to the greater.
Path Compression
Moreover, we can make the tree even flater if for every explored nodes,
we make it point to its grandparent, namely id[i] = id[id[i]]
.
#define N 1000000
int id[N];
int sz[N];
// Initialization:
// for (int i = 0; i < N; i++) {
// id[i] = i;
// sz[i] = 1;
// }
int root(int i) {
while (i != id[i]) {
id[i] = id[id[i]];
i = id[i];
}
return i;
}
bool areConnected(int p, int q) {
return root(p) == root(q);
}
void unite(int p, int q) {
int i = root(p);
int j = root(q);
if (sz[i] < sz[j]) {
id[j] = i;
sz[i] += sz[j];
} else {
id[i] = j;
sz[j] += sz[i];
}
}
References:
 Disjointset data structure  wikipedia
 Algorithm Design  Jon Kleinberg & Éva Tardos
 Algorithms  Robert Sedgewick and Kevin Wayne  algs4.cs.princeton.edu