# Union-find

#### December 28, 2015

This is one of the most basic graph problem: find the set of connected components. It has lots of applications: network connectivity, image processing, Kruskal's minimum spanning tree algorithm, least common ancestor and others.

Given a set of nodes partitioned into a number of disjoint subsets,
what is the best data structure which can efficiently

• tell if two elements are in the same subset
• find the number of elements inside one element's subset
• ## apply an union between two subsets

The graph has a fixed population of nodes but it may grow over time by having edges appear between certain pairs of nodes.

Consider this set of 8 nodes

and assume they are gathered into 3 subsets as follow:

Every subset could be identified by one node: respectively 1, 5 and 2. Hence, we can set up an array `id` of size `N=8` to associate each node to one subset. At start, when there is any edge yet, the array is setup with `id[i]=i` for every node.

i 1 2 3 4 5 6 7 8
id[i] 1 2 1 5 5 1 5 1

Two nodes `i` and `j` are in the same subset if `id[i]==id[j]`

``````#define N 1000000
int id[N];
// id has to be initialized:
// for (int i = 0; i < N; i++) {
//   id[i] = i;
// }

bool areConnected(int p, int q) {
return id[p] == id[q];
}

void unite(int p, int q) {
int pid = id[p];
for (int i = 0; i < N; i++) {
if (id[i] == pid) {
id[i] = id[q];
}
}
}``````

Let's take a look at two unite steps:

i 1 2 3 4 5 6 7 8
id[i] 1 2 1 5 5 1 5 1
unite(3, 7)
5
2
5
5 5
5
5
5
unite(1, 2)
2
2
2
2
2
2
2
2

Every unite operation need to look through the whole array which is not really acceptable if we want to deal with huge dimensions.

To improve this bound, we'll use a better data structure with trees. Every subset will be a graph with a unique root. Instead of giving the root, `id[i]` will represent its ancestor.

i 1 2 3 4 5 6 7 8
id[i] 1 2 8 5 5 1 5 1

When merging two subsets, we can create a connection between root of subset 1 and root of the subset 2.

For instance, if node 6 and 7 are united, the algorithm will look for the roots of 6 and 7 and sets `id` to 5: `id[root(6)]=root(7)`

i 1 2 3 4 5 6 7 8
id[i] 5 2 8 5 5 1 5 1

Two nodes are in the same subset if the root of the node 1 and root of node 2 are equal `id[...id[i]...] == id[...id[j]...]`

Implementation:

``````#define N 1000000
int id[N];

int root(int i) {
while (i != id[i]) {
i = id[i];
}
return i;
}

bool areConnected(int p, int q) {
return root(p) == root(q);
}

void unite(int p, int q) {
id[root(p)] = root(q);
}``````

As you can see on the figure, the trees are likely to be imbalanced and every tall. Due to this fact, the function `root` might be very inefficient.

To remedy that, we'll keep track of the trees size in another array `sz`. Smaller tree is connected to the greater.

### Path Compression

Moreover, we can make the tree even flater if for every explored nodes, we make it point to its grandparent, namely `id[i] = id[id[i]]`.

``````#define N 1000000
int id[N];
int sz[N];

// Initialization:
// for (int i = 0; i < N; i++) {
//   id[i] = i;
//   sz[i] = 1;
// }

int root(int i) {
while (i != id[i]) {
id[i] = id[id[i]];
i = id[i];
}
return i;
}

bool areConnected(int p, int q) {
return root(p) == root(q);
}

void unite(int p, int q) {
int i = root(p);
int j = root(q);
if (sz[i] < sz[j]) {
id[j] = i;
sz[i] += sz[j];
} else {
id[i] = j;
sz[j] += sz[i];
}
}``````

## References:

1. Disjoint-set data structure - wikipedia
2. Algorithm Design - Jon Kleinberg & Éva Tardos
3. Algorithms - Robert Sedgewick and Kevin Wayne - algs4.cs.princeton.edu