Union-find

December 28, 2015

This is one of the most basic graph problem: find the set of connected components. It has lots of applications: network connectivity, image processing, Kruskal's minimum spanning tree algorithm, least common ancestor and others.


Given a set of nodes partitioned into a number of disjoint subsets,
what is the best data structure which can efficiently

  • tell if two elements are in the same subset
  • find the number of elements inside one element's subset
  • apply an union between two subsets

The graph has a fixed population of nodes but it may grow over time by having edges appear between certain pairs of nodes.

Consider this set of 8 nodes

figure 0

and assume they are gathered into 3 subsets as follow:

figure 1

Every subset could be identified by one node: respectively 1, 5 and 2. Hence, we can set up an array id of size N=8 to associate each node to one subset. At start, when there is any edge yet, the array is setup with id[i]=i for every node.

i 1 2 3 4 5 6 7 8
id[i] 1 2 1 5 5 1 5 1

Two nodes i and j are in the same subset if id[i]==id[j]

#define N 1000000
int id[N];
// id has to be initialized:
// for (int i = 0; i < N; i++) {
//   id[i] = i;
// }

bool areConnected(int p, int q) {
  return id[p] == id[q];
}

void unite(int p, int q) {
  int pid = id[p];
  for (int i = 0; i < N; i++) {
    if (id[i] == pid) {
      id[i] = id[q];
    }
  }
}

Let's take a look at two unite steps:

i 1 2 3 4 5 6 7 8
id[i] 1 2 1 5 5 1 5 1
unite(3, 7)
5
2
5
5 5
5
5
5
unite(1, 2)
2
2
2
2
2
2
2
2

Every unite operation need to look through the whole array which is not really acceptable if we want to deal with huge dimensions.

To improve this bound, we'll use a better data structure with trees. Every subset will be a graph with a unique root. Instead of giving the root, id[i] will represent its ancestor.

figure 2

i 1 2 3 4 5 6 7 8
id[i] 1 2 8 5 5 1 5 1

When merging two subsets, we can create a connection between root of subset 1 and root of the subset 2.

For instance, if node 6 and 7 are united, the algorithm will look for the roots of 6 and 7 and sets id to 5: id[root(6)]=root(7)

figure 3

i 1 2 3 4 5 6 7 8
id[i] 5 2 8 5 5 1 5 1

Two nodes are in the same subset if the root of the node 1 and root of node 2 are equal id[...id[i]...] == id[...id[j]...]

Implementation:

#define N 1000000
int id[N];

int root(int i) {
  while (i != id[i]) {
    i = id[i];
  }
  return i;
}

bool areConnected(int p, int q) {
  return root(p) == root(q);
}

void unite(int p, int q) {
  id[root(p)] = root(q);
}

As you can see on the figure, the trees are likely to be imbalanced and every tall. Due to this fact, the function root might be very inefficient.

To remedy that, we'll keep track of the trees size in another array sz. Smaller tree is connected to the greater.

Path Compression

Moreover, we can make the tree even flater if for every explored nodes, we make it point to its grandparent, namely id[i] = id[id[i]].

#define N 1000000
int id[N];
int sz[N];

// Initialization:
// for (int i = 0; i < N; i++) {
//   id[i] = i;
//   sz[i] = 1;
// }

int root(int i) {
  while (i != id[i]) {
    id[i] = id[id[i]];
    i = id[i];
  }
  return i;
}

bool areConnected(int p, int q) {
  return root(p) == root(q);
}

void unite(int p, int q) {
  int i = root(p);
  int j = root(q);
  if (sz[i] < sz[j]) {
    id[j] = i;
    sz[i] += sz[j];
  } else {
    id[i] = j;
    sz[j] += sz[i];
  }
}

References:

  1. Disjoint-set data structure - wikipedia
  2. Algorithm Design - Jon Kleinberg & Éva Tardos
  3. Algorithms - Robert Sedgewick and Kevin Wayne - algs4.cs.princeton.edu