Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH (2024)

Table of Contents

Cluster Features CF Tree Clustering the Sub-Clusters FAQs References

Aman Gupta

Cluster Features

BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to represent a larger set of data points. For clustering purposes, these summary statistics constitute a CF, and represent a sufficient substitute for the actual data.
A CF is a set of three summary statistics that represent a set of data points in a single cluster.
These statistics are as follows:

★ Count [The number of data values in the cluster.]

★ Linear Sum [The sum of the individual coordinates. This is a measure of the location of the cluster.]

★ Squared Sum [The sum of the squared coordinates. This is a measure of the spread of the cluster.]

Together, the linear sum and the squared sum are equivalent to the mean and variance of the data point.

CF Tree

The building process of the CF Tree can be summarized as:

For each given record, BIRCH compares the location of that record with the location of each CF in the root node, using either the linear sum or the mean of the CF. BIRCH passes the incoming record to the root node CF closest to the incoming record.
The record then descends down to the non-leaf child nodes of the root node CF selected in step 1. BIRCH compares the location of the record with the location of each non-leaf CF. BIRCH passes the incoming record to the non-leaf node CF closest to the incoming record.
The record then descends down to the leaf child nodes of the non-leaf node CF selected in step 2. BIRCH compares the location of the record with the location of each leaf. BIRCH tentatively passes the incoming record to the leaf closest to the incoming record.
Perform one of (a) or (b):

(a) If the radius of the chosen leaf including the new record does not exceed the Threshold T, then the incoming record is assigned to that leaf. The leaf and all of its parent CF’s are updated to account for the new data point.

(b) If the radius of the chosen leaf including the new record does exceed the Threshold T, then a new leaf is formed, consisting of the incoming record only. The parent CFs are updated to account for the new data point.

If step 4(b) is executed, and there are already the maximum of L leaves in the leaf node, then the leaf node is split into two leaf nodes. The most distant leaf node CFs are used as leaf node seeds, with the remaining CFs being assigned to whichever leaf node is closer. If the parent node is full, split the parent node, and so on. Note that the radius of a cluster may be calculated even without knowing the data points, as long as we have the count n, the linear sum LS, and the squared sum SS. This allows BIRCH to evaluate whether a given data point belongs to a particular sub-cluster without needing to scan the original data set.

Clustering the Sub-Clusters

Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters (the CF leaf nodes), to combine these sub-clusters into clusters. The task of clustering becomes much more easier as the number of sub-clusters are much less than the number of data points. When a new data value is added, these statistics may be easily updated, thus making the computation more efficient.
In the original paper, the authors have used agglomerative hierarchical clustering.

There are three parameters in this algorithm, which needs to be tuned. Unlike K-means, here the optimal number of clusters (k) need not be input by the user as they are determined by the algorithm.

✓ threshold : threshold is the maximum number of data points a sub-cluster in the leaf node of the CF tree can hold.

✓ branching_factor : This parameter specifies the maximum number of CF sub-clusters in each node (internal node).

✓ n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is complete i.e., number of clusters after the final clustering step. If set to None, the final clustering step is not performed and intermediate clusters are returned.
Scikit-learn offers a robust implementation for Python.

Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH (5)

A detriment of BIRCH clustering is the following. Because of the tree structure inherent in the CF tree, the clustering solution may be dependent on the input ordering of the data records. To avoid this, the data analyst may wish to apply BIRCH clustering on a few different random sorting of the data, and find consensus among the results. However, a benefit of BIRCH clustering is that the analyst is not required to select the best choice of k, the number of clusters, as is the case with some other clustering methods. Rather, the number of clusters in a BIRCH clustering solution is an outcome of the tree-building process.

Please feel free to share your suggestions.
Thanks for Reading! Stay Safe!

References

Data Mining and Predictive Analytics, 2nd Edition by Chantal D. Larose, Daniel T. Larose
https://www.geeksforgeeks.org
https://scikit-learn.org

FAQs

Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH? ›

BIRCH

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets.

https://en.wikipedia.org › wiki › BIRCH

BIRCH - Wikipedia

(Balanced Iterative Reducing and Clustering Using Hierarchies) is a clustering technique that clusters huge datasets by first producing a smaller and compact summary of the large dataset that maintains as much information as feasible. Instead of clustering the whole dataset, this smaller summary is clustered.

What is balanced iterative reducing and clustering using Hierarchies? ›

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the the large dataset that retains as much information as possible.

Get More Info ›

Is BIRCH hierarchical clustering? ›

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm that performs hierarchical clustering over large data sets.

Learn More ›

What is the formula for birch algorithm? ›

In standard BIRCH, the CF formula used is CF = (N, LS, and SS). This study will use changes to the CF-Leaf value. The CF-leaf modif formula to be used is CF-Leaf (modif) = (N, LS, SS, T).

Get More Info ›

Is BIRCH agglomerative or divisive? ›

'BIRCH' constructs an in-memory CF-tree (a tree data structure). The second stage, global clustering, employs agglomerative hierarchical clustering with linkage criteria to cluster small-diameter dense sub-clusters represented by CF vectors in the leaf nodes of the CF tree.

When to use BIRCH clustering? ›

At a high level, Balanced Iterative Reducing and Clustering using Hierarchies, or BIRCH for short, deals with large datasets by first generating a more compact summary that retains as much distribution information as possible, and then clustering the data summary instead of the original dataset.

Get More Info ›

What is the difference between BIRCH and K-Means? ›

K-Means is a technique of partitioning-based clustering, whereas BIRCH clustering is a technique of hierarchical method of clustering.

Find Out More ›

What are the disadvantages of birch algorithm? ›

However, BIRCH has one major drawback – it can only process metric attributes. A metric attribute is any attribute whose values can be represented in Euclidean space i.e., no categorical attributes should be present.

Find Out More ›

What are the two 2 types of hierarchical clustering? ›

Types of hierarchical clustering

The two main choices are divisive methods and agglomerative methods. Understanding each method can help you make a more informed decision about how best to structure the algorithm you're using for your data.

View Details ›

What are the advantages of birch algorithm? ›

An important advantage of the BIRCH algorithm is that it only requires a single scan of the database. Compared to the popular K-means algorithm, BIRCH performs exceptionally well in terms of less memory consumption, faster performance, less order-sensitivity, and high accuracy [49] .

Discover More ›

What is the time complexity of birch clustering? ›

“How effective is BIRCH?” The time complexity of the algorithm is O(n), where n is the number of objects to be clustered. Experiments have shown the linear scalability of the algorithm with respect to the number of objects, and good quality of clustering of the data.

Discover More Details ›

What are the different phases of BIRCH? ›

Phase 1: BIRCH scans the database to build an initial in-memoryCF tree, which can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of the data. Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree.

Discover More ›

What is the algorithm to live by 37%? ›

The "37% rule" refers to a series of steps, or algorithms, that someone must follow to make the best decision within a set amount of time. Someone allots 37% of their time to research before they make a decision, then commits to the very next "best choice" they find.

Discover More Details ›

What are the weaknesses of hierarchical clustering? ›

Disadvantages

Time Complexity: As many iterations and calculations are associated, the time complexity of hierarchical clustering is high. ...
Space Complexity: As many calculations of errors with losses are associated with every epoch, the space complexity of the algorithm is very high. ...
Poor performance on Large Datasets:

Nov 23, 2023

Explore More ›

What is the difference between agglomeration and cluster? ›

Agglomerate is a loose collection of particles (or aggregates) that can be separated by relatively weak shear forces including sonication. 'Cluster' was a term favored at one time by ISO to refer to any assemblage of particles without defining whether they're loose or tight.

Why is birch clustering appropriate for streaming data? ›

BIRCH is especially appropriate for very large data sets, or for streaming data, because of its ability to find a good clustering solution with only a single scan of the data. Optionally, the algorithm can make further scans through the data to improve the clustering quality.

Tell Me More ›

What is hierarchical model of clustering? ›

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

Get More Info Here ›

Which method is best for hierarchical clustering? ›

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the datasets into clusters, it follows the bottom-up approach. It means, this algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest pair of clusters together.

View Details ›

What is hierarchical clustering vs K clustering? ›

The difference between Kmeans and hierarchical clustering is that in Kmeans clustering, the number of clusters is pre-defined and is denoted by “K”, but in hierarchical clustering, the number of sets is either one or similar to the number of data observations.