Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH (2024)

Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH (3)

The exploratory nature of data analysis and data mining makes clustering one of the most usual tasks in these kind of projects. More frequently these projects come from many different application areas like biology, text analysis, signal analysis, etc. that involve larger and larger datasets in the number of examples and the number of attributes .The biggest challenge with clustering in real-life scenarios is the volume of the data and the consequential increase in the complexity, and need for more computational power.
These problems have opened an area for the search of algorithms able to reduce this data overload. Some solutions come from the side of data preprocessing by transforming the data to a lower dimensionality manifold that represents the structure of the data or by summarizing the dataset by obtaining a smaller subset of examples that represent an equivalent information. A different perspective is to modify the classical clustering algorithms or to derive other ones able to cluster larger datasets. This perspective relies on many different strategies. Techniques such as sampling, on-line processing, summarization, data distribution and efficient data structures have being applied to the problem of scaling clustering algorithms.
In this article we will discuss one such algorithm — BIRCH, which stands for Balanced Iterative Reducing and Clustering using Hierarchies, which was developed in 1996 by Tian Zhang, Raghu Ramakrishnan, and Miron Livny.

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the the large dataset that retains as much information as possible. This smaller summary is then clustered instead of clustering the larger dataset. BIRCH is often used to complement other clustering algorithms by creating a summary of the dataset that the other clustering algorithm can now use. However, BIRCH has one major drawback — it can only process metric attributes. A metric attribute is any attribute whose values can be represented in Euclidean space i.e., no categorical attributes should be present.

The BIRCH clustering algorithm consists of two stages:

  1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered triple, (N, LS, SS) where ’N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the data points and ‘SS’ is the squared sum of the data points in the cluster. It is possible for a CF entry to be composed of other CF entries. Optionally, we can condense this initial CF tree into a smaller CF.
  2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree. A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the sum of CF entries in the child nodes. Optionally, we can refine these clusters.

Due to this two step process, BIRCH is also called Two Step Clustering. Let us now deep dive into these steps to get a holistic understanding.

Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH (4)

Cluster Features

BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to represent a larger set of data points. For clustering purposes, these summary statistics constitute a CF, and represent a sufficient substitute for the actual data.
A CF is a set of three summary statistics that represent a set of data points in a single cluster.
These statistics are as follows:

Count [The number of data values in the cluster.]

Linear Sum [The sum of the individual coordinates. This is a measure of the location of the cluster.]

Squared Sum [The sum of the squared coordinates. This is a measure of the spread of the cluster.]

Together, the linear sum and the squared sum are equivalent to the mean and variance of the data point.

CF Tree

The building process of the CF Tree can be summarized as:

  1. For each given record, BIRCH compares the location of that record with the location of each CF in the root node, using either the linear sum or the mean of the CF. BIRCH passes the incoming record to the root node CF closest to the incoming record.
  2. The record then descends down to the non-leaf child nodes of the root node CF selected in step 1. BIRCH compares the location of the record with the location of each non-leaf CF. BIRCH passes the incoming record to the non-leaf node CF closest to the incoming record.
  3. The record then descends down to the leaf child nodes of the non-leaf node CF selected in step 2. BIRCH compares the location of the record with the location of each leaf. BIRCH tentatively passes the incoming record to the leaf closest to the incoming record.
  4. Perform one of (a) or (b):

(a) If the radius of the chosen leaf including the new record does not exceed the Threshold T, then the incoming record is assigned to that leaf. The leaf and all of its parent CF’s are updated to account for the new data point.

(b) If the radius of the chosen leaf including the new record does exceed the Threshold T, then a new leaf is formed, consisting of the incoming record only. The parent CFs are updated to account for the new data point.

If step 4(b) is executed, and there are already the maximum of L leaves in the leaf node, then the leaf node is split into two leaf nodes. The most distant leaf node CFs are used as leaf node seeds, with the remaining CFs being assigned to whichever leaf node is closer. If the parent node is full, split the parent node, and so on. Note that the radius of a cluster may be calculated even without knowing the data points, as long as we have the count n, the linear sum LS, and the squared sum SS. This allows BIRCH to evaluate whether a given data point belongs to a particular sub-cluster without needing to scan the original data set.

Clustering the Sub-Clusters

Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters (the CF leaf nodes), to combine these sub-clusters into clusters. The task of clustering becomes much more easier as the number of sub-clusters are much less than the number of data points. When a new data value is added, these statistics may be easily updated, thus making the computation more efficient.
In the original paper, the authors have used agglomerative hierarchical clustering.

There are three parameters in this algorithm, which needs to be tuned. Unlike K-means, here the optimal number of clusters (k) need not be input by the user as they are determined by the algorithm.

threshold : threshold is the maximum number of data points a sub-cluster in the leaf node of the CF tree can hold.

branching_factor : This parameter specifies the maximum number of CF sub-clusters in each node (internal node).

n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is complete i.e., number of clusters after the final clustering step. If set to None, the final clustering step is not performed and intermediate clusters are returned.
Scikit-learn offers a robust implementation for Python.

Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH (5)

A detriment of BIRCH clustering is the following. Because of the tree structure inherent in the CF tree, the clustering solution may be dependent on the input ordering of the data records. To avoid this, the data analyst may wish to apply BIRCH clustering on a few different random sorting of the data, and find consensus among the results. However, a benefit of BIRCH clustering is that the analyst is not required to select the best choice of k, the number of clusters, as is the case with some other clustering methods. Rather, the number of clusters in a BIRCH clustering solution is an outcome of the tree-building process.

Please feel free to share your suggestions.
Thanks for Reading! Stay Safe!

References

Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH (2024)

FAQs

Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH? ›

BIRCH

BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets.
https://en.wikipedia.org › wiki › BIRCH
(Balanced Iterative Reducing and Clustering Using Hierarchies) is a clustering technique that clusters huge datasets by first producing a smaller and compact summary of the large dataset that maintains as much information as feasible. Instead of clustering the whole dataset, this smaller summary is clustered.

What is balanced iterative reducing and clustering using Hierarchies? ›

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the the large dataset that retains as much information as possible.

Is BIRCH hierarchical clustering? ›

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm that performs hierarchical clustering over large data sets.

What is the formula for birch algorithm? ›

In standard BIRCH, the CF formula used is CF = (N, LS, and SS). This study will use changes to the CF-Leaf value. The CF-leaf modif formula to be used is CF-Leaf (modif) = (N, LS, SS, T).

Is BIRCH agglomerative or divisive? ›

'BIRCH' constructs an in-memory CF-tree (a tree data structure). The second stage, global clustering, employs agglomerative hierarchical clustering with linkage criteria to cluster small-diameter dense sub-clusters represented by CF vectors in the leaf nodes of the CF tree.

When to use BIRCH clustering? ›

At a high level, Balanced Iterative Reducing and Clustering using Hierarchies, or BIRCH for short, deals with large datasets by first generating a more compact summary that retains as much distribution information as possible, and then clustering the data summary instead of the original dataset.

What is the difference between BIRCH and K-Means? ›

K-Means is a technique of partitioning-based clustering, whereas BIRCH clustering is a technique of hierarchical method of clustering.

What are the disadvantages of birch algorithm? ›

However, BIRCH has one major drawback – it can only process metric attributes. A metric attribute is any attribute whose values can be represented in Euclidean space i.e., no categorical attributes should be present.

What are the two 2 types of hierarchical clustering? ›

Types of hierarchical clustering

The two main choices are divisive methods and agglomerative methods. Understanding each method can help you make a more informed decision about how best to structure the algorithm you're using for your data.

What are the advantages of birch algorithm? ›

An important advantage of the BIRCH algorithm is that it only requires a single scan of the database. Compared to the popular K-means algorithm, BIRCH performs exceptionally well in terms of less memory consumption, faster performance, less order-sensitivity, and high accuracy [49] .

What is the time complexity of birch clustering? ›

“How effective is BIRCH?” The time complexity of the algorithm is O(n), where n is the number of objects to be clustered. Experiments have shown the linear scalability of the algorithm with respect to the number of objects, and good quality of clustering of the data.

What are the different phases of BIRCH? ›

Phase 1: BIRCH scans the database to build an initial in-memoryCF tree, which can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of the data. Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree.

What is the algorithm to live by 37%? ›

The "37% rule" refers to a series of steps, or algorithms, that someone must follow to make the best decision within a set amount of time. Someone allots 37% of their time to research before they make a decision, then commits to the very next "best choice" they find.

What are the weaknesses of hierarchical clustering? ›

Disadvantages
  • Time Complexity: As many iterations and calculations are associated, the time complexity of hierarchical clustering is high. ...
  • Space Complexity: As many calculations of errors with losses are associated with every epoch, the space complexity of the algorithm is very high. ...
  • Poor performance on Large Datasets:
Nov 23, 2023

What is the difference between agglomeration and cluster? ›

Agglomerate is a loose collection of particles (or aggregates) that can be separated by relatively weak shear forces including sonication. 'Cluster' was a term favored at one time by ISO to refer to any assemblage of particles without defining whether they're loose or tight.

Why is birch clustering appropriate for streaming data? ›

BIRCH is especially appropriate for very large data sets, or for streaming data, because of its ability to find a good clustering solution with only a single scan of the data. Optionally, the algorithm can make further scans through the data to improve the clustering quality.

What is hierarchical model of clustering? ›

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

Which method is best for hierarchical clustering? ›

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the datasets into clusters, it follows the bottom-up approach. It means, this algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest pair of clusters together.

What is hierarchical clustering vs K clustering? ›

The difference between Kmeans and hierarchical clustering is that in Kmeans clustering, the number of clusters is pre-defined and is denoted by “K”, but in hierarchical clustering, the number of sets is either one or similar to the number of data observations.

References

Top Articles
Kats rally to snap losing skid with 24-21 win - Sam Houston
‘Too many long faces’: Why BYU’s 14-0 win over Sam Houston in inaugural football game as a Big 12 member was less than satisfying
Toa Guide Osrs
Somboun Asian Market
Booknet.com Contract Marriage 2
Cad Calls Meriden Ct
Evil Dead Rise Showtimes Near Massena Movieplex
Directions To 401 East Chestnut Street Louisville Kentucky
Pbr Wisconsin Baseball
Southland Goldendoodles
Strange World Showtimes Near Cmx Downtown At The Gardens 16
Ohiohealth Esource Employee Login
Find your energy supplier
What Was D-Day Weegy
Bjork & Zhulkie Funeral Home Obituaries
Conan Exiles Thrall Master Build: Best Attributes, Armor, Skills, More
Steamy Afternoon With Handsome Fernando
Gino Jennings Live Stream Today
Extra Virgin Coconut Oil Walmart
Lowes Undermount Kitchen Sinks
Ppm Claims Amynta
Encore Atlanta Cheer Competition
Where to eat: the 50 best restaurants in Freiburg im Breisgau
Highmark Wholecare Otc Store
Pocono Recird Obits
C&T Wok Menu - Morrisville, NC Restaurant
Netwerk van %naam%, analyse van %nb_relaties% relaties
Bolsa Feels Bad For Sancho's Loss.
JVID Rina sauce set1
Vivification Harry Potter
The Procurement Acronyms And Abbreviations That You Need To Know Short Forms Used In Procurement
Vlacs Maestro Login
Free Tiktok Likes Compara Smm
1475 Akron Way Forney Tx 75126
Warn Notice Va
Ixlggusd
Culver's Hartland Flavor Of The Day
Craigslist Red Wing Mn
Natashas Bedroom - Slave Commands
Baywatch 2017 123Movies
Claim loopt uit op pr-drama voor Hohenzollern
Pepsi Collaboration
The best bagels in NYC, according to a New Yorker
About My Father Showtimes Near Amc Rockford 16
The Realreal Temporary Closure
Jetblue 1919
Atom Tickets – Buy Movie Tickets, Invite Friends, Skip Lines
The Horn Of Plenty Figgerits
Ehc Workspace Login
La Qua Brothers Funeral Home
Hdmovie2 Sbs
Puss In Boots: The Last Wish Showtimes Near Valdosta Cinemas
Latest Posts
Article information

Author: Domingo Moore

Last Updated:

Views: 6003

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Domingo Moore

Birthday: 1997-05-20

Address: 6485 Kohler Route, Antonioton, VT 77375-0299

Phone: +3213869077934

Job: Sales Analyst

Hobby: Kayaking, Roller skating, Cabaret, Rugby, Homebrewing, Creative writing, amateur radio

Introduction: My name is Domingo Moore, I am a attractive, gorgeous, funny, jolly, spotless, nice, fantastic person who loves writing and wants to share my knowledge and understanding with you.