- Last updated March 18, 2024
- In AI Mysteries, Developers Corner
BIRCH clustering algorithm is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the centroids being read off the leaf. And these centroids can be the final cluster centroid or the input for other cluster algorithms like AgglomerativeClustering.
Share
Clustering is the process of dividing huge data into smaller parts. It is an unsupervised learning problem. Mostly we perform clustering when the analysis is required to extract the information of an interesting pattern or the field, for example, extracting similar user behaviour in a customer database.
Many clustering algorithms are available to use, and all of them have their characteristics and use cases. We can not have all the similar time kinds of datasets. Some algorithms are made to use when the dataset is low in quantity, and some of them can be used when the dataset is in high quantity. Clustering algorithms are made to find the natural feature groups in the feature space of input data.
Examples of clustering algorithms are:
- Agglomerative clustering
- DBSCAN’
- K- means
- Spectral clustering
- BIRCH
In this article, we are going to discuss the BIRCH clustering algorithm. The article assumes that the reader has the basic knowledge of clustering algorithms and their terminology.
BIRCH(Balanced Iterative Reducing and Clustering hierarchies)
Basic clustering algorithms like K means, agglomerative clustering are some of the most commonly used clustering algorithms. But when performing clustering on very large datasets, BIRCH and DBSCAN are the advanced clustering algorithms useful for performing precise clustering on large datasets. Moreover, BIRCH is very useful because of its easy implementation.
Before going on with the implementation, we will discuss the algorithms and features of the BIRCH.
Without going into the mathematics of BIRCH, more formally, BIRCH is a clustering algorithm that clusters the dataset first in small summaries, then after small summaries get clustered. It does not directly cluster the dataset. This is why BIRCH is often used with other clustering algorithms; after making the summary, the summary can also be clustered by other clustering algorithms.
It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the centroids being read off the leaf. And these centroids can be the final cluster centroid or the input for other cluster algorithms like AgglomerativeClustering.
BIRCH is a scalable clustering method based on hierarchy clustering and only requires a one-time scan of the dataset, making it fast for working with large datasets. This algorithm is based on the CF (clustering features) tree. In addition, this algorithm uses a tree-structured summary to create clusters. The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature tree(CF tree).
In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds necessary information of given data for further hierarchical clustering. This prevents the need to work with whole data given as input. The tree cluster of data points as CF is represented by three numbers (N, LS, SS).
- N = Number of items in subclusters
- LS = vector sum of the data points
- SS = Sum of the squared data points
In the image below, we can easily understand the numbers.
Here we can see in the example how a CF generates there. In the image, five samples are available (3,4), (2, 6), (4, 5), (4, 7), (3, 8),.
So by the image N = 5, LS = (16,30) and SS = (54, 190).
The below image can represent the CF tree structure.
In the image, we can see that the root node has a non-leaf node where every non-leaf node is having B entries and the leaf node has L cluster feature(CF), so if every CF in the leaf node has L entries, it will satisfy the threshold value T is a maximum diameter of radius. So here, we can see that each leaf node is a sub-cluster, more formally saying it is a summary, not a data point.
Let’s see the basics of algorithms.
There are mainly four phases which are followed by the algorithm of BIRCH.
- Scanning data into memory.
- Condense data(resize data).
- Global clustering.
- Refining clusters.
In these four phases, two of them (resize data and refining clusters) are optional. They come in the process when more clarity is required. But scanning data is just like loading data into a model. After loading the data, the algorithm scans the whole data and fits them into the CF trees. In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering, it sends CF trees for clustering using existing clustering algorithms. Finally, refining fixes the problem of CF trees where the same valued points are assigned to different leaf nodes.
Scikit Learn provides the module for direct implementation of BIRCH under the cluster class packages. We need to provide values to the parameters according to the requirement.
There are three parameters in the BIRCH algorithm.
- Threshold – The maximum number of data samples to be considered in a subcluster of the leaf node in a CF tree.
- Branching_factor – It is the factor that is used to specify the number of CF sub-clusters that can be made in a node.
- N_clusters – number of clusters.
Implementation of the BIRCH using python
Importing the required libraries
Input:
import matplotlib.pyplot as pltfrom sklearn.datasets.samples_generator import make_blobsfrom sklearn.cluster import Birch
Generating a dataset using make blobs.
Input:
data, clusters = make_blobs(n_samples = 1000, centers = 12, cluster_std = 0.50, random_state = 0)data.shape
Output:
Creating a BIRCH model
Input:
model = Birch(branching_factor = 50, n_clusters = None, threshold = 1.5)
Fitting the dataset in the model.
Input:
model.fit(data)
Output:
Creating the prediction of the dataset using the generated model.
Input:
pred = model.predict(data)
Making the scatterplot for checking the results.
Input:
plt.scatter(data[:, 0], data[:, 1], c = pred)
Output:
Here in the output, we can see that we have created 12 clusters of randomly generated samples using make blob, and we can see the algorithm is working finely. The main feature of using the BIRCH is its CF-tree feature. There can be many problems related to huge data.
- Pixel classification in images.
- Image blending.
- Audio data classification.
It is a good algorithm with the advantages of a single scan, and also, the CF-tree feature increases the quality of clusters, but one thing where it lags is it uses only numeric or vector data.
References
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.
Related Posts
AI as a Feature or AI as a Product?
Tarunya S29/06/2024
AI Sparks India’s Research Boom but Threat to Quality Looms
Vidyashree Srinivas28/06/2024
Virtual Reality Brings You Closer to God
Anshul Vipat22/06/2024
Is the UPSC Exam a Reason for Youth Unemployment?
Vidyashree Srinivas19/06/2024
6 Incredible Ways LLMs are Transforming Healthcare
Anshul Vipat14/06/2024
Is Kling a Big Slap to AI Ethicists?
Vidyashree Srinivas14/06/2024
Could AI have Prevented the Exit Poll Mess in India?
Anshul Vipat11/06/2024
What’s Wrong with Selling AI Artwork at Church Street?
Vidyashree Srinivas10/06/2024
CORPORATE TRAINING PROGRAMS ON GENERATIVE AI
Generative AI Skilling for Enterprises
Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.
Upcoming Large format Conference
Cypher 2024India's Biggest AI Summit
Sep 25-27, 2024 | 📍 Bangalore, India
AI CEOs Should Stop Gaslighting Students With LLMs
Mohit Pandey
“By the time those teens get into adulthood, it will be the present-day LLM startups that become obsolete.”
Adobe Rewrites History with Databricks in World’s Largest Data Migration
Shritama Saha
CP Gurnani Proves Altman Wrong, Tech Mahindra Builds Indian LLM Under $5M
Shyam Nandan Upadhyay
Top Editorial Picks
Oracle Announces General Availability of HeatWave GenAI, an in-database LLM and Database Vector Store
Vandana Nair
Ola Krutrim to Sponsor Cloud Services for Top 10 AI Startups
Siddharth Jindal
Figma Rolls Out AI Tools to Rival Design Giants like Adobe, Canva
Shritama Saha
Yann LeCun Urges Signing Letter to Block Regulation for AI Research
Anshul Vipat
OpenAI Unveils CriticGPT to Review GPT-4’s Performance
Donna Eva
Tech Mahindra Finally Launches Project Indus, Indic LLM with 37+ Hindi Dialects
Shyam Nandan Upadhyay
Google Opens Access to Gemini 1.5 Pro 2M Context Window, Enables Code Execution for Gemini API
Donna Eva
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Flagship Events
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration withNVIDIA.
GenAI
Corner
View All
GCC Excellence Awards 2024: Meet the Winners
Watch Out, Chatbots! Amazon Metis is Almost Here
Generative AI Moves from Hype to Enterprise Adoption
Top 7 Papers Presented by Google at CVPR 2024
The 5 Ps That Will Define Healthcare GCCs in India
Data Science Hiring Process at Target
After Skoda, Audi Integrates ChatGPT into Cars
Meet the Indian Who Created an Open Source Perplexity Over One Weekend