Why clustering techniques are relevant in the world of data science?
Dealing with data is hard, the life of a data scientist is even harder. There are some tools and techniques which makes it easy for anyone working with data. One such technique is discussed below in this article.
In today’s world of information explosion, knowledge is derived by classifying information. Analytics borrows its strength by eliminating the random perturbations and identifying the underlying structure. Every customer who walks in a departmental superstore is different. No two persons will buy identical items in identical quantities. Customers who are young and single are likely to buy certain items whereas elderly couples are likely to buy certain other items. To promote sales by utilizing resources in a cost-effective manner, superstores are to understand the requirements of different groups of customers. Instead of a blanket advertisement policy, where all items are being advertised to all customers, a much more efficient way of increasing sales is to classify customers into mutually exclusive groups and to come up with a tailor-made advertising policy for each group.
Clustering lies at the core of marketing strategy. In recent times many other application areas of clustering have emerged. Clustering documents on the web helps to extract and classify information. Clustering of genes helps to identify various properties, including their disease carrying propensity. Clustering is a powerful tool for data mining and pattern recognition for image analysis.
Clustering is an unsupervised technique. The underlying assumption is that, the observed data is coming from multiple populations. To elaborate the marketing strategy example, it is assumed that distinct populations exist among the customer base for a supermarket. The difference among populations may not be based on demographics only. A set of complex characteristics based on demography, socio-economic strata and other conditions delineate the populations and they form a partition of the customers. Once the clusters are identified, they can be studied in a better manner and possibly different strategies are applied to garner more business from the targeted groups.
So by definition, we can say that , “Clustering is an unsupervised learning technique to partition the data into homogeneous segments. Within a cluster the observations may be assumed to come from one single population. Observations belonging to different clusters are assumed to represent different populations.”
Why clustering is important?
- Clustering group observations so that the similar observations belonging in the same group, whereas observations in different groups are dissimilar.
- Clustering results can be used as a preprocessing step for other algorithms.
- Visualization of clusters may reveal some important information of data.
- Clustering can be considered as a stand alone tool to get insight into data distribution.
Different methods of clustering
There are two primary approaches to clustering; namely hierarchical or agglomerative clustering and k-means clustering.
In hierarchical clustering, the closest points are combined in a pairwise manner to form the clusters. It is an iterative procedure, where at every step of the iteration, two points or two clusters are combined to form a bigger cluster. At the end of the process all points are combined into a single cluster. The number of clusters is not predetermined. Hierarchical clustering may correspond to meaningful taxonomies. The main disadvantages of hierarchical clustering is that its time complexity which is not suitable for larger data sets and it is fount to be very sensitive to outliers
In the k-means clustering process, k denotes the number of clusters, which has to be predetermined. Once k is fixed, the observations are allocated to one and only one cluster so that the closest points belong to one cluster. The cluster size is not controlled. K-means clustering is widely used in large dataset applications.
- Image processing — the images can be clustered based on their visual content.
- Web- the different web pages can be clustered based on their content. The web users can be clustered based on their webpage access patterns.
- Finance- cluster analysis can be used for creating balanced portfolios.
- Market segmentation-customers can be grouped into clusters based on demographic information and transaction history , and a marketing strategy is tailored for each segment.