Data clustering: 50 years beyond K-means☆
Introduction
Advances in sensing and storage technology and dramatic growth in applications such as Internet search, digital imaging, and video surveillance have created many high-volume, high-dimensional data sets. It is estimated that the digital universe consumed approximately 281 exabytes in 2007, and it is projected to be 10 times that size by 2011 (1 exabyte is ∼1018 bytes or 1,000,000 terabytes) (Gantz, 2008). Most of the data is stored digitally in electronic media, thus providing huge potential for the development of automatic data analysis, classification, and retrieval techniques. In addition to the growth in the amount of data, the variety of available data (text, image, and video) has also increased. Inexpensive digital and video cameras have made available huge archives of images and videos. The prevalence of RFID tags or transponders due to their low cost and small size has resulted in the deployment of millions of sensors that transmit data regularly. E-mails, blogs, transaction data, and billions of Web pages create terabytes of new data every day. Many of these data streams are unstructured, adding to the difficulty in analyzing them.
The increase in both the volume and the variety of data requires advances in methodology to automatically understand, process, and summarize the data. Data analysis techniques can be broadly classified into two major types (Tukey, 1977): (i) exploratory or descriptive, meaning that the investigator does not have pre-specified models or hypotheses but wants to understand the general characteristics or structure of the high-dimensional data, and (ii) confirmatory or inferential, meaning that the investigator wants to confirm the validity of a hypothesis/model or a set of assumptions given the available data. Many statistical techniques have been proposed to analyze the data, such as analysis of variance, linear regression, discriminant analysis, canonical correlation analysis, multi-dimensional scaling, factor analysis, principal component analysis, and cluster analysis to name a few. A useful overview is given in (Tabachnick and Fidell, 2007).
In pattern recognition, data analysis is concerned with predictive modeling: given some training data, we want to predict the behavior of the unseen test data. This task is also referred to as learning. Often, a clear distinction is made between learning problems that are (i) supervised (classification) or (ii) unsupervised (clustering), the first involving only labeled data (training patterns with known category labels) while the latter involving only unlabeled data (Duda et al., 2001). Clustering is a more difficult and challenging problem than classification. There is a growing interest in a hybrid setting, called semi-supervised learning (Chapelle et al., 2006); in semi-supervised classification, the labels of only a small portion of the training data set are available. The unlabeled data, instead of being discarded, are also used in the learning process. In semi-supervised clustering, instead of specifying the class labels, pair-wise constraints are specified, which is a weaker way of encoding the prior knowledge. A pair-wise must-link constraint corresponds to the requirement that two objects should be assigned the same cluster label, whereas the cluster labels of two objects participating in a cannot-link constraint should be different. Constraints can be particularly beneficial in data clustering (Lange et al., 2005, Basu et al., 2008), where precise definitions of underlying clusters are absent. In the search for good models, one would like to include all the available information, no matter whether it is unlabeled data, data with constraints, or labeled data. Fig. 1 illustrates this spectrum of different types of learning problems of interest in pattern recognition and machine learning.
Section snippets
Data clustering
The goal of data clustering, also known as cluster analysis, is to discover the natural grouping(s) of a set of patterns, points, or objects. Webster (Merriam-Webster Online Dictionary, 2008) defines cluster analysis as “a statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics.” An example of clustering is shown in Fig. 2. The objective is to develop an automatic
User’s dilemma
In spite of the prevalence of such a large number of clustering algorithms, and their success in a number of different application domains, clustering remains a difficult problem. This can be attributed to the inherent vagueness in the definition of a cluster, and the difficulty in defining an appropriate similarity measure and objective function.
The following fundamental challenges associated with clustering were highlighted in (Jain and Dubes, 1988), which are relevant even to this day.
- (a)
What
Trends in data clustering
Information explosion is not only creating large amounts of data but also a diverse set of data, both structured and unstructured. Unstructured data is a collection of objects that do not follow a specific format. For example, images, text, audio, video, etc. On the other hand, in structured data, there are semantic relationships within each object that are important. Most clustering approaches ignore the structure in the objects to be clustered and use a feature vector based representation for
Summary
Organizing data into sensible groupings arises naturally in many scientific fields. It is, therefore, not surprising to see the continued popularity of data clustering. It is important to remember that cluster analysis is an exploratory tool; the output of clustering algorithms only suggest hypotheses. While numerous clustering algorithms have been published and new ones continue to appear, there is no single clustering algorithm that has been shown to dominate other algorithms across all
Acknowledgements
I would like to acknowledge the National Science Foundation and the Office of Naval research for supporting my research in data clustering, dimensionality reduction, classification, and semi-supervised learning. I am grateful to Rong Jin, Pang-Ning Tan, and Pavan Mallapragada for helping me prepare the King-Sun Fu lecture as well as this manuscript. I have learned much from and enjoyed my fruitful collaborations in data clustering with Eric Backer, Joachim Buhmann, (late) Richard Dubes, Mario
References (144)
- et al.
Clustering techniques: User’s dilemma
Pattern Recognition
(1976) - et al.
Rock: A robust clustering algorithm for categorical attributes
Inform. Systems
(2000) - et al.
A scalable framework for cluster ensembles
Pattern Recognition
(2009) - Aggarwal, Charu C., Han, Jiawei, Wang, Jianyong, Yu, Philip S., 2003. A framework for clustering evolving data streams....
- Agrawal, Rakesh, Gehrke, Johannes, Gunopulos, Dimitrios, Raghavan, Prabhakar, 1998. Automatic subspace clustering of...
Cluster Analysis for Applications
(1973)- Andrews, Nicholas, O., Fox, Edward A., 2007. Recent developments in document clustering. Technical report TR-07-35....
- et al.
Cluster analysis in marketing research
Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets
(1978)- Backstrom, L., Huttenlocher, D., Kleinberg, J., Lan, X., 2006. Group formation in large social networks: Membership,...
DNA Microarrays and Gene Expression
Pattern Recognition with Fuzzy Objective Function Algorithms
Conceputal clustering in information retrieval
IEEE Trans. Systems Man Cybernet.
Pattern Recognition and Machine Learning
Latent dirichlet allocation
J. Machine Learn. Res.
Efficient large-scale sequence comparison by locality-sensitive hashing
Bioinformatics
Writer adaptation for online handwriting recognition
IEEE Trans. Pattern Anal. Machine Intell.
Metric Methods for Analyzing Partially Ranked Data
Maximum likelihood from incomplete data via the EM algorithm
J. Roy. Statist. Soc.
Clustering large graphs via the singular value decomposition
Machine Learn.
Pattern Classification
A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters
J. Cybernet.
A Bayesian analysis of some nonparametric problems
Ann. Statist.
Unsupervised learning of finite mixture models
IEEE Trans. Pattern Anal. Machine Intell.
Knowledge acquisition via incremental conceptual clustering
Machine Learn.
Cited by (7184)
Rising frequency of ozone-favorable synoptic weather patterns contributes to 2015–2022 ozone increase in Guangzhou
2025, Journal of Environmental Sciences (China)A novel hybridization approach to improve the critical distance clustering algorithm: Balancing speed and quality
2024, Expert Systems with ApplicationsMulti-task hierarchical convolutional network for visual-semantic cross-modal retrieval
2024, Pattern RecognitionA survey on semi-supervised graph clustering
2024, Engineering Applications of Artificial IntelligenceSpectral clustering with linear embedding: A discrete clustering method for large-scale data
2024, Pattern RecognitionA co-evolutionary algorithm based on sparsity clustering for sparse large-scale multi-objective optimization
2024, Engineering Applications of Artificial Intelligence
- ☆
This paper is based on the King-Sun Fu Prize lecture delivered at the 19th International Conference on Pattern Recognition (ICPR), Tempa, FL, December 8, 2008.