Data clustering: 50 years beyond K-means

https://doi.org/10.1016/j.patrec.2009.09.011Get rights and content

Abstract

Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

Introduction

Advances in sensing and storage technology and dramatic growth in applications such as Internet search, digital imaging, and video surveillance have created many high-volume, high-dimensional data sets. It is estimated that the digital universe consumed approximately 281 exabytes in 2007, and it is projected to be 10 times that size by 2011 (1 exabyte is ∼1018 bytes or 1,000,000 terabytes) (Gantz, 2008). Most of the data is stored digitally in electronic media, thus providing huge potential for the development of automatic data analysis, classification, and retrieval techniques. In addition to the growth in the amount of data, the variety of available data (text, image, and video) has also increased. Inexpensive digital and video cameras have made available huge archives of images and videos. The prevalence of RFID tags or transponders due to their low cost and small size has resulted in the deployment of millions of sensors that transmit data regularly. E-mails, blogs, transaction data, and billions of Web pages create terabytes of new data every day. Many of these data streams are unstructured, adding to the difficulty in analyzing them.

The increase in both the volume and the variety of data requires advances in methodology to automatically understand, process, and summarize the data. Data analysis techniques can be broadly classified into two major types (Tukey, 1977): (i) exploratory or descriptive, meaning that the investigator does not have pre-specified models or hypotheses but wants to understand the general characteristics or structure of the high-dimensional data, and (ii) confirmatory or inferential, meaning that the investigator wants to confirm the validity of a hypothesis/model or a set of assumptions given the available data. Many statistical techniques have been proposed to analyze the data, such as analysis of variance, linear regression, discriminant analysis, canonical correlation analysis, multi-dimensional scaling, factor analysis, principal component analysis, and cluster analysis to name a few. A useful overview is given in (Tabachnick and Fidell, 2007).

In pattern recognition, data analysis is concerned with predictive modeling: given some training data, we want to predict the behavior of the unseen test data. This task is also referred to as learning. Often, a clear distinction is made between learning problems that are (i) supervised (classification) or (ii) unsupervised (clustering), the first involving only labeled data (training patterns with known category labels) while the latter involving only unlabeled data (Duda et al., 2001). Clustering is a more difficult and challenging problem than classification. There is a growing interest in a hybrid setting, called semi-supervised learning (Chapelle et al., 2006); in semi-supervised classification, the labels of only a small portion of the training data set are available. The unlabeled data, instead of being discarded, are also used in the learning process. In semi-supervised clustering, instead of specifying the class labels, pair-wise constraints are specified, which is a weaker way of encoding the prior knowledge. A pair-wise must-link constraint corresponds to the requirement that two objects should be assigned the same cluster label, whereas the cluster labels of two objects participating in a cannot-link constraint should be different. Constraints can be particularly beneficial in data clustering (Lange et al., 2005, Basu et al., 2008), where precise definitions of underlying clusters are absent. In the search for good models, one would like to include all the available information, no matter whether it is unlabeled data, data with constraints, or labeled data. Fig. 1 illustrates this spectrum of different types of learning problems of interest in pattern recognition and machine learning.

Section snippets

Data clustering

The goal of data clustering, also known as cluster analysis, is to discover the natural grouping(s) of a set of patterns, points, or objects. Webster (Merriam-Webster Online Dictionary, 2008) defines cluster analysis as “a statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics.” An example of clustering is shown in Fig. 2. The objective is to develop an automatic

User’s dilemma

In spite of the prevalence of such a large number of clustering algorithms, and their success in a number of different application domains, clustering remains a difficult problem. This can be attributed to the inherent vagueness in the definition of a cluster, and the difficulty in defining an appropriate similarity measure and objective function.

The following fundamental challenges associated with clustering were highlighted in (Jain and Dubes, 1988), which are relevant even to this day.

  • (a)

    What

Trends in data clustering

Information explosion is not only creating large amounts of data but also a diverse set of data, both structured and unstructured. Unstructured data is a collection of objects that do not follow a specific format. For example, images, text, audio, video, etc. On the other hand, in structured data, there are semantic relationships within each object that are important. Most clustering approaches ignore the structure in the objects to be clustered and use a feature vector based representation for

Summary

Organizing data into sensible groupings arises naturally in many scientific fields. It is, therefore, not surprising to see the continued popularity of data clustering. It is important to remember that cluster analysis is an exploratory tool; the output of clustering algorithms only suggest hypotheses. While numerous clustering algorithms have been published and new ones continue to appear, there is no single clustering algorithm that has been shown to dominate other algorithms across all

Acknowledgements

I would like to acknowledge the National Science Foundation and the Office of Naval research for supporting my research in data clustering, dimensionality reduction, classification, and semi-supervised learning. I am grateful to Rong Jin, Pang-Ning Tan, and Pavan Mallapragada for helping me prepare the King-Sun Fu lecture as well as this manuscript. I have learned much from and enjoyed my fruitful collaborations in data clustering with Eric Backer, Joachim Buhmann, (late) Richard Dubes, Mario

References (144)

  • Richard C. Dubes et al.

    Clustering techniques: User’s dilemma

    Pattern Recognition

    (1976)
  • Sudipto Guha et al.

    Rock: A robust clustering algorithm for categorical attributes

    Inform. Systems

    (2000)
  • Prodip Hore et al.

    A scalable framework for cluster ensembles

    Pattern Recognition

    (2009)
  • Aggarwal, Charu C., Han, Jiawei, Wang, Jianyong, Yu, Philip S., 2003. A framework for clustering evolving data streams....
  • Agrawal, Rakesh, Gehrke, Johannes, Gunopulos, Dimitrios, Raghavan, Prabhakar, 1998. Automatic subspace clustering of...
  • M.R. Anderberg

    Cluster Analysis for Applications

    (1973)
  • Andrews, Nicholas, O., Fox, Edward A., 2007. Recent developments in document clustering. Technical report TR-07-35....
  • P. Arabie et al.

    Cluster analysis in marketing research

  • Eric Backer

    Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets

    (1978)
  • Backstrom, L., Huttenlocher, D., Kleinberg, J., Lan, X., 2006. Group formation in large social networks: Membership,...
  • P. Baldi et al.

    DNA Microarrays and Gene Expression

    (2002)
  • Ball, G., Hall, D., 1965. ISODATA, a novel method of data anlysis and pattern classification. Technical report NTIS AD...
  • Banerjee, Arindam, Merugu, Srujana, Dhillon, Inderjit, Ghosh, Joydeep. 2004. Clustering with bregman divergences. J....
  • Banerjee, Arindam, Basu, Sugato, Merugu, Srujana, 2007a. Multi-way clustering on relation graphs. In: Proc. 7th SIAM...
  • Banerjee, S., Ramanathan, K., Gupta, A., 2007b. Clustering short texts using Wikipedia. In: Proc....
  • Bar-Hillel, Aaron, Hertz, T., Shental, Noam, Weinshall, Daphna, 2003. Learning distance functions using equivalence...
  • Basu, Sugato, Banerjee, Arindam, Mooney, Raymond, 2002. Semi-supervised clustering by seeding. In: Proc. 19th Internat....
  • Basu, Sugato, Bilenko, Mikhail, Mooney, Raymond J., 2004. A probabilistic framework for semi-supervised clustering. In:...
  • Basu, Sugato, Davidson, Ian, Wagstaff, Kiri (Eds.), 2008. Constrained Clustering: Advances in Algorithms, Theory and...
  • Bekkerman, Ron, El-Yaniv, Ran, McCallum, Andrew, 2005. Multi-way distributional clustering via pairwise interactions....
  • Belkin, Mikhail, Niyogi, Partha, 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering....
  • Ben-David, S., Ackerman, M., 2008. Measures of clustering quality: A working set of axioms for clustering. Advances in...
  • J.C. Bezdek

    Pattern Recognition with Fuzzy Objective Function Algorithms

    (1981)
  • S. Bhatia et al.

    Conceputal clustering in information retrieval

    IEEE Trans. Systems Man Cybernet.

    (1998)
  • Christopher M. Bishop

    Pattern Recognition and Machine Learning

    (2006)
  • Blake, Merz C.J., 1998. UCI repository of machine learning...
  • D.M. Blei et al.

    Latent dirichlet allocation

    J. Machine Learn. Res.

    (2003)
  • Bradley, P.S., Fayyad, U., Reina, C., 1998. Scaling clustering algorithms to large databases. In: Proc. 4th...
  • J. Buhler

    Efficient large-scale sequence comparison by locality-sensitive hashing

    Bioinformatics

    (2001)
  • Busse, Ludwig M., Orbanz, Peter, Buhmann, Joachim M., 2007. Cluster analysis of heterogeneous rank data. In: Proc. 24th...
  • Cao, F., Ester, M., Qian, W., Zhou, A., 2006. Density-based clustering over an evolving data stream with noise. In:...
  • Cheng, Yizong, Church, George M., 2000. Biclustering of expression data. In: Proc. Eighth Internat. Conf. on...
  • S.D. Connell et al.

    Writer adaptation for online handwriting recognition

    IEEE Trans. Pattern Anal. Machine Intell.

    (2002)
  • D. Critchlow

    Metric Methods for Analyzing Partially Ranked Data

    (1985)
  • Datta, Ritendra, Joshi, Dhiraj, Li, Jia, Wang, James Z., 2008. Image retrieval: Ideas, influences, and trends of the...
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. Roy. Statist. Soc.

    (1977)
  • Dhillon, I., Modha, D., 1999. A data-clustering algorithm on distributed memory multiprocessors. In: Proc. KDD’99...
  • Dhillon, Inderjit S., Mallela, Subramanyam, Guyon, Isabelle, Elisseeff, André, 2003. A divisive information-theoretic...
  • Dhillon, Inderjit S., Guan, Yuqiang, Kulis, Brian, 2004. Kernel k-means: Spectral clustering and normalized cuts. In:...
  • Ding, Chris, He, Xiaofeng, Simon, Horst D., 2005. On the equivalence of nonnegative matrix factorization and spectral...
  • P. Drineas et al.

    Clustering large graphs via the singular value decomposition

    Machine Learn.

    (1999)
  • R. Duda et al.

    Pattern Classification

    (2001)
  • J.C. Dunn

    A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters

    J. Cybernet.

    (1973)
  • Eschrich, S., Ke, Jingwei, Hall, L.O., Goldgof, D.B., 2003. Fast accurate fuzzy clustering through data reduction. IEEE...
  • Ester, Martin, Peter Kriegel, Hans, S., Jörg, Xu, Xiaowei, 1996. A density-based algorithm for discovering clusters in...
  • Thomas S. Ferguson

    A Bayesian analysis of some nonparametric problems

    Ann. Statist.

    (1973)
  • Mario Figueiredo et al.

    Unsupervised learning of finite mixture models

    IEEE Trans. Pattern Anal. Machine Intell.

    (2002)
  • Figueiredo, M.A.T., Chang, D.S., Murino, V., 2006. Clustering under prior knowledge with application to image...
  • Douglas H. Fisher

    Knowledge acquisition via incremental conceptual clustering

    Machine Learn.

    (1987)
  • Cited by (7184)

    • A survey on semi-supervised graph clustering

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus

    This paper is based on the King-Sun Fu Prize lecture delivered at the 19th International Conference on Pattern Recognition (ICPR), Tempa, FL, December 8, 2008.

    View full text