Skip to main content

Which clustering method should you choose?

Goal of this tutorial

The aim of this tutorial is to help XLSTAT users to pick an appropriate cluster analysis tool to analyze their data.

What is cluster analysis?

Cluster analysis methods allow assembling objects (observations or individuals) in classes (clusters) in such a way that objects belonging to the same class are more similar to one another than to objects belonging to other classes. Proximity between objects is based on a set of variables measured on all the objects. Cluster analysis methods are widely used in exploratory data mining techniques. Here are a few examples:

In expression data (transcriptomics, proteomics, metabolomics, etc.), those methods allow detecting individuals that have similar expression profiles, or features that have similar expression patterns.

In market research, clustering methods allow to detect different consumer profiles using survey data.

In ecology, those methods help to identify groups of sites that hold similar communities.

Available methods in XLSTAT

XLSTAT proposes four different clustering methods stored in the Analyzing data button:

k-means clustering

Agglomerative hierarchical clustering (AHC)

Gaussian mixture models

Univariate clustering

And one method in the XLSTAT-LG option:

Latent class cluster models

These methods only work on quantitative variables (except for latent class cluster models). Binary variables could also be used in AHC. If you need to cluster objects based on qualitative variables, we recommend running a Multiple Correspondence Analysis first and using observation scores on the first axes (factors) as a dataset for clustering.

In the same spirit, one may also use observation scores provided by any exploratory analysis, including Correspondence analysis.

What clustering method to choose

Every method has its own characteristics summarized in the table below.

AHC k-means Gaussian Mixture Univariate clustering Latent class cluster model
Number of variables 1 at least 1 at least 1 at least 1 at most 1 at least
Input variables type Quantitative continuous Quantitative continuous Quantitative continuous Quantitative continuous Quantitative continuous, Quantitative ordinal, nominal
Should the number of classes be chosen prior to computations? Optional Mandatory Mandatory Mandatory Mandatory (but optimal number of classes can be determined by the model)
Results: Class membership* Deterministic Deterministic Probabilistic Deterministic Probabilistic
Results: Special features Dendrogram, profile plot Profile plot Parameter estimation of classes, mixture model plots, MAP classification plot - Variable contribution to each class, possibility to predict class membership of new observations (scoring equation

Going furtherAfter computations, the class membership of every observation is provided in different ways according to the clustering method. The deterministic way involves the assignment of every object to a single class whereas the probabilistic way displays membership probability of an observation in each class.

Very big datasets could be handled by combining different methods. For example, clusters obtained by the k-means method could be used as observations within an agglomerative hierarchical clustering. This tutorial will guide you.

Was this article useful?

  • Yes
  • No