Purpose:
The purpose of cluster analysis is to identify groups by which individual cases naturally cluster such that the cases within clusters are more alike than they are with cases outside the clusters.
The idea is to discover how cases (typically people) cluster into groups where members of those groups have characteristics in common.
We naturally do this all the time even if we are unaware that we are doing it. When we call someone lazy, emotional, or outgoing, we are essentially identifying their membership to a group (e.g, the “lazy group”) based upon observations of behavioral attributes that together make these people more similar to lazy people than to hardworking people.
Similarity:
As cluster analysis is all about clustering together things that are most similar and keeping apart things that are most different, there obviously needs to be a way to measure similarity/difference.
Euclidean distance
The most common method is to compute the Euclidean distance between each pair of objects. For p variables (x1, x2,… xp) this equation gives the distance between the ith and jth objects.
A problem arises if variables are measured on different scales. For example, a 1 unit difference on a scale from 1 to 5 is substantial, but a 1 unit difference on a 1 to 100 scale is negligible. As a result, the usual recommendation is to standardize the variables by converting them to z-scores.
City-block (Manhattan) distance
For p variables (x1, x2,… xp) this equation gives the distance between the ith and jth objects.
This is often referred to as the Manhattan distance because whereas the Euclidean distance takes the most direct route in computing the distance (“as the crow flies”), the city block distance is computed as a distance as if one was walking between two points in a city on a grid patter such as Manhattan.
Chebychev distance
The Chebychev distance is the maximum of the absolute difference among the values of the clustering variables.
Strategy:
From Mooi and Sarstedt (2011). A concise Guide To Market Research. Springer-Verlag: Berlin, p. 240
- Decide on the variables (i.e., characteristics) upon which you want to group cases.
- Decide on clustering method.
- Different methods require different procedures prior to analysis.
- Each method uses a different approach to determining cluster membership, with the overall goal of maximizing similarity among members within a cluster and minimizing similarity among members between clusters.
- Methods
- hierarchical
- k-means
- two-step
- Decide how many clusters
- Interpret results and label clusters
Illustration:
To take a simple example, the table below contains a list of the mean January temperatures in degrees Fahrenheit for 19 U.S. states. The scatterplot on the following page depicts the the mean temperatures plotted for each state. It is apparent from the figure that these 19 states can be viewed as being classified into 2 distinct clusters that we can label the “Hot States” and the “Cold States”.
The output from a cluster analysis can then be used to simplify further analyses. For example, a business could apply separate marketing strategies aimed at each of the different clusters rather than at individual states or environmental policies could be developed for only two categories of states rather than individually for each state.
| State Index | State | Jan Temp (F) |
| 1 | AZ | 42.27 |
| 2 | CO | 23.71 |
| 3 | FL | 58.09 |
| 4 | IL | 24.58 |
| 5 | IA | 17.84 |
| 6 | ME | 13.58 |
| 7 | MA | 24.87 |
| 8 | MN | 7.94 |
| 9 | MS | 44.21 |
| 10 | NB | 22.73 |
| 11 | NH | 18.17 |
| 12 | NM | 34.39 |
| 13 | ND | 7.90 |
| 14 | OK | 36.11 |
| 15 | SC | 44.12 |
| 16 | TN | 36.32 |
| 17 | TX | 45.63 |
| 18 | VA | 34.48 |
| 19 | WY | 19.18 |
