Preliminary steps
Start by loading the necessary packages (which must be installed
beforehand if they are not already installed) : TraMineR
and TraMineRextras
for sequence analysis, cluster
and WeightedCluster
for clustering, FactoMineR
et ade4
for
correspondence analysis, RColorBrewer
for color
palettes, questionr
et descriptio
for descriptive statistics, dplyr
et purrr
for data
manipulation, ggplot2
for plots and seqhandbook
which accompanies the handbook.
library(TraMineR)
library(TraMineRextras)
library(cluster)
library(WeightedCluster)
library(FactoMineR)
library(ade4)
library(RColorBrewer)
library(questionr)
library(descriptio)
library(dplyr)
library(purrr)
library(ggplot2)
library(seqhandbook)
The data used in the handbook is then loaded. The data on employment
trajectories are in the data frame trajact
: there are 500
observations, i.e. 500 individual trajectories, and 37 variables,
corresponding to the activity status observed each year between the ages
of 14 and 50.
# loading the trajectories
data(trajact)
str(trajact)
'data.frame': 500 obs. of 37 variables:
$ sact14: num 2 1 1 1 1 1 1 1 1 1 ...
$ sact15: num 2 1 1 1 1 1 1 1 1 1 ...
$ sact16: num 2 1 1 1 1 1 1 1 1 1 ...
$ sact17: num 2 1 1 1 1 1 2 2 2 2 ...
$ sact18: num 2 1 2 2 1 1 2 2 2 2 ...
$ sact19: num 2 3 2 2 1 1 2 2 2 2 ...
$ sact20: num 2 3 2 6 2 2 5 2 6 2 ...
$ sact21: num 2 3 2 2 2 2 5 6 6 2 ...
$ sact22: num 2 3 2 2 2 2 5 6 2 2 ...
$ sact23: num 2 3 2 2 2 2 5 2 2 2 ...
$ sact24: num 2 3 2 2 2 5 3 2 2 2 ...
$ sact25: num 2 3 2 2 2 5 3 2 2 2 ...
$ sact26: num 2 3 2 2 2 5 3 2 2 2 ...
$ sact27: num 2 3 2 2 2 2 3 2 2 2 ...
$ sact28: num 2 3 5 2 2 2 3 2 2 2 ...
$ sact29: num 2 3 2 2 2 2 3 2 2 2 ...
$ sact30: num 2 3 5 2 2 2 3 2 2 2 ...
$ sact31: num 2 3 5 2 2 2 3 2 2 2 ...
$ sact32: num 2 3 5 2 2 2 5 2 2 2 ...
$ sact33: num 2 3 5 2 2 2 5 2 2 2 ...
$ sact34: num 2 3 5 2 2 2 5 2 2 2 ...
$ sact35: num 2 3 5 2 2 2 2 2 2 2 ...
$ sact36: num 2 3 5 2 2 2 2 2 2 2 ...
$ sact37: num 2 3 5 2 2 2 2 2 2 2 ...
$ sact38: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact39: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact40: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact41: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact42: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact43: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact44: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact45: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact46: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact47: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact48: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact49: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact50: num 2 3 3 2 2 2 2 2 2 2 ...
The first step in sequence analysis is to create a corpus of
sequences, i.e. a stslist
class object using the
seqdef
function. First the labels of the 6 states and a
palette of 6 colours are defined (this is optional: seqdef
creates labels and palette automatically if not provided).
# defining of a corpus of sequences
<- c("studies","full-time","part-time","short jobs","inactivity","military")
labs <- brewer.pal(length(labs), 'Set2')
palette <- seqdef(trajact, labels=labs, cpal=palette) seqact
Our corpus of 500 sequences consists of 377 distinct sequences, which confirms the interest of using a statistical procedure to group together sequences that are similar.
# number of distinct sequences
seqtab(seqact, idx=0) %>% nrow
[1] 377
The state distribution plot of all the sequences shows the preponderance of full-time employment and the non-negligible weight of inactivity.
# state distribution plot
seqdplot(seqact, xtlab=14:50, cex.legend=0.7)
A data frame is also loaded containing some socio-demographic
variables on the individuals. Note that the categorical variables are in
factor
format.
# loading the socio-demographic variables
data(socdem)
str(socdem)
'data.frame': 500 obs. of 9 variables:
$ annais : num 1950 1937 1933 1948 1944 ...
$ nbenf : Factor w/ 4 levels "aucun","un","deux",..: 4 3 3 3 3 2 4 4 4 1 ...
$ nbunion : Factor w/ 3 levels "aucune","une",..: 2 2 2 2 3 2 2 2 2 1 ...
$ mereactive : Factor w/ 2 levels "non","oui": 1 1 2 2 1 1 2 1 1 2 ...
$ sexe : Factor w/ 2 levels "homme","femme": 1 2 2 1 2 2 2 1 1 2 ...
$ PCS : Factor w/ 8 levels "NA","agric","indep",..: 7 8 5 7 6 5 6 7 7 4 ...
$ PCSpere : Factor w/ 8 levels "agric","indep",..: 7 1 5 2 6 3 6 5 6 6 ...
$ diplome : Factor w/ 4 levels "aucun","<bac",..: 1 2 2 2 3 4 2 2 2 2 ...
$ nationalite: Factor w/ 2 levels "francaise","etrangere": 1 1 1 1 1 1 1 1 1 1 ...
Building a distance matrix
Synthetic indicators
As an example, a distance matrix is constructed from indicators describing the number of episodes in the different states. The first individual has spent his entire trajectory in full-time employment, while the second has experienced one full-time episode but also one episode of study and two of part-time employment.
<- seqinepi(seqact)
indics head(indics)
nepi1 nepi2 nepi3 nepi4 nepi5 nepi6
1 0 1 0 0 0 0
2 1 1 2 0 0 0
3 1 2 1 0 2 0
4 1 2 0 0 0 1
5 1 1 0 0 0 0
6 1 2 0 0 1 0
The matrix can be calculated directly from the indicators or after a Principal Component Analysis (PCA) step, here by retaining the first 5 dimensions.
# distance matrix from the indicator
<- dist(indics, method='euclidean') %>% as.matrix
dissim
# distance matrix from PCA results
<- PCA(indics, scale.unit=FALSE, ncp=5, graph=FALSE)$ind$coord
acp_coords <- dist(acp_coords, method='euclidean') %>% as.matrix dissim
Other synthetic indicators (durations, states visited, etc.) can be
calculated simply from the functions seqistatd
,
seqi1epi
, seqifpos
, seqindic
or
seqpropclust
.
Disjunctive coding and QHA
In the case of complete disjunctive coding, the distance matrix can be calculated directly, with the Euclidean distance or the chi2 distance, or after a Principal Component Analysis (PCA) or Multiple Correspondence Analysis (MCA) step, here by retaining the first 5 dimensions.
NB: map_df
allows you to apply the same function to all
the columns of a data frame. Here, this function is used to convert
columns from numerical format to factor
format.
# complete disjunctive coding
<- as.data.frame(tab.disjonctif(seqact))
disjo <- disjo[,colSums(disjo)>0]
disjo
# euclidian distance
<- dist(disjo, method='euclidean') %>% as.matrix
dissim
# chi2 distance
<- map_df(disjo, as.factor) %>%
dissim dudi.acm(scannf=FALSE, nf=ncol(disjo)) %>%
dist.dudi() %>%
as.matrix
# after a PCA step
<- PCA(disjo, scale.unit=FALSE, ncp=5, graph=FALSE)$ind$coord
acp_coords <- dist(acp_coords, method='euclidean') %>% as.matrix
dissim
# after a MCA step
<- purrr::map_df(disjo, as.factor) %>%
acm_res MCA(ncp=5, graph=FALSE)
<- dist(acm_res$ind$coord, method='euclidean') %>% as.matrix dissim
For the qualitative harmonic analysis (QHA), the calculation of the distance matrix can be done directly (chi2 distance) or after a correspondence analysis (CA), here using the first 5 dimensions.
# QHA coding
<- seq2qha(seqact, c(1,3,7,10,15,20,28))
ahq <- ahq[,colSums(ahq)>0]
ahq
# chi2 distance
<- dudi.coa(ahq, scannf=FALSE, nf=ncol(ahq)) %>%
dissim dist.dudi() %>%
as.matrix
# after a CA step
<- CA(ahq, ncp=5, graph=FALSE)$row$coord
afc_coord <- dist(afc_coord, method='euclidean') %>% as.matrix dissim
Optimal Matching and alternatives
For Optimal Matching, the construction of a distance matrix
between sequences is done with the seqdist
function. This
also involves defining a substitution cost matrix between states (with
the seqsubm
function). Here the substitution costs are
constant and equal to 2 and the indel cost is equal to 1.5.
# building the distance matrix
<- seqsubm(seqact, method="CONSTANT", cval=2)
couts <- seqdist(seqact, method="OM", sm=couts, indel=1.5) dissim
From experience, Optimal Matching with the parameters
adopted here is a choice that allows the different dimensions of
sequence temporality to be taken into account - sequencing,
calendar (timing), duration (in different states or episodes,
duration and spell duration). If you wish to focus on
one of these dimensions, you can follow the recommendations of Studer
& Ritschard (2016, see in particular pages 507-509), and choose one
of the many other metrics implemented in the TraMineR
extension.
# sequencing
<- seqdist(seqact, method="OMstran", otto=0.1, sm=couts, indel=1)
dissim <- seqdist(seqact, method="OMspell", expcost=0, sm=couts, indel=1)
dissim <- seqdist(seqact, method="SVRspell", tpow=0)
dissim
# timing
<- seqdist(seqact, method="HAM", sm=couts)
dissim <- seqdist(seqact, method="CHI2", step=1)
dissim
# duration (distribution aver the entire period)
<- seqdist(seqact, method="EUCLID", step=37)
dissim
# duration (spell lengths)
<- seqdist(seqact, method="OMspell", expcost=1, sm=couts, indel=1)
dissim <- seqdist(seqact, method="LCS") dissim
Note that methods using the complete disjunctive coding or QHA are
also implemented in the seqdist
function (“EUCLID” and
“CHI2” methods).
Typology of sequences
Building a typology
A hierarchical agglomerative clustering (HAC) is then carried out
using Ward’s aggregation criterion, using the agnes
function of the cluster
extension.
NB: With a high number of sequences, HAC may require a significant
amount of computing time. However, there is a much faster implementation
in the fastcluster
package (hclust
function).
# hierarchical agglomerative clustering
<- as.dist(dissim) %>% agnes(method="ward", keep.diss=FALSE) agnes
In order to explore the solutions of a hierarchical agglomerative clustering, one usually starts by examining the dendrogram.
# dendrogram
as.dendrogram(agnes) %>% plot(leaflab="none")
The following graph combines dendrogram and index plot: the sequences of the index plot index are sorted according to their position in the dendrogram, which is shown in the margin of the graph.
# heatmap (dendrogram + index plot)
seq_heatmap(seqact, agnes)
Examination of inertia gaps can also be useful in determining the number of clusters in the typology. For example, it can be seen that there is a noticeable difference in inertia between partitions in 5 and 6 clusters.
# plot of inertia
plot(sort(agnes$height, decreasing=TRUE)[1:20], type="s", xlab="number of clusters", ylab="inertia")
There are also a number of indicators of partition quality (silhouette, Calinski-Harabasz, pseudo-R2, etc.; see Studer, 2013).
# indicators of quality
<- as.clustrange(agnes, diss=dissim)
wardRange summary(wardRange, max.rank=2)
1. N groups 1. stat 2. N groups 2. stat
PBC 5 0.79612318 4 0.79012613
HG 5 0.92899997 4 0.92259604
HGSD 5 0.92672338 4 0.92027518
ASW 3 0.53491169 2 0.52958820
ASWw 3 0.53862662 2 0.53182999
CH 2 156.86409716 3 103.74539004
R2 20 0.60648740 19 0.59473652
CHsq 2 324.53910100 3 245.89231855
R2sq 20 0.82862437 19 0.82073439
HC 5 0.04010082 4 0.04392402
The quality of the partitions for different numbers of clusters for the silhouette, pseudo-R2 and Calinski-Harabasz indicators is shown graphically here.
plot(wardRange, stat=c('ASW','R2','CH'), norm="zscore")
In the end, we opt for a partition in 5 clusters, by “cutting the
tree” of the HAC using the cutree
function.
# choosing the partition in 5 clusters
<- 5
nbcl <- cutree(agnes, nbcl) part
It is possible to “consolidate” the partition using the PAM algorithm
(Partition Around Medoids) and the wcKMedoids
function of the WeightedCluster
package. This results in a
very similar distribution of sequences between the clusters (see the
table crossing the clusters before and after consolidation) but the
quality of the consolidated partition is slightly higher (the R² goes
from 61 to 64%).
# consolidating the partition
<- wcKMedoids(dissim, k=nbcl, initialclust=part, cluster.only=TRUE)
newpart table(part, newpart)
newpart
part 28 34 103 307 480
1 104 241 1 0 0
2 9 24 2 7 0
3 0 6 6 63 1
4 0 0 17 0 0
5 8 0 0 1 10
wcClusterQuality(dissim, part)$stats['R2sq'] %>% round(3)
R2sq
0.607
wcClusterQuality(dissim, newpart)$stats['R2sq'] %>% round(3)
R2sq
0.64
If you wish to keep the consolidated partition :
<- as.numeric(as.factor(newpart)) part
NB: Another option, the “fuzzy” clustering, here with the Fanny
algorithm (extension cluster
). Unlike HAC or PAM, each
sequence does not belong to a single cluster but is characterised by
degrees of membership to the different clusters. The following table
presents the degrees of membership to the 6 clusters of the first 3
sequences of the corpus. The first sequence belongs 99% to cluster 1,
but the second is more “balanced”, mainly between clusters 2 and 5.
# fuzzy clustering
<- as.dist(dissim) %>% fanny(k=5, metric='euclidean', memb.exp=1.2)
fanny $membership %>% round(2) %>% .[1:3,] fanny
[,1] [,2] [,3] [,4] [,5]
1 0.99 0.00 0.00 0.01 0.00
2 0.06 0.51 0.02 0.13 0.28
3 0.00 0.11 0.80 0.00 0.08
Describing the typology: plots
The graphical representations give a quick and intuitive idea of the nature of the clusters in the typology. The most commonly used type of graph is the state distribution plot.
# state distribution plots of the typology
seqdplot(seqact, group=part, xtlab=14:50, border=NA, cex.legend=0.8)
The index plots are also very common.
# index plots of the typology
seqIplot(seqact, group=part, xtlab=14:50, yaxis=FALSE, cex.legend=0.8)