Rankclust: An R package for clustering multivariate partial rankings

Rankcluster is the ﬁrst R package proposing both modelling and clustering tools for ranking data, potentially multivariate and partial. Ranking data are modelled by the Insertion Sorting Rank ( isr ) model, which is a meaningful model parametrized by a central ranking and a dispersion parameter. A conditional independence assumption allows to take into account multivariate rankings, and clustering is performed by the mean of mixtures of multivariate isr model. The clusters’ parameters (central rankings and dispersion parameters) help the practitioners in the interpretation of the clustering. Moreover, the Rankcluster package provides an estimation of the missing ranking positions when rankings are partial. After an overview of the mixture of multivariate isr model, the Rankcluster package is described and its use is illustrated through two real datasets analysis.


Introduction
Ranking data occur when a number of subjects are asked to rank a list of objects O 1 , . . ., O m according to their personal order of preference.The resulting ranking can be designed by its ordering representation x = (x 1 , . . ., x m ) ∈ P m which signifies that Object O x h is the hth (h = 1, . . ., m), where P m is the set of the permutations of the first m integers.These data are of great interest in human activities involving preferences, attitudes or choices like Politics, Economics, Biology, Psychology, Marketing, etc.For instance, the voting system single transferable vote occurring in Ireland, Australia and New Zeeland, is based on preferential voting.

Mixture of multivariate ISR model
Starting from the assumption that a rank datum is the result of a sorting algorithm based on paired comparisons, and that the judge who ranks the objects uses the insertion sort because of its optimality properties, [1] state the following isr model: where µ ∈ P m is a location parameter and π ∈ [ 1 2 , 1] is a scale parameter.The numbers G(x, y, µ) and A(x, y) are respectively the number of good paired comparisons and the total number of paired comparisons of objects during the sorting process (see [1] for more details).Recently, [2] propose a model-based clustering algorithm for multivariate rankings, i.e. when a datum is composed of several rankings, potentially partial (when some objects have not been ranked).For this, they extend the isr model by assuming that, given a group k, the components of a multivariate ranking are independent: where the model parameter θ = (π j k , µ j k , p k ) k=1,...,K ,j=1,...,p are estimated by the mean of a SEM-Gibbs algorithm.The resulting algorithm is able to cluster ranking data sets with full and/or partial rankings, univariate or multivariate.To the best of our knowledge, this is the only clustering algorithm for ranking data with a so wide application scope.

The Rankclust package
This algorithm has been implemented in C++ and is available through the Rankclust package for R, available on the author webpage 1 and soon on the CRAN website 2 .
The main function rankclust() performs cluster analysis for multivariate rankings and is able to take into account partial ranking.This function has only one mandatory arguments: data, which is a matrix composed of the observed ranks in their ordering representation.The user can specify the number of clusters (1 by default) he wants to estimate or provide a list of clusters numbers.In that case, the user can choose either the BIC or ICL criterion to select the best number of clusters among his list.The outputs of rankclust() are of different nature: • the estimation of the model parameters as well as the 'distances' between the final estimation and the current value at each iteration of the SEM-Gibbs algorithm.These distances can be used as indicators of the estimation variability.
• the estimated partition.Additionally, for each cluster, the probability and the entropy for all the cluster's members are given.This information helps the user in its interpretation of the clusters.
• for each partial ranking, an estimation of the missing positions.

Application
The use of the Rankclust package will be illustrated by the analysis of the European countries votes at the Eurovision song contest from 2007 to 2012.