Tip:
Highlight text to annotate it
X
A multi-dimensional multivariate dataset
is comprised of several variables. One popular way of displaying such
a dataset is to use a scatterplot matrix, or SPLOM. In a SPLOM,
each possible pair of variables is displayed as a scatterplot
within a matrix structure.
SPLOMs do not handle categorical variables very well
due to overplotting. For example, in this scatterplot matrix,
it is not obvious which combination of month and region
has the most sales or which product has the highest sales.
Another way of displaying multi-dimensional multivariate datasets,
parallel coordinates, shows pairs of adjacent variables and their relationships,
with each line connecting the value of all variables for single observation.
Here again, categorical variables are hard to read,
being displayed as complete bipartite graphs in this instance.
We propose a new technique called the generalized plot matrix,
or GPLOM, which shows different charts depending on the type of variables
displayed in each matrix cell,
in a fashion similar to the generalized pairs plot of Emerson et al
2013. Furthermore, the interactive features in GPLOM
allow casual data exploration and analysis.
In a GPLOM, categorical and continuous variables
are segregated on each axis; furthermore, one variable
is removed from each axis to avoid the same variable being plotted against itself.
The upper triangular part of the matrix is not displayed,
to allow room for interactive features. Different types of charts are used to
show
relationships between different types of variables.
Charts with a categorical variable show an aggregate of the data,
as aggregated by a user selectable aggregation function.
Due to the limited amount of space relative to the cardinality
of variables, labels for data are shown through tooltips
and bendy highlights. Bendy highlights connect related marks between charts
while tooltips contain a brief summary of the data inside the mark.
The infobox contains more information about the particular mark under the cursor,
as well as a kernel density estimate of the underlying distribution.
The kernel density estimate also highlights the average,
allowing the analyst to see at a glance whether or not the average
accurately summarizes the data.
Clicking on a particular mark highlights it and shows a part-to-whole comparison
on the other charts. For aggregation functions which are not additive,
such as the average, minimum or maximum, a dot is displayed.
For aggregation functions which are additive, such as the sum
or count, partial highlighting shows the contribution
of the selected value to the whole.
Corresponding values are also highlighted in the scatterplots
and displayed in the kernel density estimate part of the infobox.
It is also possible to quickly find a value
in the matrix using textual search, which highlights the entered value.
Double clicking on a particular value in the matrix
filters the data by only showing the data contained in a particular mark.
GPLOM supports dimensional shedding, meaning that the filtered
dimension is removed from the matrix upon filtering;
one might think of a GPLOM as a hyperlinked
pyramid of sub-GPLOMs.