Gplom - The generalized plot matrix for visualizing multidimensional multivariate data

A multi-dimensional multivariate dataset is comprised of several variables. One popular way of displaying such a dataset is to use a scatterplot matrix, or SPLOM. In a SPLOM, each possible pair of variables is displayed as a scatterplot within a matrix structure. SPLOMs do not handle categorical variables very well due to overplotting. For example, in this scatterplot matrix, it is not obvious which combination of month and region has the most sales or which product has the highest sales. Another way of displaying multi-dimensional multivariate datasets, parallel coordinates, shows pairs of adjacent variables and their relationships, with each line connecting the value of all variables for single observation. Here again, categorical variables are hard to read, being displayed as complete bipartite graphs in this instance. We propose a new technique called the generalized plot matrix, or GPLOM, which shows different charts depending on the type of variables displayed in each matrix cell, in a fashion similar to the generalized pairs plot of Emerson et al 2013. Furthermore, the interactive features in GPLOM allow casual data exploration and analysis. In a GPLOM, categorical and continuous variables are segregated on each axis; furthermore, one variable is removed from each axis to avoid the same variable being plotted against itself. The upper triangular part of the matrix is not displayed, to allow room for interactive features. Different types of charts are used to show relationships between different types of variables. Charts with a categorical variable show an aggregate of the data, as aggregated by a user selectable aggregation function. Due to the limited amount of space relative to the cardinality of variables, labels for data are shown through tooltips and bendy highlights. Bendy highlights connect related marks between charts while tooltips contain a brief summary of the data inside the mark. The infobox contains more information about the particular mark under the cursor, as well as a kernel density estimate of the underlying distribution. The kernel density estimate also highlights the average, allowing the analyst to see at a glance whether or not the average accurately summarizes the data. Clicking on a particular mark highlights it and shows a part-to-whole comparison on the other charts. For aggregation functions which are not additive, such as the average, minimum or maximum, a dot is displayed. For aggregation functions which are additive, such as the sum or count, partial highlighting shows the contribution of the selected value to the whole. Corresponding values are also highlighted in the scatterplots and displayed in the kernel density estimate part of the infobox. It is also possible to quickly find a value in the matrix using textual search, which highlights the entered value. Double clicking on a particular value in the matrix filters the data by only showing the data contained in a particular mark. GPLOM supports dimensional shedding, meaning that the filtered dimension is removed from the matrix upon filtering; one might think of a GPLOM as a hyperlinked pyramid of sub-GPLOMs.