New Visualization Techniques
Vol.34 No.1 February 2000
Visualizing Multi-Dimensional Data
Stephen G. Eick
With Microsoft’s release of Office/2000 and SQL Server 7.0, multi-dimensional databases, commonly called data cubes, are becoming an increasingly common data structure for storing aggregated transactions. Multi-dimensional data (or measures) are organized into cells that are indexed by a long a sequence of categories, called dimensions. To help understand multi-dimensional databases, we have developed three perspectives, Single, Multiple and Anchored Measures, for visualizing one, two and an arbitrary number of measures.
Our approach applies visual query and analysis to display much larger cubes than conventional PivotTable or Cross Tab reports, navigate through complex cubes and apply interactive operations to help users identify important structures. Our software integrates with Microsoft Excel and can access data directly from SQL Server 7.0 or other cube databases, and uses Microsoft Office for result set distribution and reporting.
With the decreasing cost of storage and increased bandwidth of networks it has become technically feasible and cost effective to store huge volumes of fine grain data. This data typically consists of transactions, sales records, customer information, and is stored in warehouses or data marts. When properly analyzed, it provides a rich analysis source for understanding business data. Transactions collected by operational systems are frequently stored in relational tables.
For a variety of reasons, data cleanliness, scalability, efficiency of the relational method, difficulty in building schemas and computational complexity, it is difficult to report, analyze, distribute and make business decisions using raw relational tables. The problem is that the relational model and SQL  standard interface for manipulating relational tables is not well suited for analysis tasks [2, 6]. Analysis queries submitted against warehouses engineered for fast transaction archiving frequently run extremely slowly. (Note: A multi-million dollar warehouse may only be able to support one or two power analysis users.) An approach for overcoming the analysis problems associated with relational databases promulgated by the business intelligence software vendors involves aggregating transactions into multi-dimensional databases or data cubes.
A data cube, the raw data structure in a multi-dimensional database, organizes information along a sequence of categories. The categorizing variables are called dimensions and the data, called measures, are stored in cells. The cube dimensions are frequently organized into hierarchies and usually include a dimension representing time. Multi-dimensional databases automatically aggregate measures across hierarchical dimensions, support hierarchical navigation, expand and collapse dimensions, enable drill-down, drill-up or drill-across and facilitate comparisons through time.
Perhaps the most common uses of data cubes is to store aggregated transaction information. In this case, for example, the cube dimensions might be product, store, department, customer number, region, month, and the measures might be COGS, sales and profit. The dimensions are predefined indices into a cube cell and the measures in a cell are roll-ups over the transactions. The roll-up or aggregations are usually sums but may include other functions such as averages, standard deviations, percentages, etc.
For example, the values for the dimensions may be:
Then the cell corresponding to [north] [shirts][Feb] is the total sales of shirts for the northern region for the month of February.
The hierarchical levels for the dimension time may be year, quarter, month, day, and hour. Thus [north][shirts][Q1] = sales [north][shirts][Jan] + ___ + sales [north][shirts][Mar].
It is not necessary that all cells be populated nor is it a requirement that the hierarchies be symmetric for all dimensions.
Figure 1: Microsoft Excel’s PivotTable.
The standard interface for understanding and manipulating data cubes is called a PivotTable or Cross Tab. Although there are variations among particular vendor’s implementations, Figure 1 shows an example of a Microsoft Excel PivotTable. Cells are arranged in a row by column by page grid, with one page showing at any time. The values of the row (Product), column (State) and page dimensions (QTR) index the table cells and adjust the visible page. Each cell contains five measures: Sales, Expenses, Profit, COGS and Marketing. Margins are totaled for each measure along the edges with grand totals in the lower right hand corner (not shown). The row dimension, Product, is organized into a two-level hierarchy by Product nested within Product Type.
A PivotTable is an interactive textual report. Standard PivotTable manipulations include:
Multi-Dimensional OLAP Cube Visualization
A problem with PivotTable reports is that they are not an effective tool for understanding multi-dimensional databases. From Figure 1 for example, even if the font were readable about the only thing it shows is that certain products are not sold in certain states. Seeing patterns, discovering trends, navigating hierarchies in the multi dimensional data, showing what changed from quarter to quarter and finding relationships between sales and profits is impossible. Seeming simple analysis tasks such as identifying the three largest cells, locating the two rows with the smallest totals, drilling into a subset of the cells or finding the biggest growth trends are time consuming and tedious at best and frequently impossible.
To overcome this problem, we have developed a series of techniques for visual discovery in multi-dimensional data cubes. This effort is part of an on-going effort to develop a complete family of tools, called ADVIZOR, targeting different aspects of visual query and analysis.
Our techniques are organized into three perspectives. A perspective is a set of linked visual components (or views) that are displayed together on the same screen. The views in a perspective work together to enable a particular type of visual analysis. One perspective, focused on visualizing a single measure, answers "what" questions. Another, focused on showing two or three measures simultaneously, answers "why" questions. The third explores ways to show three to 20 measures using a variant on parallel coordinates.
Figure 2: A landscape visualization of the multi-dimensional data from the PivotTable in Figure 1 for the profits measure.
Figure 3: 3D Multiscape’s navigation controls. Top: popup menu; Bottom: 3D Multiscape toolbar.
Figure 4: Four frames from a 3D Multiscape animation that help users understand transitions.
Figure 5: Multiple Measures perspective showing sales and profits.
Figure 6: Scatterplot navigation controls. Top: popup menu, bottom: Scatterplot toolbar.
Figure 7: Anchored Measure perspective.
Figure 8: Toolbar supports rich navigation.
Figure 9: Comparing profits, Coffee and Espresso.
Figure 10: Profitability Analysis. Top: most profitable product and state overall. Bottom: most and least profitable state-product combinations.
Figure 11: High-level software design.
Figure 2 shows a Single Measure perspective visualizing the PivotTable pictured in Figure 1. The Single Measure perspective presents the entire PivotTable for one measure, profits in this case.
The perspective in Figure 2 is organized into three parts. Controls across the top follow the standard Microsoft conventions. Along the left are three interactive Bar Charts, one for each dimension, showing profits totaled by state, product and product_type within product. In the center of the display a three-dimensional landscape view, called a 3D Multiscape , shows profit by product-state combination. The height (or depth for negative values) of each 3D Multiscape bar along the z-axis encodes profits or losses. The Bar Charts present PivotTable marginal totals, the aggregation of profits by dimension value, whereas the 3D Multiscape shows cell details.
The colors in the perspective, set in the top control panel, are tied to the state dimension. The Bar Charts along the left perform two functions: first to provide measure summary detail with the color slices showing contributions by state and second to act as filters in the visual analysis process.
The 3D Multiscape provides "big picture" overview of the PivotTable. It shows, for example, how profits vary by state, by product and product type. The down-pointing 3D Multiscape bars represent negative profits (losses).
3D Multiscape’s navigation control and popup menu, shown in Figure 3, provides a rich interface for tuning the visualization. Classes of interactive operations include:
Figure 5 shows a multiple measures perspective visualizing both profit and sales. The biggest difference between a single and a multiple measures is that the 3D Multiscape is replaced with Scatterplot with two measures, one each on the x- and y-axes, and a third measure (or dimension) tied to color.
This leads to visual insights involving relationships among three measures.
The Data Sheet  below the Scatterplot shows line-item details for the points in the Scatterplot. Mousing over any individual point both labels the point coordinates and shows its corresponding details for all of the measures in the Data Sheet.
The Multiple Measures perspective shows patterns between two (or three if color is used) measures and is intended to answer "why" questions. Users discover what happened using the single measure perspective, e.g., profits are down for a particular state-product combination, and use the multiple measures perspective to discover why, e.g., marketing costs were unusually high for this combination.
The anchored measure perspective, shown in Figure 7 focuses on displaying an arbitrary number of measures, say between three and 20, using techniques from multidimensional visualization. Anchored measures combines a weighted ParaBox , a combination of bubble plots, parallel coordinates  and Box plots. (Note: The outer box in the box plot represents the range of values from the 5% to 95% percentiles. The inner, dark grey box represents the range into which the middle 50% of the values fall (i.e. the 25-75th percentiles). Outliers are values that are plotted outside (above or below) the outer box (outside the 95th percentile).
The dimensions and measures are organized along a series of parallel axes, as with a parallel coordinates plot. The first three axes, indicated by the bubble plots, correspond to the dimensions of the original pivot table and the remainder, indicated by the Box plots, correspond to the measures. Lines are drawn between the bubble and box plots connecting dimensions and measure values. Each line on Figure 7 shows the measure values for one cell. The sizes of the bubbles in the State, Product Type and Product columns show the number of cells have each respective value. The values of the measures from each row by column cell of the pivot table are plotted as points on the axis of the corresponding box plot. Lines are drawn between the bubble and box plots connecting dimensions and measure.
From this perspective one can easily identify measure values that are "outliers" or extreme values since they touch extreme regions in the Box plot. The lines in Figure 7 are colored by Product Type. The red lines corresponding to monthly green tea purchases in Nevada show losses in profit (negative profits) since they touch the bottom of the Profit axis. By noticing the values of the other measures for the red lines, we can easily tell that the reasons for losses are: low sales, high COGS and high marketing costs.
Visually Navigating PivotTables
The analysis power of our perspectives (and virtually any other visualization tool) is increased by providing efficient techniques to navigate through the visualization. Figure 8 shows our toolbar, which serves as a command and control center for PivotTable navigation.
There are eight classes of navigational controls:
Using the mouse, users may sweep out regions on the views. The items in the sweep out region become the selection set and are drawn in color. The unselected set is drawn in gray. Following the standard Windows model, there are four selection modes:
In addition, since selection is so important, we have extended these modes by adding buttons for:
Visibility: Excluding and Restoring
A key capability in visual analysis is the ability to focus in on particular regions of interest. There are two aspects of the visibility capability: excluding and restoring data. The exclude button eliminates the unselected data items from the display so that only the selected set is visible. The restore button brings them back, making them visible again. Working together, selecting and excluding are extremely powerful. Starting with a large PivotTable users can quickly and easily identify unusual patterns and interactively select them. Using the exclude button, users can easily drill-in and focus on the interesting regions.
The Totals Table (lower left corner) provides immediate feedback on what is selected and excluded. This helps users navigate without getting lost. If the PivotTable data came from a SQL Server 7 OLAP cube, drill-down and drill-up operations cause ADVIZOR to attach directly to the cube to fetch data. This significantly increases scalability since data is fetched only when needed.
To compare, for example, how profits vary between the Coffee and Espresso, a convenient strategy is to select those products and exclude the other product types. To accomplish this in the Single Measure perspective, the user would use the dimension Bar Charts as filtering tools and would select just the bars corresponding to Coffee and Espresso Product Type Bar Chart. Selecting the Exclude button simplifies the perspective to these two Product Types, making it immediately apparent that Colombian Coffee is quite profitable in Massachusetts and Espresso is extremely profitable in New York (See Figure 9).
Write-Back: Exporting Result Sets
A visual analysis involves posing questions, formulating hypotheses and discovering results. As part of a holistic analysis process these results must translate into business actions that yield value. Our approach for supporting the complete analysis process involves both visual discovery and the creation of result sets.
A result set, a sub-cube created by selecting and excluding, may be exported (also called "write-back") to Excel where it appears as a new PivotTable on a new worksheet. Furthermore, by integrating with Microsoft Office 97 and Office 2000, important visualizations may be copied to the clipboard and inserted in PowerPoint presentations, printed, analyzed further in Excel, saved as .html files and browsed in Internet Explorer, Netscape Navigator or distributed as text for further action.
Case Study: A Profitability Analysis
The PivotTable shown in Figure 1 involves a multi-quarter business profitability study. The complete dataset, stored as an Excel PivotTable, contains a wide range of business metrics. The analysis goal is to study profitability by product and market, identify profitability problems and highlight reasons for the problems. The business strategy is to maximize profitability in Nevada by adjusting product mix.
In the master PivotTable the original dimensions included: QTR, Months, Market, State, Mrkt Size, Product Type, Product and Decaf. The master PivotTable also included the measures: Profit, Margin, Sales, COGS, Tot Exp, Marktg Payroll, Misc, Inventory, Opening, Additions, Ending, Margin Rat, Profit Ratio, Bdgt Profit, Bdgt Margin, Bdgt Sales, Bdgt COGS, Bdgt Payroll and Bdgt Additions.
Focusing in on profitability, Figure 2 shows profitability by product-state combination. Identifying the tallest bars by interactively touching the dimensional Bar Charts with the mouse, shows that Colombian is the most profitable product overall and California is the most profitable state (Figure 10). Rotating the 3D Multiscape for better viewing of the bars and labeling the tallest bars shows that the most profitable product in any one state is Columbian coffee (in Massachusetts) followed by Columbian (in California) (Figure 10).
Notice also that one bar in Figure 10 stands out since it is large and points down (indicating loss). Labeling this bar shows that it corresponds to Green Tea sales in Nevada. Selecting, excluding and switching to the Multiple Measures perspective shows the performance of Green Tea in Nevada over the last several quarters (Figure 5).
Switching to the Multiple Measures perspective in Figure 5 and Anchored Measures perspective in Figure 7, we can dig into why there are losses by looking at the relationship of five measures: profits, sales, COGS, Tot Exp and Marktg. There is a consistent pattern: profits and sales of Green Tea are low where COGS, Tot Exp and Marktg are high. The business problem is clear: a lack of Green Tea sales along with high costs are causing a large loss in Nevada. The data shown in Figure 5 can then be written back as a reduced PivotTable for further Excel analysis.
Dimensional databases are a particularly useful way to organize business metrics. PivotTables are a widely used tool for manipulating and reporting on these metrics. The problem that we address involves PivotTable understanding. It is difficult for users to understand and make sense of PivotTable data, e.g., see patterns, identify trends and spot outliers because of their size and textual nature. Our experience is that understanding even a tiny PivotTable with 20 or more cells is hard, and understanding bigger tables is impossible. Graphical tools supplied with Excel and other vendors are also not scalable and not particularly useful. In practice, the way users understand large PivotTables is by breaking them up into multiple small tables.
There are three disadvantages in using smaller, reduced size PivotTables:
We overcome these problems using the following "sense making" operations:
Our initial implementation, named ADVIZOR/2000, integrates with MS Office as an Excel add-in and is launched from Excel’s pivot table toolbar. An architectural view of its high-level design is shown in Figure 11.
To run ADVIZOR/2000, the user creates a PivotTable in Excel using Excel’s PivotTable wizard. The source data can originate from an Excel worksheet, a relational database or an OLAP cube accessible from OLAP services. When ADVIZOR/2000 is launched its initial Single Measure perspective is automatically populated with data scraped from the PivotTable in the active worksheet. If the Excel PivotTable was created from an SQL Server 7.0 data cube, ADVIZOR/2000 connects to SQL Server and uses the cube schema harvested from the PivotTable. Dimensional browsing and other cube navigation operations such as drilling-up, -down and across are then accomplished by selecting cube slices and pulling raw slice data from OLAP services.
The ADVIZOR/2000 container application is written in Visual Basic 6.0. It consists essentially of a thin container, about a dozen controls for navigation, and visualization. Internally, PivotTable data is kept in the Data Pool, an in-memory store that supports manipulations and case-based linking .
ADVIZOR/2000’s architecture is quite flexible. We have created prototype stand-alone versions of ADVIZOR/2000 that attach to OLAP services, integrate with Business Objects (Note: http://www.businessobjects.com) and attach to Knosys’ Pro Clarity.
We have developed a tool that explores techniques for visualizing multi-dimensional databases. It displays data cubes in three ways:
We extend multi-dimensional data analysis techniques in at least four ways:
This project has involved many key engineers, developers and creative thinkers on the Visual Insights staff including Tim Barg, Sue Burkwald, Brenda Garity, Dianne Hackborn, Bill Hammond, Barbara Mirel, John Pyrce, Kurt Rivard, Bill Swanson and Michael Tatelman.
Stephen G. Eick is the Founder and Vice President of Research and Development at Visual Insights, an emerging growth startup that spun out from Lucent Technologies. Visual Insights’ mission is to apply its patented RealTime3D™ graphics technology to understand e-business as it happens.
The copyright of articles and images printed remains with the author unless otherwise indicated.