Data Cubes is a data format especially useful for heavy reads. It has been popularized in business environments where the main use for data was to make reports (many reads). This also links with the OLAP (Online Analytical Processing) vs OLTP (Online Transaction Processing) concepts, where one is optimized for reads and the other for writes.

The main driver behind data cubes was business intelligence. While traditional relational database systems are focused on the day-to-day business of a company and record keeping (with customers placing or- ders, inventories kept up to date, etc), business intelligence is focused on the production of high-level reports for supporting C-level executives in making informed decisions.

The Data Model

Data Cubes-20241213154509445 We basically have some dimensions that index the data (which can be more than three in the image), and we have some information for each. The important part is that they are **dense** meaning we usually have information for every entry.

Fact Tables

Clearly this model can be decomposed into a relational table, which is called fact table in this context. For example, the following is a $2\times 2\times 2$ cube: Data Cubes-20241213154807702

Operations

Slicing ๐ŸŸจ–

We just take a subset of the cube, for example, we can take all the data for a specific value of one of the dimensions. This should be the equivalent of a projection, we only care about specific values. Equivalent of a selection on the fact table.

  • Definition: Slicing is the process of selecting a specific slice (subset) of the data cube by fixing a value for one or more dimensions. It reduces the dimensionality of the cube by keeping certain dimensions constant and viewing the data along the remaining dimensions.
  • Example: Consider a data cube with dimensions: Time, Product, and Location. Slicing might involve selecting data for a specific Time (e.g., January 2025) to analyze sales across Product and Location dimensions.
Data Cubes-20250128105132230

Dicers ๐ŸŸจ–

If we slice in a way such that we only have (1,2,3)-dimensions left, then we have a nice way to represent that information in 2D table. This is what dicing means, it is a slice plus a nice organization in an human readable manner of the information.

  • Definition: Dicing is similar to slicing but involves selecting a sub-cube by specifying a range or set of values for multiple dimensions. This operation creates a smaller cube with more refined data.
  • Example: Using the same data cube, dicing could involve selecting sales data for the Time range (e.g., January to March 2025), specific Products (e.g., Laptops and Tablets), and Locations (e.g., Zurich and Milan).

Hierarchies

We can have hierarchies in the dimensions, for example, we can have a hierarchy of time (year, month, day).

  • Definition: Hierarchies define levels of granularity within a dimension and facilitate roll-up and drill-down operations. A hierarchy organizes data within a dimension into a tree-like structure, enabling movement between detailed and summarized views.
  • Example:
    • For the Location dimension, a hierarchy might be: City โ†’ Country โ†’ Continent.
    • For the Time dimension, a hierarchy might be: Day โ†’ Month โ†’ Quarter โ†’ Year.
  • Significance: Hierarchies are essential for summarization, enabling roll-up (aggregation to higher levels) and drill-down (exploration to lower levels).

Roll-Up

  • Definition: Roll-up is the process of aggregating data along a dimension hierarchy, moving from more detailed data to summarized or higher-level data. This typically involves grouping data and applying aggregation functions such as sum, average, or count.
  • Example: In the same data cube, rolling up along the Location hierarchy might aggregate sales data from cities (e.g., Zurich, Milan) into countries (e.g., Switzerland, Italy) or continents (e.g., Europe). Similarly, rolling up along Time might aggregate daily data into monthly or yearly data.

Implementation

In this section we will present how data cubes are usually implemented.

ROLAP

ROLAP is the storage of the data cube in a relational database. This is done by having a fact table and multiple dimension tables.

Pivoting

We can pivot the data cube to have a 2D table, which is useful for visualization. The easiest example is when we have a column in a table to be an enum, then we can explode this column into multiple columns, one for each value of the enum. This is called Pivoting. The reverse process is called unpivoting.

Satellite Tables

These are tables that contain information about the dimensions. The main table just contains the keys to these tables, the full information is in the satellite tables and can be recovered with joins. If the satellited tables are also normalized (see Structured Query Language) we derive the snowflake schema (if not, is called star schema).

Querying Cube Data

TODO: (the main problem is doing with multiple aggregates).

MOLAP

In this case the support for cubes is implemented from scratch. We will not cover this method. MOLAP stands for Multidimensional OLAP. They also have a special query language called MDX (Multi Dimensional Expressions).