Data Warehousing

May 21, 2023
Data Warehousing Column-Oriented Storage ETL

An enterprise may have dozens of different transaction processing systems.Each of these systems is complex and needs a team of people to maintain it.
These OLTP systems are usually expected to be highly available and to process transactions with low latency.Database administrators therefore closely guard their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database.

Introduction

A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations.
The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.
Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse.This process of getting data into the warehouse is known as Extract–Transform–Load (ETL).

A big advantage of using a separate data warehouse, rather than querying OLTP systems directly for analytics, is that the data warehouse can be optimized for analytic access patterns.

Stars and Snowflakes: Schemas for Analytics

A sample star schema enterprise relationship diagram showing a single fact table connected to multiple dimension tables.

The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.

Facts are captured as individual events, because this allows maximum flexibility of analysis later. Some of the columns in the fact table are attributes, such as the price at which the product was sold and the cost of buying it from the supplier (allowing the profit margin to be calculated).
Other columns in the fact table are foreign key references to other tables, called dimension tables. As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event.
- Even date and time are often represented using dimension tables, because this allows additional information about dates (such as public holidays) to be encoded, allowing queries to differentiate between sales on holidays and non-holidays.
A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions.
- For example, there could be separate tables for brands and product categories, and each row in the dim_product table could reference the brand and category as foreign keys, rather than storing them as strings in the dim_product table.
- Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with.

In a typical data warehouse, fact tables often have over 100 columns, sometimes several hundred. Dimension tables can also be very wide, as they include all the metadata that may be relevant for analysis.

Column-Oriented Storage

(“SELECT *” queries are rarely needed for analytics)

If you have trillions of rows and petabytes of data in your fact tables, storing and querying them efficiently becomes a challenging problem. Dimension tables are usually much smaller (millions of rows), so in this section we will concentrate primarily on storage of facts.
Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time.
For example, Analyzing whether people are more inclined to buy fresh fruit or candy, depending on the day of the week

SELECT
  dim_date.weekday, dim_product.category,
  SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
  JOIN dim_date ON fact_sales.date_key = dim_date.date_key
  JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
  dim_date.year = 2013 AND
  dim_product.category IN ('Fresh fruit', 'Candy') GROUP BY
  dim_date.weekday, dim_product.category;

How can we execute this query efficiently?
- In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other.
- In order to process a query like above, you may have indexes on fact_sales.date_key and/or fact_sales.product_sk that tell the storage engine where to find all the sales for a particular date or for a particular product.
- But then, a row-oriented storage engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into memory, parse them, and filter out those that don’t meet the required conditions. That can take a long time.

The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work.

Column Compression

Fortunately, column-oriented storage often lends itself very well to compression.
The value in a column oftem look quite repetitive, which is a good sign for compression.
Depending on the data in the column, different compression techniques can be used. One technique that is particularly effective in data warehouses is bitmap encoding.

Sort Order in Column Storage

In a column store, it doesn’t necessarily matter in which order the rows are stored.
It’s easiest to store them in the order in which they were inserted, since then inserting a new row just means appending to each of the column files.
Advantage of sorted order is that it can help with compression of columns.
- If the primary sort column does not have many distinct values, then after sorting, it will have long sequences where the same value is repeated many times in a row.
- A simple run-length encoding could compress that column down to a few kilobytes, even if the table has billions of rows

Writing to Column-Oriented Storage

Column-oriented storage, compression, and sorting all help to make the read queries faster.
However, they have the downside of making writes more difficult.
- If you wanted to insert a row in the middle of a sorted table, you would most likely have to rewrite all the column files. As rows are identified by their position within a column, the insertion has to update all columns consistently.
- Fortunately, we have already seen a good solution earlier in this chapter: LSM-trees. All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk. It doesn’t matter whether the in-memory store is row-oriented or column-oriented. When enough writes have accumulated, they are merged with the column files on disk and written to new files in bulk. This is essen‐ tially what Vertica does [62].
Queries need to examine both the column data on disk and the recent writes in memory, and combine the two. - However, the query optimizer hides this distinction from the user. From an analyst’s point of view, data that has been modified with inserts, updates, or deletes is immediately reflected in subsequent queries.