- Precomputation is a common technique used in information retrieval and analysis, including index, materialized view, cube and more.
- It’s a trade-off between time and space, query speed and update flexibility, online processing and offline processing.
- A few megatrends that make precomputation essential to the big data era.
- A real example of 200x acceleration of an OLAP query using different types of precomputation.
- In the near future, how AI and automation will improve precomputation and how that impacts the TCO of big data systems.
Precomputation is a common technique used in information retrieval and analysis. The basic idea is to calculate and store the intermediate results ahead of time, and reuse them to speed up further inquiries.
The oldest form of precomputation, perhaps, is the multiplication table used by the Babylonians about 4000 years ago. Nowadays, children memorize multiplication tables like the one below in grade school. So we can do multiplication of small numbers in our head by simply memorizing the precomputed result instead of calculating them by ourselves.
Figure 1: Multiplication Table from wikipedia
Precomputation in Databases
In database technology the use of precomputation is also very common. For example, an index in a relational database is a type of precomputation. In order to retrieve data faster, databases actively maintain index data structures which encompass one or more columns of an underlying table.
Once computed, indexes are used to quickly locate data without having to search every row in a table every time it is accessed. With index precomputation, the data retrieval time can be reduced from O(N) to O(log(N)) or even O(1), where N is the number of rows in a table.
Indexes as a precomputation have their own cost. Additional calculation and storage are required when rows are inserted into a table. While more indexes make more queries faster, it also means more precomputation, which can slow down data manipulation significantly.
The following chart shows how the number of indexes impacts the insert DB operation performance of a table.
Figure 2: Insert Performance by Number of Indexes from use-the-index-luke
Summary tables, often implemented as materialized views, are another form of precomputation in databases. A summary table is essentially the aggregated result of the original table. For example, a billion-row transaction table may be reduced to a few thousand rows after grouping by date. Analysis to the original data table can then take place on the summary table. Thanks to the reduced size, summary tables can make interactive data exploration hundreds or thousands of times faster, comparing to analysis on the original big table. Building a summary table is not cheap, and keeping it updated with changes in the original table is expensive. However, given the valuable speed boost at analysis time summary tables are still a widely used technique in modern data analysis.
Precomputation in OLAP and Data Cubes
In 1993, Edgar F. Codd, the father of the relational database, coined the term OLAP for On-Line Analytical Processing. Databases then diverge into OLTP focused transactional databases and OLAP focused analytical databases. As you may have guessed, OLAP databases usually leverage precomputation to an even greater extent than OLTP databases.
Cube-oriented systems are a special kind of OLAP databases that push the concept of precomputation to the extreme. A cube is a multi-dimensional array of data, given that data can have an arbitrary number of dimensions. Loading relational data into a cube is a precomputation process of joining tables and aggregating measures. A fully loaded cube is equivalent to 2n summary tables, where n is the number of dimensions. This is a tremendous amount of precomputation that can take hours to complete!
The pros and cons of cube can also be extreme:
On the pro side: once built, a cube provides the fastest analysis experience simply because everything has been precomputed. From whatever dimensions you want to look into the data, the result is already there. Almost no on-line calculation is needed apart from fetching the result from the cube and visualizing. That means low latency and high concurrency.
On the con side: a cube is inflexible and expensive to maintain. This is not just because precomputation and cube storage have associated costs. It’s also because loading a cube from a relational database often requires manual engineering of data pipelines. Every time the business user wants a change, a new development cycle is required to update the pipelines and the cube. This costs both time and money.
Figure 3: Multi-Dimensional Cube from wikipedia
Megatrends that Make Precomputation Essential to the Big Data Era
Precomputation does not exist in a vacuum. It’s interesting now because circumstances in computing demand it. Let’s take a look at some of the megatrends that make precomputation the right technology at the right time for the big data era and the promise of pervasive analytics.
More Data, More Compute
Data volume is growing continuously. You will have more data to analyze and that means more computation power has to be provided to process the additional data each year.
Figure 4: Growing Worldwide Data Created from statista.com
The End of Moore’s Law
Moore’s Law has failed in the past few years and is not going to recover. Research from University of Texas has shown that the effect of Moore’s Law has been greatly attenuated over the last decade from chip manufacturing perspective. Also the cloud price of computation has been basically flat in the past few years. This means, very likely, your computation cost will go up at the same rate as data volume grows!
Figure 5: Computation Cloud Price in the Past Few Years from redmonk.com
Pervasive High-Concurrency Analytics
The number of data consumers will grow significantly. Data is only valuable when it is used to make decisions. To let the “new oil” drive business, we want every employee and team in a company to be able to use data. That could mean tens or hundreds times of more concurrent users on your analytical system. The days of citizen analysts is coming.
Big Data Challenges and Opportunities for Precomputation
We have seen the love and the hate of precomputation. But let’s objectively consider the challenges and opportunities for precomputation looking forward. In the big data era, precomputation will become one of the cornerstone technologies in the data service layer as data volumes and the number of concurrent data users continue to increase.
To understand why this is the case let’s first sum up the salient technical characteristics of precomputation:
- Improved Response Times: precomputation trades space for time. When response time is a key concern, people will gladly trade space (storage) for time and should consider precomputation.
- Improved Concurrency: precomputation increases data preparation time/cost and decreases data serving time/cost. To have high concurrency and serve more consumers, consider precomputation.
- Data Pipelines and Latency: precomputation lengthens data pipeline and increases end-to-end data latency. This is a downside to be managed and we will discuss below.
Given these opportunities and challenges, let’s now consider how precomputation will help address some of the fundamental analytical needs of the modern business.
Maintaining query response times as data volumes keep growing
Practically speaking, how can you keep query response time constant while data volumes grow? When a query is served by on-line computation, typically by a MPP (Massively Parallel Processing) database, the query’s time complexity is O(N) at minimal, which means the needed computation has to grow linearly with data growth. If a query runs for 3 seconds today and data volume doubles, then the same query will become 6 seconds.
To avoid a wholesale revolt by data analysts, you will have to double MPP system resources – and double the infrastructure costs – to keep the query response time within 3 seconds.
However, when a query is served by precomputation, it is almost immune to data growth. The query’s time complexity is close to O(1) since most – if not all – of the results have been precomputed. Fetching and returning the results won’t be much different after the data volume is doubled. The query response time will remain 3 seconds.
Figure 6: Time Complexity On-line Computation vs. Precomputation
Supporting Thousands of Concurrent Citizen Analysts
For on-line computation, the impact of consumer growth is similar to that of data growth. The computation resources demand increases linearly with the number of concurrent users. The MPP vendors will encourage you to double the cluster size (and cost as well) to support double the number of analysts, though your budget may not agree. On the other side, since precomputation minimizes the resource needed to serve a query, the resource increase to support more users is also minimized.
Maximizing TCO as Both Data Volumes and Concurrent Users are Growing
The cloud is remarkable in that all resource consumption can be quantified by cost. The chart below shows a real cost comparison between a MPP data service and a precomputation data service in AWS. Both data preparation costs and query serving costs are included. The workload used is TPC-H, a decision support benchmark with 1 TB data. If today we have 40 analysts and each are running 100 queries per day, the question is what will the total cost be one year later if data grows by 25% and user count grows by 5 times?
Figure 7: A Real TCO Comparison: MPP vs. Precomputation
This shows that precomputation has a significant lead in TCO when the number of queries (or users) are high. Particularly, when the number of queries reaches 20,000 per day, the TCO of the precomputation data service is a third of the MPP service. And the growth of data volume further amplifies this advantage.
Example of 200x Speed up of an OLAP Query
As a practical example, we will examine how Apache Kylin (a big data OLAP engine) accelerates TPC-H queries to get a 200X speed-up using precomputation. TPC-H is a widely used decision support benchmark for databases.
With 100 GB data, the TPC-H query 07 runs for 35.23 seconds on a MPP engine, which is Hive+Tez in this case. The query has a sub-query and is not simple. As we can see from the execution plan, the query involves joining of multiple tables and then aggregating for the final result. These are the two biggest bottlenecks at execution time.
Figure 8: TPC-H Query 07, Original Execution Plan
From a precomputation perspective, we can easily think of creating a materialized view to have the joins precomputed. Doing that manually would result in something like the below.
Figure 9: TPC-H Query 07, Execution Plan with Manual Precomputation
Note the execution plan is much simplified because the joins are replaced by the precomputed result. The drawback of this solution is the materialized view requires manual effort to maintain, and application-level SQL needs a rewrite to query the new view instead of the original tables, which is costly for real-world use. Also, the heavy aggregation operation is still computed online.
To make precomputation better, Apache Kylin makes a few improvements:
- Introduces the cuboid concept, which is basically an enhanced materialized view that precomputes joins as well as aggregates.
- Helps to automate the maintenance of cuboids.
- Optimizes the original execution plan to leverage cuboid precomputation automatically.
Figure 10: TPC-H Query 07, Optimized Execution Plan in Apache Kylin
With Apache Kylin, the same query can return in 0.17 seconds on the same hardware. Precomputation removes both join and aggregate bottlenecks. And there is no need to rewrite application-level SQLs as the execution plan is optimized automatically to leverage the best cuboid available.
Future of Precomputation
Precomputation is the critical technology in the big data era to convert data into value at scale. For a wide variety of data platforms – such as cloud data warehouses and data lakes – it can provide fast response time, high concurrency, and low TCO at the same time. One potential drawback is the additional data preparation step, but this problem is actively being addressed. Another potential drawback of precomputation is that it can add latency in the data pipeline and requires extra maintenance operations. The good news is, this shortcoming will be resolved or eased greatly in the next two years.
Gartner predicts that through 2022, manual data management tasks will be reduced by 45% through the addition of machine learning and automated service-level management. In the near future, we will see that precomputation will be widely adopted at the data service layer to support bigger datasets and larger populations of citizen analysts.
With AI and automation technology, the manual data preparation part of precomputation will be fully automated. For example, Snowflake applies small materialized aggregates [Moerkotte98] on low-level data blocks fully automatically. The SMA is a kind of precomputation considered as a flexible and versatile alternative for cubes, and for certain queries, it can give speed up to two orders of magnitude. Apache Kylin, a big data OLAP engine, allows the user to define a multi-dimensional cube in a drag-and-drop GUI and configure dimension combinations that need precomputation by rules. The system then automates the remaining part of precomputation by running a Spark job to load the relational data into the partial cube. The whole process of data preparation is semi-automated and requires no programming or big data skills.
OLAP databases will start to feature smart or transparent precomputation. Such databases will be able to switch between on-line computation and precomputation transparently. When a query wants to see the latest fresh data, it can be served from an MPP engine without delay in the data pipeline. Or if a query can hit a precomputed result, then the rapid results will be used to minimize query cost and raise system throughput. New databases will be able to automatically decide when to precompute, what to precompute, and leverage precomputation to support operational goals like fast response time, high concurrency, and low TCO. All these will be transparent to the end user. This is what needs to happen to realize the goal of pervasive analytics.
About the Author
Yang Li is Kyligence co-founder and CTO, Apache Kylin co-creator and PMC member. Tech lead and architect of Kylin, focus on big data analysis, parallel computation, data index, relational algebra, approximation algorithm, and other technologies. Was senior architect of eBay Analytic Data Infrastructure department. Was tech lead of IBM InfoSphere BigInsights, responsible for Hadoop open source platform, winner of Outstanding Technical Achievement Award. Was Vice President of Morgan Stanley, responsible for global regulatory reporting platform.