The Batch Layer

The goal of a BI system is to answer any question (within reason) asked of it. In the Lambda architecture, any question can be implemented as function that takes all the data as input – unfortunately something that consumes the whole dataset is likely not to perform.

In the Lambda architecture, the batch layer precomputes the master dataset into batch views so that queries can be run with low latency. This requires balancing what needs to be precomputed  & what needs to be computed on the fly at execution time (Rather like aggregates in a star schema), the key is precompute just enough information to enable the query to return in an acceptable time.

The batch layer runs functions over the master dataset to precompute intermediate data called batch views. The batch views are loaded by the serving layer, which indexes them to allow rapid access to that data.

The speed layer compensates for the high latency of the batch layer by providing low-latency updates using data that has yet to be precomputed into a batch view.

(Rather like aggregates/caching in memory, with more esoteric queries going to the relational engine).

Queries are then satisfied by processing data from the serving layer views and the speed layer views, and merging the results

*you should take the opportunity to thoroughly explore the data & connect diverse pieces of data together! – assuming you have a priori knowledge of the necessary ‘joined’ datasets!

A naive strategy for computing on the batch layer would be to precompute all possible queries and cache the results in the serving layer. Unfortunately you can’t always precompute everything. Consider the pageviews-over-time query as an example. If you wanted to precompute every potential query, you’d need to determine the answer for every possible range of hours for every URL. But the number of ranges of hours within a given time frame can be huge. In a one-year period, there are approximately 380 million distinct hour ranges. To precompute the query, you’d need to precompute and index 380 million values for every URL. This is obviously infeasible and an unworkable solution

…Yet this is very much achievable using OLAP tools


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s