BigData: Principles & Practices of scalable real-time data systems

Nathan Marz
BigData: Principles & Practices of scalable real-time data systems

“Web-scale applications like real-time analytics or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional RDBMS. These systems require architectures built around clusters of machines to store & process data of any size, or speed. Forunately, scale and simplicity are not mutually exclusive.

…Build big data systems using an architecture designed specifically to capture and analyse web-scale data. This book presents the Lambda Architecturea scalable, easy-to-understand approach that can be built and run by a small team. You’ll explore the theory of bigdata systems and how to implement them in practice. In addition to discovering a general framework for processing bigdata, you’ll learn specif technologies like Hadoop, Storm & NoSQL DBs.

What is the Lambda Architecture?
Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems.

Batch layer
The batch layer precomputes results using a distributed processing system that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process all available data when generating views. This means it can fix any errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views.Apache Hadoop is the de facto standard batch-processing system used in most high-throughput architectures.

Speed layer
The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, the speed layer is responsible for filling the “gap” caused by the batch layer’s lag in providing views based on the most recent data. This layer’s views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer’s views for the same data become available. Stream-processing technologies typically used in this layer include Apache Storm, SQLstream and Apache Spark. Output is typically stored on fast NoSQL databases.

Serving layer
Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data. Examples of technologies used in the serving layer include Druid, which provides a single cluster to handle output from both layers.[7]Dedicated stores used in the serving layer include Apache Cassandra or Apache HBase for speed-layer output, and Elephant DB orCloudera Impala for batch-layer output.

The premise behind the Lambda architecture is you should be able to run ad-hoc queries against all of your data to get results, but doing so is unreasonably expensive in terms of resource. Technically it is now feasible to run ad-hoc queries against your Big Data (Cloudera Impala), but querying a petabyte dataset everytime you want to compute the number of pageviews for a URL may not always be the most efficient approach. So the idea is to precompute the results as a set of views, and you query the views. I tend to call these Question Focused Datasets (e.g. pageviews QFD).

The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up. Here’s how it looks like, from a high-level perspective:

LA overview

  1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
  2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
  3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
  4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
  5. Any incoming query can be answered by merging results from batch views and real-time views.

(Re)installing Apache Tika

Install a Windows subversion client (to make life easier)

Download Tika
This will install all the required Tika packages
SVN > Checkout
URL of repository:

Download Java SDK
Before downloading Maven, you need the java SDK

Install it to a pathname without spaces, such as c:\j2se1.6.

Once Java is installed, you must ensure that the commands from the Java SDK are in your PATH environment variable.

Set environmental variable JAVA_HOME to installation dir
Control Panel\System and Security\System
> Advanced system settings > SYSTEM VARIABLES > NEW

Append the full path of the Java compiler to the system path
Append the string C:\jdk1.7.0_79 to the end of the system variable ‘PATH’

Download Maven

  1. navigate in explorer to the Maven directory
  2. go to a dir IN the bin
  3. copy the address in the address bar(must end with bin)
  4. go to Start and type in “env”
  5. Select “edit the system evironment variables”
  6. find the PATH variable which must also have an existing value for Java as Maven needs Java.
  7. append a ; + paste the path.
  8. restart to update system
  9. run “mvn install” in the client ???

Download the ‘binaries’? *BIN* & *SRC*

Extract to
& C:\apache-maven-3.2.5\src

Maven in PATH

You run Maven by invoking a command-line tool: mvn.bat from the bin directory of the Maven. To do this conveniently, c:\mvn3.0.4\bin must be in your PATH, just like the J2SE SDK commands. You can add directories to your PATH in the control panel; the details vary by Windows version.

Add M2_HOME to sys vars

Update PATH variable, append Maven bin folder – %M2_HOME%\bin, so that you can run the Maven’s command everywhere.

Check install using
mvn –version
Should echo back windows etc.

Go to the src dir
Run mvn install


Now to connect Tika with Pentaho!

Scala Actions

first – returns the first element of the RDD (typically a file header)

take – returns a given number of records (like *nix HEAD)

collect – returns the whole dataset

Actions, by default, return results to the local process. The saveAsTextFile actions saves the contents of an RDD to persistent storage, such as HDFS
-This action creates a directory & writes out each partition as a file within it.

foreach println – prints out each value in the array on its own line

foreach printlin is an example of functional programming, where one function (foreach) is passed as an argument to another function (println)

The Spark Programming Model

Spark programming usually starts with a dataset or few, usually residing in some form of distributed, persistent storage like HDFS. Writing a spark program usually consists of a few related steps

  • Defining a set of transformations on input data sets
  • Invoking actions that output the transformed datasets to persistent storage or return results (to local memory)
  • Running local computations that operate on the results computed in a distributed fashion – these can help you decide what the next transformations and actions could be

Spark functions revolve around storage and execution

SCALA uses type inference

Whenever we create a new variable in Scala, we must preface it with either val or var
– val are immutable and cannot be changed to refer to another value once assigned
-var can be changed to refer to different objects of the same type.

Advanced Analytics with Apache Spark: What is Spark?

The challenges of Data Science

The hard truths:

First, the majority of work that goes into conducting successful analyses lie in preprocessing data. Data is messy, and cleansing, munging, fusing, mushing and many other verbs are prerequisites for doing anything useful with it…A typical data (processing) pipeline requires spending far more time in feature engineering and selection than in choosing and writing algorithms. The Data Scientist must choose from a wide variety of potential features (data). Each comes with its own challenges in converting to vectors fit for machine learning algorithms. A system needs to support more flexible transformations than turning a 2-d array (tuple) of doubles into a mathematical model.

Second, iteration. Models typically require multiple passes/scans over the same data to reach convergence. Iteration is also crucial within the Data Scientist’s own workflow, usually the results of a query inform the next query that should run. Feature and algorithm selection, significance testing and finding the right hyper-parameters all require multiple experiments.

Third, implementation. Models need to be operationalised and become part of a production service and may need to be rebuilt periodically or even in real time – analytics in the ‘lab’ vs analytics in the ‘factory’. Exploratory analytics usually occurs in e.g. R, whereas production applications’ data pipelines are written in C++ or Java.

Enter Apache Spark

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast toHadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.[1] By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited to machine learning algorithms.[2]

Spark is an open-source framework that combines an engine for distributed programs across clusters of machines with an elegant model for writing programs atop it.

Word Count

In this SCALA example, we build a dataset of (String, Int) pairs called counts and then save it to a file:

val file = spark.textFile(“hdfs://…”)
val counts = file.flatMap(line => line.split(” “))
.map(word => (word, 1))
.reduceByKey(_ + _)

One way to understand Spark is to compare it to its predecessor, MapReduce. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in 3 important ways:

First, rather than relying on a rigid map-then-reduce format, Spark’s engine can execture a more general directed acyclic graph (DAG) of operators. 

This means that (rather than MapReduce writing out intermediate results to the distributed filesystem), Spark can pass them directly to the next step in the pipeline.

Secondly, Spark has a rich set of transformations that enable users to express computation more naturally. Its streamlined API can represent complex pipelines in a few lines of code.

Thirdly, Spark leverages in-memory processing (rather than distributed disk). Developers can materialise any point in a processing pipeline into memory across the cluster, meaning that future steps that want to deal with the same dataset need not recompute it from disk.

Perhaps most importantly, Spark’s collapsing of the full pipeline, from preprocessing to model evaluation, into a single programming environment can dramatically speed up productivity. By packaging an expressive programming model with a set of analytic libraries under REPL it avoids round trips and moving data back and forth. Spark spans the gap between lab and factory analytics, Spark is better at being an operational system than most exploratory systems and better for data exploration that the technologies commonly used in operational systems. Spark also has tremendous versatility offering functionality like SQL, being able to ingest data from streaming services like Flume & Kafka, it has machine learning tools and graph processing capabilities.

Building Microservices

Reading this

Ch1. Microservices
Ch2. The Evolutionary Architect
Ch3. How to Model Services
Ch4. Integration
Ch5. Splitting the Monolith
Ch6. Deployment
Ch7. Testing
Ch8. Monitoring
Ch9. Security
Ch10. Conway’s Law & System Design
Ch11. Microservices at scale
Ch12. Bringing It All Together

I’ve learned something today as I’d never heard of “Conway’s Law”. Here’s a list of Google results

Conway states: Conway’s law was not intended as a joke or a Zen koan, but as a valid sociological observation. It is a consequence of the fact that two software modules A and B cannot interface correctly with each other unless the designer and implementer of A communicates with the designer and implementer of B. Thus the interface structure of a software system necessarily will show a congruence with the social structure of the organization that produced it.

Wiki: Conway’s law is an adage named after computer programmer Melvin Conway, who introduced the idea in 1968; it was first dubbed Conway’s law by participants at the 1968 National Symposium on Modular Programming.[1] It states that

organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations

—M. Conway[2]

A nice link, which took me to another nice link, which provides a nice, succint definition (same author as the book)

In short, the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare mininum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

Which arrived me at another link

To start explaining the microservice style it’s useful to compare it to the monolithic style: a monolithic application built as a single unit. Enterprise Applications are often built in three main parts: a client-side user interface (consisting of HTML pages and javascript running in a browser on the user’s machine) a database (consisting of many tables inserted into a common, and usually relational, database management system), and a server-side application. The server-side application will handle HTTP requests, execute domain logic, retrieve and update data from the database, and select and populate HTML views to be sent to the browser. This server-side application is a monolith – a single logical executable[2]. Any changes to the system involve building and deploying a new version of the server-side application.

{Here endeth today’s lesson. More tomorrow}

ES: Filter types

and filter
bool filter
exists filter
geo bounding box filter
geo distance filter
geo distance range filter
geo polygon filter
geoshape filter
geohash cell filter
has child filter
has parent filter
ids filter
indices filter
limit filter
match all filter
missing filter
nested filter
not filter
or filter
prefix filter
query filter
range filter
regexp filter
script filter
term filter
terms filter
type filter

Chapter 12: Structured search

Combining Filters

SELECT product
FROM products
WHERE (price 20 OR productID = ‘XXYYZZ’)
AND (price !=30)

Bool Filter

The bool filter is comprised 3 sections

“bool” : {
“must”:          [ ],
“should”:       [ ],
“must_not”:   [ ],

-MUST: All of the clauses must match. The equivalent of AND
-SHOULD: At least ONE of the clauses must match. The equivalent of OR
-MUST_NOT: All of the clauses must NOT match. The equivalent of NOT


To replicate the preceding SQL example, we will take the two term filters that we used previously and place them inside the should clause of a bool filter, and add another clause to deal with the NOT condition:

GET /my_store/products/_search
   "query" : {
      "filtered" : { 
         "filter" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}}, 
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} 
              "must_not" : {
                 "term" : {"price" : 30} 
Note that we still need to use a filtered query to wrap everything.
These two term filters are children of the bool filter, and since they are placed inside the should clause, at least one of them needs to match.
If a product has a price of 30, it is automatically excluded because it matches a must_notclause.

Our search results return two hits, each document satisfying a different clause in the bool filter:

"hits" : [
        "_id" :     "1",
        "_score" :  1.0,
        "_source" : {
          "price" :     10,
          "productID" : "XHDK-A-1293-#fJ3" 
        "_id" :     "2",
        "_score" :  1.0,
        "_source" : {
          "price" :     20, 
          "productID" : "KDKE-B-9947-#kL5"
Matches the term filter for productID = "XHDK-A-1293-#fJ3"
Matches the term filter for price = 20