Streaming Data

Have begun reading
https://www.manning.com/books/streaming-data

51k8Fv94dBL._SX397_BO1,204,203,200_

What is a real-time system?

Real time systems are defined as hard, soft & near. Being defined by their latency and tolerance for delay. These definitions are somewhat fuzzy.

Anti-lock brakes would be hard – immediate application and critical consequences of delay. Skype would a near RT system.

Differences between RT & streaming systems

These systems have 2 parts: A computation/processing system and a consumption system(s).

DEFINITION: STREAMING DATA SYSTEM – in many scenarios the consumption part of the system is operating in a non-hard RT fashion, however the clients may not be consuming the data in RT due to network delays, application design or perhaps a client application isn’t running. The clients consume data when they need it. This is a streaming data system.

Typical architectural components

Collection Tier >
Message Queuing Tier >
Analysis Tier (> Optional persistent storage) >
In-Memory Data Store >
Data Access Tier (consumers)

[See also An Architecture for Fast and General Data Processing on Large Clusters by Matei Zaharia]

Put another way:
[Ref: Databricks]

At a high level, modern distributed stream processing pipelines execute as follows:

  1. Receive streaming data from data sources (e.g. live logs, system telemetry data, IoT device data, etc.) into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.
  2. Process the data in parallel on a cluster. This is what stream processing engines are designed to do, as we will discuss in detail next.
  3. Output the results out to downstream systems like HBase, Cassandra, Kafka, etc.

To process the data, most traditional stream processing systems are designed with a continuous operator model, which works as follows:

  • There is a set of worker nodes, each of which run one or more continuous operators.
  • Each continuous operator processes the streaming data one record at a time and forwards the records to other operators in the pipeline.
  • There are “source” operators for receiving data from ingestion systems, and “sink” operators that output to downstream systems.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s