Have begun reading
What is a real-time system?
Real time systems are defined as hard, soft & near. Being defined by their latency and tolerance for delay. These definitions are somewhat fuzzy.
Anti-lock brakes would be hard – immediate application and critical consequences of delay. Skype would a near RT system.
Differences between RT & streaming systems
These systems have 2 parts: A computation/processing system and a consumption system(s).
DEFINITION: STREAMING DATA SYSTEM – in many scenarios the consumption part of the system is operating in a non-hard RT fashion, however the clients may not be consuming the data in RT due to network delays, application design or perhaps a client application isn’t running. The clients consume data when they need it. This is a streaming data system.
Typical architectural components
Collection Tier >
Message Queuing Tier >
Analysis Tier (> Optional persistent storage) >
In-Memory Data Store >
Data Access Tier (consumers)
Put another way:
At a high level, modern distributed stream processing pipelines execute as follows:
- Receive streaming data from data sources (e.g. live logs, system telemetry data, IoT device data, etc.) into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.
- Process the data in parallel on a cluster. This is what stream processing engines are designed to do, as we will discuss in detail next.
- Output the results out to downstream systems like HBase, Cassandra, Kafka, etc.
To process the data, most traditional stream processing systems are designed with a continuous operator model, which works as follows:
- There is a set of worker nodes, each of which run one or more continuous operators.
- Each continuous operator processes the streaming data one record at a time and forwards the records to other operators in the pipeline.
- There are “source” operators for receiving data from ingestion systems, and “sink” operators that output to downstream systems.