Spark Streaming: Session 2

Ankur Ranjan
4 min readFeb 5, 2023

In the first article about the spark streaming series, we have understood the following important concept.

We have also written word count code to understand these concepts. For those who are reading this article without reading the first article of this series, I recommend reading the previous article or watching the video before starting this article. I also created a YouTube video for the same concepts. Please follow these links.

First session links:

Here in this session, we will mainly focus following concepts.

  1. Different streaming sources
  2. Stateful vs Stateless transformation

So let’s start and discuss these concepts in very layman’s terms.

Different Streaming Sources

Spark Streaming mostly provides two types or categories of built-in streaming sources.

  • Basic Source — Socket connection, File system
  • Advance Source — Kafka, Pub/Sub, Kinesis

Basic source like socket connection & File System comes in very handy when you are learning streaming concepts. These are very easy to set up for producing streaming data.

We have already seen one example of a socket in the word count example code where we have tried to produce data using nc -lk 9999 command. Here we have just treated localhost with port 9999 as our source.

Producing stream data from a file-based system is also very easy. Here, one event just keeps adding new files to the source folder and the spark streaming application just keeps looking(pooling) for the new file after a certain interval of time.

Basic sources are directly available in the StreamingContext API.

Advanced sources like Kafka require some more effort in setting the Kafka cluster on your local system.

Advance Sources like Kafka, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies.

With time, we will explore all these sources by our code. So don’t worry if you are not getting some of the concepts related to these sources. For now, just understand that there are different sources that help to produce data in the streaming use cases.

Now, Let’s try to understand the concept of Window and sliding interval before starting the stateless vs stateful transformation discussion.

Window Operations

Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data.

  • Window Length: The duration of the window in secs
  • Sliding Interval: The interval at which the window operation is performed in seconds. These parameters must be multiples of the batch interval.

Let’s check the following pics & then we will understand it by one example.

Let’s try to make it more clear from one example :)

You remember in the last article we have written world count code.

  • There we have used batch duration as 5 sec.
  • It means that after 5 seconds a new RDD will be created.
  • Let’s suppose, our window size is 10 seconds. It means that we are only interested in the last 2 RDD’.
  • Now suppose that Sliding Interval is 1 second. It means that after every 1 second one old RDD will go away and one new RDD will come in.

I hope now you have some understanding of window time and sliding interval. Now let’s understand stateless and stateful transformation.

Stateless Transformation

  • It forgets the previous state.
  • Here we perform operations on a single RDD always.
  • We are not worried about past records. Our calculations always involve the current records.
  • So every transformation in batch processing is stateless transformation.

Stateful Transformation

  • Here we do aggregation over more than one RDD.
  • When we talk about stateful transformation then we have two choices. We can either do aggregations or transformation on entire streams or we can consider some windows time.
  • Consider we have a streaming application that runs for 1 hour with a batch duration equal to 5 minutes. It means that a new RDD will be created every 5 minutes. For one hour it will create 12 RDDs.
  • If we want to do operations over the last 10 minutes then we will only consider 2 RDDs always.

Let’s suppose we want to apply sum() on incoming streaming data then it is a stateless transformation but let’s suppose you want to calculate the running sum then it is a stateful operation.

For stateful operations, you will be two choices.

You can just calculate the running sum of entire streams. For example in the above example, we do have an entire streaming time equal to 1 hour i.e. 12 RDDs. We can apply one stateful transformation and find the running choice of 1-hour streaming data.

Another choice can be the use of windows time and sliding interval. Here with each sliding interval old records will be removed from the calculation and new records will be added. Here we will only consider the last results like the last 10 minutes running sum or the last 30 minutes running sum with 2 minutes of sliding interval time.

I hope now you have got some idea about StateFul vs StateLess transformation and you have now a basic understanding of windows time and sliding interval.

In the next article, we will see some of the stateful transformation examples like

  • reduceByWindow
  • updateStateByKey
  • reduceByKeyAndWindow
  • countByWindow

We will write code to understand the above transformation. Let’s meet in the next articles for discussing more Spark Streaming.

Feel free to subscribe to my YouTube channel i.e The Big Data Show. I have started the Spark Streaming playlist and I am explaining its core concepts by means of video illustration.

More so, thank you for that most precious gift to a me as writer i.e. your time.

Originally published at https://www.linkedin.com.

--

--

Ankur Ranjan

Data Engineer III @Walmart | Contributor of The Big Data Show