Spark Streaming — Part 1

Ankur Ranjan
6 min readFeb 5, 2023

A few months back, I was given a codebase that used Spark Streaming and it was written in scala. We were supposed to make major changes and go for the new version of the project. I had worked on Spark Structure Streaming before but not in Spark Streaming. I found that, as Spark Streaming is a legacy project, Apache foundation has stopped giving further updates on this. Hence there are very few good articles on Spark Streaming which talks about its functionalities, problems and use cases in layman’s terms in good depth.

I think that understanding Spark Streaming is a good way to enter into the Streaming world as it is very easy to learn. It is also written using a lower level of code, hence it’s very easier to visualize the concepts of streaming. So let’s start learning Spark streaming in very layman’s terms using this series of articles.

This article assumes that you know the basic terms of Spark and you have written some code for batch processing using spark.

What is Spark Streaming?

  • Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Data can be ingested from different sources like Socket, File Source, Kafka, Kinesis etc and can be processed with high-level functions like map, reduce, join etc of Spark Streaming.
  • Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
  • Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
  • DStream is nothing but a continuous stream of RDDs (Spark abstraction). One can consider it as Seq[RDD]. Every RDD in DStream contains data from a certain interval.
  • Any operation on a DStream applies to all the underlying RDDs. It covers all the details. It provides the developer with a high-level API for convenience.

But Ankur, how do we understand this flow of live data and spark streaming concept in simpler terms?

There are mostly the following terms that one has to understand for understanding the flow of live data with Spark Streaming.

Let’s try to understand using one of my experiences. I started my career with one of the start-ups in Mumbai(India). I used to daily see people collecting water from the public water system.

Let’s suppose Bombay Municipal Corporation(BMC) allowed tap to run for the whole day then we can consider it as a producer producing data continuously. This is like a streaming use case where data is flowing continuously.

Now let’s suppose, we consider the water tab running time equal to 1 hour for our calculation & consider a scenario where every person is given 2 minutes to fill their bucket.

Let’s visualize this above situation with a streaming use case. Here 2 minutes will be treated as batch duration and every bucket as one RDD. Droplets of water will be considered messages or records.

As we know the flow of water might not be the same every time. So some buckets may contain more water and some might contain less water. Same way in the case of streaming some RDD might have more messages and some might have fewer messages. So one batch duration, we might have more messages whereas, in the next batch duration, we might receive fewer messages.

Now as the size of the stream is 1 hour and the batch duration is 2 minutes. It means that 30 people can fill their 30 buckets.

Here groups of buckets can be considered as DStream and each bucket as one RDDs.

So basically,

DStream -> RDD -> Messages (entities)

So here it has 30 RDD and each RDD can have lots of messages but when we use Spark Streaming then operations are performed as on DStream. Any operation on a DStream applies to all the underlying individual RDDs.

So underneath spark still works in a batch style, usually batch size is so small that we get the feeling that it is real-time streaming.

I hope my enthusiastic readers now understand the meaning of batch duration, messages, RDD and DStream.

A Quick Example

Before we start discussing the concepts of Spark Streaming. Let’s see one quick code. I think a word count example in any computing engine is a really good way to start.

  • Step 1: Open your console or terminal and just use nc -lk 9000 command to simulate producing data.
  • Step 2: Now our producer is ready, we have to start our consumer application for listening to our data. Here Spark Streaming will act as a consumer.
  • Step 3: In Spark Streaming one has to start the Streaming Context. StreamingContext is the main entry point for all streaming functionality. Here, we will create a local StreamingContext with two execution threads, and a batch interval of 5 seconds.
  • Step 4: Then we will create a DStream that will connect to hostname: port, like localhost:9000

For our word count example, we will simulate the producer by our console.

Socket => IP Address + Port number => { localhost + 9000}

Let’s try to look into the code and comments in the code snippet for understanding the flow. The below code is written in Scala.

val BATCH_INTERVAL_TIME = 5 def main(args: Array[String]): Unit = { 

// Inititating spark session which will further help us to start
// spark streaming context

val spark: SparkSession = SparkSession.builder().master("local[*]").appName("theBigDataShow.com").getOrCreate()

// Setting spark error log level. This helps to debug in console.
spark.sparkContext.setLogLevel("ERROR")

// Start the streaming context.
// This it entry point for spark streaming.
// Here In our code base batch interval is 5 seconds.
// It means after every 5 seconds.
// New RDD will be created.

val sc: SparkContext = spark.sparkContext

val ssc = new StreamingContext(
sc,
batchDuration = Seconds(BATCH_INTERVAL_TIME)
)

// One can use checkpointing but this is very basic example so I
// have just commented it.

// ssc.checkpoint("./checkpointing")

// Using streaming context, we can create a DStream that
// represents streaming data from a TCP source,
// specified as hostname (e.g. localhost) and port (e.g. 900).

val lines: ReceiverInputDStream[String] = ssc.socketTextStream( "localhost", 9000 )

// This lines DStream represents the stream of data that
// will be received from the data server.
// Each record in this DStream is a line of text.
// Next, we want to split the lines by space characters into words.

val words: DStream[String] = lines.flatMap(x => x.split(" "))

// flatMap is a one-to-many DStream operation that creates a
// new DStream by generating multiple new records from each
// record in the source DStream. In this case, each line will
// be split into multiple words and the stream of words is
// represented as the words DStream. Next, we want to count
// these words.

val wordsFrequencyMap: DStream[(String, Int)] = words.map(w => { (w, 1) })

// Count each word in each batch

val wordsCounts: DStream[(String, Int)] = wordsFrequencyMap .reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this
// DStream to the console

wordsCounts.print()

ssc.start()
ssc.awaitTermination()

}
  • Stateless and Stateful Transformation
  • We will understand the difference b/w stateless and stateful transformation by using UpdateStateByKey and reduceByKey transformations.
  • Concept of Window Size and Sliding Interval.
  • We will use the reduceByKeyAndWindow and reduceByWindow for understanding window size and sliding interval concepts.
  • Joins in Streaming like stream-stream join, Stream-dataset join
  • Output Operations in Spark Streaming

In this article, we have tried to build the foundation of Spark Streaming i.e. DStream, Rdd, Messages/Entity/Records.

In the next article of this series, we will try to understand some more concepts like.

I hope this small article sparks some interesting thoughts w.r.t to Spark Streaming. Let’s meet in the next articles for discussing the above topics.

Feel free to subscribe to my YouTube channel i.e The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.

More so, thank you for that most precious gift to a me as writer i.e. your time.

Originally published at https://www.linkedin.com.

--

--

Ankur Ranjan

Data Engineer III @Walmart | Contributor of The Big Data Show