Do big data stream processing in the stream way

Reading: Years in Big Data. Months with Apache Flink. 5 Early Observations With Stream Processing:

The article suggest adopting the right solution, Flink, for big data processing. Flink is interesting and built for stream processing.

The broader view and take away may be to solve problems using the right solution. We saw many painful tries in history and in current practices still: do huge large scale data in traditional databases, do unstructured data processing in relational database, do graph processing in tables way, do stream processing in micro-batch way and etc. The specific problem should be handled by a solution built for that problem and that solution can be the most efficient and convenient one.

Some good examples and points from the article.

“In reality, however, processing data with as low latency as possible has been a challenge for a long time….a customer asked me how to produce an up-to-date aggregation over a tumbling five-minute window of a growing table using Hive.”

“the customer and business user really need: a representation of data as a stream and the ability to do in-stream complex/stateful analytics. ”

“Customers and end-users wrangle with the latency gap in all kinds of interesting and expensive ways.”

“it’s refreshing to be given constructs of stream, state, time and snapshots as the building blocks of event processing rather than incomplete concepts of keys, values, and execution phases.”

“The first approach is to use batch as a starting point then try to build streaming on top of batch. This likely won’t meet strict latency requirements, though, because micro-batching to simulate streaming requires some fixed overhead–hence the proportion of the overhead increases as you try to reduce latency.”

“However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real-time?”

“requirements around low latency processing and complex analysis cannot be met in an inexpensive, scalable and fault-tolerant way.”

Eric Ma

Eric is a systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

One comment:

  1. Hi,
    interesting article. I want to know if there is a research field that focus on data processing when data in Motion vs in the rest. because analysing IoT data streams in transit is better than batch in Cloud.

Leave a Reply

Your email address will not be published. Required fields are marked *