Do big data stream processing in the stream way

ByEric Ma Nov 27, 2018Nov 21, 2019

Reading: Years in Big Data. Months with Apache Flink. 5 Early Observations With Stream Processing: https://data-artisans.com/blog/early-observations-apache-flink.

The article suggest adopting the right solution, Flink, for big data processing. Flink is interesting and built for stream processing.

The broader view and take away may be to solve problems using the right solution. We saw many painful tries in history and in current practices still: do huge large scale data in traditional databases, do unstructured data processing in relational database, do graph processing in tables way, do stream processing in micro-batch way and etc. The specific problem should be handled by a solution built for that problem and that solution can be the most efficient and convenient one.

Some good examples and points from the article.

“In reality, however, processing data with as low latency as possible has been a challenge for a long time….a customer asked me how to produce an up-to-date aggregation over a tumbling five-minute window of a growing table using Hive.”

“the customer and business user really need: a representation of data as a stream and the ability to do in-stream complex/stateful analytics. ”

“Customers and end-users wrangle with the latency gap in all kinds of interesting and expensive ways.”

“it’s refreshing to be given constructs of stream, state, time and snapshots as the building blocks of event processing rather than incomplete concepts of keys, values, and execution phases.”

“The first approach is to use batch as a starting point then try to build streaming on top of batch. This likely won’t meet strict latency requirements, though, because micro-batching to simulate streaming requires some fixed overhead–hence the proportion of the overhead increases as you try to reduce latency.”

“However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real-time?”

“requirements around low latency processing and complex analysis cannot be met in an inexpensive, scalable and fault-tolerant way.”

Linux | Software

Installing the Flash Plugin for 64-bit Firefox in Linux x86-64

ByEric Ma Jul 13, 2013

This post introduces how to install flash plugin to 64-bit firefox on a x86-64 Linux (Fedora as the example). Both 64-bit and 32-bit plugin are available. 64-bit flash plugin for Firefox on Linux First, download Flash Player Release for 64-bit Linux from Adobe Labs. Then, unpack the package: $ tar xf flashplayer.tar.gz Check whether all…

How to Passwordless SSH to an OpenWrt Router?

ByEric Ma Mar 24, 2018Mar 24, 2018

The good ssh-copy-id method which works well on common Linux seems not working for OpenWrt router. How to Passwordless SSH to an OpenWrt Router? OpenWrt’s SSH server is Dropbear. It can accept normal RSA keys. But the authorized_keys location is not the same as the openssh “~/.ssh/authorized_keys”. The location for the authorized_keys is /etc/dropbear/authorized_keys What…

Linux | Software

Minimizes Thunderbird Windows into the System Tray

ByEric Ma Jul 13, 2013Aug 30, 2020

I usually keep Thunderbird open when my computer is open. But Thunderbird is minimized to the taskbar. It will be more convenient if it is minimized or closed to the system tray as an icon.It will be better if it can display the number of unread messages. Mozilla’s software’s addon system give the method to…

Linux | Tutorial

How sched_min_granularity_ns, sched_latency_ns and sched_wakeup_granularity_ns in CFS affect the timeslice of processes

ByWeiwei Jia Dec 1, 2016Jan 7, 2020

Abstract Currently, the most famous process scheduling algorithm in Linux Kernel is Completely Fair Scheduling (CFS) algorithm. The core idea of CFS is to let each process share the same proportional CPU resources to run so that it is fair to each process. In this article, I will introduce how sched_min_granularity_ns and sched_latency_ns work internal…

Linux | Tutorial

How to Add a File Based Swap for Linux

ByEric Ma Apr 3, 2018Nov 21, 2019

We may want to add some swap space for a Linux box while only find that all disk space is partitioned and mounted. Some partition has large available free space. For such cases, we may not want to change the partition allocation. The solution may be to add a file based swap for Linux as…

Import Evolution mail directory to Thunderbird/imap account

ByQ A Mar 24, 2018

I have some old emails left in the Evolution local directory (unfortunately not in the mail server’s imap directory). So how to import the Evolution directory which contains some files for the emails to Thunderbird so that I can copy them to the imap directory if needed? You can do this in two steps: First,…

One Comment

Yassir says:

May 7, 2019 at 12:43 am

Hi,
interesting article. I want to know if there is a research field that focus on data processing when data in Motion vs in the rest. because analysing IoT data streams in transit is better than batch in Cloud.

Reply

Similar Posts

One Comment

Leave a Reply Cancel reply