Home » Google » Professional-Data-Engineer » How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?

How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?

08/07/2019 – by Mod_GuideK 2

Your company receives both batch- and stream-based event data. You want to process the data using Google Cloud Dataflow over a predictable time period.
However, you realize that in some instances data can arrive late or out of order.
How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?
A. Set a single global window to capture all the data.
B. Set sliding windows to capture all the lagged data.
C. Use watermarks and timestamps to capture the lagged data.
D. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.

SHOW ANSWERS

Download Printable PDF. VALID exam to help you PASS.

2 thoughts on “How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?”

Younes says:

02/13/2020 at 6:03 PM

The correct answer is C.
A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. If new data arrives with a timestamp that’s in the window but older than the watermark, the data is considered late data. Dataflow tracks watermarks because of the following:
Data is not guaranteed to arrive in time order or at predictable intervals.
Data events are not guaranteed to appear in pipelines in the same order that they were generated.

4

Reply
vicThor says:

10/20/2019 at 5:55 PM

Great question this one. Not easy.

A is a direct No, if data don’t have timestamp, we’ll only have the procesing time and not the “event time”.
B is not either, sliding windows are not for this. Hopping|sliding windowing is useful for taking running averages of data, but not to process late data.

D looks correct but has one concept missing, the watermark to know if the process time is ok with the event time or not. I’m not 100% sure is incorrect. If, since we have a “predictable time period”, mightbe this will do. I mean, if our dashboard is shown after the last input data has arrived(single global window), this should be ok. We’d have a “perfect watermark”. Anyway we’d need triggering .

C is, I think, the correct answer: Watermark is different from late data. Watermark in implementation is a monotonically increasing timestamp. When Beam/Dataflow see a record with a event timestamp that is earlier than the watermark, the record is treated as late data.

I’ll try to explain: Late data is inherent to Beam’s model for out-of-order processing. What does it mean for data to be late? The definition and its properties are intertwined with watermarks that track the progress of each computation across the event time domain. The simple intuition behind handling lateness is this: only late input should result in late data anywhere in the pipeline.

So, is not easy to decide between C and D. If you ask me I’d say C since for D we ought to make some suppositions.

3

Reply

2 thoughts on “How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?”

Leave a Reply Cancel reply