You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?
A. Add a SideInput that returns a Boolean if the element is corrupt.
B. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
C. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
D. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
B
Correct answer is B.
Question says, you need to filter out bad data. So use a ParDo transform with a DoFn() to discard all bad records and you get a PCollection of good data. If you want to keep the bad data, then option C is correct, since you can split the PCollection into 2 (good and bad) based on whether the record is good or bad.
B ParDo is a Beam transform for generic parallel processing. ParDo is useful for common data processing operations, including
a. Filtering a data set. You can use ParDo to consider each element in a PCollection and either output that element to a new collection, or discard it.
b. Formatting or type-converting each element in a data set.
c.Extracting parts of each element in a data set.
d.Performing computations on each element in a data set.
A does not help
C Partition is a Beam transform for PCollection objects that store the same data type. Partition splits a single PCollection into a fixed number of smaller collections. Again, does not help
D GroupByKey is a Beam transform for processing collections of key/value pairs. GroupByKey is a good way to aggregate data that has something in common
Hence B
Data is anyway corrupt and of no use . Ans B
Ideally, we should not discard corrupt elements. We should keep the corrupt data separately for later analysis.