What should you do?

You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?
A. Add a SideInput that returns a Boolean if the element is corrupt.
B. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
C. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
D. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.

Download Printable PDF. VALID exam to help you PASS.

5 thoughts on “What should you do?

  1. Correct answer is B.
    Question says, you need to filter out bad data. So use a ParDo transform with a DoFn() to discard all bad records and you get a PCollection of good data. If you want to keep the bad data, then option C is correct, since you can split the PCollection into 2 (good and bad) based on whether the record is good or bad.

  2. B ParDo is a Beam transform for generic parallel processing. ParDo is useful for common data processing operations, including
    a. Filtering a data set. You can use ParDo to consider each element in a PCollection and either output that element to a new collection, or discard it.
    b. Formatting or type-converting each element in a data set.
    c.Extracting parts of each element in a data set.
    d.Performing computations on each element in a data set.
    A does not help
    C Partition is a Beam transform for PCollection objects that store the same data type. Partition splits a single PCollection into a fixed number of smaller collections. Again, does not help
    D GroupByKey is a Beam transform for processing collections of key/value pairs. GroupByKey is a good way to aggregate data that has something in common
    Hence B

  3. Ideally, we should not discard corrupt elements. We should keep the corrupt data separately for later analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.