What is the most likely cause of the delay for this query?

You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country
You check the query plan for the query and see the following output in the Read section of Stage:1:

What is the most likely cause of the delay for this query?
A. Users are running too many concurrent queries in the system
B. The [myproject:mydataset.mytable] table has too many partitions
C. Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values
D. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew

SHOW ANSWERS

Download Printable PDF. VALID exam to help you PASS.

2 thoughts on “What is the most likely cause of the delay for this query?”

Answer is D since too many identical values (or even NULLs) in the grouping column ‘country’ causes data skew. See https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
Answer is not C since city and state are not grouping columns and hence the NULLs don’t matter.

This is evil, we have to learn the colors of the big query analizer 🙁
so, on the left is blue

A is not correct, if we are using more than 2000 slots it will not start
B looks right, if we have many partitions querys will go slow on aggregated selects like this since it’ll have to read many files (as much as 2000)
C is not correct, BigQuery won’t index null values / won’t scan null values on where clause for efficiency
D does not look right, that happends when we use join in unbalanced data.

2 thoughts on “What is the most likely cause of the delay for this query?”

Leave a Reply Cancel reply