You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country
You check the query plan for the query and see the following output in the Read section of Stage:1:
What is the most likely cause of the delay for this query?
A. Users are running too many concurrent queries in the system
B. The [myproject:mydataset.mytable] table has too many partitions
C. Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values
D. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew
Answer is D since too many identical values (or even NULLs) in the grouping column ‘country’ causes data skew. See https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
Answer is not C since city and state are not grouping columns and hence the NULLs don’t matter.
This is evil, we have to learn the colors of the big query analizer 🙁
so, on the left is blue
A is not correct, if we are using more than 2000 slots it will not start
B looks right, if we have many partitions querys will go slow on aggregated selects like this since it’ll have to read many files (as much as 2000)
C is not correct, BigQuery won’t index null values / won’t scan null values on where clause for efficiency
D does not look right, that happends when we use join in unbalanced data.