SQL tips: get exclusive data set
In data science & data engineering, it is often required to extracting exclusive data set from a large-scale table in a business date comparing with a history data, e.g. daily new product incremental, weekly new users, ….. Here two ways are discussed using an example of how to calculate daily new products.
Method 1: use LEFT JOIN operator, e.g.
-- daily_biz_tb is table to store daily product with schema like -- CREATE TABLE IF NOT EXISTS daily_biz_tb (product_id STRING, title STRING) PARTITION BY (ds STRING) DROP TABLE IF EXISTS daily_inc_tb; CREATE TABLE daily_inc_tb LIFECYCLE 7 AS SELECT t1.* FROM ( SELECT product_id, title FROM daily_biz_tb WHERE ds='${bizdate}' )t1 LEFT JOIN ( SELECT product_id, title FROM daily_biz_tb WHERE ds='${yesterday}' )t2 ON t1.product_id=t2_product_id WHERE t2.product_id IS NULL ;
The logic is that if product_id in today but not in yesterday, after left join, the column t2.product_id will be NULL in today.
Method 2: left anti join. The LEFT ANTI JOIN will be much elegant. It will exclude all item based on key product_id.
-- daily_biz_tb is table to store daily product with schema like -- CREATE TABLE IF NOT EXISTS daily_biz_tb (product_id STRING, title STRING) PARTITION BY (ds STRING) DROP TABLE IF EXISTS daily_inc_tb; CREATE TABLE daily_inc_tb LIFECYCLE 7 AS SELECT t1.* FROM ( SELECT product_id, title FROM daily_biz_tb WHERE ds='${bizdate}' )t1 LEFT ANTI JOIN ( SELECT product_id, title FROM daily_biz_tb WHERE ds='${yesterday}' )t2 ON t1.product_id=t2_product_id ;
Which one is recommended? Of course, left anti join. It is not only elegant but also performance better when from a big table excluding data in a small table. e.g. 1T table excluding 1K data in small table. In practice, method 1 sometimes has performance issue while method 2 fast.