Tag Archives: sql

SQL tips: filter first before join

When JOIN multiple tables, the direct and simple way is to sequentially join one-by-one, e.g. if you have a table with schema as, which record a pair of samples, saying ima & imb are item ids.

CREATE TABLE IF NOT EXISTS pair (
   ima BIGINT,
   imb BIGINT
)
PARTITION BY (
  ds STRING
)
LIFECYCLE 7
; 

Then each item is represented by a high-dimensional (e.g. 1024 dimension) vector, store in another table, like

CREATE TABLE IF NOT EXISTS emb (
   imid BIGINT,
   vec STRING
)
PARTITION BY (
  ds STRING
)
LIFECYCLE 7
; 

Note: vec is concatenating all vector elements into a string for easily store in SQL table. e.g. a 5-dimension vector [1,2,3,4,5], which is stored as "1,2,3,4,5" if comma as separator

In practice, emb table is often very large because it stores all history vector calculated. e.g. if the vector is image embedding. The item already calculated in the past is not re-calculated today for reduce duplicate calculation. While the item in pair table is often a subset of emb. In case of millions of items, emb table size may be ~TB/PB. If we need to calculate the cosine similarity of item pair in pair table, like

CREATE TABLE pair_cosine LIFECYCLE 7 AS
SELECT p.ima,p.imb, COSINE(a1.vec,a2.vec) FROM (
  SELECT ima, imb FROM pair WHERE ds=bizdate
)p
JOIN (
  SELECT imid, vec FROM emb WHERE ds=bizdate
)a1
ON p.ima=a1.imid
JOIN (
  SELECT imid, vec FROM emb WHERE ds=bizdate
)a2
ON p.imb=a2.imid
;

Because emb table is TB/PB scale, there will be performance issue using the above naive script. A lot of time spends on loading emb table.

Considering the items in pair table being a relatively small subset of emb, it is more efficient if collecting subset embedding in temporary table, and then do above operation. The example script like

CREATE TABLE emb_a LIFECYCLE 7 AS
SELECT p.ima,a1.vec FROM (
  SELECT DISTINCT ima FROM pair WHERE ds=bizdate
)p
JOIN (
  SELECT imid, vec FROM emb WHERE ds=bizdate
)a1
ON p.ima=a1.imid
;

CREATE TABLE emb_b LIFECYCLE 7 AS
SELECT p.imb,a1.vec FROM (
  SELECT DISTINCT imb FROM pair WHERE ds=bizdate
)p
JOIN (
  SELECT imid, vec FROM emb WHERE ds=bizdate
)a1
ON p.imb=a1.imid
;

CREATE TABLE pair_cosine LIFECYCLE 7 AS
SELECT p.ima,p.imb, COSINE(a1.vec,a2.vec) FROM (
  SELECT ima, imb FROM pair WHERE ds=bizdate
)p
JOIN emb_a a1
ON p.ima=a1.ima
JOIN emb_b a2
ON p.imb=a2.imb
;

After above process optimization, performance will be improved.

Remember SQL cannot do all possible optimization for you. It is best practice to analyze the performance bottleneck and try to optimize block-by-block. Sometime it is not a big issue if enough resource. But resource is not always enough because big data always exp increasing.

SQL optimization – long tail affecting performance

It often happens that you need join multiple tables (more than two) into one and do some computation. the you will find many instances are long tails. All other instances complete, long tail instances still running. For example, normal instances may 1 hour completed but long tail may more than 24 hours even worse it halt there.

For example, write a sql to do text information retrieval, ie, a query document table to match a target document database, with constrain of only selected query and target document pair are needed to calculate score.

Let saying 3 tables are query, target, candidate. Normally, you can join 3 at one time, like

Select other-score ……
from query q
Join target t
On q.word=t.word
Join candidate c
On q.id=c.query_id
And t.id=c.target_id
……

In big data, query, target and candidate maybe millions, which is normal. But word distribution along document is very skew, following 80-20 rule or Pareto principle. It will cause above join operation having serious long tail issue.

In my practice, some normal instances completed about a few hours, but long tails running more than 1 day and still halting there without any progress.

After analyzing, change 3 join at one time into first join query and candidate, and their result join target. Only this logic change, the task complete at 1 hour without any long tail, data skew. Comparing with before , the task cannot successfully completed even after running 24 hours.

It is amazing performance improvement.