Category Archives: Programming language

SQL tips: filter first before join

When JOIN multiple tables, the direct and simple way is to sequentially join one-by-one, e.g. if you have a table with schema as, which record a pair of samples, saying ima & imb are item ids.

CREATE TABLE IF NOT EXISTS pair (
   ima BIGINT,
   imb BIGINT
)
PARTITION BY (
  ds STRING
)
LIFECYCLE 7
; 

Then each item is represented by a high-dimensional (e.g. 1024 dimension) vector, store in another table, like

CREATE TABLE IF NOT EXISTS emb (
   imid BIGINT,
   vec STRING
)
PARTITION BY (
  ds STRING
)
LIFECYCLE 7
; 

Note: vec is concatenating all vector elements into a string for easily store in SQL table. e.g. a 5-dimension vector [1,2,3,4,5], which is stored as "1,2,3,4,5" if comma as separator

In practice, emb table is often very large because it stores all history vector calculated. e.g. if the vector is image embedding. The item already calculated in the past is not re-calculated today for reduce duplicate calculation. While the item in pair table is often a subset of emb. In case of millions of items, emb table size may be ~TB/PB. If we need to calculate the cosine similarity of item pair in pair table, like

CREATE TABLE pair_cosine LIFECYCLE 7 AS
SELECT p.ima,p.imb, COSINE(a1.vec,a2.vec) FROM (
  SELECT ima, imb FROM pair WHERE ds=bizdate
)p
JOIN (
  SELECT imid, vec FROM emb WHERE ds=bizdate
)a1
ON p.ima=a1.imid
JOIN (
  SELECT imid, vec FROM emb WHERE ds=bizdate
)a2
ON p.imb=a2.imid
;

Because emb table is TB/PB scale, there will be performance issue using the above naive script. A lot of time spends on loading emb table.

Considering the items in pair table being a relatively small subset of emb, it is more efficient if collecting subset embedding in temporary table, and then do above operation. The example script like

CREATE TABLE emb_a LIFECYCLE 7 AS
SELECT p.ima,a1.vec FROM (
  SELECT DISTINCT ima FROM pair WHERE ds=bizdate
)p
JOIN (
  SELECT imid, vec FROM emb WHERE ds=bizdate
)a1
ON p.ima=a1.imid
;

CREATE TABLE emb_b LIFECYCLE 7 AS
SELECT p.imb,a1.vec FROM (
  SELECT DISTINCT imb FROM pair WHERE ds=bizdate
)p
JOIN (
  SELECT imid, vec FROM emb WHERE ds=bizdate
)a1
ON p.imb=a1.imid
;

CREATE TABLE pair_cosine LIFECYCLE 7 AS
SELECT p.ima,p.imb, COSINE(a1.vec,a2.vec) FROM (
  SELECT ima, imb FROM pair WHERE ds=bizdate
)p
JOIN emb_a a1
ON p.ima=a1.ima
JOIN emb_b a2
ON p.imb=a2.imb
;

After above process optimization, performance will be improved.

Remember SQL cannot do all possible optimization for you. It is best practice to analyze the performance bottleneck and try to optimize block-by-block. Sometime it is not a big issue if enough resource. But resource is not always enough because big data always exp increasing.

SQL tips: get exclusive data set

In data science & data engineering, it is often required to extracting exclusive data set from a large-scale table in a business date comparing with a history data, e.g. daily new product incremental, weekly new users, ….. Here two ways are discussed using an example of how to calculate daily new products.

Method 1: use LEFT JOIN operator, e.g.

-- daily_biz_tb is table to store daily product with schema like 
-- CREATE TABLE IF NOT EXISTS daily_biz_tb (product_id STRING, title STRING) PARTITION BY (ds STRING)

DROP TABLE IF EXISTS daily_inc_tb;
CREATE TABLE daily_inc_tb LIFECYCLE 7 AS 
SELECT t1.* FROM (
  SELECT product_id, title 
  FROM daily_biz_tb 
  WHERE ds='${bizdate}'
)t1
LEFT JOIN (
  SELECT product_id, title 
  FROM daily_biz_tb 
  WHERE ds='${yesterday}'
)t2
ON t1.product_id=t2_product_id
WHERE t2.product_id IS NULL
;

The logic is that if product_id in today but not in yesterday, after left join, the column t2.product_id will be NULL in today.

Method 2: left anti join. The LEFT ANTI JOIN will be much elegant. It will exclude all item based on key product_id.

-- daily_biz_tb is table to store daily product with schema like 
-- CREATE TABLE IF NOT EXISTS daily_biz_tb (product_id STRING, title STRING) PARTITION BY (ds STRING)

DROP TABLE IF EXISTS daily_inc_tb;
CREATE TABLE daily_inc_tb LIFECYCLE 7 AS 
SELECT t1.* FROM (
  SELECT product_id, title 
  FROM daily_biz_tb 
  WHERE ds='${bizdate}'
)t1
LEFT ANTI JOIN (
  SELECT product_id, title 
  FROM daily_biz_tb 
  WHERE ds='${yesterday}'
)t2
ON t1.product_id=t2_product_id
;

Which one is recommended? Of course, left anti join. It is not only elegant but also performance better when from a big table excluding data in a small table. e.g. 1T table excluding 1K data in small table. In practice, method 1 sometimes has performance issue while method 2 fast.

Tips: mapjoin in long-tail optimization in SQL

When joining a big table, saying A, (e.g.10M records), with a small table, saying B (e.g. 1000 records), based on some keys, the long-tail issue may popup. For example, if you have 10M product & category name records, A, and another product name and category name identity table, B. Now you want a new table patch product, category name, & category name identity together, you may write code like,

DROP TABLE IF EXISTS prd_cat_id;
CREATE TABLE IF NOT EXISTS prd_cat_id LIFECYCLE 7 AS
SELECT A.product, A.category_name, B.cat_id 
FROM A
JOIN B
ON A.category_name=B.category_name
;

Then you find the job running slowly, even instances allocated many, saying 1000. When you investigate the logview, it is found many instances completed very fast, only a few, saying 10 instances running very slow. It is the long-tail issue, i.e. some instances allocated too many samples than others, which causes overall performance very worse.

To solve the issue, what you need is to allocate almost same samples among all instances. In the case, mapjoin, can help you. The magic mapjoin can significantly improve SQL code performance.

DROP TABLE IF EXISTS prd_cat_id;
CREATE TABLE IF NOT EXISTS prd_cat_id LIFECYCLE 7 AS
SELECT /*+ mapjoin(B) */
A.product, A.category_name, B.cat_id 
FROM A
JOIN B
ON A.category_name=B.category_name
;

Rather than single small table, mapjoin can support join multiple small tables at the same time. For example, if there is another table, saying C, which mapping category name into nation, i.e. category name is nation dependent. Just write code like

DROP TABLE IF EXISTS prd_cat_id;
CREATE TABLE IF NOT EXISTS prd_cat_id LIFECYCLE 7 AS
SELECT /*+ mapjoin(B,C) */
A.product, A.category_name, B.cat_id, C.nation 
FROM A
JOIN B
ON A.category_name=B.category_name
JOIN C
ON A.category_name=C.category_name
;

How to setup PostgreSQL in Python

  • Install Psycopg2 package in Python
pip install psycopg2 or
pipenv install psycopg2 (virtual python environment)
  • Install postgresql in Ubuntu
sudo apt update
sudo apt install postgresql postgresql-contrib
  • After successfully install postgresql, the default role is postgres. So when you connect postgresql service in your own Ubuntu user, you cannot connect it, reporting the username not in the role. Thus create a role using your own username, and add your username in ROLE of postgres
psql postgres 

postgres=# CREATE ROLE ubuntu_username superuser;
postgres=# ALTER ROLE ubuntu_username WITH LOGIN;
  • In python, connect postgresql, e.g.
import psycopg2

conn = psycopg2.connect("dbname=test user=user_name")

Learn programming – C++ basic

I will share some slides being prepared to teach basic C++ to children. C++ is beautiful and useful, particularly when your system has high requirements on efficiency with limited resources. Although high level languages such as Python, are widely used in machine learning, the backbone library is in written in C++ most of times. These slides only include basic C++, not including advanced topics such as class, template, memory management, ……

Practice coding is best way to learning any programming. For each lesson, samples codes are taught. If you are interest, please contact me.

Lesson 1

Lesson 1

Lesson 3

Lesson 4

Lesson 5

Lesson 6

Lesson 7

« Older Entries