Tips: mapjoin in long-tail optimization in SQL
When joining a big table, saying A, (e.g.10M records), with a small table, saying B (e.g. 1000 records), based on some keys, the long-tail issue may popup. For example, if you have 10M product & category name records, A, and another product name and category name identity table, B. Now you want a new table patch product, category name, & category name identity together, you may write code like,
DROP TABLE IF EXISTS prd_cat_id;
CREATE TABLE IF NOT EXISTS prd_cat_id LIFECYCLE 7 AS
SELECT A.product, A.category_name, B.cat_id
FROM A
JOIN B
ON A.category_name=B.category_name
;
Then you find the job running slowly, even instances allocated many, saying 1000. When you investigate the logview, it is found many instances completed very fast, only a few, saying 10 instances running very slow. It is the long-tail issue, i.e. some instances allocated too many samples than others, which causes overall performance very worse.
To solve the issue, what you need is to allocate almost same samples among all instances. In the case, mapjoin, can help you. The magic mapjoin can significantly improve SQL code performance.
DROP TABLE IF EXISTS prd_cat_id;
CREATE TABLE IF NOT EXISTS prd_cat_id LIFECYCLE 7 AS
SELECT /*+ mapjoin(B) */
A.product, A.category_name, B.cat_id
FROM A
JOIN B
ON A.category_name=B.category_name
;
Rather than single small table, mapjoin can support join multiple small tables at the same time. For example, if there is another table, saying C, which mapping category name into nation, i.e. category name is nation dependent. Just write code like
DROP TABLE IF EXISTS prd_cat_id;
CREATE TABLE IF NOT EXISTS prd_cat_id LIFECYCLE 7 AS
SELECT /*+ mapjoin(B,C) */
A.product, A.category_name, B.cat_id, C.nation
FROM A
JOIN B
ON A.category_name=B.category_name
JOIN C
ON A.category_name=C.category_name
;