Category Archives: Programming language

Learn programming – C++ basic

I will share some slides being prepared to teach basic C++ to children. C++ is beautiful and useful, particularly when your system has high requirements on efficiency with limited resources. Although high level languages such as Python, are widely used in machine learning, the backbone library is in written in C++ most of times. These slides only include basic C++, not including advanced topics such as class, template, memory management, ……

Practice coding is best way to learning any programming. For each lesson, samples codes are taught. If you are interest, please contact me.

Lesson 1

Lesson 1

Lesson 3

Lesson 4

Lesson 5

Lesson 6

Lesson 7

SQL optimization – long tail affecting performance

It often happens that you need join multiple tables (more than two) into one and do some computation. the you will find many instances are long tails. All other instances complete, long tail instances still running. For example, normal instances may 1 hour completed but long tail may more than 24 hours even worse it halt there.

For example, write a sql to do text information retrieval, ie, a query document table to match a target document database, with constrain of only selected query and target document pair are needed to calculate score.

Let saying 3 tables are query, target, candidate. Normally, you can join 3 at one time, like

Select other-score ……
from query q
Join target t
On q.word=t.word
Join candidate c
On q.id=c.query_id
And t.id=c.target_id
……

In big data, query, target and candidate maybe millions, which is normal. But word distribution along document is very skew, following 80-20 rule or Pareto principle. It will cause above join operation having serious long tail issue.

In my practice, some normal instances completed about a few hours, but long tails running more than 1 day and still halting there without any progress.

After analyzing, change 3 join at one time into first join query and candidate, and their result join target. Only this logic change, the task complete at 1 hour without any long tail, data skew. Comparing with before , the task cannot successfully completed even after running 24 hours.

It is amazing performance improvement.