Tag Archives: technology

SQL optimization – long tail affecting performance

It often happens that you need join multiple tables (more than two) into one and do some computation. the you will find many instances are long tails. All other instances complete, long tail instances still running. For example, normal instances may 1 hour completed but long tail may more than 24 hours even worse it halt there.

For example, write a sql to do text information retrieval, ie, a query document table to match a target document database, with constrain of only selected query and target document pair are needed to calculate score.

Let saying 3 tables are query, target, candidate. Normally, you can join 3 at one time, like

Select other-score ……
from query q
Join target t
On q.word=t.word
Join candidate c
On q.id=c.query_id
And t.id=c.target_id

In big data, query, target and candidate maybe millions, which is normal. But word distribution along document is very skew, following 80-20 rule or Pareto principle. It will cause above join operation having serious long tail issue.

In my practice, some normal instances completed about a few hours, but long tails running more than 1 day and still halting there without any progress.

After analyzing, change 3 join at one time into first join query and candidate, and their result join target. Only this logic change, the task complete at 1 hour without any long tail, data skew. Comparing with before , the task cannot successfully completed even after running 24 hours.

It is amazing performance improvement.

Cookie – Tracking user behavior & recommendation

Cookie is a short code to tracking user behavior when surfing in the internet, reading news and article, watching video and podcast and audio program. From cookie collected data, we can understand who, which, where and when content clicked and dwelling time. When you google, google cookie will assign a unique identity (UUID) to you, and trace you, similarly when you Baidu, Bing. But the UUID is different in Google, Baidu, Bing because UUID is not cross browser. But when you login different browsers using same Email account, these UUIDs can be linked and identified as a single user.

Different cookie is used to track different user behavior. For example, cookie tracking user surfing news is different from tracking user watching TV program, or listening radio channel. Third-party cookie service is often used in media company to support news recommendation, audio program recommendation, video program recommendation. There are many DSP (data side platform), DMP (data management platform), and SSP (supply side platform) to provide technology services, e.g. cxense, lotame, ……

Media company often requires customized recommendation system. Third-party service provides cookie and widget toolkit to satisfy customized requirement. For news recommendation, through the widget setting, the customer can configure news category, keyword, name entity, term weighting, period, blacklist & whitelist. These functions can satisfy basic business requirements on news recommendation. This is traditional information retrieval application in news, and cannot do personalized news recommendation, which is widely applied in Google, Facebook or Microsoft Bing search. In-house data science team can exploit internal audience data to understand user interests, build machine learning model to do personalized recommendation. In practice, most companies have no such capability.

For audio / podcast and video program recommendation, most of time, it is still treated as a text information retrieval problem. These program have meta text description such as caption, short description of program, editors or reporter names, program director and actor names. Using these available meta data, recommendation can fulfill most business requirements. Audio and video/image processing and content understanding are not widely used. It is not only because of less manpower capability but also because of hungry computing resources to processing audio and image. In terms of ROI (return on investment), they may not be a good investment.

Media company as publisher platform – 1

Media companies, such as SPH, MediaCorp in Singapore, CCTV in China, Washington Post, are publisher platform. They create high-quality content (e.g. audio, video, news) to engage the users. How do they earn money to support their business? One way is to earn money by subscription fee. But for national broadcast company such as MediaCorp, most of their content, such as broadcast Yes 933, Channel News Asia (CNA), Channel 5, Channel 8, Suria, Vasantham, are free. They earn money most from advertiser. In general, media company is a publish platform, bridging the users (audience, content consumer) with the advertisers (marketing their products to consumers).

Publisher platform:

  • Platform [ Create ] high-quality content [Engage] ==> audience <== [ Consume ] product [ Create ] Advertiser

In media companies, reporters, editors, media creators generate creative, original high-quality content (news, audio and video) . Although their volume is relatively smaller comparing with UGC (user generated content) data in internet companies such as Google, Facebook, their quality is high and trusty, which is important to brand products and companies.

Business is core in media company, they prefer to use third-party service to exploit their content and serve audiences and advertisers. However, they also have strong intention to build in-house data science technology to completely mining their gold data and serve their customers. With increasing strict government policy on user privacy and data security, it is impossible to completely explore in-house data to third-party. In some business application scenario, customized solution is preferred, and out-source or third-party service is not satisfied.

Music summary

Music summary is to extract a short clip from music recording to represent music content, which is used to engage consumer to buy music recording. A simple way is to use the beginning of audio. But it may not characterize the most engaging part of the music. I developed a music structure analysis and repeated pattern identification algorithm. The repeated pattern or segment may reflect the most engaging content in the recording, which is used as music summary. Refer to https://aisengtech.com/project#music-summary.

Learn a metric oriented classifier

Objective function is the mathematical formulation of how to estimate classifier parameters. The classical objective function is derived from maximal log-likelihood function on training samples for the proposed classifier. Classifier parameters are estimated by solving the objective function. But log-likelihood is not directly related to performance metric, e.g. training on likelihood, and preferred evaluation metric maybe F1, accuracy or ranking. This criteria gap between training and evaluating causes the classifier trained on log-likelihood is not optimal for F1 , classification error or ranking. This is the intention of our work on MFoM based classifier learning. Updated the work on https://aisengtech.com/project#mfom. Hereafter MFoM, there are many research papers on learning classifier for specified metric in research community, in which learn-to-rank is most famous, and learn-to-rank is now a core module for modern search engine.

Audio/music search

In around 2013, music search is becoming hot application in internet industry with the increasing coverage of mobile phone. Its intention is to provide music / song search experience using a music clip recorded by mobile phone anywhere anytime. Its challenges are robust (diverse noise e.g. town hall, road, audio edit, pitch shift) & compact audio fingerprint extraction and quick response to support real-time search. Fortunately I developed audio landmark binary feature as fingerprint and inverted document index framework for audio search (C++). Interests to learn more, please refer to https://aisengtech.com/project#speech-recognition