Category Archives: technology

浅谈任务调度中的自身依赖

在大数据处理中,因为算法或者业务逻辑的要求,有时候需要依赖任务本身的历史数据,譬如,当前业务日期依赖T-1,即昨天的计算结果。这就是自依赖,在任务调度图上看,就是一个闭环。

以T-1来说,如果昨天的任务没有完成,那么今天的业务日期是处于等待状态,直到昨天的完成为止。这种有个潜在的风险是,如果任务失败,而又不能及时发现并重启,那么后续的所有业务日期的任务都处于等待。而且,重启任务要从最早失败的日期开始,等成功了,在重启下一个业务日期。这种一个一个按顺序重启,非常麻烦。那么如何解决这个问题?

解决方案要具体问题具体分析,如果如果T-1的依赖是业务强要求,譬如,一定要严格计算今天同昨天比新增的数据,那么就一定要这个依赖。但在大数据处理时,这种严格的业务要求通常不会有。

如果是因为算法计算为了提高效率,减少计算量,昨天已经处理过的数据,今天就不必处理了,直接把昨天的结果拿来就行,这时候,就完全可以去掉这种自依赖。从而保证系统的稳定性和鲁棒性。核心是cache,即把所需的历史数据,譬如,t-1的结果放到一个cache表(独立的表或者和处理结果同一个表,但放到一个特别的最大业务日期分区,譬如99991231),那么在当前业务日期计算的时候,增量是和从cache取得,就不需要依赖自身。cache总是保存当前最新的业务日期处理结果。当当天的业务日期处理完成后,除了更新当前业务日期分区,同时,也要更新cache表或最大业务日期分区。这样即使历史任务很多失败了,也可以快速修复数据,因为cache里总是有数据在,只是增量计算的参考不再是严格的t-1,但实际上不影响任何东西。因为算法增量计算的目的是不做重复计算,大多数时候是要把增量和cache的数据做合并,每天仍是全量。

如何设计cache,和具体算法业务有关,如果你想在某些情况下已经计算过的,其中一些在某些条件下需要重新计算,这时候,cache的逻辑可以复杂,只cache满足条件的部分数据。你也可以设计个开关,随时打开或关闭cache。

我们希望系统稳定,减少维护成本,把宝贵的时间和聪明放到更有价值的地方,就需要在系统开发的时候提前规划好,设计好。

Device graph – cross-device link

In modern society, many of us have multiple devices,such as desktop pc,mobile phone,pad,et at and are surfing the internet using different devices. Some may surf using different browsers such as Google, Bing, Baidu, Firefox, ……. Cookies in different devices and browsers are unique. For example, I read news paper in CNA using chrome and have assigned a cookie ID to me. Next time, I read news in CNA using Firefox using same device, and is assigned another cookie id. Then from cookie id, I have 2 different identities, i.e. different person. Now the question is, can we identify the the two cookie ids referring to one person?

Simple way is to ask users to create an account and login every time they consume contents in your platform. Thus, the user account can link different cookie ids among different devices or browsers.

But if users do not want login, have a way to do it? Technically, it can. This is called device graph, i.e. based on available cookie data, link different cookie ids to one unique identify, referring to one person.

Cookie records all the behaviors in the surfing, e.g. dwelling time, visiting url, click, ip address, device types (e.g. iphone, samsung S20, huawei P40, PC, …), browser type (e.g. chrome, firefox, safari, bing, baidu). It looks like

  • Cookie-id = 1234, IP:0.0.0., visiting url: wordpress.com at 21:00:00 20210908, browser: firefox, device: pc, …

For each cookie-id, we can aggregate behaviors in the history with different window, e.g. last 30 days, last 60 days, last 90 days, and extract a high-dimensional feature to characterize the cookie-id.

Then it needs to group similar cookie-ids into a cluster. From the view of pattern recognition and machine learning, it is a supervised cluster problem. If there are some login user available, these login users can give us some golden answers about which cookies must be in the same cluster. Thus, it is becoming a semi-supervised clustering problem.

Clustering can also be viewed from graph theory. We can calculate similarity score between any pair of cookies to measure the probability of the pair cookie that is from the same identity (Only need to keep high probability candidate). Then a cookie graph is built, node being cookie-id, and edge being weight to measure link strength. Now any graph cut algorithm can be exploited to solve the clustering problem. Graph cut can identify sub-graph in which all cookie ids is identified as the same identity.

Device graph is very useful technology in ads targeting, personalized recommendation.

Ads targeting

As a publisher platform, broadcast companies create content to attract users to their platform, and earn money by ads operation or by content subscription. Because media company is business driven, and they do not have enough manpower to develop own ads technology such as user tracking, ads placement, ads optimization, ……, they use third-party service such as Google DoubleClick. Based on Google technology, the media company can get real-time data about how users react to the ads displayed to them, e.g. which ads unit displayed to users (impression), whether user click the ads or not. From ads click information, in-house data scientists can develop machine learning model (lookalike model) to predict how probability a user will click an ads. Thus, ads targeting will be implemented, i.e. targeting ads to precisely tailored audience. Thus, it can improve click through rate (CTR) and drive traffic.

Big internet tech companies such as Google, Facebook, Baidu have ads targeting products. But there is intention that media companies like to build their in-house technology because of data privacy and they do not want depend on third-party service too heavy. Building in-house technology can let them easily customized ads targeting model to support niche business requirements, and improves quick response to business.

How to build ads targeting model?

Firstly, the problem is formulated as: given an ads – user pair, predict if user click the ads or not. Thus, from machine learning point, it is a binary classification problem.

Secondly, collect data to prepare training samples to learn a ML model. Collect ads-user pairs already displayed in the platform. If using Google DoubleClick, you can get the real-time impression log data. From these log data, you will know which ads unit impress the audience, and whether the audience click the ads. If the user click the ads, the impression is positive (1). Otherwise, it is negative (0). Thus, each ads-user log pair is tagged as 1 or 0.

Thirdly, represent the ads-user pair as a feature vector. In lookalike model, it try to find potential audience with similar behavior what they already know. Thus, ads information can be ignored. It only need to represent a user using a vector. This vector characterizes the user history behaviors in the platform from various dimension. How to use feature to represent a user, please refer to your-browsing-behavior-expose-your-gender-age-ethnicity.

Lastly, you can train any supervised machine learning model to do prediction. In my case, a simple weighted linear classifier works good, which is like mean of positive sample, negative sample, plus discriminative info comparing with a background model. A/B test on some ads unit shows promising results.

After the model learned, we can rank audiences based on how much probability the audience will click the ads, and selected top-N to do ads targeting.

SQL optimization – long tail affecting performance

It often happens that you need join multiple tables (more than two) into one and do some computation. the you will find many instances are long tails. All other instances complete, long tail instances still running. For example, normal instances may 1 hour completed but long tail may more than 24 hours even worse it halt there.

For example, write a sql to do text information retrieval, ie, a query document table to match a target document database, with constrain of only selected query and target document pair are needed to calculate score.

Let saying 3 tables are query, target, candidate. Normally, you can join 3 at one time, like

Select other-score ……
from query q
Join target t
On q.word=t.word
Join candidate c
On q.id=c.query_id
And t.id=c.target_id
……

In big data, query, target and candidate maybe millions, which is normal. But word distribution along document is very skew, following 80-20 rule or Pareto principle. It will cause above join operation having serious long tail issue.

In my practice, some normal instances completed about a few hours, but long tails running more than 1 day and still halting there without any progress.

After analyzing, change 3 join at one time into first join query and candidate, and their result join target. Only this logic change, the task complete at 1 hour without any long tail, data skew. Comparing with before , the task cannot successfully completed even after running 24 hours.

It is amazing performance improvement.

Viewer forecasting – predicting how many users will read or watch your content

Media companies create News article, audio, and video content to engage users to their platform and provides Ads service to make money. The content creators must understand which topics are most interesting to the audiences, and how their content is popular. For the popular content, i.e. attracting large volume of viewers in short time, the creators may plan to follow user interesting to do deep report on the topic. It is thus necessary to predict the viewer volume for a published article.

For example, the CNA reporter publishes a article on 22 Sep 2021 06:25AM , COVID-19: Home recovery patients ‘anxious’ without clear instructions, set up Telegram group for support ( https://www.channelnewsasia.com/singapore/covid-19-home-recovery-quarantine-art-self-test-kit-telegram-support-group-2191691). The reporter wants to know how many viewers the article will be attracted in the next 24 hours or 72 hours.

Predicting the viewer number of article is a classical time series regression problem. Based on the article published date, time, calendar day, day of week, history viewers of a article, the regression model can be learned and used to do forecasting. The following are steps.

  • Data collection and cleaning
    • Collect history articles together with their viewers, which are series number. e.g. article A published 4 hours ago, its hourly viewer numbers are (0-th, 0), (1-th, 10), (2-th, 100), (3,1000). Collect a lot of these sequence from published articles
  • Prepare training set to train regression model
    • Training data is a set of pairs like (x, y), x is the feature, which is evidence observed, and y is target value (golden truth). If we only forecast next hour viewer number, y is just a number. For above series example, (x,y) pairs may look like ([0,10],100), ([10,100], 1000), i.e. use the past 2-hour viewer number to predict the next hour. But in practice, it is more complicated than the simple case. For example, in the project I worked in the media company, it needs to predict next 72-hour viewers.
  • Feature extraction
    • Feature extraction is most important step in forecasting. If feature is bad, the regression accuracy is worse regardless of which state-of-art machine learning models used. In the news article, the viewer number is only one source of feature. Other features like publish date, time, time of day, day of week, channel, …. Many extra-features can help to improve forecasting precision.
    • Because viewer is integer, it is better to use it in LOG domain. e.g. use LOG(1000) rather than 1000 as a feature, and LOG will non-linear re-scale the number.
  • Machine learning model
    • Any machine learning model for regression can be applied after the feature and training data is ready. For example, xgboost, decision tree, neural network. I finally apply a DNN with metric oriented learning (My previous research, Learn a metric oriented classifier, learning NN to optimize metrics like mean square error (MSE), adjusted R2, …)
  • Forecasting performance metrics
    • Popular metrics like mean square error, mean absolute error, adjusted R2.
  • After model is ready, next step is to deploy forecasting as a service. You can try Flask,https://flask.palletsprojects.com/en/2.0.x/, to build a service. Your data engineering team can call the service to do real-time forecasting.

When I work on the forecasting, not only on news article, but also on audio viewer prediction for broadcast channel program (which is a little different from article, because broadcast program is a scheduled program, e.g. 1-3 program-A, 4-5 program-B), and video program prediction.

You can also apply forecasting to predict exchange rate between two dollars. e.g. predicting USD/SGD exchange rate in next a few days. I try the forecasting model on USD/RMB exchange rate prediction. It looks good.

If you feels the topic interest and want to know more, please contact me.

Music summary

Music summary is to extract a short clip from music recording to represent music content, which is used to engage consumer to buy music recording. A simple way is to use the beginning of audio. But it may not characterize the most engaging part of the music. I developed a music structure analysis and repeated pattern identification algorithm. The repeated pattern or segment may reflect the most engaging content in the recording, which is used as music summary. Refer to https://aisengtech.com/project#music-summary.

Learn a metric oriented classifier

Objective function is the mathematical formulation of how to estimate classifier parameters. The classical objective function is derived from maximal log-likelihood function on training samples for the proposed classifier. Classifier parameters are estimated by solving the objective function. But log-likelihood is not directly related to performance metric, e.g. training on likelihood, and preferred evaluation metric maybe F1, accuracy or ranking. This criteria gap between training and evaluating causes the classifier trained on log-likelihood is not optimal for F1 , classification error or ranking. This is the intention of our work on MFoM based classifier learning. Updated the work on https://aisengtech.com/project#mfom. Hereafter MFoM, there are many research papers on learning classifier for specified metric in research community, in which learn-to-rank is most famous, and learn-to-rank is now a core module for modern search engine.

Audio/music search

In around 2013, music search is becoming hot application in internet industry with the increasing coverage of mobile phone. Its intention is to provide music / song search experience using a music clip recorded by mobile phone anywhere anytime. Its challenges are robust (diverse noise e.g. town hall, road, audio edit, pitch shift) & compact audio fingerprint extraction and quick response to support real-time search. Fortunately I developed audio landmark binary feature as fingerprint and inverted document index framework for audio search (C++). Interests to learn more, please refer to https://aisengtech.com/project#speech-recognition

« Older Entries