Tag Archives: data science

Understanding data and business in data science

Data science is about data and business, it is about how company exploits data to drive business decision and strategy. It is not about science, not about AI & ML and whatever deep models, which are only tools to help company mining insights from data using suitable toolkit. Business sense and capability to identify correct business problem is top-1 important step. Never mind whatever your ML model is simple decision tree, linear, boosting or fancy deep learning model, success model is that that can solve business problem and reach business success using reasonable resource and ROI. In many real business scenario, there is not big good data to let data scientist to tuning deep model, and there is constrained resource to be allocated.

Understanding business problem, transform business problem requirement to machine learning & algorithm problem, e.g. the business problem is actual a regression, or cluster or classification problem. Then understanding which type of data and how to collect data in the company’s info-structure for the problem. Then clean data, feature engineering and preliminary data analysis are must need. After data analysis, maybe it is found some insightful feature, which significantly improves model performance. It is not a good practice that blindly try some fancy model. Deep model cannot do everything for you. Traditional model can solve most business problem in business driven company. In some companies, you will find available resource do not allow you using fancy latest model because of investment too much but benefit not much attractive.

Above is personal thinking, welcome comment and discussion.

Cookie – Tracking user behavior & recommendation

Cookie is a short code to tracking user behavior when surfing in the internet, reading news and article, watching video and podcast and audio program. From cookie collected data, we can understand who, which, where and when content clicked and dwelling time. When you google, google cookie will assign a unique identity (UUID) to you, and trace you, similarly when you Baidu, Bing. But the UUID is different in Google, Baidu, Bing because UUID is not cross browser. But when you login different browsers using same Email account, these UUIDs can be linked and identified as a single user.

Different cookie is used to track different user behavior. For example, cookie tracking user surfing news is different from tracking user watching TV program, or listening radio channel. Third-party cookie service is often used in media company to support news recommendation, audio program recommendation, video program recommendation. There are many DSP (data side platform), DMP (data management platform), and SSP (supply side platform) to provide technology services, e.g. cxense, lotame, ……

Media company often requires customized recommendation system. Third-party service provides cookie and widget toolkit to satisfy customized requirement. For news recommendation, through the widget setting, the customer can configure news category, keyword, name entity, term weighting, period, blacklist & whitelist. These functions can satisfy basic business requirements on news recommendation. This is traditional information retrieval application in news, and cannot do personalized news recommendation, which is widely applied in Google, Facebook or Microsoft Bing search. In-house data science team can exploit internal audience data to understand user interests, build machine learning model to do personalized recommendation. In practice, most companies have no such capability.

For audio / podcast and video program recommendation, most of time, it is still treated as a text information retrieval problem. These program have meta text description such as caption, short description of program, editors or reporter names, program director and actor names. Using these available meta data, recommendation can fulfill most business requirements. Audio and video/image processing and content understanding are not widely used. It is not only because of less manpower capability but also because of hungry computing resources to processing audio and image. In terms of ROI (return on investment), they may not be a good investment.

Media company as publisher platform – 1

Media companies, such as SPH, MediaCorp in Singapore, CCTV in China, Washington Post, are publisher platform. They create high-quality content (e.g. audio, video, news) to engage the users. How do they earn money to support their business? One way is to earn money by subscription fee. But for national broadcast company such as MediaCorp, most of their content, such as broadcast Yes 933, Channel News Asia (CNA), Channel 5, Channel 8, Suria, Vasantham, are free. They earn money most from advertiser. In general, media company is a publish platform, bridging the users (audience, content consumer) with the advertisers (marketing their products to consumers).

Publisher platform:

  • Platform [ Create ] high-quality content [Engage] ==> audience <== [ Consume ] product [ Create ] Advertiser

In media companies, reporters, editors, media creators generate creative, original high-quality content (news, audio and video) . Although their volume is relatively smaller comparing with UGC (user generated content) data in internet companies such as Google, Facebook, their quality is high and trusty, which is important to brand products and companies.

Business is core in media company, they prefer to use third-party service to exploit their content and serve audiences and advertisers. However, they also have strong intention to build in-house data science technology to completely mining their gold data and serve their customers. With increasing strict government policy on user privacy and data security, it is impossible to completely explore in-house data to third-party. In some business application scenario, customized solution is preferred, and out-source or third-party service is not satisfied.