Once upon a time, the mere concept of big data was exciting; now, everyone’s trying to figure out how to put it to work. The result: data science has never been a sexier profession.

If you’re a data scientist, people either want you or want to be you: job posting for data scientists increased a whopping 15000% between 2011 and 2012, and universities (along with online education providers like Coursera and Udacity) have come out with data science programs to meet the demand.

The rising popularity of the search term “data science” tells a similar story: 


Data science is an exciting field, but the challenges often aren’t as sexy as most people imagine. While data science encompasses juicy topics like deep learning (automatically extracting the most important aspects of a dataset), a lot of crucial upfront work goes into making the juicy stuff happen.

At Boomtrain, we’ve written about personalizing to users quite a bit. What we haven’t written about is the work that goes into making personalization possible at scale for our customers. Many people see personalization as something you can just switch on or off; if you’re implementing truly robust personalization, that’s not quite the case. 

Objectives and Constraints

Each business has unique goals and context-driven rules, which we’ll discuss in terms of objectives and constraints. The objective is what you’re optimizing for, and the business rules are the constraints on that optimization (e.g. you might want to optimize for engagement, but not want to ruin your brand in pursuit of clicks).

Most media companies want to increase ad revenue, which comes from ad impressions; a company might define engagement as the total number of pageviews and optimize for that metric. The company could also get more specific and optimize for engagement on specific channels.

For brand marketers, on the other hand, money from ad impressions means virtually nothing; they’re generally trying to build stronger relationships with their customers by increasing readership, so they often optimize for conversions in the form of newsletter signups or app downloads.

eCommerce companies optimize for purchases, though not always the same ones: some companies might want to optimize for first-time purchases (acquisition), while others might want to optimize for repeat purchases (retention).

Businesses of all kinds strive to build brand affinity — trust in the quality of a brand — and we measure their success by seeing how often people come back, and through which channels. Organic visits are the most promising.

Recommending what’s technically the best piece of content doesn’t always work with a business’s unique constraints. A machine could eventually learn the constraints on its own, but it’s often faster and easier to account for the constraints ourselves.

A few examples of constraints (or, in the case of #2, a lack thereof):

  1. If you’re a news site, you don’t want to show month-old content
  2. If you’re a recipe site, you definitely want to show month-old content — in all likelihood, nearly all your content is evergreen
  3. If you host TV shows or web series, you want to maximize content discovery, so you probably don’t want to recommend three episodes from the exact same show — or maybe you do want to recommend three episodes in a row, but not four
  4. If you sell clothing, you might not want to recommend hats to someone who’s already bought four of them in the past year (that user is probably done buying hats for a while)

No matter how strong a machine learning algorithm is, constraints like these — built through human intuition by the people working closest to the content and brand — offer useful guidance to improve your overall user experience.

Wrangling Data

Separating the signal from the noise
When setting out to personalize for all users (the vast majority of whom are browsing anonymously), it’s important for the data input to reflect that reality. This task becomes difficult if a business isn’t collecting the right data from the get-go. For example, if a publication only collects data from subscribers or otherwise engaged users, it’s working with a biased data set.

Even if you do collect data on all your users, you’ll need to clean it to make sure you’re getting actionable insights. It’s important to create different groups of users based on their behavior. Building a model based on everyone doesn’t work; it’ll tell you about a user that doesn’t really exist.

If you’re in the business of content creation, your readers likely fall into a 1-9-90 ratio, where the top 1% of people account for a huge proportion of your traffic, the next 9% represents a fraction more, and 90% of readers do almost nothing. Among eCommerce companies, it’s common for 5% of customers to generate a third of revenue.

If you’re looking to expand your user base, you might not get much value out of focusing on that hyper-engaged 1-5% of users (although those users are worth paying attention to either way, since they’re your MVPs). You also don’t want to throw all your traffic in one bucket — you’ll want to tease out the new users and give them a different experience from the one you give everyone else.

As far as user base expansion goes, the most value comes from understanding people with low engagement — the ones who took a few actions and have begun to slip away. It’s ten times easier to get someone to do something if they’ve already done it before, than it is to get an unengaged user to take their first action. The path to increasing readership is easier if you focus on occasional readers; the path to increasing revenue is easier if you focus on one-time buyers.

Putting content in a consistent format
Now we’re getting to the really unsexy stuff: making sure your data has a standard format. Any personalization system needs clean data to deliver the best results. We realize it’s a big ask; content creators — journalists, marketers, web designers, etc. — don’t create with clean, standardized data in mind. Their goal is to make their websites appealing to users through their content (and rightly so).

Every site we work with at Boomtrain is completely different from the ones before, and many of the people who own and manage those sites can’t initially deliver all the useful pieces — e.g. titles, body text, categorical — to our personalization system. A few examples of problems we’ve seen:

  • Sites where different pieces of content are tagged according to different standards
  • Difficulty pulling in body text, which prevents us from using semantic analysis to generate keywords for articles (for more information on semantic analysis, take a look at this post)
  • A lack of canonical URLs, which makes our personalization system think that a slideshow article with 10 subpages is actually 10 different, strikingly similar, articles

Cleaning data is hard work, and the reality is that most companies won’t have 100% perfect data quality. That’s why at Boomtrain, we’ve invested in building technology to smooth over these gaps and inconsistencies.

Advice for Companies Implementing Personalization

  1. Be clear on your end goal. What experience do you want to give, and which metrics will you use to evaluate success?
  2. Be prepared to take the time to clean your data (if you’re with us, we can help). With any personalization system, garbage in means garbage out.

Once those two things are out of the way, the sexy part — personalizing to your users — can finally begin.

Share This