Monday, Gizmodo published an article about the curation practices behind Facebook’s Trending module, and to what extent a curator’s personal biases affect what’s shown to Facebook’s billion-plus active users. (As you can imagine, this has caused some controversy.) Although Gizmodo focused on the role of human curators, those of us who work with algorithms and machine learning have had to confront the fact that biases can end up deeply encoded into supposedly objective systems, including Klout’s own topic classifier -- the system that identifies your Expert topics, as well as the relevant articles for the Explore tab. Here’s a quick roundup of how we in Klout’s Data Science team think about keeping the topic system as unbiased as possible.
Klout’s topic classifier in 30 seconds or less
The major inputs to our topic system are:
Even without getting into the weeds of the pipeline that brings those two types of input together -- more on that below -- critical readers will already be able to spot a few areas where we’re vulnerable to bias. Let’s walk through them.
Any classification system contains value choices
Should “Autism” be placed under “Diseases” or “Neurology”? Is “the Tea Party movement” distinct from “conservative politics”? Is “Wizard Rock” really a thing? How we answer these questions shapes the experience of our users, and inevitably gives our ontology a point of view.
Longtime users may remember the early days of Klout, when topics were a, shall we say, messy combination of user-submitted tags and data-mined concepts. As the Data Science team worked on regularizing and improving the ontology, we’ve relied on the following principles:
Even so, any time your application or audience changes, it’s important to reassess your classification scheme. One major flaw in Klout’s topic ontology is that it was developed for a U.S. audience, and still needs significant work for other countries and languages.
Staying alert for sins of omission
In addition to the human-curated topics in the ontology, we also use a dictionary of concepts and entities derived from Freebase. Freebase is a widely-used resource in the data science world, but “widely-used” is not the same thing as “perfect”, by any means. The biggest issue with Freebase is what it leaves out; like Wikipedia, it was collectively sourced, and like Wikipedia, it’s biased toward the interests of its editors, and sparse in some areas like cosmetic products and fashion terms, requiring us to develop ways to supplement the dictionary. The moral of the story: it pays to look critically at any pre-packaged data set you plan to use.
Next, let’s consider the URLs we collect for the Explore tab. The majority are URLs that have been shared on social media, which means they are dominated by the topics most discussed on social media: politics, celebrity news, music, etc. What we sometimes call “niche topics”, like molecular biology, or Wicca, or wheelchairs, naturally are present in fewer URLs. Does that count as a bias? It’s unclear, but it’s not a good end-user experience and risks making some users feel marginalized. As a result, we’ve had to develop backup strategies to increase coverage for less common topics.
The fuzzy line between human bias and business logic
One of the more ironic tidbits in Gizmodo’s article was that Facebook’s curators were told to suppress news about Facebook -- that is, to interfere with the Trending algorithm to avoid the appearance that Facebook was interfering with the Trending algorithm. But that kind of decision is probably familiar to the product managers in the audience, whose goal it is to preserve the user experience. Similarly, a discovery feed like our Explore tab might recommend porn, or spam, or hate speech, and need to be tuned or overridden. To make it even more complicated, the definition of porn, or spam, or hate speech may change from region to region. Keeping those decisions from being made inconsistently or thoughtlessly is really difficult, but our approach has been to define a single owner who both documents the rules and is accessible to discuss individual cases. As others have pointed out, Facebook’s mistake may not have been having curatorial tools, but isolating the employees using them.
Fine, but what about the actual topic algorithm?
Eagle-eyed readers will have noticed that we haven’t touched on the nuts and bolts of how Klout’s topic system actually assigns topics. The challenges of data modeling and debugging machine learning algorithms are pretty well surveyed elsewhere, and how we handle those challenges at Klout would require a dedicated blog post. However, there’s less discussion of how to handle human biases when collecting training or validation data -- how people’s points of view get encoded into the data a given algorithm is trying to approximate. The two approaches often recommended could be described as micromanaging versus crowdsourcing; either a) have an in-house process that includes well-defined guidelines, trained judges, and a reconciliation process for disagreements, or b) have lightweight guidelines but a large number of judges, in the hopes that individual biases will be muted. There are tradeoffs to either approach; our team has recently been relying mostly on in-house validation data, mostly because it’s friendly to our development schedule. But what’s more important, in our experience, is that the potential weaknesses of the training/validation data are known and discussed and documented ahead of time, so that they can be distinguished from problems with the model itself.
No system is perfect, and keeping out bias takes continual work. Although a focus on documentation, consistency, and validation will take you a long way, the very best defense against unintentional bias is a diverse team, who can bring multiple points of view. Want to come work with us?
Sarah Ellinger is the Lead Data Analyst for Klout/Lithium’s Data Science team. She is responsible for overseeing the content of the topic ontology, as well as monitoring the performance of the topic classification system. Sarah attended U.C. Berkeley’s School of Information and has over a decade of experience in taxonomy and web content classification at tech companies large and small. She can be found on Twitter discussing information science and Game of Thrones spoilers @sarahellinger.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.