Every day across the nation, people doing work for Google log in to their computers and start watching YouTube. They look for violence in videos. They seek out hateful language in video titles. They decide whether to classify clips as “offensive” or “sensitive.” They are Google’s so-called “ads quality raters,” temporary workers hired by outside agencies to render judgments machines still can’t make all on their own. And right now, Google appears to need these humans’ help—urgently.
YouTube, the Google-owned video giant, sells ads that accompany millions of the site’s videos each day. Automated systems determine where those ads appear, and advertisers often don’t know which specific videos their ads will show up next to. Recently, that uncertainty has turned into a big problem for Google. The company has come under scrutiny after multiple reports revealed that it had allowed ads to run against YouTube videos promoting hate and terrorism. Advertisers such as Walmart, PepsiCo, and Verizon, ditched the platform, and much of the wider Google ad network.
Google has scrambled to control the narrative, saying the media has overstated the problem of ads showing up adjacent to offensive videos. Flagged videos received “less than 1/1000th of a percent of the advertisers’ total impressions,” the company says. Google’s chief business officer, Philipp Schindler, says the issue affected a “very very very small” number of videos. But ad raters say the company is marshalling them as a force to keep the problem from getting worse.
‘We know very well that human eyes—and human brains—need to put some deliberate thought into evaluating content.’ Former Ad Rater
Because Google derives 90 percent of its revenue from advertisers, it needs to keep more from fleeing by targeting offensive content—fast. But users upload nearly 600,000 hours of new video to YouTube daily; it would take a small city of humans working around the clock to watch it all. That’s why the tech giant has emphasized that it’s hard at work developing artificially intelligent content filters, software that can flag offensive videos at a greater clip than ever before. “The problem cannot be solved by humans and it shouldn’t be solved by humans,” Schindler recently told Bloomberg.
The problem is, the company still needs humans to train that AI. So Google still depends on a phalanx of human workers to identify and flag offensive material to build the trove of data its AI will learn from. But eight current and former raters tell WIRED that, at a time the company is increasingly reliant on ad raters’ work, poor communication with Google and a lack of job stability are impairing their ability to do their jobs well.
“I’m not saying this is the entire reason for the current crisis,” says a former Google ad rater, who was not authorized to speak with WIRED about the program. “But I do believe the instability in the program is a factor. We raters train AI, but we know very well that human eyes—and human brains—need to put some deliberate thought into evaluating content.”
Tech companies have long employed content moderators, and as people upload and share more and more content, this work has become increasingly important to these internet giants. The ad raters WIRED spoke with explained that their role goes beyond monitoring videos. They read comment sections to flag abusive banter between users. They check all kinds of websites served by Google’s ad network to ensure they meet the company’s standards of quality. They classify sites by category, such as retail or news, and click links in ads to see if they work. And, as their name suggests, they rate the quality of ads themselves.
In March, however, in the wake of advertiser boycotts, Google asked raters to set that other work aside in favor of a “high-priority rating project” that would consume their workloads “for the foreseeable future,” according to an email the company sent them. This new project meant focusing almost exclusively on YouTube—checking the content of videos or entire channels against a list of things that advertisers find objectionable. “It’s been a huge change,” says one ad rater.
‘I’m worried if I take too long on too many videos in a row I’ll get fired.’ Ad Rater
Raters say their workload suggests that volume and speed are more of a priority than accuracy. In some cases, they’re asked to review hours-long videos in less than two minutes. On anonymous online forums, raters swap time-saving techniques—for instance, looking up rap video lyrics to scan quickly for profanity, or skipping through a clip in 10-second chunks instead of watching the entire thing. A timer keeps track of how long they spend on each video, and while it is only a suggested deadline, raters say it adds a layer of pressure. “I’m worried if I take too long on too many videos in a row I’ll get fired,” one rater tells WIRED.
Ad raters don’t just flag videos as inappropriate. They are asked to make granular assessments of their title and contents—classifying them, for instance, as containing “Inappropriate Language,” such as “profanity,” “hate speech,” or “other.” Or “Violence,” with the subcategories “terrorism,” “war and conflict,” “death and tragedy,” or “other.” There’s also “Drugs” and “Sex/Nudity” (with the subcategories “abusive,” “nudity,” or “other”). The system also provides the ad rater with an option for “other sensitive content”—if, say, someone is sharing extreme political views. (AdAge recently reported that Google is now allowing clients to opt out advertising alongside “sexually suggestive” and “sensational and shocking” content, as well as content containing “profanity and rough language.”)
Some material doesn’t always fit neatly into the provided categories, ad raters say. In those cases, raters label the material as “unrateable.” One current rater described how he had to evaluate two Spanish-speaking people engaged in a rap battle. “I checked it as unrateable because of the foreign language,” he told WIRED. “I also added a comment that said it seems like this is a video of people insulting each other in a foreign language, but I can’t exactly tell if they are using profanity.” (Judging from recent ad-rating job openings, one former rater said, Google seems to be prioritizing hiring bilingual raters. Workers can also check a box when a video is in a language they don’t understand.)
Multiple ad raters say they have been asked to watch videos with shocking content. “The graphic stuff is far more graphic lately… someone trying to commit suicide with their dog in their truck,” one rater said. The person set the truck on fire, the rater said, then exited the truck and committed suicide with a shot to the head. In the online forums frequented by ad raters, anonymous posters said they had seen videos of violence against women, children, and animals. Several posters said they needed to take breaks after watching several such videos in a row. Ad raters said they don’t know how Google selects the videos they will watch—they only see the title and thumbnail of the video before they rate it, not a rationale. Other typical content in videos raters are tasked to watch include people talking video games, politics, and conspiracy theories.
Taken together, the scope of the work and nuance required in assessing videos shows Google still needs human help in dealing with YouTube’s ad problems. “We have many sources of information, but one of our most important sources is people like you,” Google tells raters in a document describing the purpose of their ad-rating work. But while only machine intelligence can grapple with YouTube’s scale, as company execs and representatives have stressed again and again, until Google’s machines—or anyone else’s—get smart enough to distinguish, say, truly offensive speech from other forms of expression on its own, such efforts will still need to rely on people.
“We have always relied on a combination of technology and human reviews to analyze content that has been flagged to us because understanding context in video can be subjective,” says Chi Hea Cho, a spokesperson for Google. “Recently we added more people to accelerate the reviews. These reviews help train our algorithms so they keep improving over time.”
The ads quality rater program started in 2004, two sources told WIRED. It was modeled after Google’s (much talked-about) search quality evaluation program, and initially served Google’s core ad initiatives: AdWords, which generates ads that correspond to search results and AdSense, which place ads on websites through Google. The original hiring agency for ad raters, ABE, paid ad raters $20 an hour. They could work full-time and even overtime, one former rater said. In 2006, WorkForceLogic acquired ABE, after which raters say working conditions became less favorable. A company called ZeroChaos bought WorkForceLogic in 2012 and contracts with ad raters today.
Ad rating work often attracts people who prefer more flexible working conditions, among them college graduates who have just entered the workforce, workers nearing retirement age, stay-at-home parents, and individuals with physical disabilities. Ad raters can work wherever and whenever they want, as long as they fulfill the 10-hour weekly minimum work requirement. Raters only need their own desktop computer and mobile device to work with.
The scope of the work and nuance required in assessing videos shows Google still needs human help.
But the inherent instability of the job can take a toll on many workers. “Most of us love this job,” one ad rater tells WIRED, “but we have no chance of becoming permanent, full-time employees.”
Most of the ad raters who spoke to WIRED were hired through ZeroChaos—just one of several agencies that provide temporary workers to tech companies. ZeroChaos hires ad raters on one-year contracts, and at least until very recently, they could not stay on the job after two years of continuous work. Some workers believed this limit deprived the company of experienced raters best qualified to do the work. (In early April, during our reporting on this story, ZeroChaos notified ad raters that it was ending the two-year limit.) Ad raters do not get raises—they earn $15 per hour and can work a maximum of 29 hours a week. They get no paid time off. They can sign up for benefits if they work at least 25 hours per week, but they have no assurance they will have enough tasks to meet that threshold. Workers say they can find themselves dismissed suddenly, without warning and without a reason given—something multiple employees say has happened to them, one after only a week on the job. The company notifies workers of termination with a perfunctory email message.
“Google strives to work with vendors that have a strong track record of good working conditions,” says Cho. “When issues come to our attention, we alert these vendors about their employees’ concerns and work with them to address any issues. We will look into this matter further.” ZeroChaos declined to comment.
A lack of clear communication with Google itself compounds the feelings of job insecurity ad raters have, they say. They don’t meet anyone they work for in person—including during the job interview process—and Google gives raters only a generic Google email for the “Ads Evaluation Administrative team,” telling the raters to use it only for task-related issues. When raters email the address, they only receive an auto-response. “Because of the volume of reports received, administrators do not respond to individual problem reports: instead, we monitor incoming reports to detect system-wide problems as quickly as possible,” Google’s reply reads. “If you need an individualized response, or a specific action taken on your account, contact your contract administrator instead.”
“The communication from Google was totally nonexistent,” one former rater said. “Google is legendary for not communicating.”
“The people at the other end of this pipeline in Mountain View are like the wizard behind the curtain,” another former rater said. “We would like very much to communicate with them, be real colleagues, but no.”
For its part, Google does inform raters that they’re doing important work, even if it doesn’t spell out exactly why.
“We won’t always be able to tell you what [each] task is for, but it’s always something we consider important,” the company explains in orientation materials for ad raters. “You won’t often hear about the results of your work. In fact, it sometimes might seem like your work just flows into a black hole… Even though you don’t always see the impact, your work is very important, and many people at Google review it very, very closely.”
Sometimes too closely for some workers’ comfort. Google incorporates already-reviewed content into ad raters’ assignments to gauge their performance. “These exams appear as normal tasks, and you will acquire them along with your regular work,” an email to an ad rater from Google reads. “You will not be able to tell which tasks you are being tested on … We use exam scores to evaluate your performance. Very low scores may result in your assignment being ended.”
Embedding questions with known answers is a common practice in crowdsourcing research, according to Georgia Tech AI researcher Mark Riedl. The strategy is often used to determine whether a researcher should throw out data from an individual who might be clicking randomly, he explains, and it’s often jokingly called the Turing Test among practitioners.
But Riedl says he doesn’t care for the Turing Test reference. “It perpetuates an attitude that crowd workers are machines when instead we need to recognize that crowd workers are humans, for whom we have an ethical and moral responsibility to design tasks that recognize the dignity of the worker,” he says.
To be sure, not all ad raters find fault with the issues raised by some of their fellow workers. The $15-per-hour rate is still above most cities’ minimum wages. One ad rater told me he was grateful for the opportunity ZeroChaos gave him. “[ZeroChaos] didn’t care about a criminal background when even McDonald’s turned me down,” the rater said. Multiple raters said they’d been close to homelessness or needing to go on food stamps when this job came along.
But others say the flexibility often doesn’t end up working in their favor, even as they come to depend on this job. Working from home and choosing one’s own hours are perks. But according to a ZeroChaos FAQ, ad raters are prohibited from working for other companies at the same time. One former ad rater says she is now doing temp work for another company through ZeroChaos and would also like to resume doing ad rater work to help make ends meet but can’t because of the restriction. “If I could work jobs simultaneously, that would be great, a living wage,” she says. “Right now, I’m earning $40 a week more than I did on unemployment. That’s not sustainable.”
The Human-AI Connection
Big companies across the tech industry employ temporary workers to participate in repetitive tasks meant to train AI systems, according to multiple ad raters WIRED spoke with. One ad rater described a job several years ago rating Microsoft Bing search results, in which human evaluators were expected to go through as many as 80 pages of search results an hour. LinkedIn and Facebook employ humans for similar tasks too, raters told me—LinkedIn for data annotation, and Facebook for rating “sponsored posts” from fan pages.
(Microsoft declined to comment, while LinkedIn could not confirm such a program. Facebook did not respond to a request for comment.)
The overall job insecurity of temp work and widespread turnover unsettles current and former employees, who argue Google is losing the institutional knowledge possessed by workers who have spent more time on the job. “They’re wasting money by taking the time to train new people then booting them out the door,” says one former ad rater.
But churning through human ad raters may just reflect best practices for making AI smarter. Artificial intelligence researchers and industry experts say a regular rotation of human trainers inputting data is better for training AI. “AI needs many perspectives, especially in areas like offensive content,” says Jana Eggers, CEO of AI startup Nara Logics. Even the Supreme Court could not describe obscenity, she points out, citing the “I know it when I see it” threshold test. “Giving ‘the machine’ more eyes to see is going to be a better result.”
But while AI researchers agree in general that poor human morale doesn’t necessarily cause poor machine learning, there may be more subtle effects that stem from one’s work environment and experiences. “One often hears the perspective that getting large amounts of diverse inputs is the way to go for training AI models,” says Bart Selman, a Cornell University AI professor. “This is often a good general guideline, but when it comes to ethical judgments it is also known that there are significant ingrained biases in most groups.” For example, Selman says, the perception that men are better at certain types of jobs than women, and vice versa. “So, if your train your AI hiring model on the perceptions of a regular group of opinions, or past hiring decisions, you will get the hidden biases present in the general population.” And if it turns out you’re training your AI mainly on the perceptions of anxious temp workers, they could wind up embedding their own distinct biases in those systems.
“You would not want to train an AI ethics module by having it observe what regular people do in everyday life,” Selman says. “You want to get input from people that have thought about potential biases and ethical issues more carefully.”
Googlers at the company’s sprawling Mountain View headquarters enjoy a picturesque campus, free gourmet cafeteria food, and rec room games like pool and foosball. That’s a far cry from the life of a typical ad rater. These days, working for the world’s most valuable tech companies can mean luxurious perks and huge paydays. It can also mean toiling away as a temp worker at rote tasks, training these companies’ machines to do the same work.
Have any secrets you want to share? Use WIRED SecureDrop.