An international group of researchers has developed an algorithmic tool that uses Twitter to automatically predict exactly where you live in a matter of minutes, with more than 90 percent accuracy. It can also predict where you work, where you pray, and other information you might rather keep private, like, say, whether you’ve frequented a certain strip club or gone to rehab.
The tool, called LPAuditor (short for Location Privacy Auditor), exploits what the researchers call an "invasive policy" Twitter deployed after it introduced the ability to tag tweets with a location in 2009. For years, users who chose to geotag tweets with any location, even something as geographically broad as “New York City,” also automatically gave their precise GPS coordinates. Users wouldn’t see the coordinates displayed on Twitter. Nor would their followers. But the GPS information would still be included in the tweet’s metadata and accessible through Twitter’s API.
Twitter didn't change this policy across its apps until April of 2015. Now, users must opt-in to share their precise location—and, according to a Twitter spokesperson, a very small percentage of people do. But the GPS data people shared before the update remains available through the API to this day.
The researchers developed LPAuditor to analyze those geotagged tweets and infer detailed information about people’s most sensitive locations. They outline this process in a new, peer-reviewed paper that will be presented at the Network and Distributed System Security Symposium next month. By analyzing clusters of coordinates, as well as timestamps on the tweets, LPAuditor was able to suss out where tens of thousands of people lived, worked, and spent their private time.
A member of Twitter's site integrity team told WIRED that sharing location data on Twitter has always been voluntary and that the company has always given users a way to delete that data in its help section. "We recognized in 2015 that we could be even clearer with people about that, but our overarching perspective on location sharing has always been that it’s voluntary and that users can choose what they do and don't want to share," the Twitter employee said.
It's true that it's always been up to users to geotag their tweets or not. But there's a big difference between choosing to share that you're in Paris and choosing to share exactly where you live in Paris. And yet, for years, regardless of the square mileage of the locations users chose to share, Twitter was choosing to share their locations down to the GPS coordinates. The fact that these details were spelled out in Twitter's help section wouldn't do much good to users who didn't know they needed help in the first place.
"If you're not aware of the problem, you're never going to go remove that data," says Jason Polakis, a co-author of the study and an assistant professor of computer science at the University of Illinois at Chicago specializing in privacy and security. And according to the study, that data can reveal a lot.
In November of 2016, well after Twitter changed its settings, Polakis and researchers at the Foundation for Research and Technology in Crete began pulling Twitter metadata from the company’s API. They were building on prior research that showed it was possible to infer private information from geotagged tweets, but they wanted to see if they could do it at scale and with more precision, using automation.
The researchers analyzed a pool of about 15 million geotagged tweets from about 87,000 users. Some of the location data attached to those tweets may have come from users who wanted to share their exact locations, like, say, a museum or music venue. But there were also plenty of users who shared nothing more than a city or general vicinity, only to have their GPS location shared anyway.
From there, LPAuditor set to work assigning each tweet to a physical spot on a map, and locating it by time zone. That generated clusters of tweets around the map, some busier than others, indicating locations where a given user spends a lot of time—or at least, a lot of time tweeting.
"If you're not aware of the problem, you're never going to go remove that data."
Jason Polakis, University of Illinois at Chicago
To predict which cluster might correspond to a user’s home, the researchers directed LPAuditor to look for locations where people spent the longest time span tweeting over the weekend. The thinking was: During the week, you might tweet in the morning, at night, and on your day off, in an unpredictable pattern, but home is where most people spend the bulk of their time on weekends.
When it came to finding work locations, they did the opposite, analyzing tweet patterns during the week. LPAuditor analyzed the locations where users tweeted the most (not including home), then studied the time frames during which those tweets were sent. That gave the researchers a sense of whether the tweets might have been sent over the course of a typical eight-hour shift, even if that shift was overnight. Finally, the tool looked for the time frame that appeared most often during the week and decided that the location with the most tweets in that time frame was most likely the person’s place of work.
When it came time to check their answers, the researchers identified a group of roughly 2,000 users to serve as a sort of ground truth. Compiling this group was a manual process that required two graduate students to independently sift through all of the tweets in the collection to find key phrases that might confirm a person really was home or at work when they sent it. Terms like, “I’m home” or “at the office," for instance, might provide a clue. They inspected each tweet for context that might provide additional information.
They then compared the locations of those tweets to the tool's predictions and found they were highly accurate, identifying people’s homes correctly 92.5 percent of the time. It wasn’t as good at predicting where people worked, getting that right just 55.6 percent of the time. But that, Polakis says, could simply mean that the location they identified as “work” is actually a school or a place where the person spends what would otherwise be working hours.
Finally, the researchers set about identifying sensitive locations a user might have visited. To do that, they compared the tweet locations to Foursquare’s directory of businesses and venues. They were looking for places like hospitals, urgent care centers, places of worship, and also strip clubs and gay bars. Any venue that appeared within 27 yards of the geotagged tweet would be considered as a potential location. Then, they conducted a similar keyword analysis, searching for words associated with health, religion, sex, and nightlife, to check whether a user was likely where they seemed to be. Using this method, the researchers found that LPAuditor was right about sensitive locations about 80 percent of the time.
Of course, if a user is tweeting about, say, being at the doctor while they’re at the doctor, one might argue that they’re not so concerned about privacy. But Polakis says, “The location might give away more information than the user wants to say.” In one case, the researchers found a user who was tweeting about a doctor from a location that the GPS coordinates revealed to be a rehab facility. “That’s a lot more sensitive context than what they were willing to disclose,” he says.
Even when the tweet doesn’t include context clues, LPAuditor was still able to predict whether a person had actually spent time at a sensitive location by studying the duration of time that people spent there and the number of times they returned. The researchers were, however, unable to measure the accuracy of these specific predictions.
The majority of this research was based on tweets that were sent prior to Twitter's policy change in April 2015. That change, Polakis says, made a huge difference in terms of how much precise location data was available through the API. To measure just how huge, the researchers excluded all of the tweets they collected prior to April 2015 and found that they were only able to positively identify key locations for about one-fifteenth of the users they were studying. In other words, Polakis says, "That kind of invasive Twitter behavior increased the amount of people we could attack by 15 times."
The fact that Twitter changed its policies is a good thing. The problem is, so much of that pre-2015 location data is still available through the API. Asked why Twitter didn't scrub it after changing the policy, the Twitter site integrity employee said, "We didn’t feel it would be appropriate for us to go back and unilaterally make the decision to change people’s tweets without their consent."
This is not the first study to reveal what can be inferred from location data, or even geotagged tweets. But, according to Henry Kautz, a computer scientist at the University of Rochester who has conducted similar research, this paper makes key contributions. "The advancement here is that they studied two types of locations—work and home—rather than one, and they did a larger study with a more systematic evaluation and a more highly tuned algorithm, so it got the right answer a higher percentage of the time," Kautz says. LPAuditor isn't exclusive to Twitter data either. It could be applied to any set of location data.
Kautz argues that Twitter is of relatively small concern compared to other apps that continue to use invasive location data practices today. Government officials in Los Angeles recently filed a lawsuit against the IBM-owned Weather Channel app for allegedly collecting and selling users' geolocation data under the guise of helping users "personaliz[e] local weather data, alerts, and forecasts." And just this week, Motherboard reported that bounty hunters are using location data purchased from T-Mobile, Sprint, and AT&T to track individuals using their phones. That's despite the companies' public promises to stop selling such data. Then, of course, there are apps that get infected with malware and gobble up location data.
"The big problem today is not nefarious people looking at your geotagged tweets. The problem is compromised cell phone apps that steal your entire GPS history," Kautz says. "From that data one can extract not just your home and work locations, but a huge number of significant places in your life."
And yet, Polakis says the fact that Twitter no longer attaches GPS coordinates to all geotagged tweets isn't enough, given that developers still have access to years' worth of data from before 2015. Yes, some of that information might now be stale. People move. They change jobs. But even outdated information can be useful to an attacker, and other sensitive information, like, say, a person's sexuality, seems unlikely to change. This study proves that not only is it possible to infer this kind of information from location data, but that a machine can do it almost instantly.
For now, Polakis says, the most people can do is delete their location data today—and think twice before sharing it in the future.