Turn Privacy Debate on Its Head, Says Researcher

Could a lack of privacy regulations in the U.S. and abusive practices lead to a backlash that negatively affects scientific research for the greater social good? That worries Tom Mitchell, a Carnegie Mellon professor and machine learning researcher, whose profile appears this week in the pages of Computerworld.

As smart phones diligently record people's locations, movements and other activities, machine learning and real time data mining can be used for the greater good. For example, real time positioning and movement data from you smart phone is already being used to track traffic congestion. Soon it could be used to change traffic light patterns in order to optimize traffic flows.

Machine learning algorithms feed on such data to make predictions for good -- or ill. Patient data could be analyzed to inform you that yesterday you came in contact with someone who has a contagious disease. But if you have the disease, do you want that information made public? What about entities that might use machine learning tools to identify you in random groups of photos that you or others have posted on the Web? How about identifying your mother or your child?

The current approach to privacy needs to be turned on its head, Mitchell argues. "I would like to see a shift from the current idea that the data capturer owns the data to see the person that the data is about have final control over it."

But that won't happen without an informed debate on how data mining goals can be done while preserving anonymity. Right now, he says, policy makers are not getting that information. They may be lead to believe that it's all or nothing, and that any attempt to provide guidelines through privacy regulation will shut the doors to innovation.

Not necessarily.

"Even though there are guidelines out there about how to deal with privacy issues, a lot of those guidelines and a lot of the debates I've heard are under informed about what the available technology is for assisting in protecting privacy," he says.

Mining data while protecting patients

For example, says Mitchell, let's take all of the patient data from all of the hospitals in the US.

If researchers could collect all of the data from all H1N1 patients in hospitals in the US, they would have a tremendous data set. Machine learning algorithms, which work best with very large data sets, could then be used to determine which treatments allow patients to recover quickly for a given patterns of disease symptoms. In a rapidly moving disease outbreak, such as pandemic, machine learning algorithms could help doctors figure out what works for very specific patient profiles -- and what doesn't. What works best when the patient is under 12 and has a high fever but doesn't work so well if no fever is present? "Those patterns exist out there but we can't discover them quickly if we don't have the data," Mitchell says.

But if hospitals were willing to participate, researchers could do that work without copying or directly accessing patient data.

"Most people think that we'd have to send all of that data to one place to run all of our data mining algorithms on it. The truth is there are privacy enhancing data mining methods that allow people to leave the data there. There are techniques to collect the statistics you need through a combination of cryptographic methods and passing around the statistics rather than the data," Mitchell says.

"Suppose we had six hospitals and we wanted to run some data mining algorithms to see which symptoms tend to lead to a particular treatment being successful. In the inner loop of that algorithm there's a repeating set of questions being asked, like how many patients are under 12. You can answer those questions without sharing the data," he says.

"You make up a random number, a really big number like 6453522, and you give it to hospital one and ask, 'How many of your H1N1 patients are under 12?' They pass that number plus that random number -- the total -- to the next hospital. The next hospital gets that random number and they add in theirs and so forth. When it comes back to you, you subtract the random number you started out with and you've got the total. No hospital had any information about any other hospitals' data. There was no information shared at all, even at the level of how many patients satisfied the criteria, and yet at the end, because you knew the random number, you could figure out the total. You don't know anything about the individual hospitals," says Mitchell.

"That's an example of privacy-preserving ways of collecting statistics," according to Mitchell.

Ten years ago no one even thought about this kind of approach, Mitchell says. "But because of the increasing interest in privacy, people are trying to design algorithms like that to achieve the same result without the privacy implications."

It takes a technologist to explain that to policy makers.

"Most of the discussions I've heard in Washington about privacy and mining personal data are just not informed about the existence of these techniques. So, it's really important that technologists insert themselves into this discussion to make sure that when we're weighing all of the tradeoffs we're informed about what the options really are," Mitchell says.

Subscribe to the Best of PCWorld Newsletter