Hadoop brings far-flung people together across time and space
With its powerful data mining capabilities, Hadoop is bringing together people across different places and even across different generations.
While Hadoop continues to grow in popularity, most reported use cases around the open-source data-processing platform revolve around ad targeting or some other specialized task. But at O’Reilly’s Strata-Hadoop World, held last week in New York, a number of Internet services talked about how they use Hadoop to bring people together.
Ancestry.com is using Hadoop as the cornerstone of a new service that allows users to submit a sample of their DNA and then have Ancestry.com look for matches to far-flung relatives, both alive and long-deceased. And social dating service eHarmony uses the service to refine its process of matching its millions of members.
In both cases, Hadoop has excelled at comparing hundreds or even thousands of variables across millions of different entities, a job much too large for traditional relational databases or even data warehouses.
“Hadoop is one of those key tools that has allowed us to create a massively scalable system,” said Ancestry.com Chief Technology Officer Scott Sorensen in one presentation. The service is moving from proprietary tools to Hadoop to parse its large and ever-growing amount of data, he said.
Ancestry.com generates around US$480 million a year in revenue, from people who use the service to chart their ancestry, using their own documents as well as a repository of Ancestry.com’s collection of 12 billion public records, about 10 petabytes’ worth of data.
Hadoop powers a new service offered by the company called AncestryDNA. A user can send in a saliva sample, along with $99, and the company will take 700,000 snips of the DNA from the sample and load the results into Hadoop, which will compare the snips to more than 200,000 other samples collected by the company. The company can then provide a list of far-flung relatives, whose family connections can go back 10 generations or more.
Half of a person’s DNA comes from each biological parent. “Small changes in that DNA over generations leave bread crumbs that are like a view into history,” Sorensen said. Ancestry.com can use the snips to determine a user’s mix of ethnicities, as well as match the user with distant relatives.
Hadoop proved to be uniquely suited for this task in that it excels at taking 700,000 snips and then comparing those to snips from hundreds of thousands of other people’s DNA to find matches. The service can find, on average, 40 fourth cousins for every customer who submits a sample. That result will only improve as more people submit their DNA, Sorensen said.
The company used a number of algorithms developed in academia for finding hidden matches in DNA. But the engineers at Ancestry.com had to parallelize the algorithms to run them across a multinode Hadoop deployment. Using traditional scale-up architectures, it would take Ancestry.com up to four weeks to compare 120,000 sets of DNA.
Also at the conference, Vaclav Petricek, director of machine learning at eHarmony, described how the online dating service uses Hadoop to make better matches among its customers.
Like Ancestry.com’s DNA service, the fundamental problem eHarmony tackles is a massively parallel one. The service wants to find a set of potential suitors for each member of the service, which involves doing many comparisons across a large number of factors, while slimming down the result sets to manageable proportions.
“We want to give people enough options to keep them engaged, but we don’t want to overwhelm them,” Petricek said. “Because this is an embarrassingly parallel problem, you can run this on Hadoop in parallel.”
EHarmony customers fill out an extensive questionnaire, which helps to estimate the user’s personality across 29 different dimensions.
The system first uses algorithms to predict how happy two potential matches would be if they were married, using scientific studies that describe the personality traits of people in both happy and “distressed” marriages, Petricek said. If they have personality types that would indicate they would be happy in a marriage together, they are considered for pairing.
This is only the first step, however. EHarmony must also predict how attracted two potential people would be to one another.
“There is no guarantee that people who have compatible personalities would be interested in each other,” Petricek said.
Gauging attractiveness between two people is where the use of big-data-styled machine learning comes in. The service keeps track of a wide range of additional variables of its members, from the types of devices used to interact with eHarmony to whether each individual is single or divorced. The company also keeps track of the flow of messages among its members, charting which exchanges led to successful matches and trying to find indicators among all the known variables as to why these matches were successful.
For instance, one fairly predictable variable has been distance. The farther apart two people are geographically, the less likely they are to pick one another from a list of candidates. Another variable is the difference in heights between a potential heterosexual couple. On average, the two people are most likely to communicate if the male is 4 to 8 inches taller than the female. The company will not know which factors ahead of time will prove to be predictors of compatibility, so Hadoop churns through all the combinations of all the variables looking for clues.
The use of Hadoop is improving the service eHarmony offers, Petricek said. According to a third-party study, the divorce rate among couples who met on eHarmony and married between 2005 and 2012 was about half the rate of those who met offline. The sample set is limited, Petricek said, but it still is a “very encouraging” sign for the use of big-data analysis.