The U.S. National Institute of Standards and Technology (NIST) wants to bring some metrics and rigor to the nascent but rapidly growing field of data science.
The government agency is embarking on a project to develop by 2016 a framework that can be used by all industries to understand how to use, and measure the results from data science, and big data projects.
NIST, an agency of the U.S. Department of Commerce, is holding a symposium on Tuesday and Wednesday at its Gaithersburg, Maryland, headquarters with big data specialists and data scientists to better understand the challenges around the emerging discipline.
“Data science is pretty convoluted because it involves multiple data types, structured and unstructured,” said event speaker Ashit Talukder, NIST chief for its information access division. “So metrics to measure the performance of data science solutions is going to be pretty complex.”
Starting with this symposium, the organization plans to seek feedback from industry about the challenges and successes of data science and big data projects. It then hopes to build a common taxonomy with the community that can be used across different domains of expertise, allowing best practices to be shared among multiple industries, Talukder said.
While computer-based data analysis is nothing new, many of the speakers at the event talked about a fundamental industry shift now going on underway with data analysis.
Doug Cutting, who originally created the Hadoop data processing platform noted that what made Hadoop unique is that it took a different approach to working with data. Instead of moving the data to a place where it can be analyzed—an approach used with data warehouses—the analysis takes place where the data is stored itself.
“You can’t move [large] data sets without major performance penalties,” Cutting said. Since its creation in 2005, Apache Hadoop has set the stage for storing and analyzing data sets so large that they can not fit into a standard relational database, hence the term “big data.”
As these data sets grow larger, the tools for working with them are changing as well, noted Volker Markl, a professor and chair of the database systems and information management group at the Technische Universität Berlin.
“Data analysis is becoming more complex,” Markl said. As a discipline, data science is challenging in that it requires both understanding the technologies to handle the data, such as Hadoop and R, as well as the statistics and other forms of mathematics needed to harvest useful information from the data, Markl said.
“A lot of companies are finding that they thought they were getting data science when they purchased Hadoop, but then they have to hire a data scientist to really do something useful with it,” said symposium attendee Brand Niemann, director and senior enterprise architect at the data management consulting firm Semantic Community.
Another emerging problem with data science is that it is very difficult to maintain a data analysis system over time, given its complexity. As the people who developed the algorithms to analyze data move on to other jobs or retire, an organization may have difficulty finding other people to understand how the code works, Markl said.
Another challenge will be visualization, said Pak Chung Wong, chief scientist at the Department of Energy’s Pacific Northwest National Laboratory. Visualization has long been a proven technique to help humans pinpoint trends and unusual events buried in large amounts of data, such as log files.
Standard visualization techniques may not work well with petabyte and exabyte-sized datasets, Wong warned. Such datasets may be arranged in hierarchies that can go 60 levels deep. “How can you represent that?” he asked.