The quest for harnessing today’s big data technology is like trying to steer a runaway stagecoach across the Wild West. It is challenging, fast moving and a little bumpy.
Every day, scientists, engineers, medical staff, statisticians and others generate staggering amounts of new and different data. Collecting it is not new, but with more information comes a new set of problems, the biggest ones being how to make sense of it all, and how to make use of it all. As scientists barrel toward promising new discoveries, they need resources to help them sort through their massive amounts of data. They need the training to be able to wrangle a project in hopes for a breakthrough.
Hridesh Rajan, a professor of computer science and the Kingland Professor of Data Analytics at Iowa State University, helps rein it in by creating tools that help researchers who are struggling to organize their mountains of information. Researchers across the university and across the globe use his work. To be a part of a university that is internationally known for its innovations in technology is a major perk of his job, he said.
“Iowa State faculty are doing the latest and greatest in the field of data-driven research,” Rajan said. “Today, in terms of data-driven research, in terms of data-driven education, we are in the big leagues. There is absolutely no doubt about that.”
“Today, in terms of data-driven research, in terms of data-driven education, we are in the big leagues. There is absolutely no doubt about that.”
Addressing new needs in a data-driven world
“There is no field today that is not touched by data,” Rajan said.
However, scientists investigating hypotheses rarely have the resources they need to sort through and analyze it for meaningful conclusions.
“In the data-driven world, there aren’t solutions that allow scientists to just jump in,” he said. “Most of the time, people give up somewhere along the way. They decide they don’t have the expertise or resources.”
Rajan was tired of seeing researchers give up. In May of 2012, he and a team began creating “Boa,” a “domain-specific language and infrastructure that eases mining software repositories.” In other words, an easy to use, incredibly fast web-based tool that makes data mining easy.
A year later, Rajan and his team presented Boa at the International Conference on Software Engineering, where it was well received. Soon after, they opened it for public use. Interest was immediate. Today, nearly 900 research groups across the world use the web app to answer critical questions such as how to improve identification of defective software and which software development practices are most effective. Rajan is currently creating similar infrastructures to answer traffic engineering questions such as which highway segments consistently bottleneck and genomics questions such as what genome assembler is used most in a node of the tree of life.
Big data tools in practice
Rajan, of course, uses Boa, too. One of his toughest projects has been documenting software libraries, collections of code and scripts that developers use to build modern, complex software. Like a collection of standard and custom parts for machines, software libraries are handy when their purpose and functionality is understood, but without documentation they can be challenging to use.
Documenting libraries also enables verification methods to check whether the software does what it claims to do. The problem is, very few libraries have been documented due to the intricate and time-intensive process of doing so. When Rajan began his project, only one library in his dataset had been partially documented. That partial documentation had been done through manual effort over the last 23 years.
Using Boa, he documented more than 10 complete libraries in just two years.
“This will have a huge impact on safety-critical systems as developers would be able to rely on these formal documentations that we have created to do safety checks for their own systems,” he said.
Rajan and his team plan to document the top 200 software libraries in one programming language by 2018.
“I think we may be able to do it,” Rajan said. “Even if we are at 150, that will still be quite good.”
Shifting the foundations of research
Data-driven science is shifting the foundations of research. The need to harness massive amounts of data extends to all scientists, but many researchers do not have the specialized computing skills or equipment to use big data. So, together with a team of 12 researchers across Iowa State, Rajan will take steps to solve the problem by creating data-science infrastructures that open the door to big-data analysis. The initial cyber infrastructures will help researchers improve traffic and air safety and better understand how biological systems work and evolve – two strong areas of research at Iowa State that have immediate needs.
Rajan received seed funding for the project in 2016 through ISU’s Presidential Initiative for Interdisciplinary Research, which helps establish research teams to tackle emerging societal challenges.
“In the 21st century, shared data science infrastructures will be as important for data-driven science and engineering in a domain as telescopes are for astronomy,” Rajan said.
“In the 21st century, shared data science infrastructures will be as important for data-driven science and engineering in a domain as telescopes are for astronomy.”
Their proposal states, “This project brings together a transdisciplinary team to decrease the barrier to entry for data-driven science for ISU researchers and other data-driven scientists around the world.”
Investing in the next generation
Rajan’s work is highly influential in pushing data research forward. In addition to his numerous projects, he also founded the Midwest Big Data Summer School, a one-week, intensive curriculum hosted at Iowa State to help early career researchers get started in data-driven research. He serves as a Steering Committee member of the Midwest Big Data Hub (MBDH), and in May this year, was named the Kingland Professor of Data Analytics as part of a $1.5 million gift from Kingland to the Colleges of Liberal Arts and Sciences, Business and Engineering.
“When I search [for jobs], more than half of the jobs are related to big data and data science,” Zahra Khoshmanesh (‘21 Ph.D. computer science) said. “I think this is the future path of every major.” Khoshmanesh, surrounded by resources at ISU, attended the Big Data Summer School to further prepare for a career in data science.
“There is so much momentum around data-driven research on campus,” Rajan said. “I think every year we are going to see more and more leadership, in both the national and the local level.”
Khoshmanesh has worked with Rajan since the summer on a project about synthetic data that combines statistics with data science. She has helped jumpstart her career in an emerging area that is exploding with need for skilled data scientists. She said Rajan’s mentorship has been invaluable.
“I recommend this school and Dr. Rajan to every person who wants to know about big data and data science and wants to have a career in that area,” she said.
Flourishing at Iowa State
Rajan graduated from the Indian Institute of Technology in 2000 and came to graduate school in the United States with a plan to go into industry. At the urging of his advisor, he pursued a research project in graduate school and got a taste of how energetic the field can be.
“Following my first conference presentation, people were engaged and interested in what I was doing,” he said. “They asked follow-up questions and that was very, very powerful. In some sense, it emphasized, at least to me, that it was important to be in a job where public disclosure of research is an integral part of it all.”
“For our students, this is an opportunity to learn about these ideas that perhaps five years down the road will be put in a textbook.”
Research allowed him to investigate areas he would not have been able to otherwise, to explore new ideas in a variety of areas. It also opened up a clear avenue to mentor the up-and- coming scientists in data fields.
“Working with students and motivating them to take on challenges – that was even more powerful,” he said. “For our students, this is an opportunity to learn about these ideas that perhaps five years down the road will be put in a textbook,” he said.
In the yet-to-be-tamed world of data science, Rajan creates a sense of calm with solutions that make a difference, ensuring Iowa State remains a research powerhouse with the tools to push data-driven science forward.
Hridesh Rajan’s work is exceptional in part because it relies on interdisciplinary collaboration. Research breakthroughs are a team effort, comprised of fellow LAS and ISU colleagues, as well as collaborations with esteemed scientists around the world. Key team members include: Gianfranco Ciardo (professor and chair of the Department of Computer Science at Iowa State), Stephen Holland (associate professor of aerospace engineering at Iowa State), Heike Hofmann (professor of statistics at Iowa State), Jim Reecy (professor of animal science and associate vice president for research at Iowa State), Andrew Severin (adjunct assistant professor and manager of the Genome Informatics Facility at Iowa State), Anuj Sharma (associate professor of civil, construction and environmental engineering at Iowa State), Jin Tian (associate professor of computer science at Iowa State), Hoan Nguyen (’16 Ph.D. computer engineering), Robert Dyer (’06 B.S, ’08 M.S., ’13 Ph.D. computer science), Samantha Khairunnesa (’19 Ph.D. computer science), John Singleton (Ph.D. candidate at the University of Central Florida), Gary Leavens (affiliate professor of computer science at Iowa State, and professor and chair of the University of Central Florida’s Department of Computer Science), and Tien N. Nguyen (associate professor of computer science at the University of Texas at Dallas).