What Is the State of Data Science Today?
In recent years, the field of data science has exploded into the mainstream, and into our daily lives. Data science now touches nearly every aspect of society and affects how governments, the private sector, the healthcare profession, and many other vital sectors operate.
It’s for that reason that four data science experts—including Jeannette Wing, Columbia executive vice president for research and professor of computer science, and associate professor of applied mathematics and systems biology Chris Wiggins—decided to write Data Science in Context: Foundations, Challenges, Opportunities, a new book on the state of the field published this fall. They worked with co-authors Peter Norvig, a fellow at Stanford's Human-Centered Artificial Intelligence Institute who previously served as director of research and search quality at Google, and Alfred Spector, a visiting scholar at MIT who worked as vice president of research and special initiatives at Google. Columbia News caught up with Wing and Wiggins to discuss how the book got off the ground, and what it aims to say about data science today and in the future.
Where did the idea for this book come from? Did one of the four authors have the original idea and connect with the others?
Jeannette Wing: Alfred Spector, whom I’ve known since our days together at Carnegie Mellon, approached me to co-author a book on data science. After I hesitated, given the time commitment needed to do a good job, he suggested bringing Peter Norvig and Chris into the project, knowing that collectively we have perspectives and strengths that would make it possible to pull this book off in a timely manner.
Chris Wiggins: Jeannette and I had known each other since her arrival at Columbia, and Alfred I got to know thanks to a Data Science Institute event. I had pressed him on a talk he gave about data science in which he pointed out problems but didn’t really suggest solutions. Then, later, I visited him at Two Sigma, a hedge fund where he worked at the time, to ask how he had thought about the ethics of data-empowered algorithms. Not everyone likes hard questions but Alfred, to his credit, suggested we look into those questions (solutions, including ethics) together, which became part of the fabric of the book.
How would you describe the perspective that you each wanted to contribute to the book, the thing you felt it might not have included without you?
Wing: My academic perspective complemented Alfred’s and Peter’s industry perspective. In running Columbia’s Data Science Institute, I knew questions the academic community cares about that are different from those that industry cares about.
I had already written journal articles and given many talks about data science. I had been promoting “Data for Good,” a campaign for Columbia’s Data Science Institute, and working on this book gave me a chance to explain and elaborate on its dual meaning: doing good with data and using data in a responsible manner.
I felt strongly that ethics should be part of data science and I was glad Chris is a co-author since he and Columbia history professor Matt Jones had been thinking about a data ethics follow-on course to “Data: Past, Present, and Future,” a class that they’ve taught together since 2017. Interestingly, we decided that rather than separate out ethics, say as a chapter, we would try to weave it in throughout the book.
Wiggins: Back in 2015, I spoke at a dinner for Columbia undergraduates hosted by Matt Jones. I had already co-taught a class with him about data science for journalism students, working with Mark Hansen and Cathy O’Neil, but after that dinner, we started thinking about a class on the history of data science. When we finally started teaching, with the help of Columbia’s Collaboratory program, it was clear that students were not only interested in the history but also the ethics of data, along with how the history relates to our present challenges. So, I was excited to contribute topics on ethics as well as a few points about how data relates to news and journalism. I’ve been helping The New York Times for several years now to build out a data science team, which has also informed my understanding of the gap between what academics and people in industry mean when they say “data science,” which the book gave an opportunity to explore in more detail.
The book points out that the term “data science” only came to be used widely in 2010. What current use of data science could you not have imagined in 2010?
Wing: The most obvious answer is deep neural networks, an artificial intelligence approach to building a computer inspired by modeling the neural connections in the brain. Deep neural networks have a plethora of applications and are having a disruptive and transformative impact on almost every sector. Only in 2012, with the advent of big data and big compute, did the research community and then the private sector see how these networks could “solve” AI tasks such as speech recognition and image classification that had been studied since the 1960s. The breakthrough came about because of enormous amounts of digital data, data used to train deep neural networks.
Wiggins: To this, I’ll add the real pervasiveness of data science across different industries. The job description “data scientist” became prominent at LinkedIn and Facebook in the first decade of the new millennium; William Cleveland of AT&T earlier used the term in a paper in 2001 to propose a new field. But in 2010 it was an aspiration that making sense of data in a way that transforms your business could be possible not just for “big tech” companies like AT&T, Facebook, or LinkedIn, but for a wide variety of companies. It has certainly been transformative at The New York Times. Similarly, a wide variety of academic fields are now transformed by machine learning. In 2010 it was clear that machine learning was having a huge impact in a few branches of natural science, like computational biology, but now almost every academic field has a locus of research activity around how machine learning is opening up new questions and answers!
Your book outlines some of the major promises and perils of data science. If you had to name a single biggest promise of data science–something that isn’t happening yet, that you’re most excited about–what would it be?
Wing: The biggest promise of data science is to address societal challenges like health care and climate change. We can use medical images, health records, and genetic data to better predict whether someone will get a disease or even how someone might respond to a specific treatment. We can use machine learning and physics-based simulations to build better climate models. While we are seeing early forays into using AI and data science for these challenges, so much more can be done.
The biggest challenge is addressing the issue of fairness. For example, an individual judge may rule differently depending on the time of day and different judges may rule differently depending on their own biases. Using automated tools, one hopes to smooth out those differences in judgment. However, current AI techniques, such as deep neural networks, rely on large amounts of data to build such an automated decision system. If historical data is used to produce this system, then it will capture and reflect the same biased human judgments of the past. What we’ve discovered is that it is difficult technically and philosophically to build “fair” systems.
I am currently advocating a research agenda called “Trustworthy AI” which is a call to arms for three computer science communities—the AI community, the cybersecurity community, and the formal methods community—to work together to address both the promise and perils of AI.
What are you each teaching this year at Columbia?
Wing: In spring 2019 I taught a graduate-level course on privacy-preserving technologies. Based on my work at Microsoft, I wanted our students to know that there exist industry-strength point solutions to point problems in privacy. These scalable computational solutions draw on hardware, cryptography, statistics, and mathematics. These ideas made it into chapter 10 of our book.
Wiggins: In the fall I teach the capstone course for applied mathematics majors, working with groups of students to do original research on topics of their own interest, and to present to their peers. Over the decades I’ve taught this class, more and more projects have been around data, machine learning, and the impact of data. This term we had presentations on gerrymandering and mathematical modeling of migration, for example. Students are able to do analyses they couldn’t have done years ago, with great open-source machine learning methods; what’s more, students are far more aware of the ethical consequences of these methods. It’s continually a class in which the students teach me the future.
In the spring, professor Matt Jones and I will teach our “Data: Past, Present, and Future” course again. Developing this class has really opened my eyes to a historical appreciation for data, and how our world came to be shaped by data and data-empowered algorithms. One lesson here is that the future is in our hands, with no fate but what we make. In class, we discuss it as an unstable three-player game among corporations, governments, and the individuals who provide the data and talent to these corporations. I’m optimistic about how our students, both technologists and humanists alike, are so engaged with understanding data and our role in shaping data’s future.