Can Texting a Friend Get More People to Vote? This Data Scientist Wants to Find Out.
The field of data science is still emerging, and Aaron Schein, a postdoc as comfortable coding as he is talking politics, is helping to define it. Schein was part of the Data Science Institute’s inaugural class of fellows. This summer he joins Columbia’s counterpart at the University of Chicago.
The son of linguistics professors (his mom at MIT, his dad at USC), Schein initially studied political science and linguistics as an undergraduate. He planned to go on for a PhD in Near East Studies until a project involving the analysis of Persian-language social media at scale turned him on to statistics and machine learning. He switched focus to computer science, and after finishing his PhD at the University of Massachusetts, Amherst, landed at Columbia.
Columbia News caught up with Schein as he was getting ready to leave New York City for a tenure-track position in statistics at the University of Chicago.
How has your experience at Columbia’s Data Science Institute prepared you for Chicago’s?
At the Institute, I’ve been able to lead collaborations between social/political scientists and statisticians/computer scientists and to learn how to do truly interdisciplinary research. I’ve also helped organize interdisciplinary programs, like the distinguished lecture series. At UChicago, I’ll look to collaborate with the many amazing social and political scientists on campus in bringing modern data sets to bear on questions of policy importance. I’m also eager to help build up their institute and define data science as an emerging discipline.
What made you switch from studying foreign policy, linguistics, and even Farsi, to data science?
My first exposure to data science was as an undergrad at UMass when I audited Hanna Wallach’s computational social science seminar. It was a thrill, but I lacked the formal training in machine learning to run with it. I then interned at a federal research lab studying public opinion in Iran via Persian-language blogs. The goal was to provide U.S. policymakers with nuanced views of political thought in Iran at a time when there was still hope for a softening of U.S.-Iran relations. I worked on natural language processing methods to characterize the sentiment and topics in these blogs. This got me excited about computer science and statistics, and set me on the path to eventually do a PhD in computer science, which I did back at UMass with Hanna Wallach.
Advice for others struggling to find their academic focus?
I'm probably the wrong person to ask. I was never good at learning what I wasn't already curious about. (I'm still not.) I got lucky that I ended up in a hot field, with jobs. I'd love to say that following your curiosity is a good strategy, but I think that would just be propagating survivorship bias. I will plug data science and statistics, though. As a profession, it lets you move around. I focus on political science, but I’ve collaborated with geneticists, economists, and neuroscientists. It’s hard to be bored as a methodologist!
Is texting a friend as effective as knocking on doors to get out the vote? Are there any caveats?
This is the question I’ve been asking with David Blei, Donald Green, and others. We’ve been conducting large-scale randomized field experiments on Outvote, an app that lets Americans text their friends reminders to vote. We found that Outvote users had about an eight-percentage point effect on getting friends to vote in the 2018 Midterms. That’s large relative to what has been measured for door-to-door canvassing, phone-banking, and other get-out-the-vote actions. But we ran another experiment during the 2020 Presidential elections and found much weaker effects. That wasn’t unexpected, since nudges are typically less effective during Presidential elections, but we're waiting for the 2022 Midterms to see if we can replicate our 2018 results. Stay tuned!
Any predictions about the upcoming Midterms?
I have predictions, but they probably aren't any better than yours. Text your friends to remind them to register and to vote.
You’ve co-led a popular workshop at NeurIPS, the world’s top machine-learning conference, on beautiful ideas that don’t work. Why?
Machine learning research these days is increasingly metricized and competitive. Researchers are incentivized to produce new methods that beat baselines, rather than understand fundamental principles or develop new approaches to problems. These workshops are meant to promote negative results, highlight gaps between theory and practice, and solicit “beautiful" ideas that don't necessarily "work" (yet).
You’re the first postdoc I’ve met with a probability distribution named after you. What is the “Schein” distribution?
We introduced it in a 2019 paper at NeurIPS, Poisson-Randomized Gamma Dynamical Systems. We called it the "shifted confluent hypergeometric (SCH) distribution" because it's a variation on a previously known distribution. Characterizing this distribution was one ingredient to an algorithm for fitting a model of country-to-country interaction data. Such data is composed of micro-records of the form “country i took action a to country j at time t,” and there are millions of such events. Ours was a time-series model that could characterize uncertainty about unobserved or future events.
One of my co-authors, Scott Linderman, co-wrote a follow-up paper building a model for neuroscience data that used similar methodology, and they renamed it the "Schein distribution." My Mom has a printout of the paper posted on her office door.
Does this happen to data scientists often?
Having a distribution named for them? No! It’s rare and awesome!
You were a political organizer growing up in Brookline, Mass. Has that shaped your work?
Not really, but it influenced my life at Columbia in one important way. I recently learned that a long-lost friend and co-campaigner for John Kerry in 2004 now owns the Hungarian Pastry Shop. We reconnected one day when I was ordering pastries.
What should everyone know about data science?
It's popular these days to wonder which methods will become obsolete once we collect enough data. But my hot take is that there's no such thing as “Big Data.” It depends on your questions; if your data are "big" for the questions you're asking, then maybe you should ask bigger questions! We will always need theory, domain knowledge, and tailored methodology to answer those big questions.
How is working with social scientists different from physical scientists?
My sense is that social scientists emphasize theory to guide their empirical work and are generally less willing to embrace a purely inductive approach to science. That makes sense, given the traditional sparsity of social scientific data. But those data are increasingly rich, so this may change. For now, I think computational social science means engaging with theory.