ACL wraps up today. Here’s my rundown of what was most exciting at this year’s conference.
Let me start off by describing my bias: my Ph.D. research was in conversational discourse analysis, focusing on social behaviors in language. I also wrote LightSIDE, the open source tool for feature extraction, machine learning model building, and in-depth error analysis for text data. Since I left Carnegie Mellon, I’ve been focusing on automated writing evaluation, with applications to both essay grading and generation of formative feedback directly to students.
Because of this background, I spent very little time paying attention to machine translation talks. I don’t have the mathematics background to really contribute to discussions of parsing or machine learning optimization papers. I spent most sessions in the talks on social behaviors, dialogue, and some of the more creative fields, like summarization and generation. I also really like off-kilter applications of NLP, and growing fields like digital humanities.
My criteria was that the paper was full of innovative ideas and applicable to real world problems; I care less about accuracy numbers and pushing the diminishing returns on well-known corpora and tasks. With that being said, here’s my top 10 papers from this year’s conference.
1. Translating Italian connectives into Italian Sign Language
Camillo Lugaresi and Barbara Di Eugenio
Barbara’s work here targeted a specific problem in translation from a standard written language (Italian) into Italian sign language: most function words, like conjunctions and prepositions, are implicit in sign language. They only rarely need to be gestured at all, and when they are, they’re often embedded into existing signs rather than standing alone.
The best part of this work is that it’s truly end-to-end. They started with an annotated corpus, did a deep linguistic dive into the gritty details of the manual parse trees, developed a theory behind what they saw, and designed intuitive categories around those qualitative findings. Then and only then did they start working on a machine learning problem. The quantitative results are so much stronger as a result; frankly, it’s impressive that they could fit so many different stages of a project into a single conference paper.
The research was also culturally aware in a way that most research isn’t. While many papers on sign language focus on standalone lexical recognition, which is almost useless to natural sign language where words flow and are often altered on the fly, this research focused on a real use case for the Deaf, including the ways that signs are transformed in natural language, the spatial awareness that’s used to connect signs, and the challenge of spoken words being omitted entirely. That level of integration with a real use case is sorely lacking from many ACL papers and this was a model for others.
2. What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain
Annie Louis and Ani Nenkova
This work out of Penn attempted to recognize whether a science journalism article would be evaluated as interesting and exciting based only on the text of that article. They did several things right.
First, they made use of existing documents and built an enticing classification task around cheap and plentiful metadata; I always like seeing people find novel problems to solve where they could have just as easily fallen into a tedious and well-trodden sentiment or topic classification task. Ani Nenkova’s team has always been good at doing this (Regina Barzilay at MIT is also very good at defining tasks like this).
Second, their results really focus on explaining why their models performed well. Good scientific articles use visual imagery at the beginning and end of articles, but not throughout; consistent visual words throughout an entire article is actually a net negative. Phrases with high cross-entropy (“plasticky woman”) get picked out in a composite feature. Interview and pure narrative formats bore people to tears. These are all interesting, even if they’re simplifications of the big picture.
Finally, they also admit that unigrams beat out everything clever that they just did. Not enough people in the field are willing to admit that a bag-of-words will get you almost all of the accuracy that you’re ever going to get, if all you care about is the topline number. This results in a field obsessed with squeezing out tenths of percentage points in the optimal best-case scenario, and it’s stale. Here, we see models that are clearly less accurate than the dumb, brute-force approach; they can be added to a unigram baseline to push performance up by a point or two, but that’s not the focus. Instead, we’ve learned about the data in a way that we never could have if the focus had been a table of precision and recall values.
3. Learning Latent Personas of Film Characters
David Bamman, Brendan O’Connor, and Noah A. Smith
This paper again takes a novel corpus and task – looking at movies and trying to use complex graphical modeling to discover common tropes across personas. We see villains, heroes, and romantic leads emerge, along with behaviors of each (villains hatch things far more than other personas). Brendan has a good rundown of the technical details.
Noah’s group at CMU has a history of clever applications of machine learning techniques to real-world text. This is the next work in that lineage and it does a great job of making use of generative models to tell a real story about their data. It’s a rare case of NLP researchers butting into a well-studied but non-technical field (in this case, film criticism) and actually coming out with insightful findings really drawing on the benefits of “big data,” rather than replicating a poor shadow of what a human could have done.
A big problem with big unwieldy graphical models is once again that they’re often an excuse to put heavy machinery to work to get incremental additive results where a trivial baseline would have gotten you 90% of the way there. This work is different; it’s really trying to learn something new about a domain that you simply couldn’t get with vocabulary alone. They fix some of their generative variables and tell us what happens, which topics emerge. While many papers (especially those built around LDA topic modeling) cannot ever hope to explain their word distributions, this model’s output feels natural and intuitive at a glance.
4. The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia
Oliver Ferschke, Iryna Gurevych, and Marc Rittberger
Oliver was interested in predicting the types of flaw templates that warn Wikipedia authors of the deficiency of an article. These templates have catchy names like “peacock” and “weasel” but they represent real problems – getting writers to follow a style guide that doesn’t resemble 5-paragraph essays, creative fiction, or anything else they might be used to.
He ran into a problem, though – topic bias is huge in these prediction tasks. Consider the “In-Universe” flaw, which states that an article is written as if you were a reporter inside of fiction like the Harry Potter world or the universe of Marvel comics. It turns out that this just isn’t much of a problem unless you’re writing about superheroes and science fiction, so those topics are what a baseline model will learn to predict. It’s useless for getting at the fundamental writing flaws that characterize the template.
He solved this problem through time travel. Rather than build a naïve training dataset, he looked at revisions of individual articles. The flaws in an article are going to be most obvious, he posited, at the revision where the template was added by an editor. The flaws will have been fixed most cleanly at the revision date where the warning template was removed.
By making use of this and other information about the articles he was trying to classify, his models were suddenly learning totally different feature weights. All of his performance numbers changed. Suddenly and for the first time, these tasks were being evaluated based on their merits, rather than merely building topic classifiers. This clear view into the actual writing quality tasks he was trying to perform is unprecedented and very exciting to hear about.
5. A computational approach to politeness with application to social factors
Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts
All the way back to his earliest work at Cornell, Cristian’s work on social behaviors in real-world data has been stellar. This is the newest chapter of what’s becoming a substantial library of papers on similar topics with similar methods, each of which diving into a new issue. Here, we learn about politeness, with an emphasis on Wikipedia administrators and StackOverflow question answering.
Cristian’s work is a great combination of annotation, prediction, and analysis. He starts by defining a coding scheme and building an annotated corpus (this time with Mechanical Turk). Next he shows that this annotation scheme can be reliably automated. Finally he throws that model on a large pool of hundreds of thousands of what was previously unlabeled data, and finds some great trends within that data that tells us about social behaviors.
Each of these could be an interesting paper in their own, and it shows again my bias towards work that proceeds through a pipeline of research, showing how different components of a researcher’s toolbox interact with one another. I feel like I’ve learned not just about a particular algorithm but also about how work gets done in computational linguistics. If you dig into the paper, you’ll also see many examples of actual lines of text from his dataset, which makes all of the aggregate results more intuitive and reasonable.
6. Discriminative state tracking for spoken dialog systems
Angeliki Metallinou, Dan Bohus, and Jason Williams
The team at Microsoft Research manages to make some very innovative advances in what is rapidly becoming an overdone task. This paper is essentially a pointer to the much larger body of work on the Let’s Go dialogue system. That ground has been covered many times over and it’s tough to make an impact with any one paper after the dozens that have come out previously, especially with this year’s upcoming Dialog State Tracking Challenge.
In particular, this paper starts asking some hard questions about the difference between generative and discriminative models in machine learning for dialog. Because the class labels are changing throughout the course of a dialog, generative models are preferred in most work, because they can accommodate a very large number of possible output states. Discriminative models really strain under the standard formulation of state tracking.
This paper takes a new approach to linking features and class values. By collapsing all but the most likely classes into an overall “everything else” class, and doing some linear algebra, this paper shows that you can build a discriminative classifier that can manage an essentially unlimited number of possible states. The math is about as simple as you could possibly hope for and it’s based in an intuitive goal that applied machine learning scientists can grasp.
This is important. Most optimization papers, things that involve changing the actual parameter space and weighting of feature functions of a model, are hopelessly locked away from applied researchers. It’s hard for researchers with poor math skills to even make an informed decision about generative versus discriminative models. We need more papers that give less technical researchers examples of approaches beyond feature engineering, making the whole breadth of machine learning available if they think it’s appropriate for their task.
7. Word Association Profiles and their Use for Automated Scoring of Essays
Beata Beigman Klebanov and Michael Flor
I’m obviously biased – I think automated essay evaluation is a field with a huge amount of room for growth and an exciting breadth of options for where to improve. ETS has historically been the flagship company for publishing research on this topic, and this paper continues that tradition. In particular, their approach has always been about designing features, rather than spending too much time on defining tasks or altering machine learning architecture. Their features are always aimed at intuition and interpretability, rather than topline accuracy improvement, and that’s exciting.
This work starts with a natural goal. Almost every feature that correlates with essay quality also correlates with essay length – long essays are just better, most of the time. The variance explained by those correlated features plummets if you’re also controlling for document length. What needs to happen, then, is that the field needs to discover features that correlate with score in a way that’s orthogonal to the number of words in an essay.
They’ve pushed in that direction here by focusing on pairs of words, looking at how the vocabulary of an essay fits together to form a profile of the text, and particularly focusing on how naturally words fit together in context. They’ve discovered that most of the writing in the best essays is either very related – “dog” and “barks – or they’re extremely unusual to place together (look back at Annie Louis’s work in my #2 paper for another result that says nearly the same thing). Words that are similar, but not highly coherent or highly unique, are the only range that really correlates inversely with score.
More importantly, this new measure of vocabulary profiling fits the goal I described earlier. It’s an attribute of a text that tells you something about writing quality that gives you new information about the essay, after you’ve accounted for length. That’s exciting in a field where a single feature explains so much of the variance in the final output.
8. Reconstructing an Indo-European Family Tree from Non-native English Texts
Ryo Nagata and Edward Whittaker
Sometimes a really interesting task is hidden in a dataset that everyone else has been using for something else entirely. This paper illustrates that well in the International Corpus of Learner English, a large corpus that has always been focused on the problem of language learning.
The errors that these non-native English speakers are making have a signature to them; you make different types of errors depending on what language family you’re coming from. Spanish speakers tend to scatter determiners indiscriminately, in places they don’t belong, because “el” and “ella” are so much more flexible than “a” and “the” in English. This same problem doesn’t occur with Slavic-language writers, who instead are adapting to the lack of cases and fixed word order.
Ryo’s work asks a fantastic new question: can we do something like linguistic cladistics using this corpus and computational methods? It’s an attempt to use non-native English writing to build up language families, recognizing similarities in the types of mechanical writing errors that modern writers are making, rather than looking at historical documents like general corpus linguistics would. It’s an innovative bunch of work.
There’s also some work on metrics that I find very inspiring. Looking at trigram lists and vocabularies can be numbing; it’s very difficult to find gem features with rich qualitative interpretability. This paper tries something new, measuring the extent to which the ablative removal of a single feature impacts the structure of the agglomerative clusters that come out of their reconstructed family trees. This is a great way to target your attention and focus on what matters, rather than being overwhelmed by a high-dimensional feature space. It’s similar in its goals to what we’ve done with LightSIDE’s error analysis for researchers and I hope to learn from it as we continue to develop our own tools.
9. Offspring from Reproduction Problems: What Replication Failure Teaches Us
Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen, and Nuno Freire
It’s common within our field to build abstractly on someone else’s work, whether that means reusing a corpus, extending an algorithm, or mimicking a feature extractor. It’s rarer to try to replicate results exactly before moving on to your own innovations. The researchers on this paper give a pointer as to why: with an 8-page paper, it’s almost impossible to fully specify what you actually did.
Our focus in publications is usually on what’s novel, Little time is spent on details about tokenization, experimental setup and train/test splits, and default assumptions that might be only a small part of your innovative component. This is for good reason (those things distract from your attention in a short paper), but they’re crucial for full transparency in a field that relies on corpora.
Antske’s paper is an unusual mix of position paper, case study, and call to action for researchers in our field. It draws out a series of parameters to be aware of, issues that are likely to cause headaches for reproducibility, and guidelines for how to help others avoid those issues when you release your own research. To me, it really pushes for an extension of the traditional publication model, and everyone that hopes to see others build off of their own work should read it, reflect on it, and act on its recommendations moving forward.
10. Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms
Elijah Mayfield, David Adamson, and Carolyn Penstein Rosé
I would be remiss if I didn’t mention my own work. My most recent publication with my collaborators at CMU attempts to discover empowered attitudes in support group discussions for people living with cancer. In this work, I tried to emphasize all of the best characteristics of the work I’ve described in the rest of this list.
I created my own classification task (empowerment detection) based on a careful reading of the psychological and sociolinguistic literature. I annotated a real-world corpus around this task, using real data that looks a lot like what future researchers and developers in the field will see when they dive into their own data. We did some novel things with machine learning, but we focused more on how to organize the workflow of our machine learning – defining novel tasks and performing them in a useful way for end-to-end pipelines – rather than trying to push optimization and hyperparameters to the limit of diminishing returns.
That’ll probably be my last conference paper with CMU now that I’ve left to work on my company. I’m immensely proud of the work that I’ve done there and I look forward to hearing more about what comes out of Carolyn’s group without me.
All in all, the field is moving in a very exciting direction. More than in any time since I’ve been coming to ACL, there’s less focus on trivial (but statistically significant) improvements in accuracy. Generative models like LDA are finally coming into their own since they took off a decade ago, and people are starting to actually understand how to interpret their output in a usable way. Novel tasks and new directions, from digital humanities to quantitative social science, are appearing in every direction. With data easier to come by than ever before, it’s a great sign that people are choosing to really dig into it in novel ways, and not retread the tasks of the past.
I can’t wait for next year in Baltimore.
LightSide is a company based in Pittsburgh, PA, USA. We’re building state-of-the-art tools for formative feedback for student writing and making automated essay assessment available to all. We’re part of the Bill & Melinda Gates Foundation’sLiteracy Courseware Challenge, and our partners include the College Board and CTB McGraw-Hill.
For the research community, we previously built and still contribute to LightSIDE, a tool from Carnegie Mellon’s Language Technologies Institute. The open source, Java tool provides an intuitive interface for feature extraction, model building, and error analysis with an emphasis on supervised text classification problems. Everyone from social scientists with minimal background in machine learning to state-of-the-art researchers at conferences like ACL can make use of it, either through the point-and-click UI or through the highly modular plugin architecture.
We have an opening for a researcher interested in doing research of the type and caliber that I’ve highlighted here, trying to improve the state of the art in writing education. If you’re in the market for a job in upcoming months, contact me email@example.com and tell me about your work. I’m looking forward to hearing from you.
If our work excites you and you’d like to keep reading posts like this, follow us on Twitter and keep up with this blog.