We propose the use of modal dependency parses (MDPs) aligned with syntactic dependency parse trees as an avenue for the novel task of claim extraction. MDPs provide a document-level structure that links linguistic expression of events to the conceivers responsible for those expressions. By defining the event- conceiver links as claims and using subgraph pattern matching to exploit the complementarity of these modal links and syntactic claim patterns, we outline a method for aggregating and classifying claims, with the potential for supplying a novel perspective on large natural language data sets. Abstracting away from the task of claim extraction, we prototype an interpretable information extraction (IE) paradigm over sentence- and document-level parse structures, framing inference as subgraph matching and learning as subgraph mining. We make our code open-sourced at https://github.com/BBN-E/nlp-graph-pattern-matching-and-mining.
Multistage Collaborative Knowledge Distillation from Large Language Models
Jiachen Zhao, Wenlong Zhao, Andrew Drozdov, and 5 more authors
We study semi-supervised sequence prediction tasks where labeled data are too scarce to effectively finetune a model and at the same time few-shot prompting of a large language model (LLM) has suboptimal performance. This happens when a task, such as parsing, is expensive to annotate and also unfamiliar to a pretrained LLM. In this paper, we present a discovery that student models distilled from a prompted LLM can often generalize better than their teacher on such tasks. Leveraging this finding, we propose a new distillation method, multistage collaborative knowledge distillation from an LLM (MCKD), for such tasks. MCKD first prompts an LLM using few-shot in-context learning to produce pseudolabels for unlabeled data. Then, at each stage of distillation, a pair of students are trained on disjoint partitions of the pseudolabeled data. Each student subsequently produces new and improved pseudolabels for the unseen partition to supervise the next round of student(s) with. We show the benefit of multistage cross-partition labeling on two constituency parsing tasks. On CRAFT biomedical parsing, 3-stage MCKD with 50 labeled examples matches the performance of supervised finetuning with 500 examples and outperforms the prompted LLM and vanilla KD by 7.5% and 3.7% parsing F1, respectively.
2021
Graph Convolutional Encoders for Syntax-aware AMR Parsing
Graph Convolutional Networks (GCNs), a natural architecture for modeling graph-structured data, have recently entered the playing field of NLP as sentence encoders over dependency structure. Contemporary setups of semantic role labeling (SRL), neural machine translation (NMT), and event extraction have demonstrated the superiority of GCNs to CNN and RNN encoders, which expect inherently grid-like inputs. In this thesis, we explore GCN encoders in a fully neural paradigm of AMR parsing, taking Cai and Lam (2020)’s state-of-the-art parser as the framework. We hypothesize that GCN encoders are especially well suited for this problem, following the intuition that syntactic structure strongly informs graph-based semantic structure and can be viewed as an intermediate step towards obtaining it from sequential input. Unlike in previous setups, our GCN encoder has to compete with the extremely successful Transformer baseline (the parser’s default encoder), and performs only modestly worse while 1) having an order of magnitude fewer parameters, 2) incorporating explicit syntactic information, and 3) not relying on positional encoding. Our extensive experiments around GCN and Transformer (as well as BiLSTM and GAT) encoder configurations shed light on some of the settings that contribute to the successes of the respective architectures. We confirm that the “syntactic GCN” is the best-performing GCN layer, make empirical observations about Transformers and GCNs based on comparative results and dependency tree statistics, and draw parallels between the Transformer and GCN models in terms of their ability to learn relational structure.
Timely responses from policy makers to mitigate the impact of the COVID-19 pandemic rely on a comprehensive grasp of events, their causes, and their impacts. These events are reported at such a speed and scale as to be overwhelming. In this paper, we present ExcavatorCovid, a machine reading system that ingests open-source text documents (e.g., news and scientific publications), extracts COVID-19 related events and relations between them, and builds a Temporal and Causal Analysis Graph (TCAG). Excavator will help government agencies alleviate the information overload, understand likely downstream effects of political and economic decisions and events related to the pandemic, and respond in a timely manner to mitigate the impact of COVID-19. We expect the utility of Excavator to outlive the COVID-19 pandemic: analysts and decision makers will be empowered by Excavator to better understand and solve complex problems in the future. A demonstration video is available at https://vimeo.com/528619007.
The speculative clause in Aguaruna presents us with two distinctive and interacting semantic phenomena – evidentiality and focus – both of which have been objects of recent interest cross-linguistically. Following the alternative semantics theory of focus developed by Rooth (1992), I analyze Aguaruna’s alternating speculative focus enclitics, and incorporate the evidentiality-focus complex into a compositional semantics for Aguaruna. By formally modeling the interplay of evidentiality and focus, this analysis hopes to glean a more precise understanding of each phenomenon individually, and to contribute to a more complete typology of both.
We present the first Universal Dependencies treebank for Hittite. This paper expands on earlier efforts at Hittite corpus creation (Molina and Molin, 2016; Molina, 2016) and discussions of annotation guidelines for Hittite within the UD framework (Inglese, 2015; Inglese et al., 2018). We build on the expertise of the above works to create a small corpus which we hope will serve as a stepping-stone to more expansive UD treebanking for Hittite.
We provide system updates and performance analysis regarding the 2020 version of the BBN Panorama multi-modal processing pipeline, as submitted to the 2020 Streaming Media Knowledge Base Population track.