From Theory to Practice

Corpus, parsing, and applications

The GUM Corpus
Building the Annotations
What the Data Reveals
The Parsing Task
What eRST Enables
Looking Forward

A formalism justifies itself through implementation. eRST comes with substantial infrastructure: a large multilayer corpus, annotation tools, a formally defined parsing task with new evaluation metrics, a baseline system, and concrete applications that go beyond what RST alone supports.

The GUM Corpus

The Georgetown University Multilayer corpus provides the primary testing ground for eRST. GUM is a growing corpus created through a classroom annotation project, where students annotate texts across multiple formalisms over a semester. The result is unusually rich: each document carries morphosyntactic annotations following Universal Dependencies guidelines, nested entity annotations, coreference and bridging anaphora, complete RST trees, and now eRST annotations with signals and secondary edges.

At version 9, GUM encompassed 213 documents across 12 spoken and written genres: interviews, news stories, travel guides, how-to guides, academic papers, biographies, fiction, web forum posts, casual conversations, speeches, vlogs, and textbooks. The corpus totals over 200,000 tokens and 26,000 EDUs, making it the largest English RST corpus—surpassing the RST Discourse Treebank's 21,789 EDUs.

The genre diversity matters. Discourse structure varies across text types: vlogs exhibit different patterns than academic papers, conversations different patterns than news. A framework claiming generality must demonstrate it across this variation.

For cross-corpus comparison, the Zeldes et al. study also annotated the RST-DT test set (38 Wall Street Journal documents) with DMs and secondary edges. Due to licensing restrictions, these annotations are released separately from the underlying text.

Building the Annotations

Since primary RST trees already existed for GUM and RST-DT, eRST annotation proceeded in three phases.

DM identification and alignment began with automatic preprocessing. DisCoDisCo, the winning system from the DISRPT 2021 shared task on connective detection, identified candidate connectives with high recall. A script then associated each predicted connective with the nearest compatible relation in the tree hierarchy, using PDTB's connective definitions and an RST-PDTB relation mapping. Connectives that could not be aligned were flagged as potential orphans.

Five annotators manually corrected the entire GUM dataset, using rstWeb—an open-source web interface for RST annotation extended to support signal marking and secondary edges. Inter-annotator agreement reached F-scores above 92 for DM identification and above 88 for relation association, indicating reliable annotation.

Secondary edge annotation proved more difficult. An initial agreement experiment showed substantial disagreements, with scores well below primary tree annotation levels. Inspection revealed that disagreements centered on two issues: whether certain items were connectives at all (especially sentence-initial "And" or "So" in spoken genres) and the exact scope of relations (whether to include trailing elements like bibliographical citations in academic text).

After guideline refinement addressing these specific issues, a second experiment produced markedly improved agreement—levels only about 16 points below human agreement on primary relations. Given that secondary edges involve inherently difficult cases (concurrent relations, tree-breaking structures), this represents substantial agreement.

Non-DM signal annotation employed semi-automatic methods leveraging GUM's existing layers. Graphical signals (parentheses, question marks, colons) were tagged automatically from token forms. Reference signals were identified by aligning gold coreference chains with eligible relation types. Syntactic signals (relative clauses, reported speech, imperatives) were detected using dependency tree editing scripts applied to gold syntax annotations.

Lexical signals required more care. The annotation drew on PDTB's AltLex inventory, chi-square-associated terms from corpus statistics, and items noticed during manual review. The approach proved nearly error-free: if an evaluative word like "pretty" appears in an already-annotated EVALUATION relation, it almost certainly signals that relation.

Semantic lexical chains—related but non-coreferring words like "power" and "influence"—used MIT's ConceptNet (34 million conceptual relations) and stem matching. A script proposed candidate chains, which annotators then verified manually, yielding 1,280 confirmed instances covering about 2,825 tokens.

Human-versus-human agreement on all signal types reached F-scores around 0.80-0.85, with the automatic system performing comparably. Lexical chain disagreements were most common, followed by indicative words, while syntactic and coreference-based signals proved nearly always correct.

What the Data Reveals

The completed annotations reveal patterns that could not be observed without eRST's expanded representation.

Roughly 13% of discourse markers in GUM are orphans—about one in eight. These are connectives with no corresponding primary relation, indicating relations the tree structure cannot represent. In RST-DT, the proportion is slightly higher at 17%. The proportion of secondary edges (3.37%) is identical across both corpora, suggesting this reflects a stable property of English discourse rather than an artifact of genre or annotation practices.

Genre variation is substantial. Secondary edges are most common in vlogs (6.56% of all relations), driven by frequent sentence-initial "And" and "So" in informal spoken narration. They are rarest in how-to guides (1.95%), where procedural discourse tends toward sequential structure that trees represent well.

Academic text presents a counterintuitive finding. Despite common assumptions about academic writing's explicitness, academic papers fall below average in DMs per relation. They compensate, however, with syntactic cues and graphical signals (section headings, formatting), achieving the highest overall signaling rate: 73.2% of relations are marked by some signal. News text comes next at 68.4%, followed by textbooks (66.4%) and how-to guides (66.1%). The overall signaling rate across GUM is 63%—lower than previous estimates for RST-DT, though GUM news approaches those higher figures.

Signal distribution varies dramatically by relation class. ATTRIBUTION is signaled in 99.94% of cases, primarily by speech and cognition verbs like "said," "think," and "know." CONTINGENCY is heavily DM-marked (96.72%), usually by "if." At the other extreme, JOINT relations—including temporal SEQUENCE and LIST—are the least signaled class (34.17%), often inferred from implicit chronological order or parallel structure rather than any overt marker.

EVALUATION relies on open-class lexical items ("good," "very," "important," "remarkable"), with a long-tailed distribution: the top items account for about 14% of tokens, but frequencies drop quickly to single attestations. This contrasts sharply with the closed-class DM inventory, where a handful of items ("and," "but," "because," "if") dominate.

The Parsing Task

eRST introduces a new parsing task with extended evaluation metrics. Standard RST parsing is evaluated using Parseval metrics: Span (correct constituent boundaries), Nuclearity (correct boundaries plus correct nucleus-satellite assignment), Relation (correct boundaries plus correct label), and Full (all three). These metrics apply directly to the primary tree.

For signals, new metrics assess whether the correct signal types are predicted for each edge (signal detection) and whether the correct token spans are identified (signal anchoring). Because multiple signals of the same type may apply to the same edge, evaluation requires an optimal pairing procedure between predicted and gold signals. For secondary edges, the four Parseval metrics apply with "nuclearity" replaced by "direction"—secondary edges carry directionality but not prominence.

The Zeldes et al. study developed a baseline system by combining existing state-of-the-art components rather than building an end-to-end architecture. Primary trees use DMRST, a top-down neural parser that remains competitive on RST relation classification. Connective detection uses DisCoDisCo. Morphosyntactic features and coreference use the AMALGUM pipeline, designed to predict the same annotations present in GUM.

A novel component handles DM-to-relation association and secondary edge prediction: an Electra-based transformer classifier that receives two text spans connected by a relation (one containing a marked DM) and predicts whether the DM signals that relation. At test time, candidates are generated for all plausible secondary edges—any primary edge path containing a compatible DM—and ranked by classification probability.

Results illuminate the task's difficulty. With gold primary trees and gold NLP preprocessing, secondary edge Span reaches 0.389 and Full 0.184. These numbers may seem low, but consider what correct prediction requires: identifying that a relation exists, confirming no primary edge already expresses it, finding a sufficient trigger, and choosing correct attachment points, direction, and label—all with fewer than 1,000 training examples.

With predicted primary trees, scores collapse. Secondary edge Span drops to 0.101, Full to 0.030. The culprit is cascading errors: if the primary tree is wrong, even correctly identified secondary relations may be penalized (the relation might be primary in the gold data), and missing primary edges may appear to license secondary edges that should not exist.

Signal detection achieves 0.925 overall F-score with gold inputs. Syntactic signals remain robust even with predicted trees (~0.83), because syntactic parsing is reliable and the structures involved (relative clauses, complement clauses) are comparatively easy to identify. Orphan detection is the hardest subtask, since it depends on accurate primary parsing and secondary edge prediction together.

The critical bottleneck is primary tree accuracy. Without reliable trees, all downstream eRST tasks suffer from cascading errors. This finding has implications for system development: effort invested in improving primary parsing yields benefits across the entire eRST pipeline.

What eRST Enables

All existing RST applications remain available. Primary trees support extractive summarization via nucleus traversal, central discourse unit detection, topic segmentation, and targeted relation extraction. Because eRST graphs reduce trivially to RST trees (ignore secondary edges and signals), existing tools and methods remain applicable.

But eRST enables capabilities RST alone cannot provide.

Additional relation instances. For some labels, secondary edges constitute a substantial proportion of all instances. CAUSAL-RESULT has 14.1% secondary instances, EXPLANATION-JUSTIFY 11.7%, ADVERSATIVE-CONCESSION 10.3%. Even common relations like ELABORATION-ADDITIONAL have 5.5% secondary instances. A system relying on primary trees alone systematically misses these relations.

Signal-based relation subtypes. Because signals are anchored and typed, analysts can extract all instances of a particular relation-signal combination without defining new labels. The non-conditional explanatory "if" in "I have oregano if you want any" marks a subtype of EXPLANATION-JUSTIFY distinct from conditional "if"—retrievable directly by querying the DM-relation pairing. Similarly: temporal relations signaled by date expressions, elaborations discussing meronyms, contrasts marked by antonyms. The signal taxonomy enables fine-grained subtyping without taxonomy proliferation.

Comprehensive attribution extraction. RST identifies ATTRIBUTION scope; eRST additionally exposes the source (the entity speaking or thinking), the predicate mode (speech verb like "said" versus cognitive verb like "think" versus newspaper-style quotation with no predicate), and polarity (ATTRIBUTION-POSITIVE versus ATTRIBUTION-NEGATIVE for denials). In GUM's multilayer context, sources link to canonical entity identifiers and lemmatized predicates via the aligned coreference and entity linking layers.

Evaluation content analysis. EVALUATION-COMMENT provides scope; indicative word signals identify which terms convey the evaluation. About 62% of EVALUATION relations have an associated indicative item, with over 200 lemma types represented. The long-tailed distribution—"good" at 90 instances down to hapax legomena—reveals the open-ended nature of evaluative language, quite different from the constrained DM inventory.

Built-in explainability. Signals provide a rationalization mechanism for discourse parses. Downstream applications can filter for only explicitly signaled relations to increase confidence. Analysts can inspect the evidence supporting each parser output. Even when predictions are wrong, signals indicate what the system thought it was detecting. This matters increasingly as NLP systems are deployed in contexts requiring interpretability.

Looking Forward

The eRST infrastructure supports several research directions. The data enables studies of how discourse relations and signals distribute across texts, genres, and domains. Correlations between relation types and signal types can inform both theoretical accounts and practical systems. The extent to which discourse relations are predictable from localizable signals—versus requiring global pragmatic inference—becomes an empirical question with tractable data.

For system development, the baseline numbers indicate that primary tree parsing remains the key bottleneck. Improvements there yield cascading benefits for signal detection and secondary edge prediction. Connective detection, already a high-performance task, provides a strong foundation for DM-related components.

Multilingual extension is an obvious direction. The Georgetown Chinese Discourse Treebank follows the same RST annotation scheme as GUM; many tools and scripts can be adapted with relative ease. Other languages with RST treebanks are candidates for extension.

Large language models present both opportunities and evaluation targets. Zero-shot and few-shot performance on eRST tasks may reveal what levels of discourse awareness LLMs possess. Conversely, LLMs may be used to bootstrap eRST annotations for domains or languages where manual annotation is scarce.

The GUM corpus continues to grow, now encompassing 24 genres with plans for more. Each new genre tests the framework's generality and reveals patterns specific to different text types. eRST is designed to be a living framework, refined through application and community feedback.

The fragmentation that characterized discourse parsing for three decades need not be permanent. Different researchers solved different pieces of the puzzle; eRST assembles those pieces into a coherent picture. The tree structure that made RST useful remains, enriched with the concurrent relations SDRT recognized, the signal anchoring PDTB championed, and a taxonomy broad enough to capture the full range of discourse-marking devices. Whether this synthesis proves durable is for the field to determine. The infrastructure exists; the experiments can begin.