INESS-logo
DELPH-IN

Treebanks

Tools


DELPH-IN

Computational linguists from research sites world-wide have joined forces in a collaborative effort aimed at ‘deep’ linguistic processing of human language. The goal is the combination of linguistic and statistical processing methods for getting at the meaning of texts and utterances. The partners have adopted Head-Driven Phrase Structure Grammar (HPSG) and Minimal Recursion Semantics (MRS), two advanced models of formal linguistic analysis. They have also committed themselves to a shared format for grammatical representation and to a rigid scheme of evaluation, as well as to the general use of open-source licensing and transparency.

Deep Bank

The DeepBank project has the goal of annotating the one million words of 1989 Wall Street Journal text (the same set of sentences annotated in the original Penn Treebank project) with the English Resource Grammar, augmented with a robust approximating PCFG for complete coverage. DeepBank contains rich linguistic annotation on both syntactic and semantic structures of the sentences and is available in a variety of representation formats (see the description on formats below).

The project is hosted at the Department of Computational Linguistics of Saarland University and the Language Technology Lab of the German Research Center for Artificial Intelligence in Saarbrücken, Germany, and in close collaboration with CSLI Stanford. Other institutes, including (but not limited to) Humboldt University of Berlin and University of Oslo have also contributed to the development and release of the resource. In the long term, the DeepBank will be further supported by the DELPH-IN community with updates and maintenance.

Redwoods

The LinGO Redwoods Treebank is a collection of hand-annotated corpora analysed with the LinGO ERG. For each utterance from a corpus, the treebank records (in principle) all analyses hypothesized by the grammar, together with an annotator decision as to which reading is preferred in context.

The key innovative aspect of the Redwoods approach to treebanking is the anchoring of all linguistic data captured in the treebank to the HPSG framework and a generally-available broad-coverage grammar of English, viz. the LinGO English Resource Grammar. Unlike existing treebanks, there is no need to define a (new) form of grammatical representation specific to the treebank (and, consequently, less dissemination effort in establishing this representation). Instead, the treebank records complete syntacto-semantic analyses as defined by the LinGO ERG; tools are provided to extract many different types of linguistic information at varying granularity.

Other relevant aspects of the Redwoods Treebank include the integration of alternate, though dispreferred analyses for each utterance and the dynamic nature of the annotations: as the underlying grammar evolves and improves its analyses, there is a provision for a (nearly) fully automated update of the treebank against a version of the original corpus analysed with the revised grammar. As a methodological results, part of the Redwoods data are now regularly maintained as part of the grammar regression cycle with each new release of the ERG.

Sources: http://moin.delph-in.net, http://moin.delph-in.net/DeepBank, http://moin.delph-in.net/RedwoodsTop