University of Southern Denmark
World of VISL -> World of VISL  Visual Interactive Syntax Learning  
Syddansk Universitet
 
 


Treebanks

VISL produces two kinds of treebanks, (a) small, pedagogically structured teaching treebanks with selected sentences, and (b) large treebanks over running text, or treebanks with a randomized inventory of real corpus sentences. In all, 22 teaching languages and 5 research languages are covered. While teaching treebanks will be largely hand-made, research corpora are built with automatic parsers, usually using a hybrid 2-stage system involving (1) Constraint Grammar analysis and (2) PSG or dependency tree transformation. The resulting analyses are then manually revised by VISL linguists or students. Experimentally, so-called "jungle" treebanks are built, providing extensive data, but without revision.

Research Treebanks and search interfaces
  • Arboretum (Danish treebanks)
  • Floresta sintá[c]tica (Portuguese treebanks)
  • L'Arboratoire (French treebanks)
  • Arborest (Estonian treebanks)
  • Corpuseye (search interface overview)
  • Corpus inventory (list of corpora, sources and sizes)

    VISL treebank manuals
  • Cafeteria (VISL Category labelling principles)
  • Treebank formalism (Treebank design principles)
  • Category definitions(Danish examples)
  • Morphological categories (Terms and abbreviations)
  • Semantic prototype ontology (Definitions and abbreviations)
  • Semantic roles (Manual)
  • Teaching treebanks and java visualiser
    Treebank formats Treebank tools

    The graphical format: VISL's java-based tree-visualiser will represent syntactic trees in an interactive interface, allowing both step-by-step inspection and manipulation ("rebuilding" and "retagging") of the tree. Each constituent is represented as a node containing both form and function information (e.g. Od:pron = a direct object, which is a pronoun). Trees can be manipulated in 3 ways:

  • inspection: The tree is shown in its entirety, or layered top-down by clicking on the node to be expanded.
  • tree-building: Words and non-terminals can be moved with the mous-pointer (drag-&-drop), "mounting" dependents onto heads. E.g. 'the' and 'little' onto the head 'pige''girl', in the little girl). Alternatively, a mother-node can be assembled by clicking all its daughters and using the combine node-function.
  • labelling: Words and non-terminal nodes in the finished tree can be labelled with category symbols from the category bar (word class, syntactic function, phrase and clause type).

    In evaluation-mode the java-program will keep count with your error rate, as well as provide hints and explanations along the way.


    The source format: Internally, for format filtering, searching and manual revision, trees are stored in the VISL-format, as "horizontal" trees with a separate line for each terminal or non-terminal node, with indentation marking depth. A constituent's daughters will thus be listed below the mother node, with an indentation level increased by 1. Cf. the example from "The World of Sophie" below:

    STA:fcl
    fA:fcl
    =SUB:conj-s("når") Når
    =S:np
    ==DN:prop("Sofie" GEN) Sofies
    ==H:n("mor" UTR S IDF NOM) mor
    =P:v-fin("være" IMPF AKT) var
    =Cs:adjp
    ==H:adj("sur" UTR S IDF NOM) sur
    ==DA:pp
    ===H:prp("over") over
    ===DP:pron-indef("en_eller_anden" NEU S NOM) et_eller_andet
    P:v-fin("ske" IMPF AKT) skete
    Sf:pron-pers("den" NEU 3S NOM) det
    S:fcl
    =SUB:conj-s("at") at
    =S:pron-pers("hun" UTR 3S NOM) hun
    =P:v-fin("kalde" IMPF AKT) kaldte
    =Od:np
    ==DN:pron-poss("de" 3P GEN) deres
    ==H:n("hus" NEU S IDF NOM) hus
    =Co:pp
    ==H:prp("for") for
    ==DP:np
    ===DN:art("en" NEU S IDF) et
    ===DN:adj("dårlig" COM nG nN nD NOM) værre
    ===H:n("menageri" NEU S IDF NOM) menageri


    VISL constituent trees are built from Constraint Grammar parser's flat dependency output using a function-based PSG and VISL's open source psg-compiler. Treebank revision is performed first at CG-level, and again after tree-generation, drawing robustness from the CG-system and depth from the PSG-grammar.

    VISL dependency trees are constructed directly from word based CG input using structural transformation filters based on Prolog (S. Harder) or Perl (E. Bick). In source annotation, the result is ordinary CG enriched with token and head id's. =4:2 or #4->2 on a tag line means that the token in question (number 4) attaches to head token number 2. There are two modules that can be used to add dependency numbering links to CG input.

    • the Depsplicator), implemented in Prolog by Søren Harder. This program, working with an internal, so-called word/9 structure, also calculates relative dependency link numbers (e.g. +2 for a head word 2 positions to the right), and outputs parameter lists for a choice of graphical formats (VISL dependency and DTAG, shown below).
    • cg2dep, implemented - along with the TIGER and MALT filters - as a Perl Grammar by Eckhard Bick.

    <s_id="sofie-da43">

    Når [når] KS @SUB #1->4
    Sofies [Sofie] PROP GEN @>N #2->3
    mor [mor] N UTR S IDF NOM @SUBJ> #3->4
    var [være] V IMPF AKT @FS-ADVL> #4->9
    sur [sur] ADJ UTR S IDF NOM @4
    over [over] PRP @A< #6->5
    et=eller=andet [en=eller=anden] DET NEU S NOM @P< #7->6
    $, #8->0
    skete [ske] V IMPF AKT @FS-STA #9->0
    det [den] PERS NEU 3S NOM @F-9
    at [at] KS @SUB #11->13
    hun [hun] PERS UTR 3S NOM @SUBJ> #12->13
    kaldte [kalde] V IMPF AKT @FS-9
    deres [de] PERS 3P GEN @>N #14->15
    hus [hus] N NEU S IDF NOM @13
    for [for] PRP @13
    et [en] ART NEU S IDF @>N #17->19
    værre [dårlig] ADJ COM nG nN nD NOM @>N #18->19
    menageri [menageri] N NEU S IDF NOM @P< #19->16
    $. #20->0
    </s>


    VISL dependency trees:

    DTAG export format:

    TIGER exchange format: This is the treebank exchange format agreed upon by the Nordic Treebank Network, allowing free data exchange and the use of tools developed by the international TIGER project community. VISL constituent trees can be filtered into TIGER constituent format using the program visl2tiger.pl. In TIGER format, edge labels contain the original syntactic function tags, and the (non-teminal) cat category contains phrase and clause forms (graphical example).

    <s id="s43" ref="sofie-da43" source="Sofie-da" forest="5/7" text="Når Sofies mor var sur over et eller andet, skete det at hun kaldte deres hus for et værre menageri. ">
      <graph root="s43_500">
        <terminals>
          <t id="s43_1" word="Når" lemma="når" pos="conj-s" morph="--" extra="--"/>
          <t id="s43_2" word="Sofies" lemma="Sofie" pos="prop" morph="GEN" extra="hum"/>
          <t id="s43_3" word="mor" lemma="mor" pos="n" morph="UTR S IDF NOM" extra="--"/>
          <t id="s43_4" word="var" lemma="være" pos="v-fin" morph="IMPF AKT" extra="--"/>
          <t id="s43_5" word="sur" lemma="sur" pos="adj" morph="UTR S IDF NOM" extra="--"/>
          <t id="s43_6" word="over" lemma="over" pos="prp" morph="--" extra="--"/>
          <t id="s43_7" word="et_eller_andet" lemma="en_eller_anden" pos="pron-indef" morph="NEU S NOM" extra="--"/>
          <t id="s43_8" word="skete" lemma="ske" pos="v-fin" morph="IMPF AKT" extra="--"/>
          <t id="s43_9" word="det" lemma="den" pos="pron-pers" morph="NEU 3S NOM" extra="--"/>
          <t id="s43_10" word="at" lemma="at" pos="conj-s" morph="--" extra="--"/>
          <t id="s43_11" word="hun" lemma="hun" pos="pron-pers" morph="UTR 3S NOM" extra="--"/>
          <t id="s43_12" word="kaldte" lemma="kalde" pos="v-fin" morph="IMPF AKT" extra="--"/>
          <t id="s43_13" word="deres" lemma="de" pos="pron-poss" morph="--" extra="--"/>
          <t id="s43_14" word="hus" lemma="hus" pos="n" morph="NEU S IDF NOM" extra="--"/>
          <t id="s43_15" word="for" lemma="for" pos="prp" morph="--" extra="--"/>
          <t id="s43_16" word="et" lemma="en" pos="art" morph="NEU S IDF" extra="--"/>
          <t id="s43_17" word="værre" lemma="dårlig" pos="adj" morph="COM nG nN nD NOM" extra="--"/>
          <t id="s43_18" word="menageri" lemma="menageri" pos="n" morph="NEU S IDF NOM" extra="--"/>
        </terminals>

        <nonterminals>
          <nt id="s43_500" cat="s">
            <edge label="STA" idref="s43_501"/>
          </nt>
          <nt id="s43_501" cat="fcl">
            <edge label="fA" idref="s43_502"/>
            <edge label="P" idref="s43_8"/>
            <edge label="Sf" idref="s43_9"/>
            <edge label="S" idref="s43_506"/>
          </nt>
          <nt id="s43_502" cat="fcl">
            <edge label="SUB" idref="s43_1"/>
            <edge label="S" idref="s43_503"/>
            <edge label="P" idref="s43_4"/>
            <edge label="Cs" idref="s43_504"/>
          </nt>
          <nt id="s43_503" cat="np">
            <edge label="DN" idref="s43_2"/>
            <edge label="H" idref="s43_3"/>
          </nt>
          <nt id="s43_504" cat="adjp">
            <edge label="H" idref="s43_5"/>
            <edge label="DA" idref="s43_505"/>
          </nt>
          <nt id="s43_505" cat="pp">
            <edge label="H" idref="s43_6"/>
            <edge label="DP" idref="s43_7"/>
          </nt>
          <nt id="s43_506" cat="fcl">
            <edge label="SUB" idref="s43_10"/>
            <edge label="S" idref="s43_11"/>
            <edge label="P" idref="s43_12"/>
            <edge label="Od" idref="s43_507"/>
            <edge label="Co" idref="s43_508"/>
          </nt>
          <nt id="s43_507" cat="np">
            <edge label="DN" idref="s43_13"/>
            <edge label="H" idref="s43_14"/>
          </nt>
          <nt id="s43_508" cat="pp">
            <edge label="H" idref="s43_15"/>
            <edge label="DP" idref="s43_509"/>
          </nt>
          <nt id="s43_509" cat="np">
            <edge label="DN" idref="s43_16"/>
            <edge label="DN" idref="s43_17"/>
            <edge label="H" idref="s43_18"/>
          </nt>
        </nonterminals>
      </graph>
    </s>

    TIGER tree example

    TIGER dependency format: This format is derived from TIGER constituent trees using a special Perl program, called tiger2dep.pl. In this format, word-terminals are "identified" with their dependency node by using the empty edge label '--'.

    <s id="s43" ref="sofie-da43" source="Sofie-da" forest="5/7" text="Når Sofies mor var sur over et eller andet, skete det at hun kaldte deres hus for et værre menageri. ">
      <graph root="s43_500">
        <terminals>
          <t id="s43_1" word="Når" lemma="når" pos="conj-s" morph="--" extra="--"/>
          <t id="s43_2" word="Sofies" lemma="Sofie" pos="prop" morph="GEN" extra="hum"/>
          <t id="s43_3" word="mor" lemma="mor" pos="n" morph="UTR S IDF NOM" extra="--"/>
          <t id="s43_4" word="var" lemma="være" pos="v-fin" morph="IMPF AKT" extra="--"/>
          <t id="s43_5" word="sur" lemma="sur" pos="adj" morph="UTR S IDF NOM" extra="--"/>
          <t id="s43_6" word="over" lemma="over" pos="prp" morph="--" extra="--"/>
          <t id="s43_7" word="et_eller_andet" lemma="en_eller_anden" pos="pron-indef" morph="NEU S NOM" extra="--"/>
          <t id="s43_8" word="skete" lemma="ske" pos="v-fin" morph="IMPF AKT" extra="--"/>
          <t id="s43_9" word="det" lemma="den" pos="pron-pers" morph="NEU 3S NOM" extra="--"/>
          <t id="s43_10" word="at" lemma="at" pos="conj-s" morph="--" extra="--"/>
          <t id="s43_11" word="hun" lemma="hun" pos="pron-pers" morph="UTR 3S NOM" extra="--"/>
          <t id="s43_12" word="kaldte" lemma="kalde" pos="v-fin" morph="IMPF AKT" extra="--"/>
          <t id="s43_13" word="deres" lemma="de" pos="pron-poss" morph="--" extra="--"/>
          <t id="s43_14" word="hus" lemma="hus" pos="n" morph="NEU S IDF NOM" extra="--"/>
          <t id="s43_15" word="for" lemma="for" pos="prp" morph="--" extra="--"/>
          <t id="s43_16" word="et" lemma="en" pos="art" morph="NEU S IDF" extra="--"/>
          <t id="s43_17" word="værre" lemma="dårlig" pos="adj" morph="COM nG nN nD NOM" extra="--"/>
          <t id="s43_18" word="menageri" lemma="menageri" pos="n" morph="NEU S IDF NOM" extra="--"/>
        </terminals>

        <nonterminals>
          <nt id="s43_500" cat="s">
            <edge label="STA" idref="s43_501"/>
          </nt>
          <nt id="s43_501" cat="v-fin">
            <edge label="fA" idref="s43_502"/>
            <edge label="--" idref="s43_8"/>
            <edge label="Sf" idref="s43_9"/>
            <edge label="S" idref="s43_506"/>
          </nt>
          <nt id="s43_502" cat="v-fin">
            <edge label="SUB" idref="s43_1"/>
            <edge label="S" idref="s43_503"/>
            <edge label="--" idref="s43_4"/>
            <edge label="Cs" idref="s43_504"/>
          </nt>
          <nt id="s43_503" cat="n">
            <edge label="DN" idref="s43_2"/>
            <edge label="--" idref="s43_3"/>
          </nt>
          <nt id="s43_504" cat="adj">
            <edge label="--" idref="s43_5"/>
            <edge label="DA" idref="s43_505"/>
          </nt>
          <nt id="s43_505" cat="prp">
            <edge label="--" idref="s43_6"/>
            <edge label="DP" idref="s43_7"/>
          </nt>
          <nt id="s43_506" cat="v-fin">
            <edge label="SUB" idref="s43_10"/>
            <edge label="S" idref="s43_11"/>
            <edge label="--" idref="s43_12"/>
            <edge label="Od" idref="s43_507"/>
            <edge label="Co" idref="s43_508"/>
          </nt>
          <nt id="s43_507" cat="n">
            <edge label="DN" idref="s43_13"/>
            <edge label="--" idref="s43_14"/>
          </nt>
          <nt id="s43_508" cat="prp">
            <edge label="--" idref="s43_15"/>
            <edge label="DP" idref="s43_509"/>
          </nt>
          <nt id="s43_509" cat="n">
            <edge label="DN" idref="s43_16"/>
            <edge label="DN" idref="s43_17"/>
            <edge label="--" idref="s43_18"/>
          </nt>
        </nonterminals>
      </graph>
    </s>


    MALT dependency format: This format was developed by Joakim Nivre at Växjö University. For evaluation purposes and compatibility, VISL data can be transformed into MALT, using either visldep2malt (from CG dependency format) or visltiger2malt (from VISL-tree format).

    <sentence id="s43" ref="sofie-da43" source="Sofie-da" forest="5/7" text="Når Sofies mor var sur over et eller andet, skete det at hun kaldte deres hus for et værre menageri. ">
      <word id=1 form="Når" lemma="når" pos="conj-s" morph="--" extra="--" deprel="SUB" head="4"/>
      <word id=2 form="Sofies" lemma="Sofie" pos="prop" morph="GEN" extra="hum" deprel="DN" head="3"/>
      <word id=3 form="mor" lemma="mor" pos="n" morph="UTR S IDF NOM" extra="--" deprel="S" head="4"/>
      <word id=4 form="var" lemma="være" pos="v-fin" morph="IMPF AKT" extra="--" deprel="fA" head="8"/>
      <word id=5 form="sur" lemma="sur" pos="adj" morph="UTR S IDF NOM" extra="--" deprel="Cs" head="4"/>
      <word id=6 form="over" lemma="over" pos="prp" morph="--" extra="--" deprel="DA" head="5"/>
      <word id=7 form="et_eller_andet" lemma="en_eller_anden" pos="pron-indef" morph="NEU S NOM" extra="--" deprel="DP" head="6"/>
      <word id=8 form="skete" lemma="ske" pos="v-fin" morph="IMPF AKT" extra="--" deprel="STA" head="0"/>
      <word id=9 form="det" lemma="den" pos="pron-pers" morph="NEU 3S NOM" extra="--" deprel="Sf" head="8"/>
      <word id=10 form="at" lemma="at" pos="conj-s" morph="--" extra="--" deprel="SUB" head="12"/>
      <word id=11 form="hun" lemma="hun" pos="pron-pers" morph="UTR 3S NOM" extra="--" deprel="S" head="12"/>
      <word id=12 form="kaldte" lemma="kalde" pos="v-fin" morph="IMPF AKT" extra="--" deprel="S" head="8"/>
      <word id=13 form="deres" lemma="de" pos="pron-poss" morph="--" extra="--" deprel="DN" head="14"/>
      <word id=14 form="hus" lemma="hus" pos="n" morph="NEU S IDF NOM" extra="--" deprel="Od" head="12"/>
      <word id=15 form="for" lemma="for" pos="prp" morph="--" extra="--" deprel="Co" head="12"/>
      <word id=16 form="et" lemma="en" pos="art" morph="NEU S IDF" extra="--" deprel="DN" head="18"/>
      <word id=17 form="værre" lemma="dårlig" pos="adj" morph="COM nG nN nD NOM" extra="--" deprel="DN" head="18"/>
      <word id=18 form="menageri" lemma="menageri" pos="n" morph="NEU S IDF NOM" extra="--" deprel="DP" head="15"/>
    </sentence>

    Transformation Tools: The table below provides an overview of format transformation programs and filters. The pipe symbol '|' means that the transformation may be achieved by chaining a number of step-by-step programs. Red tools are Perl based (Eckhard Bick), blue ones are Prolog based (Søren Harder). NTN-tools are available through the Nordic Treebank Network. cg2visl (green) is not one program, but a suite of language dependent phrase structure grammars and the VISL's open source C++ rule compiler.

    CG CG-dep VISL VISL-dep TIGER TIGER-dep MALT-dep DTAG-dep PENN
    CG
    -
    cg2dep (+ grammar)
    depsplicator
    cg2visl
    (vislpsg + grammar) OR cg2dep | dep2tree
    depsplicator cg2visl | visl2tiger.pl cg2visl | visl2tiger.pl
    | tiger2dep.pl
    OR cg2dep | visldep2malt | malt2tigerdep
    cg2dep | visldep2malt depsplicator cg2dep | dep2tree | visl2penn
    CG-dep perl -wnpe 's/#.*//'
    -
    dep2tree dep2tree | visl2tiger.pl visldep2malt | malt2tigerdep visldep2malt via TIGER-dep (NTN) dep2tree | visl2penn
    VISL tree2cg (tree2cg | cg2dep)
    -
    visl2tiger.pl visl2tiger.pl | tiger2dep.pl visl2tiger.pl | tiger2dep.pl
    | tigerdep2malt
    OR visl2malt
    via TIGER-dep (NTN) visl2penn
    VISL-dep
    -
    TIGER
    -
    tiger2dep.pl tiger2dep.pl | tigerdep2malt via TIGER-dep (NTN)
    TIGER-dep
    -
    tigerdep2malt, (NTN tools) (NTN tools)
    MALT (NTN tools)
    -
    DTAG (NTN tools)
    -

  •  


    Copyright 1996-2017 | Report a Problem / Contact Us | Printable Version