University of Southern Denmark
Danish -> Info -> Semantic Prototype Project Description  Visual Interactive Syntax Learning  
 
Danish VISLSentence AnalysisEdutainmentCorporaDictionaries

Printer-friendly version

 

Semantic Prototype Project Description
Please note: The development and implementation of the semantic prototypes are work in progress.

Project Researchers
Eckhard Bick
Lone Hegelund

Project Period
Feb. 1 - Dec. 31, 2001.

Short Description
The goal of the Semantic Prototype Project was to develop and implement a semantics and valency-based systematics for Danish nouns. The semantic tags for Danish nouns are used within the overall Constraint Grammar (CG) to further syntactically disambiguate the automatic analyses of running Danish text (open corpora). The present systematics has been developed and tested on a corpus of originally 80,000 Danish nouns. Of these 80,000 nouns, app. 75 % nouns have one or several tags which can be used within CG. Of the remaining 25 %, a large number of the nouns have tentative tags. Several prototype categories have been checked and further  defined.

Semantics and Prototypes
Semantics in general is about defining the meaning of a word: What does a word mean? However, within the Constraint Grammar (CG) approach applied in the VISL project, semantics is used to distinguish between meanings and thus potential readings of a noun in a syntactic context. The semantic tags applied do thus not uniquely define a given noun. Instead the tags designate one or several prototypes for each noun.
            A prototype is an (idealized) best instance of a given class of entities. Prototypes are thus often what one calls "class hyperonyms", e.g. "animal", "human beings" and "plant". Iin other words, a class hyperonym is the most general term that can be used as a common reference for human beings who can be individually characterized by being an architect or a child or an Italian or that annoying guy hanging over the counter or... 
            The prototypes used in the Semantic Prototype Project are based on earlier work done by Eckhard Bick for Portuguese (Bick 2000). At present the systematics includes app. 150 prototypes. The prototype systematics are under continuous revision. Ultimately the systematics should be developed to such an extent that they will enhance the automatic machine-translation of Danish by improving the ability of the CG system to handle semantic disambiguation. This development is already well under way for Portuguese (cf. this site and Bick 2000).

Constraint Grammar and Prototypes
CG is a grammar of constraint rules which are used on a text which has been automatically tagged with all possible morphological and syntactical readings. Out of this multitude of possible readings of a text, the CG formalism eliminates all impossible readings. Thus, only one - or in the case of a truly ambiguous text, several - possible reading(s) is (are) left (for more information, see Constraint Grammar).
            In earlier work, Bick has proved that it is possible to extend CG to a semantic level through semantic tagging (Bick 2000). Semantic tags can work as secondary tags for the disambiguation of syntactic ambiguity. However, if the long-term goal of machine-translation is to be achieved, it is important to distinguish between different readings of polysemic words in order to translate correctly. Thus the semantic tags become primary tags and subject to disambiguation themselves through the use of atomic semantic features. Again, neither the use of the prototypes nor of their atomic semantic features are meant to uniquely define a noun. CG formalism allows for a highly context dependent and thus detailed disambiguation of semantic prototypes, both as secondary and primary tags. 

Valency and Valency Tags
Valency refers to the number and types of bonds which syntactic elements may form with each other. Valency is typically connected with the verb and its dependent elements, referred to as e.g. arguments, complements or valents. Different verbs require a different number of arguments and are thus accordingly considered monovalent, bivalent or trivalent: "to disappear" is a monovalent verb, as it requires only a subject in order to form a well-formed sentence such as "he disappeared". "to buy", on the other hand, is bivalent as the verb requires a subject and an object: "She bought a new computer".
            However, in this project valency refers to nouns and their ability to form different kinds of syntatic relations, for example a noun which can be the dependency attachment point of a non-finite or finite subclause, i.e. the subclauses are postnominal:

            1. Non-finite subclause: Vi har forskellige behov, forskellige måder at udtrykke vores følelser og drifter på.
            2. Finite subclause: Han tog den beslutning, at han ville holde op med at ryge.

At present, only few valency tags have been applied in the corpus. However, nouns have been tagged for "prespositional valency": prepositions in typical co-occurrence with the noun. This tagging is not, however, valency marking in a strict (or even less strict) sense, but more of a probability/statistics hint. xx

Related Projects
Other projects have dealt with semantic tagging of corpora. The project SIMPLE - Semantic Information for Multifunctional Plurilingual LExicons - aimed at developing wide-coverage semantic lexicons for 12 European languages. SIMPLE used a common semantic model and semantic coding formalism. 139 semantic types were distinguished in SIMPLE. Each semantic type was defined as a cluster of structured semantic information, e.g. semantic class, domain, argument structure of predicative expressions and selectional restrictions and qualia roles (cf. Pustejovsky 1995).
            The SIMPLE project was an extension of the European PAROLE project: Preparatory Action for Linguistic Resources Organisation for Language Engineering. The main objective of PAROLE was to produce corpora and lexical resources for each of the 14 participating European languages. One of the results of the PAROLE project was a set of harmonised lexica containing a minimum of 20.000 entries provided with morphosyntactic and syntactic information:12.000 nouns, 3000 verbs, 3000 adjectives, 500 adverbs and 1500 entries with other POS categories. 
            For information on SIMPLE and PAROLE for Danish, see the documentation on SIMPLE  and PAROLE provided by Center for Language Technology.

References
Bick, Eckhard. 2000. The Parsing System "Palavras". Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Århus: Aarhus University Press.
Karlsson, Fred et al. (eds.). 1995. Constraint Grammar. A Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton de Gruyter.
Pustejovsky, James. 1996. The Generative Lexicon. Cambridge, MA/London, England: MIT Press

 


In order to continue using the Java applets, see Verify Java Version and Download Java.
We are actively working on replacing all our Java with portable HTML5.


Copyright 1996-2020 | Report a Problem / Contact Us | Printable Version