Named entities in CG corpus annotation: categories, principles and statistics
(Eckhard Bick, March 2002)
For Danish corpus examples (Corpus 2000), cf. corp.hum.sdu.dk
1. General principles
Prior to classifying names, there should be a stable definition of the parent category ...
problem: How to handle upper case spelling variants of the same names? Unless listed in the lexicon (difficult for such a productive class), the parser will assume an unknown PROP for upper case, and an unknown N for lower case.
problem: Should the mere potential of inflectability be enough to force a noun reading: en Volvo - Volvo'en, and if so, what about en Mercedes 600 SL, where only the first part in isolation can be inflected. Suggestion: Keep PROP, unless in lower case or inflected
problem: As in the previous point, potential inflectability is, as an introspective test, not very accessible to the automatic tagging of a productive word class
problem: Upper case nouns that are not generic and do not allow articles or modifiers, could still sensibly be treated as real PROP names in the lexicon: Hjemmeværnet, Kommunekontoret
problem: As in the last point - an analytic-morphological reading may conflict with a lexicon decision.
2. Name categories
Three levels: 6 core categories, 10 other operational categories, preferably classifiable as subcategories of the core-categories, 4 experimental categories
2.1. Human names <hum>
People's names can be lexically recognized as a chain of registered simplex personal names: Christian, middle and surnames. Morphologically, certain structural lower case words may precede a surname: von, van, de, de la, di, du, da, do, das, dos, y, ten, zu, bin, ibn. These elements may either head the name chain or break the regular flow of upper case words, creating a preprocessor lumping task. In a few cases structural markers become part of a surname as such, as in gaelic Mac (MacMillan, McNamara), the Scandinavian patronym endings -sen/-son and -dottir (Jensen, Axelsson, Kristindottir) or the Slavic endings -ajev and -owa.
In Danish, only a few names, most notably Hans and Otte, are - in sentence initial position - co-ambiguous with non-propria classes. Surnames, however, are frequently co-ambiguous with place names, like in Sprogø, Togeby, Svendstrup etc., and are sometimes used for creating names for certain cultural concepts or creations:
a) diseases: M. Crohn, M. Parkinson, M. Cushing, used with either M., Morbus or as a naked name
b) prizes: en Bodil, Nobel-prisen, Pulitzer-prisen
c) paintings: en Picasso, en ægte van Gogh
d) stipends: Betty Jensen Legat
Note, that certain of these classes, unlike personal names as such, allow articles.
Human name chains can be complements of title nouns, as in Baronesse Blixen, Jomfru Ane, Hr Jensen, Frøken Mathilde, Mrs Smith, Broder Jakob, Dr. Schnelling. Though the noun's <+prop> valency does suggest internal structure in these cases, I have in the Danish parser opted for a non-analytic 1-token name reading integrating the titles. One distinction justifying this solution as opposed to the one chosen for profession noun + name (gravøren Peter Jensen, danserinden Mia Maertens), is the fact that titles cannot be definiteness inflected before names, while profession nouns can, suggesting that the former are under the inherent definiteness scope of the (larger) name chain.
Given this distinction, both preprocessing and name recognition use a list of title nouns and abbreviations in several languages.
Mythical names denoting humanoids, literary heroes or gods (Snehvide, Odin, Herkules, Mumrik) are also treated as <hum>, but suffer a fair deal of ambiguity due to loan usage in astronomy, titles and other fields.
Title+profession: hr. professor Wiedewelt, hr. dyrehandler Johnny Rasmussen, mrs. filmproducer Gale Anne Hurd. Since title and proper noun bracket the profession nouns in these cases, and the latter cannot inflect, the whole name chain should be treated as one token in these cases, i.e. the same as for title+name withoth interfering profession noun.
Articles and demonstratives:
enhr. hvemsomhelst - titles are treated as nouns with complements if no name is present
denPeter jeg husker - the article implies added definiteness, thus implying a generic reading for Peter in isolation. While this supports a noun reading, upper case morphology favours a name reading, as does the fact that generic names have been allowed in other cases (scientific biological names, cars, brands)
lilleIda - modifiers are unusual for names, and one might consider fusing lille onto the name token. However, the fact that no article is provided already nicely distinguishes this construction from ordinary noun-np's.
Alexander den Store - the attribute is a kind of fixed expression, with unproductive syntactical word order. It should either be fused into the name token, or else treated as an apposition.
2.2. Place names <top>
The prototypical Danish place name is written as one token without an article, though international exceptions do exist (Den Haag, O Porto, La Santa, Trinidad y Tobago, San Francisco). In particular, Sankt, San, Santo, Santa, São, Saint etc. are very productive first parts of topologica in Christian countries. Morphologically, place names can often be recognized by geographical elements: -vig, -havn, -bjerg, or the town-specific -rup, -strup, -by, -lev, -sted.
Syntactically, place names often have a characteristical left verbal context (<va+LOC>, <vta+LOC>, <va+DIR>, <vta+DIR>, MOVE-verbs) or left prepositional context. In particular, ved og i , i mindre grad fra og til, are suggestive of topologicals, though a lot of contextual restrictions apply.
However, place names for human settlements (countries, towns, villages etc.) in particular may also function as +HUM subjects of cognitive verbs, as genitive-marked "owners" or with the case role of AGENT (constructing, sending), blurring the distinctional line between <hum> and <top>. For semantic reasons, and to allow for CG rules using +HUM and -HUM syntactic contexts, I have introduced the civitas <civ> category for these names: Danmark, Ikast, USA, London, Folkerepublikken_Kina. The new category also nicely covers terms like Det Tredje Rige, Romerriget, Sovjetunionen, which are neither typical topologicals nor typical organisations.
Buildings with a geographical value, like churches (St. Peters Katedralen, Ribe Domkirke, Vor Frue Kirke, Bedsted Kirke), are treated as <top>, if all parts are in upper case, or as a name + noun unit, if the last element i in lower case (Uppenskij katedralen). Many buildings, however, have an institutional value that allows them, if only metaphorically, to assume +HUM traits in many sentence contexts. Thus, places like Mønsterbageriet, diskoteket SiSi, Louvre, Legoland, Kommunekontoret, Det Hvide Hus can offer, invite, earn like names from the person or organisation classes, while at the same time allowing for locational prepositions and the somewhat less human, ergative, opening and closing actions. For these cases, I use the category institution <inst>, a kind of hybrid between <org> and <top>, a topologically defined multi-human unit. A characteristical, but unsafe, prepositional context is non-literal på: Compare: på SiSi, på Station Nord versus på Ribe Domkirke. Individual hotels, restaurants and supermarkets (Illum, London Hilton, Rådhuskælderen) are treated as <inst> , while chains are <org> (Dansk Supermarked, MacDonalds, Best Western).
En god introspektiv test for både <top>, <civ> og <inst> er subject-hood for ligge + LOC. Og selvom også mennesker og ting kan ligge (i seng, på bordet), så kan den semantiske forskel testes med spørgsmålet: Hvor ligger X henne?
Place names can form composita, often with a hyphen, marking either suburbs, motorway access points and postal districts (Århus-Syd, Odense-NØ), geographical ambiguity (Nykøbing Mors) or "fused" locations (Köln-Wahn). These cases are difficult to distinguish from route denominations like Århus-Kalundborg, Dover-Calais, or trekanten Berlin-Wien-Rom, which therefore also are treated as <top> (though so far the distinction as not <civ> is hard to make).
Other problematic cases involving topologicals are:
soccer or other sports "pairings": Danmark-Norge 2:1, semifinalerne Rusland-Spanien (tirsdag) og Frankrig-Italien (onsdag). I aften spiller Lillerød-Triton (<top/civ>?) og GBK-KM (<org>?). Both <civ>, <org> and metaphorical <top> can be parts in these constructions.
sporting or political events preferably tagged <occ>, but involving place names: han vandt Wimbledon, Paris-Dakar startede i går, efter Maastricht, siden Watergate
addresses: Helgolandsgade 8 st.th., 8800 Viborg. The question is how much of an address to lump into one name token. At present, the Danish CG parser treats 8_st._th. as a postnominal dependent of the street name, and fuses postal codes into town name tokens.
geographical "titles": (i) han bor på Slot Belvedere, (ii) det gamle slot Belvedere, (iii) Slottet Belvedere er smuk. While the non-inflected upper case form in (i) suggests one fused name token, it can be argued that the definiteness markers in (ii) and (iii) support a separate noun (Slottet) + complement (Belvedere) reading.
banks: Mostly used as organisation names <org>, but can denote the building: Unibanken nedsatte renten vs. (ii) Han stod i kø i en halv time henne i Den Danske Bank.
2.3. Organisations <org>
This category covers companies, organisations, ideological movements and the like. Though in principle lexicographically accessible, this category is very productive, unlike place names who form a more stable inventory, and more in the style of personal names. Morphologically, full <org> names are often polylexicals, involving "safe" company markers like A/S, K/S, I/S, ApS, Ltd, Gmbh, & Co., & bros., nominal compound parts like ...selskab, ...forening, Selskab for ..., or classifier second parts like ... Foods, ... Electronics, ... Industries, ... Airlines. Complex <org> names may involve articles (marked as upper case) as in Den Danske Bank, Den Danske Forening, or prepositions, in particular for and af (usually not in upper case): Sammenslutningen af Alternative Behandlere, Medicinsk Forening for Akupunktur. Fairly safe graphical markers, used in simplex names, are upper case letters in mid-word: GrammarSoft, ItauTech, and - in little proud and umbilical Denmark - names initiating in Dan... or Scan...
Typical of movements and organisations, and to a certain degree of companies, are abbreviations of 2-3 or more capital letters (NATO, SF, WHO, AUDI, AEG, AGF), though there is a considerable ambiguity with chemical formula (CO, NOX), events (OL, VM) and other types of abbreviations, calling for maximal lexicographical treatment of abbreviations. A well-defined sub-category of <org> are political parties <party>, whose names are often abbreviated, and quite common in newpaper corpora.
Especially in (sports) club names, <org> names can constist of an abbreviation followed by a place name: FC København, MC Herning.
Semantically, the <org> category shares a lot of its functional distribution with both person names and institutions (agent case role, +HUM subject-hood with cognitive og speech verbs), but can be distinguished from the former by allowing the preposition i to the left, and from the latter by being PLACE, not allowing narrow place prepositions like ved. An introspective semantic test is that <org> can be founded and joined, but not situated and touched. <org> is an abstract entity, while <inst> is a physical entity.
A special case are names of newspapers, radio channels and tv stations, most of which are used as both <org> names and title names <tit>. They can be read in, watched or listened to (like books, films and songs), but at the same time allow for a degree of +HUM agenthood and subjecthood for cognitive verbs: Jeg har læst i Jyllandsposten, at .... Jyllandsposten har ansat 20 nye medarbejdere, BT tør hvor andre tier ... In order to capture this double potential, I have introduced the name category of media <media>: Miljø Nyt, Bådnyt, Ekstrabladet, Aftonbladet, Le Figaro, Sunday Times, Frankfurter Allgemeine.
2.4. Events and occasions <occ>
This category is used for both natural and organized events, periods and time names in general. A good distributional test are time-prepositions, in particular efter, indtil and siden. As subjects, most members of this category allow time verbs like vare, stå på (duration feature), foregå, forløbe (process, activity), ske (event), begynde, slutte etc. (not starte, stoppe, which also work for moving things).
In Danish, many lexically fixed "time names", like the names of days, months and Christian holidays (tirsdag, januar, jul, påske), are written in lower case, thus, if not contradicting, then at least discouraging proper name analysis in terms of morphological word class, - though most of these do not inflect in number or definiteness, which might be considered criteria i favour of a proper noun reading. There are, however, many upper case polylexicals where either the first or last part are nouns denoting events or occasions, thus allowing to classifiy the name in question as <occ>: Operation Barbarossa, US Open, Tour de France, Tønder Festival, Første Verdenskrig, Golfkrigen, Europamesterskaberne.
The last two examples are morphologically different, since they are inflecting, and could also be read as composita consisting of a proper noun (Golf, Europa) and a common noun. Such an analysis has, in fact, proven to be very robust. Being analytical, the latter method makes optimal use of the parser's existing lexicon, without the need for a more heuristical name recognizer, or individual lexicon additions.
Expositions: Expo 98, "Keltiske fyrster" (<tit>?), den 5. internationale Arkitekturudstilling
Projects: PaNoLa, Cordial Syn , Hjerteugen (<tit>?)
Sites for events: Siden Kranskaja Gora 1988, with a metaphoric place -> event transfer
2.5. Book, film and music titles <tit>
Titles can be simplex words ("Skyggen", Iliaden, Genesis, Eddaen), but will often consist of more than one word, and display a syntactic structure of their own, while still functioning as constituent units in higher order syntactic structures. The strategy of the Danish CG parser is to assign internal structure only where titles appear in isolation, and to lump titles as one-name units in all other cases (titles as subject, object, argument of preposition):
"Frøken_Smillas_fornemmelse_for_sne" blev filmatiseret på engelsk.
Graphical criteria for recognizing titles are quotes and - where surviving in electronic corpora - italics. Furthermore, the 1. word in a title, as well as most content words, will be in upper case, though there seems to be a great deal of variation as to this point, not least because many title names in Danish texts are in English, German, French or other languages. Recognizing title names is thus largely a preprocessing task, and once recognized as a token, most titles are fairly unambiguous.
Semantically, a distinction can be made between literary "running text" titles on the one hand ("The Quick and the Dead", "Det lille hus på prærien", "Mit liv som hund", "Cosi fan tutte", "På loftet sidder nissefar"), and "classifying" titles on the other. This latter category is rarely quote marked, does not exceed np-structure, and can usually be recognized by a classifying key nominal element: Den Danske Ordbog, Bill of Rights, Lov om ..., Grundloven,Warszawa-pagten. Again, a number of these could also be read as name+noun composita.
Collections: Skagens Samlingen, Top 10, Top 50, Kollektionen "Purple Pal" (<brand>?)
Stipends etc.: H.V.Jensens Legat, Betty Jensen Fondats (<org>?, <brand>?)
Prizes: Nobelprisen, De Gyldne Palmer (<brand>?)
Name titles: Der står stadig "Pavarotti" på dem. Filmen "Oscar". Here a name of another category, <hum>, fills all of the title name. My current view is that <hum> would be the "analytical" internal reading, while <tit> should be maintained in sentence context, quotes and syntax overruling internal name recognition.
A related category is that of <genre>, which subsumes the names of literary or philosophical traditions (Science Fiction, Islam), areas of study (Anatomi, Fysiologi), games (Backgammon, "4-på stribe") and dances (Cha cha cha, Square Dance, Togtur til Vejle, Paso Doble). These cases are somewhat reminiscent of the <title> category, since they are names for cognitive creations, and sometimes use quotes ("4-på-stribe") or some internal structure (Togtur til Vejle). Dance names in particular, are often co-extensive with corresponding music names. On the other hand, members of the <genre> category are more generic, less individual names, and thus closer to the realm of common nouns. In Danish, there is some corresponding telltale fluctuation in the usage of upper and lower case, where the individual writer can express a varying degree of name-hood by using either one or the other.
An introspective test for the <genre> class is direct object-hood for dyrke, lære, undervise, forkynde.
Unquoted book and film titles: de kan høre Something to believe in. There are no quotes in this case (or possibly lost italics) making preprocessing tokenisation very hard, but the initial capital letter and the chaining of foreign language words could in theory allow token recognition.
2.6. Brand names <brand>
Brand names cover a wide range of products, for instance foods (Corn flakes), drinks (Coca Cola), operating systems (Linux), sanitary products (Reponse Shampoo) etc., defying a homogeneous semantic description and turning the category into the the naming system's obvious waste bin category, with the notable exception of vehicles (to be discussed below). A morphological feature that nevertheless helps characterizing this category is their members' strong tendency to turn into "real" inflecting common nouns: Volvo'en, hans gamle Macintosh, at drikke 4 Tuborg, spise en After Eight, tage 3 kodimagnyler/Kodimagnyler, at smøre Kærgården på brødet. Unless listet in the lexicon in inflected form, the Danish CG parser will treat such inflected words as common nouns in spite of their name etymology and upper case first letter, assigning gender, numer, case and definitness features. In fact, there is a tendency for very common brand names to shed the upper case initial and become ordinary nouns (which the would deserve to be listed in the lexicon as such): betale med dankort, drikke cola, købe en pc'er. Semantically, brand names denote things, concrete movable physical entities, that can be had and brought (not to mention bought).
Brand names are often derived from company names, allowing for a certain ambiguity (Tuborg, Volvo). In other cases, the company name enters as first part of a polylexical brand name (Tuborg Gold, VW 1300, Peugeot 603 Cashmere, Apple II, Konica E240 Super SR), allowing the name recognizer to assign a <brand> tag on the grounds of a recognized company name and a variable second part, in many cases consisting of a well-patterned combination of Arabic or Roman numerals, capital letters and brand type specific key words, for instance coupé, sedan or station car for cars. Typical for brand names in a more general sense are superlative markers like Ultra, Super, Extra, de Luxe.
Wine names are usually derived from regions or other place denominations, and are thus ambiguous between <brand> and <top>, unless a year number or type specific extension (Apellation, Cru, Sec, Blanco) force the distinction.
Ship names can often be recognized by a systematic first part (USS, HMS, S/S, M/S), followed by a variable name part. In the present tagging system, following the semantic prototyping of ordinary nouns, cars, ships, planes and space shuttles are lumped together as vehicles <V>, since they differ from other brand names in that they allow movement verbs, like members of the +ANIM classes <hum> and <A>. A distinction not made explicit at present, is that between generic and individual vehicle names, the ship names above being examples of the latter, car names an example of the former. A similar distinction holds for biological names vs. pet animal names. In the lexicon, but not yet in the parser, <v> and <a> are used for generic names, <V> and <A> for individual names.
2.7. Substance and material names <mat>
Apart from the experimental categories, this is the name category that most systematically contradicts upper case usage, though there is still a great deal of orthographical variation in case spelling. The semantics of the category is parallel to the corresponding common noun class (træ, gummi, klister, salt), the difference being the degree of "arteficiality" and "scientific specifity". The largest group are pharmaceuticals: salvarsan, kodymagnyl, agiolax. To a certain degree, these can be heuristically captured, via endings like -am, -cid, -lax, or additions like retard or forte.
Another <mat> group consists of chemical abbreviations: NaCl, NO2, H2O etc., though their lower case long forms (natriumklorid etc.) , somewhat illogically, are still treated as nouns.
Pharmaceuticals, lotions, houshold substances etc.: Should upper case / lower case as a token feature force the N/PROP distinction, or should the lexicon decide? If the former is chosen, how to decide in sentence initial position?
2.8. Animal and plant names <A>, <B>
This is the category used for scientific biological species names of the type Mus musculus, Bacillus subtilis, Arenomya arenaria, Quercus robur, where the first, higher order part of a typically polylexical name is in upper case, the rest in lower case. Heuristic morphological recognition can be attempted using certain characteristical Latin inflexion and derivation endings.
Other sets of names tagged <A> are pet names like Hundi, Fido, Rex, Blacky etc., and mythical beasts like Pegasus, Cerberus etc., both denoting individuals rather than a species. Like with vehicles, the parser does not make this distinction explicit, whereas the lexicon distinguishes between <A> for pet names and <a> for species names.
2.9. Astronomical and astrological names <astro>
This category subsumes names of planets, stars, moons and other celestial bodies, as well as constellations and human space installations that are not vehicles (i.e. space stations): Mars, Merkur, Io, Ganymed, Halley, Mir, Karlsvognen, Orion, Sirius etc.
Most of these names were inspired by mythical names from the Greek and Roman classics tradition, and are thus ambiguous between <astro> and <hum>, demanding contextual disambiguation, relying, among other things, on the instantiation of ± HUM selection restrictions by CG rules.
2.10. Other, experimental categories
A number of minor name categories can be defined, but not easily subsumed under any of the larger categories. Current candidates are <disease>, <ling>, <race> and <wea>.
Disease names <disease> are sometimes based on surnames (Parkinson, M. Parkinson, Morbus Parkinson, Hodgin, Menière), but there is also a Latin scientific classification (Diabetes mellitus), and in some cases ordinary Danish disease nouns are spelled in upper case, suggesting a certain name consciousness on the part of the writer (Mæslinger, Tuberkulose, Leddegigt).
Language names <ling> are usually treated as nouns, but (increasing?) upper case usage, or the lack of the usual nationality adjective analogue can suggest a name reading (latin, esperanto, pidgin, volapyk).
Race or ethnicity names <race> are also commonly treated as nouns (in fact often language names at the same time), which is supported by the fact that many instances are inflected in the definite plural (Yanomamierne, Irokeserne, Maori, Xhosa, Yoruba). However, words from this category are often spelled in upper case in Danish, mimicking English usage.
Weather phenomena <wea> are sometimes assigned names too, most notably storms. A fashionable example is El Niño. Weather names are spelled with upper case initials, and could be subsumed under the event category <occ> .
Feature bundling in the major name categories (synopsis)
In the table above, certain feature bundling structures become evident. The yellow cell block shows, how the features +COGNITION and +LOCATION can be used to distinguish <hum>/<org> from <top> and <inst>/<civ>, respectively, while <brand> and <occ> have neither of these features. The blue cells lump together discrete physical entities, which may possess both generic and individual names, with the former exhibiting a certain tendency towards inflexion and lower case. The green cell block makes the necessary distinctions within the entity block, using the 4 possible permutations of ± LIFE and ± MOVE. The red block shows the verbal-semantic tests necessary to classify names within the "human-made" categories.
Name type incidences in running text
The statistics was done on a so-called "quote corpus" of mixed running text (DSL's Corpus 2000, in reparation), with a 4% proper noun incidence. Name type data for the first 100.000 words was compared with incidences in the whole corpus.
As the table indicates, almost half of all names were found to be personal names <hum>, a third were place names <top> (of these 3/4 "humanoid") and 12-16% were "place-less" human non-physical entities <org>. Title frequency <tit> was around 4%, while events <occ> and brands/others <brand> hover around the 1% mark.
In roughly two thirds of all tokens a primary proper noun subclass reading was assigned from the lexicon, in one third a primary reading was assigned by morphological, contextual or heuristic means. In both cases, ambiguity was resolved by a Constraint Grammar module. This module also contains context sensitive mapping rules which can force a name subclass reading other than the primary reading, or increase ambiguity by adding further readings before the disambiguation rules are run.
Two 100.000 word subcorpora from the quote corpus were inspected for name tagging errors, evaluating all name readings in one case, and only heuristically based readings in the other. Error types were 'wrong major class' (6 name classes) and 'wrong minor class' (but correct major class), as well as PoS errors concerning the PROP tag itself, i.e. false positive and false negative proper noun readings. The latter also include tokenisation errors, i.e. fusing to much into a PROP token (false positive), or too little (false negative). The latter figure (false negative) is somewhat unsafe, since evaluation focused on PROP contexts only. A special source of errors was the corpus type as such: In spite of a tailor made automatic preprocessing module, there were a fair number of sentence chunking errors (irremediable due to the mixed sentence order of the quote corpus), as well as many upper case first and second words, presumably used as a kind of bold facing, but creating false positive PROP readings. In all, such problems accounted for an additional error percentage of 1.3% of all PROP readings, and were not included in the table below. Naturally, most of these cases involved non-lexicon readings.
Name type errors in running text
As can be seen from the table, major semantic class errors are nearly twice as frequent for names without a lexicon entry. Subclass errors within the same major class are fairly rare, possibly due to the absolute rareness of certain subclasses (cf. previous table). Word class (PoS) and tokenisation (chunking) errors concerning the PROP category run at an error rate of about 1.6 %, which is slightly worse than a Constraint Grammar parser's average PoS error rate, a probable reason being the fact that proper noun recognition is much more dependent on tokenisation and preprocessing than the recognition of other word classes, which to a higher degree can be based on grammatical features and context conditions alone.