Chapter 17. Tags

Table of Contents

Tag Order
Literal String Modifiers
Regular Expressions
Line Matching
Variable Strings
Numerical Matches
Stream Metadata
Stream Static Tags
Global Variables
Local Variables
Fail-Fast Tag
STRICT-TAGS
LIST-TAGS

First some example tags as we know them from CG-2 and VISLCG:

    "<wordform>"
    "baseform"
    <W-max>
    ADV
    @error
    (<civ> N)
  

Now some example tags as they may look in VISL CG-3:

    "<Wordform>"i
    "^[Bb]ase.*"r
    /^@<?ADV>?$/r
    <W>65>
    (<F>=15> <F<=30>)
    !ADV
    ^<dem>
    (N <civ>)
  

The tag '>>>' is added to the 0th (invisible) cohort in the window, and the tag '<<<' is added to the last cohort in the window. They can be used as markers to see if scans have reached those positions.

Tag Order

Starting with the latter, (N <civ>), as this merely signifies that tags with multiple parts do not have to match in-order; (N <civ>) is the same as (<civ> N). This is different from previous versions of CG, but I deemed it unncecessary to spend extra time checking the tag order when hash lookups can verify the existence so fast.

Literal String Modifiers

The first two additions to the feature sheet all display what I refer to as literal string modifiers, and there are two of such: 'i' for case-insensitive, and 'r' for a regular expression match. Using these modifiers will significantly slow down the matching as a hash lookup will no longer be enough. You can combine 'ir' for case-insensitive regular expressions. Regular expressions are evaluated via ICU, so their documentation is a good source. Regular expressions may also contain groupings that can later be used in variable string tags (see below).

Due to tags themselves needing the occasional escaping, regular expressions need double-escaping of symbols that have special meaning to CG-3. E.g. literal non-grouping () need to be written as "a\\(b\\)c"r. Metacharacters also need double-escaping, so \w needs to be written as \\w.

This will not work for wordforms used as the first qualifier of a rule, e.g:

        "<wordform>"i SELECT (tag) ;
      

but those can be rewritten in a form such as

        SELECT ("<wordform>"i) + (tag) ;
      

which will work, but be slightly slower.

Regular Expressions

Tags in the form //r and //i and //ri are general purpose regular expression and case insensitive matches that may act on any tag type, and unlike Literal String Modifiers they can do partial matches. Thus a tag like /^@<?ADV>?$/r will match any of @<ADV, @<ADV>, @ADV>, and plain @ADV. A tag like /word/ri will match any tag containing a substring with any case-variation of the text 'word'. Asides from that, the rules and gotchas are the same as for Literal String Modifiers.

Line Matching

Tags in the form //l are regular expressions matched on the whole literal reading. Used to match exact tag sequences. Special helper __ will expand to (^|$| | .+? ) and can be used to ensure there are only whole tags before/after/between something.

Variable Strings

Variable string tags contain markers that are replaced with matches from the previously run grouping regular expression tag. Regular expression tags with no groupings will not have any effect on this behavior. Time also has no effect, so one could theoretically perform a group match in a previous rule and use the results later, though that would be highly unpredictable in practice.

Variable string tags are in the form of "string"v, "<string>"v, and <string>v, where variables matching $1 through $9 will be replaced with the corresponding group from the regular expression match. Multiple occurances of a single variable is allowed, so e.g. "$1$2$1"v would contain group 1 twice.

Alternative syntax is prefixing with VSTR:. This is used to build tags that are not textual, or tags that need secondary processing such as regex or case-insensitive matching. E.g., VSTR:@m$1 would create a mapping tag or VSTR:"$1.*"r to create a regex tag. To include spaces in such tags, escape them with a backslash, e.g. VSTR:"$1\ $2"r (otherwise it is treated as two tags).

One can also manipulate the case of the resulting tag via %U, %u, %L, and %l. %U upper-cases the entire following string. %u upper-cases the following single letter. %L lower-cases the entire following string. %l lower-cases the following single letter. The case folding is performed right-to-left one-by-one.

It is also possible to include references to unified $$sets or &&sets in {} where they will be replaced with the tags that the unification resulted in. If there are multiple tags, they will be delimited by an underscore _.

It should be noted that you can use varstring tags anywhere, not just when manipulating tags. When used in a contextual test they are fleshed out with the information available at the time and then attempted matched.

        # Adds a lower-case <wordform> to all readings.
        ADD (<%L$1>v) TARGET ("<(.*)>"r) ;

        # Adds a reading with a normalized baseform for all suspicious wordforms ending in 'ies'
        APPEND ("$1y"v N P NOM) TARGET N + ("<(.*)ies>"r) IF (1 VFIN) ;

        # Merge results from multiple unified $$sets into a single tag
        LIST ROLE = human anim inanim (bench table) ;
        LIST OTHER = crispy waffles butter ;
        MAP (<{$$ROLE}/{$$OTHER}>v) (target tags) (-1 $$OTHER) (-2C $$ROLE) ;
      

Numerical Matches

Then there are the numerical matches, e.g. <W>65>. This will match tags such as <W:204> and <W=156> but not <W:32>. The second tag, (<F>15> <F<30>), matches values 15>F>30. These constructs are also slower than simple hash lookups.

The two special values MIN and MAX (both case-sensitive) will scan the cohort for their respective minimum or maximum value, and use that for the comparison. Internally the value is stored in a double, and the range is capped between -281474976710656.0 to +281474976710655.0, and using values beyond that range will also act as those limits.

        # Select the maximum value of W. Readings with no W will also be removed.
        SELECT (<W=MAX>) ;

        # Remove the minimum F. Readings with no F will not be removed.
        REMOVE (<N=MIN>) ;
      

Table 17.1. Valid Operators

OperatorMeaning
=Equal to
!=Not equal to
<Less than
>Greater than
<=Less than or equal to
>=Greater than or equal to
<>Not equal to


Anywhere that an = is valid you can also use : for backwards compatibility.

Table 17.2. Comparison Truth Table

ABResult
= x= yTrue if x = y
= x!= yTrue if x != y
= x< yTrue if x < y
= x> yTrue if x > y
= x<= yTrue if x <= y
= x>= yTrue if x >= y
< x!= yAlways true
< x< yAlways true
< x> yTrue if x > y
< x<= yAlways true
< x>= yTrue if x > y
> x!= yAlways true
> x> yAlways true
> x<= yTrue if x < y
> x>= yAlways true
<= x!= yAlways true
<= x<= yAlways true
<= x>= yTrue if x >= y
>= x!= yAlways true
>= x>= yAlways true
!= x!= yTrue if x = y


Stream Metadata

CG-3 will store and forward any data between cohorts, attached to the preceding cohort. META:/.../r and META:/.../ri lets you query and capture from this data with regular expressions. Data before the first cohort is not accessible.

      ADD (@header) (*) IF (-1 (META:/<h\d+>$/ri)) ;
    

I recommend keeping META tags in the contextual tests, since they cannot currently be cached and will be checked every time.

Stream Static Tags

In the CG and Apertium stream formats, it is allowed to have tags after the word form / surface form. These tags behave as if they contextually exist in every reading of the cohort - they will not be seen by rule targets.

Global Variables

Global variables are manipulated with rule types SETVARIABLE and REMVARIABLE, plus the stream commands SETVAR and REMVAR. Global variables persist until unset and are not bound to any window, cohort, or reading.

You can query a global variable with the form VAR:name or query whether a variable has a specific value with VAR:name=value. Both the name and value test can be in the form of // regular expressions, a'la VAR:/ame/r=value or VAR:name=/val.*/r, including capturing parts.

The runtime value of mapping prefix is stored in the special variable named _MPREFIX.

      REMOVE (@poetry) IF (0 (VAR:news)) ;
      SELECT (<historical>) IF (0 (VAR:year=1764)) ;
    

I recommend keeping VAR tags in the contextual tests, since they cannot currently be cached and will be checked every time.

Local Variables

Almost identical to global variables, but uses LVAR instead of VAR, and variables are bound to windows.

Global variables are remembered on a per-window basis. When the current window has no more possible rules, the current variable state is recorded in the window. Later windows looking back with W can then query what a given variable's value was at that time. It is also possible to query future windows' variable values if the stream contains SETVAR and the window is in the lookahead buffer.

When LVAR queries the current window, it is the same as VAR.

Fail-Fast Tag

A Fail-Fast tag is the ^ prefix, such as ^<dem>. This will be checked first of a set and if found will block the set from matching, regardless of whether later independent tags could match. It is mostly useful for sets such as LIST SetWithFail = (N <bib>) (V TR) ^<dem>. This set will never match a reading with a <dem> tag, even if the reading matches (V TR).

STRICT-TAGS

If you are worried about typos or need to otherwise enforce a strict tagset, STRICT-TAGS is your friend. You can add tags to the list of allowed tags with STRICT-TAGS += ... ; where ... is a list of tags to allow. Any tag parsed while the STRICT-TAGS list is non-empty will be checked against the list, and an error will be thrown if the tag is not on the list.

It is currently only possible to add to the list, hence +=. Removing and assigning can be added if anyone needs those.

      STRICT-TAGS += N V ADJ etc ... ;
    

By default, STRICT-TAGS always allows wordforms, baseforms, regular expressions, case-insensitive, and VISL-style secondary tags ("<…>", "…", <…>), since those are too prolific to list individually. If you are extra paranoid, you can change that with OPTIONS.

To get a list of unique used tags, pass --show-tags to CG-3. To filter this list to the default set of interesting tags, cg-strictify can be used:

        cg-strictify grammar-goes-here
      

For comparison, this yields 285 tags for VISL's 10000-rule Danish grammar. Edit the resulting list to remove any tags you can see are typos or should otherwise not be allowed, stuff it at the top of the grammar, and recompile the grammar. Any errors you get will be lines where forbidden tags are used, which can be whole sets if those sets aren't used in any rules.

Once you have a suitable STRICT-TAGS list, you can further trim the grammar by taking advantage the fact that any tag listed in STRICT-TAGS may be used as an implicit set that contains only the tag itself. No more need for LIST N = N ; constructs.

LIST-TAGS

Very similar to STRICT-TAGS, but only performs the final part of making LIST N = N ; superfluous. Any tag listed in LIST-TAGS has an implicit set created for it.