com.sun.labs.minion.lexmorph
Class LiteMorphRule
java.lang.Object
com.sun.labs.minion.lexmorph.LiteMorphRule
public class LiteMorphRule
- extends java.lang.Object
A Rule typically matches a pattern at the right end of a word, removes
some characters from the right end to produce a stem, and generates a
list of alternative forms of the word by adding each of a specified list
of alternative endings to the stem.
Each rule specifies a pattern and a list of modifications to be made to
the input word to produce different variant forms of the word. The rule
is represented as a string of pattern elements, separated by spaces,
followed by a right arrow (" -> "), delimited by spaces, followed by
a sequence of modification patterns, separated by commas, as in:
".aeiouy t + a -> as,ae,um,ums,on,ons,ic,ical",
// (e.g., chordata, data, errata, sonata, toccata)
In this case, the pattern element ".aeiouy" will match a vowel anywhere
to the left of a "t" that is immediately to the left of a final "a",
and it will remove the "a" from the end to produce a stem and then add
each of the endings "as", "ae", "um", ... in turn to that stem to
produce different variant forms of the word. The plus sign ("+") in
this rule indicates that the rule is a suffix rule that is anchored
to the right end of the word, with the "+" indicating the position
where the suffix is to be removed from the word to form the stem.
The dot (".") operator before the "aeiouy" indicates that one of these
characters must be found somewhere in the word to the left of the "t".
Pattern elements are matched right-to-left starting with the pattern
element just to the left of the arrow (->) in the rule, with the matching
usually starting with the rightmost letter of the word being analyzed.
The different kinds of pattern elements that can occur are:
-
A letter group (e.g., aeiou) will match any of the letters in the group.
This can also be represented with vertical bars separating the letters
to express disjunction (e.g., a|e|i|o|u). A letter group can also be
specified by a parameter name, indicated by an initial dollar sign ("$")
which is defined as part of the initialization of a LiteMorph localization
class (e.g. "$Vowel"). Such parameters are defined by expressions of
the form:
defVar("$Vowel", "aeiouy");
-
A letter group prefixed with a negation sign (~) will match any letter
that is not in the group.
-
A letter group prefixed with a period (.) must be matched somewhere
preceding the match of its subsequent pattern element. A letter group
of this type is unanchored. (A pattern element that is to be matched at
a specified position is "anchored," while one that can be matched at
any position to the left of some specified position is "unanchored.")
-
A letter group prefixed with a question mark (?) may be matched once, or
skipped, immediately preceding the match of its subsequent pattern element.
A letter group of this type is anchored if the pattern element to its
right is anchored.
-
A letter group prefixed with an asterisk (*) may be matched zero or more
times immediately preceding the match of its subsequent pattern element.
A letter group of this type is anchored if the pattern element to its
right is anchored.
-
A letter group prefixed with a plus sign (+) must be matched one or more
times immediately preceding the match of its subsequent pattern element.
A letter group of this type is anchored if the pattern element to its
right is anchored.
-
An isolated plus sign (+) as a pattern element, marks a point in the
pattern to the right of which the matching letters will be removed
to form the stem. There should be no unanchored letter groups after
the plus sign, and there should be at most one plus sign in the pattern
(otherwise only the leftmost will count). A plus sign also marks the
pattern as being anchored to the right end of a word, so that the
rightmost pattern element can only match the righmost character of
the input word. In particular, a plus sign as the rightmost pattern
element marks the right end as the anchor point, although a # (see next)
is stylistically preferable in this case, and you may choose to put
an explicit # at the end of any suffix rule to emphasize the right anchor.
(A # sign is traditionally used by linguists to denote a word boundary.)
-
An isolated pound sign (#) as the first pattern element, marks the
pattern as anchored at the left end of the word. A pound sign as the
rightmost pattern element before the arrow, marks the pattern as anchored
to the right end of the word. A pound sign as a pattern element anywhere
else in the pattern will be ignored. There may be pound signs at both
ends of a pattern to mark the pattern as anchored at both ends.
-
An ampersand (&) as a pattern element will match a letter that is the
same as its preceeding letter in the word.
-
A left angle bracket (<) as a pattern element marks the left context
of a substitution rule that can replace a portion of the middle of a
word with each of the modifications in the right-hand part of the rule.
This should occur somewhere to the left of a right angle bracket that
marks the end of the portion to be replaced (see next).
-
A right angle bracket (>) as a pattern element marks the right context
of a substitution rule that can replace a portion of the middle of a
word. The portion of the word that matches the pattern elements between
a pair of angle brackets will be replaced by each of the modifications
in the right-hand part of the rule. For example:
"< a e > +$Consonant e # -> ",
will replace an "ae" by a single "a" when it occurs before a sequence
of one or more consonants preceding a final e. The substitution operators
< and > cannot be used together with the + or - operators in the same rule.
The modification entries in the right-hand side of a rule are typically
sequences of characters to be appended as a suffixes to the stem that
was determined by the pattern of the rule, However, there are various
operators that can modify this behavior:
-
An ampersand (&) as the first character of a character sequence in an
alternative modification indicates a repeat of the letter that ends the
stem.
-
An under bar (_) as an alternative modification represents the empty
string, indicating that nothing is to be added to the stem for that
alternative.
-
A modification beginning with an asterisk (*) indicates that the rules
are to be reapplied recursively to the form obtained from doing this
modification.
-
A modification beginning with a name in parentheses (()) indicates
that the rule set indicated by the is to be applied to the form
obtained from doing this modification. If the parentheses are preceeded
by the keyword TRY, then this rule will be considered to have failed if
nothing is found by the named rule set, and the next rule after this one
will be tried. Otherwise, if this rule pattern matched, but the rule
found nothing, no further rules would be tried. If the name in the
parentheses is prefixed with an exclamation point (!), then instead
of invoking a named rule set, the method computeMorph will be called
to process the modified stem, with the name following the exclamation
point as its argument. This provides a universal escape mechanism that
can be used to provide features not otherwise available in the rules.
Specifically, a localization subclass of LiteMorph can redefine the
method computeMorph to do whatever is necessary, and the argument
provided by the name following the ! is effectively a subroutine name
within the computeMorph method. For example, the computeMorph method
is used effectively in LiteMorph_de.java to deal with the separable and
inseparable prefixes in German. If there is no name inside the parens,
then the unnamed rule set (:unnamed) is used -- this is equivalent to
the asterisk (*) modification operator.
-
A modification beginning with a left angle bracket (<) indicates that a
substitution is to be made corresponding to a < > pair in the left-hand-side
pattern. The normal case is a simple pattern where the string between
the angle brackets is substituted for the portion of the input between
the < and > in the pattern. A more complicated case allows the addition
of material at the ends of the input word in addition to making the
substitution. The formats for such a modification are /rightadd
and /leftadd_rightadd, where the contents of the angle brackets are to
be substituted as before, and the material that follows the slash is to be
added to the ends, with _ (when present) indicating the position of the modified
input between the left and right add strings. (If there is no _ specified, then
everything following the / is added to the right end. In the first case, the
substitution is made and the string rightadd is added to the end of the result.
In the second case, the substitution is made, leftadd is added to the beginning
of the result, and rightadd is added to the end of the result.
E.g. (in a German morphology):
"$AllConsonant < a e > u m + e ?n -> e,en,<ä>/_e,<ä>/_en"
produces:
LiteMorph: adding variation: Baeume
LiteMorph: adding variation: Baeumen
LiteMorph: adding variation: Bäume
LiteMorph: adding variation: Bäumen
when applied to Baeume
-
A modification beginning with a right angle bracket (>) indicates that a
pattern>subst substitution is to be applied in the right-hand-side of the rule,
after possibly adding something to the beginning or end of the input.
There are three formats for these substitution rules:
">*foo>fie/_" //substitute foo for fie after adding _
">fie/_" //in a mode determined by 2nd character:
">>foo>fie/_" //* = every, < = leftmost, > = rightmost
The second character (after the initial >) is a mode character indicating
whether the substitution is to happen repeatedly as many times as
possible (*), just once at the leftmost position (<), and just once at the
rightmost position (>). The material after this character, up to the >,
is the pattern to be searched for, the material from the > to the / is
the string to be substituted for it, and the material after the / indicates
any transformations to be applied to the input before substituting. (If
this is simply _, then there are no such transformations.) E.g.:
"$AllConsonant a e u m + e ?n -> e,en,>>ae>ä/_e,>>ae>ä/_en"
produces the same result as example above for the input Baeume, but does
the pattern substitution action in the right-hand side, instead of using
< > in the left-hand side. This type of substitution allows for repeated
substitutions and simple pattern>subst substitutions in the right-hand
side, while the previous substitution format, using < > operators in the
left-hand side provides more capabilities for determining exactly where a
substitution is to happen.
Rules are grouped in blocks and labeled (often by a common final letter
sequence) and are ordered within each group so that after a matching
rule is found no further rules are to be tried (except when invoked
explicitly by a modification operator in an alternative modification in
the right-hand side of the rule).
- See Also:
LiteMorph
Field Summary |
static boolean |
authorFlag
The following static final boolean variable authorFlag is a flag for
use by localization authors when developing morphological rules. |
static boolean |
debugFlag
|
static boolean |
traceFlag
For tracing the testing of LiteMorph rules. |
static boolean |
traceMatchFlag
For tracing behavior of rule matching. |
Constructor Summary |
LiteMorphRule(java.lang.String expression,
java.lang.String ruleName,
LiteMorph morph)
Create a Rule |
Method Summary |
java.lang.String[] |
getExpansions()
|
java.util.Vector |
match(java.lang.String word,
int depth,
int skipnum)
Determines if a word matches the rule |
java.lang.String |
toString()
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
authorFlag
public static final boolean authorFlag
- The following static final boolean variable authorFlag is a flag for
use by localization authors when developing morphological rules.
This flag will be set to false for a delivered run-time version, but
can be set true when a morphological rule set is being developed.
This flag is used to enable format checking and tracing that is important
during rule development, but which is unnecessary in the run-time
rule system, after the rule developer has used this facility to insure
that the rules are well-formed. It is a static final variable so
that the compiler will optimize the extra code away when the variable
is false so that the run-time class files will be smaller. When
authorFlag is false, all of the code associated with the tracing
mechanism will automatically be eliminated by the compiler.
- See Also:
- Constant Field Values
debugFlag
public static boolean debugFlag
traceMatchFlag
public static boolean traceMatchFlag
- For tracing behavior of rule matching.
traceFlag
public static boolean traceFlag
- For tracing the testing of LiteMorph rules.
LiteMorphRule
public LiteMorphRule(java.lang.String expression,
java.lang.String ruleName,
LiteMorph morph)
- Create a Rule
- Parameters:
expression
- A String representing the ending patern described previously.ruleName
- the name of the rulemorph
- the LiteMorph to make morphological variants
getExpansions
public java.lang.String[] getExpansions()
match
public java.util.Vector match(java.lang.String word,
int depth,
int skipnum)
- Determines if a word matches the rule
toString
public java.lang.String toString()
- Overrides:
toString
in class java.lang.Object