by Jim Bretti » Fri May 07, 2004 3:00 pm
Beta version 2.047B includes support for Masks and Regular Expressions. Since it isn't documented anywhere yet, I'll try to explain here.
For those not familiar with the term, Regular Expressions are a pattern matching language used for searching and parsing text. I won't get into regular expressions much here, if you search the web you can find plenty of references.
Regular Expressions are extremely powerful, but can be intimidating the first time you see them. So in addition to Regular Expression support, the Basic Pronunciation Editor also supports something called "Masks". A Mask is really a regular expression under the surface, but Masks are a little easier to construct, and do not require that you know anything about regular expressions.
In the Basic Pronunciation Editor, you can now use the Word field to enter either a Mask, or a Regular Expression. To illustrate how to use both, the following real problem will be used: you're using a voice engine that pronounces year numbers (like 1987) as "one thousand nine hundred and eighty seven". We would like the year pronounced as "nineteen eighty seven".
To use a Regular Expression, you would enter the following in the Word field:
{{re=\b19(\d\d)(\b)}}
The characters {{re= indicate the beginning of the regular expression, the trailing }} characters are also required to end the expression.
Inside the expression, \b is a "metacharacter" that indicates word delimiter. The \d characters match any numeric characters. So the pattern we're searching for is a word delimiter, followed by the string "19", two numeric characters and another word delmiter.
Notice there are also two sets of parentheses in the expression, the first set contains the last two digits of the year number, the second set contains the trailing word delimiter.
In the Pronounciation field, enter the following:
<s>nineteen $1$2
The leading <s> indicates a space .. since we're matching on something starting with a word delimiter preceding a year number, we need a leading space in the substituted string. After the space, comes the string "nineteen". Finally, the $1 and $2 strings point at the first and second subexpressions mentioned above, this gets us the last two digits of the year, and any trailing punctuation following the year number.
The idea of the mask is that it basically does the same thing, but you can look at some simple help instead of figuring out how to write a regular expression. The following mask characters are predfined:
# - numeric character
$ - any alpha character
@ - any alphanumeric character
? - any character
_ - underscore is a word separator.
\ - escape character
To handle the year problem above, you would enter the following mask into the Word field:
{{mask=_19(##)(_)}}
Very similar to the regular expression above, including how the parentheses are used. The pronunciation field ends up the same:
<s>nineteen $1$2
For those of you that are interested that should be enough to get started. I'd appreciate any feedback.
Thanks