Pronunciation Editor - Masks and Regular Expressions

Forum for info exchange on beta tests of new versions/features, and in-depth discussions of issues related to nextup products with Power Users. You must register with the forum system in order to have access to this section.

Moderators: kdwhite, Jim Bretti, D.Leikin

Pronunciation Editor - Masks and Regular Expressions

Postby Jim Bretti » Fri May 07, 2004 3:00 pm

Beta version 2.047B includes support for Masks and Regular Expressions. Since it isn't documented anywhere yet, I'll try to explain here.

For those not familiar with the term, Regular Expressions are a pattern matching language used for searching and parsing text. I won't get into regular expressions much here, if you search the web you can find plenty of references.

Regular Expressions are extremely powerful, but can be intimidating the first time you see them. So in addition to Regular Expression support, the Basic Pronunciation Editor also supports something called "Masks". A Mask is really a regular expression under the surface, but Masks are a little easier to construct, and do not require that you know anything about regular expressions.

In the Basic Pronunciation Editor, you can now use the Word field to enter either a Mask, or a Regular Expression. To illustrate how to use both, the following real problem will be used: you're using a voice engine that pronounces year numbers (like 1987) as "one thousand nine hundred and eighty seven". We would like the year pronounced as "nineteen eighty seven".

To use a Regular Expression, you would enter the following in the Word field:

{{re=\b19(\d\d)(\b)}}

The characters {{re= indicate the beginning of the regular expression, the trailing }} characters are also required to end the expression.

Inside the expression, \b is a "metacharacter" that indicates word delimiter. The \d characters match any numeric characters. So the pattern we're searching for is a word delimiter, followed by the string "19", two numeric characters and another word delmiter.

Notice there are also two sets of parentheses in the expression, the first set contains the last two digits of the year number, the second set contains the trailing word delimiter.

In the Pronounciation field, enter the following:
<s>nineteen $1$2

The leading <s> indicates a space .. since we're matching on something starting with a word delimiter preceding a year number, we need a leading space in the substituted string. After the space, comes the string "nineteen". Finally, the $1 and $2 strings point at the first and second subexpressions mentioned above, this gets us the last two digits of the year, and any trailing punctuation following the year number.

The idea of the mask is that it basically does the same thing, but you can look at some simple help instead of figuring out how to write a regular expression. The following mask characters are predfined:
# - numeric character
$ - any alpha character
@ - any alphanumeric character
? - any character
_ - underscore is a word separator.
\ - escape character

To handle the year problem above, you would enter the following mask into the Word field:

{{mask=_19(##)(_)}}

Very similar to the regular expression above, including how the parentheses are used. The pronunciation field ends up the same:

<s>nineteen $1$2

For those of you that are interested that should be enough to get started. I'd appreciate any feedback.

Thanks
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby SFCurley » Fri May 07, 2004 11:32 pm

Hi Jim,

I tried both the mask and the reg exp using NeoSpeech Paul 16 and had no luck. Copied the reg exp into the word field -- exactly as you had it -- and the <s>nineteen $1$2 into the pronounciation field with no luck. Still pronounced the 1987 as a number not a nineteen-year. Also tried the mask, but no luck. Any reason it wouldn't work with Paul? Did this all in the basic editor AND I AM running 2.047B.

Thanks.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby Jim Bretti » Sat May 08, 2004 12:13 am

Well looks like I made a dumb mistake there. :oops: We'll put up another build shortly that should fix it.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby Jim Bretti » Sat May 08, 2004 12:54 am

New build (2.048B) is available now. Can you grab it and give it another try ?
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby Harald » Sat May 08, 2004 11:24 am

Hi Jim,

This is absolutly fabulous!
I just deleted a few hundred entries from the pronounciation editor and replaced it with ONE RegExp.

The Oracle documentation I read are full of so called V$ views. They all share the same pattern V$something or GV$something. The $ is normally not spoken by TA and this makes it hard to read.
I had added 'v dollar something' in the pro-editor for each known V$ view.

Now I have deleted them all all and added the following RegExp
{{re=([GV|V])\$(\w+)}} == $1 doller $2

It works fantastic

-Harald
Harald
 
Posts: 11
Joined: Thu Apr 08, 2004 11:58 am
Location: Ede, The Netherlands

masks and URLs

Postby DaveH » Sat May 08, 2004 6:44 pm

Hi

This is really useful and masks simplify things a lot. To avoid reading web links of type:

http://www.lnksrv.com/m.asp?i=1177549&u=123303583

I simply typed:

{{mask=http://www.(.*)(u=)(#*)(_ )}}
<s>" "

and it worked first time!

Many thanks for a great new feature.

Dave.
Dave UK
DaveH
 
Posts: 178
Joined: Tue Feb 17, 2004 11:54 am
Location: UK

masks documentation

Postby DaveH » Sun May 09, 2004 6:56 am

Hi

I have been looking at the 'help' documentation. I found it well structured and clear, however I was surprised that there is no discussion of regular expressions and masks. This is such a powerful and useful feature that it would be a great pity not to include it. The above discussion by Jim is very clear and could be used almost unchanged.

I imagine inclusion could lead to a spate of queries from users unfamiliar with the ideas, but this could be handled with a FAQ page, maybe giving an example solution for each type of query. To save the time of Ken and Jim, queries might be addressed to the forum where experienced users could help out.

These are only suggestions. I am delighted to have access to Jim's description above and think that it deserves to be highlighted in the formal documentation.

Dave
Dave UK
DaveH
 
Posts: 178
Joined: Tue Feb 17, 2004 11:54 am
Location: UK

Postby SFCurley » Sun May 09, 2004 4:57 pm

Worked PERFECTLY for me. Thanks, Jim.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby Jim Bretti » Mon May 10, 2004 9:05 am

Thanks for the responses, all very good examples of where this should be useful.

We'll definitely document this in the online help, hopefully the next help file should include something. Probably something similar to the initial post with a few more examples.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Mask for Bible citations?

Postby SLSettles » Tue May 11, 2004 12:46 pm

I have been fiddling with this and have been unable to get any results. If I have a Bible citation like:

Psa. 50:3

I would like it read as:

Psalms chapter fifty, verse three.

I wrote a mask "{{mask=_Psa._(#):(#)}}" and a pronunciation, <s>Psalms chapter $1 verse $2 following the examples above but can't seem to get it to work. Can anyone offer a hand?

Thanks in advance!
SLSettles
 
Posts: 4
Joined: Wed Mar 03, 2004 2:09 pm

Postby Jim Bretti » Tue May 11, 2004 1:19 pm

One of the problems you're going to run into here is that the # mask character matches exactly one digit. So #:# will match 5:3 but not 50:3.

I need to think a little about handling something like this in a mask, but it is pretty simple in a regular expression.

Try this in the word field instead:

{{re=\bPsa\.\s(\d+):(\d+)}}

use the same text you're already using for the pronunciation and I think it should work.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby SFCurley » Tue May 11, 2004 2:26 pm

Jim, I know you don't want to become a reg ex coach, but I've evaluated this expression on two reg ex calcuators online and it seems to match the target text, but doesn't in the pronounciation editor. Am I missing something? Thanks in advance.

Exp: {{re=\bwww\.(.*)\.com}}
Target String: www.amazon.com
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby Jim Bretti » Tue May 11, 2004 3:02 pm

I don't see a problem with that reg expression ... I tried it here and it worked on some sample text I used. Can you post a sample of text that fails ?
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby SFCurley » Tue May 11, 2004 3:15 pm

In the basic pronounciation editor, I have in the word field the reg exp exactly as shown above. In the pronounciation field, I have tried both "test" and "<s>test".

I then create a new TA article with the following text:

"this link is www.amazon.com."

I would expect it to be read as: "this is test," but it actually reads it as, "this is www amazon com"
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby Jim Bretti » Tue May 11, 2004 3:39 pm

Do you possibly have other entries in the basic editor that modify the url ? Just for example, if you had an entry that changed &www& to "w w w", this substitution is applied before the re, so the re would fail.

Maybe email me the file UserTranslations.chl in your textaloud install directory. I can check for something like the above, or anything else that might be causing a problem.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby SFCurley » Tue May 11, 2004 5:01 pm

Yep, THAT was it! I had "com" being pronounced as "dot com" and that was interfering. All fixed now. Thanks so much.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Bible citations

Postby SLSettles » Tue May 11, 2004 5:48 pm

Wow! That works great Jim! Thanks so much. :D
SLSettles
 
Posts: 4
Joined: Wed Mar 03, 2004 2:09 pm

Postby daniel » Sat Jun 12, 2004 8:14 pm

Can we predict/control the order that the text-regex substitutions are applied? Also, when are they applied? Are they applied to the whole article, one at a time, before passing the whole thing off to the engine, or is the article first parsed into sentences or paragraphs and each one processed sequentially? Thanks.
daniel
 
Posts: 11
Joined: Wed Jun 09, 2004 3:38 pm

Postby Jim Bretti » Mon Jun 14, 2004 10:33 am

The ordering goes something like this ... All non-regular expression / non mask expressions are applied first, in the order the words appear in the left panel (alphabetical order). Then, the re's and masks are applied, in the same order.

Text is normally buffered to the tts engines when it exceeds a certain size (the default is 5K). When buffereing the text like this, the pronunciation editor changes are applied to each buffer before sending to the engine. The buffering tries to stay on sentence boundaries, so as long as you don't define re's or masks that cross sentences, you should not lose any pronunciation edits cause by text being split across buffers.

Does that answer the question ?
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Exhaustive list of supported regular expressions/masks

Postby tytrate » Mon Jun 14, 2004 3:39 pm

Where can I find an exhaustive list of TextAloud supported regular expressions/masks? I don't want to research features/syntax that is not supported.

Thanks

Richard
tytrate
 
Posts: 36
Joined: Sat Apr 10, 2004 5:13 pm

Postby daniel » Tue Jun 15, 2004 6:31 am

... so the order of expressions in the file is irrelevant!?

Is it my imagination or does plain substitution ignore punctuation characters? I tried to replace "-", without using re-exp but felt like I was ignored. Then when i used {{re=-}} it seemed to work. Did I just imagine this?

Also, is it the case that plain substitution is applied to whole words only while regexps are just applied to the raw characters?

Thanks.
daniel
 
Posts: 11
Joined: Wed Jun 09, 2004 3:38 pm

Postby Jim Bretti » Tue Jun 15, 2004 8:30 am

Richard ... We don't have our own regular expression reference, try taking a look at http://www.regular-expressions.info/. Another good reference is at http://www.sweeting.org/mark/html/revalid.php

The mask characters are documeneted in the very first post on this topic, toward the bottom of the post. They're also documented in the current version of help.

Let me know more questions.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby Jim Bretti » Tue Jun 15, 2004 8:53 am

Daniel,

As for the expression order, it really is relevant, but you currently can't control it. If you set up two regular expressions that happen to match the same string, you can't control which will be applied. The {{re strings in the word field are sorted alphabetically, and the first one in the list is the one that will be used. Is that what you're asking ?

On your question about plain substitutions, the default behavior is to perform substitutions only on word boundaries. You should be able to change the pronunciation of a character like "-" (hyphen symbol), but if there is text immediately before / after the hyphen you need to use a special substring character. An "&" symbol on either side of the word you're defining means you don't care about word boundaries on that side of the word.

So if you want to match the hyphen character in "First-Class", use the & symbol on both sides of the hyphen ... instead of defining "-" in the pronunciation editor, define the word "&-&".

None of this applies when regular expressions or masks are used. You can force regular expression matches to look for word boundaries using the \b metacharacter in the expression.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby daniel » Tue Jun 15, 2004 11:49 pm

so, just to be sure that I understand you,
{{re=\babc\b}}
will be applied before
{{re=\Wabc\W}}.

Thanks Jim.

BTW, I really appreciate the addition of regexs to TA ...
daniel
 
Posts: 11
Joined: Wed Jun 09, 2004 3:38 pm

Postby Jim Bretti » Wed Jun 16, 2004 9:15 am

Daniel, that's right ... the expressions will be applied in that order.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Ordering

Postby bjsafdie » Thu Jul 15, 2004 12:49 pm

Another vote for an interface that allows us to order RegEx/Mask evaluations.

--BJ Safdie
bjsafdie
 
Posts: 43
Joined: Wed Apr 21, 2004 4:03 pm

Postby SFCurley » Thu Jul 15, 2004 12:57 pm

Yes, THAT would be nice.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby Jim Bretti » Thu Jul 15, 2004 2:15 pm

Yep, we need to do something like this. Its on the list.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1226
Joined: Wed Oct 29, 2003 11:07 am

Postby BrienMalone » Fri Apr 21, 2006 11:05 am

bump.
BrienMalone
 
Posts: 17
Joined: Sat Apr 08, 2006 1:03 am

Re: masks and URLs

Postby Melker63 » Fri Jun 02, 2006 2:23 pm

DaveH wrote:Hi

This is really useful and masks simplify things a lot. To avoid reading web links of type:

http://www.lnksrv.com/m.asp?i=1177549&u=123303583

I simply typed:

{{mask=http://www.(.*)(u=)(#*)(_ )}}
<s>" "

and it worked first time!


Well, it doesnt work for me. I put {{mask=http://www.(.*)(u=)(#*)(_ )}} in the word-field and <s>" " in the pronunciation-field, but to no avail. I have the latest TextAloud v2.185 installed
Melker63
 
Posts: 44
Joined: Sat Nov 06, 2004 5:34 pm
Location: Stockholm

Next

Return to Power Users, Beta Tests, In-Depth Discussions

Who is online

Users browsing this forum: No registered users and 0 guests