Helpful Regular Expressions

Forum for info exchange on beta tests of new versions/features, and in-depth discussions of issues related to nextup products with Power Users. You must register with the forum system in order to have access to this section.

Moderators: kdwhite, Jim Bretti, D.Leikin

Helpful Regular Expressions

Postby SFCurley » Sun Jul 25, 2004 12:07 pm

The short list below are a few regular expressions I've created that help with pronounciation. The part before the "/" goes in the word field; the part after the "/" goes in the pronounciatin field. The part in brackets is just a description. The first group are for all voices; the second group is specific to some NeoSpeech voice anomalies.

words like iPod, eBay
{{re=\b([ei])([A-Z]\w+)\b}} / $1 $2

dot com web addresses:
{{re=\b(\w*)\.?(\w+)\.(com|org|gov|us|cc|bus|net)}} / $1 dot $2 dot $3

Most year entries from 1600 through 1999:
{{re=\b(16|17|18|19)([1-9]\d)\b}} / $1 $2

Year entries like "19-0h-1 or "17-0h-9":
{{re=\b(16|17|18|19)0([1-9])\b}} / $1 o $2

Year entries ending in "00" that should be read "yy-hundred":
{{re=\b(16|17|18|19)00\b}} / $1 hundred

e.g.:
{{re=\be\.g\.}} / for example

i.e.:
{{re=\bi\.e\.}} / that is


The following entries are specific to a few NeoSpeech voice qurks:

Fixes an occasional problem where second word in hyphenated word combination doesn't get read:
{{re=\b(\w+)-(\w+)}} / $1 $2

Fixes problem for some two word combinations (_*_al adv_*_ like "financial advisor" or "funeral adminstrator" get proncouced with the first word being pluralized):
{{re=\b(\w+[aeiouyAEIOUY]l+)\b\s(ad[mv]\w+)\b}} / $1,$2

These next three fix the problem where Mr., Mrs. and Ms. are all spelled out:
{{re=\bmr\.\s(\w*)}} / mister $1
{{re=\bmrs\.\s}} / misses
{{re=\bms\.\s}} / miz
Last edited by SFCurley on Sun Jul 25, 2004 4:02 pm, edited 1 time in total.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby Bunger Henry » Sun Jul 25, 2004 3:10 pm

I used your "Mr." workaround to correct the problem of "Dr." being pronounced as "D. R."
Bunger Henry
 
Posts: 149
Joined: Thu Apr 15, 2004 8:17 pm

Postby kdwhite » Mon Jul 26, 2004 7:10 pm

You are getting really good at this.
Ken White
NextUp.com
The Power of Spoken Audio
http://www.NextUp.com

** TextAloud - The world's most popular Text To Speech tool.
http://www.nextup.com/TextAloud/
kdwhite
Site Admin
 
Posts: 2627
Joined: Mon Sep 29, 2003 11:34 am

Postby SFCurley » Mon Jul 26, 2004 9:59 pm

Thanks. All learned in the last few weeks . . . mostly from the sweeting.org website and its reg ex validator. Pretty amazing tool, that reg ex thing.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby SFCurley » Thu Aug 05, 2004 12:18 pm

Fixes problem for some other two word combinations (_*_le adv_*_ like "responsable advisor" or "reasonable adminstrator" get proncouced with the first word being pluralized). Similar to another entry above where the first word ends in an al, el, etc. This one is where the first word ends in *le.

{{re=\b(\w+l+e)\b\s+(ad[mv]\w+)\b}} /// $1 , $2
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby Jim Bretti » Sun Mar 27, 2005 11:58 am

Here is one people ask frequently ... how to keep the items in a bulleted list from running together. Most times I've seen the bullet character its been a hex 95 character, so maybe need to tweak if you have a different character used for the bullet.

This expression finds strings starting with a bullet character, followed by alpha numerics and spaces, then one or more carraige returns (no punctuation between bullet and carriage return(s)). The pronunciation sticks a period at the end of the string.

Word: {{re=\x95([\w\s]+)((\r\n)+)}}
Pronunciation: $1.$2
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1223
Joined: Wed Oct 29, 2003 11:07 am

Postby SFCurley » Tue Jul 26, 2005 9:51 am

This one fixes a very minor and specific pronounciation bug in the NeoSpeech voices, where the phrase "put it" is incorrectly pronounced "putS it" when it appears after a word ending in "y' and when either a colon or comma follows:

word: {{re=\b(\w+y)\sput\sit[:,]\s+(\w+)\b}}
pronounciation: $1 puuht it {{pause=0.25}} $2


Example: As Orwell once brilliantly put it: "Insincerity is the enemy of clear language."

Go figure!
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby RmData » Thu Aug 11, 2005 11:49 pm

The regex in the original post,
words like iPod, eBay
{{re=\b([ei])([A-Z]\w+)\b}} / $1 $2

causes the word
each
to be spelled out.
RmData
 
Posts: 6
Joined: Wed Aug 10, 2005 12:06 am
Location: The Pinery, Colorado

Postby SFCurley » Fri Aug 12, 2005 9:21 am

I should have mentioned that you have to check the Case Senstive box. . . that will fix the "each" problem. Also, make sure that the expression is entered with capital "A-Z" as shown above.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby RmData » Fri Aug 12, 2005 11:09 am

What about sentences that start with the word

Each
?

That's the one that hit me. At first I didn't realize it was one of the regular expressions causing the problem, and tried to enter pronunciations specifically for the word to no avail. Then when I renamed the pronunciation file so it wasn't found, the problem disappeared so I realized I'd just wasted my time trying to tell it how to pronounce the word and the real problem was one of the regexes!

Thanks,
Mark.
RmData
 
Posts: 6
Joined: Wed Aug 10, 2005 12:06 am
Location: The Pinery, Colorado

Postby Jim Bretti » Fri Aug 12, 2005 12:11 pm

So the problem is with matching either eBay or EBay ?

The simplest way to do it would be to use the case sensitive checkbox as Sean indicated, and add "E" and "I" (uppercase) to the expression

{{re=\b([eiEI])([A-Z]\w+)\b}} / $1 $2

There is also a modifier you can embed in an expression to toggle the "ignore case" switch within the expression. (?i) sets ignore case, and (?-i) turns it off. Here's how you could use the modifier in this case:

{{re=\b(?i)([ei])(?-i)([A-Z]\w+)\b}} / $1 $2

For more on modifiers, see http://regular-expressions.org/refadv.html
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1223
Joined: Wed Oct 29, 2003 11:07 am

Postby RmData » Fri Aug 12, 2005 2:43 pm

Hi, Jim.
My goal in posting was merely to raise the awareness that if a regular expression replacement isn't specific enough it can cause more problems than the help it gives.

Thanks,
Mark.
RmData
 
Posts: 6
Joined: Wed Aug 10, 2005 12:06 am
Location: The Pinery, Colorado

Using the | to remove pause

Postby Arlo » Tue Aug 23, 2005 2:59 pm

The pause in between the numbers for this expression bugged me.
{{re=\b(16|17|18|19)([1-9]\d)\b}} / $1 $2
Example: 1642 / 16 42

I tried changing the pause using {{Pause=0.1}}, but this just adds to the initial pause instead of replacing it.

I finally found that the pipe character "|" will slur the numbers the way I wanted.

{{re=\b(16|17|18|19)([1-9]\d)\b}} / $1|$2

I'm not sure the pipe is the correct character to use but it seems to be a neutral character and has the effect I want.

Thanks for the tips,
Arlo
Arlo
 
Posts: 7
Joined: Tue Aug 16, 2005 3:25 pm

Postby BrienMalone » Thu Apr 20, 2006 1:22 am

bump.

This is an excellent RE post.
BrienMalone
 
Posts: 17
Joined: Sat Apr 08, 2006 1:03 am

Re: Helpful Regular Expressions

Postby Melker63 » Fri Jun 02, 2006 2:59 pm

SFCurley wrote:The part before the "/" goes in the word field; the part after the "/" goes in the pronounciatin field. The part in brackets is just a description.

dot com web addresses:
{{re=\b(\w*)\.?(\w+)\.(com|org|gov|us|cc|bus|net)}} / $1 dot $2 dot $3


Again, I just cant get above to work despite following the initial explanation to the letter. What am I doing wrong? I tried checking/unchecking case-senitive but it just dont work either way.
Melker63
 
Posts: 44
Joined: Sat Nov 06, 2004 5:34 pm
Location: Stockholm

Postby SFCurley » Fri Jun 02, 2006 3:29 pm

Two thoughts:

1. Can you cut and copy exactly what you have in each field and post?

2. Are any reg-exes working for you?

3. If the answer to question 2 is yes, but THIS particular regex is not working for you, it could be a conflict. If any other non-regex/non-mask pronounciation entry matches something in the web address, then this regex won't ever be evaluated since a match with higher precedence has matched. Also, if any other regex with higher precedence matches, that would pre-empt this one, too.

I think the precedence of matching goes:

Pronounciation editor entries, then
Masks, then
Reg-exes (in order of size of regex tring size from largest to smallest).
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby Melker63 » Sat Jun 03, 2006 2:26 am

Reply to SFCurley:

The only other item I have in the pronunciation-window is:
&: / {{Pause=0.8}}<s>

Your added item is:
{{re=\b(\w*)\.?(\w+)\.(com|org|gov|us|cc|bus|net)}} / $1 dot $2 dot $3
Melker63
 
Posts: 44
Joined: Sat Nov 06, 2004 5:34 pm
Location: Stockholm

Postby Jim Bretti » Sat Jun 03, 2006 2:33 pm

Melker63 -

It might help if you post the following:
1. What version of TextAloud you're running
2. A small sample of text containing a URL that isn't being handled by the expression
3. How you're hearing TextAloud pronounce the URL.

Hopefully that will be enough to figure out what the problem is.
Jim Bretti
NextUp.com
Listen and Learn Anywhere
http://www.NextUp.com
Jim Bretti
 
Posts: 1223
Joined: Wed Oct 29, 2003 11:07 am

Postby Melker63 » Sun Jun 11, 2006 7:17 am

Jim Bretti wrote:1. What version of TextAloud you're running
2. A small sample of text containing a URL that isn't being handled by the expression
3. How you're hearing TextAloud pronounce the URL.


1: I have the latest v2.185 version installed.
2: SFCurleys suggestion works nice with the following adress:
http://news.bbc.co.uk/2/hi/science/nature/3686106.stm

But not with these two below:
http://www.911podcasts.com/default.php? ... pi=0&typ=0
http://www.dn.se/DNet/jsp/polopoly.jsp? ... nderType=6

3: Anything that isnt a word TA Spells out letter for letter in above two links.
Melker63
 
Posts: 44
Joined: Sat Nov 06, 2004 5:34 pm
Location: Stockholm

Postby SFCurley » Sun Jun 11, 2006 9:58 am

This one should do it:

word: {{re=www\.([A-Za-z0-9-]+)\.(com|net|org|gov|biz|us|cc|se)([?&A-Za-z0-9/+=,._-]*)(#)*}}

Pronounciation: w w w dot $1 dot $2 <s>

What's the .se domain by the way?
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby D.Leikin » Sun Jun 11, 2006 10:59 am

Sean, .se is the Internet country code top-level domain for Sweden.
Last edited by D.Leikin on Wed Jun 14, 2006 3:42 pm, edited 1 time in total.
D.Leikin
 
Posts: 682
Joined: Sat Jan 14, 2006 2:15 pm

Postby SFCurley » Sun Jun 11, 2006 11:16 am

Duh! Should've probably known that. (I would've guessed that if it was .sw). Thanks.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Postby D.Leikin » Sun Jun 11, 2006 12:10 pm

DNS now is supporting short names. For example, to go to NextUp one might simply type “nextup” instead of “www.nextup.com” in the address string. I just thought maybe there is no need for speaking www aloud.
Last edited by D.Leikin on Wed Jun 14, 2006 3:50 pm, edited 1 time in total.
D.Leikin
 
Posts: 682
Joined: Sat Jan 14, 2006 2:15 pm

Postby SFCurley » Tue Jun 13, 2006 4:16 pm

Here's an improvement over the regex discussed above:

word: {{re=(\w+)\.([A-Za-z0-9-]+)\.(com|net|org|gov|biz|us|cc)([?&A-Za-z0-9/+=,._-]*)(#)*}}

pronounciation: $1 dot $2 dot $3 <s>

This one accounts for cases where "www" is not the first part of the web address, for example:
http://money.cnn.com/magazines/fortune/ ... /index.htm
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm


Return to Power Users, Beta Tests, In-Depth Discussions

Who is online

Users browsing this forum: No registered users and 0 guests