Removing Superscripts/Reference citations

Forum for TextAloud version 3

Moderator: Jim Bretti

Post Reply
gtaus
Posts: 31
Joined: Fri Sep 19, 2008 12:23 pm
Contact:

Removing Superscripts/Reference citations

Post by gtaus »

I want to import professional documents into TextAloud to convert to .mp3 files for listening to while driving. I don't want to have to listen to every reference number cited in the article as it disrupts my thought process while listening. Some of my target articles are heavily referenced and it would take too much time and effort to manually delete each citation. Does anybody know how to remove these annoying reference citations (superscripts) in these articles other than deleting them manually? I would like to automatically delete the reference citations directly in TextAloud, if possible.

example: "ADHD is characterized by developmentally inappropriate degrees of inattention, impulsivity and hyperactivity.3 Before diagnosis, ..."

Where "3" is the reference citation, which in the original document is also in superscript. Any help appreciated.
Jim Bretti
Posts: 1558
Joined: Wed Oct 29, 2003 11:07 am
Contact:

Re: Removing Superscripts/Reference citations

Post by Jim Bretti »

You should be able to remove the superscripts using TextAloud's pronunciation editor. The document is handled as plain text in TextAloud, so we're not able to look for superscript fonts and filter them. But the pronunciation editor does support regular expressions, which is basically a text pattern matching language. So you could create a pronunciation dictionary entry that uses a text pattern match to find the citation numbers and filter them.

From the TextAloud menu click Tools -> Text Processing -> Pronunciation Dictionary Maintenance. Create a new dictionary. Set the Text Matching dropdown to "Regular Expression", and for the expression, try this:
([a-z][.?!])(\d+)

Then set the Pronounce Using dropdown to "Respell", and for the respelling field use this:
$1

In the expression, [a-z] matches any alpha character. [.?!] matches any one of the three characters in the square brackets, and \d+ matches 1 or more digits. In the Respell field, $1 references the text corresponding to the first pair of parentheses. This 'respelling'' of the matched text should exclude the citation numbers.

Its possible we'll need to tweak the regular expression but this should get you started.
Jim Bretti
NextUp.com
gtaus
Posts: 31
Joined: Fri Sep 19, 2008 12:23 pm
Contact:

Re: Removing Superscripts/Reference citations

Post by gtaus »

Thanks, that worked great for the not speaking single citations as given in my example. "ADHD is characterized by developmentally inappropriate degrees of inattention, impulsivity and hyperactivity.3 Before diagnosis, ..." The "3" is not spoken at all. Fantastic.

As to tweaking, I now noticed that there are a smaller number of multiple quotes in my document such as:

example: "Stimulants help control behavioral symptoms in 75% to 90% of those with ADHD.10,14 It is believed that these agents...."

Where "10,14" are reference citations after that sentence. Using your regular expression above, TextAloud correctly eliminates speaking "10", but does speak the following reference "14".

I tried to write another regular expression to eliminate the following references in addition to the that first reference. Using your previous regular expression expression, I wrote this: ([,])(\d+) to speak $1

In the Test box, it correctly does not speak the second reference "14", but now speaks "10". I saved the new expression, thinking that if both expressions worked together in the document, then any multiple references would not be spoken. When I went back to the main document, with both these regular expressions in the dictionary, it indeed correctly does not speak the multiple references. So I am very happy with the results.

My question, is there anyway to combine the two expressions into one, or am I better off just leaving it the way it is with two regular expressions? Thanks.
Jim Bretti
Posts: 1558
Joined: Wed Oct 29, 2003 11:07 am
Contact:

Re: Removing Superscripts/Reference citations

Post by Jim Bretti »

You should be able to get one expression to handle both cases. Try this:

Text Matching: Regular Expression
([a-z][.?!])(\d+)(,\d+)?

Pronounce Using: Respell
$1

I just added this: (,\d+)? This is looking for a comma followed by one or more digits. The question mark outside the parentheses means '0 or 1 occurrences', so the expression should match either case.
Jim Bretti
NextUp.com
Jim Bretti
Posts: 1558
Joined: Wed Oct 29, 2003 11:07 am
Contact:

Re: Removing Superscripts/Reference citations

Post by Jim Bretti »

One other thing I should have thought of. The '0 or 1' part I mentioned might not be enough, its possible you may have citations that look like this:

Stimulants help control behavioral symptoms in 75% to 90% of those with ADHD.10,14,18,20 It is believed that these agents.

The expression above allows '0 or 1' occurrences of (,\d+). If you need to allow more than one, use this instead

Text Matching: Regular Expression
([a-z][.?!])(\d+)(,\d+)*

Pronounce Using: Respell
$1

So instead of the ? symbol outside the parentheses to match 0 or 1, you can use the asterisk symbol to match '0 or more'.
Jim Bretti
NextUp.com
gtaus
Posts: 31
Joined: Fri Sep 19, 2008 12:23 pm
Contact:

Re: Removing Superscripts/Reference citations

Post by gtaus »

Thank you so much for your help. The last expression you gave me works the best.

Text Matching: Regular Expression
([a-z][.?!])(\d+)(,\d+)*

Pronounce Using: Respell
$1

I ran that expression on my documents and it correctly omits speaking the reference, and multiple reference citations without affecting the other text. This will save me a lot of time. Thank you, again.
LisaMcDJ
Posts: 1
Joined: Thu Oct 10, 2019 12:42 pm
Contact:

Re: Removing Superscripts/Reference citations

Post by LisaMcDJ »

My son, who can read code like other people read novels, helped me tweak the code to omit reading multiple references as above "3, 6, 7" as well as a range of references such as "3-7".

Text Matching: Regular Expression
([a-z][.?!])(\d+)([,-]\d+)*

Pronounce Using: Respell
$1

The addition is the square brackets [,-] inside the last set of parenthesis. We are telling Text Aloud to find the end of a sentence with ([a-z][.?!]). Then we ask it to find one or more digits (\d+). Finally, with ([,-]\d+)*, we tell it to find either a comma or hypen, followed by digits, zero to many times.
Post Reply