cancel
Showing results for 
Search instead for 
Did you mean: 

Can you account for a word that may or may not be included in a string with an expression?

Michael_Latchuk
Confirmed Champ
Confirmed Champ

Hello There,

 

Question: Can you account for a word that may or may not be included in a string with an expression? In this example *and* may or may not be in the value

 

I have been writing a few expressions in order to better match names within our companies documents.

 

I have been able to achieve our desired results so far but there seems to be one thing that has me stumped.

 

The issue I am having is that some employees end up writing *and* between two names and if *and* is included in the value, it throws off the desired results

 

Example: Mary and Bill Jetson

 

Produces: Mary a Bill - First name, Middle initial and Last name - Which is not our intended result

 

Within my expression I have been trying the following - which I may have misinterpreted the documentation I have been reading

 

\band\b*

\band\b?

\band*\b - I realize the issue with this one but was just hoping it may work

 

I have tried most things I can think of and couldn't find a for sure answer - is there actually a way to account for this?

 

Or is it possible that this a time where the "|" should be utilized and essentially have two expressions? 

 

Any help and or guidance would be appreciated

 

Thank you,

 

Mike

1 ACCEPTED ANSWER

Hi Mike,

 

Thanks for the additional information.  In the current configuration that you have, you could modify the expression to detect the presence of "and" and remove it from the final extracted value, perhaps by inserting a clause like this into your existing expression in the appropriate location:

 

(?:\sand\s)?

 

This syntax looks for either one or zero instances of <space>and<space> and uses the non-capture-group grouping syntax...because I'm assuming that you are finding and extracting first name, middle initial, last name using actual capture groups so you can reconstruct them back into the final value, so here we want a grouping around the possible "and" but we don't want to include it as a capture group because it won't become part of the final value.  This is the difference between using ( ... ) and (?: ... ) to indicate a logical group in the expression.

 

Having said all that, however, I'm not sure this will entirely give you what you want because in the case of "Michael and Jane Latchuk" you'd end up with "Michael Jane Latchuk" in one keyword, but I'm guessing what you really wanted was two separate instance of the keyword with "Michael Latchuk" in one and "Jane Latchuk" in the other.   In order to accomplish this, it gets a little trickier than just using a regular expression - rather you'd probably be better served by using a VB script hook on the zone that would run after the regex is processed but before the value is finalized, and then in the script look at the raw original value (OCRDoc.RawText) to see if it contained " and " and if so, then the script could count the remaining name parts and figure out where the last name is (presumably the last value in the string) and use the VB script string functions to rebuild into two separate final strings "Michael Latchuk" and "Jane Latchuk", set one back as the result from the script (OCRDoc.FieldText) and the other can be added by the script as a new separate instance of the keyword value (OCRDoc.CreateNewResultByXXX family).  

 

Hope this helps...

View answer in original post

4 REPLIES 4

Hi Mike,

 

There's a few different ways to approach this - but in order to offer the most appropriate suggestion, I have a few additional questions that I'd like to ask first:

 

(a) You're doing this using Advanced Capture, correct?  (Advanced Capture is tagged as the module on your post, so I am assuming this is true, but wanted to verify).

 

(b) Are you trying to use the built-in 'person name extraction' type of keyword extraction zone or a more general 'find regular expression' zone?  I ask because it has consequences for how the individual name parts would be stored - for example whether each (full) name is going into a single keyword, or you're trying to break the name parts up into separate keywords for first [middle] last, or something else?

 

(c) If (a) is true and one or the other of (b) applies - are the documents that you're processing structured such that you can define specific zone(s) where you know where the names will be [roughly] - or - you don't know where the names will be located so you are doing a more general unstructured search through the entire document/specific page(s) to find all possible names?   You mentioned these are company directories but I wasn't sure if we're talking like a free-flowing list of names (like a phone book) or maybe a one person (and possibly spouse) per page with a biography on the rest of the page, something else not like either of these, etc.?

Hello Steve,

 

Thank you for your reply,

 

(a) - Would be correct, Advance Capture for sure

 

(b) - Using a more generalized way of 'find regular expression' - I currently have this set up to account for and produce the correct values of First name[1], Middle Initial[2], and Last Name[3] - it does account for the following - Hyphenated last name, two separate last names, possibility of having no middle name at all or just the middle initial, commas or hyphens between characters - works for just about all cases so far

 

(c) - Our documents are technically structured where it should be the same every time - I have noticed with the forms shifting so much it was causing us to pull unwanted information to pull at times, when I was initially learning I didn't utilize the person name extraction properly - which could be user error I can admit, but ended up learning expressions and just now write one out for the required information we need to pull

 

Example: We have a customer form and the capture area has a label of "Customer Name(s)"

 

Most times - "Michael Latchuk" is how the information is presented

 

Occurrence to account for: "Michael and Jane Latchuk" is what I am attempting to account for if possible

 

It's just a form our employees submit so we can process in the back end for our members

 

I just can't figure out how to say *and* May be there

 

Hopefully this clarifies it 🙂

 

Thank you,

 

Mike

 

 

Hi Mike,

 

Thanks for the additional information.  In the current configuration that you have, you could modify the expression to detect the presence of "and" and remove it from the final extracted value, perhaps by inserting a clause like this into your existing expression in the appropriate location:

 

(?:\sand\s)?

 

This syntax looks for either one or zero instances of <space>and<space> and uses the non-capture-group grouping syntax...because I'm assuming that you are finding and extracting first name, middle initial, last name using actual capture groups so you can reconstruct them back into the final value, so here we want a grouping around the possible "and" but we don't want to include it as a capture group because it won't become part of the final value.  This is the difference between using ( ... ) and (?: ... ) to indicate a logical group in the expression.

 

Having said all that, however, I'm not sure this will entirely give you what you want because in the case of "Michael and Jane Latchuk" you'd end up with "Michael Jane Latchuk" in one keyword, but I'm guessing what you really wanted was two separate instance of the keyword with "Michael Latchuk" in one and "Jane Latchuk" in the other.   In order to accomplish this, it gets a little trickier than just using a regular expression - rather you'd probably be better served by using a VB script hook on the zone that would run after the regex is processed but before the value is finalized, and then in the script look at the raw original value (OCRDoc.RawText) to see if it contained " and " and if so, then the script could count the remaining name parts and figure out where the last name is (presumably the last value in the string) and use the VB script string functions to rebuild into two separate final strings "Michael Latchuk" and "Jane Latchuk", set one back as the result from the script (OCRDoc.FieldText) and the other can be added by the script as a new separate instance of the keyword value (OCRDoc.CreateNewResultByXXX family).  

 

Hope this helps...

Hello Steve,

 

Beautiful!! 

 

This is wonderful to know, I am excited to see what we can accomplish now!

 

Thank you 😄 

 

Have a great week!

 

Much Appreciated,

 

Mike