Data Cleansing and Standardization Using Regular Expression
Data Quality is one of the major priorities of any data warehouse or any data integration project. We use different tools for data quality and data standardization implementation. But tools may not be the right solution for small projects which involve couple of data feeds. Regular Expression is an alternative approach for such small projects. In this article lets discuss about data quality implementation using Regular Expression or RegEx in Informatica PowerCenter.
What is Regular Expression or RegEx
A regular expression provides a concise and flexible means to recognize strings of text, such as particular characters, words, or patterns of characters. Regular expressions are used when you want to search for specify lines of text containing a particular pattern.
Just like simple string search operators (%, _ ) used in SQL, Regular Expressions have a full set of matching operators. The following table provides regular expression syntax guidelines.
Syntax
|
Description
|
.
|
A period matches any one character.
|
[a-z]
|
Matches one instance of a character in lower case. For example, [a-z] matches ab. Use [A-Z] to match characters in upper case.
|
\d
|
Matches one instance of any digit from 0-9.
|
\s
|
Matches a whitespace character.
|
\w
|
Matches one alphanumeric character, including underscore (_)
|
()
|
Groups an expression. For example, the parentheses in (\d-\d-\d\d) groups the expression \d\d-\d\d, which finds any two numbers followed by a hyphen and any two numbers, as in 12-34.
|
{}
|
Matches the number of characters. For example, \d{3} matches any three numbers, such as 650 or 510. Or, [a-z]{2} matches any two letters, such as CA or NY.
|
?
|
Matches the preceding character or group of characters zero or one time. For example, \d{3}(-{d{4})? matches any three numbers, which can be followed by a hyphen and any four numbers.
|
*
|
Matches zero or more instances of the values that follow the asterisk. For example, *0 is any value that precedes a 0.
|
+
|
Matches one or more instances of the values that follow the plus sign. For example, \w+ is any value that follows an alphanumeric character.
|
Regular Expression Example
Following regular expression finds 5-digit U.S.A. zip codes, such as 93930, and 9-digit zip codes, such as 93930-5407
\d{5}(-\d{4})?
\d{5} refers to any five numbers, such as 93930. The parentheses surrounding -\d{4} group this segment of the expression. The hyphen represents the hyphen of a 9-digit zip code, as in 93930-5407. \d{4} refers to any four numbers, such as 5407. The question mark states that the hyphen and last four digits are optional or can appear one time.
Regular Expression Implementation in Informatica PowerCenter
Informatica PowerCenter provides couple of functions to implement regular expression. These function can be used just like any other function in an expression. Lets see the functions in detail.
REG_EXTRACT : Extracts sub patterns of a regular expression within an input value. For example, from a regular expression pattern for a full name, you can extract the first name or last name.
Syntax : REG_EXTRACT ( subject, pattern, subPatternNum )
Example : REG_EXTRACT( Employee_Name, '(\w+)\s+(\w+)', 2 ), Extracts the last name from the Employee Name Column.
Example : REG_EXTRACT( Employee_Name, '(\w+)\s+(\w+)', 2 ), Extracts the last name from the Employee Name Column.
REG_MATCH : Returns whether a value matches a regular expression pattern. This lets you validate data patterns, such as IDs, telephone numbers, postal codes, and state names.
Syntax : REG_MATCH ( subject, pattern )
Example : REG_MATCH (Phone_Number, '(\d\d\d-\d\d\d-\d\d\d\d)' ), This expression to validate the 10 digit telephone numbers
Example : REG_MATCH (Phone_Number, '(\d\d\d-\d\d\d-\d\d\d\d)' ), This expression to validate the 10 digit telephone numbers
REG_REPLACE : Replaces characters in a string with a another character pattern. By default, REG_REPLACE searches the input string for the character pattern you specify and replaces all occurrences with the replacement pattern. You can also indicate the number of occurrences of the pattern you want to replace in the string.
Syntax : REG_REPLACE ( subject, pattern, replace, numReplacements )
Example : REG_REPLACE( Employee_Name, ‘\s+’, ‘ ’), Removes additional spaces from the Employee name
Real Time Scenario
Consider a scenario where you get a flat file with a date column, which comes in three different formats MM-DD-YYYY, YYYY-MM-DD and DD/MM/YYYY. We need to load this date column to the target in DD/MM/YYYY format.
Expression given below will check for the date format from the DATE column and convert it to DD/MM/YYYY format.
IIF(REG_MATCH(DATE,'(\d\d-\d\d-\d\d\d\d)'),TO_DATE(DATE,'mm-dd-yyyy'),
IIF(REG_MATCH(DATE,'(\d\d\d\d-\d\d-\d\d)'),TO_DATE(DATE,'yyyy-mm-dd'))))
Apart from the small given above, there is a lot of standard RegEx available for validations such as email, Phone Number, Zip Code etc.
eMail :- ^[a-z0-9_\+-]+(\.[a-z0-9_\+-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,4})$
Phone Number :- ^[2-9][0-9]{2}-[0-9]{3}-[0-9]{4}$
Zip Code :- ^\d{5}(-\d{4})?$
Like you see in the table above, RegEx has a rich list of operators, which helps create any complex data validation rules.
Phone Number :- ^[2-9][0-9]{2}-[0-9]{3}-[0-9]{4}$
Zip Code :- ^\d{5}(-\d{4})?$
Like you see in the table above, RegEx has a rich list of operators, which helps create any complex data validation rules.
We can make these validations as a reusable expression and can be used as the data validation standard across different projects.
Hope you enjoyed this tutorial, Please let us know if you have any questions.