8i | 9i | 10g | 11g | 12c | 13c | 18c | 19c | 21c | 23c | Misc | PL/SQL | SQL | RAC | WebLogic | Linux
FUZZY_MATCH and PHONIC_ENCODE Data Quality Operators in Oracle Database 23c
Oracle database 23c introduced the FUZZY_MATCH
and PHONIC_ENCODE
data quality operators to perform fuzzy string matching.
The UTL_MATCH
package was introduced in Oracle 10g Release 2, but first documented (and therefore supported) in Oracle 11g Release 2. It contains a variety of functions that are helpful for testing the level of similarity/difference between strings. The FUZZY_MATCH
and PHONIC_ENCODE
operators extend the fuzzy string matching functionality of the database.
Related articles.
Setup
The examples in this article require the following test table.
drop table if exists match_tab; create table match_tab ( id number, col1 varchar2(15), col2 varchar2(15), constraint match_tab_pk primary key (id) ); insert into match_tab values (1, 'Peter Parker', 'Pete Parker'), (2, 'Peter Parker', 'peter parker'), (3, 'Clark Kent', 'Claire Kent'), (4, 'Wonder Woman', 'Ponder Woman'), (5, 'Superman', 'Superman'), (6, 'The Hulk', 'Iron Man'); commit;
Notice the COL1
and COL2
values have varying degrees of similarity including exact matches, near matches and completely different values.
FUZZY_MATCH
The FUZZY_MATCH
operator is language-neutral. It determines the similarity between two strings and supports several algorithms listed here.
set linesize 100 column col1 format a12 column col2 format a12 select col1, col2, fuzzy_match(levenshtein, col1, col2) as levenshtein, fuzzy_match(jaro_winkler, col1, col2) as jaro_winkler, fuzzy_match(bigram, col1, col2) as bigram, fuzzy_match(trigram, col1, col2) as trigram, fuzzy_match(whole_word_match, col1, col2) as wwm, fuzzy_match(longest_common_substring, col1, col2) as lcs from match_tab; COL1 COL2 LEVENSHTEIN JARO_WINKLER BIGRAM TRIGRAM WWM LCS ------------ ------------ ----------- ------------ ---------- ---------- ---------- ---------- Peter Parker Pete Parker 92 92 90 70 50 58 Peter Parker peter parker 84 88 72 60 0 41 Clark Kent Claire Kent 82 90 60 44 50 45 Wonder Woman Ponder Woman 92 94 100 90 50 91 Superman Superman 100 100 100 100 100 100 The Hulk Iron Man 0 41 0 0 0 12 6 rows selected. SQL>
By default the output is a percentage similarity, but the UNSCALED
keyword can be added to return the raw value.
select col1, col2, fuzzy_match(levenshtein, col1, col2, unscaled) as levenshtein, fuzzy_match(jaro_winkler, col1, col2, unscaled) as jaro_winkler, fuzzy_match(bigram, col1, col2, unscaled) as bigram, fuzzy_match(trigram, col1, col2, unscaled) as trigram, fuzzy_match(whole_word_match, col1, col2, unscaled) as wwm, fuzzy_match(longest_common_substring, col1, col2, unscaled) as lcs from match_tab; COL1 COL2 LEVENSHTEIN JARO_WINKLER BIGRAM TRIGRAM WWM LCS ------------ ------------ ----------- ------------ ---------- ---------- ---------- ---------- Peter Parker Pete Parker 1 .92 10 7 1 7 Peter Parker peter parker 2 .88 8 6 0 5 Clark Kent Claire Kent 2 .9 6 4 1 5 Wonder Woman Ponder Woman 1 .94 11 9 1 11 Superman Superman 0 1 7 6 1 8 The Hulk Iron Man 8 .41 0 0 0 1 6 rows selected. SQL>
If you have used the UTL_MATCH
package, the scaled verses unscaled values represent the following calls.
Algorithm | Scaled | Unscaled |
---|---|---|
LEVENSHTEIN | UTL_MATCH.EDIT_DISTANCE_SIMILARITY | UTL_MATCH.EDIT_DISTANCE |
JARO_WINKLER | UTL_MATCH.JARO_WINKLER_SIMILARITY | UTL_MATCH.JARO_WINKLER |
By default the output is scaled by the length of the longer input string. The RELATE_TO_SHORTER
keyword alters this to scale to the shorter input string.
select col1, col2, fuzzy_match(levenshtein, col1, col2, relate_to_shorter) as levenshtein, fuzzy_match(jaro_winkler, col1, col2, relate_to_shorter) as jaro_winkler, fuzzy_match(bigram, col1, col2, relate_to_shorter) as bigram, fuzzy_match(trigram, col1, col2, relate_to_shorter) as trigram, fuzzy_match(whole_word_match, col1, col2, relate_to_shorter) as wwm, fuzzy_match(longest_common_substring, col1, col2, relate_to_shorter) as lcs from match_tab; COL1 COL2 LEVENSHTEIN JARO_WINKLER BIGRAM TRIGRAM WWM LCS ------------ ------------ ----------- ------------ ---------- ---------- ---------- ---------- Peter Parker Pete Parker 91 92 100 77 50 63 Peter Parker peter parker 84 88 72 60 0 41 Clark Kent Claire Kent 80 90 66 50 50 50 Wonder Woman Ponder Woman 92 94 100 90 50 91 Superman Superman 100 100 100 100 100 100 The Hulk Iron Man 0 41 0 0 0 12 6 rows selected. SQL>
The EDIT_TOLERANCE
keyword can be used with the WHOLE_WORD_MATCH
algorithm. The tolerance is a percentage of characters in a word that can be different, while still considering it to be the same word.
select col1, col2, fuzzy_match(whole_word_match, col1, col2) as wwm, fuzzy_match(whole_word_match, col1, col2, edit_tolerance 20) as wwm20, fuzzy_match(whole_word_match, col1, col2, edit_tolerance 82) as wwm82 from match_tab; COL1 COL2 WWM WWM20 WWM82 ------------ ------------ ---------- ---------- ---------- Peter Parker Pete Parker 50 100 50 Peter Parker peter parker 0 100 50 Clark Kent Claire Kent 50 100 50 Wonder Woman Ponder Woman 50 100 100 Superman Superman 100 100 100 The Hulk Iron Man 0 0 0 6 rows selected. SQL>
PHONIC_ENCODE
The PHONIC_ENCODE
operator converts text into language-specific codes based on the pronunciation of the text. It implements the Double Metaphone algorithm and an alternative algorithm.
set linesize 100 column col1 format a12 column col2 format a12 column col1_dm format a8 column col2_dm format a8 column col1_dma format a8 column col2_dma format a8 select col1, col2, phonic_encode(double_metaphone, col1) as col1_dm, phonic_encode(double_metaphone, col2) as col2_dm, phonic_encode(double_metaphone_alt, col1) as col1_dma, phonic_encode(double_metaphone_alt, col2) as col2_dma from match_tab; COL1 COL2 COL1_DM COL2_DM COL1_DMA COL2_DMA ------------ ------------ -------- -------- -------- -------- Peter Parker Pete Parker PTRP PTPR PTRP PTPR Peter Parker peter parker PTRP PTRP PTRP PTRP Clark Kent Claire Kent KLRK KLRK KLRK KLRK Wonder Woman Ponder Woman ANTR PNTR FNTR PNTR Superman Superman SPRM SPRM SPRM SPRM The Hulk Iron Man 0LK ARNM TLK ARNM 6 rows selected. SQL>
When using DOUBLE_METAPHONE_ALT
, if there is no alternative code the primary code is returned.
The maximum length of the code is controlled by an optional third parameter, which accepts integer values from 1 to 12.
column col1_dm1 format a9 column col2_dm1 format a9 column col1_dm6 format a9 column col2_dm6 format a9 column col1_dm12 format a9 column col2_dm12 format a9 select col1, col2, phonic_encode(double_metaphone, col1, 1) as col1_dm1, phonic_encode(double_metaphone, col2, 1) as col2_dm1, phonic_encode(double_metaphone, col1, 6) as col1_dm6, phonic_encode(double_metaphone, col2, 6) as col2_dm6, phonic_encode(double_metaphone, col1, 12) as col1_dm12, phonic_encode(double_metaphone, col2, 12) as col2_dm12 from match_tab; COL1 COL2 COL1_DM1 COL2_DM1 COL1_DM6 COL2_DM6 COL1_DM12 COL2_DM12 ------------ ------------ --------- --------- --------- --------- --------- --------- Peter Parker Pete Parker P P PTRPRK PTPRKR PTRPRKR PTPRKR Peter Parker peter parker P P PTRPRK PTRPRK PTRPRKR PTRPRKR Clark Kent Claire Kent K K KLRKKN KLRKNT KLRKKNT KLRKNT Wonder Woman Ponder Woman A P ANTRMN PNTRMN ANTRMN PNTRMN Superman Superman S S SPRMN SPRMN SPRMN SPRMN The Hulk Iron Man 0 A 0LK ARNMN 0LK ARNMN 6 rows selected. SQL>
PL/SQL Support
In this release there is no direct support for the FUZZY_MATCH
or PHONIC_ENCODE
operators in PL/SQL, so a direct assignment is not possible.
declare l_output number; begin l_output := fuzzy_match(levenshtein, 'Peter Parker', 'peter parker'); end; / * ERROR at line 4: ORA-06550: line 4, column 15: PLS-00201: identifier 'FUZZY_MATCH' must be declared ORA-06550: line 4, column 3: PL/SQL: Statement ignored SQL> declare l_output varchar2(10); begin l_output := phonic_encode(double_metaphone, 'Peter Parker'); end; / * ERROR at line 4: ORA-06550: line 4, column 15: PLS-00201: identifier 'PHONIC_ENCODE' must be declared ORA-06550: line 4, column 3: PL/SQL: Statement ignored SQL>
We can use a SELECT ... INTO
to make the assignment.
declare l_output number; begin select fuzzy_match(levenshtein, 'Peter Parker', 'peter parker') into l_output; end; / PL/SQL procedure successfully completed. SQL> declare l_output varchar2(10); begin select phonic_encode(double_metaphone, 'Peter Parker') into l_output; end; / PL/SQL procedure successfully completed. SQL>
For more information see:
- Data Quality Operators
- Fuzzy String Matching
- UTL_MATCH : String Matching by Testing Levels of Similarity/Difference
Hope this helps. Regards Tim...