I would like to do some name matching but my names contain accented characters.
For example, how can I match Helene against Hélène ?
Best Answer
S
Subham Dixit
said
over 2 years ago
Consider these 3 main points:
data modeling
enricher
matcher
Modeling best practice: In most cases you should keep the original field unmodified and create separate field to hold the normalized value. For example: GivenName & GivenNameNormalized.
Enriching: The default solution to provide this enrichment is the "Text Normalization and Transliteration" plug-in which ships with xDM. The transformation "NORMALIZE" removes accents as well as performing lots of other normalizations. This is simplest and best in lots of cases.
Enriching variation: If data volumes are high and performance is a key consideration, then performing the enrichment in the database may be better. Enriching in the database will normally be faster, but it will not cover as many normalizations. PostgreSQL has the unaccent extension which may be installed and is quite close to xDM's text normalization.
PostgreSQL example enricher: unaccent(upper(GivenName))
Oracle/SQL Server example enricher: translate(upper(GivenName), 'ÀÉÊÈÔÑ','AEEEON') Note: the list of accented characters to translate must be much larger.
Matching best practice: Use normalized fields in matchers. (Use GivenNameNormalized in this example, do not use the raw GivenName.) Matching using equality comparisons of fields will be much faster than matching which uses functions in the match rules. When it's possible to enrich first and then match on normalized fields, this is preferred.
Notes: I wrote "keep the original field unmodified", but really it's fine to do very light normalization in-place. It's common to transform the data to UPPERCASE, for example.
1 Comment
S
Subham Dixit
said
over 2 years ago
Answer
Consider these 3 main points:
data modeling
enricher
matcher
Modeling best practice: In most cases you should keep the original field unmodified and create separate field to hold the normalized value. For example: GivenName & GivenNameNormalized.
Enriching: The default solution to provide this enrichment is the "Text Normalization and Transliteration" plug-in which ships with xDM. The transformation "NORMALIZE" removes accents as well as performing lots of other normalizations. This is simplest and best in lots of cases.
Enriching variation: If data volumes are high and performance is a key consideration, then performing the enrichment in the database may be better. Enriching in the database will normally be faster, but it will not cover as many normalizations. PostgreSQL has the unaccent extension which may be installed and is quite close to xDM's text normalization.
PostgreSQL example enricher: unaccent(upper(GivenName))
Oracle/SQL Server example enricher: translate(upper(GivenName), 'ÀÉÊÈÔÑ','AEEEON') Note: the list of accented characters to translate must be much larger.
Matching best practice: Use normalized fields in matchers. (Use GivenNameNormalized in this example, do not use the raw GivenName.) Matching using equality comparisons of fields will be much faster than matching which uses functions in the match rules. When it's possible to enrich first and then match on normalized fields, this is preferred.
Notes: I wrote "keep the original field unmodified", but really it's fine to do very light normalization in-place. It's common to transform the data to UPPERCASE, for example.
Subham Dixit
I would like to do some name matching but my names contain accented characters.
For example, how can I match Helene against Hélène ?
Consider these 3 main points:
Modeling best practice: In most cases you should keep the original field unmodified and create separate field to hold the normalized value. For example: GivenName & GivenNameNormalized.
Enriching: The default solution to provide this enrichment is the "Text Normalization and Transliteration" plug-in which ships with xDM. The transformation "NORMALIZE" removes accents as well as performing lots of other normalizations. This is simplest and best in lots of cases.
Enriching variation: If data volumes are high and performance is a key consideration, then performing the enrichment in the database may be better. Enriching in the database will normally be faster, but it will not cover as many normalizations. PostgreSQL has the unaccent extension which may be installed and is quite close to xDM's text normalization.
unaccent(upper(GivenName))
translate(upper(GivenName), 'ÀÉÊÈÔÑ','AEEEON')
Note: the list of accented characters to translate must be much larger.
Matching best practice: Use normalized fields in matchers. (Use GivenNameNormalized in this example, do not use the raw GivenName.) Matching using equality comparisons of fields will be much faster than matching which uses functions in the match rules. When it's possible to enrich first and then match on normalized fields, this is preferred.
Notes: I wrote "keep the original field unmodified", but really it's fine to do very light normalization in-place. It's common to transform the data to UPPERCASE, for example.
Subham Dixit
Consider these 3 main points:
Modeling best practice: In most cases you should keep the original field unmodified and create separate field to hold the normalized value. For example: GivenName & GivenNameNormalized.
Enriching: The default solution to provide this enrichment is the "Text Normalization and Transliteration" plug-in which ships with xDM. The transformation "NORMALIZE" removes accents as well as performing lots of other normalizations. This is simplest and best in lots of cases.
Enriching variation: If data volumes are high and performance is a key consideration, then performing the enrichment in the database may be better. Enriching in the database will normally be faster, but it will not cover as many normalizations. PostgreSQL has the unaccent extension which may be installed and is quite close to xDM's text normalization.
unaccent(upper(GivenName))
translate(upper(GivenName), 'ÀÉÊÈÔÑ','AEEEON')
Note: the list of accented characters to translate must be much larger.
Matching best practice: Use normalized fields in matchers. (Use GivenNameNormalized in this example, do not use the raw GivenName.) Matching using equality comparisons of fields will be much faster than matching which uses functions in the match rules. When it's possible to enrich first and then match on normalized fields, this is preferred.
Notes: I wrote "keep the original field unmodified", but really it's fine to do very light normalization in-place. It's common to transform the data to UPPERCASE, for example.
-
Extend a model with new entities or attributes
-
Data types in xDM
-
Effective date on entities
-
Search using wild cards
-
Export a model from production and import on a development environment
-
"Allow Delete" vs "Allow Removal" privileges
-
LOV label in Named Query
-
Select location on a map and save coordinates
-
Is there a way to set up a master-detail relationship on browse mode?
-
Choose Either a Stepper or A Workflow Based on The User Privileges
See all 276 topics