Start a new topic
Answered

Name matching for accented characters

I would like to do some name matching but my names contain accented characters.

For example, how can I match Helene against Hélène ?


Best Answer

Consider these 3 main points: 

  • data modeling
  • enricher
  • matcher

Modeling best practice: In most cases you should keep the original field unmodified and create separate field to hold the normalized value. For example: GivenName & GivenNameNormalized.

Enriching: The default solution to provide this enrichment is the "Text Normalization and Transliteration" plug-in which ships with xDM. The transformation "NORMALIZE" removes accents as well as performing lots of other normalizations. This is simplest and best in lots of cases.

Enriching variation: If data volumes are high and performance is a key consideration, then performing the enrichment in the database may be better. Enriching in the database will normally be faster, but it will not cover as many normalizations. PostgreSQL has the unaccent extension which may be installed and is quite close to xDM's text normalization. 

  • PostgreSQL example enricher: unaccent(upper(GivenName))
  • Oracle/SQL Server example enricher: translate(upper(GivenName), 'ÀÉÊÈÔÑ','AEEEON')
    Note: the list of accented characters to translate must be much larger.

Matching best practice: Use normalized fields in matchers. (Use GivenNameNormalized in this example, do not use the raw GivenName.) Matching using equality comparisons of fields will be much faster than matching which uses functions in the match rules. When it's possible to enrich first and then match on normalized fields, this is preferred.

Notes: I wrote "keep the original field unmodified", but really it's fine to do very light normalization in-place. It's common to transform the data to UPPERCASE, for example. 

1 Comment

Answer

Consider these 3 main points: 

  • data modeling
  • enricher
  • matcher

Modeling best practice: In most cases you should keep the original field unmodified and create separate field to hold the normalized value. For example: GivenName & GivenNameNormalized.

Enriching: The default solution to provide this enrichment is the "Text Normalization and Transliteration" plug-in which ships with xDM. The transformation "NORMALIZE" removes accents as well as performing lots of other normalizations. This is simplest and best in lots of cases.

Enriching variation: If data volumes are high and performance is a key consideration, then performing the enrichment in the database may be better. Enriching in the database will normally be faster, but it will not cover as many normalizations. PostgreSQL has the unaccent extension which may be installed and is quite close to xDM's text normalization. 

  • PostgreSQL example enricher: unaccent(upper(GivenName))
  • Oracle/SQL Server example enricher: translate(upper(GivenName), 'ÀÉÊÈÔÑ','AEEEON')
    Note: the list of accented characters to translate must be much larger.

Matching best practice: Use normalized fields in matchers. (Use GivenNameNormalized in this example, do not use the raw GivenName.) Matching using equality comparisons of fields will be much faster than matching which uses functions in the match rules. When it's possible to enrich first and then match on normalized fields, this is preferred.

Notes: I wrote "keep the original field unmodified", but really it's fine to do very light normalization in-place. It's common to transform the data to UPPERCASE, for example. 

Login to post a comment