Start a new topic

Best SemQL functions to use for requiring a certain special character is present in a string

I'm looking for the best way to include a clause in my match condition to require a special character such as space and hypen to delimit to parts of a compound name as true to be a match. Thanks!


Hi Hilaire, yes this would ideally be considered a match with this rule.


Postgres 


Thank you

Hi David, just to clarify, would you like "Li-Xiao-Ping" to match "Li Ping" (different separator, common tokens) with this partial match rule ?

Are you using Oracle, Postgres or SQL Server ?

Thanks for the response Hilaire. I would like to elaborate that we use enrichers in the certification process, and this specific match rule would be an exception to the standard (using SourceFirstName, SourceLastname instead of FirstName and LastName).


The requirement I am trying to meet is related to compound names hat we are looking to make partial matches on. For example, "Li-Xiao-Ping" would hit one of our partial match rules on "Xiao" or "Ping". Similarly, "Anna Mae" would match on either part of that name.


With our current partial name match rules, "LiXiaoPing" is how the name would be in the certified FirstName attribute, which wouldn't be a match. This is why I posted this to try and include a delimiter into the match rule.

Hi David, even though we try to provide semql functions as generic as possible, the databases that we currently support to host a data location have very different capabilities regarding regular expressions support, so it would really to know which database you're working on.


Besides that, I tend to think that the matching is not the best place to validate / verify a string format. In most cases, it will be way more efficient to standardize the strings earlier in the certification process. For example, using an enricher to make sure all strings are uppercase and that hyphens are replaced by spaces (the result can be stored as a technical matching attribute if you don't want to alter the source "raw" value).

Let's say you're loading 10 000 new or updated records in a hub that host a stock of 10 millions master records : in that case you only have to populate the technical StndName using an enricher, the cost will be limited to 10 000 function calls or so, whereas if you check the format in the matching rule, you could end up issuing up to 10 000 times 10 millions calls (if you don't have a binning expression and if you use only distance calculation functions).

Also, removing the function calls from the matching clause will make it easier for the database optimizer to leverage custom matching indexes (function calls prevent index usage, except for function-based indexes that have a huge maintenance cost, but that's another story)


Fell free to elaborate on the requirement if you think I missed something.

Hope this helps

Login to post a comment