Issue
Integration jobs in Semarchy xDM may experience performance degradation during matching phases (including but only on steps like "Find and Match Duplicate" and "Compute Transitive Closure for System Dups").
Symptoms may include:
- Batches taking an abnormal long time to process.
- Integration jobs appearing stuck indefinitely.
- Excessive processing time during duplicate detection.
- Long-running transitive closure calculations.
- High database activity with little actual progress.
Cause
Slow matching performance is usually caused by one or several of the following factors.
1. Inefficient Match Rule Design
The most common cause is poorly optimized match rules.
Examples may include:
- Using functions directly inside binning conditions.
- Using functions inside matching conditions (TRIM, UPPER...).
- Excessive fuzzy matching.
- Match rules generating very large candidate clusters.
Functions inside matching logic prevent efficient index usage and force expensive computations.
2. Giant Match Clusters
A poorly designed rule or incorrect dataset may create extremely large matching groups.
This heavily impacts duplicate detection, transitive closure processing and match graph computations.
3. Missing or Inefficient Indexes
Matching relies heavily on binning attributes, matching attributes and internal matching columns.
Without optimized indexes, full scans occur frequently and matching joins become extremely expensive.
4. Insufficient Database Statistics
The database optimizer depends on accurate statistics.
Default histogram sizes are often too small for large MI and DU tables, leading to poor execution plans.
5. Inefficient Transitive Closure Processing
By default, transitive closure uses a cursor-based implementation.
For large datasets, this approach may become inefficient.
6. Database Storage/Page Configuration
Frequent updates on MI tables can suffer from page fragmentation and excessive write overhead when fill factor settings are not optimized.
Solution
Resolving slow matching performance may require optimization at multiple levels:
- Match rule design.
- Calculated attributes.
- Database indexing.
- Database statistics.
- Application server configuration.
/!\ These steps are advanced and may require assistance from Database Administrators and/or System Administrators.
Step 1 - Optimize Match Rules
Avoid functions in matching and binning conditions
As they force runtime computation and prevent optimization.
Instead, the recommended approach is to create calculated attributes that already contain transformed values (example: NAME_TRIM, NAME_UPPER, PHONE_NORMALIZED).
Then use these attributes directly in binning and matching conditions.
> This allows indexes to be used efficiently.
Step 2 - Validate Overmatching
Large clusters may indicate a problematic rule.
Run validation queries against MI tables.
SELECT count(*), B_MATCHGRP From MI_xxx Group by B_MATCHGRP Order by count(*) desc;
Replace MI_xxx with your actual entity.
If extremely large clusters are found:
- Review recent match rule changes.
- Review incoming datasets.
- Tighten matching criteria.
Step 3 - Update Existing Records
If new calculated attributes were added or rules have been reviewed, existing MI records must be updated, otherwise only new incoming records benefit from optimization.
In order to do so, we recommend replaying matching as per described in our documentation.
Step 4 - Add Database Indexes
Create indexes on attributes heavily used by match rules.
Recommended targets:
- Binning columns
- Frequently matched attributes
- Internal matching columns
Proper indexing can dramatically reduce matching duration.
Step 5 - Enable Optimized Transitive Closure Mode
Instead of the default cursor-based implementation, enable the more efficient set-based mode.
To do so, add the following Java option:
CATALINA_OPTS="$CATALINA_OPTS -Dmdm.integjob.defaultTransitiveClosureMode=UPDATE_WHILE"
This needs to be done wherever you are configuring your Catalina options.
After enabling, restart the application server and redeploy the model.
This regenerates job definitions using the new mode.
/!\ Always validate this change in a DEV environment first.
Step 6 - Improve Database Statistics and Fill Factor
For large MI and DU tables, increase statistics precision.
Here is an example that needs to be adapted to your context:
alter table mi_xxx alter column b_batchid set statistics 10000; alter table mi_xxx alter column b_matchgrp set statistics 10000; alter table mi_xxx alter column b_sourceid set statistics 10000; alter table mi_xxx alter column b_xgrp set statistics 10000; analyze mi_xxx; alter table mi_xxx set (fillfactor = 80); vacuum analyze mi_xxx;
Also optimize indexes:
alter index <index_name> set (fillfactor = 80); reindex index <index_name>;
And improve DU statistics:
alter table du_xxx alter column b_sourceid set statistics 10000; alter table du_xxx alter column b_sourceid2 set statistics 10000; analyze du_xxx;
This may help for:
- Better execution plans.
- Improved cardinality estimation.
- Faster HOT updates.
- Reduced page fragmentation.
- More efficient joins.
/!\ These database operations should be performed by a DBA.
Step 7 - Verify Job Configuration
Explore adjusting the parameter "PARAM_ANALYZE_STATS" in your integration job to appropriately reflect when using custom statistics strategies.