Issue

Integration jobs in Semarchy xDM may experience performance degradation during matching phases (including but only on steps like "Find and Match Duplicate" and "Compute Transitive Closure for System Dups").


Symptoms may include:

  • Batches taking an abnormal long time to process.
  • Integration jobs appearing stuck indefinitely.
  • Excessive processing time during duplicate detection.
  • Long-running transitive closure calculations.
  • High database activity with little actual progress.


Cause

Slow matching performance is usually caused by one or several of the following factors.


1. Inefficient Match Rule Design

The most common cause is poorly optimized match rules.

Examples may include:

  • Using functions directly inside binning conditions.
  • Using functions inside matching conditions (TRIM, UPPER...).
  • Excessive fuzzy matching.
  • Match rules generating very large candidate clusters.

Functions inside matching logic prevent efficient index usage and force expensive computations.


2. Giant Match Clusters

A poorly designed rule or incorrect dataset may create extremely large matching groups.

This heavily impacts duplicate detection, transitive closure processing and match graph computations.


3. Missing or Inefficient Indexes

Matching relies heavily on binning attributes, matching attributes and internal matching columns.

Without optimized indexes, full scans occur frequently and matching joins become extremely expensive.


4. Insufficient Database Statistics

The database optimizer depends on accurate statistics.

Default histogram sizes are often too small for large MI and DU tables, leading to poor execution plans.


5. Inefficient Transitive Closure Processing

By default, transitive closure uses a cursor-based implementation.

For large datasets, this approach may become inefficient.


6. Database Storage/Page Configuration

Frequent updates on MI tables can suffer from page fragmentation and excessive write overhead when fill factor settings are not optimized.


Solution

Resolving slow matching performance may require optimization at multiple levels:

  • Match rule design.
  • Calculated attributes.
  • Database indexing.
  • Database statistics.
  • Application server configuration.

/!\ These steps are advanced and may require assistance from Database Administrators and/or System Administrators.


Step 1 - Optimize Match Rules

Avoid functions in matching and binning conditions

As they force runtime computation and prevent optimization.

Instead, the recommended approach is to create calculated attributes that already contain transformed values (example: NAME_TRIM, NAME_UPPER, PHONE_NORMALIZED).

Then use these attributes directly in binning and matching conditions.

> This allows indexes to be used efficiently.


Step 2 - Validate Overmatching

Large clusters may indicate a problematic rule.

Run validation queries against MI tables.

SELECT count(*), B_MATCHGRP
From MI_xxx
Group by B_MATCHGRP
Order by count(*) desc;

Replace MI_xxx with your actual entity.


If extremely large clusters are found:

  • Review recent match rule changes.
  • Review incoming datasets.
  • Tighten matching criteria.


Step 3 - Update Existing Records

If new calculated attributes were added or rules have been reviewed, existing MI records must be updated, otherwise only new incoming records benefit from optimization.

In order to do so, we recommend replaying matching as per described in our documentation.


Step 4 - Add Database Indexes

Create indexes on attributes heavily used by match rules.

Recommended targets:

  • Binning columns
  • Frequently matched attributes
  • Internal matching columns

Proper indexing can dramatically reduce matching duration.


Step 5 - Enable Optimized Transitive Closure Mode

Instead of the default cursor-based implementation, enable the more efficient set-based mode.

To do so, add the following Java option:

CATALINA_OPTS="$CATALINA_OPTS -Dmdm.integjob.defaultTransitiveClosureMode=UPDATE_WHILE"

This needs to be done wherever you are configuring your Catalina options.

After enabling, restart the application server and redeploy the model.

This regenerates job definitions using the new mode.

/!\ Always validate this change in a DEV environment first.


Step 6 - Improve Database Statistics and Fill Factor

For large MI and DU tables, increase statistics precision.

Here is an example that needs to be adapted to your context:

alter table mi_xxx alter column b_batchid set statistics 10000;
alter table mi_xxx alter column b_matchgrp set statistics 10000;
alter table mi_xxx alter column b_sourceid set statistics 10000;
alter table mi_xxx alter column b_xgrp set statistics 10000;

analyze mi_xxx;
alter table mi_xxx set (fillfactor = 80);
vacuum analyze mi_xxx;

Also optimize indexes:

alter index <index_name> set (fillfactor = 80);
reindex index <index_name>;

And improve DU statistics:

alter table du_xxx alter column b_sourceid set statistics 10000;
alter table du_xxx alter column b_sourceid2 set statistics 10000;
analyze du_xxx;

This may help for:

  • Better execution plans.
  • Improved cardinality estimation.
  • Faster HOT updates.
  • Reduced page fragmentation.
  • More efficient joins.

/!\ These database operations should be performed by a DBA.


Step 7 - Verify Job Configuration

Explore adjusting the parameter "PARAM_ANALYZE_STATS" in your integration job to appropriately reflect when using custom statistics strategies.