Need
Deploying the Semarchy xDM Native App on Snowflake requires sizing several Snowflake resources appropriately to deliver the expected performance while optimizing infrastructure costs.
Unlike a traditional xDM deployment, where the application server and database are managed independently, the Native App relies on Snowflake services to provide both application execution and database processing.
Sizing therefore involves evaluating three independent components:
- Snowpark Container Services Compute Pools, which host the xDM application runtime.
- Snowflake Virtual Warehouses, which execute SQL processing and integration workloads.
- Snowflake Storage, which stores application data and metadata.
Each of these components has different sizing drivers and scaling mechanisms. Understanding how they interact is essential when planning a new deployment or expanding an existing one.
This article provides practical guidance to help architects, Snowflake administrators, and implementation teams estimate an appropriate starting configuration and identify when scaling adjustments are required.
Summarized Solution
The xDM Native App should be sized by considering each Snowflake resource independently.
- Compute Pools should be sized according to application workload, including user sessions, APIs, workflows, security model complexity, and memory consumption.
- Virtual Warehouses should be sized according to SQL workload, including integration jobs, matching operations, joins, transformations, and concurrent processing.
- Storage should be estimated from expected data volumes, enrichment strategy, historization, and anticipated growth.
As a general recommendation:
- Start with a modest configuration.
- Measure actual CPU, memory, and SQL performance.
- Increase Compute Pool size when the application runtime becomes the bottleneck.
- Increase Warehouse size when SQL execution becomes the bottleneck.
- Separate interactive and batch workloads whenever possible.
- Reassess storage estimates after the first implementation iterations using actual production data.
Detailed Solution
Understanding the xDM Native App Architecture
The xDM Native App relies on three distinct Snowflake resources.
Compute Pool
The Compute Pool hosts the xDM application runtime.
It is responsible for:
- User sessions
- REST APIs
- Workflow execution
- Model loading
- Application orchestration
Virtual Warehouse
Virtual Warehouses execute SQL processing inside Snowflake, including:
- Integration jobs
- Matching and consolidation
- Data transformations
- User queries
- Analytics
Storage
Storage contains:
- Business data
- Metadata
- Match tables
- Workflow data
- Historical records
Each component scales independently and therefore should be sized independently.
Horizontal vs Vertical Scaling
Horizontal Scaling
Traditional xDM deployments often scale horizontally by adding application nodes behind a load balancer.
Additional nodes can be dedicated to:
- REST APIs
- Interactive users
- Background processing
or simply increase overall concurrency.
The xDM Native App currently supports a single active application instance.
Consequently, horizontal scaling is not available.
The Compute Pool therefore behaves similarly to a single virtual machine, and its instance class determines the amount of available CPU and memory.
Although small Compute Pools may sustain moderate concurrent activity, there is no universal sizing rule because performance depends heavily on:
- Data model complexity
- Security model
- Workflow configuration
- Business views
- API workload
- User concurrency
Vertical Scaling
Most processing scalability is achieved at the database level through Snowflake Virtual Warehouses.
While the Compute Pool manages application execution, the Warehouse performs the SQL work.
This includes:
- Bulk ingestion
- Integration jobs
- Match and merge operations
- Transformations
- User queries
As data volumes increase, warehouse sizing becomes the primary factor affecting processing performance.
Compute Pool Sizing
Several factors influence Compute Pool sizing.
Security Model Complexity
One of the largest consumers of application memory is the security model.
Memory usage increases with the number of distinct role combinations that users may impersonate.
Reducing unnecessary role combinations and simplifying workflow role mappings decreases memory consumption and leaves more resources available for user sessions.
Entity Views
Default entity views consume runtime resources.
If some default views are not required by the application, disabling them reduces both CPU and memory usage, particularly for UI-intensive deployments.
CPU vs Memory
Two different bottlenecks can occur.
Memory Bottleneck
Typical symptoms include:
- Out-of-memory errors
- Frequent garbage collection
- Application instability
- Random slowdowns
Memory should generally be considered the first limiting factor.
CPU Bottleneck
When sufficient memory is available, CPU becomes the limiting factor.
Typical symptoms include:
- Increasing response times
- Slow API calls
- Sluggish UI performance
even though memory remains stable.
Practical Recommendation
For applications primarily serving:
- Interactive users
- Moderate API traffic
start with a Small or Medium Compute Pool and monitor CPU and memory usage.
For API-intensive environments with higher concurrency, consider selecting a larger Compute Pool from the beginning, since additional application instances cannot currently be added.
Virtual Warehouse Sizing
Warehouse sizing depends primarily on SQL workload rather than the number of connected users.
Important sizing factors include:
- Volume of processed data
- Number of joins
- Sorting operations
- Grouping operations
- Window functions
- Ranking
- Deduplication logic
- Concurrent integration jobs
Large matching operations and integration jobs typically benefit from larger warehouse sizes.
Scale Up vs Scale Out
Snowflake provides two complementary scaling strategies.
Scale Up
Increasing warehouse size allocates more compute resources to each query.
This generally improves performance for:
- Large joins
- Heavy transformations
- Match and merge jobs
- Complex SQL processing
If a single integration or matching job is too slow, scaling up is usually the appropriate first step.
Scale Out
Multi-cluster warehouses increase concurrency by adding additional clusters.
They do not necessarily reduce the execution time of a single query.
Instead, they allow more queries to execute simultaneously.
Consider multi-cluster warehouses when:
- Multiple integration jobs execute concurrently
- Interactive users compete with batch processing
- Queries spend significant time waiting in queue
Separating Workloads
Although the application runtime consists of a single instance, database workloads can still be isolated.
A common architecture consists of dedicated warehouses such as:
- WH_APP for UI interactions and REST APIs
- WH_BATCH for integration jobs and heavy processing
Integration jobs can be configured to use a dedicated datasource connected to a specific warehouse by using the job parameter
PARAM_DATASOURCE_NAME_SUFFIX
Separating workloads improves predictability and prevents heavy batch processing from affecting interactive users.
Estimating Storage Requirements
Storage estimation depends on several design decisions.
Important factors include:
- Number of entities
- Entity type
- Average record size
- Number of standardized attributes
- Number of technical attributes
- Match keys
- Historization
- Update frequency
Storage requirements can vary significantly depending on the implementation.
For example:
- Preserving both source and standardized values consumes considerably more storage than overwriting source values.
- Match keys, phonetic values, concatenated strings, and technical attributes increase row size.
- Historization increases storage proportionally to update frequency.
Estimating Average Record Size
Average row size should represent the expected stored content rather than the maximum defined column size.
For example:
A VARCHAR(500) column does not typically contain 500 characters.
If the average stored value is approximately 20 characters, the estimate should be based on 20 bytes rather than 500.
Estimating from Existing Data
When sample source files are available, a practical estimate can be obtained by dividing: File Size by Number of Records.
This provides an approximate raw row size.
Additional storage should then be added for:
- Standardized attributes
- Enriched values
- Technical columns
- Match keys
Many implementations preserve both source and standardized values.
In such cases, a standardization ratio of approximately two means the stored data volume roughly doubles because each business attribute is accompanied by its standardized counterpart.
Growth Estimation
Storage estimation should also include:
- Initial data volume
- Expected daily inserts
- Daily updates
- Record historization
- Expected retention period
When match ratios are unknown, conservative default assumptions should be used initially and refined after the first production iterations.
Storage Costs
Although hybrid tables store data in multiple formats internally, storage costs generally remain significantly lower than compute costs.
For most Native App deployments, the primary cost drivers are:
- Compute Pools
- Virtual Warehouses
- AI services such as Cortex
Applications making extensive use of workflows should also consider that current workflow execution relies on polling mechanisms, which may reduce opportunities for warehouse auto-suspend.
Best Practices
- Start with conservative Compute Pool and Warehouse sizes.
- Monitor actual CPU, memory, and query execution metrics before increasing capacity.
- Optimize the security model to reduce memory consumption.
- Disable unused entity views whenever possible.
- Separate interactive and batch workloads across different warehouses.
- Scale Warehouse size when individual SQL jobs are slow.
- Scale Warehouse concurrency when query queueing becomes frequent.
- Re-evaluate storage estimates after the first implementation iterations using actual production data.