Data Subject Request API Version 1 and 2
Data Subject Request API Version 3
Key Management
Platform API Overview
Accounts
Apps
Audiences
Calculated Attributes
Data Points
Feeds
Field Transformations
Services
Users
Workspaces
Warehouse Sync API Overview
Warehouse Sync API Tutorial
Warehouse Sync API Reference
Data Mapping
Warehouse Sync SQL Reference
Warehouse Sync Troubleshooting Guide
ComposeID
Warehouse Sync API v2 Migration
Bulk Profile Deletion API Reference
Calculated Attributes Seeding API
Custom Access Roles API
Data Planning API
Group Identity API Reference
Pixel Service
Profile API
Events API
mParticle JSON Schema Reference
IDSync
AMP SDK
Initialization
Configuration
Network Security Configuration
Event Tracking
User Attributes
IDSync
Screen Events
Commerce Events
Location Tracking
Media
Kits
Application State and Session Management
Data Privacy Controls
Error Tracking
Opt Out
Push Notifications
WebView Integration
Logger
Preventing Blocked HTTP Traffic with CNAME
Linting Data Plans
Troubleshooting the Android SDK
API Reference
Upgrade to Version 5
Cordova Plugin
Identity
Direct URL Routing FAQ
Web
Android
iOS
Workspace Switching
Initialization
Configuration
Event Tracking
User Attributes
IDSync
Screen Tracking
Commerce Events
Location Tracking
Media
Kits
Application State and Session Management
Data Privacy Controls
Error Tracking
Opt Out
Push Notifications
Webview Integration
Upload Frequency
App Extensions
Preventing Blocked HTTP Traffic with CNAME
Linting Data Plans
Troubleshooting iOS SDK
Social Networks
iOS 14 Guide
iOS 15 FAQ
iOS 16 FAQ
iOS 17 FAQ
iOS 18 FAQ
API Reference
Upgrade to Version 7
Getting Started
Identity
Upload Frequency
Getting Started
Opt Out
Initialize the SDK
Event Tracking
Commerce Tracking
Error Tracking
Screen Tracking
Identity
Location Tracking
Session Management
Initialization
Configuration
Content Security Policy
Event Tracking
User Attributes
IDSync
Page View Tracking
Commerce Events
Location Tracking
Media
Kits
Application State and Session Management
Data Privacy Controls
Error Tracking
Opt Out
Custom Logger
Persistence
Native Web Views
Self-Hosting
Multiple Instances
Web SDK via Google Tag Manager
Preventing Blocked HTTP Traffic with CNAME
Facebook Instant Articles
Troubleshooting the Web SDK
Browser Compatibility
Linting Data Plans
API Reference
Upgrade to Version 2 of the SDK
Getting Started
Identity
Alexa
Overview
Step 1. Create an input
Step 2. Verify your input
Step 3. Set up your output
Step 4. Create a connection
Step 5. Verify your connection
Step 6. Track events
Step 7. Track user data
Step 8. Create a data plan
Step 9. Test your local app
Overview
Step 1. Create an input
Step 2. Verify your input
Step 3. Set up your output
Step 4. Create a connection
Step 5. Verify your connection
Step 6. Track events
Step 7. Track user data
Step 8. Create a data plan
Step 1. Create an input
Step 2. Create an output
Step 3. Verify output
Node SDK
Go SDK
Python SDK
Ruby SDK
Java SDK
Introduction
Outbound Integrations
Firehose Java SDK
Inbound Integrations
Compose ID
Data Hosting Locations
Glossary
Migrate from Segment to mParticle
Migrate from Segment to Client-side mParticle
Migrate from Segment to Server-side mParticle
Segment-to-mParticle Migration Reference
Rules Developer Guide
API Credential Management
The Developer's Guided Journey to mParticle
Overview
Overview
User Profiles
Overview
Create and Manage Group Definitions
Calculated Attributes Overview
Using Calculated Attributes
Create with AI Assistance
Calculated Attributes Reference
What are predictive attributes?
Create an Input
Start capturing data
Connect an Event Output
Create an Audience
Connect an Audience Output
Transform and Enhance Your Data
Usage and Billing Report
The new mParticle Experience
The Overview Map
Observability Overview
Observability User Guide
Observability Troubleshooting Examples
Observability Span Glossary
Key Management
Event Forwarding
Notification Center (Early Access)
System Alerts
Trends
Introduction
Data Retention
Data Catalog
Connections
Activity
Data Plans
Live Stream
Filters
Rules
Blocked Data Backfill Guide
Tiered Events
mParticle Users and Roles
Analytics Free Trial
Troubleshooting mParticle
Usage metering for value-based pricing (VBP)
Audiences Overview
Create an Audience
Connect an Audience
Manage Audiences
FAQ
Real-time Audiences (Legacy)
Standard Audiences (Legacy)
New vs. Classic Experience Comparison
Predictive Audiences Overview
Using Predictive Audiences
IDSync Overview
Use Cases for IDSync
Components of IDSync
Store and Organize User Data
Identify Users
Default IDSync Configuration
Profile Conversion Strategy
Profile Link Strategy
Profile Isolation Strategy
Best Match Strategy
Aliasing
Introduction
Core Analytics (Beta)
Sync and Activate Analytics User Segments in mParticle
User Segment Activation
Welcome Page Announcements
Project Settings
Roles and Teammates
Organization Settings
Global Project Filters
Portfolio Analytics
Analytics Data Manager Overview
Events
Event Properties
User Properties
Revenue Mapping
Export Data
UTM Guide
Analyses Introduction
Getting Started
Visualization Options
For Clauses
Date Range and Time Settings
Calculator
Numerical Settings
Assisted Analysis
Properties Explorer
Frequency in Segmentation
Trends in Segmentation
Did [not] Perform Clauses
Cumulative vs. Non-Cumulative Analysis in Segmentation
Total Count of vs. Users Who Performed
Save Your Segmentation Analysis
Export Results in Segmentation
Explore Users from Segmentation
Getting Started with Funnels
Group By Settings
Conversion Window
Tracking Properties
Date Range and Time Settings
Visualization Options
Interpreting a Funnel Analysis
Group By
Filters
Conversion over Time
Conversion Order
Trends
Funnel Direction
Multi-path Funnels
Analyze as Cohort from Funnel
Save a Funnel Analysis
Export Results from a Funnel
Explore Users from a Funnel
Saved Analyses
Manage Analyses in Dashboards
Data Dictionary
Query Builder Overview
Modify Filters With And/Or Clauses
Query-time Sampling
Query Notes
Filter Where Clauses
Event vs. User Properties
Group By Clauses
Annotations
Cross-tool Compatibility
Apply All for Filter Where Clauses
Date Range and Time Settings Overview
User Attributes at Event Time
Understanding the Screen View Event
User Aliasing
Dashboards––Getting Started
Manage Dashboards
Dashboard Filters
Organize Dashboards
Scheduled Reports
Favorites
Time and Interval Settings in Dashboards
Query Notes in Dashboards
The Demo Environment
Keyboard Shortcuts
User Segments
Data Privacy Controls
Data Subject Requests
Default Service Limits
Feeds
Cross-Account Audience Sharing
Approved Sub-Processors
Import Data with CSV Files
CSV File Reference
Glossary
Video Index
Single Sign-On (SSO)
Setup Examples
Introduction
Introduction
Introduction
Rudderstack
Google Tag Manager
Segment
Advanced Data Warehouse Settings
AWS Kinesis (Snowplow)
AWS Redshift (Define Your Own Schema)
AWS S3 Integration (Define Your Own Schema)
AWS S3 (Snowplow Schema)
BigQuery (Snowplow Schema)
BigQuery Firebase Schema
BigQuery (Define Your Own Schema)
GCP BigQuery Export
Snowflake (Snowplow Schema)
Snowplow Schema Overview
Snowflake (Define Your Own Schema)
Aliasing
Event
Event
Audience
Audience
Feed
Event
Audience
Cookie Sync
Server-to-Server Events
Platform SDK Events
Audience
Audience
Audience
Feed
Event
Event
Audience
Event
Event
Data Warehouse
Event
Event
Event
Audience
Feed
Event
Event
Event
Event
Event
Event
Event
Audience
Event
Feed
Event
Event
Audience
Feed
Event
Event
Event
Custom Feed
Data Warehouse
Event
Event
Audience
Audience
Audience
Event
Audience
Event
Event
Event
Event
Event
Audience
Audience
Event
Event
Event
Audience
Data Warehouse
Event
Audience
Cookie Sync
Event
Event
Event
Event
Feed
Event
Feed
Event
Event
Kit
Audience
Event
Event
Event
Audience
Event
Event
Event
Feed
Audience
Event
Audience
Event
Audience
Event
Audience
Audience
Audience
Event
Audience
Event
Event
Event
Event
Event
Feed
Event
Event
Event
Event
Feed
Event
Audience
Event
Event
Event
Event
Event
Event
Event
Event
Feed
Event
Event
Event
Custom Pixel
Feed
Event
Event
Event
Event
Event
Audience
Event
Event
Data Warehouse
Event
Event
Audience
Audience
Audience
Audience
Event
Loyalty Feed
Feed
Audience
Event
Audience
Audience
Cookie Sync
Event
Feed
Audience
Event
Audience
Event
Audience
Event
Event
Event
Event
Audience
Cookie Sync
Audience
Cookie Sync
Feed
Audience
Event
When ingesting historical event data into mParticle with Warehouse Sync, it is important to consider historical data handling, data quality, data retention, and platform limits. Historical data is processed differently by mParticle, so pipelines that ingest it have special considerations and requirements.
This guide outlines key points and best practices for handling historical and large-volume data ingestion.
There are two approaches to ingesting data with Warehouse Sync:
When ingesting historical data, you can use a combination of these two pipeline types to backfill old data and then switch to an ongoing incremental sync.
As a general rule, any event with a timestamp older than 30 days is considered historical. This data requires special handling to ensure it is processed correctly and made available for long-term use cases like audience segmentation with extended lookback windows.
Data that you flag as historical is handled differently from real-time data:
All other processing rules for identity resolution and storage remain the same for historical data.
A core concept in mParticle is the relationship between user data (attributes and identities) and event data. Understanding this is critical for a successful historical data ingest.
Data is sent to mParticle in batches (multiple events batched together) to optimize throughput. Each batch contains event data, and provides context about the user in the form of user attributes and identities.
product_view
event captures that a user looked at a product. Event data is captured and stored as events using mParticle’s event data format.membership_tier: gold
.User attributes and identities are stored in a persistent User Profile, which creates a complete, 360-degree view of your user.
When ingesting historical data, you can include user attributes and identities in the same batch as your events. However, in many backfill scenarios, it’s more effective to ingest them separately. This is particularly true when your goal is to have the final user profile reflect the most recent information about a user, rather than the state they were in when a historical event occurred. For example, you can send a batch containing only user_attributes
and user_identities
to update a user’s profile to ensure that the profile is accurate and up-to-date, then ingest the users’ events when the profile reflects the most recent information.
mParticle enforces service limits specific to Warehouse Sync, including limits on:
Warehouse Sync ingests data at a rate according to your account’s configured limits and expected data volumes. When the Warehouse Sync API or UI shows a “success” status, it means your data has been ingested can be used by downstream features and sent to output integrations. Keep in mind that while recent non-historical data may be available quickly, some downstream features and integrations may require additional processing time before all ingested data is available. This is especially true for large or historical data loads that are processed differently from real-time data and connected outputs which have their own processing times.
Based on your estimates, coordinate with your mParticle Customer Success team if you anticipate exceeding any limits.
When ingesting historical data, it’s essential to distinguish between batch timestamps and event timestamps:
source_info.is_historical
field.This approach ensures that both event and profile data are processed accurately and in the correct order, while avoiding issues with data retention, forwarding, and audience availability.
There are two common strategies for ingesting historical data efficiently:
from
date – Set the from
value to the start of the period you want to ingest, and leave until
blank. The initial run will backfill the entire range, after which the pipeline will automatically switch to incremental updates on its schedule.full
On Demand pipeline and use a WHERE
clause in your data model to limit the dataset (for example, WHERE event_timestamp BETWEEN '2024-01-01' AND '2024-06-30'
). Trigger the pipeline whenever you need to ingest another slice, updating the clause between runs as needed. Once you have finished ingesting the historical data, disable the full pipeline and create a new incremental pipeline to sync ongoing new data on a regular schedule, setting the from
value to the last date you ingested with the full pipeline.This section provides an overview of creating a Warehouse Sync pipeline specifically for ingesting historical data. For a complete, detailed walkthrough of how to create a Warehouse Sync pipeline, refer to the main Warehouse Sync setup guide.
The first step is to connect mParticle to your data warehouse. This process is the same for all Warehouse Sync pipelines.
Your data model is the SQL query that defines what data to pull from your warehouse. When ingesting historical data, your query needs special consideration.
To ensure mParticle correctly identifies and processes your historical data, you must explicitly flag it when creating your data model. The “Sync Historical Data” setting in your Warehouse Sync pipeline only controls how far back the pipeline reads from your warehouse; it does not automatically mark old data as historical.
Example SQL:
Add a CASE
statement to your SQL query to create an is_historical
column.
SELECT
*,
CASE
WHEN event_timestamp < CURRENT_DATE - INTERVAL '30 days' THEN TRUE
ELSE FALSE
END AS is_historical
FROM your_table
Warehouse Sync provides two complementary ways to control which rows are pulled from your warehouse:
WHERE
clausesAdd predicates such as WHERE event_timestamp >= DATE '2023-01-01'
or a bounded BETWEEN
statement inside the query you save with the data model. The filter executes in your warehouse on every run. This is recommended for filtering on fields other than the iterator field, such as event timestamp, event type, or user region, and is useful for validating your pipeline by limiting the dataset to a small subset of data.
from
/ until
)Use the Sync Historical Data setting in the UI or use the from
and until
fields on sync_mode
when calling the API to set the minimum and maximum iterator values a pipeline will ever request. This is recommended for a seamless transition between your initial backfill and ongoing incremental syncs, and makes your pipeline runs more easily auditable via the pipeline status APIs.
A row must satisfy both the iterator window and your SQL to be ingested, giving you granular control over the time span and the business logic applied to your data.
When using incremental pipelines in Warehouse Sync, you must specify an iterator column—a timestamp field (such as datetime, date, or Unix timestamp) that tracks which rows have already been processed. This iterator column is essential for reliable incremental updates and should be distinct from your event timestamp field whenever possible.
Best practices for iterator columns:
WHERE
clause) and for mapping to the appropriate mParticle field.delay
field to 1d
to account for this upstream processing time.In the “Create data mapping” step of your Warehouse Sync setup, map the is_historical
column from your SQL query to the source_info.is_historical
field. You should also set a channel for the data. For more details on this process, see Field Transformations.
Example Field Transformation Mapping:
[
{
"mapping_type": "column",
"source": "is_historical",
"destination": "source_info.is_historical"
},
{
"mapping_type": "static",
"destination": "source_info.channel",
"value": "server_to_server"
}
]
Your sync settings determine whether your pipeline runs on a schedule (incremental) or on-demand (full), and over what time period. These settings work together with the WHERE
clause in your data model to control what data is ingested.
from
date to the beginning of your historical period. Subsequent runs will automatically pick up where the last run left off.Before activating your pipeline, review all settings to ensure they match your intended historical ingestion strategy. It’s a best practice to start with a small test batch to validate your configuration before running a large backfill.
Successfully ingesting large volumes of historical data requires thoughtful preparation. In addition to the specific practices embedded in the steps above, keep the following general guidelines in mind:
Evaluate connected systems and features:
sync_mode
can’t be changed after creation.from
/until
iterator windows to precisely control which rows are ingested and avoid overlapping backfill runs.Your mParticle customer success team can:
Engage your Customer Success team early in the planning process for large or unusual data ingests.
Was this page helpful?