Warehouse Sync is mParticle’s reverse-ETL solution, allowing you to use an external warehouse as a data source for your mParticle account. By ingesting data from your warehouse using either one-time, on-demand, or recurring syncs, you can benefit from mParticle’s data governance, personalization, and prediction features without having to modify your existing data infrastructure.
Use the Warehouse Sync UI to configure a complete Warehouse sync, including:
You can also use the UI to view the status of, edit, or delete, existing Warehouse Sync pipelines.
All functionality of Warehouse Sync is also available from an API. To learn how to create and manage syncs between mParticle and your data warehouse using the API, visit the mParticle developer documentation.
Before you can ingest data from your warehouse, you must configure your warehouse to be accessible by mParticle by:
To begin the setup wizard:
You can also enter the setup wizard from the Integrations Directory:
After selecting your warehouse provider, the Warehouse Sync setup wizard opens where you enter your warehouse details, create your data model, review how data from your warehouse will map to data in mParticle, and set your sync schedule.
The setup wizard presents different configuration options depending on the warehouse provider you select. Use the tabs below to view the specific setup instructions for Amazon Redshift, Google BigQuery, and Snowflake.
The configuration name is specific to mParticle and will appear in your list of warehouse inputs. You can use any configuration name, but it must be unique since this name is used when configuring the rest of your sync settings.
The database name identifies your database in Amazon Redshift. This must be a valid Amazon Redshift name, and it must exactly match the name for the database you want to connect to.
This is the unique string used to identify an IAM (Identity and Access Management) role within your AWS account. AWS IAM Role ARNs follow the format arn:aws:iam::account:role/role-name-with-path where account is replaced with your AWS account number, role is replaced with the role type in AWS, and role-name-with-path is replaced with the name and location for the role in your AWS account.
Learn more about AWS identifiers in IAM identifiers in the AWS documentation.
mParticle uses the host name and port number when connecting to, and ingesting data from, your warehouse.
Provide the username and password associated with the AWS IAM Role you entered in step 1.4. mParticle will use these credentials when logging into AWS before syncing data.
The configuration name is specific to mParticle and will appear in your list of warehouse inputs. You can use any configuration name, but it must be unique since this name is used when configuring the rest of your sync settings.
Enter the ID for the project in BigQuery containing the dataset. You can find your Project ID from the Google BigQuery Console.
Enter the ID for the dataset you’re connecting to. You can find your Dataset ID from the Google BigQuery Console.
Enter the region where your dataset is localized. For example, aws-us-east-1
or aws-us-west-2
. You can find the region for your data set from the Google BigQuery Console.
Finally, enter the service account ID and upload a JSON file containing your source account key associated with the the Project ID you entered in step 1.3. mParticle uses this information to log into BigQuery on your behalf when syncing data.
Your service account ID must match the source account key.
The configuration name is specific to mParticle and will appear in your list of warehouse inputs. You can use any configuration name, but it must be unique since this name is used when configuring the rest of your sync settings.
Select either Prod or Dev depending on whether your want your warehouse data sent to the mParticle development or production environments. This setting determines how mParticle processes ingested data.
Your Snowflake account ID uniquely identifies your Snowflake account. You can find your snowflake account ID by logging into your Snowflake account, and copying everything appearing before .snowflake.com
in your web browsers URL bar.
The Snowflake warehouse name is used to find the specific database you are connecting to. Each Snowflake warehouse can contain multiple databases.
The specific database you want to sync data from.
Enter the region identifier for where your Snowflake data is localized.
Finally, you must provide the username, password, and role specific to the database in Snowflake you are connecting to. These credentials are independent from the username and password you use to access the main Snowflake portal. If you don’t have these credentials, contact the Snowflake database administrator on your team.
Your data model describes which columns from your warehouse to ingest into mParticle, and which mParticle user profile attributes each column maps to. Data models are written as SQL queries, but warehouse providers process SQL functions differently so it is important to use the correct SQL syntax for the warehouse provider you select.
For a detailed reference of all SQL commands Warehouse Sync supports alongside real-world example SQL queries, see Warehouse Sync SQL reference.
mParticle submits the SQL query you provide to your warehouse to retrieve specific columns of data. Depending on the SQL operators and functions you use, the columns selected from your database are transformed, or mapped, to user profile attributes in your mParticle account.
If you use an attribute name in your SQL query that doesn’t exist in your mParticle account, mParticle creates an attribute with the same name and maps this data to the new attribute.
mParticle automatically maps matching column names in your warehouse to reserved mParticle user attributes and device ids. For example, if your database contains a column named customer_id
, it is automatically mapped to the customer_id
user identifier in mParticle. For a complete list of reserved attribute names, see Reserved mParticle user or device identity column names.
Refer to the SQL query written for a hypothetical database that includes following tables:
Table name: mp.demo.attr
Columns | date_updated | customer_id | firstname |
---|---|---|---|
Data rows | … | … | … |
Table name: mp.demo.calc
Columns | propensity_to_buy | customer_id |
---|---|---|
Data rows | … | … |
Table name: mp.demo.favs
Columns | value | customer_id |
---|---|---|
Data rows | … | … |
Example SQL query:
SELECT
a.date_updated,
a.customer_id,
a.firstname AS "$firstname",
c.propensity_to_buy AS "propensity_to_buy",
ARRAY_AGG(f.value) WITHIN GROUP (ORDER BY f.value ASC) AS "favorite_categories"
FROM mp.demo.attr a
JOIN mp.demo.calc c ON a.customer_id = c.customer_id
JOIN mp.demo.favs f ON a.customer_id = f.customer_id
GROUP BY
a.date_updated,
a.customer_id,
a.firstname,
c.propensity_to_buy
The SQL query above performs several selections and operations on the data in the sample tables:
date_updated
, customer_id
, and firstname
are selected from the database table mp.demo.att
.firstname
maps to the mParticle attribute firstname
using the AS
SQL operator. date_updated
and customer_id
are automatically mapped to their corresponding reserved attributes in mParticle.propensity_to_buy
in the table mp.demo.calc
maps to the mParticle attribute propensity_to_buy
.ARRAY_ARG()
creates an array named favorite_categories
containing each user’s favorite categories in ascending order from the table mp.demo.favs
. Note that the query joins the three tables on their shared customer_id
column. The result of the query as it exists in mParticle would then be:
mParticle attributes names | date_updated | customer_id | firstname | propensity_to_buy |
---|---|---|---|---|
Attribute values | … | … | … | … |
Verify that the columns selected in your warehouse map to the correct attributes on a user profile in your mParticle account. Any columns or attributes using the reserved word identifiers are automatically be used by mParticle during the identity resolution process. Warehouse Sync supports the same user identifiers as other input methods and adheres to the identity strategy you’ve configured.
To change your data mapping, click Back, modify and rerun your SQL query, then click Next to return to Review mapping.
The sync frequency settings determine when the initial sync with your database will occur, and how frequently any subsequent syncs will be executed.
Select either Prod or Dev depending on whether your want your warehouse data sent to the mParticle development or production environments. This setting determines how mParticle processes ingested data.
Input protection levels determine how data ingested from your warehouse can contribute to new or existing user profiles in mParticle:
To learn more about these settings and how they can be used in different scenarios, see Input protections.
There are two sync modes: incremental and full.
The remaining setting options change depending on the mode you select from the Sync mode dropdown menu. Navigate between the two configuration options using the tabs below:
Following are the sync settings for incremental syncs:
The main difference between full and incremental sync configurations is the use of an iterator field for incremental syncs. Both full and incremental syncs support all three sync modes (Interval, Once, and On Demand).
Select the column name from your warehouse using the Iterator dropdown menu. The options are generated from the SQL query you ran when creating your data model.
mParticle uses the iterator to determine which data to include and exclude during each incremental sync.
Select the iterator data type using the Iterator data type dropdown. This value must match the datatype of the iterator as it exists in your warehouse.
Select one of the following schedule types:
Sync frequencies can be either Hourly, Daily, Weekly, or Monthly.
The date and time you select for your initial sync is used to calculate the date and time for the next sync. For example, if you select Hourly for your frequency and 11/15/2023 07:00 PM UTC
for your initial sync, then the next sync will occur at 11/15/2023 08:00 PM UTC
.
Use the date and time picker to select the calendar date and time (in UTC) for your initial sync. Subsequent interval syncs will be scheduled based on this initial date and time.
Following are the sync settings for full syncs:
The main difference between full and incremental sync configurations is the use of an iterator field for incremental syncs. Both full and incremental syncs support all three sync modes (Interval, Once, and On Demand).
Select one of the following schedule types:
Sync frequencies can be either Hourly, Daily, Weekly, or Monthly.
The date and time you select for your initial sync is used to calculate the date and time for the next sync. For example, if you select Hourly for your frequency and 11/15/2023 07:00 PM UTC
for your initial sync, then the next sync will occur at 11/15/2023 08:00 PM UTC
.
Use the date and time picker to select the calendar date and time (in UTC) for your initial sync. Subsequent interval syncs will be scheduled based on this initial date and time.
The value you select for Sync Start Date determines how much old, historical data mParticle ingests from your warehouse in your initial sync. When determining how much historical data to ingest, mParticle uses to the column in your database you selected as the Timestamp field in the Create Data Model step.
After your initial sync begins, mParticle begins ingesting any historical data. If mParticle hasn’t finished ingesting historical data before the time a subsequent sync is due to start, the subsequent sync is still executed, and the historical data continues syncing in the background.
Historical data syncing doesn’t contribute to any rate limiting on subsequent syncs.
After entering your sync settings, click Next.
mParticle generates a preview for Data Warehouse syncs that have been configured, but not yet activated. Use this page and the sample enriched user profiles to confirm the following:
After reviewing your sync configuration, click Activate to activate your sync.
Any configured warehouse syncs are listed on this page, grouped by warehouse provider. Expand a warehouse provider to view and manage a sync.
To view details for an existing sync, select it from the list of syncs. A summary page is displayed, showing the current status (Active or Paused), sync frequency, and a list of recent or in-progress syncs.
To pause a sync, click Pause Sync. Paused syncs will only resume running on their configured schedule after you click Resume.
To run an on-demand sync, click Run Sync under Sync Frequency.
Use the Data Model, Mapping, and Settings tabs to view and edit your sync configuration details. Clicking Edit from any of these tabs opens the respective step of the setup wizard where you can make and save your changes.
Was this page helpful?