Warehouse Sync

Warehouse Sync is mParticle’s reverse-ETL solution, allowing you to use an external warehouse as a data source for your mParticle account. By ingesting data from your warehouse using either one-time, on-demand, or recurring syncs, you can benefit from mParticle’s data governance, personalization, and prediction features without having to modify your existing data infrastructure.

Supported warehouse providers

You can use Warehouse Sync to ingest both user and event from the following warehouse providers:

  • Amazon Redshift
  • Google BigQuery
  • Snowflake
  • Databricks

Warehouse Sync setup overview

  1. Prepare your data warehouse before connecting it to mParticle
  2. Create an input feed in your mParticle account for your warehouse
  3. Connect your warehouse to your new mParticle input feed
  4. Specify the data you want to ingest into mParticle by creating a SQL data model
  5. Map your warehouse data to fields in mParticle

    • For user data pipelines, this mapping is done by your data model
    • For event data pipelines, you must complete an additional data mapping step
  6. Configure when and how often data is ingested from your warehouse

Warehouse Sync API

All functionality of Warehouse Sync is also available from an API. To learn how to create and manage syncs between mParticle and your data warehouse using the API, visit the mParticle developer documentation.

Warehouse Sync setup tutorial

1 Prepare your data warehouse

Before you can ingest data from your warehouse, you must configure your warehouse to be accessible by mParticle by:

  1. Whitelisting the mParticle IP address range in your warehouse so that mParticle can submit API requests.
  2. Create credentials and permissions that mParticle can use to access your warehouse.

Step-by-step instructions for how to complete these two requirements for each supported provider can be found in the Warehouse Sync developer docs.

2 Create a new warehouse input

  1. Log into your mParticle account
  2. Navigate to Setup > Inputs in the left-hand nav bar and select the Feeds tab
  3. Under Add Feed Input, search for and select your data warehouse provider.

You can also create a new warehouse input from the Integrations Directory:

  1. Log into your mParticle account, and click Directory in the left hand nav.
  2. Search for either Google BigQuery, Snowflake, Amazon Redshift, or Databricks.

After selecting your warehouse provider, the Warehouse Sync setup wizard will open where you will:

  1. Enter your warehouse details
  2. Create your data model
  3. Create any necessary mappings between your warehouse data and mParticle fields
  4. Enter your sync schedule settings

3 Connect warehouse

The setup wizard presents different configuration options depending on the warehouse provider you select. Use the tabs below to view the specific setup instructions for Amazon Redshift, Google BigQuery, and Snowflake.

Connect to Amazon Redshift

3.1 Enter a configuration name

The configuration name is specific to mParticle and will appear in your list of warehouse inputs. You can use any configuration name, but it must be unique since this name is used when configuring the rest of your sync settings.

3.2 Enter your Amazon Redshift database name

The database name identifies your database in Amazon Redshift. This must be a valid Amazon Redshift name, and it must exactly match the name for the database you want to connect to.

3.3 Enter your Redshift AWS IAM Role ARN

This is the unique string used to identify an IAM (Identity and Access Management) role within your AWS account. AWS IAM Role ARNs follow the format arn:aws:iam::account:role/role-name-with-path where account is replaced with your AWS account number, role is replaced with the role type in AWS, and role-name-with-path is replaced with the name and location for the role in your AWS account.

Learn more about AWS identifiers in IAM identifiers in the AWS documentation.

3.4 Enter your warehouse host and port

mParticle uses the host name and port number when connecting to, and ingesting data from, your warehouse.

3.5 Enter your Amazon Redshift credentials

Provide the username and password associated with the AWS IAM Role you entered in step 1.4. mParticle will use these credentials when logging into AWS before syncing data.

Connect to Google BigQuery

3.1 Enter a configuration name

The configuration name is specific to mParticle and will appear in your list of warehouse inputs. You can use any configuration name, but it must be unique since this name is used when configuring the rest of your sync settings.

3.2 Enter your BigQuery project ID

Enter the ID for the project in BigQuery containing the dataset. You can find your Project ID from the Google BigQuery Console.

3.3 Enter your BigQuery Dataset ID

Enter the ID for the dataset you’re connecting to. You can find your Dataset ID from the Google BigQuery Console.

3.4 Enter the region

Enter the region where your dataset is localized. For example, aws-us-east-1 or aws-us-west-2. You can find the region for your data set from the Google BigQuery Console.

3.5 Add your BigQuery credentials

Finally, enter the service account ID and upload a JSON file containing your source account key associated with the Project ID you entered in step 1.3. mParticle uses this information to log into BigQuery on your behalf when syncing data.

Your service account ID must match the source account key.

Connect to Snowflake

3.1 Enter a configuration name

The configuration name is specific to mParticle and will appear in your list of warehouse inputs. You can use any configuration name, but it must be unique since this name is used when configuring the rest of your sync settings.

3.2 Select the environment type where your data originates

Select either Prod or Dev depending on whether your want your warehouse data sent to the mParticle development or production environments. This setting determines how mParticle processes ingested data.

3.3 Enter your Snowflake account ID

Your Snowflake account ID uniquely identifies your Snowflake account. You can find your Snowflake account ID by logging into your Snowflake account, and copying everything appearing before .snowflake.com in your web browsers URL bar.

3.4 Enter your Snowflake warehouse name

The Snowflake warehouse name is used to find the specific database you are connecting to. Each Snowflake warehouse can contain multiple databases.

3.5 Enter your Snowflake database name

The specific database you want to sync data from.

3.6 Enter the region

Enter the region identifier for where your Snowflake data is localized.

3.7 Add your Snowflake credentials

Finally, you must provide the username, password, and role specific to the database in Snowflake you are connecting to. These credentials are independent from the username and password you use to access the main Snowflake portal. If you don’t have these credentials, contact the Snowflake database administrator on your team.

Connect to Databricks

3.1 Enter a configuration name

The configuration name is specific to mParticle and will appear in your list of warehouse inputs. You can use any configuration name, but it must be unique since this name is used when configuring the rest of your sync settings.

3.2 Enter your Databricks provider

The value you enter for your Databricks provider must match the value of the Databricks organization that contains the schema you want to ingest data from. This is the same value that you saved when following Step 4 of the Data Warehouse setup, Find and save your Databricks provider name, prior to creating your warehouse connection.

3.3 Enter your Databricks schema

Enter the name of the schema in your Databricks account that you want to ingest data from. Databricks uses the terms database and schema interchangeably, so in this situation the schema is the specific collection of tables and views that mParticle will access through this Warehouse Sync connection.

4 Create data model

Your data model describes which columns from your warehouse to ingest into mParticle, and which mParticle fields each column should map to. While mParticle data models are written in SQL, all warehouse providers process SQL slightly differently so it is important to use the correct SQL syntax for the warehouse provider you select.

For a detailed reference of all SQL commands Warehouse Sync supports alongside real-world example SQL queries, see Warehouse Sync SQL reference.

Mapping requirements for user and event data

Warehouse Sync can be used to ingest user data or events data.

User data mapping

If you are creating a user data pipeline, your SQL data model maps your ingested data to fields in the mParticle JSON schema.

Event data mapping

If you are creating an event data pipeline, you must complete an additional mapping step after creating your data model.

Steps to create a data model

  1. Write a SQL query following the guidelines outlined below and the Warehouse Sync SQL reference. Make sure to use SQL commands supported by your warehouse provider.
  2. Enter the SQL query in the SQL query text box, and click Run Query.
  3. Click Next.

mParticle submits the SQL query you provide to your warehouse to retrieve specific columns of data. Depending on the SQL operators and functions you use, the columns selected from your database are transformed, or mapped, to user profile attributes in your mParticle account.

If you use an attribute name in your SQL query that doesn’t exist in your mParticle account, mParticle creates an attribute with the same name and maps this data to the new attribute.

mParticle automatically maps matching column names in your warehouse to reserved mParticle user attributes and device ids. For example, if your database contains a column named customer_id, it is automatically mapped to the customer_id user identifier in mParticle. For a complete list of reserved attribute names, see Reserved mParticle user or device identity column names.

Example database and SQL query

Below are several example data tables and a SQL query that creates a data model based on them:

Table name: mp.demo.attr

Columns date_updated customer_id firstname
Data rows

Table name: mp.demo.calc

Columns propensity_to_buy customer_id
Data rows

Table name: mp.demo.favs

Columns value customer_id
Data rows

Example SQL query:

SELECT
    a.date_updated,
    a.customer_id,
    a.firstname AS "$firstname",
    c.propensity_to_buy AS "propensity_to_buy",
    ARRAY_AGG(f.value) WITHIN GROUP (ORDER BY f.value ASC) AS "favorite_categories"
FROM mp.demo.attr a
JOIN mp.demo.calc c ON a.customer_id = c.customer_id
JOIN mp.demo.favs f ON a.customer_id = f.customer_id
GROUP BY
    a.date_updated,
    a.customer_id,
    a.firstname,
    c.propensity_to_buy

The SQL query above performs several selections and operations on the data in the sample tables:

  1. The columns date_updated, customer_id, and firstname are selected from the database table mp.demo.att.
  2. The column firstname maps to the mParticle attribute firstname using the AS SQL operator. date_updated and customer_id are automatically mapped to their corresponding reserved attributes in mParticle.
  3. The column propensity_to_buy in the table mp.demo.calc maps to the mParticle attribute propensity_to_buy.
  4. The aggregate function ARRAY_ARG() creates an array named favorite_categories containing each user’s favorite categories in ascending order from the table mp.demo.favs.

Note that the query joins the three tables on their shared customer_id column. The result of the query as it exists in mParticle would then be:

mParticle attributes names date_updated customer_id firstname propensity_to_buy
Attribute values

5 Create data mapping

Use the dropdown menu titled Type of data to sync to select either User Attributes & Identities Only or Events, depending on whether you want to ingest user data or event data.

User Attributes & Identities Only

If you selected User Attributes & Identities Only, then the data model you created in step 4 maps your ingested data to their corresponding fields in mParticle.

Verify that each column under Column in Warehouse maps to the correct attribute under mParticle User Profile Attribute.

To make changes your user data mapping:

  1. Click Back.
  2. Modify and rerun your SQL query according to how you want your user data to map to mParticle attributes.
  3. After making changes to, and re-running, your SQL query, click Next.
  4. Review your new mappings, and click Next.

Events

If you selected Events, then you must manually create mappings between each ingested field and its corresponding field in mParticle.

To create a mapping:

  1. Click Add Mapping to add a new mapping.
  2. In the New Mapping modal, select the type of mapping you want to use from the Mapping Type dropdown:

    • Column - maps a column from your database to a field in mParticle
    • Static - maps a static value that you define to a field in mParticle
    • Ignore - prevents a field that you specify from being ingested into mParticle

The next steps vary depending on the mapping type you select:

Column mapping
  1. Under Column in warehouse, select the name of the column in your database you are mapping fields from.
  2. Under Field in mParticle, select the field in mParticle you are mapping fields to.
  3. Click Save.

Static mapping
  1. Under Input Value select either String, Number, or Boolean for the data type of the static value you are mapping.
  2. In the entry box, enter the static value you are mapping.
  3. Under Field in mParticle, select the field you want to map your static value to.
  4. Click Save.

Ignore mapping
  1. Under Column in warehouse, select the name of the column that you want your pipeline to ignore when ingesting data.
  2. Click Save.

To add additional mappings, click Add Mapping. You must create a mapping for every column you selected in your data model.

When you have finished creating your mappings, click Next.

6 Set sync settings

The sync frequency settings determine when the initial sync with your database will occur, and how frequently any subsequent syncs will be executed.

6.1 Select environment

Select either Prod or Dev depending on whether you want your warehouse data sent to the mParticle development or production environments. This setting determines how mParticle processes ingested data.

6.2 Select input protection level

Input protection levels determine how data ingested from your warehouse can contribute to new or existing user profiles in mParticle:

  • Create & Update: the default setting for all inputs in mParticle. This setting allows ingested user data to initiate the creation of a new profile or to be added to an existing profile.
  • Update Only: allows ingested data to be added to existing profiles, but not initiate the creation of new profiles.
  • Read Only: prevents ingested data from updating or creating user profiles.

To learn more about these settings and how they can be used in different scenarios, see Input protections.

6.3 Select sync mode and schedule

There are two sync modes: incremental and full.

  • Incremental: Use this sync mode to only ingest data that has changed or been added between sync runs as indicated by your warehouse column you use as an iterator. The first run for incremental sync modes is always be a full sync.
  • Full: Use this sync mode to sync all data from your warehouse each time you execute a sync run. Use caution when selecting this sync mode, as it can incur to very high costs due to the volume of data ingested.

The remaining setting options change depending on the mode you select from the Sync mode dropdown menu. Navigate between the two configuration options using the tabs below:

Following are the sync settings for incremental syncs:

The main difference between full and incremental sync configurations is the use of an iterator field for incremental syncs. Both full and incremental syncs support all three sync modes (Interval, Once, and On Demand).

Iterator

Select the column name from your warehouse using the Iterator dropdown menu. The options are generated from the SQL query you ran when creating your data model.

mParticle uses the iterator to determine which data to include and exclude during each incremental sync.

Iterator data type

Select the iterator data type using the Iterator data type dropdown. This value must match the datatype of the iterator as it exists in your warehouse.

Sync schedule type

Select one of the following schedule types:

  • Interval: for recurring syncs that are run automatically based on a set frequency.
  • Once: for a single sync between a warehouse and mParticle that will not be repeated.
  • On Demand: for a sync that can run multiple times, but must be triggered manually.
Frequency of sync

Sync frequencies can be either Hourly, Daily, Weekly, or Monthly.

The date and time you select for your initial sync is used to calculate the date and time for the next sync. For example, if you select Hourly for your frequency and 11/15/2023 07:00 PM UTC for your initial sync, then the next sync will occur at 11/15/2023 08:00 PM UTC.

Date & time of sync

Use the date and time picker to select the calendar date and time (in UTC) for your initial sync. Subsequent interval syncs will be scheduled based on this initial date and time.

Following are the sync settings for full syncs:

The main difference between full and incremental sync configurations is the use of an iterator field for incremental syncs. Both full and incremental syncs support all three sync modes (Interval, Once, and On Demand).

Sync schedule type

Select one of the following schedule types:

  • Interval: for recurring syncs that is run automatically based on a set frequency.
  • Once: for a single sync between a warehouse and mParticle that will not be repeated.
  • On Demand: for a sync that can run multiple times, but must be triggered manually.
Frequency of sync

Sync frequencies can be either Hourly, Daily, Weekly, or Monthly.

The date and time you select for your initial sync is used to calculate the date and time for the next sync. For example, if you select Hourly for your frequency and 11/15/2023 07:00 PM UTC for your initial sync, then the next sync will occur at 11/15/2023 08:00 PM UTC.

Date & time of initial sync

Use the date and time picker to select the calendar date and time (in UTC) for your initial sync. Subsequent interval syncs will be scheduled based on this initial date and time.

6.4 Sync historical data

The value you select for Sync Start Date determines how much old, historical data mParticle ingests from your warehouse in your initial sync. When determining how much historical data to ingest, mParticle uses to the column in your database you selected as the Timestamp field in the Create Data Model step.

After your initial sync begins, mParticle begins ingesting any historical data. If mParticle hasn’t finished ingesting historical data before the time a subsequent sync is due to start, the subsequent sync is still executed, and the historical data continues syncing in the background.

Historical data syncing doesn’t contribute to any rate limiting on subsequent syncs.

After entering your sync settings, click Next.

7 Review

mParticle generates a preview for Data Warehouse syncs that have been configured, but not yet activated. Use this page and the sample enriched user profiles to confirm the following:

  • Your data model correctly maps columns from your database to mParticle attributes
  • Your sync is scheduled to occur at the correct interval
  • Your initial sync is scheduled to occur at the correct time
  • Your initial sync includes any desired historical data in your warehouse

After reviewing your sync configuration, click Activate to activate your sync.

View and manage existing warehouse syncs

  1. Log into your mParticle account.
  2. Navigate to Setup > Inputs in the left hand nav bar and select the Feeds tab.
  3. Any configured warehouse syncs are listed on this page, grouped by warehouse provider. Expand a warehouse provider to view and manage a sync.

    • To edit an existing sync, select it from the list under a warehouse provider. This loads the Warehouse Sync setup wizard, where you can modify your sync settings.
    • Connect a sync to an output by clicking the green + icon under Connected Outputs.
    • Configure rules for a sync by clicking the + Setup button under Rules Applied.
    • Delete a sync configuration by clicking the trash icon under Actions.
    • To add a new sync for a warehouse provider, click the + icon next to the provider.

Manage a sync

To view details for an existing sync, select it from the list of syncs. A summary page is displayed, showing the current status (Active or Paused), sync frequency, and a list of recent or in-progress syncs.

To pause a sync, click Pause Sync. Paused syncs will only resume running on their configured schedule after you click Resume.

To run an on-demand sync, click Run Sync under Sync Frequency.

Use the Data Model, Mapping, and Settings tabs to view and edit your sync configuration details. Clicking Edit from any of these tabs opens the respective step of the setup wizard where you can make and save your changes.

Was this page helpful?