Warehouse Sync API Tutorial

Use this tutorial to configure your first Warehouse Sync pipeline using the mParticle Postman collection, and use the data from your pipeline to create an mParticle Audience. Postman is an easy and friendly environment for both developers and non-developers to use APIs.

This tutorial is not a complete guide to all Warehouse Sync features. For a complete API reference, see the Warehouse Sync API Reference.

Prerequisites

  • Install the latest version of the Postman desktop application. You can download Postman from https://www.postman.com/downloads/.
  • Fork the mParticle Warehouse Sync Postman Collection to your workspace:

    Run in Postman

    A copy of the Warehouse Sync environment is included. You can download it again here.

Step 1. mParticle setup

Create Platform API credentials

You need credentials to use the Platform API to create Warehouse Sync pipelines.

To create a Platform API credential:

  1. After signing in to the mParticle app as a user with the Admin role, click the gear icon in the bottom left corner.

  1. Click Platform.
  2. Select the API Credentials tab.
  3. Click the green Add Credential button in the top right.
  4. Give the credential a name, check the Platform checkbox and select Admin from the Permissions dropdown menu. Click the green Save button.

    Add api credential screen
  5. Copy the Client ID and Client Secret values for use in a later step.
  6. Click Done.

Step 2. Data warehouse setup

Work with your warehouse administrator or IT team to ensure your warehouse is reachable and accessible by mParticle.

  1. Whitelist the mParticle IP address range so your warehouse will be able to accept inbound API requests from mParticle.
  2. Ask your database administrator to perform the following steps in your warehouse to create a new role that mParticle can use to access your database. Select the correct tab for your warehouse (Snowflake, Google BigQuery, Amazon Redshift, or Databricks) below.

Run the following commands from your Snowflake instance:

USE ROLE ACCOUNTADMIN;

// mParticle recommends creating a unique role for warehouse sync 
CREATE ROLE IF NOT EXISTS {{role_name}};

GRANT USAGE ON WAREHOUSE {{warehouse}} TO ROLE {{role_name}};
GRANT USAGE ON DATABASE {{database}} TO ROLE {{role_name}};
GRANT USAGE ON SCHEMA {{database}}.{{schema}} TO ROLE {{role_name}};

// Grant SELECT privilege on any tables/views mP needs to access
GRANT SELECT ON TABLE {{database}}.{{schema}}.{{table}} TO ROLE {{role_name}};

// mParticle recommends creating a unique user for mParticle
CREATE OR REPLACE USER {{user_name}} PASSWORD = "{{unique_secure_password}}";
GRANT ROLE {{role_name}} TO USER {{user_name}};

CREATE OR REPLACE STORAGE INTEGRATION {{storage_integration_name}}
          WITH TYPE = EXTERNAL_STAGE
          STORAGE_PROVIDER = S3
          ENABLED = TRUE
          STORAGE_AWS_ROLE_ARN = "arn:aws:iam::{{mp_pod_aws_account_id}}:role/ingest-pipeline-data-external-{{mp_org_id}}-{{mp_acct_id}}"
          STORAGE_AWS_OBJECT_ACL = "bucket-owner-full-control"
          STORAGE_ALLOWED_LOCATIONS = ("s3://{{mp_pod}}-ingest-pipeline-data/{{mp_org_id}}/{{mp_acct_id}}");


GRANT USAGE ON INTEGRATION {{storage_integration_name}} TO ROLE {{role_name}};

Where:

  • role_name: The ID of the role mParticle will assume while executing SQL commands on your Snowflake instance. mParticle recommends creating a unique role for warehouse sync.
  • warehouse: The ID of the Snowflake virtual warehouse compute cluster where SQL commands will be executed.
  • database: The ID of the database in your Snowflake instance from which you want to sync data.
  • schema: The ID of the schema in your Snowflake instance containing the tables you want to sync data from.
  • table: The ID of the table containing data you want to sync. Grant SELECT privileges on any tables/views mParticle needs to access.
  • user_name: The ID of the user mParticle will log in as while executing SQL commands on your Snowflake instance. mParticle recommends creating a unique role for warehouse sync.

    • If creating a new user, use a new unique_secure_password.
  • storage_integration_name: The ID of a Snowflake external storage integration allowing mParticle to unload data from your Snowflake instance to an S3 storage bucket.
  • mp_pod: The mParticle region ID of your data hosting location, one of US1, US2, AU1, or EU1.
  • mp_pod_aws_account_id: The mParticle provided ID for the data hosting location where your organization resides. Use the corresponding value for your mParticle instance:

    • US1: 338661164609
    • US2: 386705975570
    • AU1: 526464060896
    • EU1: 583371261087
  • mp_org_id: The mParticle provided ID of your organization where this connection will be stored. This can be found from your API client setup in step 1.
  • mp_acct_id: The mParticle provided ID of the account where this connection will be stored. This can be found from your API client setup in step 1.

Create a new service account for mParticle

  1. Go to console.cloud.google.com, log in, and navigate to IAM & Admin > Service Accounts.
  2. Select Create Service Account.
  3. Enter a new identifier for mParticle in Service account ID. In the example below, the email address is the service account ID. Save this value for your Postman setup.

  1. Under Grant this service account access to project, select BigQuery Job User under the Role dropdown menu, and click DONE.

  1. Select your new service account and navigate to the Keys tab.
  2. Click ADD KEY and select Create new key. The value for service_account_key will be the contents of the generated JSON file. Save this value for your Postman setup.

Identify your BigQuery warehouse details

Navigate to your BigQuery instance from console.cloud.google.com.

  • Your project_id is the first portion of Dataset ID (the portion before the .). In the example above, it is mp-project.
  • Your dataset_id is the second portion of Dataset ID (the portion immediately after the .) In the example above, it is mp-dataset.
  • Your region is the Data location. This is us-east4 in the example above.

Grant access to the dataset in BigQuery

  1. From your BigQuery instance in console.cloud.google.com, click Sharing and select Permissions.
  2. Click Add Principle.
  3. Assign two Roles, one for BigQuery Data Viewer, and one for BigQuery User.
  4. Click Save.

  1. Navigate to your AWS Console, log in with your administrator account, and navigate to your Redshift cluster details.
  2. Run the following SQL statements to create a new user for mParticle, grant the necessary schema permissions to the new user, and grant the necessary access to your tables/views.
-- Create a unique user for mParticle
CREATE USER {{user_name}} WITH PASSWORD '{{unique_secure_password}}'

-- Grant schema usage permissions to the new user
GRANT USAGE ON SCHEMA {{schema_name}} TO {{user_name}}

-- Grant SELECT privilege on any tables/views mP needs to access to the new user
GRANT SELECT ON TABLE {{schema_name}}.{{table_name}} TO {{user_name}}
  1. Navigate to the Identity And Access Management (IAM) dashboard, select Roles under the left hand nav bar, and click Create role.

  1. In Step 1 Select trusted entity, click AWS service under Trusted entity
  2. Select Redshift from the dropdown menu titled “Use cases for other AWS services”, and select Redshift - Customizable. Click Next.

  1. In Step 2 Add permissions, click Create Policy.

  1. Click JSON in the policy editor, and enter the following permissions before clicking Next.
  • Replace {mp_pod_aws_account_id} with one of the following values according to your mParticle instance’s location:

    • US1 = 338661164609
    • US2 = 386705975570
    • AU1 = 526464060896
    • EU1 = 583371261087
{
    "Statement": [
        {
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Resource": "arn:aws:iam::{{mp_pod_aws_account_id}}:role/ingest-pipeline-data-external-{{mp_org_id}}-{{mp_acct_id}}",
            "Sid": ""
        }
    ],
    "Version": "2012-10-17"
}

  1. Enter a meaningful name for your new policy, such as mparticle_redshift_assume_role_policy, and click Create policy.

  1. Return to the Create role tab, click the refresh button, and select your new policy. Click Next.

  1. Enter a meaningful name for your new role, such as mparticle_redshift_role, and click Create role.

Your configuration will differ between Amazon Redshift and Amazon Redshift Serverless. To complete your configuration, follow the appropriate steps for your use case below.

Make sure to save the value of your new role’s ARN. You will need to use this when setting up Postman in the next section.

Amazon Redshift (not serverless)

  1. Navigate to your AWS Console, then navigate to Redshift cluster details. Select the Properties tab.

  1. Scroll to Associated IAM roles, and select Associate IAM Roles from the Manage IAM roles dropdown menu.

  1. Select the new role you just created. The name for the role in this example is mparticle_redshift_role.

Amazon Redshift Serverless

  1. Navigate to your AWS Console, then navigate to your Redshift namespace configuration. Select the Security & Encryption tab, and click Manage IAM roles.

  1. Select Associate IAM roles from the Manage IAM roles dropdown menu.
  2. Select the new role you just created. The name for the role in this example is mparticle_redshift_role.

Warehouse Sync uses the Databricks-to-Databricks Delta Sharing protocol to ingest data from Databricks into mParticle.

Complete the following steps to prepare your Databricks instance for Warehouse Sync.

1. Enable Delta Sharing

  1. Log into your Databricks account and navigate to the Account Admin Console. You must have Account Admin user privileges.
  2. Click Data from the left hand nav bar, and select the metastore you want to ingest data from.
  3. Select the Configuration tab. Under Delta Sharing, check the box labeled “Enable Delta Sharing to allow a Databricks user to share data outside their organization”

2. Configure a Delta Sharing recipient for mParticle

  1. From the Unity Catalog explorer in your Databricks account, expand the Delta Sharing dropdown menu in the left hand nav, and select Shared by me.
  2. Click New Recipient in the top right corner.
  3. Within the Create a new recipient window, enter mParticle_{YOUR-DATA-POD} under Recipient name where {YOUR-DATA-POD} is either us1, us2, eu1, or au1 depending on the location of the data pod configured for your mParticle account.
  4. In Sharing identifier, enter one of the following identifiers below, depending on the location of your mParticle account’s data pod:

    • US1: aws:us-east-1:e92fd7c1-5d24-4113-b83d-07e0edbb787b
    • US2: aws:us-east-1:e92fd7c1-5d24-4113-b83d-07e0edbb787b
    • EU1: aws:eu-central-1:2b8d9413-05fe-43ce-a570-3f6bc5fc3acf
    • AU1: aws:ap-southeast-2:ac9a9fc4-22a2-40cc-a706-fef8a4cd554e

3. Share your Databricks tables and schema with your new Delta Sharing recipient

  1. From the Unity Catalog explorer in your Databricks account, expand the Delta Sharing dropdown menu in the left hand nav, and select Share Data.
  2. Click Share Data in the top right.
  3. Within the Create a new share window, enter mparticle_{YOUR-MPARTICLE-ORG-ID}_{YOUR-MPARTICLE-ACCOUNT-ID} under Share name where {YOUR-MPARTICLE-ORG-ID} and {YOUR-MPARTICLE-ACCOUNT-ID} are your mParticle Org and Account IDs.

    • To find your Org ID, log into the mParticle app. View the page source. For example, in Google Chrome, go to View > Developer > View Page Source. In the resulting source for the page, look for “orgId”:xxx. This number is your Org ID.
    • Follow a similar process to find your Account ID (“accountId”:yyy) and Workspace ID (“workspaceId”:zzz).

  1. Click Create.
  2. From the Share details page, select the Assets tab and use the Add assets dropdown to add the schema and tables you want to send to mParticle.

  1. From the Recipients tab, make sure to add your new mParticle recipient you created in step 2.

4. Find and save your Databricks provider name

  1. Navigate to the Account Admin Console for your Databricks account.
  2. Click Data in the left hand nav bar.
  3. Select the metastore you are ingesting data from.
  4. Save the value displayed under Organization name. You will use this value for your provider name when creating the connection between the mParticle Warehouse Sync API and Databricks.

Unsupported data types between mParticle and Databricks

Databricks Delta Sharing does not currently support the TIMESTAMP_NTZ data type.

Other data types that are not currently supported by the Databricks integration for Warehouse Sync (for both user and events data) include:

If you are ingesting events data through Warehouse Sync, the following data types are unsupported:

While multi-dimensional, or nested, arrays are unsupported, you can still ingest simple arrays with events data.

Step 3. Postman setup

Once you have installed Postman, configure the collection environment settings and variables.

Update Postman environment settings

  1. Ensure you forked the mParticle Warehouse Sync API Postman Collection, as described in the Prerequisites section of this tutorial. In Postman, click the Environments tab from the left navigation menu.
  2. If you successfully forked the Warehouse Sync API collection, you’ll see it in the list of Environment configurations. You can rename it to something more meaningful by right-clicking on the next to the name and choosing the Rename option.

    Postman page with environment displayed
  3. Replace the the placeholders (replace_me) with the correct values for your environment. You must update the values under column labeled Current value.

    • Replace PLATFORM_API_CLIENT_ID and PLATFORM_API_CLIENT_SECRET with your new Platform API credentials.
    • Replace WORKSPACE_ID, ACCOUNT_ID, and ORG_ID with the corresponding values for your mParticle account.

      • To find your Org ID, log into the mParticle app. View the page source. For example, in Google Chrome, go to View > Developer > View Page Source. In the resulting source for the page, look for “orgId”:xxx. This number is your Org ID.
      • Follow a similar process to find your Account ID (“accountId”:yyy) and Workspace ID (“workspaceId”:zzz).
    • Replace POD with the regional pod your mParticle account is deployed on. Look at the URL in your browser where you are signed into mParticle. The POD is one of the following values: US1, US2, EU1, AU1.
  4. Enter the data warehouse usernames and passwords you saved from “Step 2. Data Warehouse Setup.” according to the data warehouse you are using:

    • For Snowflake, replace SNOWFLAKE_PASSWORD and SNOWFLAKE_STORAGE_INTEGRATION with the values you saved in step 2. Please refer to the Snowflake documentation to determine your account_identifier and region.
    • For BigQuery, replace BIG_QUERY_SERVICE_ACCOUNT_ID with the service account ID you used in BigQuery, and BIG_QUERY_SERVICE_ACCOUNT_KEY with the key from the generated JSON file in step 2.
    • For Redshift, replace REDSHIFT_USER, REDSHIFT_PASSWORD, and REDSHIFT_AWS_IAM_ROLE_ARN with the values created in step 2.
  5. After updating all necessary values, run COMMAND-S (or CTRL-S) to save your changes.

Update the Postman collection

  1. Ensure you forked the mParticle Warehouse Sync API Postman Collection as described in the Prerequisites section. In Postman, click the Collections tab on the left hand navigation.
  2. Once successfully forked, you’ll see the collection in the list of available collections.
  3. Click Warehouse Sync API, then select the Variables tab.
  4. Replace replace_me placeholders with the values corresponding to your environment. Ensure you update the values in the Current value column.

    • Replace INGEST_PIPELINE_SLUG and INGEST_PIPELINE_NAME with the slug and name you want to use to identify your new pipeline.
    • Replace SQL_QUERY with the database SQL query mParticle will use to retrieve data from your warehouse. SQL is a powerful language and you can use advanced expressions to filter, aggregate, join, etc. your data. Work with your database administrator if you need help crafting the right SQL query:

      • Your query can contain a timestamp column that mParticle will use to keep track of which rows need to be loaded.
      • Your query should contain one or more user identity columns that mParticle will use to perform identity resolution to ensure that data ends up on the correct user profile.
      • As part of the SQL query, you must specify how columns in the query will map to attributes on a user’s profile. You do this by using column aliasing in SQL. For example, in the following query, the column cid in Snowflake is being mapped to the mParticle attribute customer_id.

        sql example

        If you don’t provide an alias, mParticle will use the name of the column in your database. If an attribute of this name does not already exist on the user’s profile, mParticle will create a new attribute with this name.

    • Before using the query in mParticle, test the query outside of mParticle to ensure it returns the data you expect.
    • To learn more checkout the Warehouse Sync SQL reference
  5. After updating all necessary values, run COMMAND-S (or CTRL-S) to save your changes.

Step 4: Create your first Warehouse Sync pipeline

Creating a Warehouse Sync pipeline requires completing the following steps:

  1. Create a partner feed.
  2. Create a connection.
  3. Create a data model.
  4. Create a field transformation
  5. Create the pipeline.

After configuration, you can monitor the pipeline.

Create the partner feed

First, you must create a data feed by submitting a request to the Feeds API. mParticle uses this feed for rules and connections in your workspace. You must provide a value for module_name that corresponds with the data warehouse you are using.

Valid values for module_name are:

  • Redshift
  • BigQuery
  • Snowflake
  • Databricks

These values are case sensitive. For example, if you use snowflake instead of Snowflake, you will encounter errors later in your configuration.

  1. In Postman, ensure the environment drop-down is pointed to the Environment configuration you recently imported.
  2. Expand the Warehouse Sync Collection and open the Feeds folder.
  3. Click Create Warehouse Sync Feed.
  4. Click the Body tab to see the information you will pass to the API in order to create the feed.

    Postman page showing body of post request
    • module_name must be one of Snowflake, BigQuery, or Redshift

    Values surrounded by double braces (for example: {{WORKSPACE_ID}}) are taken from the variables you updated in previous steps of this tutorial.

  5. Verify all values you changed, and click the blue Send button. mParticle returns a success message with details about your new feed. If the request fails, mParticle returns an error message with additional information.

Create the connection

The next step is to create a connection between mParticle and your data warehouse.

  1. In Postman, ensure the environment drop-down is pointed to the Environment configuration you recently imported.
  2. Expand the Warehouse Sync Collection and open the Connections folder.
  3. Click POST Create Snowflake/BigQuery/Redshift Connection, selecting the appropriate endpoint for the data warehouse you are using.
  4. Click the Body tab and replace each "replace_me" placeholder with the correct value for your specific warehouse.

    The values in {{Sample Values}} are taken from the environment variables you updated in earlier steps. Make sure these values match the values for your organization’s data warehouse. You may need to work with your database administrator to ensure you have the correct values.

  5. Verify all values you changed, and click the blue Send button. mParticle returns a success message with details about the configuration you just created. If the request fails, mParticle returns an error message with additional information.

    Postman page showing body of post request

Create the data model

The next step is to create a data model. A data model is a SQL query that mParticle sends to your warehouse specifying exactly what columns, rows, and fields of data you want to ingest through your pipeline.

In the case of pipelines that ingest user profile data, the data model is also responsible for mapping ingested data fields to mParticle user attributes.

To create a data model using the Warehouse Sync API:

  1. In Postman, expand Warehouse Sync API and click Data Models.
  2. Click Create Data Model.
  3. Select the Body tab and enter your data model.

    Postman page showing body of post request
  4. The values in {{Sample Values}} are taken from the variables you updated in previous steps.
  5. Once you are confident all values are correct, click the blue Send button. mParticle returns a success message with details about the data model you just created. If the request fails, mParticle returns an error message with additional information.

For more details about using SQL to create a data model, with example queries and best practices, see the Warehouse Sync SQL Reference.

Create the field transformation

You can create a field transformation to specify exactly how fields in your database should map to fields in the mParticle JSON schema.

For detailed instructions on how to create a field transformation, read Event Data Mapping.

To use your field transformation, add the field_transformation_id to the request body of your API call when creating your pipeline in the next step.

Create the pipeline

The final step is to create the pipeline. You pass in the connection and data model configurations previously created along with your sync mode and scheduling settings.

  1. In Postman, expand Warehouse Sync Collection and open the Pipelines folder.
  2. Click Create Pipeline.
  3. Select the Body tab, and update the sync_mode and schedule settings as follows:

    • For sync_mode:

      • Set type to either incremental or full
      • If you set type to incremental, Set iterator_field to the name of the column in your sql query mParticle will monitor to track changes that need to be synced
      • If you set type to incremental, Set iterator_data_type to the datatype of your iterator

        • Valid values for Snowflake are: timestamp_ltz, timestamp_ntz, timestamp_tz, datetime, date, timestamp_unixtime_ms, timestamp_unixtime_s.
        • Valid values for Google BigQuery are datetime, date, timestamp, timestamp_unixtime_ms, timestamp_unixtime_s.
        • Valid values for AWS Redshift are date, timestamp, timestamptz, timestamp_unixtime_ms, timestamp_unixtime_s.
        • Valid values for Databricks are date, timestamp, timestamp_unixtime_ms, timestamp_unixtime_s.
  • For schedule:

    • Set type to interval, once, or on_demand
    • If you set type to interval, set frequency to hourly, daily, weekly, or monthly
    • Set start to the date-time value you want recurring syncs to begin
<img src="/images/dwi/api-v2-tutorial/5.png" alt="Postman page showing request results"> 

The values in `{{Sample Values}}` will be taken from the variables you updated in previous steps. You can optionally update the `environment` variable. It is currently set to target your mParticle development environment, but you can change it to target your production environment.
  1. Set the field_transformation_id to the ID of a a custom field transformation to map data from your warehouse to specific fields in the mParticle JSON schema. To learn more about custom field transformations and how to create them, see Field Transformations API.
  2. You can associate an optional data plan with your pipeline to help improve the quality of, and grant you more control over, the data imported to mParticle. In the request body of API request, include the body parameters plan_id and plan_version and set these to the values of the data plan and version you want to use. You must use a data plan version that is active and exists in the mParticle workspace you are using.
  3. Once you are confident all values are correct, click the blue Send button. If successful, mParticle returns a success message with details about the configuration you just created. If it was not successful, mParticle returns an error message with additional information.

Monitor the pipeline

Once a pipeline has been created, you can monitor its status using the additional requests provided in the Postman collection.

  1. In Postman, expand the Warehouse Sync Collection and open the Pipelines folder.
  2. Click Get Pipeline Report.
  3. Click the blue Send button.
  4. mParticle sends a detailed message with the pipeline’s current status. After creating a pipeline, there is an approximate one-minute delay until the pipeline is created in the mParticle backend, so submitting a Get Pipeline Report request results in a Not Founderror. Try again after several minutes.

    Postman page showing success status
  5. While a pipeline is ingesting data, you can monitor it in mParticle as you would with any other input. From mParticle, go to Data Master > Live Stream to inspect the incoming data from Snowflake.

    mParticle live stream page
  6. Once the data is ingested, the data points appear on the user’s profile. Go to Activity > User Activity and look up a sample profile. If the attributes do not appear as expected, validate the mapping you provided in the SQL query provided earlier.

    mParticle live stream page with data points

Step 5. Activate the data

Now that the data has been loaded into mParticle, it’s time to put it to use by creating an audience using the newly ingested data and sending it to a downstream integration.

Create an audience

Create an audience that uses one of the attributes we ingested as a qualifying criteria for it:

  1. In the mParticle app, go to Audiences > Standard.
  2. Click the green New Standard Audience in the upper right corner.
  3. Give the audience a name.
  4. Select a date range, or choose “All available data.”
  5. Select the Warehouse Sync feed you created in mParticle Setup.
  6. Add an audience criteria that leverages one of the data points you ingested from your warehouse. In the example below, we only want to consider users who have a propensity-to-buy score that’s greater than 0.7.

    mparticle page showing user attribute
  7. Add any other criteria you want to be considered. Click Save As Draft when you are done.
  8. Click Calculate to run the audience calculation.

Connect the audience to an output

In the final step, we will send users who qualified for this audience to Iterable for further targeting. If your organization doesn’t use Iterable, pick a different integration that your organization uses.

After the audience has been fully calculated, connect it to an output:

  1. If you aren’t still there, go to Audiences > Standard.
  2. In the row for the audience you just created, click the Connect button in the Actions column.

    mparticle audiences page
  3. Click Connect Output.

    mparticle audience connect page
  4. Click the Add Output green button.
  5. Click the Iterable tile in the integrations directory, or pick another integration your organization uses if you don’t have Iterable.
  6. Select the Audience checkbox and click Configure.

    mparticle audience setup page
  7. Enter a configuration name, your Iterable API key, and the user ID that is used to identify users, and then click Save & Open in Connections.

    mparticle audience configuration page
  8. Provide the ID of the list in Iterable that the user data should be loaded into, and click Add Connection.

    mparticle connect output page
  9. Click the Send button to send the audience to Iterable.

    mparticle standard audience page
  10. Data starts flowing to Iterable.

    mparticle standard audience page showing data being sent

Now you can open the audience in Iterable.

Was this page helpful?