AWS S3 (Snowplow Schema)

Prerequisites:

The Snowplow Unified Log is stored in an S3 bucket, and you are required to write an IAM policy to grant Analytics programmatic access to the respective S3 bucket.

If there are additional enrichments required, such as joining with user property tables or deriving custom user_ids, please contact us.

Instructions:

Adding a Data Source In Analytics

  1. In Analytics, click on the gear icon and select Project Settings.

    Project Settings screen

  2. Select the Data Sources tab.

    Data Sources tab

  3. Select New Data Source.

    New Data Source screen

  4. Select Connect via Data Warehouse or Lake.

    Connect via Data Warehouse or Lake

  5. Select S3 as your data connection and Snowplow as the connection schema, then click Connect.

    Connect via Data Warehouse or Lake

  6. You should see this S3 + Snowplow Overview screen. Click Next.

    S3 + Snowplow Overview screen

Connection Information

Connection Information screen

  1. Sign in to the AWS Management Console and open your IAM console.
  2. Under the Services dropdown, select S3 under Storage.

    S3 selection screen

  3. Click on the bucket that contains your Snowplow data.
  4. Enter the Bucket Name into the Analytics UI.

    Bucket Name input screen

  5. Click on your bucket and refer to the bucket structure. Enter that into the File Path field in the Analytics UI.

    File Path input screen

    In this example, the File Path to put into the Analytics field is /main/enriched/good.

  6. Click Next.

Grant Permissions

Grant Permissions screen

  1. In this section, click on the box that contains the policy to copy to your clipboard. You will need to use this in step 4 of this section.

    Policy box

  2. Go back to the AWS Console. Select the bucket and click on the Permissions tab.

    Permissions tab

  3. Click on Bucket Policy.

    Bucket Policy screen

  4. Enter the copied policy from step 1 into the editor and click Save.
  5. Click Next in Analytics.

Event Modeling

Event Modeling screen

  1. In the Structured Event Name section, select the field that should be used to derive Analytics event names. Our logic will first look at this field, and if this value is null, it will try to use the event_name field. If that value is also null, then we will look at the event field.

    1. se_action
    2. se_category
    3. se_label
    4. None - Select this option if you’re not using Snowplow’s structured events.
  2. For Timestamp, select the field that represents the time that the event was performed. If unsure, leave as derived_tstamp.
  3. For Vendor Name, input the Snowplow vendor names used so we can simplify your event property names.

User Identification (Aliasing)

User Identification screen

For more information on User Identification (Aliasing), please refer to this article.

Note: If aliasing is not preferred, please set the Authenticated ID Type to None and press Next.

  1. Select the Type for the Unauthenticated ID:

    1. Atomic - This will allow you to choose between the domain_userid and network_userid fields that are part of the standard Snowplow event structure. We typically recommend domain_userid since this uses a 1st party cookie. Click here for more information.
    2. Context - If the unauthenticated ID is part of a Snowplow context, choose this option. Enter the values for Vendor, Name, Version, and Field.
    3. Other - If the unauthenticated field is not either of the options, please specify where we can find the unauthenticated ID in the data.
  2. Select the Type for Authenticated ID:

    1. Atomic - Enter the field name that should be used for known users. Typically, it is the user_id field in the raw enriched event archive data.
    2. Context - If the authenticated ID is part of a Snowplow context, choose this option. Enter the values for Vendor, Name, Version, and Field.
    3. Other - If the authenticated field is not either of the options, please specify where we can find the authenticated ID in the data.
    4. None - choose this option to skip aliasing.

Scheduling

Scheduling screen

  1. Select the Schedule Interval to adjust the frequency at which new data is available in Analytics.
  2. Set the Schedule Time for when the data should be extracted from your S3 bucket. It is critical that 100% of the data is available by this time to avoid loading partial data.
  3. Select Next.

Waiting for Data

Waiting for Data screen

Advanced Settings

For additional advanced settings such as excluding certain events and properties, please refer to this page.

Was this page helpful?