The Snowplow Unified Log is stored in an S3 bucket and you are required to write an IAM policy to grant Analytics programmatic access to the respective S3 bucket.
If there are additional enrichments required, such as joining with user property tables or deriving custom user_ids, please contact us.
To connect your real time Snowplow data to Analytics, follow the instructions below:
Your AWS Lambda needs to have an Execution Role that allows it to use the Kinesis Stream and CloudWatch. (For more information on setting up IAM Roles, please see the official AWS tutorial.)
The Lambda function can be created either directly through AWS Console or through other tools like the AWS CLI. For this integration, the recommended memory setting is 256 MB and because the JVM has to cold start when the function is called for the first time on a new instance, you should set a high timeout value; 90 seconds should be safe.
As with the IAM Role, we will be using the AWS Console to get our Lambda function up and running. Make sure you are in the same region as where your Kinesis streams are defined.
The Lambda has been created, although it does not do anything yet. We need to provide the code and configure the function:
a. Take a look at the Function code box. In the Handler textbox paste: com.snowplowanalytics.indicative.LambdaHandler::recordHandler
b. From the Code entry type dropdown pick Upload a file from Amazon S3. A textbox labeled S3 Link URL will appear. We are hosting the code through our hosted assets. You will need to choose the S3 bucket in the same region as your AWS Lambda function: for example if your Lambda is us-east-1
region, paste the following URL: s3://snowplow-hosted-assets-us-east-1/relays/indicative/indicative-relay-0.4.0.jar
in the textbox. Take a look at this table to pick the right bucket name for your region. Make sure Runtime is Java 8.
Below Function code settings you will find a section called Environment variables.
a. In the first row, first column (the key), type INDICATIVE_API_KEY. In the second column (the value), paste your API Key.
b. The relay lets you configure the following filters:
- UNUSED_EVENTS: events that will not be relayed to Analytics;
- UNUSED_ATOMIC_FIELDS: fields of the [canonical]( Snowplow event that will not be relayed to Analytics;
- UNUSED_CONTEXTS: contexts whose fields will not be relayed to Analytics.
Out of the box, the relay is configured to use the following defaults:
Unused events | Unused atomic fields | Unused contexts |
app_heartbeat | etl_tstamp | application_context |
app_initialized | collector_tstamp | application_error |
app_shutdown | dvce_created_tstamp | duplicate |
app_warning | event | geolocation_context |
create_event | txn_id | instance_identity_document |
emr_job_failed | name_tracker | java_context |
emr_job_started | v_tracker | jobflow_step_status |
emr_job_status | v_collector | parent_event |
emr_job_succeeded | v_etl | performance_timing |
incident | user_fingerprint | timing |
incident_assign | geo_latitude | |
incident_notify_of_close | geo_longitude | |
incident_notify_user | ip_isp | |
job_update | ip_organization | |
load_failed | ip_domain | |
load_succeeded | ip_netspeed | |
page_ping | page_urlscheme | |
s3_notification_event | page_urlport | |
send_email | page_urlquery | |
send_message | page_urlfragment | |
storage_write_failed | refr_urlscheme | |
stream_write_failed | refr_urlport | |
task_update | refr_urlquery | |
wd_access_log | refr_urlfragment | |
pp_xoffset_min | ||
pp_xoffset_max | ||
pp_yoffset_min | ||
pp_yoffset_max | ||
br_features_pdf | ||
br_features_flash | ||
br_features_java | ||
br_features_director | ||
br_features_quicktime | ||
br_features_realplayer | ||
br_features_windowsmedia | ||
br_features_gears | ||
br_features_silverlight | ||
br_cookies | ||
br_colordepth | ||
br_viewwidth | ||
br_viewheight | ||
dvce_ismobile | ||
dvce_screenwidth | ||
dvce_screenheight | ||
doc_charset | ||
doc_width | ||
doc_height | ||
tr_currency | ||
mkt_clickid | ||
etl_tags | ||
dvce_sent_tstamp | ||
refr_domain_userid | ||
refr_device_tstamp | ||
derived_tstamp | ||
event_vendor | ||
event_name | ||
event_format | ||
event_version | ||
event_fingerprint | ||
true_tstamp |
To change the defaults, you can pass in your own lists of events, atomic fields or contexts to be filtered out. For example:
Environment variable key | Environment variable value |
UNUSED_EVENTS | page_ping,file_download |
UNUSED_ATOMIC_FIELDS | name_tracker,event_vendor |
UNUSED_CONTEXTS | performance_timing,client_context |
Similarly to setting up the API key, the first column (key) needs to be set to the specified environment variable name in ALLCAPS. The second column (value) is your own list as a comma-separated string with no spaces.
If you only specify the environment variable name but do not provide a list of values, then nothing will be filtered out.
If you do not set any of the environment variables, the defaults will be used.
Take a look at the Configure triggers section which just appeared below. Choose your Kinesis stream that contains Snowplow enriched events. Set the batch size to your liking - 100 is a reasonable setting. Note that this is a maximum batch size, the function can be triggered with fewer records. For the starting position we recommend Trim horizon, which starts processing the stream from an observable start (Alternatively, you can select At timestamp to start sending data from a particular date). Click the Add button to finish the trigger configuration. Make sure Enable trigger is selected.
Go to your Indicative project to check if you are receiving data. You can also go to the debug console to troubleshoot the relay in real time.
Was this page helpful?