Firehose use case: Exporting data to Google BigQuery

With a BlueConic firehose connection, you can stream BlueConic events and profile data to Google Pub/Sub. From here, you can use this data in various ways, including sending the data from Google Pub/Sub to the analytics data warehouse tool Google BigQuery.

The example shown below illustrates one approach to streaming data from BlueConic through Google Pub/Sub to Google BigQuery. Here's an outline of the process:

  1. Set up a connection from BlueConic to Google Pub/Sub. 
  2. Prepare a Google BigQuery table to receive this information. 
  3. Use Google Dataflow to connect a Google Pub/Sub topic to the BigQuery table.

The technology stack resembles this illustration:

firehose-g.png

Note: For importing or exporting data in batch mode (but not BlueConic events) directly from BlueConic to Google BigQuery, you can also use our Google BigQuery connection.

Before you get started, a few things to keep in mind:

  • Sending the data from Google Pub/Sub to BigQuery requires a virtual machine and temporary storage. Google will bill you for this.
  • Be prepared to ingest a lot of data, and to solve problems along the way. BlueConic cannot give in-depth support for issues with systems outside of BlueConic.

Prerequisites

To send profile and event data from BlueConic to Google Big Query, you'll need the following:

  • BlueConic installed and set up, with a Google Cloud Pub/Sub firehose connection configured.
  • A Google Cloud account with billing set up.

Step 1: Send data from BlueConic to a Google Cloud Pub/Sub topic

Start by setting up a BlueConic Google Cloud Pub/Sub firehose connection. This will send data to a specific Pub/Sub topic. See Google Cloud Pub/Sub connection for details. When you set up the connection, record the name of the topic you’re sending data to. The example outlined below uses a topic named "bctobq" in the project "pubsub-example-221115".

Step 2: Set up a Google BigQuery table

  1. Open the Google BigQuery window in the Google console.
  2. You need a BigQuery project, so you can either select an existing project or create a new one.
    image__1_.png
  3. Next, you need a dataset within this project; either select an existing dataset or create a new one.
    dataset.png
  4. Within the dataset, you need to create a new table. 
    create_table.png
  5. You will need to create a table schema predefined for the data you want to retrieve. See Creating and Using Tables in the Google BigQuery documentation for details on the standard format. But you will need to create a specific schema for the settings you’ve set up in the Google Cloud Pub/Sub connection in BlueConic, to account for the data that will stream through the firehose. One approach is to use a JSON example file with your BlueConic data, with schema autodetection in BigQuery.
    query_editor_schema.png

Make note of your topic, project, dataset, and table names. This example uses these sample names:

Topic bctobq
Project pubsub-example-221115
Dataset bc
Table example
Bucket bc-pubsub

See Sample firehose messages for details on the JSON format the Google Cloud Pub/Sub connection will use to stream through the firehose.

Be aware that changes to BlueConic profile properties that are being exported are not automatically carried over to the BigQuery schema. Any added profile properties will also require changes to the BigQuery table.

Step 3: Connecting Google Cloud Pub/Sub to BigQuery through Google Dataflow

Next you configure storage for the data in BigQuery. The BigQuery table serves as the end point for firehose messages.

Set up temporary storage

  1. Open the Google storage console: https://console.cloud.google.com/storage/browser
  2. If there is already a bucket, you can choose to reuse this. Otherwise click Create Bucket to create a new bucket. The default options are sufficient for this purpose.
    create_bucket.png
  3. Write down the bucket name, for example "bc-pubsub".
    bucket_values3.png

Enable the Google Dataflow API

The Google Dataflow API is not enabled by default. You can do this by going to: https://console.developers.google.com/apis/library/dataflow.googleapis.com or from the Google Cloud APIs and Services menu.

Set up the Dataflow

  1. Open the Google Dataflow console: https://console.cloud.google.com/Dataflow
    dataflow.png
  2. Select Create a job from a template.
  3. Enter a job name and select the Google Pub/Sub to BigQuery template.
    This should give you a list of fields.
  4. Enter the field names using the notes you’ve taken in the steps above. In our example it would be:
    Google Cloud Pub/Sub input topic projects/pubsub-example-221115/topics/bctobq
    BigQuery output table pubsub-example-221115:bc.example
    Temporary location gs://bc-pubsub/tmp
     
    Note that the input topic uses forward slashes (/) but the output table uses a colon (:) and a period (.).
    bigquery_dataflow.png
  5. Click the Run job button. This will create the job and start streaming the dataflow.

After you stream data to Google BigQuery, if you change something in BlueConic (add a profile property, for example), you will need to update your Google BigQuery table. In this case, we suggest you create a new BigQuery table.

Alternative method

Another way to move data through the Google Cloud Pub/Sub firehose stream is to send the data to GCS Text as a file and then send it to Google BigQuery. This would save you the steps of customizing the BigQuery table to accommodate changes in the BlueConic profile properties sent through the connection. 

Learn more about BlueConic firehose connections