Automate Data Warehouse ETL from an API with Python and Google Cloud Platform

Connor Clancy
4 min readMay 27, 2021

--

Learn to set up an ETL pipeline triggered by a Google Cloud PubSub message to extract data from an API, stage it in a Google Cloud Storage Bucket, transform the data using a Google Cloud Function and Python, and load the data into a Google BigQuery data warehouse.

Image by Author

Introduction

In this article, we will be automating a simple ETL pipeline to warehouse weather data from a RESTful API. The process outlined in this article for Google Cloud Platform could also be set up in Amazon Web Services using AWS Lambda Functions and AWS Redshift or in Microsoft Azure using Azure Functions and Azure SQL Data Warehouse.

Before jumping into this article, there are a few things you should already have:

This article will follow 3 high-level steps:

  1. BigQuery data warehouse and Cloud Storage setup
  2. API data extraction with Cloud Storage, PubSub, Scheduler, and Cloud Function
  3. Transform and load data into BigQuery with a second Cloud Function

BigQuery Data Warehouse and Cloud Storage Setup

Big Query Setup

Open BigQuery and create a new dataset named “weather”. Then create an empty table name “tbl_condition” with 6 columns (condition_timestamp, zipcode, city, temperature_f, temperature_c, condition).

Image by Author

Cloud Storage Setup

Go to Cloud Storage and create a bucket with a globally unique name. Your bucket cannot share a name with another GCP bucket already in existence. I usually include a unique prefix or a suffix to my bucket names to ensure I can still use a descriptive and common name.

If you are trying to minimize costs, you can also change the location type to a more restrictive selection. For this tutorial, the “Region” selection should be acceptable; however, for larger production datasets needed at high availability around the world, you may want to pick one of the other options.

Image is screenshot from GCP

API Data Extraction

Create PubSub Topic

Next, create a PubSub topic called “get_weather_api”

Image is screenshot from GCP

Create Cloud Scheduler Job

Then, create a Cloud Scheduler Job. Since the weather API publishes updates every 15 minutes, we will set our frequency to match using unix-cron format. The job’s target should be the PubSub topic we just created above. The scheduler requires you to include a message body, but this text will not be used in our process.

Image is screenshot from GCP

Create API Cloud Function

Create our first cloud function that is triggered by our get_wather_api PubSub topic. Click “save” and “next”.

Image is screenshot from GCP

Change the runtime to Python 3.8, change the entry point to your function name and enter your main.py and requirements.txt (on GitHub).

Image is screenshot from GCP

Transform and Load Into BigQuery

Create a second function that is triggered by a new file creation in your storage bucket.

Image is screenshot from GCP

Change the runtime to Python 3.8, change the entry point to your function name and enter your main.py and requirements.txt (on GitHub). This function preforms a few quick transformations and loads the data into our BigQuery table from above.

Image is screenshot from GCP

Run your Cloud Scheduler job, wait a few seconds, and you should now see data in your BigQuery table.

Image is screenshot from GCP

Key Takeaways

We now have a simple and easy-to-implement framework to ingest data from an API and load it into our BigQuery data lake. We used Google Cloud Functions as the primary host of our process, triggered with a PubSub message that was generated from a Scheduler job. Adding additional API sources would just require and additional API call functions triggered by the same PubSub message.

In practice, this could be used to add data from third party tools or from your homegrown ecosystem into your data lake.

What’s Next?

Watch for a follow-up article on how we can now build an analysis and visualizations off of this dataset.

Thank you for reading my first solo Medium article; check back soon for more!

--

--