Automate Data Warehouse ETL from an API with Python and Google Cloud Platform

4 min readMay 27, 2021

Learn to set up an ETL pipeline triggered by a Google Cloud PubSub message to extract data from an API, stage it in a Google Cloud Storage Bucket, transform the data using a Google Cloud Function and Python, and load the data into a Google BigQuery data warehouse.

Introduction

In this article, we will be automating a simple ETL pipeline to warehouse weather data from a RESTful API. The process outlined in this article for Google Cloud Platform could also be set up in Amazon Web Services using AWS Lambda Functions and AWS Redshift or in Microsoft Azure using Azure Functions and Azure SQL Data Warehouse.

Before jumping into this article, there are a few things you should already have:

A Google Cloud Platform (GCP) account and a GCP project
A free weatherapi.com account with your API key set up in GCP Secrets Manager and with the proper IAM secrets role assignments
Working knowledge of Python3 and RESTful APIs

This article will follow 3 high-level steps:

BigQuery data warehouse and Cloud Storage setup
API data extraction with Cloud Storage, PubSub, Scheduler, and Cloud Function
Transform and load data into BigQuery with a second Cloud Function

BigQuery Data Warehouse and Cloud Storage Setup

Big Query Setup

Open BigQuery and create a new dataset named “weather”. Then create an empty table name “tbl_condition” with 6 columns (condition_timestamp, zipcode, city, temperature_f, temperature_c, condition).

Cloud Storage Setup

Go to Cloud Storage and create a bucket with a globally unique name. Your bucket cannot share a name with another GCP bucket already in existence. I usually include a unique prefix or a suffix to my bucket names to ensure I can still use a descriptive and common name.

If you are trying to minimize costs, you can also change the location type to a more restrictive selection. For this tutorial, the “Region” selection should be acceptable; however, for larger production datasets needed at high availability around the world, you may want to pick one of the other options.

API Data Extraction

Create PubSub Topic

Next, create a PubSub topic called “get_weather_api”

Create Cloud Scheduler Job

Then, create a Cloud Scheduler Job. Since the weather API publishes updates every 15 minutes, we will set our frequency to match using unix-cron format. The job’s target should be the PubSub topic we just created above. The scheduler requires you to include a message body, but this text will not be used in our process.

Create API Cloud Function

Create our first cloud function that is triggered by our get_wather_api PubSub topic. Click “save” and “next”.

Change the runtime to Python 3.8, change the entry point to your function name and enter your main.py and requirements.txt (on GitHub).

Transform and Load Into BigQuery

Create a second function that is triggered by a new file creation in your storage bucket.

Change the runtime to Python 3.8, change the entry point to your function name and enter your main.py and requirements.txt (on GitHub). This function preforms a few quick transformations and loads the data into our BigQuery table from above.

Run your Cloud Scheduler job, wait a few seconds, and you should now see data in your BigQuery table.

Key Takeaways

We now have a simple and easy-to-implement framework to ingest data from an API and load it into our BigQuery data lake. We used Google Cloud Functions as the primary host of our process, triggered with a PubSub message that was generated from a Scheduler job. Adding additional API sources would just require and additional API call functions triggered by the same PubSub message.

In practice, this could be used to add data from third party tools or from your homegrown ecosystem into your data lake.

What’s Next?

Watch for a follow-up article on how we can now build an analysis and visualizations off of this dataset.

Thank you for reading my first solo Medium article; check back soon for more!