Automate Data Warehouse ETL from an API with Python and Google Cloud Platform
Learn to set up an ETL pipeline triggered by a Google Cloud PubSub message to extract data from an API, stage it in a Google Cloud Storage Bucket, transform the data using a Google Cloud Function and Python, and load the data into a Google BigQuery data warehouse.
Introduction
In this article, we will be automating a simple ETL pipeline to warehouse weather data from a RESTful API. The process outlined in this article for Google Cloud Platform could also be set up in Amazon Web Services using AWS Lambda Functions and AWS Redshift or in Microsoft Azure using Azure Functions and Azure SQL Data Warehouse.
Before jumping into this article, there are a few things you should already have:
- A Google Cloud Platform (GCP) account and a GCP project
- A free weatherapi.com account with your API key set up in GCP Secrets Manager and with the proper IAM secrets role assignments
- Working knowledge of Python3 and RESTful APIs
This article will follow 3 high-level steps:
- BigQuery data warehouse and Cloud Storage setup
- API data extraction with Cloud Storage, PubSub, Scheduler, and Cloud Function
- Transform and load data into BigQuery with a second Cloud Function
BigQuery Data Warehouse and Cloud Storage Setup
Big Query Setup
Open BigQuery and create a new dataset named “weather”. Then create an empty table name “tbl_condition” with 6 columns (condition_timestamp, zipcode, city, temperature_f, temperature_c, condition).
Cloud Storage Setup
Go to Cloud Storage and create a bucket with a globally unique name. Your bucket cannot share a name with another GCP bucket already in existence. I usually include a unique prefix or a suffix to my bucket names to ensure I can still use a descriptive and common name.
If you are trying to minimize costs, you can also change the location type to a more restrictive selection. For this tutorial, the “Region” selection should be acceptable; however, for larger production datasets needed at high availability around the world, you may want to pick one of the other options.
API Data Extraction
Create PubSub Topic
Next, create a PubSub topic called “get_weather_api”
Create Cloud Scheduler Job
Then, create a Cloud Scheduler Job. Since the weather API publishes updates every 15 minutes, we will set our frequency to match using unix-cron format. The job’s target should be the PubSub topic we just created above. The scheduler requires you to include a message body, but this text will not be used in our process.
Create API Cloud Function
Create our first cloud function that is triggered by our get_wather_api PubSub topic. Click “save” and “next”.
Change the runtime to Python 3.8, change the entry point to your function name and enter your main.py and requirements.txt (on GitHub).
Transform and Load Into BigQuery
Create a second function that is triggered by a new file creation in your storage bucket.
Change the runtime to Python 3.8, change the entry point to your function name and enter your main.py and requirements.txt (on GitHub). This function preforms a few quick transformations and loads the data into our BigQuery table from above.
Run your Cloud Scheduler job, wait a few seconds, and you should now see data in your BigQuery table.
Key Takeaways
We now have a simple and easy-to-implement framework to ingest data from an API and load it into our BigQuery data lake. We used Google Cloud Functions as the primary host of our process, triggered with a PubSub message that was generated from a Scheduler job. Adding additional API sources would just require and additional API call functions triggered by the same PubSub message.
In practice, this could be used to add data from third party tools or from your homegrown ecosystem into your data lake.
What’s Next?
Watch for a follow-up article on how we can now build an analysis and visualizations off of this dataset.
Thank you for reading my first solo Medium article; check back soon for more!