This project is part of my full portfolio.

Introduction

The purpose of this Tokyo Olympics Cloud Data Engineering Project is to showcase my skills in ETL functions on cloud platforms.

The project was done within the limitations of the free trial Azure subscription for new accounts.

Overview

Include a data pipeline image here

Objective

To build an end-to-end data pipeline on a cloud platform from a Github repository to a Power BI dashboard.

The process will entail ETL functions into Microsoft Azure using a Data Factory (Extract), Databricks (Transform), and Microsoft Azure Data Lake Gen 2 storage. (Load)

The aggregated data at the end will then be analysed and visualised using Azure’s Built-in Synapse Analytics platform and on Power BI Dashboards.

Tokyo Olympics: Data Engineering Project (Step-by-step)

In a stepwise format, I will cover how managed to complete this entire project.

Before I started the project, I created a Resource Group which will be the point of reference for all the data actions within this project.

I named it as tokyo-olympics.

1. Create Storage Account and Raw & Transformed Directories

Before I began the ETL process proper, I had to make sure that the infrastructure is set up right.

Firstly, I created a Storage Account under the free Azure Trial Subscription.

Under the Storage Account, I created the raw and transformed data directories, which will house the storage of the data later on.

2. Extract and Load the Data from Github to an Azure Data Factory

Next, I created a Data Factory, which I will be using to ingest the data from a data source.

I created a Pipeline within the Data Factory, and used a simple copy function to connect to a data source.

For the data source, I connected it to my raw csv files in my Github using HTTP requests.

I loaded the Athletes, Coaches, EntriesGender, Medals, Teams datasets.

I stored the data as DelimitedText sink datasets and set its file path within the Data Lake, under the raw-data folder.

I linked them to run in a pipeline and validated its successful loading.

3. Initializing Transforming the Data using Azure Databricks

I first created a brand new Azure Databricks workspace, then deployed it within the Southeast Asia region.

I picked the Premium pricing tier.

Then I deployed the Azure Databricks Workspace.

Once deployed, I created a new compute with a single node cluster and the Standard_DS3_v2 node type.

I then created a new workbook and connected to the single node cluster I created before.

Before proceeding any further, I created an App Registration and named it app01.

Then I noted down its Application (Client) ID, Directory (Tenant) ID in a notepad.

I created a Client Secret Key, and noted down its value in a notepad.

In real deployment cases, these keys can be stored securely in a Key Vault within Azure instead.

4. Importing the Data using Azure Databricks

I then opened the Pyspark notebook and defined the configs and mounted the data source (the Storage Account) using the Storage Account name.

Python
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "bcb75087-d02c-4221-9763-3e4206cda9b8",
"fs.azure.account.oauth2.client.secret": '4Rv8Q~yKPBfgHRk_T3IXRaQANs1jR_xLT~x9Ub6p',
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/03938f9d-f8c2-4b5f-9749-7637c1509bd0/oauth2/token"}


dbutils.fs.mount(
source = "abfss://tokyo-olympic-data@tokyoolympicsjustin.dfs.core.windows.net", # container@storageacc
mount_point = "/mnt/tokyoolymic",
extra_configs = configs)

I then went to the Storage Account and under the Access Control (IAM) section, I assigned my member, application app01, as a Storage Blog Data Contributor.

Back in the notebook, I listed the files in the mounted directory and it was successfully loaded.

ShellScript
%fs
ls "/mnt/tokyoolymic"

I then started a Spark session by typing spark. (As Azure Databricks already offers Spark Sessions within its Databricks shell)

Python
spark

I could then proceed to import all the datasets using PySpark functions.

Python
# Read all datasets into apache spark
athletes = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/mnt/tokyoolymic/raw-data/athletes.csv")
coaches = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/mnt/tokyoolymic/raw-data/coaches.csv")
entriesgender = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/mnt/tokyoolymic/raw-data/entriesgender.csv")
medals = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/mnt/tokyoolymic/raw-data/medals.csv")
teams = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/mnt/tokyoolymic/raw-data/teams.csv")

# note that inferSchema will tell apache spark to understand the data type of each column

I then used the .show() and .printSchema() methods to check the data type to ensure it’s been loaded in the right format.

I then used the .inferSchema() option to allow Spark to infer the right data type from the get go.

5. Transforming the Data using Azure Databricks

To find the top countries that had the highest number of gold medals, I used the orderBy function to sort in ascending order of Gold medals, then chose which columns to see using the .select() function.

Python
# Find the top countries with the highest number of gold medals

top_gold_medals_by_countries = medals.orderBy("Gold", ascending = False).select("Team_Country", "Gold").show()

And finally used the .show() function to show the preview of the transformed dataset.

I then calculated the average number of entries by gender for each discipline.

5. Writing Transformed Data into Data Lake Storage

I then wrote the transformed data into the Data Lake Storage Gen 2 as csv files.

Python
# Writes transformed data into transformed datalake container
athletes.repartition(1).write.mode("overwrite").option("header",'true').csv("/mnt/tokyoolymic/transformed-data/athletes")
coaches.repartition(1).write.mode("overwrite").option("header",'true').csv("/mnt/tokyoolymic/transformed-data/coaches")
entriesgender.repartition(1).write.mode("overwrite").option("header",'true').csv("/mnt/tokyoolymic/transformed-data/entriesgender")
medals.repartition(1).write.mode("overwrite").option("header",'true').csv("/mnt/tokyoolymic/transformed-data/medals")
teams.repartition(1).write.mode("overwrite").option("header",'true').csv("/mnt/tokyoolymic/transformed-data/teams")

Data lakes are centralized repositories that ingest, store, and allow for processing of large volumes of data in its original form.

It can accommodate all types of data, which is then used to power big data analytics, machine learning, and other forms of intelligent action.

6. Visualize Using Azure Synapse Analytics

7. Creating a Power BI Dashboard

Final Thoughts

This was a great learning exercise as I hit multiple roadblocks along the way and overcame them.

I am quite proud of the result built.

Further steps that needed to be taken in this would be to ensure its data pipeline is maintained over time for stakeholders to use timely self-served analytics on Power BI dashboards.