Building a Sports Data Lake on AWS with S3, Glue, and Athena (VS Code Edition)

My journey in creating a data lake to analyze NBA player data, while utilizing VS Code's AWS integration

Created by Ron-tino on January 04, 2025
AWS Amplify

In this blog post, we'll walk through building a data lake on AWS to analyze NBA player data. We'll be using Amazon S3 for storage, AWS Glue for data cataloging, and AWS Athena for querying. The unique part? Instead of using the AWS CLI in Cloud Shell, we'll leverage VS Code’s AWS integration for a more streamlined development experience. This project is perfect for anyone looking to get hands-on with AWS data services, learn more about data lake architecture, and work directly with real data. Let's get started!

Project Overview

Our data lake will follow this workflow:

  • Data Storage (S3): Raw NBA player data in JSON format will be stored in an S3 bucket.
  • Data Cataloging (Glue): AWS Glue will crawl the S3 bucket to infer the schema and create a metadata catalog, making it queryable.
  • Data Querying (Athena): AWS Athena will query the data in S3 using the metadata catalog defined by Glue, allowing us to explore the data with SQL.

Detailed Step-by-Step Guide

Step 1: Setting Up Your Development Environment with VS Code and AWS

Before we dive in, make sure you have:

  • VS Code: Download and install VS Code if you haven't already.
  • AWS Toolkit for VS Code: Install the AWS Toolkit extension in VS Code.
  • AWS Credentials: Configure your AWS credentials within the AWS Toolkit. This allows VS Code to interact with your AWS account securely.
  • Python 3.x: Python 3.x installed in your development environment, and VS code set up to use the relevant python environment.
  • AWS Account: An active AWS account.

Step 2: Create Python Project Structure

  • Create a new directory for your project and name it something relevant, like `nba-data-lake`.
  • Inside that directory, create a new directory named `src`. This is where you will store your python scripts.
  • Create a new file called `setup_nba_datalake.py` inside the `src` folder. This will house the script to create our S3 bucket, set up Glue, and configure Athena.

Step 3: The Python Script (`setup_nba_datalake.py`)

Here’s the Python code. Copy this into your `setup_nba_datalake.py` file:

                        
import os
import json
import boto3
import requests
import time
from dotenv import load_dotenv
load_dotenv()

# AWS Configurations (Environment variables)
region = os.getenv("AWS_REGION")
bucket_name = os.getenv("AWS_S3_BUCKET_NAME")
glue_database_name = os.getenv("AWS_GLUE_DATABASE_NAME")
athena_output_location = os.getenv("AWS_ATHENA_OUTPUT_LOCATION")
nba_api_key = os.getenv("NBA_API_KEY")


# Create AWS clients
s3_client = boto3.client('s3', region_name=region)
glue_client = boto3.client('glue', region_name=region)
athena_client = boto3.client('athena', region_name=region)


# Function to create S3 bucket
def create_s3_bucket():
    try:
        s3_client.head_bucket(Bucket=bucket_name)
        print(f"Bucket {bucket_name} exists.")
    except:
        print(f"Creating bucket {bucket_name}")
        try:
            if region == "us-east-1":
                s3_client.create_bucket(Bucket=bucket_name)
            else:
                s3_client.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})

            print(f"Successfully created bucket {bucket_name}")
        except Exception as e:
            print(f"Error creating bucket: {e}")

# Function to create Glue database
def create_glue_database():
    try:
        glue_client.get_database(Name=glue_database_name)
        print(f"Glue database {glue_database_name} exists.")
    except:
        print(f"Creating glue database {glue_database_name}.")
        try:
            glue_client.create_database(
                DatabaseInput={
                'Name': glue_database_name,
                'Description': 'Glue database for NBA data lake'
                }
            )
            print(f"Successfully created glue database {glue_database_name}")
        except Exception as e:
            print(f"Error creating glue database: {e}")


# Function to fetch data from the API
def fetch_nba_data():
    api_url = f"https://api.sportsdata.io/v3/nba/scores/json/Players?key={nba_api_key}"
    try:
      response = requests.get(api_url)
      response.raise_for_status()
      return response.json()
    except requests.exceptions.RequestException as e:
      print(f"Error fetching NBA data from API: {e}")
      return None


# Function to upload data to S3
def upload_data_to_s3(data):
    try:
        s3_client.put_object(
            Bucket=bucket_name,
            Key="raw-data/nba_players.json",
            Body=json.dumps(data),
            ContentType='application/json'
        )
        print("Successfully uploaded NBA player data to S3.")
    except Exception as e:
        print(f"Error uploading data to S3: {e}")

# Function to create glue table
def create_glue_table():
    try:
        glue_client.get_table(DatabaseName=glue_database_name, Name="nba_players")
        print("Glue table nba_players already exists")
    except:
            try:
                    glue_client.create_table(
                         DatabaseName=glue_database_name,
                           TableInput={
                             'Name': 'nba_players',
                             'Description': 'NBA player data table',
                             'StorageDescriptor': {
                               'Columns': [
                                 {'Name': 'PlayerID', 'Type': 'int'},
                                 {'Name': 'FirstName', 'Type': 'string'},
                                 {'Name': 'LastName', 'Type': 'string'},
                                 {'Name': 'Position', 'Type': 'string'},
                                 {'Name': 'Team', 'Type': 'string'},
                                 {'Name': 'BirthCity', 'Type': 'string'},
                                 {'Name': 'BirthState', 'Type': 'string'},
                                 {'Name': 'College', 'Type': 'string'},
                                 {'Name': 'Jersey', 'Type': 'string'}
                               ],
                               'Location': f's3://{bucket_name}/raw-data/',
                                'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
                                'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
                                'SerdeInfo': {
                                 'Name': 'jsonSerDe',
                                 'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',
                                    'Parameters': {
                                        'paths': 'PlayerID,FirstName,LastName,Position,Team,BirthCity,BirthState,College,Jersey'
                                        }
                                 }
                             },
                             'TableType': 'EXTERNAL_TABLE'
                         }
                )
                    print("Successfully created glue table nba_players")
            except Exception as e:
                    print(f"Error creating glue table: {e}")
# Function to configure Athena
def configure_athena():
    try:
        athena_client.get_query_execution(QueryExecutionId='test-query-id')
    except:
        try:
            print("Setting up Athena output location if it does not exist")
            athena_client.create_named_query(
            Name='TestQuery',
            Database=glue_database_name,
            QueryString=f"SELECT * FROM nba_players limit 10;",
            Description='Test query for nba_players table',
            ResultConfiguration={'OutputLocation': athena_output_location}
            )
            print("Successfully configured athena")
        except Exception as e:
            print(f"Error creating Athena location: {e}")

def main():
    create_s3_bucket()
    time.sleep(5)
    nba_data = fetch_nba_data()
    if nba_data:
      upload_data_to_s3(nba_data)
      create_glue_database()
      create_glue_table()
      configure_athena()
    else:
      print("Failed to fetch NBA data")


if __name__ == "__main__":
    main()
                        
                    

Step 4: Setting Up Environment Variables (`.env` file):

  • Create a file named `.env` in the root of your project (`nba-data-lake`).
  • Add the following environment variables, replacing the placeholders with your actual values:
  • Replace the placeholders with values unique to your environment, remember to create a unique bucket name for s3, and you must create a unique athena output folder. Also, obtain your API key from the sportsdata.io web page.

Step 5: Running the Script:

  • Open the `setup_nba_datalake.py` file in VS Code.
  • Ensure that the correct Python environment is selected within VS code, by selecting the python environment.
  • Run the script from your terminal: `python src/setup_nba_datalake.py`.

Step 6: Verify the Resources

  • S3: Go to the AWS console and navigate to the S3 service, check to see if the bucket was created and if the json file is inside the raw data folder.
  • Glue: Go to the AWS console and navigate to the Glue service, check to see if the database and the table was created.
  • Athena: Go to the AWS console and navigate to the Athena service and run a `select * from table` query to see if all of the data is available for query.

Step 7: Clean Up Resources

If you wish to delete the resources created use the following script:

                        
import boto3
import os
import time
from dotenv import load_dotenv
load_dotenv()

# AWS Configurations
region = os.getenv("AWS_REGION")
bucket_name = os.getenv("AWS_S3_BUCKET_NAME")
glue_database_name = os.getenv("AWS_GLUE_DATABASE_NAME")

# Create AWS clients
s3_client = boto3.client('s3', region_name=region)
glue_client = boto3.client('glue', region_name=region)
athena_client = boto3.client('athena', region_name=region)


# Function to delete S3 bucket and its contents
def delete_s3_bucket():
    try:
        print(f"Deleting all objects in the bucket {bucket_name}")
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if 'Contents' in response:
            for obj in response['Contents']:
                s3_client.delete_object(Bucket=bucket_name, Key=obj['Key'])
                print(f"Deleted object {obj['Key']} in bucket {bucket_name}")
        print(f"Deleting the bucket {bucket_name}")
        s3_client.delete_bucket(Bucket=bucket_name)
        print(f"Successfully deleted bucket {bucket_name}")
    except Exception as e:
        print(f"Error deleting S3 bucket: {e}")
# Function to delete glue table
def delete_glue_table():
    try:
        print(f"Deleting the table nba_players in {glue_database_name}")
        glue_client.delete_table(DatabaseName=glue_database_name, Name='nba_players')
        print(f"Successfully deleted table nba_players in {glue_database_name}")
    except Exception as e:
        print(f"Error deleting glue table: {e}")
# Function to delete Glue database
def delete_glue_database():
    try:
         print(f"Deleting the database {glue_database_name}")
         glue_client.delete_database(Name=glue_database_name)
         print(f"Successfully deleted database {glue_database_name}")
    except Exception as e:
         print(f"Error deleting glue database {e}")
# Function to delete Athena output location
def delete_athena_output():
    try:
        print("Deleting Athena query results")
        response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix="athena-output/")
        if 'Contents' in response:
            for obj in response['Contents']:
                s3_client.delete_object(Bucket=bucket_name, Key=obj['Key'])
                print(f"Deleted athena output object {obj['Key']} in bucket {bucket_name}")
            print("Succesfully deleted the athena output folder")
        else:
           print("No athena output folder found")
    except Exception as e:
          print(f"Error deleting athena output : {e}")
def main():
    delete_glue_table()
    time.sleep(5)
    delete_glue_database()
    delete_athena_output()
    time.sleep(5)
    delete_s3_bucket()

if __name__ == "__main__":
    main()
                        
                     
  • Create a new file in the src folder and name it `delete_resources.py`.
  • Copy the above python code into that file and run it.

Differences from the Original Tutorial

  • VS Code instead of Cloud Shell: You ran the python script in VS Code rather than the AWS Cloud Shell. This provides a local development environment with all of your dependencies pre-installed. This is different from the cloud shell environment which does not retain its environment state from session to session.
  • Environment Variables: Instead of hardcoding AWS configurations and API keys in the code, this uses environment variables. Environment variables increase the security of your programs, and keep your API keys secure and out of your publicly available code.
  • Local Execution: The project code is run from your local machine instead of a cloud instance.
  • Resource Deletion: You created a separate resource deletion script, rather than having a combined script.

Python Code Explanation

Okay, let's break down this Python script line by line, explaining its purpose, the AWS services it uses, and the overall data flow.

                        
import boto3
import json
import time
import requests
from dotenv import load_dotenv
import os
                        
                     
  • `import boto3`: Imports the AWS SDK for Python, allowing interaction with AWS services.
  • `import json`: Imports the json library for working with JSON data.
  • `import time`: Imports the time library for adding pauses (like `time.sleep`).
  • `import requests`: Imports the requests library for making HTTP calls to APIs.
  • `from dotenv import load_dotenv`: Imports the `load_dotenv` function to load environment variables from a `.env` file.
  • `import os`: Imports the `os` library for interacting with the operating system (including environment variables).
                        
# Load environment variables from .env file
load_dotenv()
                        
                     
  • `load_dotenv()`: Loads environment variables from a `.env` file into the script's environment. This is used to store sensitive information such as the API keys
                        
# AWS configurations
region = os.getenv("AWS_REGION")
bucket_name = os.getenv("AWS_S3_BUCKET_NAME")
glue_database_name = os.getenv("AWS_GLUE_DATABASE_NAME")
athena_output_location = os.getenv("AWS_ATHENA_OUTPUT_LOCATION")
nba_api_key = os.getenv("NBA_API_KEY")
                        
                     
  • `region = os.getenv("AWS_REGION")`: Sets the default AWS region from environment variables.
  • `bucket_name = os.getenv("AWS_S3_BUCKET_NAME")`: Sets a unique S3 bucket name from environment variables.
  • `glue_database_name = os.getenv("AWS_GLUE_DATABASE_NAME")`: Sets a default AWS Glue database name from environment variables.
  • `athena_output_location = os.getenv("AWS_ATHENA_OUTPUT_LOCATION")`: Creates the S3 path for Athena query results from environment variables.
  • `nba_api_key = os.getenv("NBA_API_KEY")`: Retrieves the NBA_API_KEY from environment variables.
                        
# Create AWS clients
s3_client = boto3.client("s3", region_name=region)
glue_client = boto3.client("glue", region_name=region)
athena_client = boto3.client("athena", region_name=region)
                        
                     
  • `s3_client = boto3.client("s3", region_name=region)`: Creates an S3 client object to interact with AWS S3.
  • `glue_client = boto3.client("glue", region_name=region)`: Creates a Glue client object to interact with AWS Glue.
  • `athena_client = boto3.client("athena", region_name=region)`: Creates an Athena client object to interact with AWS Athena.
                        
def create_s3_bucket():
    """Create an S3 bucket for storing sports data."""
    try:
        if region == "us-east-1":
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration={"LocationConstraint": region},
            )
        print(f"S3 bucket '{bucket_name}' created successfully.")
    except Exception as e:
        print(f"Error creating S3 bucket: {e}")
                        
                     
  • `def create_s3_bucket()`: Defines a function to create an S3 bucket.
  • `try`: Starts a try block to handle any errors that occur during S3 bucket creation.
  • `if region == "us-east-1"`: Checks if the region is `us-east-1` to avoid specifying the region during creation.
    • `s3_client.create_bucket(Bucket=bucket_name)`: Creates an S3 bucket in `us-east-1`.
  • `else`: if the region is not `us-east-1`, specify the location during creation.
    • `s3_client.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={"LocationConstraint": region})`: Creates an S3 bucket with the specified location constraint (for other AWS regions).
  • `print(f"S3 bucket '{bucket_name}' created successfully.")`: Prints a success message if the bucket is created.
  • `except Exception as e`: Catches any exceptions during bucket creation.
  • `print(f"Error creating S3 bucket: {e}")`: Prints an error message if bucket creation fails.
                        
def create_glue_database():
    """Create a Glue database for the data lake."""
    try:
        glue_client.create_database(
            DatabaseInput={
                "Name": glue_database_name,
                "Description": "Glue database for NBA data lake.",
            }
        )
        print(f"Glue database '{glue_database_name}' created successfully.")
    except Exception as e:
        print(f"Error creating Glue database: {e}")
                        
                     
  • `def create_glue_database()`: Defines a function to create an AWS Glue database.
  • `try`: Starts a try block to catch errors during database creation.
  • `glue_client.create_database(...)`: Creates an AWS Glue database with the given configurations (name and description).
  • `print(f"Glue database '{glue_database_name}' created successfully.")`: Prints a success message if the database is created.
  • `except Exception as e`: Catches any exceptions that may occur.
  • `print(f"Error creating Glue database: {e}")`: Prints an error message if the database creation fails.
                        
def fetch_nba_data():
    """Fetch NBA player data from sportsdata.io."""
    try:
        headers = {"Ocp-Apim-Subscription-Key": nba_api_key}
        response = requests.get(f"https://api.sportsdata.io/v3/nba/scores/json/Players?key={nba_api_key}", headers=headers)
        response.raise_for_status()  # Raise an error for bad status codes
        print("Fetched NBA data successfully.")
        return response.json()  # Return JSON response
    except Exception as e:
        print(f"Error fetching NBA data: {e}")
        return []
                        
                     
  • `def fetch_nba_data()`: Defines a function to fetch NBA player data.
  • `try`: Starts a try block to handle API call errors.
  • `headers = {"Ocp-Apim-Subscription-Key": api_key}`: Sets the API key in the HTTP header.
  • `response = requests.get(f"https://api.sportsdata.io/v3/nba/scores/json/Players?key={nba_api_key}", headers=headers)`: Makes an HTTP GET request to the Sportsdata.io API endpoint.
  • `response.raise_for_status()`: Raises an HTTPError for bad responses (4xx or 5xx status codes).
  • `print("Fetched NBA data successfully.")`: Prints a success message.
  • `return response.json()`: Returns the JSON response.
  • `except Exception as e`: Catches any exceptions during API call.
  • `print(f"Error fetching NBA data: {e}")`: Prints an error if the API call fails.
  • `return []`: returns an empty list if the API call fails.
                        
def upload_data_to_s3(data):
    """Upload NBA data to the S3 bucket."""
    try:
        s3_client.put_object(
            Bucket=bucket_name,
            Key="raw-data/nba_players.json",
            Body=json.dumps(data),
            ContentType='application/json'
        )
        print("Successfully uploaded NBA player data to S3.")
    except Exception as e:
        print(f"Error uploading data to S3: {e}")
                        
                    
  • `def upload_data_to_s3(data)`: Defines a function to upload data to S3.
  • `try`: Starts a try block for error handling.
  • `s3_client.put_object(...)`: Uploads the JSON data to the S3 bucket.
  • `print("Successfully uploaded NBA player data to S3.")`: Prints a success message indicating the file location.
  • `except Exception as e`: Catches exceptions during S3 upload.
  • `print(f"Error uploading data to S3: {e}")`: Prints an error message if upload fails.
                        
    
                            
                                def configure_athena():
                                    """Set up Athena output location."""
                                    try:
                                        athena_client.start_query_execution(
                                            QueryString="CREATE DATABASE IF NOT EXISTS nba_analytics",
                                            QueryExecutionContext={"Database": glue_database_name},
                                            ResultConfiguration={"OutputLocation": athena_output_location},
                                        )
                                        print("Athena output location configured successfully.")
                                    except Exception as e:
                                        print(f"Error configuring Athena: {e}")
                                                        
                                                     
  • `def configure_athena()`: Defines a function to setup Athena configuration.
  • `try`: Starts a try block for handling any exceptions.
  • `athena_client.start_query_execution(...)`: Configures Athena by creating the specified database and output location.
  • `print("Athena output location configured successfully.")`: Prints a success message.
  • `except Exception as e`: Catches exceptions during Athena configuration.
  • `print(f"Error configuring Athena: {e}")`: Prints an error message.
                                                        
                                # Main workflow
                                def main():
                                    print("Setting up data lake for NBA sports analytics...")
                                    create_s3_bucket()
                                    time.sleep(5)  # Ensure bucket creation propagates
                                    create_glue_database()
                                    nba_data = fetch_nba_data()
                                    if nba_data:  # Only proceed if data was fetched successfully
                                        upload_data_to_s3(nba_data)
                                    create_glue_table()
                                    configure_athena()
                                    print("Data lake setup complete.")
                                                        
                                                     
  • `def main()`: Defines the main function of the program to manage the workflow.
  • `print("Setting up data lake for NBA sports analytics...")`: Prints the start message.
  • `create_s3_bucket()`: Calls the function to create the S3 bucket.
  • `time.sleep(5)`: Adds a 5 second delay to allow the S3 bucket creation to take effect.
  • `create_glue_database()`: Calls the function to create a Glue database.
  • `nba_data = fetch_nba_data()`: Calls the function to fetch NBA data.
  • `if nba_data`: Checks if data is present before proceeding.
    • `upload_data_to_s3(nba_data)`: Calls the function to upload the NBA data to the S3 bucket, if the data was successfully fetched.
  • `create_glue_table()`: Calls the function to create a Glue table.
  • `configure_athena()`: Calls the function to configure Athena.
  • `print("Data lake setup complete.")`: Prints the completion message.
                                                        
                                if __name__ == "__main__":
                                    main()
                                                        
                                                     
  • `if __name__ == "__main__":`: Makes sure that the main function is only called when this script is executed directly.
  • `main()`: Calls the main function to run the main workflow when this script is directly executed.

Data Flow

  • Environment Setup: The script starts by loading environment variables.
  • AWS Setup: It creates connections to AWS S3, AWS Glue, and AWS Athena services.
  • S3 Bucket Creation: Creates an S3 bucket for data storage.
  • Glue Database Creation: Creates a Glue database to organize data.
  • Data Fetching: Retrieves NBA player data using the Sportsdata.io API, using an API key.
  • Data Upload: Uploads the JSON data into the specified S3 bucket.
  • Glue Table Creation: Creates a table in the AWS Glue database referencing the S3 data.
  • Athena Setup: Configures Athena for querying data in S3.
  • Completion: Prints a message to indicate that the data lake is set up successfully.

Key AWS Services

  • Amazon S3 (Simple Storage Service): Used as the storage layer for raw NBA data and for Athena query results.
  • AWS Glue: Used as the metadata layer to create databases and tables over the raw data in S3.
  • Amazon Athena: Used to query the data stored in S3 and described by Glue tables.

Data Flow

  • Raw data is pulled from the Sportsdata.io API as JSON.
  • The JSON data is saved to the specified S3 bucket.
  • AWS Glue creates an external table definition using the raw S3 data.
  • AWS Athena is then setup so that it can query and use the table that was created in Glue.

This detailed explanation should provide a good understanding of the script, the AWS services it uses, and the flow of data throughout the application.