College Football Data Analysis (Part 1)

Stephen Dawkins
Dec 22, 2024
3 min read

Updated: Jan 2

Overview

The objective of this project was to explore a data science application in the realm of college football. With the vast abundance of data available on team performance, individual player statistics, and game results, I wanted to investigate whether it was possible to predict the outcomes of a season using data from the prior year. Specifically, the project aimed to determine whether a machine learning model could predict team performance metrics such as total wins or improvement in performance year over year.

To achieve this, I used a combination of Python for data manipulation and modeling, Airflow for orchestrating data pipelines, and AWS S3 for data storage and retrieval. Below, I outline the steps taken to clean and normalize the data, build and evaluate predictive models, and assess the challenges encountered during the process.

Methodology

Data Collection:
- College football data was sourced and stored in AWS S3, ensuring scalability and secure access.
- The data was organized into JSON and CSV formats, representing various aspects such as team statistics, player performance, and game records.
Data Pipeline:
- Apache Airflow was used to orchestrate the data pipeline, automating tasks such as data extraction, transformation, and storage.
- Python operators in Airflow were used to load raw data from S3, clean and normalize it, and save the processed outputs back to S3.
Code Example for Data Normalization: Below is an example of how the data was normalized:

def process_college_football_data(**kwargs):

 	# Load the dataset from S3
    data = load_s3_csv("normalized_college_football_data.csv")

	# Create 'improvement' column
    data['improvement'] = (data['new_total.wins'] > 				data['total.wins']).astype(int)
    
# Save the processed data back to S3
    processed_filename = "processed_college_football_data.csv"
    data.to_csv(f"/tmp/{processed_filename}", index=False)
    hook.load_file(
        filename=f"/tmp/{processed_filename}",
        key=f"cleaned-data/{processed_filename}",
        bucket_name=BUCKET_NAME,
        replace=True
    )

Steps Taken

Data Cleaning and Normalization:
- Extracted relevant features from the raw data (e.g., total wins, rushing yards, passing accuracy).
- Flattened nested JSON data to ensure compatibility with machine learning models.
- Scaled numerical features to ensure equal contribution during modeling.
- Created new columns like "improvement" to represent whether a team’s total wins improved year over year.
Modeling:
- Initial Attempt: Predict Total Wins:
  - Multiple regression models (Linear Regression, Lasso, Random Forest, Gradient Boosting) were tested.
  - Results:
    - Models struggled to predict total wins accurately, with low R² scores indicating poor fit.
    - Example:
    - Mean Squared Error (MSE): 12.53
    - R² Score: -0.66
- Alternative Attempt: Predict Improvement:
  - Shifted focus to binary classification (“Did a team improve?”).
  - Used Logistic Regression, Random Forest, and Gradient Boosting for classification.
  - Results:
    - Logistic Regression achieved a moderate AUC-ROC of 0.7018 but struggled with class imbalance.
    - Example Classification Report:
    - Precision Recall F1-Score Support
    - 0 0.86 0.63 0.73 19
    - 1 0.36 0.67 0.47 6
    - Accuracy 0.64 25
Evaluation:
- The results indicated that predicting team performance improvement or outcomes is challenging given the available features.
- The imbalance in the target variable (fewer teams improving) further complicated classification tasks.

Challenges Encountered

Feature Engineering:
- Many features did not correlate strongly with the target variables.
- Creating meaningful derived features (e.g., efficiency metrics) could improve results.
Data Quality:
- Some datasets had missing or inconsistent values, requiring imputation and careful validation.
Model Selection:
- Linear models underperformed, indicating potential non-linear relationships between features and outcomes.
- Tree-based models performed better but were still limited by the available features.
Class Imbalance:
- The "improvement" target variable was imbalanced, necessitating techniques like oversampling (SMOTE) or class weighting.

Future Steps

Expand Dataset:
- Include additional features such as individual player statistics, coach history, and game-level data.
- Gather data from more seasons to improve model robustness.
Advanced Feature Engineering:
- Create interaction terms (e.g., passing accuracy * rushing yards) to capture complex relationships.
- Incorporate advanced metrics like Expected Points Added (EPA).
Hyperparameter Tuning:
- Use GridSearchCV or RandomizedSearchCV to optimize model hyperparameters.
Try Advanced Models:
- Experiment with deep learning models (e.g., neural networks) to capture non-linear relationships.
- Use ensemble methods like XGBoost or CatBoost for improved accuracy.
Address Imbalance:
- Implement oversampling techniques or focus on multi-class classification instead of binary targets.

Conclusion

This project demonstrated the challenges of applying data science to complex domains like college football. While the models struggled to predict outcomes effectively, the process highlighted areas for improvement and the potential for richer datasets and advanced methodologies to yield better results. The use of Airflow for orchestration, Python for modeling, and AWS S3 for storage ensured a scalable and reproducible pipeline. Future iterations will aim to refine feature engineering, address data limitations, and explore more sophisticated modeling techniques.

College Football Data Analysis (Part 1)

Recent Posts

Dawkinsdigital.com