Predicting Basketball Shot Outcomes Using Machine Learning: A Kaggle Case Study

Introduction

Basketball, a globally loved sport, offers a rich field for exploring the power of data science and machine learning. As the game becomes more analytically driven, predicting shot outcomes has become a key area of interest. By leveraging detailed player and game statistics, data scientists can build models to predict whether a player will make or miss a shot.

In this blog, we’ll walk through the machine learning approach used to predict basketball shot outcomes, submitted to a Kaggle competition. The methodology includes cleaning and preparing the data, training a model using Logistic Regression, and making predictions based on test data. This project originated from the Seal Neaward GitHub repository, which provided the foundation for the data and code used in the Kaggle competition.

The Problem

The goal of the competition is to predict the likelihood of a shot being made or missed. Using a dataset of basketball shots, you must build a machine learning model that predicts whether a shot will be successful based on various features such as:

  • Shot type: 2-point, 3-point, layup, etc.
  • Shot zone: Where the shot is taken on the court (e.g., 3-point line, key, paint).
  • Game context: Time remaining, quarter of the game, score differences, and more.

Each shot is labeled with a target variable: SHOT_MADE_FLAG, where:

  • 1 indicates a successful shot,
  • 0 indicates a missed shot.

The challenge is to use these features to build a model that can accurately predict whether a shot will be made or missed.

Step-by-Step Breakdown

Let’s walk through the process of developing this shot prediction model, from data preparation to submitting predictions to Kaggle.

1. Loading and Preprocessing the Data

The first step is loading the dataset, which comes in two parts: a training set and a test set. The training data (train.csv) contains the features and target variable, while the test data (solution_no_answer.csv) is missing the target variable but contains the same features.

Preprocessing involves several key tasks:

  • One-Hot Encoding: This step converts categorical variables (e.g., ACTION_TYPE, SHOT_TYPE, SHOT_ZONE_AREA) into dummy/indicator variables. This is necessary because machine learning algorithms require numerical input.
  • Dropping Irrelevant Columns: Some columns, like PLAYER_ID, GAME_DATE, and TEAM_NAME, are not relevant for predicting shot outcomes, so we remove them.
nba_k = pd.read_csv('train.csv')

# One-hot encode categorical variables
action_type_k = pd.get_dummies(nba_k["ACTION_TYPE"])
shot_type_k = pd.get_dummies(nba_k["SHOT_TYPE"])
shot_zone_area_k = pd.get_dummies(nba_k["SHOT_ZONE_AREA"])
shot_zone_basic_k = pd.get_dummies(nba_k["SHOT_ZONE_BASIC"])
shot_zone_range_k = pd.get_dummies(nba_k["SHOT_ZONE_RANGE"])

# Drop irrelevant columns
nba_k = nba_k.drop(["ACTION_TYPE", "SHOT_TYPE", "SHOT_ZONE_AREA", "SHOT_ZONE_BASIC", 
                    "SHOT_ZONE_RANGE", "EVENTTIME", "GAME_DATE", "GAME_ID", "HTM", 
                    "LOC_X", "LOC_Y", "MINUTES_REMAINING", "PERIOD", "PLAYER_ID", 
                    "PLAYER_NAME", "SHOT_ATTEMPTED_FLAG", "SHOT_TIME", "TEAM_ID", 
                    "TEAM_NAME", "VTM"], axis=1)

# Concatenate dummy variables back into the dataframe
nba_k = pd.concat([nba_k, action_type_k, shot_type_k, shot_zone_area_k, 
                    shot_zone_basic_k, shot_zone_range_k], axis=1)

2. Feature and Target Variable

The target variable is SHOT_MADE_FLAG, which indicates whether a shot was successful (1) or missed (0). The features consist of the remaining columns, including shot type, location, game context, and player attributes.

X_train_k = nba_k.drop(["SHOT_MADE_FLAG", "GAME_EVENT_ID"], axis=1)
y_train_k = nba_k["SHOT_MADE_FLAG"]

3. Model Training: Logistic Regression

Now that the data is ready, we can train a machine learning model. For this competition, we use Logistic Regression as the baseline model. Logistic Regression is a good starting point for binary classification problems like this one, where we are predicting whether a shot will be made or missed.

logmodel = LogisticRegression()
logmodel.fit(X_train_k, y_train_k)

After fitting the model to the training data, we evaluate its performance using the accuracy score and the classification report.

predictions_k = logmodel.predict(X_train_k)
print(classification_report(y_train_k, predictions_k))
print("Accuracy: {}".format(accuracy_score(y_train_k, predictions_k)))

These metrics give us insights into the model’s performance on the training data, including precision, recall, and F1-score.

4. Making Predictions on Test Data

Once the model is trained, we apply it to the test data (solution_no_answer.csv). The test dataset contains the same features but lacks the SHOT_MADE_FLAG (the target variable). Our objective is to predict whether each shot in the test set will be made or missed.

nba = pd.read_csv('solution_no_answer.csv')

# One-hot encode the test data
action_type = pd.get_dummies(nba["ACTION_TYPE"])
shot_type = pd.get_dummies(nba["SHOT_TYPE"])
shot_zone_area = pd.get_dummies(nba["SHOT_ZONE_AREA"])
shot_zone_basic = pd.get_dummies(nba["SHOT_ZONE_BASIC"])
shot_zone_range = pd.get_dummies(nba["SHOT_ZONE_RANGE"])

# Drop irrelevant columns from test data
nba = nba.drop(["ACTION_TYPE", "SHOT_TYPE", "SHOT_ZONE_AREA", "SHOT_ZONE_BASIC", 
                "SHOT_ZONE_RANGE", "EVENTTIME", "GAME_DATE", "HTM", "LOC_X", 
                "LOC_Y", "MINUTES_REMAINING", "PERIOD", "PLAYER_ID", "PLAYER_NAME", 
                "SHOT_ATTEMPTED_FLAG", "SHOT_TIME", "TEAM_ID", "TEAM_NAME", "VTM"], axis=1)

# Concatenate dummy variables for the test set
nba = pd.concat([nba, action_type, shot_type, shot_zone_area, 
                 shot_zone_basic, shot_zone_range], axis=1)

# Prepare the test data for predictions
X_test = nba.drop('GAME_EVENT_ID', axis=1)

After preparing the test data, we use the trained model to predict shot outcomes.

predictions = logmodel.predict(X_test)

Finally, we generate the predictions in the required format and save them to a CSV file, which can be submitted to Kaggle.

competition_entry = pd.DataFrame({"GAME_EVENT_ID": nba["GAME_EVENT_ID"], "SHOT_MADE_FLAG": predictions})
competition_entry.to_csv("competition-entry.csv", index=False)

Kaggle Submission

The generated competition-entry.csv contains the predicted shot outcomes (SHOT_MADE_FLAG) for each GAME_EVENT_ID. This file is then submitted to the Kaggle competition, where it will be evaluated based on accuracy and other classification metrics.

Conclusion

Predicting basketball shot outcomes is an exciting problem that blends data science, machine learning, and sports analytics. By following this process, we’ve built a Logistic Regression model to predict whether a player will make or miss a shot based on various game and shot features.

This approach serves as a great starting point for exploring sports analytics with machine learning. While Logistic Regression is a strong baseline, future work could involve experimenting with more complex models like Random Forest, XGBoost, or even Neural Networks to improve accuracy.

This project was originally developed using the data and inspiration from the Seal Neaward GitHub repository. It has been further refined and adapted for submission to a Kaggle competition, where it serves as a practical example of applying machine learning to real-world sports data.

Next Steps:

  • Feature Engineering: Create new features or refine existing ones to capture more complex patterns in the data.
  • Model Tuning: Use cross-validation, hyperparameter tuning, and model selection techniques to improve performance.
  • Advanced Models: Experiment with ensemble methods like Random Forests or Gradient Boosting Machines (GBM) for potentially better predictive performance.

Feel free to explore the original code and datasets on Seal Neaward’s GitHub and build on this foundational analysis!

References:


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *