Future Work
Our baseball analytics capstone project lasted only eight weeks and we could not spend as much time on the project as we would like. We would like to keep the website going, improve the existing tools, and add tools and analyses over time.
More Testing
The large datasets and limited time have kept us from testing as much as we would like. It is possible that there are missing data or inaccurately coded data in the database. In the future, we would like to have a systematic verification of the data.
Improving Performance
Our work as focused on delivering the important features for our dashboards and tools. There is room for performance improvement. This may consist of optimizing code, adding more ETL data subsets, etc. In the future, we will devote some efforts to improving performance.
Pitcher Hall of Fame Prediction
Currently, Hall of Fame prediction in the Player Exploration Dashboard only works with batters. This is primarily because there is a lack of data on pitchers and our current model is not accurate. We would like to revisit this and work on improving accuracy.
Improving Team Similarity Models
The Team Similarity Tool needs improvement. The current models are somewhat naïve and may benefit from more feature engineering. Additionally, there are probably opportunities to make season outcome predictions by modeling prior year performance of players and building models. This might also be useful for fantasy baseball players.
More Data Download Options
Currently, we are only providing access to the pitches data in our database. We also have significant data on batters and pitchers and hope to make this available as well.
Lessons Learned
In addition to our findings described above, we learned a few lessons on our data science journey. Here are a few thoughts.
Data is Messy
This isn’t a new insight and was discussed in our very first data science class, Being a Data Scientist. However, this is constantly reinforced. Even what should be simple, like agreeing on the three-letter code for baseball teams, isn’t as common as it should be. Data presentation on web sites is quite dynamic and a web scraping tool that worked perfectly last week might not work at all today.
Production Deployment isn’t Always Simple
We knitted together multiple different technologies to make our project work. Part of this was due to our budget constraints. Deploying a live application means making a lot of moving parts work together seamlessly and always worrying about a breakdown in a single part of the overall system.
Usability is Critical and Difficult
Our project was not a “simple” analysis but intended for use by non-technical people with a love of baseball. We couldn’t always rely of everyone knowing things like abbreviations for type of pitches. We spent a large amount of time on the user interface and making data and labels accessible to non-data scientists.
Ethical Considerations
This project and website was developed to support our academic goals and personal interests. It is provided to the public for entertainment purposes only
We are mindful that we are presenting data from disparate sources and may result in analyses not readily available to anyone. Further, we are aware sports-related data could be used for gambling. There is a risk that data presented simply in an analysis context could be used out of context to support gambling decisions resulting in detrimental effects. Our data comes from independent sources out of our control, and we cannot guarantee it’s accuracy, nor its relevance to support gambling decisions.
Our analyses and tools are not provided to support gambling. We discourage gambling because it might keep you out of the Hall of Fame.
Final Thoughts
Thanks for your attention. We know this was a long series of introductory blogs and appreciate your playing all nine innings with us.
We hope our love of baseball and data science training have brought brand new ways to explore baseball data to everyone and we hope you enjoy exploring our work.
Please visit us again as we further develop BaseballML.com.