Trouble with the Curve: Predicting Future MLB Players Using Scouting Reports

Recommended citation: Danovitch, J. (2019). Trouble with the Curve: Predicting Future MLB Players Using Scouting Reports. Poster session to be presented at the 2019 Carnegie Mellon Sports Analytics Conference, Pittsburgh, PA.

Github


Work primarily completed during internship at Microsoft.

In baseball, a scouting report is a written profile about a player describing their characteristics and traits, usually intended for use in player valuation. This work presents a first-of-its-kind dataset of over 5000 scouting reports for minor league, international, and draft prospects. Compiled from MLB.com, each report consists of a written description of the player, numerical grades of their key attributes (known as the “20-80 scale”), metadata (such as draft position, signing bonus, etc.), and unique IDs to reference their profiles on popular resources like MLB.com, FanGraphs, and Baseball-Reference.

With this dataset, we employ several deep neural networks to predict if minor league players will make the MLB given their scouting report. We open-source this data to share with the community, and present a web application demonstrating language variations in the reports of successful and unsuccessful prospects.