Data Mining on PITCHf/x

[DRAFT]

Overview

Background

Pitch Type

According to TexusLeaguers, PITCHf/x now classify all pitches into 15 types (UN for Unknown).

PITCHf/x

Goal

Pitch Type Classification

Pitch Type Prediction

Pitch Quality Evaluation

Analysis on Pitcher Fatigue Factor

Approach

Fetch Raw Data

PITCHf/x data is provided on mlb.com, and mainly in XML format. Parse and Query these files remotely is inconvenient, so the first thing to do is to download the whole site.

wget utility is suitable for this job, in a simple, decent way:

wget -r -p -np -k -c -e robots=off --wait 0.25 http://gd2.mlb.com/components/game/mlb/year_2012

It is worth to mention that you should always be polite, especially when robot.txt tells you not to crawl this site -- add wait parameter when using robots=off, or your request will jam the server.

BTW, thanks to Jiayang Gao for bring this method up. Otherwise, I was starting to implement a script which generalize all the game ids as filenames to be downloaded. Actually, someone wrote a 200-line complicated ruby script which doing the same thing.

In order to obtain all the game data, it seems that I have to generate all possible

Build Database

In the very first step of this project, I'm only interested in pitching data and pitchers' profiles. Hence, unrelated data won't be touched at this stage.

Game Data

Traverse All Games

In everyday's directory, there will be several subdirectories contains game data in each of them, with directory names of gid*s (game id), which are documented in the corresponding game event schedule file *grid.xml.

Pitfalls and Notes

  1. The last inning of a game might not have a bottom.
  2. The bottom of last inning might not have 3 batting-appearence.

This particular case is the consequence of a walk-off hit. For instance,

<bottom>
    <action b="0" s="0" o="0" des="Pitcher Change: Michael Dubee replaces Michael Crotta, batting 7th. " event="Pitching Substitution" tfs="161228" tfs_zulu="2012-03-08T21:12:28Z" player="452774" pitch="4"/>
    <atbat num="81" b="3" s="2" o="0" start_tfs="161712" start_tfs_zulu="2012-03-08T21:17:12Z" batter="428642" stand="R" b_height="6-1" pitcher="452774" p_throws="R" des="Lou Montanez homers (1) on a fly ball to left field. " event="Home Run" score="T" home_team_runs="5" away_team_runs="4">
        <pitch des="Swinging Strike" id="514" type="S" tfs="161714" tfs_zulu="2012-03-08T21:17:14Z" x="71.24" y="157.15" cc="" mt=""/>
        <pitch des="Swinging Strike" id="515" type="S" tfs="161715" tfs_zulu="2012-03-08T21:17:15Z" x="71.24" y="157.15" cc="" mt=""/>
        <pitch des="Ball" id="516" type="B" tfs="161717" tfs_zulu="2012-03-08T21:17:17Z" x="71.24" y="157.15" cc="" mt=""/>
        <pitch des="Ball" id="517" type="B" tfs="161718" tfs_zulu="2012-03-08T21:17:18Z" x="71.24" y="157.15" cc="" mt=""/>
        <pitch des="Ball" id="518" type="B" tfs="161719" tfs_zulu="2012-03-08T21:17:19Z" x="71.24" y="157.15" cc="" mt=""/>
        <pitch des="In play, run(s)" id="519" type="X" tfs="161721" tfs_zulu="2012-03-08T21:17:21Z" x="71.24" y="157.15" cc="Michael Dubee had Lou Montanez down 0-2 but could not put him away." mt=""/>
        <runner id="428642" start="" end="" event="Home Run" score="T" rbi="T" earned="T"/>
    </atbat>
</bottom>
  1. For a batting-appearence, there might not be any pitch.

For instance:

<atbat num="58" b="0" s="0" o="3" start_tfs="153550" start_tfs_zulu="2012-03-31T19:35:50Z" batter="457775" stand="R" b_height="6-2" pitcher="462985" p_throws="L" 
des="With Desmond Jennings batting, Sean Rodriguez picked off and caught stealing 2nd base, pitcher Franklin Morales to first baseman Mauro Gomez to second baseman Heiker Meneses.  " event="Runner Out">
    <po des="Pickoff Attempt 1B"/>
    <runner id="446481" start="1B" end="" event="Picked off stealing 2B"/>
</atbat>