Use Clustering to Identify Baseball Pitch Type

Pitcher ID: 506433

Pitcher Name: Darvish Yu

Step One: Feature Selection

First we need to choose the features (pitchfx attributes) which will be taken into clustering procedure. From former researches and my quick inspection on the data set, the selected features that should affect a pitch type, are listed as following:

  • Release point (x0, y0, z0)
  • Acceleration (ax, ay, az)
  • Initial Velocity (vx0, vy0, vz0)
  • Deviation on x and y axises (pfx_x, pfx_y)
  • Spin direction and spin rate
  • Break angle, break length (on x-axis), and break_y(on y-axis)

and the 16 features should cover all aspects of a ball.

Step Two: Dimension Deduction

Noise Elimination

By examine the data set with pitchfx pitch type as reference(shown as below, the bottom line shows numbers of pitch), some rarely pitched types can be perceived.

> table(r$X.pitch_type)

CH  CU  FA  FC  FF  FS  FT  IN  PO  SL 
 1 303   1 572 977 193 642   4   5 657

Obviously, CH, CU, IN, PO are rarely pitched types or misclassified ones in pitchfx system, in this case, pitches of these types are regarded as noises and will be removed. After noise elimination, the pitch types with numbers are:

> table(r$X.pitch_type)

  CU  FC  FF  FS  FT  SL 
 303 572 977 193 642 657

princop calculates the covariance matrix and takes its eigenvalues, while prcomp use a different technique called "singular value decomposition".

Step Three: DBSCAN

The result of DBSCAN shows as follows:

dbscan Pts=3344 MinPts=20 eps=0.5
         0    1
border 228  256
seed     0 2860
total  228 3116

dbscan Pts=3344 MinPts=20 eps=0.5
         0    1
border 228  256
seed     0 2860
total  228 3116

dbscan Pts=3344 MinPts=15 eps=0.33
         0    1   2   3  4  5
border 543  225 116  77 12 14
seed     0 1917 307 127  3  3
total  543 2142 423 204 15 17