What are the Data I'm Using?
So, given 100 IP in a single season from 2006 to 2012, I had 986 total records from 306 different pitchers to go with. For simplicity's sake, I'm going to treat each season as independent, even though we might expect different seasons from the same pitcher to be related.
Creating the Basic Model
I used these 986 data points to create a linear model predicting strike out rate in K/9 IP using the swinging strike percentage. I originally considered adding the season to the model, as strikeout rates have steadily risen over the time period I'm looking at. While season was a significant predictor of K/9 IP when it was included in the model (more recent = more strikeouts), it didn't add much to the overall fit and only had a very minor impact on the model overall. In ecology, we often describe this as statistically significant, but not biologically significant. It often happens when you have a large sample size. Also, in general, I strongly prefer the simplest model possible, especially when adding variables marginally improves the fit. Guess I'm just a big fan of Occam's Razor. Without further ado, the relationship between swinging strike rate and strikeout rate is:
K/9 = 71.8*(Swinging Strike %) + 0.7892
The R^{2} for this model was 0.6693, meaning that swinging strike rate explains 67% of the variability in K/9. That is a huge proportion, but it does leave room for improvement.
Examining the Residuals
A residual is the difference between the expected value from the model and the actual model. A positive residual means that the pitcher had more strike outs than our swinging strike model thought they should, while a negative residual means the opposite. In general, a lower residual means a better fit for the model, but you can also look at the residuals to see if there is anything consistent about them. In our case, we’ll look at the percent of swings on pitches in the zone. To do this, I ran another linear regression, trying to predict the residuals using the zone swing percent. This showed a significant trend, but with a much lower predictive value than the swinging strike rate (R^{2}=0.136). When your residuals have a trend, it means that your model isn’t capturing something, but for our purposes, it is a good thing. It means that when pitchers have a lower rate of swings against pitches in the zone, they have a higher strikeout rate.
The Low Swing Pitchers
Over the entire data set, the average swing rate at pitches in the zone was 65%. Unlike some other stats, there isn’t as much variability in swing rates at pitches in the zone. The lowest average belongs to Doug Fister (55.5%, over the course of three seasons), while the highest belongs to Scott Baker (72%, over two seasons). Another way to look at the data is to examine whether or not the pitchers with the low swing rates on pitches in the zone have a high residual. In short, the answer is yes. Four pitchers averaged a zone swinging rate of 60% or lower, and had multiple seasons on record. Here are their numbers – remember, a positive residual means that a pitcher was striking out more batters than the model predicted.
Name 
Seasons

Residual (K/9)

Zone Swing %

Doug Fister 
3

0.873

55.5%

C.J. Wilson 
3

1.678

57.5%

Jake Arrieta 
3

0.992

59.2%

Mike Mussina 
3

1.476

59.4%

So, we see a consistent pattern at the extreme ends. But, does swing rate on pitches in the strike zone explain most of the variability in K/9?
The Biggest Departures from the Model
To examine this, I looked at the pitchers who had: 1. The highest residuals, and 2. At least three years of data. If the biggest departures from our expected results (largest residuals), all have a lower than average swing rate on pitches within the zone, that would provide even stronger support that swing rate within the strike zone controls how . Of the 15 pitchers with the highest average residuals, only two have swing rates below 60% (Mussina and Wilson), while an additional six have zone swing rates slightly below average (6064%), and five have zone swing rates that are about average (64%66%). None of the pitchers with consistently high residuals have above average zone swing rates. Interestingly, many of the high residual pitchers who had average to slightly below average zone swing rates also walked a lot of batters. If you increase the number of walks you hand out, you'll also increase the number of batters you face, which in turn should lead to a higher K/9. For example, think about two different pitchers. Pitcher A strikes out 25% of the batters he faces and walks 10% (Yovanni Gallardo) and pitcher B strikes out 25% of the batters he faces and walks 6%. Pitcher B is clearly the superior pitcher, but pitcher A will likely have a higher K/9.
Name 
Seasons

Residual (K/9)

Zone Swing %

Yovani Gallardo 
4

2.065

62.0%

C.J. Wilson 
3

1.678

57.5%

Erik Bedard 
3

1.623

64.2%

Mike Mussina 
3

1.476

59.4%

David Price 
4

1.343

64.8%

Ubaldo Jimenez 
4

1.272

63.7%

Tim Lincecum 
5

1.258

63.4%

Zack Greinke 
5

1.179

64.7%

Jonathan Sanchez 
4

1.176

62.3%

Clayton Kershaw 
5

1.169

63.4%

Oliver Perez 
3

1.154

64.7%

Tommy Hanson 
3

1.148

63.0%

Josh Beckett 
6

1.134

64.9%

Daniel Cabrera 
3

1.119

64.8%

Jon Lester 
4

1.112

65.7%

So What Does This All Mean?
I think there are several conclusions we can draw here. First, a lower swing rate at pitches in the strike zone does consistently lead to a higher K/9. Second, if a pitcher has a high swing rate at pitches in the zone, they probably won't outperform their predicted K/9 based on swinging strike rate. This makes sense; lots of swings will lead to more balls in play, which in turn will lead to fewer strikeouts. Third, there are other factors that aren't captured by just looking at swinging strike rate and zone swing rate. Most of the pitchers with the highest residuals in the model did not have extremely low zone swing rates. This may be attributable to "stuff" in general, so perhaps delving into PitchFX data would lead to a clearer picture there.
Great stuff, Bill. It makes a lot of sense that pitchers with high swing rates, particularly in the zone, will have fewer strikeouts than expected if you are looking at K/9. Would it be hard to do the analysis on K%, which in some ways is more useful than K/9?
ReplyDeleteIt wouldn't be hard to do the analysis on K%, although since I didn't include K% in my custom report I'd be starting from scratch. Although K% does have some advantages over K/9, I elected to go with K/9 because I think it is a lot more accessible to most folks.
ReplyDelete