Saturday, May 11, 2013

An Extremely Unusual Start

On Friday night, Alex Cobb had one of the most unusual lines you'll ever see from a starting pitcher.

4 2/3 IP, 13 K, 2 BB, 5 H, 2 ER

So, Cobb got a total of 14 outs. Twelve were via the strikeout (one of the strikeouts reached first on a wild pitch). Two of the hits were home runs. That means that Cobb only had five balls in play. Of those five balls in play, two were infield singles and two were ground outs.

Looking at the underlying numbers, Cobb was just as dominant. Of Cobb's 117 pitches, 77 were strikes, for an about average total. However, aside from the two home runs, the Padres were clear completely lost at the plate. They swung at 57 pitches and swung and missed at 22 of them (38%). For comparison, 10% is a good swinging strike rate, and Yu Darvish has the highest swinging strike rate on the season (16.7%).

To top it all off, Cobb managed to strike out four batters in the third inning. Despite that, he allowed a run, as Will Venable reached on a wild pitch strike out, stole second, stole third, and then scored on a balk. Cobb then struck out Yonder Alonso to end the inning. It was only the 30th time than an AL pitcher has struck out four players in an inning and the 60th overall. Striking out four batters in an inning is one of the rarest occurrences in baseball, just behind a perfect game and the unassisted triple play.

All told, this might have been the most dominant start when a starter didn't even manage to qualify for a win.

Wednesday, May 1, 2013

On swinging strikes, swings in the zone (or lack thereof), and strike outs

Following up on the comment on my Clay Buchholz piece, I wanted to examine how the swing percentage on pitches in the strike zone impacted strikeout rate. Thanks to Fangraphs excellent and customizable leader boards, I was able to get data for pitchers from 2006 to 2012 who threw for 100 or more innings. I did not want to go much farther back, as strikeout rates have changed considerably over the last few years, but also, swinging strike percent and swing rate on pitches in the zone are not available prior to 2005.

What are the Data I'm Using?
So, given 100 IP in a single season from 2006 to 2012, I had 986 total records from 306 different pitchers to go with. For simplicity's sake, I'm going to treat each season as independent, even though we might expect different seasons from the same pitcher to be related.

Creating the Basic Model
I used these 986 data points to create a linear model predicting strike out rate in K/9 IP using the swinging strike percentage. I originally considered adding the season to the model, as strikeout rates have steadily risen over the time period I'm looking at. While season was a significant predictor of K/9 IP when it was included in the model (more recent = more strikeouts), it didn't add much to the overall fit and only had a very minor impact on the model overall. In ecology, we often describe this as statistically significant, but not biologically significant. It often happens when you have a large sample size. Also, in general, I strongly prefer the simplest model possible, especially when adding variables marginally improves the fit. Guess I'm just a big fan of Occam's Razor. Without further ado, the relationship between swinging strike rate and strikeout rate is:
K/9 = 71.8*(Swinging Strike %) + 0.7892

The R2 for this model was 0.6693, meaning that swinging strike rate explains 67% of the variability in K/9. That is a huge proportion, but it does leave room for improvement.

Examining the Residuals
A residual is the difference between the expected value from the model and the actual model. A positive residual means that the pitcher had more strike outs than our swinging strike model thought they should, while a negative residual means the opposite. In general, a lower residual means a better fit for the model, but you can also look at the residuals to see if there is anything consistent about them. In our case, we’ll look at the percent of swings on pitches in the zone. To do this, I ran another linear regression, trying to predict the residuals using the zone swing percent. This showed a significant trend, but with a much lower predictive value than the swinging strike rate (R2=0.136). When your residuals have a trend, it means that your model isn’t capturing something, but for our purposes, it is a good thing. It means that when pitchers have a lower rate of swings against pitches in the zone, they have a higher strikeout rate.

The Low Swing Pitchers
Over the entire data set, the average swing rate at pitches in the zone was 65%. Unlike some other stats, there isn’t as much variability in swing rates at pitches in the zone. The lowest average belongs to Doug Fister (55.5%, over the course of three seasons), while the highest belongs to Scott Baker (72%, over two seasons). Another way to look at the data is to examine whether or not the pitchers with the low swing rates on pitches in the zone have a high residual. In short, the answer is yes. Four pitchers averaged a zone swinging rate of 60% or lower, and had multiple seasons on record. Here are their numbers – remember, a positive residual means that a pitcher was striking out more batters than the model predicted.

Name
Seasons
Residual (K/9)
Zone Swing %
Doug Fister
3
0.873
55.5%
C.J. Wilson
3
1.678
57.5%
Jake Arrieta
3
0.992
59.2%
Mike Mussina
3
1.476
59.4%

So, we see a consistent pattern at the extreme ends. But, does swing rate on pitches in the strike zone explain most of the variability in K/9?

The Biggest Departures from the Model
To examine this, I looked at the pitchers who had: 1. The highest residuals, and 2. At least three years of data. If the biggest departures from our expected results (largest residuals), all have a lower than average swing rate on pitches within the zone, that would provide even stronger support that swing rate within the strike zone controls how . Of the 15 pitchers with the highest average residuals, only two have swing rates below 60% (Mussina and Wilson), while an additional six have zone swing rates slightly below average (60-64%), and five have zone swing rates that are about average (64%-66%). None of the pitchers with consistently high residuals have above average zone swing rates. Interestingly, many of the high residual pitchers who had  average to slightly below average zone swing rates also walked a lot of batters. If you increase the number of walks you hand out, you'll also increase the number of batters you face, which in turn should lead to a higher K/9. For example, think about two different pitchers. Pitcher A strikes out 25% of the batters he faces and walks 10% (Yovanni Gallardo) and pitcher B strikes out 25% of the batters he faces and walks 6%. Pitcher B is clearly the superior pitcher, but pitcher A will likely have a higher K/9.

Name
Seasons
Residual (K/9)
Zone Swing %
Yovani Gallardo
4
2.065
62.0%
C.J. Wilson
3
1.678
57.5%
Erik Bedard
3
1.623
64.2%
Mike Mussina
3
1.476
59.4%
David Price
4
1.343
64.8%
Ubaldo Jimenez
4
1.272
63.7%
Tim Lincecum
5
1.258
63.4%
Zack Greinke
5
1.179
64.7%
Jonathan Sanchez
4
1.176
62.3%
Clayton Kershaw
5
1.169
63.4%
Oliver Perez
3
1.154
64.7%
Tommy Hanson
3
1.148
63.0%
Josh Beckett
6
1.134
64.9%
Daniel Cabrera
3
1.119
64.8%
Jon Lester
4
1.112
65.7%

So What Does This All Mean?
I think there are several conclusions we can draw here. First, a lower swing rate at pitches in the strike zone does consistently lead to a higher K/9. Second, if a pitcher has a high swing rate at pitches in the zone, they probably won't outperform their predicted K/9 based on swinging strike rate. This makes sense; lots of swings will lead to more balls in play, which in turn will lead to fewer strikeouts. Third, there are other factors that aren't captured by just looking at swinging strike rate and zone swing rate. Most of the pitchers with the highest residuals in the model did not have extremely low zone swing rates. This may be attributable to "stuff" in general, so perhaps delving into PitchFX data would lead to a clearer picture there.