These are some notes to myself on a baseball presentation I’m doing in a couple of weeks. Actually, I’ve been doing it pretty much constantly since last summer, just to different groups at work. I’m just trying to capture my thought process and the issues I’ve seen while developing a baseball analytical model that finds the factors most critical in a home team victory. As always, comments are welcome.
I had to relearn some basic statistics for a work assignment on Analytics – which is statistics on steroids, so I thought, what better subject than baseball? There’s tons of data available, there are statistics galore, and there are all sorts of things you can try to predict.
There are two issues that I’ve now come across while building a lab exercise based on game results – the results tend to make baseball people say, “Well, duh!” and more frighteningly, not all people understand baseball. WTF?
The “Well, duh!” part is actually a good thing – it means the statistical results may actually be correct, or at least believable in the baseball universe (this is called domain knowledge when you’re doing analytics.) Domain Knowledge is what allows you to know that if you’re trying to predict the home team’s score, using the home team’s RBI totals is probably cheating – since the numbers are the same.
That’s where the second problem comes in – not everyone knows what an RBI is. WTF? People know Honey Boo-Boo’s waist size and they don’t know how runs batted in are counted or what they mean?
So, now I have a presentation coming up and I expected to have to explain how the basic statistical model works, which is giving me enough heartburn. I’m using IBM SPSS Modeler (a really fun toy if it’s from work or a really sophisticated analytics platform if you’re trying to get your boss to buy it) to build a model based on MLB games from 2000-2012. (That’s a lot of games – I think there are over 24,000 records in the dataset. However, most analytics models would have more records than that.) The model looks at the factors that influence a home victory – which basically means the home team’s score is greater than the visiting team’s. (Well, duh!)
This is a major advantage of baseball – there are no ties, unless you have an idiot commissioner and the managers run out of pitchers. In the real universe, somebody is going to win. So, you can predict (try to predict) victories.
The other advantage is that baseball is a logical game of progression – you’re not going to have an interception, for example, and you can’t run out the clock. You have a specific number of batters receiving a specific number of pitches. The total number of pitches may vary, but three strikes and you’re out (this is the origin of that phrase, in case you really don’t know baseball.)
So, I will have to go over the basics of linear regression – trying to predict one value based on one or more other values, and then go over the baseball terms to explain why they are important. Oh, and explain SPSS Modeler to an audience that has never seen it.
I really didn’t think I would have to cover all that.
It’s interesting – in the three or four years I’ve had AirHogs tickets, I’ve learned a lot about baseball, but I always knew the basics, so I assumed everyone did. My dad took me to one Rangers game that I can remember (David Clyde was pitching), and I actually never played – I played softball in a corporate league in my thirties (I was a pitcher), but I still knew the basics. Now, we have a generation that doesn’t necessarily know. Oops.
For the record, given the games from 2000-2012 (thank you, www.retrosheet.org), the most important factors in predicting a home victory are:
- The number of hits by the visiting team
- The number of hits by the home team
- The number of visitors’ walks
- The number of home walks
- and some other factors (errors, home runs) which have much less impact
The interesting part about this exercise to me has been realizing how important domain knowledge really is. If you don’t know much about baseball, you won’t look at the factors and think, “Wow, pitching is pretty important.” Now, to baseball people, that’s obvious, but to a fan who is used to someone swinging for the fences, it may not be obvious that the visitors are swinging for the fences, as well – and stopping them is an important part of the game.
If you watch the movie Moneyball, it begins with the “epic struggle” between the statistics nerd and the old school “just have a feelin’ about him” scouts. However, I think they are basically very similar – the statistics tend to prove what old school baseball people take as gospel (except for the dating the hot chick theory) – they just don’t know why they know it. Also, the statistics and analytics may prove that some of the gospel is wrong – which is the premise of Moneyball in the first place.
Now, if you don’t care about baseball, then none of this is very meaningful, because the results are just gibberish. However, these lessons apply to business as well – if you are running a business and making decisions based on hunches – analytics can show whether the hunches are correct or not. Maybe you’re right – in which case, you know your business well. If you’re not, either you’re in the wrong business, or you need to do research before making decisions, and not just guess.
In fact, from a modeling standpoint, building a model to look at baseball is not much different from building a model to check credit scores to approve credit card applications. The only issue that changes is the domain knowledge and the actual data.
The reason I picked baseball in the first place was because almost all of the analytical models I had seen built were for mobile phone churn (customers leaving for other carriers) or banking – what happens if you’re not in either of those industries? So, I assumed baseball was a universal industry that people would have some idea about. That may have been an incorrect assumption – but I’d rather explain baseball to a crowd of people than the mobile phone industry.