mm

Written By

Andrew Powell

mm

 

Written By

Andrew Powell

Share

Subscribe

Stay up-to-date with OST blog posts.

May 5, 2016

Data is funny.

We use it to tell us all sorts of things. We call it empirical. We talk about how the data doesn’t lie. We look at numbers, look at trends, and we draw conclusions – not just the data scientists in the crowd, but everybody. How much money did you make last year? How profitable is the latest Captain America movie? Who is the most successful batter of all time? The data will tell us.

But … will it?

Let’s look at a couple of baseball players and examine their batting averages. This is real data I’m using here, and the math is pretty easy. For the sake of conversation, let’s try to determine who was a better batter – Derek Jeter or David Justice. To make things simple let’s examine a data set of just two years, 1995 and 1996, and let’s talk about each player’s batting average – that’s the percentage of time, when a batter is at bat, he gets a hit.

Derek Jeter’s batting average for 1995 was .250 and for 1996 was .314

David Justice’s batting average for 1995 was .253 and for 1996 was .321

What does the data tell us? It’s pretty clear, right? If you’re gonna pick a better batter for 1995 and 1996, you’d choose David Justice. He was a more successful batter that Derek Jeter was. He hit the ball with more reliability. That’s not my opinion – The data says so!

Not so fast. Let’s combine the two years:

For the two-year period combined, Derek Jeter’s batting average was .310

For the same period, David Justice’s batting average was .270

Wait, what?

That’s not a typo, that’s Simpson’s Paradox in action. Edward Simpson first described his statistical finding this way: “Trends which appear in groups of data may disappear or reverse when the groups are combined.” Seem unbelievable, right? It’s not. It’s just math.

Let’s look at the raw data. I put the “winner” in bold in each data set.

 

1995:                           Hits                 At Bats            Average

Derek Jeter                 12                     48                   .250

David Justice         104                  411                 .253

 

1996:                           Hits                 At Bats            Average

Derek Jeter                 183                  582                  .314

David Justice             45                140                .321

 

Combined:                  Hits                 At Bats            Average

Derek Jeter               195                630                .310

David Justice               149                  551                  .270

 

The data doesn’t lie. David Justice had a more successful percentage of at-bats in 1995 and a more successful percentage of at-bats in 1996 … and when you combine the two years, Derek Jeter is the better batter. Sorry, David; when you aggregate data, sometimes there’s just no justice.

I’m not saying data can’t be trusted – that’s not the point at all. Data can always be trusted. It’s empirical, remember. Data doesn’t lie. The paradox is that both cases are true. David Justice had a higher batting average than Derek Jeter in both 1995 and 1996. This is a fact. Derek Jeter’s 1995/1996 Combined batting average is higher. This is also true. It seems like these things can’t both be true, but they are.

And that’s the point.

The world isn’t binary. We think if A is true then B must be false, and that’s almost never the case. We think, if we’re right about something, then others must be wrong. We think if what the data tells us is true, then what the data doesn’t tell us is surely false.

All too often, we’re wrong.

Let’s talk about movies for a second. Which movie was more successful, The Avengers, or The Fast and Furious 7? Let me give you some data to help you figure this out:

 

Movie                          Worldwide Gross

The Avengers              $1,517,557,910

Furious 7                     $1,516,045,991

 

The answer is obvious. The Avengers was more successful, right? The data says so. The math is clear. The Avengers made $1.5 million more than Furious 7. Box office numbers don’t lie! But there’s more to the data than that. Dig a bit deeper and look at the movie’s cost:

 

Movie                          Budget

The Avengers           $220,000,000

Furious 7                     $85,000,000

 

So The Avengers cost $135 million more to make than Furious 7 did, and only made $1.5 million more than Furious 7 did. Doesn’t that mean Furious 7 was more successful?

I guess it depends on how you define successful. And that brings us closer to something you can take away and think about. If you define a movie’s success to be a measure of tickets sold (and dollars earned) at the box office, you are correct in asserting that The Avengers is more successful. If you define a movie’s success as the function of the movie’s box office receipts less the movie’s budget, you are correct in asserting that Furious 7 is more successful.

Despite your binary instincts, telling you only one or the other is true, the data confirms for us that both scenarios are true.

It’s all about how you look at it.

Consider this the next time you find yourself in a disagreement with someone about something. What if the fact that you’re right doesn’t mean the other person is also wrong? What if you’re facing Simpson’s Paradox? What if you’re both right?

It’s not always about who is right. Sometimes, everybody is.

Then learn from them.

Share

Subscribe

Stay up-to-date with OST blog posts.

About the Author

Andrew Powell joined OST as a managing consultant in 2014. His experience in application development spans more than two decades, working as an application developer, an architect, a technical writer, a trainer, a consultant, a manager, a designer, and a business owner. Andrew’s career has lead him to working for companies as small as his own start-ups and as large as Meijer Corporation and Farmers Insurance. In one shape or another, he has spent the whole of his adult life working in the application development space.