Thursday, August 25, 2011

How is a species like a baseball player?

Biomass is to runs as species is to player, and as ecologist is to Brad Pitt.

Community ecology and major league baseball have a lot to learn from each other.

Let's back up. As a community ecologist, I think about how species assemble into communities, and the consequences for ecosystems when species disappear. I'm especially interested using traits of species to address these issues. For the grassland plants that I often work with, the traits are morphological (for example, plant height and leaf thickness), physiological (leaf nitrogen concentration, photosynthetic rate), and life history (timing and mode of reproduction).

As a baseball fan, I spend a lot of time watching baseball. Actually, I'm watching my Red Sox now (multitasking as usual; I freely admit there's a lot of down time in between pitches). I care about how the team does, mostly in terms of beating the Yankees. I'm especially interested in how individual players are doing at any time; for fielders I care about their batting average and defensive skills, and for the pitchers I care about how few runs they allow and how many strikeouts they get.

So my vocation and avocation have some similarities. Both ecology and baseball have changed in the last decade or so to become more focused on 'granular' data at the individual level. In ecology this has been touted as a revolutionary shift in perspective, but is really a return to the important aspects of what roles organisms play in ecosystems, and how ecosystems are shaped by the organisms in them. This trait-based approach has shifted the collection and sharing of data on organism morphology, physiology, and life history into warp speed, to the great benefit of quantitatively-minded ecologists everywhere.

In baseball, the ability to collate and analyze data on every pitch and every play has lead to an explosion of new metrics to evaluate players. One of the simplest of these new metrics, which even the traditionalists in baseball now value, is "on base plus slugging" (OPS, see all the details here). This data-intensive approach to analyzing player performance was most famously championed by the manager of the Oakland Athletics in the late 1990's, now being played by Brad Pitt in the upcoming movie Moneyball.

There is no one ecologist in particular who can claim credit for popularizing trait-based approaches in community ecology, but for the sake of laughs let's make Owen Petchey the Brad Pitt analogue.

What can we do with this analogy? For pure nerd fun, we can think about what these two worlds can learn from each other.

What can baseball learn from community ecology?

One of the most notable trait-centric innovations in community ecology has been the use of functional diversity (FD), which represents how varied the species in a community are in terms of their functional traits. Many flavors of FD exist (one of which was authored by Owen Petchey, above), but the goal is to use one value to summarize the variation in functional traits of species in a community. A high value for a set of communities indicates greater distinctiveness among the community members, and is taken to represent greater niche complementarity.

For fun, I've taken stats from a fantastic baseball database[i] and calculated the FD of all baseball teams from 1871 to 2010. I used a select set of batting, fielding, and pitching statistics[ii], and you can see the data here. For the two teams that I pay the most attention to, I plotted their FD against wins, with World Series victories highlighted:

Given that these FD values represent how different the members of a team are, it's surprising that there is much of a pattern at all. But the negative relationship between wins and FD is strong and significant by several measures[iii]. So: the more similar a team is in terms of player statistics, the better the team does!

This pattern of less dissimilarity among players correlating with better performance at the team level has apparently been noticed before, by Stephen Jay Gould, who extrapolated this pattern also across teams to explain the gradual shrinking of differences among players over time:

"if general play has improved, with less variation among a group of consistently better payers, then disparity among teams should also decrease"

and so:

"As play improves and bell curves march towards right walls, variation must shrink at the right tail." (from "Full House", thanks to Marc for this quote!).

Interesting, but is it useful? One obvious drawback in this approach of examining variation in individual performance is that it ignores the fact that in baseball, we know that a high number of earned runs allowed is bad for a pitcher, and a low number for hits is bad for a hitter. In contrast, a high value for specific leaf area is neither good nor bad for a plant, just an indication of its nutrient acquisition strategy.

There are many exponentially more nerdy avenues to go with applying community ecology tools to baseball data, but I'll spare you from that for now!

What can community ecology learn from baseball?

One new baseball stat that gets a lot of attention during trades is 'wins above replacement'. This is such a complicated statistic to calculate that the "simple" definition is that for fielders, you add together wRAA and UZR, while for pitchers it is based off of FIP. I hope that cleared things up.

The point in the end is to say how many wins a player is worth, when compared to the average player. In ecology, the concept of 'wins above replacement' has at least two analogies.

First, community ecologists have been doing competition experiments since the dawn of time. The goal is to figure out what the effect of a species is at the community level, although fully factorial competition experiments at the community level are challenging to carry out. For example, Weigelt and colleagues showed that there can be non-additive effects of competitor plant species on a target species, but could rank the effect of competitors. This result allowed them to predict the effect of adding or removing a competitor species from a mixture, in a roughly similar way to how a general manager would want to know how a trade would change his or her team's performance.

Second, ecologists have shown that both niche complementarity and a 'sampling effect' are responsible for driving the positive relationship between biodiversity and ecosystem functioning. The sampling effect refers to the increasing chance of including a particularly influential species when the number of species increases. Large-scale experiments in grasslands have been carried out where plants are grown in monoculture and then many combinations, up to 60 species. The use of the monocultures allows an analysis similar in spirit to 'wins above replacement', by testing how much the presence of a particular species, versus the number of species, alters the community performance.

We could take this analogy further, and think of communities more like teams. A restoration ecologist might calculate 'wins above replacement' for all the species in a set of communities, and then create All Star communities from the top performers.

Lessons learned

A. Shockingly, there are baseball nerds, and there are ecology nerds, and there are even double-whammy basebology nerds.

B. There are quantitative approaches to analyzing individual performance in these crazily disparate realms which might be useful to each other.

C. I might need to spend more time writing papers and less time geeking out about baseball!

More analogies to consider:

Reciprocal transplants: trades?

Trophic levels: minor league system?

Nitrogen fertilization: steroids?


[i] One of the most astonishing databases around: complete downloadable stats for every player since 1871. This database is what NEON should aspire to be, except that this one was compiled completely privately by some single-minded and visionary baseball geeks!

[ii] Batting: Hits, at bats, runs batted in, stolen bases, walks, home runs

Fielding: Put outs, assists, errors, zone rating

Pitching: Earned run average, home runs allowed, walks, strike outs.

[iii] E.g. even after taking into account other more typical measures of success in offense (runs, R) and defense (runs allowed, RA), within years, there is still a negative slope for FD on wins:

lme(win ~ R + RA + FD, random = ~1|yearID, data = team)

Value Std.Err DF t-value p-value

(Intercept) 80.289 0.7411 2159 108.3 <0.001

R 0.107 0.0009 2159 116.8 <0.001

RA -0.105 0.0009 2159 -115.6 <0.001

FD -1.729 0.8083 2159 -2.1 0.0325

8 comments:

Aaron Berdanier said...

I love ideas like this. Another analogy that isn't specifically for baseball but sports in general is home field advantage, which is probably relevant beyond these authors' use for litter decomposition:
http://warnercnr.colostate.edu/~edayres/Pubs/2009%20Ayres%20et%20al%20Soil%20Biol%20Biochem.pdf

I'd say that if you can articulate the ecological analog of some of these metrics they'd be publication worthy, which would allow you to have your cake (baseball geekery) and eat it too (scientific productivity)!

Dan Flynn said...

Yes! Although for plant growth there is a distinct home field *disadvantage*
eg
Bever 1994
http://www.esajournals.org/doi/abs/10.2307/1941601
Petermann 2008
http://www.esajournals.org/doi/full/10.1890/07-2056.1

Which for the baseball analogy is like the accumulation of noxious fans at home, driving the team's performance down, perhaps.
Analogies are fun.

Jim Bouldin said...

Not to be overly critical but I'm having difficulty figuring out how you define FD for a baseball team (either your method in particular, or any method in general). I don't see any straightforward analogy with community structure or diversity. In baseball there's a clear goal: outscore your opponent. How to define FD, and how such relates to that goal, are quite unclear. Only on offense can I see that FD might be defined in some meaningful way (i.e. there are different ways (strategies) of scoring runs such as "small ball" (walks and hits, steals, sacrifices etc) vs "big ball" (home runs)). On defense I see no such possible variations in strategy--the same skill set (i.e. speed, catching, throwing) is always required. If you can't get to balls, or can't catch them when you do, or can't throw guys out, well then tough, there's really no substitute method available. Pitching is probably somewhere in between the two--e.g. strikeout vs ground ball pitchers or something like that.

In community ecology there is no such clear goal. Species are out to maximize their relative fitnesses, and the community is whatever results from that.

Dan Flynn said...

Hey Jim -- Thanks for your comment. The analogy is certainly imperfect, but in this case FD of a team in a year is reflecting how similar the players across all of their performance stats. I agree that there isn't quite the same tradeoff axis like a leaf economics spectrum, although your batting and pitching examples are similar.

What is at least mildly interesting is that teams with more similar players did better! Perhaps when the lineup is composed of all similar types of hitters, there are fewer intentional walks and more consistent run scoring.

For community ecology, assemblages of species which are more distinct from one another is predictive of some ecosystem functions -- particularly biomass accumulation for grassland plants. We chose traits for FD which have a link to the ecosystem function of interest, but don't invoke any goal-directed behavior on the behalf of the individual members, clearly a big difference from the baseball analogy.

Jim Bouldin said...

I'd bet somebody at FanGraphs or Stathead, or Bill James (and certainly Brad Pitt!) have looked at this question.

As a first cut I'd hypothesize (as above) that there is little or no possibility for FD on defense and therefore it comes from the offense and pitching. Then I'd look very selectively at something like the ratio of righties/lefties on the staff, broken out by starters and relievers, and/or power pitchers vs location pitchers. On offense I'd look at something that summarizes power and speed (probably there is a stat already). I wouldn't take a kitchen sink approach without first taking principal components or a similar data reduction--too much covariance amongst the various stats I'd bet.

Dan Flynn said...

The sabermetricians certainly are way advanced here -- I think the next step for me actually is to think about what we can borrow from them and apply to community ecology, rather than spend time working to predict team success (although it is fun).

But about evenness of offense: last night the Red Sox got at least one hit from every member of the lineup, and beat the Yankees... QED.

Jim Bouldin said...

I'd doubt there is much that they can provide in that regard. They don't have any statistical methods not readily available to everyone. In fact, I'd guess they are behind what you can find in the ecological literature, and implemented in R.

Jim Bouldin said...

Brad Pitt, he's the guy to ask...