On Monday we published something a little different than most of the graphics we make – a running, updating tracker of how much money major league teams are paying to players on the disabled list.
I love sports, but I’m not a huge baseball fan and I’m neutral on the Yankees scale – I don’t really hate them but I can’t say I care whether they win or lose. But early in the year, I remember seeing a fun New Yorker cover that planted a seed:
Talking to some friends and colleagues, Joe Ward and I thought it would be fun to do something that put a dollar figure on the Yankees’ disabled list. We certainly weren’t the first people to notice this – in addition to coverage from traditional outlets, the Onion wrote about how “stacked” the D.L. was and there was a well-circulated blog post when their payroll approached $100 million in annual salaries – but we wanted to make something that showed all major league teams and was updated throughout the season. To do that, you only need two data sources: salaries for every player in the league and a list of all major league transactions, both of which are updated regularly.
We wanted users to be able to find their own team, but also to see the big picture. Some of our original sketches focused on the amount spent per team per day. Below, a chart where each line represents one team’s amount paid to players per day (the jumps and dips represent players coming on and off the list):
Another sketch showed the teams as small multiples:
And another used stacked bars (poorly):
Or one that just showed the players on the bench and how long they’d been on it, regardless of team or salary:
But the one that stuck out the most in the end was the simplest – an aggregate per-team calculation:
With that, we started developing things in the browser. The following are sketches made with D3 based on the previous R charts.
Originally, this started as an idea for a phone with just a couple numbers per team. (These sketches are old and the numbers are calculated incorrectly… I screwed some things up.)
But we also wanted to see individual players. Here, a first attempt at the data join in D3:
Later, hooking it up to real salary data:
And making it a little less boring (or “adding sugar,” as Shan says)
Before coming to the version that’s online now:
We still kept a mobile view that I think turned out as well or better than the desktop version:
Is this an earth-shattering example of data journalism? I suppose it is not. It’s two data sets, a timer and and a giant photograph of A-Rod updated a couple times a day. But I must say I like it. It’s fun and engaging for the users it’s aimed at; it’s not tied to a single news event but it’s not aimless either; it was developed and published in less than two weeks; it works on all sorts of devices and it updates every day (originally an R script running on a crontab, now a node script). It’s also a good example of using D3 to make data-driven applications without using SVG at all.
Normally I show what we did in print, but in this case, we didn’t make anything. Most of the fun of this is seeing the numbers tick up in front of you (Shan’s idea) as you’re on the page. In print, it’s just another bar chart. At the same time, if something happens, we’ll be ready on short notice with all the data we need.
Last week we published an interactive graphic about the N.F.L. draft. Our goal was to show an odd reality: even though N.F.L. teams do tend to pick the “best” players early in the draft, there’s a tremendous amount of chance involved. The best 10 eventual N.F.L. performers will not be the first 10 players drafted – or even close.
How to know that both of these are true and decide which is most important? We used draft and performance data from pro-football-reference.com. (One note: N.F.L. performance is hard to measure across positions – how do you decide if a tight end is “better” than a linebacker or a defensive tackle? Most analyses use a combination of games started and pro bowls; the one developed by pro-football-reference uses both of those but has some fine-tuning by position.)
So, for for every pick in the draft, we have one number encompassing their N.F.L. performance. Here are the top 20 since 1995:
Here’s a first sketch, where every dot represents one player. The Y axis is “how good” every player is, and the X axis is where in the draft they were selected. I actually screwed something up here – there aren’t more than 250 or so picks in a draft – but otherwise the distribution is more or less right:
My colleague Mike Bostock cleaned this up by coloring the picks by round and adding some labels:
Although that shows all the data, it’s too noisy to really interpret. Wanting to simplify this, I tried taking the average of all players who went at a certain round and certain pick – here, each dot represents the average value of all players at a certain pick (for example, the players drafted at Round 1, Pick 1, or Round 2, Pick 13). As before, the dots are colored by round:
The dot on the top-left represents the average value of all first picks in the draft since 1995 – on average, this group, which includes Peyton Manning, Cam Newton, Andrew Luck, Michael Vick, Keyshawn Johnson and others, clearly outperforms the other picks. (This is might be obvious, but then again, the group also includes Tim Couch and JaMarcus Russell.)
I admit I liked this chart more than I probably should have. (My colleagues corrected me!) Averaging this way is a little misleading because every round doesn’t have the same number of picks (the league has grown and there are extra picks at the end of each round, which leads to some funny business with the math), and hiding the distribution oversimplifies things a little. But this chart does make a simple point – the better players tend to go first.
Instead, Mike offered a boxplot, which shows the distribution without being so noisy:
Even this was a little too busy for the point we wanted to make, so we settled for a small bar chart.
What we wanted to focus on was the reality that there’s much more randomness in the draft than people realize. Cade Massey and Richard H. Thaler, behavioral psychologists, analyzed the draft and found that not only is there no persistent skill among teams in picking players – teams have good years and bad years in equal measure – but that across all players and positions, teams only picked a player better than the person who went next at that position 52 percent of the time. Their academic paper is here, but Massey explained this in a much more accessible way in a recent talk at the Sloan/MIT sports analytics conference.
I took a stab at replicating some of their findings just to see what it would look like. Here’s a rough chart of the percentage of teams picking a player who ended up being better than the guy drafted after him at the same position. For example, if you chose Peyton Manning (Pick 1 in the 1998 draft) over Ryan Leaf (Pick 2), your guy is better than the next guy at that position, but if you chose Spergon Wynn (Pick 183 in the 2000 draft) over Tom Brady (Pick 199), you did not. (Sorry, Cleveland Browns.)
Simply put, teams don’t pick the “right” player as often as you think, and tend to do better than a coin flip only in the first round. This chart goes under 50 percent after the third round, but that reflects some noise in the data towards the end of the draft – most of these players don’t actually get in the game, so it’s not very meaningful to say that one benchwarmer is marginally better than another. But this concept is hard to explain in a chart like this (the title would be something like “percent of players who were better than the next player at the same position by round”), so we took a simpler approach.
I had been tinkering on a version of a chart I had that showed where the best eventual players were drafted:
This chart highlights where the 10 “best” players in each draft were picked. My colleague Joe Ward thought it would look good in print, where we have more space, and this chart ended up closely resembling what was eventually printed:
Online, Shan Carter suggested an interface that showed this uncertainty with two sentences: the percent of the best players that came in the first round and the percent that came after:
A slider and about a hundred commits later, you have an tool that lets you explore where the best N players from the draft came from every year.
Mike also made a similar implementation based on the Fisher-Yates shuffle, which is a thing I learned about when he showed me, but it wasn’t the right application for this data, and anyway it was getting too late to change our minds:
These charts and sketches were made in R and D3. Normally, at the end of these posts, I write about how other people implemented the best parts of this graphic, but this time it’s especially true.
One of the great things about working in a department with a staff of 25 people is that you can be in big trouble three days before something publishes. Then you make a phone call to San Francisco and everything works out fine.
A couple weeks back, we used PitchFX data to show the relative “nastiness” (for lack of a better word) of the Mets’ pitcher Matt Harvey. The chart below shows pitches that batters swung at outside the strike zone during a recent game against the Phillies.
Just over a week ago we published a graphic – more of an interactive short blog post without a blog, really – that accompanied Tyler Kepner’s piece about strikeouts for the Times’ 2013 baseball preview. The subject of both pieces was the steep increase in strikeouts across the board in the past decade: last year, ten Major League clubs set franchise records for strikeouts.
The fact Tyler came to us with was one he’d found on his own: 18 teams struck out at least 1,200 times last season; through 2005, there had never been a season in which more than two teams topped that total. Below, the first sketch, based on that stat – the number of teams with 1,200 strikeouts or more in a season going back to 1968:
That’s a compelling chart, but it’s also a little misleading because the league has expanded a few times and not all seasons are the same length.
Instead, Joe Ward and I thought about making small multiples of the teams and arranging them in a sort of histogram, sort of like my colleague Bill Marsh did with exit polls in 2008 and 2012.
Here are the first nine teams in alphabetical order, with the league average in grey:
We didn’t really care for these, and I complained about it to my colleague and cubicle-partner Alicia Desantis, who suggested I make it look like the climate change “hockey stick charts.” (FYI, The image below, one of the better ones from Wikipedia, is meant to convey the form, not wade into the “Hockey Stick controversy“ if you believe there is one.)
Here’s what the first R sketch of that idea looked like – every team’s average strikeouts per game per year. (Red is the league average.)
At this point, we had a chart we liked and the process went forward like many of our other projects do. However, there was a key difference with this one that’s worth mentioning - all the rest of the sketches, edits and and design improvements happened in a web browser. (More on this later.)
Here are a few successions of this chart, made using D3:
Getting this data from baseball-reference.com requires a bit of scraping, and this project sold me for life on R’s XML package, which makes scraping fast and shamefully easy.
In the final project, there are three interactive charts and a table on the page, and they are all generated in D3 with just one data file. The whole chart form – line selection, tooltip, calculating averages – is easily abstracted out, and for the first time I felt some of the same sketching power in a browser that I’d seen only with R: the concept that if you can make one chart, you can make a hundred with the same effort. But with D3, the sketches are already in a browser and wired for interaction! From a development point of view, it felt tremendously powerful. (For many of you this might be obvious, but old habits die hard.)
Also, thanks to the open-source SVG Crowbar bookmarklet developed by Shan Carter, this project represented a recent change in development process, for me, at least. Instead of developing both print and online charts separately, we were able to generate all the charts for print in a web browser at precisely the sizes we wanted, then save them down to Illustrator. Aside from being a useful shift in thinking, it saved a ton of time. (This isn’t the first time the department has done something like this – just the first time I did.)
For example, we included the small multiples in print, but we made them in D3 first:
Here’s the two-page spread in print. Again, all these charts were produced in a browser, saved to SVG and edited lightly in Illustrator.
Finally, for the record, most of the best parts of this graphic were made by Shan while I was on vacation (with standard last-minute triage from Amanda Cox and Mike Bostock), and all the meaningful annotation was from Joe Ward, who, did you know, played D1 baseball and was a scout for the Cleveland Indians before coming to the Times?
On Thursday Facebook had the third-largest I.P.O. ever. In the week leading up it, my colleague Amanda Cox spent some time thinking how to best explain and contextualize this offering to readers. What follows is a series of sketches from Amanda, who shared her project folder with me for this post, and Matt Ericson, who edited the piece.
The universe of initial public offerings is seemingly simple: about 2,400 tech companies since 1980, compiled by Jay Ritter, a professor of finance at the University of Florida.
As a first step, Amanda charted the companies by I.P.O. date (x-axis) and value at I.P.O. (y-axis), colored them by their 3-year return. (The key’s not included in her sketch, but for these purposes, know that red is bad and green is good.)
This chart’s not bad (even if, like me, you have low standards), but it doesn’t say much other than that there was a dot-com boom, that most of those companies didn’t do so well, and that Facebook is worth a ton of money.
Next, a plot of 3-year return by I.P.O. date:
Trying to add in more nuance to this picture, shading the companies by the companies’ price-to-sales ratio at I.P.O. and including Facebook in a random position just for size:
But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue. So she tried another form: a histogram of 3-year returns, colored by I.P.O. date:
Or the same chart but piled into three time periods (not that anyone asked me, but I really like this one):
By the way, even the queen bee of statistical charting screws up that chart the first time (be conservative with your “cex” values, folks):
Another idea, vaguely reminiscent of the balloons from “Up,” is sales vs. market cap at I.P.O. colored by year. I won’t lie, I don’t get this one:
Going back to time series, which many readers are more accustomed to reading and understanding, Amanda focused on one thing that always gets talked about with IPOs: almost all of the companies have a bump in market cap after their first day of trading. So she charted the “trails” of companies over their first day on the market (a log scale makes percentage changes look the same):
The trails felt promising, so she pursued them with sales, too. (Along with some screw-ups.) Again, full transparency here, I don’t get this one either, but since there are some screw-ups in there I think we’re safe:
At this point, there were a lot of charts made, but no clear answers about form or the best things to show. Matt Ericson, eyeing the looming deadlines, looked through Amanda’s analysis and offered a compromise of sorts, related to the histogram she had generated earlier, and suggested a slightly different form:
Which turned into this:
And, ultimately, into this:
If you’ve seen the web version, though, you know it doesn’t look like this. [Amanda thinks print graphics can be smarter than web graphics.] For one, the browser window doesn’t give us this kind of space. But the medium itself plays a part too. Online, if you’re not engaged in 10 seconds, you’re not going to stay on the page, so they needed to keep it fun. For that, Amanda and Matt got some help from three (pretty badass) colleagues: Jeremy Ashkenas, Matt Bloch and Shan Carter. Together, they made an interactive chart that stepped through a handful of the steps above, slowly explaining the dataset, with each step building on the last:
A couple major design processes are at work in this piece. First, sketching with data is massively important. Only by looking at the data in multiple forms, from different angles, did this group of visual journalists really peel back what was most interesting about it. Here, we saw histograms, crazy arrow charts, bubble charts, time series and others – all shaded with different variables. All but one, more or less, got cut.
Second, and related, is that you go with the chart you have when the deadline comes – or that you’re only as good as the last chart you threw away. (Her words, not mine.)
To be quite honest, Amanda wasn’t thrilled with her graphics that went in the paper and online. (She is always searching for The Perfect Form, whether or not it’s there.) If the I.P.O. were delayed another week, there would be another dozen charts in the trash can and maybe something else would be the last good chart. But you go to print with the charts you have, not the charts you want. So, you know, make a lot of them.
Last week the Times published their interactive electoral map. Although a medium-sized team of reporters, editors, designers and developers (including, but not limited to, Jeremy Ashkenas, Matt Ericson, Alan McLean, David Nolen and Derek Willis) had a hand in designing and building the project, Shan Carter did much of the developing of the main visualization, and he agreed to let me post some of his sketches here. (I had no hand in this – I’m just the image copy-paster this evening.)
You’ll notice some similarities – there is analysis for every state and the option to share your own map. But they wanted to explore some different options this year, too. First, Shan started by making a cartogram in Illustrator, overlaid on a (pretty terrible) hand trace of the US:
And then slowly tinkering with it:
One idea was to take the geography out of the graphic completely:
Or at least minimize it further by dividing states into regions:
Another was to compare two maps side-by-side, similar to the “split screen” view of the Senate in 2008:
But no one was really super thrilled with maps as the main conduit for the analysis. Instead, they decided on minimizing the geography and using “bins” for states. (Shan has sort of been obsessed with “bins” since 2008, when his dream of having states magically fall into buckets on election night ultimately didn’t pan out. I personally had to cheer him up after that and it was not pretty.)
Anyway, an early prototype of that concept:
And how that part of the graphic ultimately looked:
If you’ve seen this piece by now, you’ll notice that they didn’t make just one decision – they expanded on a few of them in a compelling mix of interactive and linear storytelling that told a few different stories and also let you make your own and share it wherever you wanted.
It’s also a fun insight into Shan’s workflow, which is to mostly experiment directly with markup rather than with flat outputs from R or Adobe Illustrator mockups, which many of us do. (OK, technically, he tells me the cartograms, being more art than science, were hand-made in Illustrator and then their xy positions were exported to D3, but still, he’s on the record saying “mockups are for suckers.”)
Also, this was made using D3 and implemented a technique that let the graphic function properly even in Internet Explorer 8. (A sharp guy named Jim Vallandingham chronicled this in extreme detail if you’re interested in doing this sort of thing.)
One of the best things about working at a newspaper is that you can come into work and do something different every day. Yesterday I had planned on spending the day doing some longer-term work in preparation for the Olympics and generally phoning it in Friday-style when a handful of us got assigned a daily – a graphic that looked back on Mariano Rivera’s career in light of his A.C.L. injury on Thursday. I was totally going to do an insane 3D-video that analyzed his cutter, but apparently someone did that already, so we went with charts instead. I looked at saves over time of top pitchers while my colleague Tom Giratikanon, who just started this week, compared Rivera across different categories.
We had a broad idea for what we were going for, which Matt Ericson sketched out by hand:
I scraped the data for the players with the most saves from baseball-reference.com (using an old template Shan Carter made using hpricot, which I learned is now “over”), then sketched the top 250 or so in R. This only takes a couple seconds to read about, but it was in fact at least two hours of screws ups and swearing before I saw this chart:
Which eventually turned to this (we export odd colors to pick them up easily in Adobe Illustrator):
And the final print version:
Online, we took basically the same approach, except we wanted to make them interactive, so Shan Carter pitched in some D3 expertise and Tom made his in Raphael, and six painless hours later, after all the programming, browser checking, conditional loading (which might not be a term) and Matt Ericson VPNing in from New Jersey to fix everything, we had a nice interactive, mostly mobile-friendly graphic:
Our approach wasn’t revolutionary or anything – in fact, Amanda and I used an identical charting form to chart home runs a couple years ago – but the package worked well, and if anything, Rivera stands out more in the saves chart than Barry Bonds does in the homers chart. And it was a promising start to the possibility of turning around this kind of work on deadline.
Elisabeth Bumiller’s recent profile of Jeremy Bernard, the first man and openly gay person to be the White House social secretary, used an interesting dataset: a list of everyone who has attended a state dinner in the Obama administration. I don’t have a ton of experience with Styles (or with “style”, for that matter), but this was a good chance to do something different with a new section. Except not that different, since charts are pretty much the only trick.
Alicia Parlapiano and I ended up using a sort of spiral plot, which we then just joined together in illustrator. I remembered that we had used a similar technique in one of my first graphics at the Times to visualize which countries were good at which sports. (Then, as now, Amanda did the hard stuff.) So I ported the code from Actionscript to use for this, while also sizing for frequency of visits.
Here’s the sketch:
And how it looked in print:
Matt Ericson and Amanda Cox helped out on a late night to make a fun interactive version, perfect for gawking at all those people who were invited instead of you.
In last Sunday’s, paper Mike McIntire and Michael Luo published their investigation into White House visits by large Democratic donors. As simple as the chart was, we pondered many complex options before publishing it.
Early on, I thought some large-scale visualization of all major donors might be interesting, so I plotted a couple hundred of the top donors (based loosely on first and last names) with donations and WH visits on the same axis to see if there was any meaningful pattern. It looked like this:
Although it looked sort of cool (in a meaningless data-art kind of way), nothing there illuminated the real focus of the story – namely, the possibility that large donors might get more access to the White House. Really, that was my only idea, and I was being annoying and complaining about it when Amanda Cox matter-of-factly told me to make a sketch that showed the percent chance of visiting the White House based on one’s total donation size. An hour later, I had this:
We all liked it right away. Most of the remaining work went to matching the databases of donors and visitors as well as we could. That data work is important, but horribly unsexy and not really conducive to sketches. In general, we matched on middle initials where we could, and Matt Ericson helped me implement his handy Mr. People gem to get the various names parsed in a uniform fashion. Otherwise, all the data work was done in R, with a typically heavy-bordering-on-embarrassing level of assistance from Amanda.
Once we published, there was some discussion about the form of the chart on Twitter, and I admit it’s slightly odd. We had a lot of discussion about form on our end, too. So I present 4 options, each named for a delightful animal (we do a lot of animal-based filenames in the department, for some reason):
First, the “Blue Whale,” arguably the most straightforward, accessible approach. This form makes the trend the focus of the graphic:
“Polar Bear” is perhaps the best chart for a more technical audience…
…but it might mean fewer people understand it. And is it me, or do the horizontal segments look like error margins instead of donation ranges? It’s not quite a scatterplot, since the percentages plotted represent “buckets” of donation sizes rather than individual points.
A slightly different approach, the “Tree Lobster” might indeed be the most accurate representation of this dataset:
But where’s the continuity? And seriously, how boring are bar charts? Also, labeling is hard on this thing, which is not a trivial problem.
Lastly, (Dull) Giraffe:
Seriously, this one is dull and maybe not worth discussing. Or is it? Discuss. Any discussion of these forms might happen on Twitter under the hashtag #chartingSpiritAnimals until I figure out how to put comments into this site, which, let’s face it, isn’t ever going to happen.
If you’ve seen the graphic online or in print, you’ll know that we went with the Blue Whale. Aside from carrying the crucial Steve Duenes/Matt Ericson/Amanda Cox voting bloc (their decisions somehow track the majority vote 100% of the time), it felt suited for the data and the story it was published with.
(It looks fine online too, but it’s sort of stranded on its own URL.)
Finally, as a disclaimer, the data plotted in these examples is slightly different than what went into print last week, as we did some manual tweaking on a handful of names, which moved a couple percentages up or down a tiny bit.
Look forward to seeing if any data visualizers Tweet silly animal names this week. I’ll go first…
This week the graphics department published a couple graphics based on exit poll data. The first one, made mostly by Shan Carter, was similar in many respects to the one he made in 2008 to show the differences in voters supporting Barack Obama and Hillary Clinton. (Known internally as the “delightful dancing boxes.”)
This view, which focused on Mitt Romney and Rick Santorum, was perfect for capturing the differences between their supporters, but we also wanted to show the influence of the other candidates, who have gotten substantial amounts of delegates.
Shan addressed this with a quick sketch:
Next they tried a ternary plot (I had to look it up myself), which is apparently beloved in geology and frequently to describe soil samples. Anyway, I came on to the project late, after the concept had been more or less decided.
First, a sketch showing how voters of a single demographic group supported in 7 different states. (Groups that supported Mitt Romney are farther to the right; groups for Santorum are farther to the left; groups supporting anyone else are toward the bottom.)
A different approach, and one we eventually went with, showed all the groups across a single state. This is for Iowa.
Then we just tried to show this as best we could. One thought was to label the biggest groups and draw lines for the shift from another state. Here’s who Michigan voters supported, with the lines emphasizing the main groups’ change from New Hampshire.
We really liked the lines in print, but once you animate the transitions you don’t really need them, since the motion has the same effect. (Plus I didn’t know how program the lines anyway.)
Then we just had to build the thing, which we made using the D3 libraries. In Flash this thing would have been not so hard, and it was slow going at the beginning. But we’re as good as anyone at copy/pasting from demos, so it wasn’t too long before this: