Saturday, 25 July 2015

Difference Charts with d3.js. Science vs Style

The following post is a portion of the book D3 Tips and Tricks which is free to download from Leanpub. To use this post in context, consider it with the others in the blog or just download the the book as a pdf / epub or mobi .
----------------------------------------------------------

Difference Chart: Science vs Style.

Dear readers, please forgive me for including this example in D3 Tips and Tricks. While it demonstrates a really cool graphing technique, I have chosen to apply it to a topic that has a potential to raise a couple of sets of eyebrows in the form of Messrs Roger Peng and Jeff Leek. Both work at the Johns Hopkins Bloomberg School of Public Health where Roger is an Associate Professor of Biostatistics, and Jeff is an Associate Professor of Biostatistics and Oncology.
While both are doing amazing work to improve peoples health and well-being (amongst other things), both are also authors of highly successful books published by Leanpub. In particular Roger has written R Programming for Data Science and Exploratory Data Analysis with R while Jeff has penned The Elements of Data Analytic Style. As we could anticipate, there is a possibility that there is something of a competitive elementto publishing for both gentlemen as they see the the number of downloads of their books climb ever higher.
While I would hate to promote an increase to these tensions, The opportunity was too attractive given that I had access to some data on the number of downloads that each of the books had been achieving and I really wanted to write about difference charts using d3.js (and the method of sourcing the data for the book Raspberry Pi: Measure, Record, Explore).
So at the risk of providing some form of offence to these fine gentlemen or inciting an increased rivalry, I have forged ahead and hopefully the worst that will happen is that someone interested in d3.js will also find some interesting reading in R Programming for Data Science,Exploratory Data Analysis with R or The Elements of Data Analytic Style. Ultimately we should be left with a graph that will look something like this;

Science vs Style - Daily Leanpub Book Sales

Purpose

A difference chart is a variation on a bivariate area chart. This is a line chart that includes two lines that are interlinked by filling the space between the lines. A difference chart as demonstrated in the example here by Mike Bostock is able to highlight the differences between the lines by filling the area between them with different colours depending on which line is the greater value.
As Mike points out in his example, this technique harks back at least as far as William Playfair when he was describing the time series of exports and imports of Denmark and Norway in 1786.


William Playfair’s Time Series of Exports and Imports of Denmark and Norway

All that remains is for us to work out how d3.js can help us out by doing the job programmatically. The example that I use here is based on that of Mike Bostock’s, with the addition of a few niceties in the form of a legend, a title, and some minor changes.
We will start with a simple example of the code and we will add blocks to finally arrive at the example with Legends and title.

The Code

The following is the code for the simple difference chart. A live version is available online at bl.ocks.org or GitHub. It is also available as the files ‘diff-basic.html’ and ‘downloads.csv’ as a download with the book D3 Tips and Tricks (in a zip file) when you download the book from Leanpub.


<!DOCTYPE html>
<meta charset="utf-8">
<style>

body { font: 10px sans-serif;}

text.shadow {
  stroke: white;
  stroke-width: 2px;
  opacity: 0.9;
}

.axis path,
.axis line {
  fill: none;
  stroke: #000;
  shape-rendering: crispEdges;
}

.x.axis path { display: none; }

.area.above { fill: rgb(252,141,89); }
.area.below { fill: rgb(145,207,96); }

.line {
  fill: none;
  stroke: #000;
  stroke-width: 1.5px;
}

</style>
<body>
<script src="http://d3js.org/d3.v3.min.js"></script>
<script>

var title = "Science vs Style - Daily Leanpub Book Sales";

var margin = {top: 20, right: 20, bottom: 50, left: 50},
    width = 960 - margin.left - margin.right,
    height = 500 - margin.top - margin.bottom;

var parsedtg = d3.time.format("%Y-%m-%d").parse;

var x = d3.time.scale().range([0, width]);
var y = d3.scale.linear().range([height, 0]);

var xAxis = d3.svg.axis().scale(x).orient("bottom");
var yAxis = d3.svg.axis().scale(y).orient("left");

var lineScience = d3.svg.area()
    .interpolate("basis")
    .x(function(d) { return x(d.dtg); })
    .y(function(d) { return y(d["Science"]); });

var lineStyle = d3.svg.area()
    .interpolate("basis")
    .x(function(d) { return x(d.dtg); })
    .y(function(d) { return y(d["Style"]); });

var area = d3.svg.area()
    .interpolate("basis")
    .x(function(d) { return x(d.dtg); })
    .y1(function(d) { return y(d["Science"]); });

var svg = d3.select("body").append("svg")
    .attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
  .append("g")
    .attr("transform",
        "translate(" + margin.left + "," + margin.top + ")");

d3.csv("downloads.csv", function(error, dataNest) {

  dataNest.forEach(function(d) {
      d.dtg = parsedtg(d.date_entered);
      d.downloaded = +d.downloaded;
  });

  var data = d3.nest()
      .key(function(d) {return d.dtg;})
      .entries(dataNest);

  data.forEach(function(d) {
      d.dtg = d.values[0]['dtg'];
      d["Science"] = d.values[0]['downloaded'];
      d["Style"] = d.values[1]['downloaded'];
  });

  for(i=data.length-1;i>0;i--) {
          data[i].Science   = data[i].Science  -data[(i-1)].Science ;
          data[i].Style     = data[i].Style    -data[(i-1)].Style ;
   }

  data.shift(); // Removes the first element in the array

    x.domain(d3.extent(data, function(d) { return d.dtg; }));
    y.domain([
//      d3.min(data, function(d) {
//          return Math.min(d["Science"], d["Style"]); }),
//      d3.max(data, function(d) {
//          return Math.max(d["Science"], d["Style"]); })
      0,1400
    ]);

    svg.datum(data);

    svg.append("clipPath")
        .attr("id", "clip-above")
      .append("path")
        .attr("d", area.y0(0));

    svg.append("clipPath")
        .attr("id", "clip-below")
      .append("path")
        .attr("d", area.y0(height));

    svg.append("path")
        .attr("class", "area above")
        .attr("clip-path", "url(#clip-above)")
        .attr("d", area.y0(function(d) { return y(d["Style"]); }));

    svg.append("path")
        .attr("class", "area below")
        .attr("clip-path", "url(#clip-below)")
        .attr("d", area.y0(function(d) { return y(d["Style"]); }));

    svg.append("path")
        .attr("class", "line")
        .style("stroke", "darkgreen")
        .attr("d", lineScience);

    svg.append("path")
        .attr("class", "line")
        .style("stroke", "red")
        .attr("d", lineStyle);

    svg.append("g")
        .attr("class", "x axis")
        .attr("transform", "translate(0," + height + ")")
        .call(xAxis);

    svg.append("g")
        .attr("class", "y axis")
        .call(yAxis);

});

</script>
</body>

A sample of the associated csv file (downloads.csv) is formatted as follows;


date_entered,downloaded,book_name
2015-04-19,5481,R Programming for Data Science
2015-04-19,23751,The Elements of Data Analytic Style
2015-04-20,5691,R Programming for Data Science
2015-04-20,23782,The Elements of Data Analytic Style
2015-04-21,6379,R Programming for Data Science
2015-04-21,23820,The Elements of Data Analytic Style
2015-04-22,7281,R Programming for Data Science
2015-04-22,23857,The Elements of Data Analytic Style
2015-04-23,7554,R Programming for Data Science
2015-04-23,23881,The Elements of Data Analytic Style
2015-04-24,9331,R Programming for Data Science
2015-04-24,23932,The Elements of Data Analytic Style

Description

The graph has some portions that are common to the simple line graph example.
We start the HTML file, load some styling for the upcoming elements, set up the margins, time formatting scales, ranges and axes.
Because the graph is composed of two lines we need to declare two separate line functions;


var lineScience = d3.svg.area()
    .interpolate("basis")
    .x(function(d) { return x(d.dtg); })
    .y(function(d) { return y(d["Science"]); });

var lineStyle = d3.svg.area()
    .interpolate("basis")
    .x(function(d) { return x(d.dtg); })
    .y(function(d) { return y(d["Style"]); });

To fill an area we declare an area function using one of the lines as the baseline (y1) and when it comes time to fill the area later in the script we declare y0 separately to define the area to be filled as an intersection of two paths.


var area = d3.svg.area()
    .interpolate("basis")
    .x(function(d) { return x(d.dtg); })
    .y1(function(d) { return y(d["Science"]); });

In this instance we are using the green ‘Science’ line as the y1 line.
The svg area is then set up using the height, width and margin values and we load our csv files with our number of downloads for each book. We then carry out a standard forEach to ensure that the time and numerical values are formatted correctly.
Nesting the data
The data that we are starting with is formatted in a way that we could reasonably expect data to be available in this instance where a value is saved for distinct elements on an element by element basis. This style of recording data makes it easy to add new elements into the data stream or a database rather than relying on having them as discrete columns.


date_entered,downloaded,book_name
2015-04-19,5481,R Programming for Data Science
2015-04-19,23751,The Elements of Data Analytic Style
2015-04-20,5691,R Programming for Data Science
2015-04-20,23782,The Elements of Data Analytic Style
2015-04-21,6379,R Programming for Data Science
2015-04-21,23820,The Elements of Data Analytic Style

In this case, we will need to ‘pivot’ the data to produce a multi-column representation where we have a single row for each date, and the number of downloads for each book as seperate columns as follows;


date_entered,R Programming for Data Science,The Elements of Data Analytic Sty\
le
2015-04-19,5481,23751
2015-04-20,5691,23782
2015-04-21,6379,23820

This can be achieved using the d3 nest function.


  var data = d3.nest()
      .key(function(d) {return d.dtg;})
      .entries(dataNest);

We declare our new array’s name as data and we initiate the nest function;


 var data = d3.nest()

We assign the key for our new array as dtg. A ‘key’ is like a way of saying “This is the thing we will be grouping on”. In other words our resultant array will have a single entry for each unique date (dtg) which will have the values of the number of downloaded books associated with it.


  .key(function(d) {return d.dtg;})

Then we tell the nest function which data array we will be using for our source of data.


  }).entries(dataNest);

Wrangle the data
Once we have our pivoted data we can format it in a way that will suit the code for the visualisation. This involves storing the values for the ‘Science’ and ‘Style’ variables as part of a named index.


  data.forEach(function(d) {
      d.dtg = d.values[0]['dtg'];
      d["Science"] = d.values[0]['downloaded'];
      d["Style"] = d.values[1]['downloaded'];
  });

We then loop through the ‘Science’ and ‘Style’ array to convert the incrementing value of the total number of downloads into a value of the number that have been downloaded each day;


  for(i=data.length-1;i>0;i--) {
          data[i].Science   = data[i].Science  -data[(i-1)].Science ;
          data[i].Style     = data[i].Style    -data[(i-1)].Style ;
   }

Finally because we are adjusting from total downloaded to daily values we are left with an orphan value that we need to remove from the front of the array;


  data.shift();

Cheating with the domain
The observant d3.js reader will have noticed that the setting of the y domain has a large section commented out;


    x.domain(d3.extent(data, function(d) { return d.dtg; }));
    y.domain([
//      d3.min(data, function(d) {
//          return Math.min(d["Science"], d["Style"]); }),
//      d3.max(data, function(d) {
//          return Math.max(d["Science"], d["Style"]); })
      0,1400
    ]);

That’s because I want to be able to provide an ideal way for the graph to represent the data in an appropriate range, but because we are using the basis smoothing modifier, and the data is ‘peaky’, there is a tendency for the y scale to be fairy broad and the resultant graph looks a little lost;


Using automatic range

Alternatively, we could remove the smoothing and let the true data be shown;


Using automatic range and removing the basis smoothing

It should be argued that this is a truer representation of the data, but in this case I feel comfortable sacrificing accuracy for aesthetics (what have I become?).
Therefore, the domain for the y axis is set manually to between 0 and 1400, but feel free to remove that at the point when you introduce your own data :-).
data vs datum
One small line gets its own section. That line is;


    svg.datum(data);

A casual d3.js user could be forgiven for thinking that this doesn’t seem too fearsome a line, but it has hidden depths.
As Mike Bostock explains here, if we want to bind data to elements as a group we would be *.data, but if we want to bind that data to individual elements, we should use *.datum.
It’s a function of how the data is stored. If there is an expectation that the data will be dynamic then data is the way to go since it has the feature of preparing enter and exit selections. If the data is static (it won’t be changing) then datum is the way to go.
In our case we are assigning data to individual elements and as a result we will be using datum.
Setting up the clipPaths
The clipPath operator is used to define an area that is used to create a shape by intersecting one area with another.
In our case we are going to set up two clip paths. One is the area above the green ‘Science’ line (which we defined earlier as being the y1 component of an area selection);


The ‘clip-above’ clip path

This is declared via this portion of the code;


    svg.append("clipPath")
        .attr("id", "clip-above")
      .append("path")
        .attr("d", area.y0(0));

Then we set up the clip path that will exist for the area below the green ‘Science’ line ;


The ‘clip-below’ clip path

This is declared via this portion of the code;


    svg.append("clipPath")
        .attr("id", "clip-below")
      .append("path")
        .attr("d", area.y0(height));

Each of these paths has an ‘id’ which can be subsequently used by the following code.
Clipping and adding the areas
Now we come to clipping our shape and filling it with the appropriate colour.
We do this by having a shape that represents the area between the two lines and applying our clip path for the values above and below our reference line (the green ‘Science’ line). Where the two intersect, we fill it with the appropriate colour. The code to fill the area above the reference line is as follows;


    svg.append("path")
        .attr("class", "area above")
        .attr("clip-path", "url(#clip-above)")
        .attr("d", area.y0(function(d) { return y(d["Style"]); }));

Here we have two lines that are defining the shape between the two science and style lines;


    svg.append("path")
        ....
        ....
        .attr("d", area.y0(function(d) { return y(d["Style"]); }));

If we were to look at the shape that this produces it would look as follows (greyed out for highlighting);


The shape between the science and style lines

We apply a class to the shape so that is filled with the colour that we want;


        .attr("class", "area above")

.. and apply the clip path so that only the areas that intersect the two shapes are filled with the appropriate colour;


        .attr("clip-path", "url(#clip-above)")

Here the intersection of those two shapes is shown as pink;


The intersection of the shapes

Then we do the same for the area below;


    svg.append("path")
        .attr("class", "area below")
        .attr("clip-path", "url(#clip-below)")
        .attr("d", area.y0(function(d) { return y(d["Style"]); }));

With the corresponding areas showing the intersection of the two shapes coloured differently;


The intersection of the shapes

Draw the lines and the axes
The final part of our basic difference chart is to draw in the lines over the top so that they are highlighted and to add in the axes;


    svg.append("path")
        .attr("class", "line")
        .style("stroke", "darkgreen")
        .attr("d", lineScience);

    svg.append("path")
        .attr("class", "line")
        .style("stroke", "red")
        .attr("d", lineStyle);

    svg.append("g")
        .attr("class", "x axis")
        .attr("transform", "translate(0," + height + ")")
        .call(xAxis);

    svg.append("g")
        .attr("class", "y axis")
        .call(yAxis);

Et viola! we have our difference chart!


The basic difference chart

As mentioned earlier, the code for the simple difference chart is available online at bl.ocks.org or GitHub. It is also available as the files ‘diff-basic.html’ and ‘downloads.csv’ as a download with the book D3 Tips and Tricks (in a zip file) when you download the book from Leanpub.

Adding a bit more to our difference chart.

The chart itself is a thing of beauty, but given the subject matter (it’s describing two books after all) we should include a bit more information on what it is we’re looking at and provide some links so that a fascinated viewer of the graphs can read the books!
Add a Y axis label
Because it’s not immediately obvious what we’re looking at on the Y axis we should add in a nice subtle label on the Y axis;


    svg.append("g")
        .attr("class", "y axis")
        .call(yAxis)
      .append("text")
        .attr("transform", "rotate(-90)")
        .attr("y", 6)
        .attr("dy", ".71em")
        .style("text-anchor", "end")
        .text("Daily Downloads from Leanpub");

Add a title
Every graph should have a title. The following code adds this to the top(ish) centre of the chart and provides a white drop-shadow for readability;


    // ******* Title Block ********
    svg.append("text") // Title shadow
      .attr("x", (width / 2))
      .attr("y", 50 )
      .attr("text-anchor", "middle")
      .style("font-size", "30px")
      .attr("class", "shadow")
      .text(title);

    svg.append("text") // Title
      .attr("x", (width / 2))
      .attr("y", 50 )
      .attr("text-anchor", "middle")
      .style("font-size", "30px")
      .style("stroke", "none")
      .text(title);

Adding the legend
A respectable legend in this case should provide visual context of what it is describing in relation to the graph (by way of colour) and should actually name the book. We can also go a little bit further and provide a link to the books in the legend so that potential readers can access them easily.
Firstly the rectangles filled with the right colour, sized appropriately and arranged just right;


    var block = 300;   // rectangle width and position

    svg.append("rect") // Style Legend Rectangle
      .attr("x", ((width / 2)/2)-(block/2))
      .attr("y", height+(margin.bottom/2) )
      .attr("width", block)
      .attr("height", "25")
      .attr("class", "area above");

    svg.append("rect") // Science Legend Rectangle
      .attr("x", ((width / 2)/2)+(width / 2)-(block/2))
      .attr("y", height+(margin.bottom/2) )
      .attr("width", block)
      .attr("height", "25")
      .attr("class", "area below");

Then we add the text (with a drop-shadow) and a link;


    svg.append("text") // Style Legend Text shadow
      .attr("x", ((width / 2)/2))
      .attr("y", height+(margin.bottom/2) + 5)
      .attr("dy", ".71em")
      .attr("text-anchor", "middle")
      .style("font-size", "18px")
      .attr("class", "shadow")
      .text("The Elements of Data Analytic Style");

    svg.append("text") // Science Legend Text shadow
      .attr("x", ((width / 2)/2)+(width / 2))
      .attr("y", height+(margin.bottom/2) + 5)
      .attr("dy", ".71em")
      .attr("text-anchor", "middle")
      .style("font-size", "18px")
      .attr("class", "shadow")
      .text("R Programming for Data Science");

    svg.append("a")
      .attr("xlink:href", "https://leanpub.com/datastyle")
      .append("text") // Style Legend Text
      .attr("x", ((width / 2)/2))
      .attr("y", height+(margin.bottom/2) + 5)
      .attr("dy", ".71em")
      .attr("text-anchor", "middle")
      .style("font-size", "18px")
      .style("stroke", "none")
      .text("The Elements of Data Analytic Style");

    svg.append("a")
      .attr("xlink:href", "https://leanpub.com/rprogramming")
      .append("text") // Science Legend Text
      .attr("x", ((width / 2)/2)+(width / 2))
      .attr("y", height+(margin.bottom/2) + 5)
      .attr("dy", ".71em")
      .attr("text-anchor", "middle")
      .style("font-size", "18px")
      .style("stroke", "none")
      .text("R Programming for Data Science");

I’ll be the first to admit that this could be done more efficiently with some styling via css, but then it would leave nothing for the reader to try :-).
As a last touch we can include the links to the respective books in the shading for the graph itself;


    svg.append("a")
      .attr("xlink:href", "https://leanpub.com/datastyle")
      .append("path")
        .attr("class", "area above")
        .attr("clip-path", "url(#clip-above)")
        .attr("d", area.y0(function(d) { return y(d["Style"]); }));

    svg.append("a")
      .attr("xlink:href", "https://leanpub.com/rprogramming")
      .append("path")
        .attr("class", "area below")
        .attr("clip-path", "url(#clip-below)")
        .attr("d", area.y0(function(d) { return y(d["Style"]); }));

Perhaps not strictly required, but a nice touch none the less.
The final result
And here it is;


The full difference chart

The code for the full difference chart is available online at bl.ocks.org or GitHub. It is also available as the files ‘diff-full.html’ and ‘downloads.csv’ as a download with the book D3 Tips and Tricks (in a zip file) when you download the book from Leanpub.
The description above (and heaps of other stuff) is in the D3 Tips and Tricks book that can be downloaded for free (or donate if you really want to :-)).

No comments:

Post a Comment