Sunday, 5 April 2015

Exploring Event Data by Combination Scatter Plot and Interactive Line Graphs

Purpose

In the process of implementing a method of measuring and displaying the passage of a cat through a cat-door (as described in the book ‘Raspberry Pi: Measure, Record, Explore’) I built a graph that showed events indicated by both date and time on separate axes. It was then that I figured that this would be useful for exploring event data or data that exists as a series of date/time stamps that signify a particular ‘thing as having occurred. In the cat door example it was the use of the door by the cat, but this is applicable to a huge range of data sets.
One that I thought of straight away was the dates and times that people downloaded the book D3 Tips and Tricks. Leanpub has an API for accessing the history of book activity and I was able to download it and store it in a database for examination.
Ultimately what I developed was a scatter plot that shows the date of the events on the X axis and the time of the events on the Y axis. This was augmented by two line graphs that showed the accumulated sums of each axis on their respective sides.
Data Event Exploration
The full code for this example is available online at bl.ocks.org or GitHub. It is also available as the files ‘book-downloads.html’ and ‘downloads.zip’ (which contains downloads.json (it’s zipped up because otherwise it’s a bit too large for Leanpub)) as a download with the book D3 Tips and Tricks (in a zip file) when you download the book from Leanpub. For the ideal viewing experience, check it out in full screen mode.

There is also a separate blog post describing the information that I learned from looking at the data here.
To make the information slightly more accessible when the user hovers their mouse over the scatter plot there is an intersection of the position extrapolated to show the relationship to the other graphs and it presents the appropriate value of date, time and number downloaded by date and time.
This graph is a relatively complex combination of a range of different techniques presented in the book, including wrangling and nesting of data, combination of multiple graphs and the use of mouse movement to display tool-tips and additional data.

The Code

The code is extremely lengthy, so in lieu of placing it in the book it can be found on bl.ocks.org or Github. It is liberally commented to assist readers and I will describe particular sections of the code below and hopefully that will help more where required.
Wrangling the data
The graph uses four sets of data.
  1. The raw event data (an array called events)
  2. The scatter plot data (an array called data)
  3. The date graph data (an array called dataDate)
  4. The time graph data (an array called dataTime)
The raw event data is ingested from an external JSON file using the standard d3.json call.
The data itself is simply a collection of dates.
{"dtg":"2013-01-24 09:10:59"},
{"dtg":"2013-01-24 09:17:37"},
{"dtg":"2013-01-24 09:48:48"},
{"dtg":"2013-01-24 15:01:59"},
{"dtg":"2013-01-24 18:11:44"},
{"dtg":"2013-01-24 18:47:05"},
{"dtg":"2013-01-24 18:47:23"},
{"dtg":"2013-01-24 19:55:53"},
{"dtg":"2013-01-24 22:37:39"},
{"dtg":"2013-01-25 01:22:48"},
{"dtg":"2013-01-25 06:37:38"},
{"dtg":"2013-01-25 08:28:20"},
Each date represents the time that a book was downloaded.
Once loaded we run a forEach over the file to put it in a format for manipulation into the remaining three data sets.
    // parse and format all the event data
    events.forEach(function(d) {
        d.dtg = d.dtg.slice(0,-4)+'0:00'; // get the 10 minute block
        dtgSplit = d.dtg.split(" ");      // split on the space
        d.date = dtgSplit[0];             // get the date seperatly
        d.time = dtgSplit[1];             // format the time
        d.number_downloaded = 1;          // Number of downloads
    });
The first thing we do is to slice off the last four characters of the dtg string and replace them with 0:00. This leave us with a set of dtg values that are only represented by the 10 minute window in which they were downloaded.
We then split the dtg string on the space that separates the date and the time and we designate one half date and the other half time.
Lastly we represent the number of books downloaded for each event as 1 (this helps us sum them up later).
Using the events data we create the data-set for the scatter plot (data) by nesting the information on the 10 minute dtg value of date/time and by summing the number of downloads;
    var data = d3.nest()
        .key(function(d) { return d.dtg;})
        .rollup(function(d) {
            return d3.sum(d,function(g) {return g.number_downloaded; });
            })
        .entries(events);
We carry out a similar process for the date…
    var dataDate = d3.nest()
        .key(function(d) { return d.date;})
        .rollup(function(d) {
            return d3.sum(d,function(g) {return g.number_downloaded; });
            })
        .entries(events);
… and the time;
    var dataTime = d3.nest()
        .key(function(d) { return d.time;})
        .sortKeys(d3.ascending)
        .rollup(function(d) {
            return d3.sum(d,function(g) {return g.number_downloaded; });
            })
        .entries(events);
Sizing Everything Up
The size of the graph is determined by a number of fixed variables which are fairly self explanatory;
  • scatterplotHeight (which is also the height of the time graph)
  • dateGraphHeight
  • timeGraphWidth
But we need to let the width of the scatter plot (and the date graph) be a function of the number of days that have been collected. This variable is handled by;
  • scatterplotWidth
This set-up is handled in the following block of code;
    var oneDay = 24*60*60*1000; // hours*minutes*seconds*milliseconds
    var dateStart = d3.min(data, function(d) { return d.date; });
    var dateFinish = d3.max(data, function(d) { return d.date; });
    var numberDays = Math.round(Math.abs((dateStart.getTime() -
                               dateFinish.getTime())/(oneDay)));

    var margin = {top: 20, right: 20, bottom: 20, left: 50},
        scatterplotHeight = 520,
        scatterplotWidth = numberDays * 1.5,
        dateGraphHeight = 220,
        timeGraphWidth = 220;
The overall size of the graphic (height and width) is therefore a combination of these variables;
    var height = scatterplotHeight + dateGraphHeight,
        width = scatterplotWidth + timeGraphWidth;
The Scatter Plot
There is no real surprise with the scatter plot itself. The only thing slightly unusual is the use of a time scale for both the X and Y axes;
    var x = d3.time.scale().range([0, scatterplotWidth]);
    var y = d3.time.scale().range([0, scatterplotHeight]);
When the circles are drawn, the size of the circle is determined by the radius, which is the number of downloads multiplied by 1.5. I know that this is a bit of a visualization ‘no-no’ because the area of the circle should be representative of the number, not the radius, but I tried it both ways and to my simple way of viewing the data, the radius adjustment provided the best comparison.
    svg.selectAll(".dot")
        .data(data)
      .enter().append("circle")
        .attr("class", "dot")
        .attr("r", function(d) { return d.number_downloaded*1.5; })
        .style("opacity", 0.3)
        .style("fill", "#e31a1c" )
        .attr("cx", function(d) { return x(d.date); })
        .attr("cy", function(d) { return y(d.time); }); 
I know that this is a topic of some academic debate, and it is fascinating, so here are both results for comparison;
Circle Area Representing Downloads
Circle Radius Representing Downloads
Date and Time Graphs
Both of these graphs are fairly routine. The time graph has the X and Y axes reversed from what would be ordinarily expected, but otherwise not much else to write home about.
Mouse Movement Information Display
This portion of the graph is an expansion of the ‘Favorite tool tip’ method from the previous section in this chapter. We expand the number of elements to update dynamically to about 10. All of which are designated with their own class.
We append the rectangle to capture the mouse movement over the scatter plot;
    svg.append("rect")
        .attr("width", scatterplotWidth)
        .attr("height", scatterplotHeight)
        .style("fill", "none")
        .style("pointer-events", "all")
        .on("mouseover", function() { focus.style("display", null); })
        .on("mouseout", function() { focus.style("display", "none"); })
        .on("mousemove", mousemove);
We capture the position of the mouse and convert it to figures we can use to compare to our data;
    function mousemove() {
        var xpos = d3.mouse(this)[0],
            x0 = x.invert(xpos),
            y0 = d3.mouse(this)[1],
            y1 = y.invert(y0),
            date1 = d3.mouse(this)[0];
And then we place our dynamic text and lines with our focus.select statements.
Labeling
The last order of business is to place some labels.
The location of labeling in this example is an interesting problem in itself. I’m personally torn between the desire to maintain simplicity and to ensure clarity. Hopefully what I have is enough to satisfy both requirements, but as always, each user and requirement will differ, so label as desired.
If there are additional parts of the code that you would like explained, please feel free to get in touch.

3 comments:

  1. thank you! really useful stuff.

    ReplyDelete
  2. Hi!

    This is a great visualisation!

    I'd like to use your visualization in an OSS project (https://github.com/lwindolf/polscan, project is GPLv3+, most JS is MIT licensed) and would like to know if you consider this visualization as open source? If yes, what license do you prefer?

    With Best Regards
    Lars

    ReplyDelete
    Replies
    1. Hi Lars. Yes I would consider it as open source and I would go with an MIT licence. I have annotated the visualisation on GitHub https://gist.github.com/d3noob/a0cbcddc6bf0eb9569fe. Enjoy

      Delete