Andrew's Data Blog

creatively titled, I know

dc.js and crossfilter

Posted at — Jun 8, 2019

I find dc.js visuals in the exploratory phase to be more efficient than writing lots of SQL and R. Once you do a couple, it’s easy to replicate, and it can also be a perfect tool to present to others. I personally use this webpage as a starting template for new projects, but it’s a tutorial as well. Instead of following the “flow of data” from reading it in to placing a graph, I’ll start with the end in mind and work backward from there, at least as much as possible. For a forward looking version, the typical annotated versions are good like (main dc.js site).

We’re going to do this in a different order than a typical data project. We’ll get the boilerplate stuff up in step 0 since boilerplate code is before you start using the actual tools. Then we’ll go in an order that is different than what you’ll do on an actual project.

The steps are roughly: 0/6. get data and make boilerplate html doc, and load libraries 6/6. div for graph 1/6. read data (and create javascript section) 5/6. make graph variable with noncreative names 2/6. clean data 3/6. make dimensions 4/6. make the graph

I have a lot of repetition of code. Seeing what you have, what you’ll add, adding it yourself, and then seeing it added should provide the repetition needed to learn it quickly.

I’ll put the html we’ve done so far with the section that we’ll add as:

////////////////////////
// new code goes here //
////////////////////////

Followed by the code that goes there. The next section will have the code pasted in with a new section. Start with the boilerplate html, and then copy in the new section of code into your html document. d3.js and dc.js have a lot of copy paste since it needs to connect items from different areas. Might as well get some practice at it.

A side note, I like using Brackets with the livereload feature, but any workflow you can iterate with will work.

Step 0/5.

In step 0, we’ll take care of some prerequisites for the tutorial (though getting a cleaned dataset is usually the hardest). We’ll take the ubiquitous mtcars data set and write it as a csv. I typically use a csv since it’s so easy to handle at all stages of the process. I’ll often grab the data using SQL or an api, preprocess it using R, and then write it as a csv, ready for js to get a hold of it.

data(mtcars)
write.csv(mtcars,file = 'mtcars.csv',row.names = F)

The most simple place to put the file is in the same folder as your html.

Step 0.5/5.

Next is boilerplate html.

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

</head>
<body>

</body
</html>

It’s familiar to anyone who’s written html. If you haven’t, there it is. We’ll place each step into this to build the entire viz.

0.75/6.

Step 0.75 is horribly named. I’m trying to gloss over a lot of stuff that isn’t the “meat” of the problem. Anyway, let’s load all the code others wrote to make this easy for us! This includes js libraries and related css files.

Overview,

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    ////////////////////////////////
    // import your libraries here //
    ////////////////////////////////
    
</head>
<body>

</body
</html>

These should work fine with this tutorial, but you’ll want to use newer versions when they come out. I don’t like managing files for my smaller projects, so I pull them from a third party. There are other ways of handling this fyi.

// load all the libraries and related styles
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.12/crossfilter.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/dc/3.0.9/dc.min.js"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/dc/3.0.9/dc.min.css" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/datatables/1.10.19/js/jquery.dataTables.min.js"></script>
<script rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/datatables/1.10.19/js/jquery.dataTables.min.css"></script>

That’s it for that. Since this takes up so much space both height and width, I’m going to replace it in the rest of the examples with a comment and a single example. The final version will have the full code so it works. If you do an intermediate step, you must add these back in! I just don’t want to see this boilerplate stuff every time.

On to the first step!

6/6. div for graph

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>

</head>
<body>

    ////////////////////////////////////////
    // the viz html scaffolding goes here //
    ////////////////////////////////////////
    
</body
</html>

The html scaffolding can get as complex as you want, just like a regular webpage. Use jquery, nested divs, whatever. Here we’ll keep it simple. You can put text in it, wrap it in another div or 3, whatever you need.

<div id="carbpie" class="pie"></div>

This is the final destination for a single graph generated from dc.js. Once it works, you can always rearrange later. It will be tied to the carb data in mtcars, so we’ll call it “carb”.

1/6. read data (and create javascript section)

Next up is to read the data and set up the boilerplate javascript code.

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
    
</head>
<body>

    <div id="carbpie" class="pie"></div>
    
    ///////////////////////////////////////
    // read data in a javascript section //
    ///////////////////////////////////////
    
</body
</html>

We’ll start a section for javascript and also put the d3.js read function here.

<script type = 'text/javascript'>

'use strict';

console.log(window.location.pathname);

d3.csv("mtcars.csv").then(function (cars) {

    // print the first row of data to console, it helps debugging so much
    console.log(cars[1]);
    
    // don't forget this. this actually draws/renders the graphs
    dc.renderAll()

})
</script>

Use strict means you can’t write super sloppy code. Console.log writes notes to yourself that can be found in developer mode in a modern browser like Chrome. window.location.pathname tells you where your viz is trying to find your data. If your setup is a bit complicated, this can help you solve a missing data problem. d3.csv(“mtcars.csv”).then(function (cars) {} is the d3 function that will read your data from a csv file and make it ready for use. The rest of your code is in the {} so that it reads the data before trying to use it. Otherwise problems abound. If the window.location.pathname isn’t the same place as your html, then you might need to put a longer path like “../data/mtcars.csv” to load it correctly. I like using console.log(cars[1]) to expose the first line of data in your browser console. This helps in prepping the data in the next step. Lastly, don’t forget “dc.renderAll()”. If your graph won’t show up and there isn’t a javascript error in your browser, you probably forget to render your graphs.

5/6. make graph variable with noncreative names

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>

</head>
<body>

    <div id="carbpie" class="pie"></div>
    
    <script type = 'text/javascript'>

    'use strict';

    console.log(window.location.pathname);

    d3.csv("mtcars.csv").then(function (cars) {
        console.log(cars[1]);
    
        ////////////////////////////////////
        // make your dc.js graph variable //
        ////////////////////////////////////
    
        dc.renderAll()

    })
</script>
    
</body
</html>

dc.js graph variables are easy to make, and you’ll need one for each graph. You’ll need to track code between html and javascript, so make these as clear and noncreative as possible. Some combination of what it does and what type it is often works ie revenuebar, customertypepie.

We’ll use “carb” for now to relate to the carb data we referred to in the html scaffolding.

var carbpie = dc.pieChart('#carbpie');

When developing, I’ll often add these as I make the actual graph. I’ve seen some people put this near the top in a grouping, but I often put them right above the graph so that I can change it more easily.

2/6. clean data

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
    
</head>
<body>

    <div id="carb"></div>
    
    <script type = 'text/javascript'>

    'use strict';

    console.log(window.location.pathname);

    d3.csv("mtcars.csv").then(function (cars) {
        console.log(cars[1]);
    
        //////////////////////////
        // clean your data here //
        //////////////////////////
        
        var carbpie = dc.pieChart('#carbpie');
    
        dc.renderAll()

    })
</script>
    
</body
</html>

Next we’ll fix up an artifact from reading the data in using d3. It reads everything in as text, so we need to make them numeric. It’s easy using the “+” symbol.

cars.forEach(function(d) {
    d.carb = +d.carb,
    d.cylnum = +d.cyl
})

console.log(cars[1]);

I added a second variable, cylnum, to show how to do multiple at once, and to show that you can create variables. We overwrote the carb string variable with a carb numeric variable, and have a cyl numeric variable named cylnum, and still retain a string variable named cyl.

You can do some transformation here, but ideally you clean it up in the original dataset since processing here will slow down your load time.

The second console.log(cars[1]); allows you to see a before and after data cleaning. You’ll see the cyl variable show cyl:“4” and cyl:4. That means it worked!

3/6. make dimensions

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
   
</head>
<body>

    <div id="carb"></div>
    
    <script type = 'text/javascript'>

    'use strict';

    console.log(window.location.pathname);

    d3.csv("mtcars.csv").then(function (cars) {
        console.log(cars[1]);
    
        cars.forEach(function(d) {
            d.carb = +d.carb,
            d.cylnum = +d.cyl
        })

        console.log(cars[1]);
        
        ///////////////////////////////
        // make your dimensions here //
        ///////////////////////////////
        
        var firstgraph = dc.pieChart('#carb');
    
        dc.renderAll()

    })
</script>
    
</body
</html>

Here comes the magic that is crossfilter.js. You want your graphs to connect! But you have to tell it how to connect, and it can get complicated if you need it. Here it’s simple.

Make the crossfilter object, we’ll call ndx, that ties everything together, and then tell it which variable is a dimension. Each dimension refers to ndx to make the connections.

var ndx = crossfilter(cars);

var carbdim = ndx.dimension(function (d) {return d.carb});

At a high level, a dimension is a filter you can refer to, and the crossfilter tracks which rows are currently filtered or not. Put as many of these as you need.

4/6. make the graph

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
   
</head>
<body>

    <div id="carb"></div>
    
    <script type = 'text/javascript'>

    'use strict';

    console.log(window.location.pathname);

    d3.csv("mtcars.csv").then(function (cars) {
        console.log(cars[1]);
    
        cars.forEach(function(d) {
            d.carb = +d.carb,
            d.cylnum = +d.cyl
        })

        console.log(cars[1]);
        
        var ndx = crossfilter(cars);

        var carbdim = ndx.dimension(function (d) {return d.carb});
        
        var carbpie = dc.pieChart('#carbpie');
        
        //////////////////////
        // make your graph! //
        //////////////////////
    
        dc.renderAll()

    })
</script>
    
</body
</html>

Finally we can write out how we want the graph to look!

mpggraphpie
  .width(120)
  .height(120)
  .radius(60)
  .innerRadius(0)
  .dimension(carbdim)
  .group(carbdim.group());

It’s more like a declarative language, like writing SQL, than a procedural language. It takes the graph variable we made earlier, and adds specific descriptions. Most are straightforward, and you can refer to the API for more options here.

We have a number of size and visual options. Height and width (in pixels) are about what you’d expect. Radius and innerRadius is specific to pie graphs, and we specified “dc.pieChart(‘#carbpie’)” earlier. InnerRadius is how you make a donut instead of a pie.

Next we have two items that are in nearly every graph: dimension and group.

Dimension must refer to one of the dimensions you specified earlier. We made carbdim for this carbpie, so we’ll use that. Now your viz knows that clicking on this pie chart will filter the other charts by the function in carbdim, which is just the carb variable.

And last we have group. Group as used here is the most basic version, and this is a basic example. So you take the dimension you just used and make it a “group” on the spot by adding .group(). It needs to be a function.

It’s possible to make groups beforehand (probably named carbsumgroup or the like). You would do that to make a custom aggregation, instead of a count like the default, you could do a sum to aggregate partially preaggregated data, which is smart for performance reasons.

Final code for a single graph

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
    
</head>
<body>

    <div id="carb"></div>
    
    <script type = 'text/javascript'>

    'use strict';

    console.log(window.location.pathname);

    d3.csv("mtcars.csv").then(function (cars) {
        console.log(cars[1]);
    
        cars.forEach(function(d) {
            d.carb = +d.carb,
            d.cylnum = +d.cyl
        })

        console.log(cars[1]);
        
        var ndx = crossfilter(cars);

        var carbdim = ndx.dimension(function (d) {return d.carb});
        
        var carbpie = dc.pieChart('#carbpie');
        
        mpggraphpie
          .width(120)
          .height(120)
          .radius(60)
          .innerRadius(0)
          .dimension(carbdim)
          .group(carbdim.group());
    
        dc.renderAll()

    })
</script>
    
</body
</html>

This has all the pieces for a single graph. Anything more is more advanced or just more graphs. However the entire purpose of dc.js (in my view) is to gain synergy between exploring different variables in different graphs.

7/6. Add a second graph

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>

</head>
<body>

    <div id="carb"></div>
    ////////////////////
    // add an mpg div //
    ////////////////////
    
    <script type = 'text/javascript'>

    'use strict';

    console.log(window.location.pathname);

    d3.csv("mtcars.csv").then(function (cars) {
        console.log(cars[1]);
    
        cars.forEach(function(d) {
            d.carb = +d.carb,
            d.cylnum = +d.cyl
            ///////////////
            // clean mpg //
            ///////////////
        })

        console.log(cars[1]);
        
        var ndx = crossfilter(cars);

        var carbdim = ndx.dimension(function (d) {return d.carb});
        ////////////////////
        // add an mpg dim //
        ////////////////////
        
        var carbpie = dc.pieChart('#carbpie');
        carbpie
          .width(120)
          .height(120)
          .radius(60)
          .innerRadius(0)
          .dimension(carbdim)
          .group(carbdim.group());
        
        ///////////////////////////////////
        // add an mpg variable and chart //
        ///////////////////////////////////
    
        dc.renderAll()

    })
</script>
    
</body
</html>

So let’s add a bar graph for mpg. Thinking ahead, this will allow us to click on the number of cylinders and learn about the mpg distribution and not just an average. We can also select a mpg range and see the proportion of each cylinder count that makes it up.

This is the part that will happen iteratively since we already have the boilerplate stuff done. We add the html scaffolding, clean up the variable, add the dim, and add the graph variable and graph. You can usually copy the first item you have and change variable names.

So let’s get all those parts.

    <div id="mpgbar"></div>
    
    d.mpg = +d.mpg,
    
    var mpgdim = ndx.dimension(function (d) {return d.mpg});
    
    var mpgbar = dc.barChart('#mpgbar');
    mpgbar
      .width(320)
      .height(120)
      .margins({top: 10, right: 50, bottom: 30, left: 40})
      .dimension(mpgdim)
      .group(mpgdim.group())
      .x(d3.scaleLinear().domain([10,34])) // chart goal posts, mpg varies from 10.4 to 33.9

Not a whole lot of tough code to add another graph. You can start batching things like cleaning all your variables at once, then batching dimension creation. So what’s the result?

        'use strict';
    
        console.log(window.location.pathname);

        d3.csv("mtcars.csv").then(function (cars) {
            console.log(cars[1]);
        
            cars.forEach(function(d) {
                d.carb = +d.carb,
                d.cylnum = +d.cyl,
                d.mpg = +d.mpg
            })
    
            console.log(cars[1]);
            
            var ndx = crossfilter(cars);
    
            var carbdim = ndx.dimension(function (d) {return d.carb});
            var mpgdim = ndx.dimension(function (d) {return d.mpg});
            
            var carbpie = dc.pieChart('#carbpie');
            carbpie
              .width(120)
              .height(120)
              .radius(60)
              .innerRadius(0)
              .dimension(carbdim)
              .group(carbdim.group())
            
            var mpgbar = dc.barChart('#mpgbar');
            mpgbar
              .width(400)
              .height(120)
              .margins({top: 10, right: 40, bottom: 20, left: 20})
              .dimension(mpgdim)
              .group(mpgdim.group())
              .x(d3.scaleLinear().domain([10,34]))
        
            dc.renderAll()
    
        })
carb
mpg

Pretty nifty! We can click through the different pie slices and see the higher we go, the generally lower the mpg. We can find a similar thing by selecting around 10-14 on the mpg graph, and then sliding it right. The carb pie starts shifting to higher proportions of lower carburators.

Here’s the final double pie graph for jump starting new projects. Just copy and paste, change variables to your data, and add more elements as needed! Now you’re slicing and dicing in a way that would take 100s of SQL statements, dozens of lines of R, and in a way that you can understand the data better than a bunch of one off chunks of numbers.

<!DOCTYPE html>
<html lang="en">
<head>

    <title>mtcars analysis</title>
    <meta charset="UTF-8">

    // load all the libraries and related styles
    <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.12/crossfilter.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/dc/3.0.9/dc.min.js"></script>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/dc/3.0.9/dc.min.css" />
    <script src="https://cdnjs.cloudflare.com/ajax/libs/datatables/1.10.19/js/jquery.dataTables.min.js"></script>
    <script rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/datatables/1.10.19/js/jquery.dataTables.min.css"></script>
    
</head>
<body>

    <div id="carb"></div>
    <div id="mpgbar"></div>
    
    <script type = 'text/javascript'>

    'use strict';

    console.log(window.location.pathname);

    d3.csv("mtcars.csv").then(function (cars) {
        console.log(cars[1]);
    
        cars.forEach(function(d) {
            d.carb = +d.carb,
            d.cylnum = +d.cyl,
            d.mpg = +d.mpg
        })

        console.log(cars[1]);
        
        var ndx = crossfilter(cars);

        var carbdim = ndx.dimension(function (d) {return d.carb});
        var mpgdim = ndx.dimension(function (d) {return d.mpg});
    
        var carbpie = dc.pieChart('#carbpie');
        carbpie
          .width(120)
          .height(120)
          .radius(60)
          .innerRadius(0)
          .dimension(carbdim)
          .group(carbdim.group());
        
        var mpgbar = dc.barChart('#mpgbar');
        mpgbar
          .width(320)
          .height(120)
          .margins({top: 10, right: 50, bottom: 30, left: 40})
          .dimension(mpgdim)
          .group(mpgdim.group())
          .x(d3.scaleLinear().domain([10,34])) // chart goal posts, mpg varies from 10.4 to 33.9
    
        dc.renderAll()

    })
</script>
    
</body
</html>

There’s plenty more

These are the very basics of structure, but there are lots of interesting things in the API that you can use for many use cases. Those will have their own posts.

comments powered by Disqus