Blogs are some of the best data science resources available. However, the paths to an efficient data-blogging workflow are many and perilous! This post walks through how I got the process down to two commands, with the forked teachings of Yihui Xie (… and how you can, too).

Those two commands in full:

  1. new_post("Post title!")

    • Creates an Rmarkdown file with the neecesary boilerplate, in a new subdirectory within the blog
    • Opens the Rmarkdown file in the Rstudio editor
    • Serves an interactive live-preview of the site in the Rstudio viewer window, which updates whenever you save changes
  2. blog_push()

    Rebuilds the site locally, then uploads any changes an Amazon S3 bucket (which points to this domain name).


That’s it.

These seemingly mighty functions are tiny wrappers I wrote around the superb knitr-jekyll system put together by R’s high-priest of dynamic documents, Yihui. After having found integrating data analysis with other blog systems a pain, I was surprised by how incredibly easy this was — perhaps easy enough get more data scientists & statisticians writing about their work.

I was also surprised that it had never made it on to r-bloggers!

As the repo went up on GitHub without any fanfare, I thought I’d write up my experiences (with a gentle introduction to static-site-generation) in case it might motivate others to get blogging!

Why bother?

Before getting in to the nuts and bolts, it seems like a good idea to establish some of the reasons a data scientist might want to write a blog in the first place. Here’s a few:

  • They’re some of the best resources available

    How many times has a blog post walked you through a complex subject with accessible writing, and code you can easily try out?

    More than you times than you deserve, and less times than you’d like!

  • You know less than you thought, and your memory is horrible

    The process of writing, especially publicly for your peers, is the best way to challenge and hone your understanding. At the same time, a blog-post fills the effort/usefulness chasm between a research paper, and crappy notes in your code

  • Exploit the kindness/boredom of experts

    Twitter, rbloggers, hacker news, and the rest are teeming with people much cleverer than you, often willing to offer advice and encouragement

  • Bonus Reason: Because Hadley Wickham told you to

    “If you want to give back, I think writing a blog is a great way. Many of the things that you struggle with will be common problems. Think about how to solve them well and describe your solution to others.”

    From Hadley’s Reddit AMA a few weeks ago

    So there. Switching to snake_case_variables was just the beginning.

What is a blog?

If you haven’t cobbled together a website before, it’s worth mentioning that blogs are usually static websites. Because everyone sees the same stuff, you don’t need to keep a server running the background, reacting to specific users and their inputs (like a Shiny app, or Gmail, for example).

This means that running a blog is both easy and cheap. Blogs tend to just be a few HTML files thrown up on a file server such as Amazon S3, GitHub Pages, or even Dropbox. There’s no server-side code to write.

Blogging infrastructure just takes your ‘content’ (words, pictures, perhaps code and plots), from some format which is productive for you to write in, turns it into a HTML file, and adds a few bells and whistles, and lets you preview the result in a browser. If you’re happy with it, you can upload it to a remote file hosting service, which a domain name points to.

Figure 1.   The general process-flow for blog-aware static-site-generators, like Jekyll. Your content is converted from Markdown to HTML, and added to a HTML layout or template (containing things like menus and metadata). The resulting HTML file (and it’s dependencies, like JavaScript and CSS files, as well as any assets used, like images), can then be copied to an online file-host, such as Amazon’s S3 service.
Figure 1.   The general process-flow for blog-aware static-site-generators, like Jekyll. Your content is converted from Markdown to HTML, and added to a HTML ‘layout’ or template (containing things like menus and metadata). The resulting HTML file (and it’s dependencies, like JavaScript and CSS files, as well as any assets used, like images), can then be copied to an online file-host, such as Amazon’s S3 service.

Integrating Data Analysis into Your Workflow

With the process above, it’s quite possible to blog with R output, writing stats into the text as you go, saving plots as images and adding links to them. However, it’s also possible to blog using a reproducible work-flow, automating all the copying between documents. It also means that if you wanted to, you could press a button to regenerate all the analysis from scratch. This is increasingly advocated in scientific research to ease replication and collaboration.

A reproducible workflow is useful to me, even though my most common collaborator is myself-a-few-weeks-ago. If I have code-trail for my work, it’s always much easier and quicker for me to figure out what I was doing, and pick-up where I left off.

At the same time, copying and keeping track of extra files is incredibly boring, and knowing that you have to faff-around whenever you change something can really put you off making iterative improvements.

But most importantly, I’m lazy as hell. My main motivation is to reduce the amount of work/thought required to write a blog post, so there’s more chance I’ll actually do it.

knitr-jekyll makes this all very easy.

Figure 2.   The process flow used for this site. The usual Jekyll system is used, but with the posts written in .Rmd format. knitr parses the input file, executing code (e.g. data analysis), and saving any output assets to appropriate folders in the blog’s directory structure (e.g. plots/graphs/statistics). knitr also outputs an .md format of the content, which is then fed to Jekyll to continue the process as normal. The entire local blog-generation process from .Rmd to HTML happens automatically.
Figure 2.   The process flow used for this site. The usual Jekyll system is used, but with the posts written in .Rmd format. knitr parses the input file, executing code (e.g. data analysis), and saving any output assets to appropriate folders in the blog’s directory structure (e.g. plots/graphs/statistics). knitr also outputs an .md format of the content, which is then fed to Jekyll to continue the process as normal. The entire local blog-generation process from .Rmd to HTML happens automatically.

Here’s how all the parts fit together:

  • rmarkdown lets you write in a productive format

    Markdown is a joy to write in, and integrating R code gives you the benefits of reproducibility.

  • knitr automatically generates and integrates R output

    It simply converts the .Rmd file to .md, saving all the outputs from your R code to the right places.

  • jekyll adds some blog-themed bells and whistles

    These are the multitude of little things that make a blog a blog and without even realising, you probably expect quite a few of them. “Blog stuff” (a sensible directory structure, HTML templating, index pages, RSS feeds, etc.) sounds trivial, but having an established system ready to organise it all will make your life much easier. More established blog-generators tend to have better options for providing and organising these bells and whistles, and Jekyll is the most mature and widely used.

All of the above happens automatically, each time you save your .Rmd file. You just write and watch the results compiled live, in real-time, in the Rstudio viewer pane.

Yihui’s example blog post, using this system has a more detailed description of what’s going on under the hood to make this all work.

The Snag: htmlwidgets

This system works flawlessly for static plots, but a few extra steps are required if you’d like to use one of the new htmlwidgets packages.

This is because rmarkdown::render (used behind the scenes by Rstudio), in addition to running knitr, performs some other black-magic for injecting js/css dependencies.

This doesn’t mean that you can’t use these packages with knitr-jekyll, but it does mean that you need to copy the required js/css files from the R package to your blog. This only takes a minute, but it is slightly annoying.

I have some ideas about how to solve this. I think the htmlwidgets packages look great, but I’m yet to use them regularly myself, so this isn’t a big deal for me. But if it is for you, beware!

Customising Jekyll

Using Jekyll is easy. Although it’s written in Ruby, it’s a command-line programme – you don’t need to know any Ruby to use it. While knitr-jekyll handles the process of generating blog posts, you’ll probably want to customise the appearance of your blog before you start publishing. I recommend reading the official documentation, but here are the main things I did.

Getting it on the internet

Using a file host (e.g. S3)

The option I went with. Amazon S3 is like a very cheap, flexible, and ugly version of Dropbox, commonly used to host static websites. Aside from the web interface, there’s awscli; a command-line interface which makes synchronising local and remote files easy. I have the following set-up as a bash alias:

alias push_blog='aws s3 sync /home/br/projects/brendanrocks.com/_site \
                 s3://brendanrocks.com --exclude "cache/*|README.md" --delete'

Which synchronises the files in my local blog directory, and the remote directory (or ‘bucket’) on S3, which the URL brendanrocks.com points to. I use the same command in blog_push().

Setting up S3 to host static websites takes an hour or so of fiddling around. There’s a short guide here.

If you like this system, but don’t want to give your money to Amazon, there are many equivalent services.

Using GitHub Pages

If you like the sound of all this, but are put off by installing Ruby, arranging file hosting, buying a domain name, etc., you can use GitHub Pages.

Simply git push markdown files to a configured repo, and the HTML will be built via Jekyll, and hosted on GitHub’s servers, for free. Doing this means that you have a little less flexibility and control over your blog, but there are work-arounds for the most common needs. The knitr-jekyll system works great for this.

Helper functions

The helper/wrapper functions I wrote are not especially clever (nor essential to use knitr-jekyll), and live in my personal R package. If they might be useful, do pull them apart for your own purposes. Here’s the source code.

  • new_post("Post title!")

    As explained at the top of the post, this sets up RStudio (or your default applications) for writing a blog post. I use a slightly different directory structure from normal, with each blog post in it’s own folder, just to keep the files tidy. This assumes that you want that, too. It then runs…

  • blog_serve()

    Which is a tiny wrapper around Yihui’s servr::jekyll(serve = TRUE), accounting for my extra directory levels above.

  • blog_gen()

    Generates all the static files (without running a local webserver). Wraps servr::jekyll(serve = FALSE).

  • blog_opts() Is a set of knitr chunk options I find useful for blogging. In general I want to show plots, not the code that derived them, so adding this to my boilerplate saves me typing echo=FALSE, warning=FALSE, ... in every chunk.

  • blog_push()

    The laziest wrapper function of all, this just runs blog_gen() followed by an arbitrary system() command, which I use to push this site up on S3.

Get going!

While I’ve covered lots of details, my main message is that knitr-jekyll allows you to get a blog compiling on your machine in less than 10 minutes! If you’ve ever wanted to blog about R/data-science, but have been put off by the effort to get started, I hope you give it a go!

Here’s what you need to get going:

1. Install R dependencies

install.packages(c("knitr", "servr", "devtools"))     # To process .Rmd files
devtools::install_github("hadley/lubridate")         # brocks reqs dev version
devtools::install_github("brendan-r/brocks")         # My lazy wrapper funs

2. Install Ruby & Jekyll

3. Clone or download Yihui’s knitr-jekyll repo

4. Open up knitr-jekyll/knitr-jekyll.Rproj, and get blogging!

library(brocks)
new_post("My first blog post!")

Acknowledgements

Thanks of course to Yihui Xie for the fantastic knitr-jekyll repo, and to Carson Sievert for drawing my attention to it on Twitter. Graphs produced with GraphViz via the DiagrammeR package – hat-tip to Rich Iannone, and Benoit Thieurmel!




… Postscript: Alternatives

The method I’ve ended up using reflects my own computational biases, as well as my neurotic desire for everything to be automated, reproducible, and open-source. Happily, this personality defect is not a pre-requisite for writing about statistics & programming. Here are some alternatives.

R Packages

Unfortunately none of the packages below are on CRAN, or under active development. At the time of writing, they lack enough bells and whistles to keep me using Jekyll. However, I’m keen to acknowledge the work that’s gone in to them, and hope they might develop further!

  • Samantha

    By David Springate. No development since 2013, and no longer used by the author for his own blog, but is in active use by @rmflight for his personal blog.

  • rsmith

    Hadley Wickham (with contributions from Gábor Csárdi). No development for about a year. I’m not aware of any websites that currently use it.

  • Poirot / Slidify

    Poirot is an all-in-one R package for blogging, written by Ramnath Vaidyanathan of Slidify/rCharts/HTMLWidgets fame. It’s since been retired and merged back in to Slidify.

Blog-aware static-site-generators in other languages

Pythonistas are likely to find Pelican useful (see Alyssa Frazee’s blog for an example). Go has Hugo, used by ggplot2-fan and venture capitalist Tom Tunguz. JavaScript has Metalsmith, Wintersmith, and Hexo. The JavaScript SSG’s were especially attractive to me, not just because I write a little JS, but also because build tools like Grunt can be handy for things like compressing images, and minifying source code.

Hosted GUIs

  • Ghost

    Open source & markdown friendly! Used by Oliver Keyes.

  • WordPress.com

    The hosted, easier version of the open-source WordPress server-app. Wordpress in some form is used by many prominent data scientists including esteemed Hilarys Mason and Parker, Erin Shelman, and r-bloggers itself.

  • Medium

    Beautiful, proprietary, blog-oriented social-network/content-farm, with a built-in distribution network. Not much activity from data scientists (so far). The terms of service and privacy policy are more social-network-like than the others (though they do respect Do Not Track).