Blogs are some of the best data science resources available. However, the paths to an efficient data-blogging workflow are many and perilous! This post walks through how I got the process down to two commands, with the forked teachings of Yihui Xie (… and how you can, too).
Those two commands in full:
- Creates an Rmarkdown file with the neecesary boilerplate, in a new subdirectory within the blog
- Opens the Rmarkdown file in the Rstudio editor
- Serves an interactive live-preview of the site in the Rstudio viewer window, which updates whenever you save changes
Rebuilds the site locally, then uploads any changes an Amazon S3 bucket (which points to this domain name).
These seemingly mighty functions are tiny wrappers I wrote around the superb
knitr-jekyll system put together by R’s high-priest of dynamic documents, Yihui. After having found integrating data analysis with other blog systems a pain, I was surprised by how incredibly easy this was — perhaps easy enough get more data scientists & statisticians writing about their work.
I was also surprised that it had never made it on to r-bloggers!
As the repo went up on GitHub without any fanfare, I thought I’d write up my experiences (with a gentle introduction to static-site-generation) in case it might motivate others to get blogging!
Before getting in to the nuts and bolts, it seems like a good idea to establish some of the reasons a data scientist might want to write a blog in the first place. Here’s a few:
They’re some of the best resources available
How many times has a blog post walked you through a complex subject with accessible writing, and code you can easily try out?
More than you times than you deserve, and less times than you’d like!
You know less than you thought, and your memory is horrible
The process of writing, especially publicly for your peers, is the best way to challenge and hone your understanding. At the same time, a blog-post fills the effort/usefulness chasm between a research paper, and crappy notes in your code
Exploit the kindness/boredom of experts
Twitter, rbloggers, hacker news, and the rest are teeming with people much cleverer than you, often willing to offer advice and encouragement
Bonus Reason: Because Hadley Wickham told you to
“If you want to give back, I think writing a blog is a great way. Many of the things that you struggle with will be common problems. Think about how to solve them well and describe your solution to others.”
From Hadley’s Reddit AMA a few weeks ago
So there. Switching to
snake_case_variableswas just the beginning.
What is a blog?
If you haven’t cobbled together a website before, it’s worth mentioning that blogs are usually static websites. Because everyone sees the same stuff, you don’t need to keep a server running the background, reacting to specific users and their inputs (like a Shiny app, or Gmail, for example).
This means that running a blog is both easy and cheap. Blogs tend to just be a few HTML files thrown up on a file server such as Amazon S3, GitHub Pages, or even Dropbox. There’s no server-side code to write.
Blogging infrastructure just takes your ‘content’ (words, pictures, perhaps code and plots), from some format which is productive for you to write in, turns it into a HTML file, and adds a few bells and whistles, and lets you preview the result in a browser. If you’re happy with it, you can upload it to a remote file hosting service, which a domain name points to.
Integrating Data Analysis into Your Workflow
With the process above, it’s quite possible to blog with R output, writing stats into the text as you go, saving plots as images and adding links to them. However, it’s also possible to blog using a reproducible work-flow, automating all the copying between documents. It also means that if you wanted to, you could press a button to regenerate all the analysis from scratch. This is increasingly advocated in scientific research to ease replication and collaboration.
A reproducible workflow is useful to me, even though my most common collaborator is myself-a-few-weeks-ago. If I have code-trail for my work, it’s always much easier and quicker for me to figure out what I was doing, and pick-up where I left off.
At the same time, copying and keeping track of extra files is incredibly boring, and knowing that you have to faff-around whenever you change something can really put you off making iterative improvements.
But most importantly, I’m lazy as hell. My main motivation is to reduce the amount of work/thought required to write a blog post, so there’s more chance I’ll actually do it.
knitr-jekyll makes this all very easy.
Here’s how all the parts fit together:
rmarkdownlets you write in a productive format
Markdown is a joy to write in, and integrating R code gives you the benefits of reproducibility.
knitrautomatically generates and integrates R output
It simply converts the .Rmd file to .md, saving all the outputs from your R code to the right places.
jekylladds some blog-themed bells and whistles
These are the multitude of little things that make a blog a blog and without even realising, you probably expect quite a few of them. “Blog stuff” (a sensible directory structure, HTML templating, index pages, RSS feeds, etc.) sounds trivial, but having an established system ready to organise it all will make your life much easier. More established blog-generators tend to have better options for providing and organising these bells and whistles, and Jekyll is the most mature and widely used.
All of the above happens automatically, each time you save your .Rmd file. You just write and watch the results compiled live, in real-time, in the Rstudio viewer pane.
Yihui’s example blog post, using this system has a more detailed description of what’s going on under the hood to make this all work.
The Snag: htmlwidgets
This system works flawlessly for static plots, but a few extra steps are required if you’d like to use one of the new
This is because
rmarkdown::render (used behind the scenes by Rstudio), in addition to running
knitr, performs some other black-magic for injecting js/css dependencies.
This doesn’t mean that you can’t use these packages with
knitr-jekyll, but it does mean that you need to copy the required js/css files from the R package to your blog. This only takes a minute, but it is slightly annoying.
I have some ideas about how to solve this. I think the
htmlwidgets packages look great, but I’m yet to use them regularly myself, so this isn’t a big deal for me. But if it is for you, beware!
Using Jekyll is easy. Although it’s written in Ruby, it’s a command-line programme – you don’t need to know any Ruby to use it. While
knitr-jekyll handles the process of generating blog posts, you’ll probably want to customise the appearance of your blog before you start publishing. I recommend reading the official documentation, but here are the main things I did.
Changed the markdown engine to pandoc. I used the
Added Mathjax. If you’re using pandoc, you can do this with the
Added highlight.js, for slightly more flexible sytax highlighting.
Getting it on the internet
Using a file host (e.g. S3)
The option I went with. Amazon S3 is like a very cheap, flexible, and ugly version of Dropbox, commonly used to host static websites. Aside from the web interface, there’s
awscli; a command-line interface which makes synchronising local and remote files easy. I have the following set-up as a bash alias:
alias push_blog='aws s3 sync /home/br/projects/brendanrocks.com/_site \ s3://brendanrocks.com --exclude "cache/*|README.md" --delete'
Which synchronises the files in my local blog directory, and the remote directory (or ‘bucket’) on S3, which the URL brendanrocks.com points to. I use the same command in
Setting up S3 to host static websites takes an hour or so of fiddling around. There’s a short guide here.
Using GitHub Pages
If you like the sound of all this, but are put off by installing Ruby, arranging file hosting, buying a domain name, etc., you can use GitHub Pages.
git push markdown files to a configured repo, and the HTML will be built via Jekyll, and hosted on GitHub’s servers, for free. Doing this means that you have a little less flexibility and control over your blog, but there are work-arounds for the most common needs. The
knitr-jekyll system works great for this.
The helper/wrapper functions I wrote are not especially clever (nor essential to use
knitr-jekyll), and live in my personal R package. If they might be useful, do pull them apart for your own purposes. Here’s the source code.
As explained at the top of the post, this sets up RStudio (or your default applications) for writing a blog post. I use a slightly different directory structure from normal, with each blog post in it’s own folder, just to keep the files tidy. This assumes that you want that, too. It then runs…
Which is a tiny wrapper around Yihui’s
servr::jekyll(serve = TRUE), accounting for my extra directory levels above.
Generates all the static files (without running a local webserver). Wraps
servr::jekyll(serve = FALSE).
blog_opts()Is a set of
knitrchunk options I find useful for blogging. In general I want to show plots, not the code that derived them, so adding this to my boilerplate saves me typing
echo=FALSE, warning=FALSE, ...in every chunk.
The laziest wrapper function of all, this just runs
blog_gen()followed by an arbitrary
system()command, which I use to push this site up on S3.
While I’ve covered lots of details, my main message is that
knitr-jekyll allows you to get a blog compiling on your machine in less than 10 minutes! If you’ve ever wanted to blog about R/data-science, but have been put off by the effort to get started, I hope you give it a go!
Here’s what you need to get going:
1. Install R dependencies
install.packages(c("knitr", "servr", "devtools")) # To process .Rmd files devtools::install_github("hadley/lubridate") # brocks reqs dev version devtools::install_github("brendan-r/brocks") # My lazy wrapper funs
4. Open up
knitr-jekyll/knitr-jekyll.Rproj, and get blogging!
library(brocks) new_post("My first blog post!")
Thanks of course to Yihui Xie for the fantastic
knitr-jekyll repo, and to Carson Sievert for drawing my attention to it on Twitter. Graphs produced with GraphViz via the DiagrammeR package – hat-tip to Rich Iannone, and Benoit Thieurmel!
… Postscript: Alternatives
The method I’ve ended up using reflects my own computational biases, as well as my neurotic desire for everything to be automated, reproducible, and open-source. Happily, this personality defect is not a pre-requisite for writing about statistics & programming. Here are some alternatives.
Unfortunately none of the packages below are on CRAN, or under active development. At the time of writing, they lack enough bells and whistles to keep me using Jekyll. However, I’m keen to acknowledge the work that’s gone in to them, and hope they might develop further!
Poirot is an all-in-one R package for blogging, written by Ramnath Vaidyanathan of Slidify/rCharts/HTMLWidgets fame. It’s since been retired and merged back in to Slidify.
Blog-aware static-site-generators in other languages
Open source & markdown friendly! Used by Oliver Keyes.
The hosted, easier version of the open-source WordPress server-app. Wordpress in some form is used by many prominent data scientists including esteemed Hilarys Mason and Parker, Erin Shelman, and r-bloggers itself.