Blogging with Rmarkdown, knitr, and Jekyll
Blogs are some of the best data science resources available. However, the paths to an efficient data-blogging workflow are many and perilous! This post walks through how I got the process down to two commands, with the forked teachings of Yihui Xie (… and how you can, too).
Those two commands in full:
new_post("Post title!")
- Creates an Rmarkdown file with the neecesary boilerplate, in a new subdirectory within the blog
- Opens the Rmarkdown file in the Rstudio editor
- Serves an interactive live-preview of the site in the Rstudio viewer window, which updates whenever you save changes
blog_push()
Rebuilds the site locally, then uploads any changes an Amazon S3 bucket (which points to this domain name).
That’s it.
These seemingly mighty functions are tiny wrappers I wrote around the superb knitr-jekyll
system put together by R’s high-priest of dynamic documents, Yihui. After having found integrating data analysis with other blog systems a pain, I was surprised by how incredibly easy this was — perhaps easy enough get more data scientists & statisticians writing about their work.
I was also surprised that it had never made it on to r-bloggers!
As the repo went up on GitHub without any fanfare, I thought I’d write up my experiences (with a gentle introduction to static-site-generation) in case it might motivate others to get blogging!
Why bother?
Before getting in to the nuts and bolts, it seems like a good idea to establish some of the reasons a data scientist might want to write a blog in the first place. Here’s a few:
They’re some of the best resources available
How many times has a blog post walked you through a complex subject with accessible writing, and code you can easily try out?
More than you times than you deserve, and less times than you’d like!
You know less than you thought, and your memory is horrible
The process of writing, especially publicly for your peers, is the best way to challenge and hone your understanding. At the same time, a blog-post fills the effort/usefulness chasm between a research paper, and crappy notes in your code
Exploit the kindness/boredom of experts
Twitter, rbloggers, hacker news, and the rest are teeming with people much cleverer than you, often willing to offer advice and encouragement
Bonus Reason: Because Hadley Wickham told you to
“If you want to give back, I think writing a blog is a great way. Many of the things that you struggle with will be common problems. Think about how to solve them well and describe your solution to others.”
From Hadley’s Reddit AMA a few weeks ago
So there. Switching to
snake_case_variables
was just the beginning.
What is a blog?
If you haven’t cobbled together a website before, it’s worth mentioning that blogs are usually static websites. Because everyone sees the same stuff, you don’t need to keep a server running the background, reacting to specific users and their inputs (like a Shiny app, or Gmail, for example).
This means that running a blog is both easy and cheap. Blogs tend to just be a few HTML files thrown up on a file server such as Amazon S3, GitHub Pages, or even Dropbox. There’s no server-side code to write.
Blogging infrastructure just takes your ‘content’ (words, pictures, perhaps code and plots), from some format which is productive for you to write in, turns it into a HTML file, and adds a few bells and whistles, and lets you preview the result in a browser. If you’re happy with it, you can upload it to a remote file hosting service, which a domain name points to.
Integrating Data Analysis into Your Workflow
With the process above, it’s quite possible to blog with R output, writing stats into the text as you go, saving plots as images and adding links to them. However, it’s also possible to blog using a reproducible work-flow, automating all the copying between documents. It also means that if you wanted to, you could press a button to regenerate all the analysis from scratch. This is increasingly advocated in scientific research to ease replication and collaboration.
A reproducible workflow is useful to me, even though my most common collaborator is myself-a-few-weeks-ago. If I have code-trail for my work, it’s always much easier and quicker for me to figure out what I was doing, and pick-up where I left off.
At the same time, copying and keeping track of extra files is incredibly boring, and knowing that you have to faff-around whenever you change something can really put you off making iterative improvements.
But most importantly, I’m lazy as hell. My main motivation is to reduce the amount of work/thought required to write a blog post, so there’s more chance I’ll actually do it.
knitr-jekyll
makes this all very easy.
Here’s how all the parts fit together:
rmarkdown
lets you write in a productive formatMarkdown is a joy to write in, and integrating R code gives you the benefits of reproducibility.
knitr
automatically generates and integrates R outputIt simply converts the .Rmd file to .md, saving all the outputs from your R code to the right places.
jekyll
adds some blog-themed bells and whistlesThese are the multitude of little things that make a blog a blog and without even realising, you probably expect quite a few of them. “Blog stuff” (a sensible directory structure, HTML templating, index pages, RSS feeds, etc.) sounds trivial, but having an established system ready to organise it all will make your life much easier. More established blog-generators tend to have better options for providing and organising these bells and whistles, and Jekyll is the most mature and widely used.
All of the above happens automatically, each time you save your .Rmd file. You just write and watch the results compiled live, in real-time, in the Rstudio viewer pane.
Yihui’s example blog post, using this system has a more detailed description of what’s going on under the hood to make this all work.
The Snag: htmlwidgets
This system works flawlessly for static plots, but a few extra steps are required if you’d like to use one of the new htmlwidgets
packages.
This is because rmarkdown::render
(used behind the scenes by Rstudio), in addition to running knitr
, performs some other black-magic for injecting js/css dependencies.
This doesn’t mean that you can’t use these packages with knitr-jekyll
, but it does mean that you need to copy the required js/css files from the R package to your blog. This only takes a minute, but it is slightly annoying.
I have some ideas about how to solve this. I think the htmlwidgets
packages look great, but I’m yet to use them regularly myself, so this isn’t a big deal for me. But if it is for you, beware!
Customising Jekyll
Using Jekyll is easy. Although it’s written in Ruby, it’s a command-line programme – you don’t need to know any Ruby to use it. While knitr-jekyll
handles the process of generating blog posts, you’ll probably want to customise the appearance of your blog before you start publishing. I recommend reading the official documentation, but here are the main things I did.
The best thing about Jekyll is liquid, it’s very powerful templating engine. I wrote my own template files with bootstrap, but there are plenty of ready-made ones available.
Changed the markdown engine to pandoc. I used the
jekyll-pandoc-multiple-formats
plugin.Added Mathjax. If you’re using pandoc, you can do this with the
--mathjax
flag.Added highlight.js, for slightly more flexible sytax highlighting.
Getting it on the internet
Using a file host (e.g. S3)
The option I went with. Amazon S3 is like a very cheap, flexible, and ugly version of Dropbox, commonly used to host static websites. Aside from the web interface, there’s awscli
; a command-line interface which makes synchronising local and remote files easy. I have the following set-up as a bash alias:
alias push_blog='aws s3 sync /home/br/projects/brendanrocks.com/_site \
s3://brendanrocks.com --exclude "cache/*|README.md" --delete'
Which synchronises the files in my local blog directory, and the remote directory (or ‘bucket’) on S3, which the URL brendanrocks.com points to. I use the same command in blog_push()
.
Setting up S3 to host static websites takes an hour or so of fiddling around. There’s a short guide here.
If you like this system, but don’t want to give your money to Amazon, there are many equivalent services.
Using GitHub Pages
If you like the sound of all this, but are put off by installing Ruby, arranging file hosting, buying a domain name, etc., you can use GitHub Pages.
Simply git push
markdown files to a configured repo, and the HTML will be built via Jekyll, and hosted on GitHub’s servers, for free. Doing this means that you have a little less flexibility and control over your blog, but there are work-arounds for the most common needs. The knitr-jekyll
system works great for this.
Helper functions
The helper/wrapper functions I wrote are not especially clever (nor essential to use knitr-jekyll
), and live in my personal R package. If they might be useful, do pull them apart for your own purposes. Here’s the source code.
new_post("Post title!")
As explained at the top of the post, this sets up RStudio (or your default applications) for writing a blog post. I use a slightly different directory structure from normal, with each blog post in it’s own folder, just to keep the files tidy. This assumes that you want that, too. It then runs…
blog_serve()
Which is a tiny wrapper around Yihui’s
servr::jekyll(serve = TRUE)
, accounting for my extra directory levels above.blog_gen()
Generates all the static files (without running a local webserver). Wraps
servr::jekyll(serve = FALSE)
.blog_opts()
Is a set ofknitr
chunk options I find useful for blogging. In general I want to show plots, not the code that derived them, so adding this to my boilerplate saves me typingecho=FALSE, warning=FALSE, ...
in every chunk.blog_push()
The laziest wrapper function of all, this just runs
blog_gen()
followed by an arbitrarysystem()
command, which I use to push this site up on S3.
Get going!
While I’ve covered lots of details, my main message is that knitr-jekyll
allows you to get a blog compiling on your machine in less than 10 minutes! If you’ve ever wanted to blog about R/data-science, but have been put off by the effort to get started, I hope you give it a go!
Here’s what you need to get going:
1. Install R dependencies
install.packages(c("knitr", "servr", "devtools")) # To process .Rmd files
devtools::install_github("hadley/lubridate") # brocks reqs dev version
devtools::install_github("brendan-r/brocks") # My lazy wrapper funs
2. Install Ruby & Jekyll
3. Clone or download Yihui’s knitr-jekyll
repo
4. Open up knitr-jekyll/knitr-jekyll.Rproj
, and get blogging!
library(brocks)
new_post("My first blog post!")
Acknowledgements
Thanks of course to Yihui Xie for the fantastic knitr-jekyll
repo, and to Carson Sievert for drawing my attention to it on Twitter. Graphs produced with GraphViz via the DiagrammeR package – hat-tip to Rich Iannone, and Benoit Thieurmel!
… Postscript: Alternatives
The method I’ve ended up using reflects my own computational biases, as well as my neurotic desire for everything to be automated, reproducible, and open-source. Happily, this personality defect is not a pre-requisite for writing about statistics & programming. Here are some alternatives.
R Packages
Unfortunately none of the packages below are on CRAN, or under active development. At the time of writing, they lack enough bells and whistles to keep me using Jekyll. However, I’m keen to acknowledge the work that’s gone in to them, and hope they might develop further!
-
By David Springate. No development since 2013, and no longer used by the author for his own blog, but is in active use by @rmflight for his personal blog.
-
Hadley Wickham (with contributions from Gábor Csárdi). No development for about a year. I’m not aware of any websites that currently use it.
-
Poirot is an all-in-one R package for blogging, written by Ramnath Vaidyanathan of Slidify/rCharts/HTMLWidgets fame. It’s since been retired and merged back in to Slidify.
Blog-aware static-site-generators in other languages
Pythonistas are likely to find Pelican useful (see Alyssa Frazee’s blog for an example). Go has Hugo, used by ggplot2-fan and venture capitalist Tom Tunguz. JavaScript has Metalsmith, Wintersmith, and Hexo. The JavaScript SSG’s were especially attractive to me, not just because I write a little JS, but also because build tools like Grunt can be handy for things like compressing images, and minifying source code.
Hosted GUIs
-
Open source & markdown friendly! Used by Oliver Keyes.
-
The hosted, easier version of the open-source WordPress server-app. Wordpress in some form is used by many prominent data scientists including esteemed Hilarys Mason and Parker, Erin Shelman, and r-bloggers itself.
-
Beautiful, proprietary, blog-oriented social-network/content-farm, with a built-in distribution network. Not much activity from data scientists (so far). The terms of service and privacy policy are more social-network-like than the others (though they do respect Do Not Track).