How to generate a blog-wide word count in Jekyll

One of the “minor” tasks left on my to-do list since making the transition to Jekyll was to come up with a quick way to generate a blog-wide word count. This metric is just something I like to have handy (and I may end up putting it on the About page). (Some of you may remember that years ago I wrote a plugin for WordPress to do this very thing.)

Initially, I tried to tackle the problem from just the shell, and it is doable, but inaccurate. All of Jekyll’s blog posts exist in a single directory, and so the following does work:

wc -w * | tail -1 | cut -b -8

Obviously, this just pipes every blog post through the wc command. The problem though is that it doesn’t ignore the YAML front matter present in every post, thus adding to the count words that shouldn’t be included. Clearly, these extra words, especially over a very large site, can really skew your word count.

After that idea crashed and burned, I thought I could just come up with a regex that would grab the YAML headers, use grep or egrep to do the matching, and then pipe the inverse of the result into the wc command. I ran into a snag though after coming up with a regex, namely grep's inability to recognize modifiers. Specifically, I needed to specify “single-line” mode so that the “.” operator would match any character, including newlines.

After banging my head against the wall with that for a while, I just decided to tackle the problem in Python, and was able to whip up a solution pretty quickly, despite my inexperience with the language. The following is what I came up with:

import os
import re
path = '/path/to/jekyll/posts/'wordCount = 0

# Regex to match YAML front matter and everything after it
regex = re.compile((?s)(---.*---)(.*))

# Iterate through all posts
for post in os.listdir(path):    
    f = open(path+post, r)    
    result = re.match(regex, f.read()
    # Count words in everything after YAML front matter    
    wordCount += len(result.group(2).split())print {:,}.format(wordCount) +  words!

It’s probably pretty self-explanatory, but if you have any questions (or have a way to maybe make it more efficient or elegant), please feel free to email me.

On my home machine—a mid-2010 MacBook Pro (2.66GHz Core i7)—this script takes about 0.15 seconds to jump through ~3000 posts and spit out the result. (For those curious, the result is 352,802 words.)