Full text search on static website

I was recently thinking about implementing a full text search for my blog. That sound like a unreasonable thought, apart from the fact that it is a static site. But surely, there must be a way to achieve it.

My blog is a static website, but it is a generated one by a static website generator (nanoc in my case), so I can start from that. The first thing I need to create is the list of the articles combined with tags and keywords. As nanoc is a ruby generator, this part will be in ruby, but you can easily change it based on your stack.

require 'json'
require 'nokogiri'

class CreateFullTextIndex
  COMMON_WORDS = %w{ a about above across ... } unless defined?(COMMON_WORDS)

  def initialize(articles)
    @articles = articles
  end

  def call
    @articles.map do |item|
      words = item.raw_content.downcase.split(/\W+/)
      keywords = words.uniq - COMMON_WORDS
      {
        id: item.path,
        title: item[:title],
        tags: item[:tags].join(","),
        body: keywords.join(" ")
      }
    end.to_a
  end
end

In the code above I’m iterating through all articles, searching for unique words and adding them into a simple JSON-like data structure. COMMON_WORDS array contains the list of the most common English words that are not significant for full text search and they are removed from the data structure itself making it smaller and more relevant.

Once I have the data I need to create and an index and search in it. There is an awesome javascript library lunr.js which brings simplified Apache Solr capabilities into the client side world.

To work with lunr we have to initialize the index first.

var documents = <%=  CreateFullTextIndex.new(sorted_active_articles).call.to_json %>;
var index = lunr(function () {
  this.field('title', {boost: 10});
  this.field('tags', { boost: 5 });
  this.field('body');
  this.ref('id');
});

documents.forEach(function(i) { index.add(i); });

We used our previously generated data by nanoc and converted them into JSON. As I mentioned earlier, you can use any other static site generator or even provide the JSON yourself so you are not limited to any particular technology.

After that, we setup fields and assign them a weight. The weight is for working with the significance of the match in the field and about its prioritization. A match in a title of an article is more important than the one from the body text, so it should appear higher in our search results and that’s what the higher weight is for.

Last part of setting up the index is to iterate over our JSON document and add each record one by one to the index.

Now when the index is sorted, we can implement a simple search on top of a simple form. For that, we are going to steal its submit event and query the index.

form.addEventListener("submit", function(e) {
  e.preventDefault();
  search(input.value);
});

var search = function(query) {
  clearResults();
  results = index.search(query).map(function(i) { return findDocumentById(i.ref); });
  renderResults(results);
};

Awesome, we created a simple working example of a javascript search on the static website. Let’s dig a bit deeper in it and enhance it a little bit to improve the user experience by using permanent URLs.

The idea is that when you access https://chodounsky.com/q=javascript it will query for the javascript keyword. That will make our requests idempotent – for the same URL you will the same results.

To do so, we need to extract the parameters from the URL with the following code.

var searchQueryFromUrl = function() {
  var parameters = location.search.replace("?","").split("&");
  for (var i = 0; i < parameters.length; i++) {
    if (parameters[i] !== "" && parameters[i][0] === "q")
      return decodeURIComponent(parameters[i].substring(2));
  }
};

if (searchQueryFromUrl() !== undefined) {
  input.value = searchQueryFromUrl();
  search(input.value);
}

Also, it is important to run searchQueryFromUrl and search every time when the page loads.

After that, we have to modify the URL when the user submits the form. If you try to add the parameter directly to the URL the page will reload. But don’t despair, javascript and HTML5 give you a window.history object. As it is not yet supported on all major versions of browsers we’ll use excellent history.js that provides a graceful fallback to hashes.

History.Adapter.bind(window, 'statechange', function() {
  var state = History.getState();
  input.value = state.data.q;
  search(state.data.q);
});

form.addEventListener("submit", function(e) {
  e.preventDefault();
  History.pushState({ q: input.value },
    "Search for " + input.value,
    "?q=" + encodeURIComponent(input.value)
  );
});

Firstly, we bounded the window.statechange event, so whenever we modify the URL by adding parameters or clicking on the back button, we’ll run our search code.

Secondly, we changed the form submit event to push a new state into history, so it modifies the window state change, changes URL and triggers a new search. We do that only on the search page.

An awesome thing about this solution is that you have a permalink for search and you can use simple HTML form with named input from every other page. No javascript required there.

<form action="/search/" id="search-form" method="GET" >
  <input name="q" type="text" id="search-input" />
  <input type="submit" value="search" />
</form>

And that’s it. What seemed like a relatively impossible task to do was achieved by a few lines of javascript with the use of pre-generated JSON.

You might wonder how big is the index, but for my entire blog it’s about 260kb before gzipping and 90kb after which is less than one nice image. Also, when you are already on the search page it is crazy fast as there is no communication with the server, just client side code.

Resources


Would you like to get the most interesting content about programming every Monday?
Sign up to Programming Digest and stay up to date!