Create e-book from website with ruby

I’m going to spend next two weeks without internet and I want to catch up with some reading. There is a few websites with articles I’d like to read, so I decided to create an e-book out of them.

Those articles have one thing in common — an archive that provides a list of them. I could chuck them into a Pocket, but it wouldn’t be too much fun and it would involve lots of clicking. Let’s use a ruby instead.

First thing to do is scraping that list of links that I’m interested in. Why not do it for this blog.

require 'nokogiri'
require 'open-uri'

archive_url = "http://chodounsky.net/archive/"
link_selector = ".content .archive li a"
domain = "http://chodounsky.net"

archive = Nokogiri::HTML(open(archive_url))
links = archive.css(link_selector).map { |a| domain + a["href"] }.reverse

We used nokogiri for parsing archive page and selected all the links to articles with a simple CSS selector. You can be more creative depending on the page structure or your needs — as the archive might be spread across multiple pages or you want a specific order or filtering, but for this example we’ll keep it simple and only reverse it to start with the oldest articles.

After that, we’ll create a simple data container for storing an article — our new book chapter. It would be able to give me an id and format itself to HTML string, which we’ll save to the file later.

class Chapter
  attr_accessor :title, :content

  def initialize(title, content)
    @title = title
    @content = content
  end

  def id
    title.downcase.gsub(" ", "_").gsub(/[^0-9a-z_]/i, '')
  end

  def to_s
  <<-eos
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <title>#{title}</title>
  <style>
    img { max-width: 95%; }
  </style>
</head>
<body>
  #{content}
</body>
</html>
eos
  end
end

Next step is to scrape the content we are interested in. It’s an action, so we’ll wrap it into a service object.

class DownloadChapter
  attr_reader :index
  attr_accessor :article_selector, :title_selector, :configuration

  def initialize
    @index = 0
    yield(self)
  end

  def call(url)
    html = raw_html(url)
    title = html.css(title_selector).text
    article = html.css(article_selector).first
    images = download_images_from(article)
    content = replace_images(article, images).to_s

    save(Chapter.new(title, content))
    @index += 1
  end

  private
  def replace_images(article, images)
    article.css("img").each_with_index { |img, index| img["src"] = images[index] }
    article.css("a").each { |a| a.replace(a.children) }
    article
  end

  def domain
    @domain ||= configuration.fetch(:domain, "")
  end

  def raw_html(url)
    html = Nokogiri::HTML(open(url))
    configuration[:normalize].call(html) if configuration[:normalize]
    html
  end

  def download_images_from(html)
    html.css("img").map do |img|
      url = domain + img["src"]
      filename = filename_prefix + "_" + id + "_" + url.split("/").last
      open("#{filename}", 'wb') { |file| file << open(url).read }
      filename
    end
  end

  def filename_prefix
    '%03d' % @index
  end

  def save(chapter)
    File.open("#{filename_prefix}-#{chapter.id}.html", "w") { |f| f.write(chapter.to_s) }
  end
end

This service is the most complex piece of this small ruby script. We want to download the appropriate content, that means text and images. On the other hand we want to skip comments, ads, sidebars and other distracting elements — that’s when the normalization method kicks in, but more about that later. Article content and title is defined with CSS selectors and we have to provide URL to scrape from.

This service has multiple responsibilities ranging from downloading the images to saving the output to the file, but we are going to keep it in one class for the sake of simplicity. If you intend to do some serious programming I recommend splitting responsibilities apart into separate classes.

Let’s move the final step and tie everything together.

download_chapter = DownloadChapter.new do |d|
  d.article_selector = ".post"
  d.title_selector = ".post h1"
  d.configuration = {
    domain: domain,
    normalize: -> (article) { article.css("footer").each { |node| node.remove } }
  }
end

links.each do |url|
  download_chapter.call(url)
end

Firstly, we created the service and passed it a configuration. The configuration contains the domain name and the normalization method. This method is important for stripping the content that we are not interested in. In this case it removes comment section, and you can remove anything by matching CSS selectors.

Last three lines are calling the service for each link we scraped from the archive page. This whole script generates HTML files with properly linked images inside your folder.

You might wonder where we create the final product — an actual e-book. I have a nook so my preferred format is epub. It is a zipped archive of html pages with bunch of files under certain hierarchy. There is a few ruby gems that exports content into it, but I didn’t find any one of them to be convenient enough to produce nice and compatible results with my reader.

But there is an excellent e-book creator called Sigil which you can produce beautiful e-books with tables of content and title pages really really easily. I highly recommend it as it is also available for all operating systems.

Oh, and if you own a Kindle and you are after the mobi format, don’t despair. Epub and mobi are convertible between each other and you can use calibre for that job.


Did you like the article? Send me a comment!