2018-11-11

Migrating from drupal to hexo

For an embarrassingly long time (2015!) I’ve been meaning to migrate my personal web site away from drupal to a static site generator like Hexo. There are several reasons why I wanted to do this:

Rather than having my content in a database managed by my service provider, I wanted to manage it the same way I manage source code using version control.
I got fed up having to upgrade the software than ran the site (drupal in this case) or pay my hosting provider to support older versions of PHP.
You can’t hack a static site and alter content unless the hosting provider’s security is broken or I disclose my credentials

I was running an old version of drupal (see point about my laziness updating the site) and there didn’t seem to be any automatic migration path, so I decided to write some code to do it.

The first step was to write some code to extract the old content of the site and convert it to markdown. For this I decided to use scrapy. Since the existing site was generating an XML sitemap I decided to feed that to the spider to ensure that the entire contents of the site was processed. I wanted to ensure that only pages with content were processed, fortunately the scrapy sitemap spider makes this trivial using some basic rules.

Scrapy sitemap rules

sitemap_rules = [
    ('/blog/', 'parse_blog'),
    ('/review/', 'parse_review'),
    ('/project/', 'parse_project'),
    ('/category/', 'parse_category'),
    ('/content/', 'parse_blog'),
    ('/clojure-core-async', 'parse_blog'),
    ('/publications', 'parse_misc'),
    ('/cv', 'parse_misc'),
    ('/old-projects', 'parse_misc'),
    ('/about', 'parse_misc'),
]

My old site had several page types including blog posts and reviews and so I needed some code to distinguish between them. I also needed to distinguish between pages I wanted converted to hexo posts and those that I didn’t (for example the about page).

Here’s the source for the entire spider. I was originally going to use the scrapy image pipeline support to grab the images out of each post but since all my images were in a single directory hierarchy it was easier to manually grab that directory and move it into my hexo site.

Scrapy spider

# -*- coding: utf-8 -*-
# Usage: scrapy crawl davesnowdon.com-spider
import os
from datetime import datetime

import html2text
import scrapy
from scrapy.utils.project import get_project_settings

SETTINGS=get_project_settings()
BASE_DIR = SETTINGS.get('FILES_STORE')

parser = html2text.HTML2Text()
parser.unicode_snob = True


META_DATA_TEMPLATE = """---
title: {}
date: {}
oldUrl: {}
---
"""

META_DATA_TEMPLATE_POST = """---
title: {}
date: {}
oldUrl: {}
tags:
categories: {}
---
"""

# Old blog posts are under blog or content
# reviews have a single category path element
# projects have a double category element project/<project-name>

class DaveSnowdonSpider(scrapy.spiders.SitemapSpider):
    name = "davesnowdon.com-spider"
    sitemap_urls = ["http://www.davesnowdon.com/sitemap.xml"]
    sitemap_rules = [
        ('/blog/', 'parse_blog'),
        ('/review/', 'parse_review'),
        ('/project/', 'parse_project'),
        ('/category/', 'parse_category'),
        ('/content/', 'parse_blog'),
        ('/clojure-core-async', 'parse_blog'),
        ('/publications', 'parse_misc'),
        ('/cv', 'parse_misc'),
        ('/old-projects', 'parse_misc'),
        ('/about', 'parse_misc'),
    ]

    def parse_blog(self, response):
        self.save_page(response, True)

    def parse_review(self, response):
        names = response.url.split('/')[3:]
        if names[0] == 'review':
            self.save_page(response, True, categories=['review'])
        else:
            self.save_page(response, True)

    def parse_project(self, response):
        names = response.url.split('/')[3:]
        if names[0] == 'project':
            self.save_page(response, True, categories=['project', names[1]])
        else:
            self.save_page(response, True)

    def parse_misc(self, response):
        self.save_page(response, False)

    # ignore categories
    def parse_category(self, response):
        pass

    def parse(self, response):
        self.save_page(response, False)

    def save_page(self, response, is_post, categories=None):
        self.logger.debug('Page %s' % response.url)
        # get the path components after the hostname
        path_components = response.url.split('/')[3:]

        if is_post:
            post_name = path_components[-1]
            # some reviews contain review in the filename
            if post_name.startswith('review-'):
                post_name = post_name[7:]
            path = os.path.join(BASE_DIR, '_posts')
            filename = os.path.join(path,  post_name + '.md')
        else:
            path = os.path.join(BASE_DIR, os.path.join(*path_components))
            filename = os.path.join(path, 'index.md')

        # create directory, don't care if it already exists
        try:
            os.makedirs(path)
        except:
            pass

        # get submitted date. Will look something like: 'Sun, 21/06/2009 - 14:53 \u2014 dave'
        submitted_text = response.css('span.submitted::text').extract_first()
        if submitted_text is not None:
            submitted_text = submitted_text.encode('ascii', 'ignore').replace('dave', '').replace('admin', '').strip()
            submitted = datetime.strptime(submitted_text, "%a, %d/%m/%Y - %H:%M")
        else:
            # some files have the date missing, will need to fill in manually from the sitemap data
            self.log('Missing date: %s' % filename)
            submitted = None

        title = response.css('title::text').extract_first()
        if title is not None:
            title = title.replace(u'\u2019', "'").replace(u'\u201c', '"').replace(u'\u201d', '"').replace(u'\xa0', u' ').replace(u':', u'--').replace('| Dave Snowdon', '')

        # post content
        html = response.css('div.left-corner div.clear-block div.content').extract_first()
        html = html.replace(u'\u2019', "'").replace(u'\u201c', '"').replace(u'\u201d', '"').replace(u'\xa0', u' ')
        markdown = parser.handle(html.encode('ascii', 'ignore'))

        if categories is None:
            categories_md = ''
        else:
            categories_md = "\n- [" + ", ".join(categories) + "]"

        template = META_DATA_TEMPLATE_POST if is_post else META_DATA_TEMPLATE
        with open(filename, 'w') as f:
            if submitted is not None:
                f.write(template.format(title, submitted.strftime('%Y-%m-%d %H:%M:00'), response.url, categories_md) + markdown)
            else:
                f.write(template.format(title, '', response.url, categories_md) + markdown)
        self.log('Saved file %s' % filename)

Migrating tags

The next issue was that drupal does not put tags inside the content pages themselves but instead inside the pages that link to the content pages. I therefore wrote a separate spider to process only the link pages that the previous spider ignored (seemed cleaner to give each spider a single job). This spider made use of scrapy’s ability to use CSS style paths to pull out the information of interest. This was then yielded so that scrapy would output it into a JSON file using the command ‘scrapy crawl davesnowdon.com-tags -o tags.json’

Tag extractor

class DaveSnowdonSpider(scrapy.spiders.SitemapSpider):
    name = "davesnowdon.com-tags"
    sitemap_urls = ["http://www.davesnowdon.com/sitemap.xml"]
    # we only want tags pages
    sitemap_rules = [
        ('/category/tags/', 'parse_tags'),
    ]

    def parse(self, response):
        pass

    def parse_tags(self, response):
        self.logger.debug('Page %s' % response.url)
        linked_pages = response.css('div.node')
        for page in linked_pages:
            page_link = page.css('h2 a::attr(href)').extract_first()
            if page_link:
                abs_link = response.urljoin(page_link)
                terms = page.css('div.terms ul.links li a::text')
                term_strings = [t.extract() for t in terms]
                quoted_strings = ['"'+q+'"' for q in term_strings]
                self.logger.debug('TAGS: {} -> [{}]'.format(abs_link, ", ".join(quoted_strings)))
                yield {
                   'url' : abs_link,
                   'tags' : term_strings
                }

Updating the content with tag information

Having obtained all the tag information for the site, the next step was to update the content pages with the tag information. I wrote a simple standalone python program for this.

Update content pages with tag information

# usage
# 1 - generate tags file using
# scrapy crawl davesnowdon.com-tags -o tags.json
# 2 - run this file to update hexo posts
# python process_tags.py tags.json <hexo _posts dir>

import os
import json
import re
import sys

# build a map of URL to tags
def read_tags(tags_filename):
    with open(tags_filename) as json_data:
        raw_tags = json.load(json_data)

    # we know that each raw tag for each page will be the same, so this
    # just removes duplicates, we don't need to merge entries
    url_to_tags = {}
    for tag in raw_tags:
        url_to_tags[tag['url']] = tag['tags']

    return url_to_tags


# read file and get value of oldUrl field
def get_old_url(filename):
    with open(filename) as f:
        for line in f.readlines():
            if line.startswith('oldUrl'):
                comps = line.split(' ')
                return comps[-1].strip()
    return None


# build a map of url to filename
def read_files(base_dir):
    file_map = {}
    for dirpath, dirname, files in os.walk(base_dir):
        for name in files:
            if name.lower().endswith('.md'):
                filename = os.path.join(dirpath, name)
                old_url = get_old_url(filename)
                if old_url:
                    file_map[old_url] = filename
    return file_map

def replace_tags(filename, tags):
    # read file contents
    with open(filename) as input:
        contents = input.read()

    # replace tags statement
    quoted_tags = ['"'+t+'"' for t in tags]
    new_contents = re.sub(r"^tags:$", "tags: [{}]".format(", ".join(quoted_tags)), contents,  flags=re.MULTILINE)

    if new_contents != contents:
        # move existing file to backup
        os.rename(filename, filename + '.bak')

        # write file
        with open(filename, "w") as output:
            output.write(new_contents)


# Ideas for defining main() from http://www.artima.com/weblogs/viewpost.jsp?thread=4829
def main(argv=None):
    if argv is None:
        argv = sys.argv

    if len(argv) != 3:
        print("Usage python process_tags.py <tags.json> <_posts dir>")
        return 1

    tags = read_tags(argv[1])

    files = read_files(argv[2])

    # now for every url with tags, find the right file and update it
    for url, tags in tags.iteritems():
        try:
            filename = files[url]
            print("Processing: {}".format(filename))
            replace_tags(filename, tags)
        except KeyError:
            pass

    return 0

if __name__ == "__main__":
    sys.exit(main())

Migrating Disqus comments

I now needed to handle the fact that the location of the pages had changed because of the way I had decided to structure the new hexo site. I therefore needed to generate two files:

a mapping file that I could use to tell disqus what the new location of the content pages were so the comment threads could be updated
a .htaccess file to use to redirect from the old URLs to the new ones so I wouldn’t break people who had linked to the old version of the site.

I used disqus to export a list of all the URLs with comments and then ran hexo generate to generate the new site, which also meant that hexo generated a new sitemap I could use to determine the new page location. I depended on the fact that the end of the old and new URLs was basically the same. The program generates warnings for URLs it is unable to map allowing me to fix them up manually (just one in this case).

Disqus comment migrator

import os
import json
import re
import sys
import xml.etree.ElementTree as ET
from urlparse import urlparse

# Read a disqus generated list of URLs (see https://help.disqus.com/import-export-and-syncing/url-mapper) and a hexo generated sitemap and generate
# - URL mappings CSV file for disqus
# - redirects to add to .htaccess file for site
#
# python migrate-disqus.py ~/Documents/www/davesnowdon.com/backup/davesnowdoncom-2018-10-26T15_57_41.542136-links.csv ~/Documents/www/davesnowdon.com/davesnowdon.com/public/sitemap.xml new-disqus-mappings.csv new-htaccess

HT_ACCESS_HEADER = """
# Don't show directory listings for URLs which map to a directory.
Options -Indexes

# Follow symbolic links in this directory.
Options +FollowSymLinks

# Various rewrite rules.
<IfModule mod_rewrite.c>
  RewriteEngine on

    RewriteBase /

# map from main domain to www subdomain
RewriteCond %{HTTP_HOST} ^davesnowdon\.com$ [NC]
RewriteRule ^(.*)$ http://www.davesnowdon.com/$1 [R=301,L]
# end subdomain map

"""

HT_ACCESS_FOOTER = """
</IfModule>
"""

# read URLS which have comment threads
def read_comment_urls(filename):
    urls = []
    with open(filename) as f:
        for line in f.readlines():
            urls.append(line.strip())
    return urls

# read file and get value of oldUrl field
def get_old_url(filename):
    with open(filename) as f:
        for line in f.readlines():
            if line.startswith('oldUrl'):
                comps = line.split(' ')
                return comps[-1].strip()
    return None


# build a map of url to filename
def read_files(base_dir):
    file_map = {}
    for dirpath, dirname, files in os.walk(base_dir):
        for name in files:
            if name.lower().endswith('.md'):
                filename = os.path.join(dirpath, name)
                old_url = get_old_url(filename)
                if old_url:
                    file_map[old_url] = filename
    return file_map


# read sitemap file and get list of URLs
def read_sitemap(filename):
    new_urls = []
    tree = ET.parse(filename)
    root = tree.getroot()
    for url_elem in root:
        url = url_elem.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
        new_urls.append(url.strip())
    return new_urls

def get_path(url):
    path = urlparse(url).path
    if path.endswith('/'):
        return path[:-1]
    else:
        return path

def get_last_path_component(path):
    comps = path.split('/')
    comps = [s for s in comps if len(s) > 0]
    return comps[-1].strip()

def get_new_url(old_url, new_path_components):
    """
    Given and old URL, list of new URLs and a mapping of old URL to file determine what the new URL should be.
    This relies on the old and new URL ending with the same component which is true of this site but might not be
    true of others.
    """
    old_path = urlparse(old_url).path
    old_last = get_last_path_component(old_path).strip()
    if old_last in new_path_components:
        return new_path_components[old_last]
    elif old_last.startswith('review-'):
        return new_path_components[old_last[7:]]

# Ideas for defining main() from http://www.artima.com/weblogs/viewpost.jsp?thread=4829
def main(argv=None):
    if argv is None:
        argv = sys.argv

    if len(argv) != 5:
        print("Usage python migrate-disqus.py <disqus export> <new sitemap> <output CSV> <output redirects>")
        return 1

    # read list of pages that have disqus comment threads
    comment_urls = read_comment_urls(argv[1])

    # get mapping from old URL to filename
    new_urls = read_sitemap(argv[2])

    new_path_components = {get_last_path_component(get_path(u)) : u  for u in new_urls}

    ouput_csv_filename = argv[3]

    output_redirects_filename = argv[4]

    # generate mapping
    with open(ouput_csv_filename, "w") as ouput_csv, open(output_redirects_filename, "w") as output_redirects:
        output_redirects.write(HT_ACCESS_HEADER)
        for old_url in comment_urls:
            try:
                new_url = get_new_url(old_url, new_path_components)
                if new_url is not None:
                    print("{} -> {}".format(old_url, new_url))
                    if not old_url.endswith('/') and new_url.endswith('/'):
                        new_url = new_url[:-1]
                    ouput_csv.write("{},{}\n".format(old_url, new_url))
                    old_path = get_path(old_url)
                    new_path = get_path(new_url)
                    old_redirect_path = old_path[1:] if old_path.startswith('/') else old_path
                    # since hexo files are static HTML we don't care about query parameters or replicating the other parts of the old url
                    output_redirects.write("RewriteRule ^{}(.*)$ {}/ [R=301,L]\n".format(old_redirect_path, new_path))
                else:
                    print("WARN: Could not determine new URL for {}".format(old_url))
            except KeyError:
                print("WARN: Could not determine new URL for {}".format(old_url))
        output_redirects.write(HT_ACCESS_FOOTER)

    return 0

if __name__ == "__main__":
    sys.exit(main())

Finishing touches and manual fix up

I now had the basics of the new site and could generate it using hexo. All that remained was some manual fix up of the generated markdown and tweaking the hexo config files.

jr0cket has posted about hexo many times and I used his post “Deconstructing the Hexo theme” to work out how to change the banner and menu text for my site - the default theme has a banner which is 300px high which takes up rather too much space IMHO.

That’s probably enough for now. This site still has plenty of rough edges and I’ll sort them out later.