For an embarrassingly long time (2015!) I’ve been meaning to migrate my personal web site away from drupal to a static site generator like Hexo. There are several reasons why I wanted to do this:
- Rather than having my content in a database managed by my service provider, I wanted to manage it the same way I manage source code using version control.
- I got fed up having to upgrade the software than ran the site (drupal in this case) or pay my hosting provider to support older versions of PHP.
- You can’t hack a static site and alter content unless the hosting provider’s security is broken or I disclose my credentials
I was running an old version of drupal (see point about my laziness updating the site) and there didn’t seem to be any automatic migration path, so I decided to write some code to do it.
The first step was to write some code to extract the old content of the site and convert it to markdown. For this I decided to use scrapy. Since the existing site was generating an XML sitemap I decided to feed that to the spider to ensure that the entire contents of the site was processed. I wanted to ensure that only pages with content were processed, fortunately the scrapy sitemap spider makes this trivial using some basic rules.
1 | sitemap_rules = [ |
My old site had several page types including blog posts and reviews and so I needed some code to distinguish between them. I also needed to distinguish between pages I wanted converted to hexo posts and those that I didn’t (for example the about page).
Here’s the source for the entire spider. I was originally going to use the scrapy image pipeline support to grab the images out of each post but since all my images were in a single directory hierarchy it was easier to manually grab that directory and move it into my hexo site.
1 | # -*- coding: utf-8 -*- |
Migrating tags
The next issue was that drupal does not put tags inside the content pages themselves but instead inside the pages that link to the content pages. I therefore wrote a separate spider to process only the link pages that the previous spider ignored (seemed cleaner to give each spider a single job). This spider made use of scrapy’s ability to use CSS style paths to pull out the information of interest. This was then yielded so that scrapy would output it into a JSON file using the command ‘scrapy crawl davesnowdon.com-tags -o tags.json’
1 | class DaveSnowdonSpider(scrapy.spiders.SitemapSpider): |
Updating the content with tag information
Having obtained all the tag information for the site, the next step was to update the content pages with the tag information. I wrote a simple standalone python program for this.
1 | # usage |
Migrating Disqus comments
I now needed to handle the fact that the location of the pages had changed because of the way I had decided to structure the new hexo site. I therefore needed to generate two files:
- a mapping file that I could use to tell disqus what the new location of the content pages were so the comment threads could be updated
- a
.htaccess
file to use to redirect from the old URLs to the new ones so I wouldn’t break people who had linked to the old version of the site.
I used disqus to export a list of all the URLs with comments and then ran hexo generate
to generate the new site, which also meant that hexo generated a new sitemap I could use to determine the new page location. I depended on the fact that the end of the old and new URLs was basically the same. The program generates warnings for URLs it is unable to map allowing me to fix them up manually (just one in this case).
1 | import os |
Finishing touches and manual fix up
I now had the basics of the new site and could generate it using hexo. All that remained was some manual fix up of the generated markdown and tweaking the hexo config files.
jr0cket has posted about hexo many times and I used his post “Deconstructing the Hexo theme” to work out how to change the banner and menu text for my site - the default theme has a banner which is 300px high which takes up rather too much space IMHO.
That’s probably enough for now. This site still has plenty of rough edges and I’ll sort them out later.