Using a katana to migrate my website

Using a katana to migrate my website
Photo by Robinson Greig / Unsplash

For years, I coded my website as a static site generated amalgam. While I started from the base VuePress as a building block – which is great, by the way – I also spent far too much time coding fun things by hand. While these things (like hidden keyboard shortcuts) are fun, I recently decided that I also wanted to be able to have folks subscribe to this blog….the one you're reading right now.

And given the recent turmoil in various social networks with changes of ownership and rules left and right, I was very hesitant about entrusting this audience to “someone else.” At the same time, I wanted to avoid building an email-blog-subscription-from-scratch into my existing website. That was a bridge too far.

Enter Ghost - an open source blogging platform. I could host that myself (I use Digital Ocean and Cloudron to host numerous services already). And while that in some ways was just trading the complexity of coding the site myself, it was at least not greatly increasing the complexity. Ghost also has a hosted service option…but I wanted to rip the whole band-aid at once and just host it myself. You can feel free to put a reminder in your calendar for 5 years from now to ask me how that worked out for me…maybe I'll live to regret it.

📦 Moving the blog

So, it seemed simple enough – I spin up a Ghost container in my Cloudron dashboard, and I'm off to the races! And I can just specify the subdomain as blog.boleary.dev. That's the easy part.

But two big questions came next:

  1. How do I migrate my existing blog content and once it is migrated
  2. How can I make sure that all the links out there in the world end up at the right place?

It turned out both processes were relatively simple – the Markdown format of my existing blog allowed me to easily transfer each blog post over to Ghost. And then, while hosting my blog on a separate subdomain, I could just redirect the correct addresses to blog. for all of my previous posts.

But once I had seen the simplicity of Ghost, I started seriously considering replacing the whole website with Ghost. However, that presented a new challenge: how do I make sure that all the links out there in the world still worked if I were to migrate the entire site? I could try to be as careful as possible, but how could I be certain I didn't miss something?

⚔️ Enter Katana

Katana is a modern web crawler built in Go from ProjectDiscovery, the folks behind such fantastic hacking tools as subfinder and nuclei. Katana has numerous options, making crawling sites for all of their content simple. You can use the command -u example.com to crawl a simple URL, or -list domains.txt to input a list of URLs to crawl. Katana, like all PD tools, also supports stdin/stdout so that it can be easily placed into a pipeline. You can pipe the results of httpx into Katana or pipe the URLs Katana finds right into nuclei and any other combinations of things you can dream up.

But my use case for Katana was a little simpler - and less intent on finding and fixing vulnerabilities…except that I didn't want my site to be vulnerable (hehe) to dead links once I migrated it.

To get started, I had Katana crawl all of boleary.dev just to see what it found with:

🖥️
katana -u https://boleary.dev
Katana output for https://boleary.dev

That works great for a website that is “static” – that is, it doesn't use any client-side JavaScript to create and display more content like a single-page app (SPA) or other dynamic technology like Vue or React. But even if you're crawling a site with that kind of tech, you can use a headless browser to render the pages fully and process all the related JavaScript with:

🖥️
katana -u https://boleary.dev -headless

And there are a LOT of other great options that Katana has, like:

  • -d int Set the crawling depth (by default the depth is 2)
  • -automatic-form-fill Automatic form filling as part of the crawl
  • -show-browser to show the browser during a crawl
  • -f for extracting basic fields like the url or directory - and, you can write custom regex to extract exactly what you're looking for in the responses

And you can learn more about all the features on GitHub.

🧑‍💻 Putting it into action

Now I had a whole list of URLs that are linked from all around my website – one off pages I might have built for a particular demo or every blog post that I ever wrote on my blog. Then I could start the job of migrating pages to Ghost – which was still at blog.boleary.dev for the moment, and compare those outputs to the output of:

🖥️
katana -u https://blog.boleary.dev -headless

Ignoring differences I knew would exist because of the technology each was built it, I figured I could easily figure out what actual “web pages” I was missing and needed to convert.

🕵️‍♂️ Comparing the output

As with other PD tools, Katana follows the Unix philosophy of many sharp tools. As such, manipulating the output and comparing it with tools like diff or even side by side in vim as a buffer was trivial.

Now, being that I'm not a Linux expert, but I play one on TV, I did like to be able to as a human see the output of both. Then I could sanity check my greps and diffs were working to make sure that I was comparing apples and apples. That is also simple with katana with a command like:

🖥️
katana -u https://boleary.dev -headless -o bolearydev.txt

Then I could use the output, which is plain and doesn't have any of the formatting of the “pretty” output to the terminal, to compare the output of both sites and know that I hadn't missed anything.

🎁 Wrapping it all up

Once I was ready to transition, I simply changed my DNS records to point boleary.dev AND blog.boleary.dev to the Ghost site. I actually kept the old site around at legacy.boleary.dev… because I again don't trust myself to have done it all right. In fact, if you're looking to get started with Katana, may I suggest that you try to find something that I “forgot” with the two commands below you can get the output of Katana from both sites. Putting aside things like JS files that are different because they are running different applications, is there anything I “missed” that you can only find on legacy.boleary.dev?

🖥️
katana -u https://boleary.dev -headless -o bolearydev.txt
katana -u https://legacy.boleary.dev -headless -o legacy.txt

Reach out to me on Twitter @olearycrew or on the ProjectDiscovery discord if you find something…there could even be some bounty swag in it for you if I unintentionally missed something 😉