Find Broken Links On a Website

January 29, 2016

This website has zero broken links, including links to external websites. At least the last time I checked. I used to handle some of this on the back-end automatically, but I pretty much rely on the following method because it catches all cases. It's actually surprising how many websites I link to that end up dropping off the face of the Internet. The Internet is a fickle beast.

I published a perl script awhile back, which I sometimes update as I find issues, that I use to crawl a a website starting with a root URL. It crawls all internal links on the website and optionally external ones to make sure that they exist and can be properly resolved.

To do it, I execute the crawl.pl script and wait a long, long time for it to complete.

./crawl.pl -v -x http://www.digitalpeer.com/blog/ | tee results.txt

The resulting output looks something like this.

...
ROOT: http://www.digitalpeer.com/blog/parallel
GOOD 1 <img> http://www.digitalpeer.com/blog/file/483/268/full/circuit.jpg file/483/268/full/circuit.jpg
GOOD 1 <a> https://github.com/digitalpeer/plights https://github.com/digitalpeer/plights
GOOD 1 <img> http://www.digitalpeer.com/blog/file/472/268/full/DSC02163.JPG /blog/file/472/268/full/DSC02163.JPG
GOOD 1 <img> http://www.digitalpeer.com/blog/file/474/268/full/DSC02158.JPG /blog/file/474/268/full/DSC02158.JPG
GOOD 1 <img> http://www.digitalpeer.com/blog/file/475/268/full/DSC02161.JPG /blog/file/475/268/full/DSC02161.JPG
GOOD 1 <img> http://www.digitalpeer.com/blog/file/476/268/full/DSC02155.JPG /blog/file/476/268/full/DSC02155.JPG
GOOD 1 <a> http://www.digitalpeer.com/blog/trait /blog/trait
...

This shows it grabbing a page, prefixed with ROOT: and then checking all of the links it finds in that page. It checks images, css files, basically any link - not just hrefs.

Related Posts