Find Broken Links On a Website
This website has zero broken links, including links to external websites. At least the last time I checked. I used to handle some of this on the back-end automatically, but I pretty much rely on the following method because it catches all cases. It's actually surprising how many websites I link to that end up dropping off the face of the Internet. The Internet is a fickle beast.
I published a perl script awhile back, which I sometimes update as I find issues, that I use to crawl a a website starting with a root URL. It crawls all internal links on the website and optionally external ones to make sure that they exist and can be properly resolved.
To do it, I execute the crawl.pl script and wait a long, long time for it to complete.
./crawl.pl -v -x http://www.digitalpeer.com/blog/ | tee results.txt
The resulting output looks something like this.
... ROOT: http://www.digitalpeer.com/blog/parallel GOOD 1 <img> http://www.digitalpeer.com/blog/file/483/268/full/circuit.jpg file/483/268/full/circuit.jpg GOOD 1 <a> https://github.com/digitalpeer/plights https://github.com/digitalpeer/plights GOOD 1 <img> http://www.digitalpeer.com/blog/file/472/268/full/DSC02163.JPG /blog/file/472/268/full/DSC02163.JPG GOOD 1 <img> http://www.digitalpeer.com/blog/file/474/268/full/DSC02158.JPG /blog/file/474/268/full/DSC02158.JPG GOOD 1 <img> http://www.digitalpeer.com/blog/file/475/268/full/DSC02161.JPG /blog/file/475/268/full/DSC02161.JPG GOOD 1 <img> http://www.digitalpeer.com/blog/file/476/268/full/DSC02155.JPG /blog/file/476/268/full/DSC02155.JPG GOOD 1 <a> http://www.digitalpeer.com/blog/trait /blog/trait ...
This shows it grabbing a page, prefixed with ROOT: and then checking all of the links it finds in that page. It checks images, css files, basically any link - not just hrefs.