LinkkURI

LinkkURI is a Perl script that can be used to find non-working links from a web site. You may download, use, distribute and modify the program under the terms of the MIT Licence.

Features

Here are some examples on the use of LinkkURI:

Crawl your website and find all files in it:
linkkuri -c http://my.domain.example/ > mysite.txt
Check all links at your website:
linkkuri -V < mysite.txt > links.txt
Find all non-working links at your website:
grep -v '^2' links.txt
Re-check the non-working links next day:
grep -v '^2' links.txt | linkkuri -r | grep -v '^2' > stillnotworking.txt
Check links at a single page:
linkkuri -v http://my.domain.example/page.html
Check links at a single file:
linkkuri -v file:/directory/path/name.html

LinkkURI gave the following report on the early version of this web page (it was then on different server): 200 http://www.pp.htv.fi/kpuolama/linkkuri.html 200 http://www.pp.htv.fi/kpuolama/linkkuri-licence.txt 200 http://www.perl.com/ 200 http://www.perl.com/CPAN/ 404 ftp://ftp.redhat.com/pub/redhat/powertools/CPAN/CPAN_rev.2/i386/ 200 ftp://ftp.funet.fi/pub/mirrors/ftp.redhat.com/redhat/powertools/CPAN/CPAN_rev.2/i386/ 200 http://www.pp.htv.fi/kpuolama/linkkuri.pl 200 http://www.pp.htv.fi/kpuolama/linkkuri.pl.gz 200 http://www.iki.fi/kaip/

The numbers at the beginning of lines are HTTP response codes. Only the RedHat archive was unavailable at the time of the test. The form of the report is purposefully simple to make it easy to manipulate with the standard unix text tools, like grep, sed or perl.

LinkkURI utilizes powerful Perl modules. They take care of all technical details, including URL parsing and redirections.

WARNING: Be careful with the website crawl (-c and -V options). You don't want to crawl too much, BOFH might get nervous... Try to ensure that your crawler stays on one site (it should, but people are creative in these kinds of things...). Don't try to crawl (i.e. download) somebody else's site. To be on the safe side, the number of GET-requests used to get all links from a page is limited to 100. You can increase the limit by editing the source code, if necessary. The number of URLs to be checked (HEAD-requests) is unlimited.

This program tries to be environmentally friendly, though: pages are fetched only once. If only header is needed, then only header is requested. Before downloading a page for links linkkuri checks the header for text/html content type. The program obeys the Robot Exclusion Protocol by default. The web host is contacted at most once per minute, be patient. Set your email address to the beginning of the source code, as required by the protocol.

Download

To use LinkkURI you need to have Perl installed. In addition you need the following packages:

perl-libwww-perl
perl-URI
perl-HTML-Parser
perl-MIME-Base64
perl-libnet (to use the ftp protocol)

You can find the above mentioned packages from CPAN.

If you use Linux with RPM system you can download the packages from ftp.redhat.com (mirror at ftp.funet.fi). Install them with command rpm -ivh package-name.rpm.

Download version 0.01.02 and save it as linkkuri. You must give it execute-privileges: chmod a+x linkkuri Check that the first line of the file points to your perl interpreter (it is /usr/bin/perl by default). Read the licence before downloading the file. Note that the program is at very early stage of development (meaning: it is just a quick hack with a stupid name).

linkkuri-0.01.02.pl [7.9 kB]
LICENCE

URL: http://www.iki.fi/kaip/linkkuri.html