Browser optimize

How to find out which of your bookmarks are still valid

Marco Fioretti provides the script needed to automatically check your collected bookmarks for working links.
Search engines help us to find what we need in the World Wide Web. Bookmarks help us to organize, and above all, quickly reload, in any moment, any page that we had found worth remembering. If its URL is still valid, that is. Are you sure that every single one of all the bookmarks you've maybe spent ten years to collect and classify still works?

Today, I will show you how to end that uncertainty, at least on Linux/Unix systems, with one simple script, that you may run manually or as a cron job. The code may also be easily adapted to check the links in generic HTML pages or, for that matter, any plain text file.

The general method

The script must perform two distinct operations:

  • generate a list, optimized for automatic processing, of the titles and URLs of all your bookmarks
  • ask the WWW servers that host each page if it is still there or not
The first step depends on the format of your bookmarks. This normally corresponds to what browser you use because, as I complained on TechRepublic here and here, Free Software programs love to be different even when there is no reason for it. The exception is if you have put your bookmarks online, in your personal cloud, with Free Software like Semantic Scuttle. The code to generate bookmark lists for Chrome/Chromium, Firefox, Konqueror, and Semantic Scuttle is at the end of the post. Before that, however, it is better to explain the second, common step. The script should look like this listing:
   1 #! /bin/bash
   2 BOOKMARK_LIST=/tmp/bookmarks_list
   3 # put bookmark extraction code here
   4
   5 IFS=
   6 while read LINE; do
   7    TITLE=`echo $LINE | cut -f1`
   8    URL=`echo $LINE | cut -f2`
   9    wget -O/dev/null -q $URL && echo -n "STILL_ONLINE: " || echo -n 'DISAPPEARED : '
  10    echo "$TITLE"
  11    done < $BOOKMARK_LIST
  12 exit

Line 3 is the one to replace with the bookmark-specific code I'll show you in a moment. The rest of the code assumes that the bookmarks already are in the file defined by the $BOOKMARK_LIST variable. Each line of that file must contain title and URL of one bookmark, separated by one TAB character. As far as we are concerned, line 5 tells BASH to only consider TAB as a column separator. The loop in line 6 reads that file one line at a time, loading titles and URLs in the variables with the same names (line 7 and 8). In line 9, which is a bit of shell black magic copied straight from CommandLineFU, the wget utility tries to fetch the current URL, without saving it. If the operation succeeds, the script writes the first message, otherwise the second. In both cases, the $TITLE of that URL is added to the current line of output. The result will be a bunch of lines like these that would be easy to reformat as HTML or other formats:

  STILL_ONLINE: For Khosla, clean tech is all about scale
  DISAPPEARED : Commerce Secretary Unveils Plan for Smart Grid Interoperability
  STILL_ONLINE: Improved Biomass Stoves
  DISAPPEARED : Darfur Cookstoves-Benefits
  STILL_ONLINE: Floating Hydro-Electric Barrel Generator
  STILL_ONLINE: Fixing the bioenergy accounting loophole
  DISAPPEARED : Biomethane from Toilets
  STILL_ONLINE: How to Wean a Town Off Fossil Fuels

The bookmark-specific code

Below you'll find what you must substitute to line 3 of the listing above, according to which bookmark system you use. Semantic Scuttle stores the bookmarks in a MySQL database, so you have to query it with the proper user name and password. The snippets of code for desktop browsers, instead, all work in the same general way: they slice and rearrange in many pieces the content of your bookmarks file (relax: without touching the file itself!) with a series of tr(anslate), cut, grep and Perl commands, until only lines with the format "TITLE[tab]URL" remain. The bookmark file definition for Firefox is complex because this browser saves plain text versions of all its bookmarks every day, in files named after the current date. Therefore, to process the most recent one, you have to "find" the file in the "bookmarkbackups" subfolder that is less than one day old.

Apart from this, the easiest way to understand how those commands work together is to run them at the prompt, one at a time, and see how they process and reformat the text (but of course don't hesitate to ask for more detailed explanations in the comments, if something isn't clear!).

Semantic Scuttle

  DB=the_database_ename
  PW='thepassword'
  USR=the_database_user
  QUERY='SELECT bTitle, bAddress FROM sc_bookmarks;'
  echo $QUERY | mysql -u $USR -p$PW $DB --skip-column-names > $BOOKMARK_LIST

Firefox

  FIREFOX_BM=`find $HOME/.mozilla/firefox/YOUR_FIREFOX_PROFILE/bookmarkbackups/ -type f -name "*json" -mtime -1`
  cat $FIREFOX_BM | tr "\012" " "     | \
                    tr "{" "\012"     | \
          grep -v '"uri":"about'      | \
          grep -v '"uri":"file'       | \
          grep -v '"uri":"javascript' | \
          grep -v '"uri":"place'      | \
          perl -e 'while(<>) {$_ =~ m/"title":"([^"]*)"/ ; $T = $1; m/"uri":"([^"]*)"/ ; $U = $1; print "$T\t$U\n" if (($T) && ($U) && ($T ne $U )) }' > $BOOKMARK_LIST

Chrome

  CHROME_BM="$HOME/.config/google-chrome/Default/Bookmarks"
  cat  .config/google-chrome/Default/Bookmarks | \
        tr "\012" " " | tr "{" "\012"          | \
        grep '"type": "url",'                  | \
        perl -e 'while(<>) {$_ =~ m/"name": "([^"]*)"/ ; $T = $1; m/"url": "([^"]*)"/ ; $U = $1; print "$T\t$U\n" if (($T) && ($U) && ($T ne $U )) }'  > $BOOKMARK_LIST

Konqueror

  KONQUEROR_BM="$HOME/.kde/share/apps/konqueror/bookmarks.xml"
         egrep '<bookmark href="|<title>' $KONQUEROR_BM | \
         perl -e 'while (<>) {if (($B) && ($_ =~ m/^\s*<title>([^<]*)</)) {$T = $1; print "$T\t$B\n"} ; if (m/^\s*<bookmark href="([^"]*)"/ ) {$B = $1} else {undef $B}}' > $BOOKMARK_LIST

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

0 comments