Social Enterprise

How to save your Twitter timeline to a file with a simple script

Marco Fioretti shares a simple script that uses the Curl program that you can run as a cron job from your server to archive a Twitter timeline.

I use Twitter to get news, let other people know what I'm doing or what I find interesting, and to find relevant information quickly.

A while ago, however, I realized that I also wanted to save on my computer a local copy of my timeline, that is, the stream made of all my tweets plus those of all the people I follow. There are several online services that can do this for you, but I wanted a Linux-style solution, that is some simple script that I could run as a cron job from my own server.

Finding a suitable command line Twitter interface was harder than I thought. Several scripts you'll find online are unusable, because they were never updated when Twitter decided to only support OAuth for authentication. Others are designed to integrate Twitter timelines in Web pages or to send status updates to Twitter — very worthy goals, but not what I needed. For similar reasons, I also discarded full-blown Twitter console clients like TTYtter, which are overkill for this job, even if they could be easily run automatically, which they aren't. Even the Python Twitter library was too unnecessarily complicated to use for this scenario.

Eventually, I found online a very simple script that uses Curl to access the mobile version of Twitter.com. That script (which isn't online anymore as far as I can tell; please add its URL in the comment if you find it; thanks.), was exactly what I needed: read access to the last twenty tweets in one's Twitter timeline, with the smallest possible amount of code and software dependencies. I used its "login" and "see timeline" sections as basis for the script that these days I run from a cron job, to archive my own timeline.

Twitter timeline archival script

The heart of the script is the Curl program, which is available on practically every Gnu/Linux distribution. Curl's job is to automate Web surfing, even when there are forms to fill or cookies to exchange with a website.

Thanks to Curl, the script is only 25 lines long and could be even less, if I didn't generate an intermediate HTML dump of the timeline, or if I squeezed all those sed invocations into two or three lines. As is, however, it is very easy to understand, even for scripting beginners, or so I hope:

       1  #! /bin/bash
       2
       3  PASSWD=my_Twitter_password
       4  USER=mfioretti_it
       5  COOKIEFILE=/tmp/cookiesShtw.txt
       6  COOKIE_LINE="—cookie $COOKIEFILE —cookie-jar $COOKIEFILE —user-agent Mozilla/4.0"
       7  # twitter_dump.html is a temporary file, useful for debug purposes
       8  rm -f  /tmp/twitter_dump.html
       9  # log in to Twitter
      10  curl -s $COOKIE_LINE —data "username=$USER" —data "password=$PASSWORD" —data 'commit=Sign In' https://mobile.twitter.com/session > /dev/null
      11
      12  #see timeline
      13
      14  curl -s $COOKIE_LINE http://mobile.twitter.com > /tmp/twitter_dump.html
      15
      16    egrep '^<strong|^<span class="status">'  /tmp/twitter_dump.html    | \
      17    sed 's/href="\//href="http:\/\/twitter.com\//g'     | \
      18    sed 's/ class="twitter-atreply"//g'                 | \
      19    sed 's/ class="twitter_external_link"//g'           | \
      20    sed 's/ class="twitter-hashtag"//g'                 | \
      21    sed 's/ rel="nofollow"//g'                          | \
      22    sed 's/^<strong>//g'                                | \
      23    sed 's/<\/span>$/<br>/g'                            | \
      24    sed -n '1h;1!H;${;g;s/<\/strong>\n<span class="status">/ : /g;p;}'
      25  exit

The first eight lines of the script are nothing special. They simply store in some variables all the parameters that Curl needs to log in to Twitter, when I call it in line 10. The cookie obtained in this step grants curl access to the timeline when, in line 14, it asks for it and saves it into /tmp/twitter_dump.html. After that, the only thing that remains to be done is to clean up all the unnecessary HTML markup. Each tweet in the twitter_dump.html will have a format like this (edited for clarity):

  <strong><a href="http://mobile.twitter.com/twitter_user">twitter_user</a></STRONG>
  <SPAN CLASS="STATUS">one tweet from that user, in HTML format</span>

Therefore, I extract from the temporary file all and only the lines that contain the actual tweets and their authors (line 16). Next, I remove all the Twitter CSS markup and the initial and final tags (lines 18-23). Line 24 looks black magic, but it isn't, really. I want to get one complete tweet, author and text, on each line. To achieve this, I need to remove all of the HTML markup that you see in upper case in the snippet above. Line 24 of the script, which I gratefully copied and adapted from Austin Matzko's blog, is the sed way to find and replace text patterns over two consecutive lines. Running the script produces a series of lines with this format:

<a href="http://mobile.twitter.com/twitter_user">twitter_user</a> : one tweet from that user, in HTML format

This is still HTML, but much simpler than what you get from Twitter. I stop there because it is the quickest way to keep a copy not only of the tweets, but also of any link to Web pages or other Twitter users they may contain. Should I need, someday, a text-only version, I could simply feed the result of this script to a console browser like w3m.

Final thought

This script is pretty rough, but does its job without needs for any special library. Sure, if Twitter changed its markup I may have to update it, but that would be a trivial task. I'm not worried about that. Its only real weakness, if you can call it a weakness, is that it needs Twitter to remain accessible, at least in read mode, with a standard Web login procedure. If and when that will change, I'll have no choice but to install a real Twitter library with keys and what not. In the meantime, I stick to my simple script, and like Twitter also because it's so simple to deal with.

About Marco Fioretti

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

Editor's Picks

Free Newsletters, In your Inbox