File this one under “an itch that needed scratching”, as I wanted a quick dashboard that showed social sharing of articles in this blog. The networks I chose to do this for were Facebook, Twitter, LinkedIn, StumbleUpon and Google+.
On the side of networks with nice APIs are Facebook, Twitter and LinkedIn. To know the share counts on those networks, you need only hit these URLs where %s is the URL you are inquiring about:
http://graph.facebook.com/%s
http://urls.api.twitter.com/1/urls/count.json?url=%s
http://www.linkedin.com/cws/share-count?url=%s
Then there is StumbleUpon, which has no API but returns a page at this address if the page has been added to the service:
http://www.stumbleupon.com/url/%s
With Google+ we need to take the +1 button’s code at:
https://plusone.google.com/u/0/_/+1/fastbutton?count=true&url=%s
For StumbleUpon and Google+, we will be stripping out the number of shares from the pages, which means that we will need regexs — for that reason I chose to use Perl. I will freely admit that my Perl-fu is lacking and hope you’ll correct my naivety in the comments below — but Perl needs some love.
The first thing we need to do is grab the URLs that we will be calling.
For this task I used the WWW::Mechanize module to go to the Girt By Code index page and return all the article URLs on that index page.
We then refer each page to a subroutine for each network that will return a string containing the results for each social network. That code looks like this:
#!/usr/bin/perl
use strict;
use JSON;
use WWW::Mechanize;
use WWW::Mechanize::GZip;
#Dispatch table
my %social_networks = (
"facebook" => \&fetch_facebook_sql,
"twitter" => \&fetch_twitter_sql,
"stumbleupon" => \&fetch_stumbleupon_sql,
"google+" => \&fetch_googleplus_sql,
"linkedin" => \&fetch_linkedin_sql
);
sub spider {
my $url = shift || "http://www.techrepublic.com/blog/australia";
$url =~ /([^\/]+\/\/[^\/]+)/ or die "ABORTING: Cannot find domain\n";
my $domain = $1;
my $mech = WWW::Mechanize::GZip->new();
#About to spider";
my $response = $mech->get( $url );
my @pages = $mech->find_all_links( url_regex => qr/^\/blog\/australia\/[\w-]+\/\d+$/ );
foreach (@pages){
my $page = $domain.$_->url();
while ( my ($network_name, $network_sub) = each(%social_networks) ) {
($page);
}
}
#Done spidering;
}
spider;
sub fetch_facebook_sql {
# code to come
}
sub fetch_twitter_sql {
# code to come
}
sub fetch_googleplus_sql {
# code to come
}
sub fetch_linkedin_sql {
# code to come
}
sub fetch_stumbleupon_sql {
# code to come
}
The tricky bits in the above code is the social_networks hash, which stores a sub reference for each network — this will be how we determine which function to call for each network and allows us to separate the logic for gaining the share count somewhat. It’s also a nice way to see at the start of the file which networks we are using. We also make use of Perl’s $_ variable to save declaring a named variable for each item in the @pages array. If you are new to Perl, it’s a variable that comes in very handy and can save a lot of time and verbosity. We call the URL method on $_ as each item of @pages is a WWW::Mechanize::Link object.
At the start of the spider subroutine we check for ‘http://[yourdomain.com]’ in the URL and store it, this is because Mechanize will return relative links and we need the full URL for certain network calls.
Inside the while loop we make use of the dispatch table to fire off calls to the fetch_[network]_sql functions that are currently only stubs.
Now if you run the code above and use the Data::Dumper module to see which pages are returned, you will notice that some of the links are duplicated; we’ll need to ignore them. We need to keep track of the pages visited, and for that we use a hash and simply test if the URL has been added to the hash as a key. If the page is already added, we will skip it and if not we need to ask each network about it. We will also store our output in a comma-delimited string, so we can open any spreadsheet program.
That changes the for loop to:
my %indexed_pages;
my $output_str = "URL, StumbleUpon, LinkedIn, Twitter, Facebook, Google\n";
foreach (@pages){
my $page = $domain.$_->url();
if( !defined $indexed_pages{$page} ) {
$indexed_pages{$page} = 1;
$output_str .= $page;
while ( my ($network_name, $network_sub) = each(%social_networks) ) {
$output_str .= ($page);
}
$output_str .="\n";
}
}
#Done spidering;
print $output_str;
Now let’s start fetching the share counts. We’ll start with Facebook and Twitter since they are well behaved and return well-formed JSON. Prior to that we will need to import the LWP::Simple module to call on the get function.
sub fetch_facebook_sql {
my $page_url = shift;
my $network_url = "http://graph.facebook.com/%s";
my $response = decode_json get( sprintf($network_url, $page_url) );
my $face_shares = defined ${$response}{'shares'} ? ${$response}{'shares'} : 0;
return ",".$face_shares;
}
sub fetch_twitter_sql {
my $page_url = shift;
my $network_url = "http://urls.api.twitter.com/1/urls/count.json?url=%s";
my $response = decode_json get( sprintf($network_url, $page_url) );
return ",".${$response}{'count'};
}
The above code simply forms the URL from the passed-in page_url variable, make fetches the social networks share URL via get method and decode the resultant string with decode_json.
Facebook does not return a zero count for unshared URLs, so we test for the share element in the JSON, and set it to zero if it does not exist.
LinkedIn returns a string that has a JSON encoding after the “IN.Tags.Share.handleCount(” string that appears at the start of its output. All we need to do in this case is strip that string away and decode the result.
sub fetch_linkedin_sql {
my $page_url = shift;
my $network_url = "http://www.linkedin.com/cws/share-count?url=%s";
my $response = get( sprintf($network_url, $page_url) );
$response =~ s/IN.Tags.Share.handleCount\(//g;
$response = decode_json $response;
return ",".${$response}{'count'};
}
For Google+, it is not possible to make an API call and get a nice easy JSON result back. Instead we need to fetch the +1 button and strip the count from its HTML. Luckily, for this use case, it is a relatively easy case of finding the HTML element with an id of aggregateCount, and capture its contents. Perl makes this look almost too easy, and we now have four network results to work with.
sub fetch_googleplus_sql {
my $page_url = shift;
my $network_url = "https://plusone.google.com/u/0/_/+1/fastbutton?count=true&url=%s";
my $response = get( sprintf($network_url, $page_url) );
$response =~ /id\=\"aggregateCount\"[^>]*>(\d+)/;
return ",".$1;
}
Now we come to StumbleUpon, where we will need to parse a web page to find the number we are after.
The key to finding the number of shares on StumbleUpon is to find a link that has the href format of http://www.stumbleupon.com/url/[your link without the http://], this is why after parsing $page_url and $network_url to create the URL we will call on StumbleUpon; we use a regex to replace /http:// with a single /. You can actually call StumbleUpon without stripping the http:// of the page URL, but doing so at the start means we can search for a variable and thus have cleaner code.
Another thing to notice here is that we are having to use WWW::Mechanize::GZip as StumbleUpon returns compressed pages. We set the autocheck parameter to 0 to stop the Mechanize from exiting our script with an error when the page cannot be found on StumbleUpon. The social network does not return a zero shares, but a 404 page. Hence, we test for a successful request with HTML status code 200, and from there search for our share count buried in the returned HTML. Once we find our share count element, we strip off the ” views”, or “view” for a single share, string and are left with a number of shares made on the network.
sub fetch_stumbleupon_sql {
my $page_url = shift;
my $network_url = "http://www.stumbleupon.com/url/%s";
my $stumble_page = sprintf($network_url, $page_url);
$stumble_page =~ s/\/http:\/\//\//g;
my $mech = WWW::Mechanize::GZip->new(autocheck => 0);
my $response = $mech->get( $stumble_page );
my $count = 0;
if($mech->status() == 200){
$count = $mech->find_link( url_regex => qr/$stumble_page/ )->text();
$count =~ s/\ views?//g;
}
return ",".$count;
}
Now we can run our script and it will output a comma delimited list of URLs and share counts on STDOUT.
To get it into a file, we need only redirect the output such as
perl sc_article.pl > social_shares.csv
From there we can open the file with a spreadsheet program, or even better, take the code a step further and store it in a database so we can track historical data.
The full code is listed below:
#!/usr/bin/perl
use strict;
use JSON;
use LWP::Simple;
use WWW::Mechanize;
use WWW::Mechanize::GZip;
#Dispatch table
my %social_networks = (
"facebook" => \&fetch_facebook_sql,
"twitter" => \&fetch_twitter_sql,
"stumbleupon" => \&fetch_stumbleupon_sql,
"google+" => \&fetch_googleplus_sql,
"linkedin" => \&fetch_linkedin_sql
);
sub spider {
my $url = shift || "http://www.techrepublic.com/blog/australia";
$url =~ /([^\/]+\/\/[^\/]+)/ or die "ABORTING: Cannot find domain\n";
my $domain = $1;
my $mech = WWW::Mechanize::GZip->new();
#About to spider";
my $response = $mech->get( $url );
my @pages = $mech->find_all_links( url_regex => qr/^\/blog\/australia\/[\w-]+\/\d+$/ );
my %indexed_pages;
my $output_str = "URL, StumbleUpon, LinkedIn, Twitter, Facebook, Google\n";
foreach (@pages){
my $page = $domain.$_->url();
if( !defined $indexed_pages{$page} ) {
$indexed_pages{$page} = 1;
$output_str .= $page;
while ( my ($network_name, $network_sub) = each(%social_networks) ) {
$output_str .= ($page);
}
$output_str .="\n";
}
}
#Done spidering;
print $output_str;
}
spider;
sub fetch_facebook_sql {
my $page_url = shift;
my $network_url = "http://graph.facebook.com/%s";
my $response = decode_json get( sprintf($network_url, $page_url) );
my $face_shares = defined ${$response}{'shares'} ? ${$response}{'shares'} : 0;
return ",".$face_shares;
}
sub fetch_twitter_sql {
my $page_url = shift;
my $network_url = "http://urls.api.twitter.com/1/urls/count.json?url=%s";
my $response = decode_json get( sprintf($network_url, $page_url) );
return ",".${$response}{'count'};
}
sub fetch_googleplus_sql {
my $page_url = shift;
my $network_url = "https://plusone.google.com/u/0/_/+1/fastbutton?count=true&url=%s";
my $response = get( sprintf($network_url, $page_url) );
$response =~ /id\=\"aggregateCount\"[^>]*>(\d+)/;
return ",".$1;
}
sub fetch_linkedin_sql {
my $page_url = shift;
my $network_url = "http://www.linkedin.com/cws/share-count?url=%s";
my $response = get( sprintf($network_url, $page_url) );
$response =~ s/IN.Tags.Share.handleCount\(//g;
$response = decode_json $response;
return ",".${$response}{'count'};
}
sub fetch_stumbleupon_sql {
my $page_url = shift;
my $network_url = "http://www.stumbleupon.com/url/%s";
my $stumble_page = sprintf($network_url, $page_url);
$stumble_page =~ s/\/http:\/\//\//g;
my $mech = WWW::Mechanize::GZip->new(autocheck => 0);
my $response = $mech->get( $stumble_page );
my $count = 0;
if($mech->status() == 200){
$count = $mech->find_link( url_regex => qr/$stumble_page/ )->text();
$count =~ s/\ views?//g;
}
return ",".$count;
}