About This Site

About These Pages

These pages were born out of a madcap idea centred on my current favourite show – Game of Thrones. Knowing what was to come in the last three episodes of Season 04 (period covering June 01 2014 to June 21 2014), I figured let’s collect tweets and analyse them. Social Media analysis is always fun right?

While this was a good idea, it unfortunately came to me a bit late in the day (just 05 days before the show resumed after the Memorial Day weekend). So I had less than a week to learn how to use the Twitter Streaming API to collect the data. Once that difficult part was done, I spent the better part of the last month preparing and presenting the data. The net result has been this presentation, that I hope was interesting for both the fans and technology enthusiasts alike. For a more detailed look at the story behind the data visualizations do take a look at my blog.

Technologies Used

The following technologies/ data were used for generating and presenting this analysis:

  • GeoNames
    After attempts on reverse geocoding using APIs proved to be cumbersome, the GeoNames database was used to build an offline reverse geocoder using SQL and Ruby
  • MySQL
    All the collected twitter data was loaded into MySQL DB for the generation of some simple count metrics. In addition to this MySQL was also used to host the GeoNames database for reverse geocoding
  • Talend
    Talend was used as the ETL tool to load the flat files generated by the data collection process into the database. Talend also helped to load the GeoNames database (about 9 million record) into MySQL
  • Hortonworks
    The Hortonworks sandbox was used for simple textual analysis of tweets. In particular HBase was used to hold the tweet data, stop words list etc and Pig was used to process the data
  • Ruby
    Ruby was used the default language for all the programming I did on this little experiment. Ruby along with the Twitter gem was used to collect data from Twitter using the Streaming API. Ruby scripts helped detect and fix any data quality issues post collection of the data. Another Ruby script was used to build an offline reverse geocoder. Finally Ruby scripts helped generate the various JSON files needed for the visualizations
  • D3.js and NVD3.js
    Almost all the visualizations were built using the D3.js library and the NVD3.js libraries
  • Google Maps API and Fusion Tables
    The Google Map representation of tweets was built using the Maps Javascript API and Fusion Tables
  • Bootstrap and Fontawesome
    The overall theme was a Bootstrap based theme and a lot of the icons used were supplied via the popular CSS icon toolkit, Fontawesome

Data collection was run on an Amazon AWS EC2 micro instance and analysis was done on a humble desktop running Ubuntu Precise Pangolin

Credits

The visualizations used to present the data have been based (wherever possible)on D3.js code. There are also some pieces of code that have been reverse engineered from what is available on the D3.js examples page. This listing is an attempt to give credit where due.

  • The bar graphs and stacked area graphs used are based on examples from NVD3.js
  • The "Choropleth" world map in the Overview section is based on the datamaps library by Mark DiMarco. Mark was quite helpful in helping me understand how JSON integrated with his code. Thanks a ton again Mark
  • The Day-Hour heatmap uses modified example code from the d3.js example gallery
  • All the word clouds use the Word Cloud example code by Jason Davies
  • The Character Popularity by Day visualization is inspired by the example on Publications in Journals over time () by Asif Rahman and uses code written for that page
  • The World Map on the locations page is just a Google Map using Fusion Tables, but with the UI customized using style sheets
  • And finally the theme used for the presentation of the analysis is a modified free Bootstrap theme called “SB Admin”