The methodology we used in this project combines techniques that make use of computer-based tools to explore and structure the data – and so help identify potentially interesting sets of tweets – with more established methods for media analysis. In this way we were able to make the most effective use of the human expertise that is essential to understanding the content. You can find more information on our research methods in this article by Rob.
One of the problems we had to solve was to relate tweets to each other to identify tweet/retweet relationships. In principle, this could have meant comparing each tweet to each other leading to an impossibly high number of comparisons to perform. Fortunately, we know that a parent tweet needs to be older than a retweet and we also know who the source of the tweet was, so this narrowed the search space down to a manageable size.
The corpus of 2.6 million tweets is not large in the sense that it fill would up a lot of harddrive space - it easily fits onto a USB stick. However, while each individual data item is quite small (less than a kilobyte), it can contain links to multiple other tweets through @mentions, so the complexity of the data is high. There are about 1.7 million mentions in the corpus.
Here is a list of media mentions of our work.