16 instances of the St Andrews CC cloud together with the main server (a total of 36 CPU cores) were busy for a long working day crunching through the riots corpus data. In the end, we had about 1.1 million tweet/retweet relationships identified. The method we used was to calculate the Levenshtein distance for each potential parent tweet relative to the retweet (after removing the strings that identified it as a retweet in the first place, e.g., "RT @alexvossuk"). In many cases, the distance was zero as the content had been retweeted exactly as originally posted. There was another group where the retweets were slightly different, so they had a higher Levenshtein distance and we had to find a balance between false positives and false negatives.
While running this analysis was not impossible using other resources, the fact that the work needed to be done quickly meant that it was very convenient to have a cloud infrastructure ready to hand where we could make a reservation of the necessary compute power with a few commands. It is easy to start up more worker nodes and just as easy to tear them down once they are finished.