I’m a little late to the game here, but the ever popular floatingsheep blog smashed their previous daily page view high with this post entitled ‘Mapping Racist Tweets in Response to President Obama’s Re-election‘ 10 days ago. I don’t think I need to explain what it’s about, it’s pretty self-explanatory.
It drew a lot of comments, understandably. From people questioning the small sample size (395 ‘hate’ tweets in total), the search terms (‘monkey’ OR ‘nigger’ AND the text ‘Obama’ OR ‘reelected’ OR ‘won’), the exclusion of racist tweets towards Romney (rectified by floating sheep, here), the geolocation of tweets (2-5% of all tweets), the use of particular search terms in the positive (for example, ‘nigger’), and the mapping of racial tweets as opposed to tweeters (the results could have produced multiple tweets from the same individual).
They’re all now included in a FAQ section, here, in response to the many questions floatingsheep received as to their choice of method. For me, it says a lot about people’s ability to pick holes in ‘scientific’ method. Although the comments started to get a little wild, they did at least open the door for a response from the floatingsheep team, clarifying the methods they used. It really isn’t a sample of the American population, let alone the American twitter population, let’s get that clear. 395 tweets is so far removed from a representative sample size it’s at the best kind of naive drawing any conclusions (‘the south is racist’) and at the worst, dangerous. I think floatingsheep know that. Still, it says a lot about the pitfalls of mapping tweet data, because there are just so many removals from the population at large. In this case:
NOT people in USA
people in USA tweeting
people in USA tweeting racist comments about Obama
people in USA tweeting racist comments about Obama with geo-location activated
people in USA tweeting particular racist comments about Obama with geo-location activated during a 7-day period at a specific time as searched for in a built database
That’s 4x removed from the classification that, I would argue, most people think of this data as representing. That is, constituting the people of the USA. Although maybe it’s only 3x removed, because I’d like to think most people have the intelligence to think this at the least is only feasibly representative of those with twitter (it doesn’t take much to realise there are more young people than old people on twitter, whatever that may mean). There’s a danger in not making this patently clear to people.