This post will try to explain the methodology behind Twittercensus.

Data is collected from Twitter’s API (api.twitter.com). For every account tested, we download the 75 most recent tweets. The language of these tweets are analyzed (as one text), after removing hashtags, links and user names. The language of the tweets are identified through the open source library Pear LanguageDetect.  The language identification recognizes specific three letter combinations that are specific for that language (Cavnar & Trenkle, 1994).

After an account has been identified as Finnish, all friends and followers of that account will be add to a queue to be analyzed. This process is iterated until all accounts (and there followers and friends, and their followers and friends, etc) have been scanned, and no more Finnish speaking accounts are found.

Users not recognized as Finnish, accounts without tweets and protected accounts are excluded.

Schematic of Twitter Census methodology
Schematic of Twitter Census methodology

All together 1,333,448 accounts have been scanned (plus the ones identified as Finnish). This is the total of unique accounts following or being followed by a finnish speaking account. Of these 222,080 have written zero tweets and can therefor not be analyzed (but MANY are “spam” accounts), 62,834 accounts are protected (and can not be analyzed), 1016 are suspended (normally due to spamming activities), 474 can not be found and the rest (1,047,044) have been identified as writing in another language than Finnish.

The data extracted about the Finnish speaking twitter account is the source for all the statistics and conclusions in Twitter Census. The relationships between the accounts are later used to create the network graph of the Twitter population.

5 thoughts on “How is Twitter Census made?

  1. Pingback: Study on active Finnish Twitter users released tomorrow – Twittercensus) « Web & Social Media Strategist

  2. The way you calculate the finnish speaking accounts is misleading if it is showcased as representing Finnish Speaking Twitter sphere, as many important figures might be tweeting in English most of the time by posting links etc. originally in english. That is the case for me (@tar1na), not visible in the listings and many others although I sometimes do post finnish tweets, although very rarely. This makes the cloud smaller than it actually is.

    Reply
    • The study, and the cloud, is not showing Finns on twitter, but Finnish speaking users on twitter. There is no reliable way to identify which nationality a user has. All attempts to do so will be less accurate than the method chosen. But — it should be clearly stated that the graph and the study only includes users writing in Finnish, and nothing else.

      Reply

Leave a reply

required

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>