This post will try to explain the methodology behind Twittercensus.

Data is collected from Twitter’s API (api.twitter.com). For every account tested, we download the 75 most recent tweets. The language of these tweets are analyzed (as one text), after removing hashtags, links and user names. The language of the tweets are identified through the open source library Pear LanguageDetect.  The language identification recognizes specific three letter combinations that are specific for that language (Cavnar & Trenkle, 1994).

After an account has been identified as Finnish, all friends and followers of that account will be add to a queue to be analyzed. This process is iterated until all accounts (and there followers and friends, and their followers and friends, etc) have been scanned, and no more Finnish speaking accounts are found.

Users not recognized as Finnish, accounts without tweets and protected accounts are excluded.

Schematic of Twitter Census methodology
Schematic of Twitter Census methodology

All together 1,333,448 accounts have been scanned (plus the ones identified as Finnish). This is the total of unique accounts following or being followed by a finnish speaking account. Of these 222,080 have written zero tweets and can therefor not be analyzed (but MANY are “spam” accounts), 62,834 accounts are protected (and can not be analyzed), 1016 are suspended (normally due to spamming activities), 474 can not be found and the rest (1,047,044) have been identified as writing in another language than Finnish.

The data extracted about the Finnish speaking twitter account is the source for all the statistics and conclusions in Twitter Census. The relationships between the accounts are later used to create the network graph of the Twitter population.