Here is the raw mysql-data [35mb] for all tweeps and their internal relations. The format is a mysql-database, and the easiest is to import it to a local mysqldatabase and then export it a format of your choice (such as gefx, gml or similar). The file, of course, is quite big. Have fun. If you use the data, please let me know!
-- phpMyAdmin SQL Dump
-- version 3.5.5
-- Host: localhost
-- Generation Time: Feb 25, 2013 at 10:27 AM
-- Server version: 5.1.66
-- PHP Version: 5.3.3
SET time_zone = "+00:00";
-- Database: `twittercensus2013_fi`
-- Table structure for table `rels_cleaned`
CREATE TABLE IF NOT EXISTS `rels_cleaned` (
`tid1` int(11) NOT NULL,
`tid2` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
-- Table structure for table `tweeps`
CREATE TABLE IF NOT EXISTS `tweeps` (
`tid` int(11) NOT NULL,
`screen_name` char(250) NOT NULL,
`real_name` char(250) NOT NULL,
`description` text NOT NULL,
`location` char(255) CHARACTER SET latin1 NOT NULL,
`lang_pear` tinyint(4) NOT NULL,
`friends_count` int(11) NOT NULL,
`statuses_count` int(11) NOT NULL,
`followers_count` int(11) NOT NULL,
`lang` char(10) CHARACTER SET latin1 DEFAULT NULL,
`changed` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`created_at` datetime NOT NULL,
`url` varchar(255) CHARACTER SET latin1 NOT NULL,
`geo_enabled` tinyint(1) NOT NULL,
`profile_image_url` char(255) CHARACTER SET latin1 NOT NULL,
`favourites_count` int(11) NOT NULL,
`tc_activetaliban` tinyint(4) NOT NULL,
`tc_active` tinyint(4) NOT NULL,
`tc_gender` tinyint(2) NOT NULL,
`is_doctor` tinyint(1) NOT NULL,
`is_journalist` tinyint(1) NOT NULL,
`modularity` tinyint(3) unsigned NOT NULL,
`tc_category` tinyint(4) NOT NULL,
`queue` tinyint(4) NOT NULL,
PRIMARY KEY (`tid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
This post will try to explain the methodology behind Twittercensus.
Data is collected from Twitter’s API (api.twitter.com). For every account tested, we download the 75 most recent tweets. The language of these tweets are analyzed (as one text), after removing hashtags, links and user names. The language of the tweets are identified through the open source library Pear LanguageDetect. The language identification recognizes specific three letter combinations that are specific for that language (Cavnar & Trenkle, 1994).
After an account has been identified as Finnish, all friends and followers of that account will be add to a queue to be analyzed. This process is iterated until all accounts (and there followers and friends, and their followers and friends, etc) have been scanned, and no more Finnish speaking accounts are found.
Users not recognized as Finnish, accounts without tweets and protected accounts are excluded.
All together 1,333,448 accounts have been scanned (plus the ones identified as Finnish). This is the total of unique accounts following or being followed by a finnish speaking account. Of these 222,080 have written zero tweets and can therefor not be analyzed (but MANY are “spam” accounts), 62,834 accounts are protected (and can not be analyzed), 1016 are suspended (normally due to spamming activities), 474 can not be found and the rest (1,047,044) have been identified as writing in another language than Finnish.
The data extracted about the Finnish speaking twitter account is the source for all the statistics and conclusions in Twitter Census. The relationships between the accounts are later used to create the network graph of the Twitter population.
Twittercensus for Finland will be released on 19th of February 11.00 (GMT+2) on this web site.
The site will contain statistics about Twitter in Finland, as well as a graph over all Finnish speaking twitter users active users in Finland. We will present the totalt number of users, active users, numbers of tweets sent and lots of more statistics! Stay tuned for more information.
A live presentation with the results of the Finnish twittercensus will published on this site February 19th.
And oh, a teaser! In the next post you can find a sample of the Finnish twitter graph!