Building your Twitter replynet with SQLite

Data preparation is one of the most important and interesting part of every research activity. Recently I’ve really enjoyed Axel’s posts about how to extract conversation networks out of Twapperkeeper archives. Axel uses awk which is a very powerful and flexible tool to manipulate csv files. Obviously there are several ways to manipulate a csv (or any kind of data) in order to extract a network file readable by a SNA software such as Gephi. We at SIGSNA have always used a database approach. This is probably due to the high expertise of Matteo with databases but I must admit that it seems to have quite a few advantages.
With this post – the first of a more methodological oriented series – I’ll show how to extract a conversation network (aka Reply Network) out of a Twapperkeeper archive using SQLite. Please note that this will be a non technical post and it i intended for a non technical audience.
SQLite is a cross platform embedded relational database management system.
(On osx you can start it simply typing sqlite3 from your command shell).
First we have to create a table suitable for importing a twapperkeeper archive:
.create table tablename (text varchar,
to_user_id int,
from_user varchar,
id int,
from_user_id int,
iso_language_code varchar,
source varchar,
profile_image_url varchar,
geo_type float,
geo_coordinates_0 float,
geo_coordinates_1 float,
created_at timestamp,
time long);

once we’ve created the table we just have to import the .csv file into our table
.separator ","
.import filename.csv tablename

Once we’ve imported our data we’ve got the full advantages of a powerful database and creating our conversation network it’s really simple.
Let’s create a csv file with a line for every @reply in our data with the user Id of the sender and the user id of the receiver.
First we have to set a file as the output of our query:
.output nameoutyournetwork.csv
then we can run the query:
select from_user_id,to_user_id from yourtable where text like '@%' and to_user_id <> ' ';
after that we can set the output back to normal (on screen) mode:
.output stdout
And that’s all: you’ll find a csv file of the conversation network in your working directory.
Of course once you’ve imported your data into a database you can perform much more complex queries e.g. filter them according to the declared language:
select from_user_id,to_user_id from yourtable where text like '@%' and iso_language_code ='it';
Obviously this approach will provide a Reply network with userID instead on UserNames and this can be good during the research phases but we could prefer to have readable usernames for our final visualisations.
We’re going into this problem and we’ll see how to solve it with some awk scripting in our next post.[articolo disponibile solo in inglese]