Language identification and source analysis

Even during Christmas holidays #SIGSNA team was working on our dataset of conversations. After the positive reaction to our first public presentation we’ve decided to improve our dataset before moving any step forward. We worked on two main issues: 1) we tried to improve the performance of our Language Identificatio system and 2) we tried to identify all the original sources of Friendfeed entries we’ve collected.
In details:
1) We were pretty satisfied by the original performances of our language identification system (SLide). The version we used in our first descriptive analysis of FF network worked fine with a very low error rate. Nevertheless the rate of unidentified post was still to high. This was because every language identification system works better with very long “normal” texts than with very small piece of text often written with an informal style. It is much easier to identify a book chapter than to identify a couple of comments like “wow! great” cool!” to a single Friendfeed entry. Our team did some great job on that issue and now we’re running a new version of SLide (ver. 0.6) that works really better than the old one especially on very short text entries.
During our first public presentation in Urbino we received a very good question (thank you again Adriano!): How may entries in Friendfeed are posted directly as Friendfeed messages and how many are imported in Friendfeed from external services? As soon as we finished the new version of SLIde we decided to go and look for external services in our dataset. We analyzed only the entries (since comments obviously come from Friendfeed) and we discovered something very interesting: the largest number of messages in Friendfeed comes from Twitter! The synchronization between the two services works just fine (as many of you probably know) and then a large number of users seems to have a single microblogging feed (the Twitter one) which is imported in Friendfeed.
Friendfeed itself is the second source of entries (not surprising) and Tumbler and Facebook are on the third and fourth places.
Another interesting aspect is that half of the entries come from a thousand of different services: google news, research query feeds, minor news services and many more.
Friendfeed is really a complex galaxy of information feeds coming from everywhere.