Friday 6 August 2010

Twitter Corpora

Applying to IR techniques to Twitter is very much in fashion right now. I can see several reasons why:
  1. Twitter is a successful phenomenon of the Web (2.0) . Millions of users write and read tweets. Hence, as there are lots of messages and as there are lots of users, there is a need of managing the information in twitter.
  2. Tweets are challenging for text IR. First of all, they are short, so the terms in a tweet are extremely sparse. Second, the messages are not formulated as well as classical documents. The scope of the medium, the length restrictions and the typically non-professional writers create a mixture of abbreviations, misspelling and slang expressions.
  3. Time plays an important role. Tweets typically report about something that happens now. Older tweets are outdated to an extend, that one could consider them irrelevant.
For research purposes, there are a few Twitter corpora out on the web. One is the corpus collected by Munmun De Choudhury. It contains tweets, information about users and the social graph of following relations. Another well known data collection from Twitter is the quite large social graph compiled by Haewoon Kwak. However, the latter does not contain any messages.

One thing I found missing was a corpus with messages of closely connected users. The collection of Munmun has several users and their tweets, but the connections between those users are not very dense. There are only a few users for which also more than ten followers are listed.

Analysing the graph and individual messages is interesting. But, the network of messages will probably be even more fascinating.

No comments:

Post a Comment