Monday, 5 March 2012

Precision-at-1 and Reciprocal Rank

There are retrieval scenarios where the first relevant document is all that matters. For instance, when a user searches the Web for the home page of a particular company, when seeking a document to answer a factual question or when trying to identify semi-structured document that lists the features of a product. In these and many other cases an IR system needs to provide only one relevant result. There is no need to list more relevant documents, as the first relevant completely satisfies the user's information need. Furthermore, you obviously want that first relevant document to be ranked very high -- preferably at rank one.

Coming across such a scenario, I had to take consider how to evaluate and compare retrieval models in such a case. Browsing through established evaluation metrics, two solutions with a slightly different angle emerged: Precision-at-1 and Reciprocal Rank.

Precision-at-1: one aspect in the evaluation is if the first relevant document is listed at rank one. Or more generic and across several queries: how often is the highest ranked result relevant? Precision-at-1 (P@1 for short) addresses exactly this question. Technically it takes the first entry in the result list, and checks if this document is relevant. Hence, P@1 has a value of either 0 (first document irrelevant) or 1 (first document relevant). Therefore, we can directly relay the metric to the relevance values of the first entry in the result list: e[1].
If then we a set Q of have several queries, we can average the P@1 values and get an idea of how often the highest ranked result is relevant:

MeanP@1 = 1/Q Σq ∈ Q e[1]

Reciprocal Rank: P@1 ignores all results that are not ranked at position one. Assuming, that no system is perfect, you might also be interested in how far to the top the first relevant document is ranked. The Reciprocal Rank (RR for short) metric is designed for exactly this purpose. If r is the position of the first relevant result then RR for this result list is 1/r.
Again, if we have an entire set of queries, we can average over these values to obtain the Mean Reciprocal Rank (MRR) by:


MRR = 1/Q Σq ∈ Q 1/r


In combination the two metrics P@1 and RR can give you a good impression of how often to expect the first entry in the results list to be relevant and at which position to expect the first relevant document on average.

Tuesday, 31 August 2010

Capturing Influence in Twitter

One of the most interesting questions in social networks is who influences whom. This question obviously applies also to Twitter. Politicians, activists, companies, celebrities, religious leaders or advertisers might use Twitter to spread short messages to the world in order to influence people in a way or another. Hence, one starts to be curious, who has much influence in the world of Twitter. Twinfluence.com is one renown example of an application trying to measure the influence of Twitter users. And by now, also researchers in IR and social sciences are looking at this field and have made some interesting observations.

But, before going into the analysis of who influences whom on Twitter, it is necessary to address the question of what influence actually means. Cha et al. [1] provide a definition taken from the Merriam-Webster dictionary: "influence is the power or capacity of causing an effect in indirect or intangible ways". Leavitt et al. [2] define influence in the context of Twitter "as the potential of an action of a user to initiate a further action by another user". All these definitions are still quite vague, but [3] states, that there actually is no crisp definition of the concept. Cosley et al. [3] consider the adoption of behaviour as an observable result of influence between users. However, they looked at Wikipedia as a social network of editors contributing to articles, not at Twitter. For Twitter it would be more difficult to define behaviours.

So, what did these studies look at, and how do they eventually define and measure influence?

The concrete definition of influence of Leavitt et al. [2] was based on actions. For Twitter they defined actions based on the tweets and came up with four categories: retweets (of the form: RT @username), replys (@username at the beginning), mentions(@username in the middle) and attributions (via @username in the middle). The authors also provide a further classification for the first two actions: retweets are more content oriented actions while replies are conversation oriented. This classification goes along a line of thought dividing the users in more content oriented and more conversation oriented users. Content oriented users spread their messages, the conversation oriented users employ Twitter for communicating with their followers.

Using these types of actions, Leavitt et al. look which of them occur among the followers of twelve popular Twitter users. For instance, they looked at the ratio of retweets, replies and mentions among all reactions to a particular user or at how often followers reacted with an content centric (retweet) or conversational response (reply) to the tweets of the popular users. The numbers were put in relation to the total number of original tweets provided by the observed users. As a result, they observed that some Twitter users that are highly involved in social media create more conversation with their followers. Others, mainly celebrities, have more passive followers, that rarely reply or retweet their messages.

Cha et al. [1] instead looked at a comparison between retweets (of both forms: RT @username and via @username), mentions and indegree of users. They state, that the indegree (so the number of followers) measures a user popularity, the number of retweets captures the ability to generate content that gets passed along and the number of mentions is the ability to engage others in discussion and conversation. An analysis with Spearman's rank correlation revealed, that indegree is not related to the other two factors, while retweets and mentions are stronger correlated. Cha et al. also analyzed the numbers of retweets and mentions on three well chosen topics and found out, that influence is different depending on the topic.

Romero et al. [4,5] incorporate the passivity of users in the calculation of influence. Passive users merely consume incoming messages, but do propagate information to the network. This means the influence of a user is determined not only by the number of followers, but also on the willingness of the followers to pass messages on.

In general the research on influence in Twitter is focusing on users and their social networks. The tweets themselves are merely considered as types of actions that reflect the influence a user has over his followers. This seems suitable to grasp the overall influence of a user, but not the influence of the tweets themselves.


[1] Measuring User Influence in Twitter: The Million Follower Fallacy
Meeyoung Cha, Hamed Haddadi, Fabricia Benevenuto, Krishna P. Gummadi, AAAI Conference on Weblogs and Social Media (ICWSM), 2010

[2] The Influentials: New Approaches for Analyzing Influence on Twitter
Alex Leavitt, Evan Burchard, David fisher, Sam Gilbert, Web Ecology Project, 2 September 2009

[3] Sequential Influence Models in Social Networks
Dan Cosley, Daniel Huttenlocher, Jon Kleinberg, Xiangyang Lan, Siddharth Suri, AAAI Conference on Weblogs and Social Media (ICWSM), 2010

[4] Ethan Bauley, What makes a tweet influential? New HP Labs social media research may provide answers http://h30507.www3.hp.com/t5/Data-Central/What-makes-a-tweet-influential-New-HP-Labs-social-media-research/ba-p/81855 ((August, 20th 2010)

[5] Daniel Romero, Wojciech Galuba, Sitaram Asur, Bernardo Huberman, Influence and Passivity in Social Media, 2010

Friday, 6 August 2010

Twitter Corpora

Applying to IR techniques to Twitter is very much in fashion right now. I can see several reasons why:
  1. Twitter is a successful phenomenon of the Web (2.0) . Millions of users write and read tweets. Hence, as there are lots of messages and as there are lots of users, there is a need of managing the information in twitter.
  2. Tweets are challenging for text IR. First of all, they are short, so the terms in a tweet are extremely sparse. Second, the messages are not formulated as well as classical documents. The scope of the medium, the length restrictions and the typically non-professional writers create a mixture of abbreviations, misspelling and slang expressions.
  3. Time plays an important role. Tweets typically report about something that happens now. Older tweets are outdated to an extend, that one could consider them irrelevant.
For research purposes, there are a few Twitter corpora out on the web. One is the corpus collected by Munmun De Choudhury. It contains tweets, information about users and the social graph of following relations. Another well known data collection from Twitter is the quite large social graph compiled by Haewoon Kwak. However, the latter does not contain any messages.

One thing I found missing was a corpus with messages of closely connected users. The collection of Munmun has several users and their tweets, but the connections between those users are not very dense. There are only a few users for which also more than ten followers are listed.

Analysing the graph and individual messages is interesting. But, the network of messages will probably be even more fascinating.

Tuesday, 6 April 2010

IR Evaluation and the "IR-Onion"

"On Queries and other Messages", that was the title of Stephen Robertson's keynote at the CIRSE workshop during ECIR 2010 in Milton Keynes. The focus of the talk was on evaluation experiments in IR. One aspect mentioned during the talk was to look at the evaluation task from a system theoretic point of view.

An IR system can be seen as an open system, which interacts with its environment. The system itself is the object being observed and measured during an experiment. Again, from a system theoretic point of view, the environment corresponds to the rest of the universe. An interaction happens, when "something" passes the boundary between the system and the environment. In classical theory this "something" are typically objects or energy. In the case of an IR system it can be seen as messages.

Hence, Robertson suggested as objectives for IR evaluation to identify the boundaries of the system and to analyse the messages crossing that boundary. In this way, also the environment can be incorporated in the evaluation process, namely via the messages it sends into the system. As one important -- if not the most important -- feature living in the environment of an IR system, he mentioned relevance. After all, it is the user, who eventually decides about relevance.

To illustrate the boundaries of an IR system, Robertson referred to the "IR-Onion", as shown in the figure to the right. In addition to the naive boundary of the IR system, i.e. the one between user and system, it shows several other layers. Just like the layers of an onion, the core of the IR system is wrapped up in the interface/interaction component. Typically the user does not see (nor needs to understand) the internal algorithms and mechanisms of the system, but deals only with the user interface. This is also the afore mentioned naive boundary from a system theoretic point. However, also the user side, i.e. the environment, can be decomposed into several layers. The closest layer is the user's model of the IR system. A user has certain ideas or expectations on how the system operates and reacts. This influences his interaction and behaviour. Next comes the user's model of the information seeking task. Depending on what the user knows or thinks about his information need, he might behave differently. Finally the information need is embedded in a larger problem or task, of which the user might not even be completely aware (Robertson referred to Nick Belkin's ASKs: the anomalous states of knowledge).

The "IR-Onion" reflects, that search is a rather difficult task. Traditionally, that is why there used to be an intermediary between the user and the system (e.g. a librarian). The intermediary helped the users to organize and maybe better understand their search, even if the user eventually had direct access to the search system.

For, evaluation purposes the intermediary between user and system was an important resource. She or he could formulate reference interviews with the user and thereby provide a deeper insight on the users' side, i.e. the environment. Nowadays, the challenge in IR evaluation -- particular for context based systems -- is to get a better picture of what is going on in the user's mind. Interaction events are a good basis, but research still has to make sense of those events.

Wednesday, 24 March 2010

Finding Similar Textfragments

Still dealing with plagiarism detection, one essential step is to identify similar fragments between two texts. That is, given two documents d1 and d2, the question is which parts of d1 can be found where in d2. To complicate the things a bit further, the parts do not need to be identical. Due to obfuscation techniques, they might only be similar to some degree.

An approach that seems to deliver acceptable results from scratch is a two step detection method. The aim of the first step is to find potentially similar parts via word n-grams. The second step looks at these candidates and computes a detailed local alignment.

To realise step one, it is necessary to split both documents into a sequence of n-grams. In order to cope with obfuscation based on modifying the word order, the n-grams are sorted. So, the n-gram "to be or not to be" becomes "be be not or to to". This adds some invariance to word reordering. To find potentially similar parts, we need to look up the n-grams of d1 and check if and where they appear in d2. Every "sighting" of an n-gram from d1 in d2 can already be considered a candidate for plagiated part. In practice it is useful to combine some candidates that are near to each other and do some outlier detection before going into the second step.

The second step does not look at the complete documents any more, but compares local parts in more detail. To do so, we consider again the original documents as sequences of words and sample a subsequence from a fixed window around the candidate parts. This leads to two (relatively) short subsequences of words from d1 and d2. Using the Smith-Waterman algorithm we compute an optimal partial alignment of the subsequences. The actual result depends a lot on the costs for inserting, replacing or deleting words to match the subsequence from d1 to the subsequence from d2. However, already a simple cost measure with fixed values for the editing actions (independent of the actual words that are inserted, replaced or deleted) delivered acceptable results.

There seems a lot of potential in this approach. It is sufficiently fast and first changes to parameters show that results can be boosted a lot when using the right settings.

Monday, 8 March 2010

Fuzzy Hashes

With the start of the two weeks student's project in IR at the University in Mainz, there was a lot of discussion about hashing and storing segments of texts. A particular question was dedicated to fuzzy hashes, that can also match near-duplicates.

There is a lot of research on the topic -- based on different technologies and intended for different purposes. For the topic the students have to deal with (detection of plagiarism) it is particularily interesting to detect similarity of text passages. As a plagiate rarely covers the full text, it is necessary to discover which parts of a text match which parts of another text.

Participants to the PAN 2009 competition (dedicated to the detection of plagiarism) mainly used forms of word or character n-grams (i.e. continious subsequences of n words or characters) to describe texts. This seems a good method to find documents which have local similarities. A suspiciously large overlap of n-grams between two documents is good hint for local passages with similar contents.

However, n-grams are not invariant under certain disguise techniques used by plagiators. A common technique is to change the order of the words. An approach to overcome this flaw is to work with sorted n-grams. This means, all the words (characters) in an n-gram are sorted alphabetically. For instance, the 6-gram (to, be, or, not, to, be) becomes (be, be, not, or, to, to).

Another disguise techniques is to substitute words with synonyms. To render the n-grams invariant under this disguise technique is more difficult. An idea could be to consider words and their synonyms as equivalence classes and rather use a (fixed) class representative than the actual word. It is problematic, because synonymy is not a transitive relation. This means, that just because w1 is a synonym of w2 and w2 is a synonym of w3 does not allow to draw the conclusion that w1 is a synonym of w3. However, the approach needs to be checked and could be a first solution to the problem.

Links
PAN 2010 @ CLEF

Friday, 5 March 2010

Translation Tools

A main ingredient for cross language information retrieval (CLIR) applications are translations. As I am currently dealing with a multi-language setup, I needed to look at possibilities to translate text documents.

The translation service of Google seems to do a good service in research projects. At least it has been mentioned repeatedly during the CLEF 2009 wokshop in Corfu. An alternative is Microsoft's translate web-service. Both can be accessed via a REST-ful web-service API.

With a few lines of code I managed to include both interfaces into a Java project. I started a batch translation of documents and summarize a few observations about the two:

  • Speed: Google seems to respond slightly quicker, while Mircosoft's translation takes longer. I did not track the response times precisely, but there is definitly a difference.
  • Text length: Google translates text up to a length of 5000 characters. Everything beyond that results in an error messages. Also Microsoft puts some restriction on the length of the text. I looked briefly at the API reference, but did not find a description of where the limit is. However, when trying to translate longer texts I got error messages.
  • Reponse format: Google delivers the translations in the JSON format. Microsoft gives a plan text format. For both I am not entirely sure, whether there are options for configuring the results.
  • Registration: Microsoft requires an application ID with each request. The ID can be generated online. Google encourages the use of an application key, but does not require it.

Resumee: Well, so far I did not compare the quality of the translations, neither did I look in all the possibilities of the APIs. However, Googles faster response and more clearly expressed limitations for text length seem to favour this solution.

Links:

Google Language API
Microsoft Translate API