Wednesday 24 March 2010

Finding Similar Textfragments

Still dealing with plagiarism detection, one essential step is to identify similar fragments between two texts. That is, given two documents d1 and d2, the question is which parts of d1 can be found where in d2. To complicate the things a bit further, the parts do not need to be identical. Due to obfuscation techniques, they might only be similar to some degree.

An approach that seems to deliver acceptable results from scratch is a two step detection method. The aim of the first step is to find potentially similar parts via word n-grams. The second step looks at these candidates and computes a detailed local alignment.

To realise step one, it is necessary to split both documents into a sequence of n-grams. In order to cope with obfuscation based on modifying the word order, the n-grams are sorted. So, the n-gram "to be or not to be" becomes "be be not or to to". This adds some invariance to word reordering. To find potentially similar parts, we need to look up the n-grams of d1 and check if and where they appear in d2. Every "sighting" of an n-gram from d1 in d2 can already be considered a candidate for plagiated part. In practice it is useful to combine some candidates that are near to each other and do some outlier detection before going into the second step.

The second step does not look at the complete documents any more, but compares local parts in more detail. To do so, we consider again the original documents as sequences of words and sample a subsequence from a fixed window around the candidate parts. This leads to two (relatively) short subsequences of words from d1 and d2. Using the Smith-Waterman algorithm we compute an optimal partial alignment of the subsequences. The actual result depends a lot on the costs for inserting, replacing or deleting words to match the subsequence from d1 to the subsequence from d2. However, already a simple cost measure with fixed values for the editing actions (independent of the actual words that are inserted, replaced or deleted) delivered acceptable results.

There seems a lot of potential in this approach. It is sufficiently fast and first changes to parameters show that results can be boosted a lot when using the right settings.

Monday 8 March 2010

Fuzzy Hashes

With the start of the two weeks student's project in IR at the University in Mainz, there was a lot of discussion about hashing and storing segments of texts. A particular question was dedicated to fuzzy hashes, that can also match near-duplicates.

There is a lot of research on the topic -- based on different technologies and intended for different purposes. For the topic the students have to deal with (detection of plagiarism) it is particularily interesting to detect similarity of text passages. As a plagiate rarely covers the full text, it is necessary to discover which parts of a text match which parts of another text.

Participants to the PAN 2009 competition (dedicated to the detection of plagiarism) mainly used forms of word or character n-grams (i.e. continious subsequences of n words or characters) to describe texts. This seems a good method to find documents which have local similarities. A suspiciously large overlap of n-grams between two documents is good hint for local passages with similar contents.

However, n-grams are not invariant under certain disguise techniques used by plagiators. A common technique is to change the order of the words. An approach to overcome this flaw is to work with sorted n-grams. This means, all the words (characters) in an n-gram are sorted alphabetically. For instance, the 6-gram (to, be, or, not, to, be) becomes (be, be, not, or, to, to).

Another disguise techniques is to substitute words with synonyms. To render the n-grams invariant under this disguise technique is more difficult. An idea could be to consider words and their synonyms as equivalence classes and rather use a (fixed) class representative than the actual word. It is problematic, because synonymy is not a transitive relation. This means, that just because w1 is a synonym of w2 and w2 is a synonym of w3 does not allow to draw the conclusion that w1 is a synonym of w3. However, the approach needs to be checked and could be a first solution to the problem.

Links
PAN 2010 @ CLEF

Friday 5 March 2010

Translation Tools

A main ingredient for cross language information retrieval (CLIR) applications are translations. As I am currently dealing with a multi-language setup, I needed to look at possibilities to translate text documents.

The translation service of Google seems to do a good service in research projects. At least it has been mentioned repeatedly during the CLEF 2009 wokshop in Corfu. An alternative is Microsoft's translate web-service. Both can be accessed via a REST-ful web-service API.

With a few lines of code I managed to include both interfaces into a Java project. I started a batch translation of documents and summarize a few observations about the two:

  • Speed: Google seems to respond slightly quicker, while Mircosoft's translation takes longer. I did not track the response times precisely, but there is definitly a difference.
  • Text length: Google translates text up to a length of 5000 characters. Everything beyond that results in an error messages. Also Microsoft puts some restriction on the length of the text. I looked briefly at the API reference, but did not find a description of where the limit is. However, when trying to translate longer texts I got error messages.
  • Reponse format: Google delivers the translations in the JSON format. Microsoft gives a plan text format. For both I am not entirely sure, whether there are options for configuring the results.
  • Registration: Microsoft requires an application ID with each request. The ID can be generated online. Google encourages the use of an application key, but does not require it.

Resumee: Well, so far I did not compare the quality of the translations, neither did I look in all the possibilities of the APIs. However, Googles faster response and more clearly expressed limitations for text length seem to favour this solution.

Links:

Google Language API
Microsoft Translate API