Wednesday 24 March 2010

Finding Similar Textfragments

Still dealing with plagiarism detection, one essential step is to identify similar fragments between two texts. That is, given two documents d1 and d2, the question is which parts of d1 can be found where in d2. To complicate the things a bit further, the parts do not need to be identical. Due to obfuscation techniques, they might only be similar to some degree.

An approach that seems to deliver acceptable results from scratch is a two step detection method. The aim of the first step is to find potentially similar parts via word n-grams. The second step looks at these candidates and computes a detailed local alignment.

To realise step one, it is necessary to split both documents into a sequence of n-grams. In order to cope with obfuscation based on modifying the word order, the n-grams are sorted. So, the n-gram "to be or not to be" becomes "be be not or to to". This adds some invariance to word reordering. To find potentially similar parts, we need to look up the n-grams of d1 and check if and where they appear in d2. Every "sighting" of an n-gram from d1 in d2 can already be considered a candidate for plagiated part. In practice it is useful to combine some candidates that are near to each other and do some outlier detection before going into the second step.

The second step does not look at the complete documents any more, but compares local parts in more detail. To do so, we consider again the original documents as sequences of words and sample a subsequence from a fixed window around the candidate parts. This leads to two (relatively) short subsequences of words from d1 and d2. Using the Smith-Waterman algorithm we compute an optimal partial alignment of the subsequences. The actual result depends a lot on the costs for inserting, replacing or deleting words to match the subsequence from d1 to the subsequence from d2. However, already a simple cost measure with fixed values for the editing actions (independent of the actual words that are inserted, replaced or deleted) delivered acceptable results.

There seems a lot of potential in this approach. It is sufficiently fast and first changes to parameters show that results can be boosted a lot when using the right settings.

No comments:

Post a Comment