IR-ratio: Fuzzy Hashes

With the start of the two weeks student's project in IR at the University in Mainz, there was a lot of discussion about hashing and storing segments of texts. A particular question was dedicated to fuzzy hashes, that can also match near-duplicates.

There is a lot of research on the topic -- based on different technologies and intended for different purposes. For the topic the students have to deal with (detection of plagiarism) it is particularily interesting to detect similarity of text passages. As a plagiate rarely covers the full text, it is necessary to discover which parts of a text match which parts of another text.

Participants to the PAN 2009 competition (dedicated to the detection of plagiarism) mainly used forms of word or character n-grams (i.e. continious subsequences of n words or characters) to describe texts. This seems a good method to find documents which have local similarities. A suspiciously large overlap of n-grams between two documents is good hint for local passages with similar contents.

However, n-grams are not invariant under certain disguise techniques used by plagiators. A common technique is to change the order of the words. An approach to overcome this flaw is to work with sorted n-grams. This means, all the words (characters) in an n-gram are sorted alphabetically. For instance, the 6-gram (to, be, or, not, to, be) becomes (be, be, not, or, to, to).

Another disguise techniques is to substitute words with synonyms. To render the n-grams invariant under this disguise technique is more difficult. An idea could be to consider words and their synonyms as equivalence classes and rather use a (fixed) class representative than the actual word. It is problematic, because synonymy is not a transitive relation. This means, that just because w1 is a synonym of w2 and w2 is a synonym of w3 does not allow to draw the conclusion that w1 is a synonym of w3. However, the approach needs to be checked and could be a first solution to the problem.

Links
PAN 2010 @ CLEF

IR-ratio

Monday, 8 March 2010

Fuzzy Hashes

No comments:

Post a Comment

Search This Blog

Blog Archive

Labels