Authors:Anirban Chakraborty1;Kripabandhu Ghosh1 andUtpal Roy2
Affiliations:1Indian Statistical Institute, India;2Visva-Bharati, India
Keyword(s):Erroneous Text, Cooccurrence, Pointwise Mutual Information.
RelatedOntology Subjects/Areas/Topics:Artificial Intelligence ;Clustering and Classification Methods ;Knowledge Discovery and Information Retrieval ;Knowledge-Based Systems ;Symbolic Systems
Abstract:OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correctionof OCR errors. However, most of the existing systems use language dependent resources or training textsfor studying the nature of errors. Not much research has been reported on improving retrieval performancefrom erroneous text when no training data is available. We propose an algorithm of detecting OCR errors andimproving retrieval performance from the erroneous corpus. We present two versions of the algorithm: onebased on word cooccurrence and the other based on Pointwise Mutual Information. Our algorithm does notuse any training data or any language specific resources like thesaurus. It also does not use any knowledgeabout the language except that the word delimiter is a blank space. We have tested our algorithm on erroneousBangla FIRE collection and obtained significant improvements.