Wednesday, May 30, 2007

ReCaptcha, digital libraries and OCR

Over 10 years ago, digital libraries were a hot research topic. Back then, I participated in a number of projects, which finally led to my dissertation. Now, many attempts are known to make books available in digital form. For those sources which are not available in digital format, the only way seems to be OCR. Unfortunately enough, it is subject to errors, which cannot always be corrected automatically.

Enter ReCaptcha, a collaborative approach which helps preventing web sites from spam (comparable to Captcha, but with real words instead of just a bunch of characters. The idea is to present two words to the user: one of them was correctly identified via OCR, the other one produced an error. Assuming that someone who is able to correctly identify one of the words will also be able to produce a correct identification to the second one, the side effect is that the set of incorrectly identified words (via OCR) can dramatically be reduced (as a side-effect to the original purpose of Captchas). Here's a demonstration of how ReCaptchas work.

And for those who would like to use ReCaptchas, Google Code offers plugins and libraries for the reCAPTCHA API. Well done!