Mechanical Turk |
Google embarked on the massively ambitious goal of digitizing all the world's books in 2004. When approaching such a monolithic task, developing new, specialized tools is often a good investment. And invest Google did. They poured resources into optical character recognition (the software that converts images into text), scanning techniques, and an interface by which the books could be accessed. Google's scanners can "read" 1000 pages an hour. This is simply not possible with almost any reasonable number of human workers. As of 2010, the company estimates that they have digitized over 10% of the world's books.
Allow me one last piece of background. I swear it's worth it. In your travels over the Internets, you have certainly encountered a form requesting that you type the distorted letters that you see in a picture. These simple tests, called captchas, are designed to be easy for humans and difficult for computers. Using captchas prevents malicious software from signing up for millions of free email accounts (the ultimate triumph... just think about how much email you could send). Captchas are tests that computers can generate and know the answer to but cannot solve themselves.
And... the punch line. Even after all of the investment and research devoted to optical character recognition, the software still stumbles in difficult cases such as smudged or distorted words. Enter reCAPTCHA. Acquired by Google in 2009, reCAPTCHA is a system that uses captchas to digitize the parts of books that are too difficult for computers to read. The system is genius in its simplicity. Instead of displaying an image of one word distorted by a computer, the system shows an image of two distorted words side by side--one of them a regular captcha and one a difficult to read word from the expanse of the Google Books project. Because the two types of images are indistinguishable, the user correctly translates both, unwittingly contributing to the digitization of the world's literature. The result of the captcha is transported back to the original source and inserted in the place of the smudged word in the digital version of the book.
Example of a captcha |
reCAPTCHA is a runaway success. It digitizes over 200 million words per day. Most people are unaware that they are effortlessly accomplishing useful work while they go about their daily lives. I'm sure the conspiracy theorist in you is frantically searching for what other work you're being tricked into performing. Tell him or her to go back to contemplating the Kennedy assassination and the moon landing. That way, the innovator in you can start asking what other difficult problems could be solved with simple, elegant, powerful systems like reCAPTCHA.
How does the system perform the same purpose as a standard CAPTCHA if it doesn't know the correct answer ahead of time?
ReplyDeleteGood question. I should have addressed that. One of the words is known to the computer, and the other is from a scanned book. In theory, the user can't tell which one is which. If they get one right, the system assumes that they got the other one right. In reality, if you know what is going on, you can sometimes tell which one came from a book scan. In that case if you get the book word wrong, the system allows you to pass the captcha test (I have done this just to see if it works. It does.) This would introduce false information into the book digitization system, but each word is given to several users. The chances of users maliciously colluding to enter a false word into the system is vanishingly small.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete