Jump to content
Washington DC Message Boards

Defeating Spambots - The Never Ending Battle

Recommended Posts

DCpages has been fighting spambots for more than a decade. I am sure you see inappropriate posts on these message boards every so often. In the past we have had a staff member work full time to remove spam. Then we implemented a standard CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) display image of random characters rendered by our server. Unfortunetely, the spam bots got more sophisticated that they were able to decipher the CAPTCHA text and make a guess in less than six seconds, on average. The posts got so bad last year that we had to block guests from posting in many forums. Now we hope to have a solution to our problem.


reCAPTCHA is a system developed at Carnegie Mellon University that uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas.


reCAPTCHA gives the user two words, the one for which the answer is not known and a second “control” word for which the answer is known. If users correctly type the control word, the system assumes they are human and gains confidence that they also typed the other word correctly.


reCAPTCHA displays words taken from scanned texts. The solutions entered by humans are used to improve the digitization process.

Share this post

Link to post
Share on other sites
Guest Guest

From Wikipedia, the free encyclopedia (Redirected from Recaptcha)




An example of a reCAPTCHA challenge, containing the words "following finding".reCAPTCHA is a system developed at Carnegie Mellon University that uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas. reCAPTCHA is currently digitizing text from the Internet Archive and the archives of the New York Times.


reCAPTCHA supplies subscribing websites with images of words that optical character recognition (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects. This provides about the equivalent of 160 books per day, or 12,000 manhours per day of free labor (as of September 2008).


The system is reported to deliver 30 million images every day (as of December 2007[update]),and counts such popular sites as Facebook, TicketMaster, Twitter and StumbleUpon amongst subscribers. Craigslist began using reCAPTCHA in June 2008. The U.S. National Telecommunications and Information Administration also uses reCAPTCHA for its digital TV converter box coupon program website as part of the US DTV transition.




The reCAPTCHA program originated with Guatemalan computer scientist Luis von Ahn, aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles."



Scanned text is subjected to analysis by two different optical character recognition programs; in cases where the programs disagree, the questionable word is converted into a CAPTCHA. The word is displayed along with a control word already known. The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 votes, the word is considered called. Those words that are consistently given a single identity by human judges are recycled as control words.




reCAPTCHA tests are taken from the central site of the reCAPTCHA project as they are supplying the undecipherable words. This is done through a Javascript API with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free service (that is, the CAPTCHA images are provided to websites free of charge, in return for assistance with the decipherment).



reCAPTCHA has also created project Mailhide which protects email addresses from being harvested by spambots. The email address is converted into a format that does not allow a crawler to see the full email address. For example, the email "noreply@example.com" would be converted to "nor...@example.com". The visitor would then click on the "..." and solve the CAPTCHA in order to obtain the full email address and others.

Share this post

Link to post
Share on other sites
Guest Google is Skynet

I guess you guys did not know Google now owns reCaptcha. They are using it digitize all books.




Google has acquired reCAPTCHA, a company that provides CAPTCHAs to help protect more than 100,000 websites from spam and fraud.


Since computers have trouble reading squiggly words like these, CAPTCHAs are designed to allow humans in but prevent malicious programs from scalping tickets or obtain millions of email accounts for spamming. But there's a twist — the words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books. Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text.


In this way, reCAPTCHA's unique technology improves the process that converts scanned images into plain text, known as Optical Character Recognition (OCR). This technology also powers large scale text scanning projects like Google Books and Google News Archive Search. Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users. So we'll be applying the technology within Google not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process.

Share this post

Link to post
Share on other sites
You are commenting as a guest. If you have an account, please sign in.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.