Recently I was asked if captchas.net is still secure today. My guess was it probably was not very secure, when compared to the fuzzy text of reCAPTCHA, but I wasn’t sure by how much, so I decided to look into it a bit more.
As you may know, Google has deprecated their old reCAPTCHA V1 API, in favor of their new reputation and image recognition based system. Apparently this may also in-part be because advances in text recognition software is starting to make the fuzzy text challenge obsolete.
As a research experiment, I decided to try my hand at solving captchas.net captchas with only software, and see how well I could do.
A few things up-front:
- As much as I like open source, I’m not going to publish the research code, as it would only by abused by script kiddies.
- It’s important to keep in mind that if a captcha system is good enough for your needs depends on if you just want to reduce the amount of bot traffic, or actually keep bots out (and remember that hackers can use cheap captcha-solving services which use people to solve them).
Alright, so the first step is to take a look at a sample captcha.
So what were dealing with is:
- Random noise
- Rotated characters
- Low-resolution text
On the plus side, we have these things going for us:
- A predictable length of 6 characters
- Case-insensitive matching
- Reloading the image over and over gives a different random sample
So, let’s see what we can do.
Utilizing ImageMagick, we can cleanup the noise up a bit, and crop out the important area.
Cool, now we have something we can feed to Tesseract. If we tweak the settings a bit to get a good result.
Case-insensitive means that’s close enough!
Let’s benchmark this.
After a few hundred tests, the average success rate on reading a random captcha is ~16%.
Yikes! 16% may not seem like a high percentage, but for most things a bot would have no problem submitting a form 6-7 times until it gets it right, and in less time than a person could do it.
But it gets worse. Remember how reloading the image over and over again produces a different random image? Using this information, we can request the same image multiple times to improved the guess, at the expense of more resources.
Using a dynamic range of 5-25 (how ever many it takes to get 5 that produce the same result, or best out of 25), I was able to boost the percentage to ~74%.
Ouch! That’s pretty bad.
And they actually charge money for this captcha service (if you want to remove the watermark, or self-host it).
In short, I think it’s safe to say captcha systems which were once secure, may no longer be.