You have probably encountered those annoying messages on registration or feedback pages that read, “Enter the letters you see on the image,” or “Select the images with a…” These are known as captchas, and they are designed as gates to let you in.
CAPTCHA stands for “Completely Automated Public Turing Test to Tell Computers and Humans Apart”.
Simply put, they are intended to differentiate between humans and automated users, such as bots. The text is created so that a human can read it without difficulty, whereas a machine cannot.
In practice, however, this rarely works because almost every simple text captcha posted on the site is cracked within a few months.
As we have mentioned, sites use CAPTCHAs to restrict bots. But why shouldn’t bots be allowed to access these sites? Here are some more specific uses.
Most websites have automatic captchas, which are triggered if a website detects unusual activities that may resemble bot behavior. These include behaviors such as unlimited requests within split seconds and clicking on links at a far higher rate than humans would do.
Captchas can be a major impediment during the web scraping process, as most scraping operations are carried out and performed by the automated bots you use to scrape. However, this should not worry you.
There are several ways to overcome captchas when scraping the web. One way is to use Python programming by writing original code from scratch or using available code. However, to avoid too many inconveniences, you can also opt for an automatic site unblocker to help you dodge captchas successfully.
The most common captcha is the image code captcha, which contains distorted letters that a computer program cannot detect easily, but a human can somehow manage to understand. When web-scraping, you can extract the letters from the image using Python. Here’s how.
After accessing the captcha in a useful format, you can employ the help of Optical Character Recognition, which comes in handy for extracting text from images.
You can also use open-source Tesseract, an optical character recognition tool for Python, to recognize and “read” the text embedded in the image. It can be installed using the pip command.
pip install pytesseract
The first step is to extend the original Python script that loaded the captcha. This will produce a different script to read the captcha in black-and-white mode as follows.
import pytesseract img = get_captcha(html) img.save('captcha_original.png') gray = img.convert('L') gray.save('captcha_gray.png') bw = gray.point(lambda x: 0 if x < 1 else 255, '1') bw.save('captcha_thresholded.png') # The format is now easy and # can be passed to tesseract as follows pytesseract.image_to_string(bw)
When run, the output of this final script is the captcha of the form you are trying to access.
If you are new to web scraping, read frequently asked questions on web scraping.
As we mentioned earlier, sending frequent requests and clicking on links continuously are considered bot behaviors and can make websites employ captchas to block access. To solve this, you have to rotate proxies every time you send a request to the website. The clean residential IP proxies will help avoid captchas that trigger while you scrape, as your IP address will not be shown.
Merely changing a user agent will not be enough to prevent websites from restricting access when you send many requests at the same time. You will have to rotate the user agents to make the target website view you as different devices sending requests.
This is all about how to solve captcha using Python. If you still fail to solve the captcha with your code, let’s discuss it in the comments.