In your web scraping journey, you must have faced certain websites that are almost hard to scrape because of the Cloudflare captcha. Many ecommerce, jobs, etc websites are protected by this captcha.
In this tutorial, we will bypass the captcha by using Cloudscraper and discuss how to pass proxy and headers to Cloudscraper.
In this article, we will scrape and analyze Google Maps’ Popular Times data for businesses in London during peak hours and visualize it using graphs and heat maps.
How Cloudscraper works?
- Cloudscraper modifies HTTP headers (e.g., User-Agent, Accept-Language) to match real browsers. It simulates TLS fingerprinting.
- Cloudscraper emulates a browser engine (using Node.js or V8) to automatically solve the challenges Cloudflare throws.
- Cloudscraper extracts Cloudflare’s security tokens (cf_clearance, cf_bm)from cookies and reuses them.
- It also supports the integration of captcha-solving services like 2Captcha.
- To avoid IP bans, Cloudscraper supports rotating proxies.
Requirements
mkdir scraper
Install cloudscraper
, beautifulSoup
and requests
inside this folder.
cd scraper
pip install cloudscraper requests beautifulsoup4
Now, create a Python file by any name you like. I am naming the file as cloud.py
.
Scraping with Cloudscraper
Let’s first see how a normal GET request to YellowPages.com using the requests
library responds.
import requests
url = "https://www.yellowpages.com/glendale-ca/mip/new-york-life-insurance-469699226?lid=469699226"
response = requests.get(url)
print(response.status_code)
print(response.text[:500])
Once you run this code you will get this on your console.
The request was blocked by Cloudflare, resulting in a 403 error code. Cloudflare redirected us to its verification page. This error is quite common when you are scraping websites.
Cloudflare thought that the incoming request was bot-like and ultimately blocked it. To bypass this verification page and scrape the target page we have to make the request look more humanized. This is where cloudscraper
can help us.
Now, let’s see how we can bypass this blockage through Cloudscrper
.
import cloudscraper
import requests
scraper = cloudscraper.create_scraper(
interpreter="nodejs",
delay=10,
browser={
"browser": "chrome",
"platform": "ios",
"desktop": False,
}
)
url = "https://www.yellowpages.com/glendale-ca/mip/new-york-life-insurance-469699226?lid=469699226"
response = scraper.get(url)
# Print the response
print(response.status_code)
print(response.text[:500])
We have created a cloudscraper
instance.
interpreter="nodejs"
uses Node.js for executing JavaScript challenges.delay=10
adds a 10-second delay between requests. This helps avoid detection.- Then to make the request look more authentic we are adding browser emulation. This will mimic a real browser. In our
cloudscraper
code, we are emulating a mobile Chrome browser on iOS.
Then finally we are making the GET request. Let’s run the code and see what happens.
We got 200
status code. That means cloudscraper
was able to scrape our target webpage.
Parsing with BeautifulSoup
We will parse the title of the restaurant, the phone number, and the status of the restaurant.
The title is located inside the h1
tag with the class business-name
.
The status is located inside the div
tag with class status-text
.
The phone number is located inside the a
tag with class phone
.
import cloudscraper
import requests
from bs4 import BeautifulSoup
o={}
scraper = cloudscraper.create_scraper(
interpreter="nodejs",
delay=10,
browser={
"browser": "chrome",
"platform": "ios",
"desktop": False,
}
)
url = "https://www.yellowpages.com/glendale-ca/mip/new-york-life-insurance-469699226?lid=469699226"
response = scraper.get(url)
print(response.status_code)
soup=BeautifulSoup(response.text,'html.parser')
o['Title']=soup.find('h1',{'class':'business-name'}).text
o['Status']=soup.find('div',{'class':'status-text'}).text
o['Phone-Number']=soup.find('a',{'class':'phone'}).get('href').replace('tel:','')
print(o)
Let’s run the code.
We were able to scrape and parse the data with the help of cloudscraper
and BS4
.
There is a small issue with this approach. This approach will not work if you want to scrape thousands of pages from websites protected by Cloudflare. We have to tweak the way we are making the GET request. Oneway is through passing custom headers.
How to pass headers to Cloudscraper
Many websites block default Python requests. Cloudscraper
allows setting a custom User-Agent to mimic real browsers.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
}
response = scraper.get(url, headers=headers)
print(response.text)
This will drastically reduce detection chances while scraping. It also provides integration with III party captcha-solving services.
How to solve a captcha with Cloudscraper
Some websites use CAPTCHAs, requiring a manual solution. Cloudscraper supports third-party captcha-solving services.
We just have to pass a captcha argument within the cloudscraper
instance.
scraper = cloudscraper.create_scraper(
interpreter="nodejs",
delay=10,
browser={
"browser": "chrome",
"platform": "ios",
"desktop": False,
},
captcha={"provider": "2captcha", "api_key": "your_2captcha_api_key"}
)
Here I am using 2Captcha as a captcha-solving service.
Disadvantages of using Cloudscraper
So far, we’ve focused on what CloudScraper
can do, but now let’s shift our attention to its limitations.
- Cloudflare keeps updating its security measures which makes scraping very difficult.
- Cloudscraper is designed specifically for bypassing Cloudflare but does not work on other bot protection systems, such as Akamai, PerimeterX, and Datadome.
- Cloudscraper bypasses JavaScript challenges, but when Cloudflare triggers reCAPTCHA or hCaptcha, it fails.
Let’s understand this with an example. Let’s scrape this page from indeed.com using cloudscraper
. When you visit this page normally on your browser you will see this on your screen.
After waiting for a few seconds it will redirect us to the target page because it knows that the request came from a legit browser.
Now, let’s see whether cloudscraper
can do the same thing or not.
scraper = cloudscraper.create_scraper(
interpreter="nodejs",
delay=10,
browser={
"browser": "chrome",
"platform": "ios",
"desktop": False,
},
captcha={"provider": "2captcha", "api_key": "your-api-key"}
)
url = "https://www.indeed.com/jobs?q=finance&l=San+Leandro%2C+CA&start=0"
response = scraper.get(url)
print(response.status_code)
Every request will give 403
error because the new updated version of Cloudflare won’t let cloudscraper
bypass it.
Alternative to Cloudscraper
If you want to scrape millions of pages from such websites then Scrapingdog can help you with it. It is a web scraping API that handles headers, proxies, retries, and of course Captchas for you.
To start with you just need to sign up for the trial pack. Once you are on the dashboard you can just paste your target URL and copy the code from the right-hand side.
import requests
response = requests.get("https://api.scrapingdog.com/scrape", params={
'api_key': 'your-api-key',
'url': 'https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY',
'dynamic': 'false',
'super_proxy': 'true'
})
print(response.text)
Remember to use your own Scrapingdog API key before running the code.
Once you run this code in your working environment you will see that Scrapingdog has bypassed Cloudflare’s security wall and extracted the results. We got the raw HTML and now using any parsing library you can extract valuable details from it.
Conclusion
Cloudscraper is a useful tool for bypassing Cloudflare’s basic protections, but it comes with limitations such as frequent blocks, CAPTCHA challenges, and dependency on browser emulation. While it can work for small-scale scraping, it lacks stability and scalability for long-term projects.
For a more reliable and efficient alternative, Scrapingdog offers a fully managed web scraping API that automatically handles Cloudflare, CAPTCHAs, and anti-bot protections without the need for proxies or complex configurations. With its simple API integration, real-time rendering, and robust infrastructure, Scrapingdog provides a hassle-free scraping experience.
🚀 If you’re looking for a stable, scalable, and CAPTCHA-free scraping solution, Scrapingdog is the way to go!
Additional Resources
- Cloudflare Error 1015: What Is It & How To Bypass It
- Cloudflare Error 1020: What Is It & How To Bypass It
- 520 Status Code – How To Bypass It
- 429 Status Code – How To Avoid It
- How To Bypass 999 Error When Scraping LinkedIn Profiles
- Avoid Getting Ban & Bypass Captcha When Scraping Amazon
- 499 Status Code – How To Avoid It
- Tips To Avoid Getting Blocked While Web Scraping
