Add Your Heading Text Here

How to Use Cloudscraper with Python

Table of Contents

In your web scraping journey, you must have faced certain websites that are almost hard to scrape because of the Cloudflare captcha. Many ecommerce, jobs, etc websites are protected by this captcha.

In this tutorial, we will bypass the captcha by using Cloudscraper and discuss how to pass proxy and headers to Cloudscraper.

In this article, we will scrape and analyze Google Maps’ Popular Times data for businesses in London during peak hours and visualize it using graphs and heat maps.

How Cloudscraper works?

  • Cloudscraper modifies HTTP headers (e.g., User-Agent, Accept-Language) to match real browsers. It simulates TLS fingerprinting.
  • Cloudscraper emulates a browser engine (using Node.js or V8) to automatically solve the challenges Cloudflare throws.
  • Cloudscraper extracts Cloudflare’s security tokens (cf_clearance, cf_bm)from cookies and reuses them.
  • It also supports the integration of captcha-solving services like 2Captcha.
  • To avoid IP bans, Cloudscraper supports rotating proxies.

Requirements

  • In this tutorial, we will scrape this page from yellopages.com using cloudscraper. Before we begin we need to make sure that Python is installed on our computer. If it is not installed then you can download it from here.

First, create a folder for keeping Python scripts in it.

				
					mkdir scraper
				
			

Install cloudscraper , beautifulSoup and requests inside this folder.

				
					cd scraper
pip install cloudscraper requests beautifulsoup4
				
			

Now, create a Python file by any name you like. I am naming the file as cloud.py.

Scraping with Cloudscraper

Let’s first see how a normal GET request to YellowPages.com using the requests library responds.

				
					import requests

url = "https://www.yellowpages.com/glendale-ca/mip/new-york-life-insurance-469699226?lid=469699226"
response = requests.get(url)


print(response.status_code)
print(response.text[:500])
				
			

Once you run this code you will get this on your console.

The request was blocked by Cloudflare, resulting in a 403 error code. Cloudflare redirected us to its verification page. This error is quite common when you are scraping websites.

Cloudflare thought that the incoming request was bot-like and ultimately blocked it. To bypass this verification page and scrape the target page we have to make the request look more humanized. This is where cloudscraper can help us.

Now, let’s see how we can bypass this blockage through Cloudscrper.

				
					import cloudscraper
import requests

scraper = cloudscraper.create_scraper(
    interpreter="nodejs",
    delay=10,
    browser={
        "browser": "chrome",
        "platform": "ios",
        "desktop": False,
    }
)

url = "https://www.yellowpages.com/glendale-ca/mip/new-york-life-insurance-469699226?lid=469699226"
response = scraper.get(url)

# Print the response
print(response.status_code)
print(response.text[:500])
				
			

We have created a cloudscraper instance.

  • interpreter="nodejs" uses Node.js for executing JavaScript challenges.
  • delay=10 adds a 10-second delay between requests. This helps avoid detection.
  • Then to make the request look more authentic we are adding browser emulation. This will mimic a real browser. In our cloudscraper code, we are emulating a mobile Chrome browser on iOS.

Then finally we are making the GET request. Let’s run the code and see what happens.

We got 200 status code. That means cloudscraper was able to scrape our target webpage.

Parsing with BeautifulSoup

We will parse the title of the restaurant, the phone number, and the status of the restaurant.

The title is located inside the h1 tag with the class business-name.

The status is located inside the div tag with class status-text.

The phone number is located inside the a tag with class phone.

				
					import cloudscraper
import requests
from bs4 import BeautifulSoup

o={}


scraper = cloudscraper.create_scraper(
    interpreter="nodejs",
    delay=10,
    browser={
        "browser": "chrome",
        "platform": "ios",
        "desktop": False,
    }
)

url = "https://www.yellowpages.com/glendale-ca/mip/new-york-life-insurance-469699226?lid=469699226"
response = scraper.get(url)
print(response.status_code)
soup=BeautifulSoup(response.text,'html.parser')

o['Title']=soup.find('h1',{'class':'business-name'}).text
o['Status']=soup.find('div',{'class':'status-text'}).text
o['Phone-Number']=soup.find('a',{'class':'phone'}).get('href').replace('tel:','')

print(o)
				
			

Let’s run the code.

We were able to scrape and parse the data with the help of cloudscraper and BS4.

There is a small issue with this approach. This approach will not work if you want to scrape thousands of pages from websites protected by Cloudflare. We have to tweak the way we are making the GET request. Oneway is through passing custom headers.

How to pass headers to Cloudscraper

Many websites block default Python requests. Cloudscraper allows setting a custom User-Agent to mimic real browsers.

				
					headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
}

response = scraper.get(url, headers=headers)
print(response.text)
				
			

This will drastically reduce detection chances while scraping. It also provides integration with III party captcha-solving services.

How to solve a captcha with Cloudscraper

Some websites use CAPTCHAs, requiring a manual solution. Cloudscraper supports third-party captcha-solving services.

We just have to pass a captcha argument within the cloudscraper instance.

				
					scraper = cloudscraper.create_scraper(
    interpreter="nodejs",
    delay=10,
    browser={
        "browser": "chrome",
        "platform": "ios",
        "desktop": False,
    },
    captcha={"provider": "2captcha", "api_key": "your_2captcha_api_key"}
)
				
			

Here I am using 2Captcha as a captcha-solving service.

Disadvantages of using Cloudscraper

So far, we’ve focused on what CloudScraper can do, but now let’s shift our attention to its limitations.

  • Cloudflare keeps updating its security measures which makes scraping very difficult.
  • Cloudscraper is designed specifically for bypassing Cloudflare but does not work on other bot protection systems, such as Akamai, PerimeterX, and Datadome.
  • Cloudscraper bypasses JavaScript challenges, but when Cloudflare triggers reCAPTCHA or hCaptcha, it fails.

Let’s understand this with an example. Let’s scrape this page from indeed.com using cloudscraper. When you visit this page normally on your browser you will see this on your screen.

After waiting for a few seconds it will redirect us to the target page because it knows that the request came from a legit browser.


Now, let’s see whether cloudscraper can do the same thing or not.

				
					scraper = cloudscraper.create_scraper(
    interpreter="nodejs",
    delay=10,
    browser={
        "browser": "chrome",
        "platform": "ios",
        "desktop": False,
    },
    captcha={"provider": "2captcha", "api_key": "your-api-key"}
)

url = "https://www.indeed.com/jobs?q=finance&l=San+Leandro%2C+CA&start=0"
response = scraper.get(url)
print(response.status_code)
				
			

Every request will give 403 error because the new updated version of Cloudflare won’t let cloudscraper bypass it.

Alternative to Cloudscraper

If you want to scrape millions of pages from such websites then Scrapingdog can help you with it. It is a web scraping API that handles headers, proxies, retries, and of course Captchas for you.

To start with you just need to sign up for the trial pack. Once you are on the dashboard you can just paste your target URL and copy the code from the right-hand side.

				
					import requests

response = requests.get("https://api.scrapingdog.com/scrape", params={
  'api_key': 'your-api-key',
  'url': 'https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY',
  'dynamic': 'false',
  'super_proxy': 'true'
  })

print(response.text)
				
			

Remember to use your own Scrapingdog API key before running the code.

Once you run this code in your working environment you will see that Scrapingdog has bypassed Cloudflare’s security wall and extracted the results. We got the raw HTML and now using any parsing library you can extract valuable details from it.

Conclusion

Cloudscraper is a useful tool for bypassing Cloudflare’s basic protections, but it comes with limitations such as frequent blocks, CAPTCHA challenges, and dependency on browser emulation. While it can work for small-scale scraping, it lacks stability and scalability for long-term projects.

For a more reliable and efficient alternative, Scrapingdog offers a fully managed web scraping API that automatically handles CloudflareCAPTCHAs, and anti-bot protections without the need for proxies or complex configurations. With its simple API integration, real-time rendering, and robust infrastructure, Scrapingdog provides a hassle-free scraping experience.

🚀 If you’re looking for a stable, scalable, and CAPTCHA-free scraping solution, Scrapingdog is the way to go!

Additional Resources

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Manthan Koolwal

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked

Recent Blogs

How to Use Cloudscraper with Python

Cloudscraper can be used to bypass cloudflare protection while web scraping. In this guide, you will learn how to effectively use it & what are its limitations.

Analysing Footfall Near British Museum By Scraping Popular Times Data from Google Maps

In this read, we have scraped popular times data from Google Maps to see how the businesses get footfall near British Museum London.