GET 10% OFF on all Annual Plans. Use Code - FETCH2024

A Complete Guide on User-Agents in Web Scraping (+Best User-Agent List)

User Agents in Web Scraping

Table of Contents

We often discuss proxies in the context of web scraping. We also do understand the significance of proxy rotation when scraping millions of pages. However, in addition to proxies, headers also play an equally important role.

With the help of other headers, User-Agents can help you scrape a tremendous amount of data from the internet. In this article, we will discuss what a user agent is, how it is used for normal/small web scraping projects, and how it can help you with advanced scraping.

What is a User Agent?

If I talk in the context of web scraping then User-Agent is a header that mimics a real browser. This makes a request look more legitimate and influences how the host server responds to the request. It provides information about the client making the request, such as the browser type, version, and sometimes the operating system.

But why are User Agents important?

Well, the User Agent in most cases is the deciding factor for the host server to respond with status code 200(OK) and allow access to the requested resource. A server can send a 4xx error if it identifies the User Agent as suspicious.

What does User Agent look like?

A User Agent looks like this- Mozilla/5.0 (X11; U; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5399.183 Safari/537.36.

Let me break the above string and explain to you what every part means in detail.

  1. Mozilla/5.0 – This is a legacy token that most web browsers include in their User-Agent strings
    for historical reasons. It’s a reference to the original Mosaic web browser and is used to ensure
    compatibility with websites.
  2. (X11; U; Linux x86_64)-This part typically represents the operating
    system and platform information. In this case, it indicates that the browser is running on a Linux (X11)
    system using a 64-bit x86 architecture.
  3. AppleWebKit/537.36-The AppleWebKit part denotes the layout engine used
    by the browser. This engine is used to render web pages. Apple’s Safari browser also uses the WebKit engine.
    The “537.36” number is the version of the WebKit engine.
  4. (KHTML, like Gecko)– This is an additional detail to ensure
    compatibility with some websites. “KHTML” refers to the open-source layout engine used by the Konqueror web
    browser. “Gecko” is the layout engine used by Mozilla Firefox. This part helps the browser appear compatible
    with a wider range of web content.
  5. Chrome/108.0.5399.183– This part indicates that the browser is Chrome,
    and “108.0.5399.183” is the version of Google Chrome. This detail allows websites to detect the browser and
    version, which may be used to optimize content or detect compatibility issues.
  6. Safari/537.36– The final part specifies that the browser is compatible
    with Safari. The “537.36” version number is a reference to WebKit, indicating the version of the engine.
    Including “Safari” in the User-Agent helps with rendering content designed for Safari browsers.

If you want to break down and test more user agent strings then use this website.

How to use User Agents with Python

Web scraping with Python is the most common way for many new coders to learn web scraping. During this journey, you will come across certain websites that are quite sensitive to scrapers and you might have to pass headers like User Agents. Let’s understand how you can pass a User-Agent with a simple example.

				
					import requests

target_url='https://httpbin.org/headers'
headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361675787112'}
resp = requests.get(target_url,headers=headers)

print(resp.text)
				
			

Here I am passing a custom User-Agent to the target URL https://httpbin.org/headers. Once you run this code you will get this as output.

				
					{
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361675787112",
    "X-Amzn-Trace-Id": "Root=1-6548b6c9-4381f1cb1cb6dc915aa1268f"
  }
}
				
			

So, this way you can pass a user agent to websites that are sensitive to scraping with Python.

How to avoid getting your scraper banned?

You might be thinking that you can simply avoid this situation by using a rotating proxy and that will certainly solve the problem. But this is not the case with many websites like Google, amazon, etc.

Along with proxy rotation, you have to also focus on header rotation (especially User-Agent). In some cases, you might have to use the latest User-Agents to spoof the request. Let’s see how we can rotate user agents in Python.

User Agent rotation with Python

For this example, I am going to consider these five User Agents.

				
					'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361675787110',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5412.99 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5361.172 Safari/537.36',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5388.177 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_14) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5397.215 Safari/537.36',
				
			

We are going to use the random library of Python. This is a legacy library and you don’t have to install it separately. Also, if you need more latest User Agents then visit this link.

				
					import requests
import random


userAgents=['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361675787110',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5412.99 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5361.172 Safari/537.36',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5388.177 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_14) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5397.215 Safari/537.36']


target_url='https://httpbin.org/headers'
headers={'User-Agent':random.choice(userAgents)}

resp = requests.get(target_url,headers=headers)

print(resp.text)
				
			

In this code, every request will go through a separate User Agent. Now, you can use this code with a rotating proxy to give more strength to the scraper. Techniques like this will help you scrape AmazonGoogle, etc. effectively.

List of Best User-Agents for Web Scraping

Conclusion

Many websites have started using a protective layer that prevents scraping. Therefore passing proper headers has become necessary and in this tutorial, I showed you how with the help of User Agents you can bypass that layer and extract the data.

If you have a small project you can create random user-agents using this free tool

But of course, for mass scraping this will not be enough and you have to consider using a Web Scraping API. This API will handle all the headers, proxy rotation, and headless chrome for you.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on social media. You can also follow us on Twitter.

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked
My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Manthan Koolwal

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked

Recent Blogs

Building Make.com automation for linkedin profile scraping

Automating LinkedIn Profile Scraping using LinkedIn Scraper API & Make.com

In this read, we have used make.com, Scrapingdog's LinkedIn profile scraper API & Google sheets to extract data LinkedIn profiles. You can automate this process in Make.com by running a scheduler.

How to Scrape Google Local Results using Scrapingdog’s Google Local API

In this read, we have used Python & Scrapingdog's Google Local API to extract results from local results. Further, we have given a code to save the extracted data in CSV.