We often discuss proxies in the context of web scraping. We also do understand the significance of proxy rotation when scraping millions of pages. However, in addition to proxies, headers also play an equally important role.
With the help of other headers, User-Agents can help you scrape a tremendous amount of data from the internet. In this article, we will discuss what a user agent is, how it is used for normal/small web scraping projects, and how it can help you with advanced scraping.
What is a User Agent?
If I talk in the context of web scraping then User-Agent is a header that mimics a real browser. This makes a request look more legitimate and influences how the host server responds to the request. It provides information about the client making the request, such as the browser type, version, and sometimes the operating system.
But why are User Agents important?
Well, the User Agent in most cases is the deciding factor for the host server to respond with status code 200(OK) and allow access to the requested resource. A server can send a 4xx error if it identifies the User Agent as suspicious.
What does User Agent look like?
A User Agent looks like this- Mozilla/5.0 (X11; U; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5399.183 Safari/537.36
.
Let me break the above string and explain to you what every part means in detail.
- Mozilla/5.0 – This is a legacy token that most web browsers include in their User-Agent strings
for historical reasons. It’s a reference to the original Mosaic web browser and is used to ensure
compatibility with websites. - (X11; U; Linux x86_64)-This part typically represents the operating
system and platform information. In this case, it indicates that the browser is running on a Linux (X11)
system using a 64-bit x86 architecture. - AppleWebKit/537.36-The AppleWebKit part denotes the layout engine used
by the browser. This engine is used to render web pages. Apple’s Safari browser also uses the WebKit engine.
The “537.36” number is the version of the WebKit engine. - (KHTML, like Gecko)– This is an additional detail to ensure
compatibility with some websites. “KHTML” refers to the open-source layout engine used by the Konqueror web
browser. “Gecko” is the layout engine used by Mozilla Firefox. This part helps the browser appear compatible
with a wider range of web content. - Chrome/108.0.5399.183– This part indicates that the browser is Chrome,
and “108.0.5399.183” is the version of Google Chrome. This detail allows websites to detect the browser and
version, which may be used to optimize content or detect compatibility issues. - Safari/537.36– The final part specifies that the browser is compatible
with Safari. The “537.36” version number is a reference to WebKit, indicating the version of the engine.
Including “Safari” in the User-Agent helps with rendering content designed for Safari browsers.
If you want to break down and test more user agent strings then use this website.
How to use User Agents with Python
Web scraping with Python is the most common way for many new coders to learn web scraping. During this journey, you will come across certain websites that are quite sensitive to scrapers and you might have to pass headers like User Agents. Let’s understand how you can pass a User-Agent with a simple example.
import requests
target_url='https://httpbin.org/headers'
headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361675787112'}
resp = requests.get(target_url,headers=headers)
print(resp.text)
Here I am passing a custom User-Agent to the target URL https://httpbin.org/headers. Once you run this code you will get this as output.
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361675787112",
"X-Amzn-Trace-Id": "Root=1-6548b6c9-4381f1cb1cb6dc915aa1268f"
}
}
So, this way you can pass a user agent to websites that are sensitive to scraping with Python.
How to avoid getting your scraper banned?
You might be thinking that you can simply avoid this situation by using a rotating proxy and that will certainly solve the problem. But this is not the case with many websites like Google, amazon, etc.
Along with proxy rotation, you have to also focus on header rotation (especially User-Agent). In some cases, you might have to use the latest User-Agents to spoof the request. Let’s see how we can rotate user agents in Python.
User Agent rotation with Python
For this example, I am going to consider these five User Agents.
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361675787110',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5412.99 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5361.172 Safari/537.36',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5388.177 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_14) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5397.215 Safari/537.36',
import requests
import random
userAgents=['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361675787110',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5412.99 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5361.172 Safari/537.36',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5388.177 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_14) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.5397.215 Safari/537.36']
target_url='https://httpbin.org/headers'
headers={'User-Agent':random.choice(userAgents)}
resp = requests.get(target_url,headers=headers)
print(resp.text)
In this code, every request will go through a separate User Agent. Now, you can use this code with a rotating proxy to give more strength to the scraper. Techniques like this will help you scrape Amazon, Google, etc. effectively.
List of Best User-Agents for Web Scraping
Conclusion
Many websites have started using a protective layer that prevents scraping. Therefore passing proper headers has become necessary and in this tutorial, I showed you how with the help of User Agents you can bypass that layer and extract the data.
If you have a small project you can create random user-agents using this free tool.
But of course, for mass scraping this will not be enough and you have to consider using a Web Scraping API. This API will handle all the headers, proxy rotation, and headless chrome for you.
I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on social media. You can also follow us on Twitter.