GET 10% OFF on all Annual Plans. Use Code - FETCH2024

How To Scrape Amazon Product Data using Python

Table of Contents

The e-commerce industry has grown in recent years, transforming from a mere convenience to an essential facet of our daily lives.

As digital storefronts multiply and consumers increasingly turn to online shopping, there’s an increasing demand for data that can drive decision-making, competitive strategies, and customer engagement in the digital marketplace.

If you are into an e-commerce niche, scraping Amazon can give you a lot of data points to understand the market.

In this guide, we will use Python to scrape Amazon, do price scraping from this platform, and demonstrate how to extract crucial information to help you make well-informed decisions in your business.

Setting up the prerequisites

I am assuming that you have already installed python 3.x on your machine. If not then you can download it from here. Apart from this, we will require two III-party libraries of Python.

  • Requests– We will use this library to connect HTTP with the Amazon page. This library will help us to extract the raw HTML from the target page.
  • BeautifulSoup– This is a powerful data parsing library. Using this we will extract necessary data out of the raw HTML we get using the requests library.

Before we install these libraries we will have to create a dedicated folder for our project.

				
					mkdir amazonscraper
				
			

Now, we will have to install the above two libraries in this folder. Here is how you can do it.

				
					pip install beautifulsoup4
pip install requests
				
			
Now, you can create a Python file by any name you wish. This will be the main file where we will keep our code. I am naming it amazon.py

Downloading raw data from amazon.com

Let’s make a normal GET request to our target page and see what happens. For GET request we are going to use the requests library.
				
					import requests
from bs4 import BeautifulSoup

target_url="https://www.amazon.com/dp/B0BSHF7WHW"

resp = requests.get(target_url)

print(resp.text)
				
			

Once you run this code, you might see this.

This is a captcha from amazon.com and this happens once their architecture observes that the incoming request is from a bot/script and not from a real human being.

To bypass this on-site protection of Amazon we can send some headers like User-Agent. You can even check what headers are sent to amazon.com once you open the URL in your browser. You can check them from the network tab.

Once you pass this header to the request, your request will act like a request coming from a real browser. This can melt down the anti-bot wall of amazon.com. Let’s pass a few headers to our request.

				
					import requests
from bs4 import BeautifulSoup

target_url="https://www.amazon.com/dp/B0BSHF7WHW"

headers={"accept-language": "en-US,en;q=0.9","accept-encoding": "gzip, deflate, br","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"}

resp = requests.get(target_url, headers=headers)

print(resp.text)
				
			

Once you run this code you might be able to bypass the anti-scraping protection wall of Amazon.

Now let’s decide what exact information we want to scrape from the page.

What are we going to scrape from Amazon?

It is always great to decide in advance what are you going to extract from the target page. This way we can analyze in advance which element is placed where inside the DOM.

Product details we are going to scrape from Amazon

We are going to scrape five data elements from the page.

  • Name of the product
  • Images
  • Price (Most important)
  • Rating
  • Specs

First, we are going to make the GET request to the target page using the requests library and then using BS4 we are going to parse out this data. Of course, there are multiple other libraries like lxml that can be used in place of BS4, but BS4 has the most powerful and easy-to-use API.

Before making the request we are going to analyze the page and find the location of each element inside the DOM. One should always do this exercise to identify the location of each element.

We are going to do this by simply using the developer tool. This can be accessed by right-clicking on the target element and then clicking on the inspect. This is the most common method, you might already know this.

Identifying the location of each element

Location of the title tag

Identifying location of title tag in source code of amazon website

Once you inspect the title you will find that the title text is located inside the h1 tag with the id title.

Coming back to our amazon.py file, we will write the code to extract this information from Amazon.

				
					import requests
from bs4 import BeautifulSoup

l=[]
o={}


url="https://www.amazon.com/dp/B0BSHF7WHW"

headers={"accept-language": "en-US,en;q=0.9","accept-encoding": "gzip, deflate, br","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"}

resp = requests.get(url, headers=headers)
print(resp.status_code)

soup=BeautifulSoup(resp.text,'html.parser')


try:
    o["title"]=soup.find('h1',{'id':'title'}).text.strip()
except:
    o["title"]=None





print(o)
				
			

Here the line soup=BeautifulSoup(resp.text,’html.parser’) is using the BeautifulSoup library to create a BeautifulSoup object from an HTTP response text, with the specified HTML parser.

Then using soup.find() method will return the first occurrence of the tag h1 with id title. We are using .text method to get the text from that element. Then finally I used .strip() method to remove all the whitespaces from the text we receive.

Once you run this code you will get this.

				
					[{'title': 'Apple 2023 MacBook Pro Laptop M2 Pro chip with 12‑core CPU and 19‑core GPU: 16.2-inch Liquid Retina XDR Display, 16GB Unified Memory, 1TB SSD Storage. Works with iPhone/iPad; Space Gray'}]
				
			

If you have not read the above section where we talked about downloading HTML data from the target page then you won’t be able to understand the above code. So, please read the above section before moving ahead.

Location of the image tag

This might be the most tricky part of this complete tutorial. Let’s inspect and find out why it is a little tricky.

Inspecting image tag in the source code of amazon website
As you can see the img tag in which the image is hidden is stored inside div tag with class imgTagWrapper.
				
					allimages = soup.find_all("div",{"class":"imgTagWrapper"})
print(len(allimages))
				
			

Once you print this it will return 3. Now, there are 6 images and we are getting just 3. The reason behind this is JS rendering. Amazon loads its images through an AJAX request at the backend. That’s why we never receive these images when we make an HTTP connection to the page through requests library.

Finding high-resolution images is not as simple as finding the title tag. But I will explain to you step by step how you can find all the images of the product.

  1. Copy any product image URL from the page.
  2. Then click on the view page source to open the source page of the target webpage.
  3. Then search for this image.

You will find that all the images are stored as a value for hiRes key.

All this information is stored inside a script tag. Now, here we will use regular expressions to find this pattern of hiRes”:”image_url”

We can still use BS4 but it will make the process a little lengthy and it might slow down our scraper. For now, we will use (.+?) non-greedy matches for one or more characters. Let me explain what each character in this expression means.

  • The . matches any character except a newline
  • The + matches one or more occurrences of the preceding character.
  • The ? makes the match non-greedy, meaning that it will match the minimum number of characters needed to satisfy the pattern.

The regular expression will return all the matched sequences of characters from the HTML string we are going to pass.

				
					images = re.findall('"hiRes":"(.+?)"', resp.text)
o["images"]=images
				
			

This will return all the high-resolution images of the product in a list. In general, it is not advised to use regular expression in data parsing but it can do wonders sometimes.

Parsing the price tag

There are two price tags on the page, but we will only extract the one which is just below the rating.

We can see that the price tag is stored inside span tag with class a-price. Once you find this tag you can find the first child span tag to get the price. Here is how you can do it.
				
					try:
    o["price"]=soup.find("span",{"class":"a-price"}).find("span").text
except:
    o["price"]=None
				
			

Once you print object o, you will get to see the price.

				
					{'price': '$2,499.00'}
				
			

Extract rating

You can find the rating in the first i tag with class a-icon-star. Let’s see how to scrape this too.

				
					try:
    o["rating"]=soup.find("i",{"class":"a-icon-star"}).text
except:
    o["rating"]=None
				
			

It will return this.

				
					{'rating': '4.1 out of 5 stars'}
				
			

In the same manner, we can scrape the specs of the device.

Extract the specs of the device

These specs are stored inside these tr tags with class a-spacing-small. Once you find these you have to find both the span under it to find the text. You can see this in the above image. Here is how it can be done.

				
					specs_arr=[]
specs_obj={}

specs = soup.find_all("tr",{"class":"a-spacing-small"})

for u in range(0,len(specs)):
    spanTags = specs[u].find_all("span")
    specs_obj[spanTags[0].text]=spanTags[1].text


specs_arr.append(specs_obj)
o["specs"]=specs_arr
				
			

Using .find_all() we are finding all the tr tags with class a-spacing-small. Then we are running a for loop to iterate over all the tr tags. Then under for loop we find all the span tags. Then finally we are extracting the text from each span tag.

Once you print the object o it will look like this.

Throughout the tutorial, we have used try/except statements to avoid any run time error. We have not managed to scrape all the data we decided to scrape at the beginning of the tutorial.

Complete Code

You can of course make a few changes to the code to extract more data because the page is filled with large information. You can even use cron jobs to mail yourself an alert when the price drops. Or you can integrate this technique into your app, this feature can mail your users when the price of any item on Amazon drops.

But for now, the code will look like this.

				
					import requests
from bs4 import BeautifulSoup
import re

l=[]
o={}
specs_arr=[]
specs_obj={}

target_url="https://www.amazon.com/dp/B0BSHF7WHW"

headers={"accept-language": "en-US,en;q=0.9","accept-encoding": "gzip, deflate, br","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"}

resp = requests.get(target_url, headers=headers)
print(resp.status_code)
if(resp.status_code != 200):
    print(resp)
soup=BeautifulSoup(resp.text,'html.parser')


try:
    o["title"]=soup.find('h1',{'id':'title'}).text.lstrip().rstrip()
except:
    o["title"]=None


images = re.findall('"hiRes":"(.+?)"', resp.text)
o["images"]=images

try:
    o["price"]=soup.find("span",{"class":"a-price"}).find("span").text
except:
    o["price"]=None

try:
    o["rating"]=soup.find("i",{"class":"a-icon-star"}).text
except:
    o["rating"]=None


specs = soup.find_all("tr",{"class":"a-spacing-small"})

for u in range(0,len(specs)):
    spanTags = specs[u].find_all("span")
    specs_obj[spanTags[0].text]=spanTags[1].text


specs_arr.append(specs_obj)
o["specs"]=specs_arr
l.append(o)


print(l)
				
			

Changing Headers on every request

With the above code, your scraping journey will come to a halt, once Amazon recognizes a pattern in the request.

To avoid this you can keep changing your headers to keep the scraper running. You can rotate a bunch of headers to overcome this challenge. Here is how it can be done.

				
					import requests
from bs4 import BeautifulSoup
import re
import random

l=[]
o={}
specs_arr=[]
specs_obj={}

useragents=['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4894.117 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4855.118 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4892.86 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4854.191 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4859.153 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36/null',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36,gzip(gfe)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4895.86 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_13) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4860.89 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4885.173 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4864.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_12) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4877.207 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_2_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML%2C like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4872.118 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_13) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4876.128 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML%2C like Gecko) Chrome/100.0.4896.127 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36']

target_url="https://www.amazon.com/dp/B0BSHF7WHW"

headers={"User-Agent":useragents[random.randint(0,31)],"accept-language": "en-US,en;q=0.9","accept-encoding": "gzip, deflate, br","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"}

resp = requests.get(target_url,headers=headers)
print(resp.status_code)
if(resp.status_code != 200):
    print(resp)
soup=BeautifulSoup(resp.text,'html.parser')


try:
    o["title"]=soup.find('h1',{'id':'title'}).text.lstrip().rstrip()
except:
    o["title"]=None


images = re.findall('"hiRes":"(.+?)"', resp.text)
o["images"]=images

try:
    o["price"]=soup.find("span",{"class":"a-price"}).find("span").text
except:
    o["price"]=None

try:
    o["rating"]=soup.find("i",{"class":"a-icon-star"}).text
except:
    o["rating"]=None


specs = soup.find_all("tr",{"class":"a-spacing-small"})

for u in range(0,len(specs)):
    spanTags = specs[u].find_all("span")
    specs_obj[spanTags[0].text]=spanTags[1].text


specs_arr.append(specs_obj)
o["specs"]=specs_arr
l.append(o)


print(l)
				
			

We are using a random library here to generate random numbers between 0 and 31(31 is the length of the useragents list). These user agents are all latest so you can easily bypass the anti-scraping wall.

But again this technique is not enough to scrape Amazon at scale. What if you want to scrape millions of such pages? Then this technique is super inefficient because your IP will be blocked. So, for mass scraping one has to use a web scraping proxy API to avoid getting blocked while scraping.

Using Scrapingdog for scraping Amazon

The advantages of using Scrapingdog’s Amazon Scraper API are:

  • You won’t have to manage headers anymore.
  • Every request will go through a new IP. This keeps your IP anonymous.
  • Our API will automatically retry on its own if the first hit fails.
  • Scrapingdog will handle issues like changes in HTML tags. You won’t have to check every time for changes in tags. You can focus on data collection.

Let me show you how easy it is to scrape Amazon product pages using Scrapingdog with just an ASIN code. It would be great if you could read the documentation first before trying the API.

Before you try the API you have to signup for the free pack. The free pack comes with 1000 credits which is enough for testing Amazon scraper API.

				
					import requests

url = "https://api.scrapingdog.com/amazon/product"
params = {
    "api_key": "Your-API-Key",
    "domain": "com",
    "asin": "B0C22KCKVQ"
}

response = requests.get(url, params=params)

if response.status_code == 200:
    data = response.json()
    print(data)
else:
    print(f"Request failed with status code {response.status_code}")
				
			

Once you run this code you will get this beautiful JSON response.

This JSON contains almost all the data you see on the Amazon product page. I have also created a video to guide you using Scrapingdog to scrape Amazon.

Conclusion

In this tutorial, we scraped various data elements from Amazon. First, we used requests library to download the raw HTML, and then using BS4 we parsed the data we wanted. You can also use lxml in place of BS4 to extract data. Python and its libraries make scraping very simple for even a beginner. Once you scale you can switch to web scraping APIs to scrape millions of such pages.

Combination of requests and Scrapingdog can help you scale your scraper. You will get more than a 99% success rate while scraping Amazon with Scrapingdog.

If you want to track the price of a product on Amazon, we have a comprehensive tutorial on tracking Amazon product prices using Python.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.

Frequently Asked Questions

 

Amazon detects scraping by the anti-bot mechanism which can check your IP address and thus can block you if you continue to scrape it. However, using a proxy management system will help you to bypass this security measure.

Additional Resources

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked
My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Manthan Koolwal

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked

Recent Blogs

Building Make.com automation for linkedin profile scraping

Automating LinkedIn Profile Scraping using LinkedIn Scraper API & Make.com

In this read, we have used make.com, Scrapingdog's LinkedIn profile scraper API & Google sheets to extract data LinkedIn profiles. You can automate this process in Make.com by running a scheduler.

How to Scrape Google Local Results using Scrapingdog’s Google Local API

In this read, we have used Python & Scrapingdog's Google Local API to extract results from local results. Further, we have given a code to save the extracted data in CSV.