Scraping Walmart can have many use cases. It is the leading retailer in the USA & has enormous public product data available.
When web scraping prices for Walmart products you can have a close look at any price change.
If the price is lower then you can prefer to buy it. You create a Walmart price scraper that can show trends in price change over a year or so.
In this article, we will learn how we can web scrape Walmart data & create a Walmart price scraper, we will be using Python for this tutorial.
But, Why Scrape Walmart with Python?
Python is a widely used & simple language with built-in mathematical functions. It is also flexible and easy to understand even if you are a beginner. The Python community is too big and it helps when you face any error while coding.
Many forums like StackOverflow, GitHub, etc already have the answers to the errors that you might face while coding when you do Walmart scraping.
You can do countless things with Python but for the sake of this article we will be extracting product details from Walmart.
Let’s Begin Web Scraping Walmart with Python
To begin with, we will create a folder and install all the libraries we might need during the course of this tutorial.
For now, we will install two libraries
- Requests will help us to make an HTTP connection with Walmart.
- BeautifulSoup will help us to create an HTML tree for smooth data extraction.
>> mkdir walmart
>> pip install requests
>> pip install beautifulsoup4
Inside this folder, you can create a python file where we will write our code. We will scrape this Walmart page. Our data of interest will be:
- Name
- Price
- Rating
- Product Details
First of all, we will find the locations of these elements in the HTML code by inspecting them.
We can see the name is stored under tag h1 with attribute itemprop. Now, let’s see where the price is stored.
Price is stored under span tag with attribute itemprop whose value is price.
The rating is stored under span tag with class rating-number.
Product detail is stored inside div tag with class dangerous-html.
Let’s start with making a normal GET request to the target webpage and see what happens.
import requests
from bs4 import BeautifulSoup
target_url="https://www.walmart.com/ip/SAMSUNG-58-Class-4K-Crystal-UHD-2160P-LED-Smart-TV-with-HDR-UN58TU7000/820835173"
resp = requests.get(target_url).text
print(resp)
Oops!! We got a captcha.
Walmart loves throwing captchas when they think the request is coming from a script/crawler and not from a browser. To remove this barrier from our way we can simply send some metadata/headers which will make Walmart consider our request as a legit request from a legit browser.
Now, if you are new to web scraping then I would advise you to read more about headers and their importance in Python. For now, we will use seven different headers to bypass Walmart’s security wall.
- Accept
- Accept-Encoding
- Accept-Language
- User-Agent
- Referer
- Host
- Connection
import requests
from bs4 import BeautifulSoup
ac="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
target_url="https://www.walmart.com/ip/SAMSUNG-58-Class-4K-Crystal-UHD-2160P-LED-Smart-TV-with-HDR-UN58TU7000/820835173"
headers={"Referer":"https://www.google.com","Connection":"Keep-Alive","Accept-Language":"en-US,en;q=0.9","Accept-Encoding":"gzip, deflate, br","Accept":ac,"User-Agent":"Mozilla/5.0 (iPad; CPU OS 9_3_5 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13G36 Safari/601.1"}
resp = requests.get(target_url, headers=headers)
print(resp.text)
One thing you might have noticed is Walmart sends a 200 status code even when it returns a captcha. So, to tackle this problem you can use the if/else statement.
if("Robot or human" in resp.text):
print(True)
else:
print(False)
If it is True then Walmart has thrown a captcha otherwise our request was successful. Nest step is to extract our data of interest.
Here we will use BS4 and will scrape every value step by step. We have already determined the exact location of each of these data elements.
soup = BeautifulSoup(resp.text,'html.parser')
l=[]
obj={}
try:
obj["price"] = soup.find("span",{"itemprop":"price"}).text.replace("Now ","")
except:
obj["price"]=None
Here we have extracted the price and then replace “Now ” (garbage string) with an empty string. You can use try/except statements to catch any errors.
Now similarly, let’s get the name and the rating of the product. We will come to the product description later.
try:
obj["name"] = soup.find("h1",{"itemprop":"name"}).text
except:
obj["name"]=None
try:
obj["rating"] = soup.find("span",{"class":"rating-number"}).text.replace("(","").replace(")","")
except:
obj["rating"]=None
l.append(obj)
print(l)
We got all the data except “product detail”. Once you scroll down to your HTML data returned by your python script you will nowhere find this dangerous-html class.
The reason behind this is the framework used by the Walmart website. It uses the Nextjs framework which sends JSON data once the page has been rendered completely. So, when the socket connection was broken product description part of the website was not loaded.
But the solution to this problem is very easy and it can be scraped in just two steps. Every Nextjs-backed website has a script tag with id as __NEXT_DATA__.
This script will return all the JSON data that we need. Since this is done through Javascript we could not have scraped it with a simple HTTP GET request. So, first of all, you have to find it using BS4 and then load it using the JSON library.
import json
nextTag = soup.find("script",{"id":"__NEXT_DATA__"})
jsonData = json.loads(nextTag.text)
print(jsonData)
This is a huge JSON data which might be a little intimidating. You can use tools like JSON viewer to figure out the exact location of your desired object.
try:
obj["detail"] = jsonData['props']['pageProps']['initialData']['data']['product']['shortDescription']
except:
obj["detail"]=None
We got all the data we were hunting for. By the way, this huge JSON data also contains the data we scraped earlier. You just have to figure out the exact object where it is stored. I leave that part to you.
If you want to learn more about headers, requests, and other libraries of Python then I would advise you to read this web scraping with Python tutorial.
Complete Code
import requests
from bs4 import BeautifulSoup
ac="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
target_url="https://www.walmart.com/ip/SAMSUNG-58-Class-4K-Crystal-UHD-2160P-LED-Smart-TV-with-HDR-UN58TU7000/820835173"
headers={"Referer":"https://www.google.com","Connection":"Keep-Alive","Accept-Language":"en-US,en;q=0.9","Accept-Encoding":"gzip, deflate, br","Accept":ac,"User-Agent":"Mozilla/5.0 (iPad; CPU OS 9_3_5 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13G36 Safari/601.1"}
resp = requests.get(target_url)
# print(resp.text)
# if("Robot or human" in resp.text):
# print(True)
# else:
# print(False)
soup = BeautifulSoup(resp.text,'html.parser')
l=[]
obj={}
try:
obj["price"] = soup.find("span",{"itemprop":"price"}).text.replace("Now ","")
except:
obj["price"]=None
try:
obj["name"] = soup.find("h1",{"itemprop":"name"}).text
except:
obj["name"]=None
try:
obj["rating"] = soup.find("span",{"class":"rating-number"}).text.replace("(","").replace(")","")
except:
obj["rating"]=None
import json
nextTag = soup.find("script",{"id":"__NEXT_DATA__"})
jsonData = json.loads(nextTag.text)
Detail = jsonData['props']['pageProps']['initialData']['data']['product']['shortDescription']
try:
obj["detail"] = Detail
except:
obj["detail"]=None
l.append(obj)
print(l)
How to Scrape Walmart without Code using Scrapingdog
Walmart is in business for a very long time and they know people use scraping techniques to crawl over their websites. They have an architecture that can determine if a request is coming from a bot or a real browser.
Along with that if you want to scrape millions of pages then your IP will be blocked by Walmart. To avoid this you need a rotation of IPs and headers. Scrapingdog can provide you with all of these features.
Scrapingdog provides an API for Web Scraping that can help you create a seamless data pipeline in no time. You can start by signing up and making a test call directly from your dashboard.
Let’s go step by step to understand how you can use Scrapingdog for Walmart web scraping without getting blocked.
Oh! I almost forgot to tell you that for new users 1000 calls are absolutely free.
First, you have to sign up!
Once you are on your dashboard you will have two options.
- Either just paste the target URL in the tool and press the “Scrape” button.
- Use API URL in POSTMAN or browser or a script to make a GET request.
The first one is the fastest one. So, let’s do that.
Once you press the Scrape button you will get the complete HTML data from Walmart. You can even set locations if you really want to change that. But in this case, that was not required.
The second option was through a script. You can use the below-provided code to scrape Walmart without being blocked. You won’t have to even pass any headers to scrape it.
import requests
from bs4 import BeautifulSoup
target_url="https://api.scrapingdog.com/scrape?dynamic=false&url=http://www.walmart.com/ip/SAMSUNG-58-Class-4K-Crystal-UHD-2160P-LED-Smart-TV-with-HDR-UN58TU7000/820835173&api_key=YOUR-API-KEY"
resp = requests.get(target_url)
# print(resp.text)
# if("Robot or human" in resp.text):
# print(True)
# else:
# print(False)
soup = BeautifulSoup(resp.text,'html.parser')
l=[]
obj={}
try:
obj["price"] = soup.find("span",{"itemprop":"price"}).text.replace("Now ","")
except:
obj["price"]=None
try:
obj["name"] = soup.find("h1",{"itemprop":"name"}).text
except:
obj["name"]=None
try:
obj["rating"] = soup.find("span",{"class":"rating-number"}).text.replace("(","").replace(")","")
except:
obj["rating"]=None
import json
nextTag = soup.find("script",{"id":"__NEXT_DATA__"})
jsonData = json.loads(nextTag.text)
Detail = jsonData['props']['pageProps']['initialData']['data']['product']['shortDescription']
try:
obj["detail"] = Detail
except:
obj["detail"]=None
l.append(obj)
print(l)
Do not forget to replace YOUR-API-KEY with your own API key. You can find your key on your dashboard. We have used &dynamic=false parameter to make a normal HTTP request rather than rendering the JS. This will just cost 1 API credit. You can read more about it here.
But if you want to load the JS part of the website as well then remove the &dynamic=false param because Scrapingdog by default renders JS through a real Chrome browser.
We just removed the headers and replaced the target URL with Scrapingdog’s API URL. Other than that rest of the code remains the same.
Conclusion
As we know Python is too great when it comes to web scraping. We just used two basic libraries to scrape Walmart product details and print results in JSON. But this process has certain limits and as we discussed above, Walmart will block you if you do not rotate proxies or change headers timely.
If you want to scrape thousands and millions of pages then using Scrapingdog will be the best approach.
I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on social media.
Additional Resources
- Amazon Price Scraping using Python
- Scraping Amazon Reviews using Python
- Building A Price Tracker for Amazon Products using Python
- How to Scrape Flipkart Indian e-commerce Brand using Python
- How To Scrape Google Shopping using Python
- Web Scraping eBay Product Details using Python
- Web Scraping Myntra with Selenium & Python
- Web Scraping Yelp Reviews using Python
- Web Scraping Yellow Pages using Python
- What is User-Agent in Web Scraping & How To Use Them