Web scraping is a useful tool when you want to gather information from the internet. For those in the hotel industry, knowing the prices of other hotels can be very helpful. This is because, with more hotels & OTAs coming into the market, the competition is rising at a faster pace now!
So, how do you keep track of all these prices?
The answer is by scraping hotel prices. In this blog, we’ll show you how to scrape hotel prices from booking.com using Python.
You’ll learn how to get prices from any hotel on booking.com by just entering the check-in/out dates and the hotel’s ID.
Let’s get started!
Why use Python to Scrape booking.com
Python is the most versatile language and is used extensively with web scraping. Moreover, it has dedicated libraries for scraping the web.
With a large community, you might get your issues solved whenever you are in trouble. If you are new to web scraping with Python, I would recommend you to go through this guide comprehensively made for web scraping with Python.
Requirements for scraping hotel data from booking.com
We need Python 3.x for this tutorial and I am assuming that you have already installed that on your computer. Along with that, you need to install two more libraries which will be used further in this tutorial for web scraping.
- Requests will help us to make an HTTP connection with Booking.com.
- BeautifulSoup will help us to create an HTML tree for smooth data extraction.
Setup
First, create a folder and then install the libraries mentioned above.
mkdir booking
pip install requests
pip install beautifulsoup4
Inside this folder create a Python file where will write the code. These are the following data points that we are going to scrape from the target website.
- Address
- Name
- Pricing
- Rating
- Room Type
- Facilities
Let’s Scrape Booking.com
Since everything is set let’s make a GET request to the target website and see if it works.
import requests
from bs4 import BeautifulSoup
l=list()
o={}
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"}
target_url = "https://www.booking.com/hotel/us/the-lenox.html?checkin=2022-12-28&checkout=2022-12-29&group_adults=2&group_children=0&no_rooms=1&selected_currency=USD"
resp = requests.get(target_url, headers=headers)
print(resp.status_code)
The code is pretty straightforward and needs no explanation but let me explain you a little. First, we imported two libraries that we downloaded earlier in this tutorial then we declared headers and target URLs.
Finally, we made a GET request to the target URL. Once you print you should see a 200 code otherwise your code is not right.
How to scrape the data points
Since we have already decided which data points we are going to scrape let’s find their HTML location by inspecting chrome.
For this tutorial, we will be using the find() and find_all() methods of BeautifulSoup to find target elements. DOM structure will decide which method will be better for each element.
Extracting hotel name and address
Let’s inspect Chrome and find the DOM location of the name as well as the address.
As you can see the hotel name can be found under the h2 tag with class pp-header__title. For the sake of simplicity let’s first create a soup variable with the BeautifulSoup constructor and from that, we will extract all the data points.
soup = BeautifulSoup(resp.text, 'html.parser')
Here BS4 will use an HTML Parser to convert a complex HTML document into a complex tree of python objects. Now, let’s use the soup variable to extract the name and address.
o["name"]=soup.find("h2",{"class":"pp-header__title"}).text
In a similar manner, we will extract the address.
The address of the property is stored under the span tag with the class name hp_address_subtitle.
o["address"]=soup.find("span",{"class":"hp_address_subtitle"}).text.strip("\n")
Extracting rating and facilities
Once again we will inspect and find the DOM location of the rating and facilities element.
o["rating"]=soup.find("div",{"class":"d10a6220b4"}).text
Let’s see how it can be done in two simple steps.
fac=soup.find_all("div",{"class":"important_facility"})
fac variable will hold all the facilities elements. Now, let’s extract them one by one.
for i in range(0,len(fac)):
fac_arr.append(fac[i].text.strip("\n"))
Extract Price and Room Types
This part is the most tricky part of the complete tutorial. The DOM structure of booking.com is a bit complex and needs thorough study before extracting price and room type information.
Here tbody tag contains all the data. Just below tbody you will find tr tag, this tag holds all the information from the first column.
First, let’s find all the tr tags.
ids= list()
targetId=list()
try:
tr = soup.find_all("tr")
except:
tr = None
One thing that you will notice is that every tr tag has data-block-id attribute. Let’s collect all those ids in a list.
for y in range(0,len(tr)):
try:
id = tr[y].get('data-block-id')
except:
id = None
if( id is not None):
ids.append(id)
Now, once you have all the ids rest of the job becomes slightly easy. We will iterate over every data-block-id to extract room pricing and room types from their individual tr blocks.
for i in range(0,len(ids)):
try:
allData = soup.find("tr",{"data-block-id":ids[i]})
except:
k["room"]=None
k["price"]=None
allData variable will store all the HTML data for a particular data-block-id.
Now, we can move to td tags that can be found inside this tr tag. Let’s extract rooms first.
try:
rooms = allData.find("span",{"class":"hprt-roomtype-icon-link"})
except:
rooms=None
Here comes the fun part, when you have more than one option for a particular room type you have to use the same room for the next set of pricing in the loop. Let me explain to you with the picture.
Here we have three prices for one room type. So, when for loop iterates value of the rooms variable will be None. You can see it by printing it. So, we will use the old value of rooms until we receive a new value. I hope you got my point.
if(rooms is not None):
last_room = rooms.text.replace("\n","")
try:
k["room"]=rooms.text.replace("\n","")
except:
k["room"]=last_room
Here last_room will store the last value of rooms until we receive a new value.
Let’s extract the price now.
Price is stored under the div tag with class “bui-price-display__value prco-text-nowrap-helper prco-inline-block-maker-helper prco-f-font-heading”. Let’s use allData variable to find it and extract the text.
price = allData.find("div",{"class":"bui-price-display__value prco-text-nowrap-helper prco-inline-block-maker-helper prco-f-font-heading"})
k["price"]=price.text.replace("\n","")
We have finally managed to scrape all the data elements that we were interested in.
Complete Code
You can extract other pieces of information like amenities, reviews, etc. You just have to make a few more changes and you will be able to extract them too. Along with this, you can extract other hotel details by just changing the unique name of the hotel in the URL.
The code will look like this.
import requests
from bs4 import BeautifulSoup
l=list()
g=list()
o={}
k={}
fac=[]
fac_arr=[]
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"}
target_url = "https://www.booking.com/hotel/us/the-lenox.html?checkin=2022-12-28&checkout=2022-12-29&group_adults=2&group_children=0&no_rooms=1&selected_currency=USD"
resp = requests.get(target_url, headers=headers)
soup = BeautifulSoup(resp.text, 'html.parser')
o["name"]=soup.find("h2",{"class":"pp-header__title"}).text
o["address"]=soup.find("span",{"class":"hp_address_subtitle"}).text.strip("\n")
o["rating"]=soup.find("div",{"class":"d10a6220b4"}).text
fac=soup.find_all("div",{"class":"important_facility"})
for i in range(0,len(fac)):
fac_arr.append(fac[i].text.strip("\n"))
ids= list()
targetId=list()
try:
tr = soup.find_all("tr")
except:
tr = None
for y in range(0,len(tr)):
try:
id = tr[y].get('data-block-id')
except:
id = None
if( id is not None):
ids.append(id)
print("ids are ",len(ids))
for i in range(0,len(ids)):
try:
allData = soup.find("tr",{"data-block-id":ids[i]})
try:
rooms = allData.find("span",{"class":"hprt-roomtype-icon-link"})
except:
rooms=None
if(rooms is not None):
last_room = rooms.text.replace("\n","")
try:
k["room"]=rooms.text.replace("\n","")
except:
k["room"]=last_room
price = allData.find("div",{"class":"bui-price-display__value prco-text-nowrap-helper prco-inline-block-maker-helper prco-f-font-heading"})
k["price"]=price.text.replace("\n","")
g.append(k)
k={}
except:
k["room"]=None
k["price"]=None
l.append(g)
l.append(o)
l.append(fac_arr)
print(l)
Advantages of Scraping Booking.com
Lots of travel agencies collect a tremendous amount of data from their competitor’s websites. They know if they want to gain an edge in the market they must have access to competitors’ pricing strategies.
To secure an advantage over the niche competitor one has to scrape multiple websites and then aggregate the data. Then finally adjust your prices after comparing with them. Generate discounts or show on the platform how cheap are your prices alongside your competitor’s prices.
Since there are more than 200 OTAs in the market it becomes a lot more difficult to scrape and compare. I would advise you to use services like hotel search API to get all the prices of all the hotels in any city around the globe.
Conclusion
Hotel data scraping goes beyond this and this was just an example of how Python can be used for scraping Booking.com for price comparison purposes. You can use Python for scraping other websites like Expedia, Hotels.com, etc.
I have scraped Expedia using Python here, Do check it out too!!
But scraping at scale would not be possible with this process. After some time booking.com will block your IP and your data pipeline will be blocked permanently. Ultimately, you will need to track and monitor prices for hotels when you will be scraping the hotel data.
Additional Resources
Here are a few additional resources that you may find helpful during your web scraping journey: