What Is Client-Side Scraping?
Client-side scraping involves fetching and processing web content directly from a browser. Or executing JavaScript code within the browser to access and modify the DOM. Typically, this is done using JavaScript, including libraries like jQuery.
Client-side scraping can be achieved by either using a public API or parsing the HTML content of the page. Since most websites do not offer public APIs, you will often need to opt for the latter: downloading HTML documents and extracting data by parsing the content.
You can perform client-side scraping by either calling a public API or parsing the page’s HTML content. Since most websites don’t provide public APIs, you will often need to resort to the latter: downloading HTML documents and extracting data by parsing the content.
Now, let’s explore how to perform client-side scraping with jQuery!
What Is Client-Side Scraping?
First, you need to fetch the HTML content of the target page. We will retrieve the HTML content from Hacker News. To do this, you can utilize the get() method in jQuery. The get() method in jQuery performs an HTTP GET request and provides the server’s response through a callback function.
Here’s how to use it:
$.get("https://news.ycombinator.com/", function(html) { console.log(html); });
Add Your Heading Text Here
Now, let’s write some HTML code and add a reference to the jQuery library using the <script> tag.
!DOCTYPE html> < html lang="en"> < head> < meta charset="UTF-8" /> < meta name="viewport" content="width=device-width, initial-scale=1.0" /> < title>Web Scraping with jQuery< /title> < script src="https://code.jquery.com/jquery-3.6.0.min.js">< /script> < /head > < body > JavaScript code < /body > < /html >
Add Your Heading Text Here
Add the following JavaScript code to fetch and log HTML content from a URL using jQuery. Paste this code inside <body> the tag of your HTML document:
< script > $(document).ready(function() { $.get("https://news.ycombinator.com/", function(html) { console.log(html); }); }); < /script>
Add Your Heading Text Here
< !DOCTYPE html> < html lang="en"> < head> < meta charset="UTF-8" /> < meta name="viewport" content="width=device-width, initial-scale=1.0" /> < title>Web Scraping with jQuery < script src="https://code.jquery.com/jquery-3.6.0.min.js"> < /head> < body> < script> $(document).ready(function () { $.get("https://news.ycombinator.com/", function (html) { console.log(html); }); }); < /script> < /body> < /html>
Add Your Heading Text Here
The output indicates that our code won’t work due to a “No ‘Access-Control-Allow-Origin‘ header is present on the requested resource” error, which is related to Cross-Origin Resource Sharing (CORS).
When your browser makes a request to a different domain, it includes an Origin HTTP header for security reasons. The server must respond with an Access-Control-Allow-Origin header to allow access. If this header is missing or incorrect, the browser will block the response to prevent unauthorized access.
To follow CORS rules, servers have security measures in place to block unauthorized requests and allow trusted ones. If the server doesn’t allow requests from your domain, you’ll encounter a CORS error. These CORS restrictions mean that you cannot use JavaScript to scrape web content from a different domain on the client side.
Okay, so how do we scrape a website using jQuery? To bypass the CORS restriction, we can shift our scraping logic to the server side. This allows us to make requests from the server to any website without being blocked by the browser’s security policies. Let’s explore this in more detail in the next section.
Why Choose jQuery for Web Scraping?
Scraping with jQuery on the Server Side
Setting Up the Environment
Before we start working on the code, let’s make sure we have the right environment set up. You’ll need to have Node.js installed on your computer. If you haven’t installed it yet, you can download it from the official Node.js website.
Once Node.js is installed, follow these steps to set up your project:
1. Initialize a New Node.js Project Open the terminal, create a new directory for your project, navigate to this directory, and run the npm init -y command to initialize a new Node.js project.
mkdir jquery-scraper cd jquery-scraper npm init -y
Add Your Heading Text Here
This will create a package.json file.
2. Install jsdom and jQuery Install the required packages, jsdom and jquery, using npm:
npm install jsdom jquery
Add Your Heading Text Here
Now you can use jQuery in Node.js for scraping! Let’s put it to use by building a scraper to extract data from the Hacker News website.
Here’s what our target page looks like:
Fetching HTML with jQuery get()
The first step in any web scraping process is to fetch the HTML of the target web page. The jQuery get() method loads data from the server using an HTTP GET request.
Let’s write a Node.js script that uses the jsdom library to create a simulated browser environment and the jquery library to perform an HTTP GET request to fetch the HTML content of the Hacker News:
const { JSDOM } = require("jsdom"); // Initialize JSDOM on the "https://news.ycombinator.com/" page to avoid CORS problems const { window } = new JSDOM("", { url: "https://news.ycombinator.com/", }); const $ = require("jquery")(window); $.get("https://news.ycombinator.com/", function (html) { console.log(html); });
Add Your Heading Text Here
In the code, we first start by importing the JSDOM class from the jsdom library. jsdom allows us to create a simulated browser environment within Node.js. Then, we create a jsdom instance by providing the URLs we want to simulate. The result is stored in a new variable called window.
This window object represents the simulated browser window, similar to the window object you would have access to in a real browser.
Next, we import the jQuery library and pass the window object. This integrates jQuery into the simulated environment, allowing us to utilize its functionalities.
Finally, we use jQuery’s get() method to fetch the HTML content from the specified URL.
Run the code and you’ll see the following output:
That’s the HTML content of the Hacker News!
Extracting Data with jQuery find()
Given a jQuery object representing a set of DOM elements, the .find() method allows you to search for descendant elements within the DOM tree and create a new jQuery object containing the matching elements.
The .find() and .children() methods are similar, but .children() only select elements that are direct children of the original set. As shown in the image below, the CSS selector for identifying news elements is tr.athing. You can use the .find() method to retrieve a list of these HTML elements.
The jQuery .find() method returns a set of DOM elements that match the CSS selector, jQuery object, or HTML element passed as the parameter.
In our case, we use jQuery to find all elements with the class tr.athing within the retrieved HTML. These elements represent individual news items on the Hacker News website. Take a look at the below code snippet:
$.get("https://news.ycombinator.com/", function(html) { const newsItemElements = $(html).find("tr.athing"); });
Add Your Heading Text Here
Next, we iterate over each news item element to individually extract its rank, title, and URL. We use jQuery’s .each() method to process these elements one by one.
$.get("https://news.ycombinator.com/", function(html) { const newsHTMLElements = $(html).find("tr.athing"); newsHTMLElements.each((i, newsItemElement) => { // Code to extract data from each news item will go here }); });
Add Your Heading Text Here
$.get("https://news.ycombinator.com/", function(html) { const newsHTMLElements = $(html).find("tr.athing"); newsHTMLElements.each((i, newsItemElement) => { const rank = $(newsItemElement).find("span.rank").text().replace('.', ''); }); });
Add Your Heading Text Here
$.get("https://news.ycombinator.com/", function(html) { const newsHTMLElements = $(html).find("tr.athing"); const newsItems = []; newsHTMLElements.each((i, newsItemElement) => { const rank = $(newsItemElement).find("span.rank").text().replace('.', ''); const titleElement = $(newsItemElement).find("span.titleline > a"); const title = titleElement.text(); const url = titleElement.attr("href"); // Store the extracted data newsItems.push({ rank, title, url }); }); // Log the extracted data console.log(newsItems); });e>
Add Your Heading Text Here
tr
element after the current tr.athing
element. This subsequent tr
element has a td
with the class “subtext” where the data is present.
Here’s the code to extract additional details: $.get("https://news.ycombinator.com/", function(html) { const newsHTMLElements = $(html).find("tr.athing"); const newsItems = []; newsHTMLElements.each((i, newsItemElement) => { const rank = $(newsItemElement).find("span.rank").text().replace('.', ''); const titleElement = $(newsItemElement).find("span.titleline > a"); const title = titleElement.text().trim(); const url = titleElement.attr("href"); // Extract additional details const subtext = $(newsItemElement).next().find(".subtext"); const author = subtext.find("a.hnuser").text(); const points = subtext.find("span.score").text(); const postTime = subtext.find("span.age").text(); // Store the extracted data newsItems.push({ rank, title, url, author, points, postTime }); }); console.log(newsItems); });
Add Your Heading Text Here
next().find(".subtext")
method, which selects the tr
element immediately after the current tr.athing
element and then finds the element with the class subtext.
To extract additional details about each news item, you need to locate the subtext class, which contains information such as the author’s name, points, and post time. The subtext class is located in the next tr
element following the current tr.athing
element.
You can use the next().find(“.subtext”) method, which selects the tr
element immediately after the current tr.athing
element and then finds the element with the class subtext.
Once you have the subtext element, you can extract desired data from it. To get the author’s name, you can use subtext.find("a.hnuser")
. Similarly, you can find the number of points and the post time.
Run the code and you will see the extracted news items logged in the console, like this: Nice! We’ve successfully extracted all the desired data from the first page of Hacker News. However, this is not enough, as we need to fetch data from multiple pages. Let’s explore how to scrape multiple pages by handling pagination.
Handling Pagination to Scrape Multiple Pages
(p)
will increase by 1 until there are no more pages to display.
Take a look at the below image: // Check for next page link const nextPageLink = $(html).find("a.morelink").attr("href"); if (nextPageLink) { currentPage += 1; fetchPage(currentPage); } else { console.log(newsItems); }
Add Your Heading Text Here
The code shows that, if the next page link is present, the script increments the currentPage variable by 1. It then recursively calls the fetchPage function with the updated currentPage value to fetch anad process the next page.
If no “next page” link is found, it means there are no more pages to fetch. At this point, the script stops fetching further pages and logs the collected news items to the console.
Here’s the complete code:
const { JSDOM } = require("jsdom"); // Initialize JSDOM with the target URL const { window } = new JSDOM("", { url: "https://news.ycombinator.com/", }); const $ = require("jquery")(window); const baseUrl = "https://news.ycombinator.com/"; let currentPage = 1; const newsItems = []; // Function to fetch a page of news items function fetchPage(page) { $.get(`${baseUrl}?p=${page}`, function (html) { const newsHTMLElements = $(html).find("tr.athing"); newsHTMLElements.each((i, newsItemElement) => { const rank = $(newsItemElement).find("span.rank").text().replace('.', ''); const titleElement = $(newsItemElement).find("span.titleline > a"); const title = titleElement.text().trim(); const url = titleElement.attr("href"); // Extract additional details const subtext = $(newsItemElement).next().find(".subtext"); const author = subtext.find("a.hnuser").text(); const points = subtext.find("span.score").text(); const postTime = subtext.find("span.age").text(); // Store the extracted data newsItems.push({ rank, title, url, author, points, postTime }); }); // Check for next page link const nextPageLink = $(html).find("a.morelink").attr("href"); if (nextPageLink) { currentPage += 1; fetchPage(currentPage); } else { console.log(newsItems); } }).fail(function (error) { console.error(`Error fetching page ${page}:`, error); }); } // Start fetching from the first page fetchPage(currentPage);
Add Your Heading Text Here
When the code runs, this would print:
Scraping data from the web is often the initial step in a larger workflow. The scraped data can be used for multiple purposes. To make the most of the gathered data, it is crucial to convert it into a format that is easy to read, analyze, and manipulate. Some of the common formats include CSV and JSON.
Let’s start by discussing how to store the scraped data in a CSV file.
To export the scraped data to a CSV file in Node.js, you can utilize the fs (File System) module to write the data. Here’s how to modify your script:
Import the necessary fs module.
Convert the newsItems array into CSV format.
Write the CSV data to a file.
Here’s the full code:
Exporting Scraped Data
const { JSDOM } = require("jsdom"); const fs = require("fs"); const { window } = new JSDOM("", { url: "https://news.ycombinator.com/", }); const $ = require("jquery")(window); const baseUrl = "https://news.ycombinator.com/"; let currentPage = 1; const newsItems = []; // Function to fetch a page of news items function fetchPage(page) { $.get(`${baseUrl}?p=${page}`, function(html) { const newsHTMLElements = $(html).find("tr.athing"); newsHTMLElements.each((i, newsItemElement) => { const rank = $(newsItemElement).find("span.rank").text().replace('.', ''); const titleElement = $(newsItemElement).find("span.titleline > a"); const title = titleElement.text().trim(); const url = titleElement.attr("href"); // Extract additional details const subtext = $(newsItemElement).next().find(".subtext"); const author = subtext.find("a.hnuser").text(); const points = subtext.find("span.score").text(); const postTime = subtext.find("span.age").text(); // Store the extracted data newsItems.push({ rank, title, url, author, points, postTime }); }); // Check for next page link const nextPageLink = $(html).find("a.morelink").attr("href"); if (nextPageLink) { currentPage += 1; fetchPage(currentPage); } else { exportToCSV(newsItems); } }).fail(function(error) { console.error(`Error fetching page ${page}:`, error); }); } // Function to export the data to a CSV file function exportToCSV(data) { const csvHeaders = ['Rank', 'Title', 'URL', 'Author', 'Points', 'Post Time']; const csvRows = [ csvHeaders.join(','), ...data.map(item => [ item.rank, `"${item.title}"`, item.url, item.author, item.points, item.postTime ].join(',')) ].join('\n'); // Join all rows with newline characters // Write the CSV string to a file fs.writeFile("newsItems.csv", csvRows, (err) => { if (err) { console.error("Error writing to CSV file:", err); } else { console.log("Data successfully exported to newsItems.csv"); } }); }
Add Your Heading Text Here
newsItems
array into a JSON format and then saves it to a file. In the function, you’ll need to convert the newsItems
array into a JSON string by using JSON.stringify
, and then write the JSON string to a file named newsItems
.json using the fs.writeFile
method from the fs
module of the Node.js.
Here’s the full code: const { JSDOM } = require("jsdom"); const fs = require("fs"); const { window } = new JSDOM("", { url: "https://news.ycombinator.com/", }); const $ = require("jquery")(window); const baseUrl = "https://news.ycombinator.com/"; let currentPage = 1; const newsItems = []; // Function to fetch a page of news items function fetchPage(page) { $.get(`${baseUrl}?p=${page}`, function (html) { const newsHTMLElements = $(html).find("tr.athing"); newsHTMLElements.each((i, newsItemElement) => { const rank = $(newsItemElement).find("span.rank").text().replace('.', ''); const titleElement = $(newsItemElement).find("span.titleline > a"); const title = titleElement.text().trim(); const url = titleElement.attr("href"); // Extract additional details const subtext = $(newsItemElement).next().find(".subtext"); const author = subtext.find("a.hnuser").text(); const points = subtext.find("span.score").text(); const postTime = subtext.find("span.age").text(); // Store the extracted data newsItems.push({ rank, title, url, author, points, postTime }); }); // Check for next page link const nextPageLink = $(html).find("a.morelink").attr("href"); if (nextPageLink) { currentPage += 1; fetchPage(currentPage); } else { exportToJSON(newsItems); } }).fail(function (error) { console.error(`Error fetching page ${page}:`, error); }); } // Function to export data to a JSON file function exportToJSON(data) { const jsonData = JSON.stringify(data, null, 2); // Write the JSON string to a file fs.writeFile("newsItems.json", jsonData, (err) => { if (err) { console.error("Error writing to JSON file:", err); } else { console.log("Data successfully exported to newsItems.json"); } }); } // Start fetching from the first page fetchPage(currentPage);