BIG NEWS: Scrapingdog is collaborating with Serpdog.

Web Scraping with jQuery

Web scraping with jQuery

Table of Contents

In this tutorial, you’ll learn how to use jQuery for web scraping. While jQuery is traditionally used for client-side tasks, this guide will also demonstrate its use in server-side scraping, including a complete example of scraping data from the Hacker News website.

What Is Client-Side Scraping?

Client-side scraping involves fetching and processing web content directly from a browser. Or executing JavaScript code within the browser to access and modify the DOM. Typically, this is done using JavaScript, including libraries like jQuery.

Client-side scraping can be achieved by either using a public API or parsing the HTML content of the page. Since most websites do not offer public APIs, you will often need to opt for the latter: downloading HTML documents and extracting data by parsing the content.

You can perform client-side scraping by either calling a public API or parsing the page’s HTML content. Since most websites don’t provide public APIs, you will often need to resort to the latter: downloading HTML documents and extracting data by parsing the content.

Now, let’s explore how to perform client-side scraping with jQuery!

Scraping with jQuery on the Client Side

First, you need to fetch the HTML content of the target page. We will retrieve the HTML content from Hacker News. To do this, you can utilize the get() method in jQuery. The get() method in jQuery performs an HTTP GET request and provides the server’s response through a callback function. Here’s how to use it:
				
					$.get("https://news.ycombinator.com/", function(html) {
    console.log(html);
});
				
			

Now, let’s write some HTML code and add a reference to the jQuery library using the <script> tag.

				
					!DOCTYPE html>
< html lang="en">
  < head>
    < meta charset="UTF-8" />
    < meta name="viewport" content="width=device-width, initial-scale=1.0" />
    
    < title>Web Scraping with jQuery< /title>
    < script src="https://code.jquery.com/jquery-3.6.0.min.js">< /script>
  < /head >

  < body >
     JavaScript code 
  < /body >
< /html >
				
			

Add the following JavaScript code to fetch and log HTML content from a URL using jQuery. Paste this code inside <body> the tag of your HTML document:

				
					< script >
    $(document).ready(function() {
        $.get("https://news.ycombinator.com/", function(html) {
            console.log(html);
        });
    }); <
/script>
				
			
Now, your full HTML should look like this:
				
					< !DOCTYPE html>
< html lang="en">
  < head>
    < meta charset="UTF-8" />
    < meta name="viewport" content="width=device-width, initial-scale=1.0" />
    < title>Web Scraping with jQuery
    < script src="https://code.jquery.com/jquery-3.6.0.min.js">
  < /head>

  < body>
    < script>
      $(document).ready(function () {
        $.get("https://news.ycombinator.com/", function (html) {
          console.log(html);
        });
      });
    < /script>
  < /body>
< /html>
				
			
To see the output, open the HTML file in your web browser and open the browser’s console:

The output indicates that our code won’t work due to a “No ‘Access-Control-Allow-Origin‘ header is present on the requested resource” error, which is related to Cross-Origin Resource Sharing (CORS).

Why does this error occur?

When your browser makes a request to a different domain, it includes an Origin HTTP header for security reasons. The server must respond with an Access-Control-Allow-Origin header to allow access. If this header is missing or incorrect, the browser will block the response to prevent unauthorized access.

To follow CORS rules, servers have security measures in place to block unauthorized requests and allow trusted ones. If the server doesn’t allow requests from your domain, you’ll encounter a CORS error. These CORS restrictions mean that you cannot use JavaScript to scrape web content from a different domain on the client side.

Okay, so how do we scrape a website using jQuery? To bypass the CORS restriction, we can shift our scraping logic to the server side. This allows us to make requests from the server to any website without being blocked by the browser’s security policies. Let’s explore this in more detail in the next section.

Why Choose jQuery for Web Scraping?

jQuery is one of the most widely adopted libraries for DOM manipulation. It provides various selectors to accurately target specific elements on the page, making it easy to extract text content, attributes, or other data from selected elements. For dynamic content, jQuery’s AJAX methods can retrieve additional data without having to reload the entire page.

Scraping with jQuery on the Server Side

To use jQuery in a Node.js environment, you can leverage the jsdom NPM package. By combining jQuery with jsdom, you can effectively query and manipulate the DOM on the server side, just as you would do in a browser.

Setting Up the Environment

Before we start working on the code, let’s make sure we have the right environment set up. You’ll need to have Node.js installed on your computer. If you haven’t installed it yet, you can download it from the official Node.js website.

Once Node.js is installed, follow these steps to set up your project:

1. Initialize a New Node.js Project Open the terminal, create a new directory for your project, navigate to this directory, and run the npm init -y command to initialize a new Node.js project.

				
					 mkdir jquery-scraper
cd jquery-scraper
npm init -y
				
			

This will create a package.json file.

2. Install jsdom and jQuery Install the required packages, jsdom and jquery, using npm:

				
					npm install jsdom jquery
				
			

Now you can use jQuery in Node.js for scraping! Let’s put it to use by building a scraper to extract data from the Hacker News website.

Here’s what our target page looks like:

Fetching HTML with jQuery get()

The first step in any web scraping process is to fetch the HTML of the target web page. The jQuery get() method loads data from the server using an HTTP GET request.

Let’s write a Node.js script that uses the jsdom library to create a simulated browser environment and the jquery library to perform an HTTP GET request to fetch the HTML content of the Hacker News:

				
					const { JSDOM } = require("jsdom");

// Initialize JSDOM on the "https://news.ycombinator.com/" page to avoid CORS problems
const { window } = new JSDOM("", {
    url: "https://news.ycombinator.com/",
});

const $ = require("jquery")(window);

$.get("https://news.ycombinator.com/", function (html) {
    console.log(html);
});
				
			

In the code, we first start by importing the JSDOM class from the jsdom library. jsdom allows us to create a simulated browser environment within Node.js. Then, we create a jsdom instance by providing the URLs we want to simulate. The result is stored in a new variable called window.

This window object represents the simulated browser window, similar to the window object you would have access to in a real browser.

Next, we import the jQuery library and pass the window object. This integrates jQuery into the simulated environment, allowing us to utilize its functionalities.

Finally, we use jQuery’s get() method to fetch the HTML content from the specified URL.

Run the code and you’ll see the following output:

That’s the HTML content of the Hacker News!

Extracting Data with jQuery find()

Given a jQuery object representing a set of DOM elements, the .find() method allows you to search for descendant elements within the DOM tree and create a new jQuery object containing the matching elements.

The .find() and .children() methods are similar, but .children() only select elements that are direct children of the original set. As shown in the image below, the CSS selector for identifying news elements is tr.athing. You can use the .find() method to retrieve a list of these HTML elements.

The jQuery .find() method returns a set of DOM elements that match the CSS selector, jQuery object, or HTML element passed as the parameter.

In our case, we use jQuery to find all elements with the class tr.athing within the retrieved HTML. These elements represent individual news items on the Hacker News website. Take a look at the below code snippet:

				
					 $.get("https://news.ycombinator.com/", function(html) {
    const newsItemElements = $(html).find("tr.athing");
});
				
			

Next, we iterate over each news item element to individually extract its rank, title, and URL. We use jQuery’s .each() method to process these elements one by one.

				
					$.get("https://news.ycombinator.com/", function(html) {
    const newsHTMLElements = $(html).find("tr.athing");

    newsHTMLElements.each((i, newsItemElement) => {
        // Code to extract data from each news item will go here
    });
});
				
			

For each news item, we locate the element containing the rank. This is a span element with the class “rank”. The .find() method is used to locate the span, and the .text() method extracts its content.

				
					$.get("https://news.ycombinator.com/", function(html) {
    const newsHTMLElements = $(html).find("tr.athing");

    newsHTMLElements.each((i, newsItemElement) => {
        const rank = $(newsItemElement).find("span.rank").text().replace('.', '');
    });
});
				
			

Finally, we find the element containing the title and URL of the news item. This is an anchor (<a>) element within a span with the class “titleline”. We extract the text of the title directly from the anchor and the URL from its href attribute.

				
					$.get("https://news.ycombinator.com/", function(html) {
    const newsHTMLElements = $(html).find("tr.athing");

    const newsItems = [];
    newsHTMLElements.each((i, newsItemElement) => {
        const rank = $(newsItemElement).find("span.rank").text().replace('.', '');
        const titleElement = $(newsItemElement).find("span.titleline > a");
        const title = titleElement.text();
        const url = titleElement.attr("href");

        // Store the extracted data
        newsItems.push({
            rank,
            title,
            url
        });
    });

    // Log the extracted data
    console.log(newsItems);
});
				
			
After running the code, you should see the extracted news items logged in the console, like this:

You can extract additional details for each news item, such as the author name, points, and post time, from the next tr element after the current tr.athing element. This subsequent tr element has a td with the class “subtext” where the data is present.

Here’s the code to extract additional details:

				
					$.get("https://news.ycombinator.com/", function(html) {
    const newsHTMLElements = $(html).find("tr.athing");

    const newsItems = [];
    newsHTMLElements.each((i, newsItemElement) => {
        const rank = $(newsItemElement).find("span.rank").text().replace('.', '');
        const titleElement = $(newsItemElement).find("span.titleline > a");
        const title = titleElement.text().trim();
        const url = titleElement.attr("href");

        // Extract additional details
        const subtext = $(newsItemElement).next().find(".subtext");
        const author = subtext.find("a.hnuser").text();
        const points = subtext.find("span.score").text();
        const postTime = subtext.find("span.age").text();

        // Store the extracted data
        newsItems.push({
            rank,
            title,
            url,
            author,
            points,
            postTime
        });
    });

    console.log(newsItems);
});
				
			

We use the next().find(".subtext") method, which selects the tr element immediately after the current tr.athing element and then finds the element with the class subtext.

To extract additional details about each news item, you need to locate the subtext class, which contains information such as the author’s name, points, and post time.

The subtext class is located in the next tr element following the current tr.athing element.

You can use the next().find(".subtext") method, which selects the tr element immediately after the current tr.athing element and then finds the element with the class subtext.

Once you have the subtext element, you can extract desired data from it. To get the author’s name, you can use subtext.find("a.hnuser"). Similarly, you can find the number of points and the post time.

Run the code and you will see the extracted news items logged in the console, like this:

Nice! We’ve successfully extracted all the desired data from the first page of Hacker News. However, this is not enough, as we need to fetch data from multiple pages. Let’s explore how to scrape multiple pages by handling pagination.

Handling Pagination to Scrape Multiple Pages

Hacker News only displays a limited number of items on each page. Pagination allows you to navigate through multiple pages to collect all available news articles, not just those on the first page.

On the Hacker News website, if you scroll to the bottom of the page, you will see a “more” button. Clicking this button loads additional news items and updates the URL.

For example, if the initial URL is:

https://news.ycombinator.com

Clicking “more” would change the URL to:

https://news.ycombinator.com/?p=2

With each subsequent click, the page number (p) will increase by 1 until there are no more pages to display.

Take a look at the below image:

How will we handle pagination? We’ll handle pagination by checking for a “next page” link. If found, we’ll fetch the content of the next page. See the code snippet below:
				
					// Check for next page link
const nextPageLink = $(html).find("a.morelink").attr("href");
if (nextPageLink) {
    currentPage += 1;
    fetchPage(currentPage);
} else {
    console.log(newsItems);
}
				
			

The code shows that, if the next page link is present, the script increments the currentPage variable by 1. It then recursively calls the fetchPage function with the updated currentPage value to fetch and process the next page.

If no “next page” link is found, it means there are no more pages to fetch. At this point, the script stops fetching further pages and logs the collected news items to the console.

Here’s the complete code:

				
					const { JSDOM } = require("jsdom");

// Initialize JSDOM with the target URL
const { window } = new JSDOM("", {
    url: "https://news.ycombinator.com/",
});

const $ = require("jquery")(window);

const baseUrl = "https://news.ycombinator.com/";
let currentPage = 1;
const newsItems = [];

// Function to fetch a page of news items
function fetchPage(page) {
    $.get(`${baseUrl}?p=${page}`, function (html) {
        const newsHTMLElements = $(html).find("tr.athing");

        newsHTMLElements.each((i, newsItemElement) => {
            const rank = $(newsItemElement).find("span.rank").text().replace('.', '');
            const titleElement = $(newsItemElement).find("span.titleline > a");
            const title = titleElement.text().trim();
            const url = titleElement.attr("href");

            // Extract additional details
            const subtext = $(newsItemElement).next().find(".subtext");
            const author = subtext.find("a.hnuser").text();
            const points = subtext.find("span.score").text();
            const postTime = subtext.find("span.age").text();

            // Store the extracted data
            newsItems.push({
                rank,
                title,
                url,
                author,
                points,
                postTime
            });
        });

        // Check for next page link
        const nextPageLink = $(html).find("a.morelink").attr("href");
        if (nextPageLink) {
            currentPage += 1;
            fetchPage(currentPage);
        } else {
            console.log(newsItems);
        }
    }).fail(function (error) {
        console.error(`Error fetching page ${page}:`, error);
    });
}

// Start fetching from the first page
fetchPage(currentPage);
				
			

When the code runs, this would print:

Great! We’ve successfully extracted data from multiple pages using the pagination technique.

Exporting Scraped Data

Scraping data from the web is often the initial step in a larger workflow. The scraped data can be used for multiple purposes. To make the most of the gathered data, it is crucial to convert it into a format that is easy to read, analyze, and manipulate. Some of the common formats include CSV and JSON.

Let’s start by discussing how to store the scraped data in a CSV file.

To export the scraped data to a CSV file in Node.js, you can utilize the fs (File System) module to write the data. Here’s how to modify your script:

  1. Import the necessary fs module.
  2. Convert the newsItems array into CSV format.
  3. Write the CSV data to a file.

Here’s the full code:

				
					const { JSDOM } = require("jsdom");
const fs = require("fs");

const { window } = new JSDOM("", {
    url: "https://news.ycombinator.com/",
});
const $ = require("jquery")(window);

const baseUrl = "https://news.ycombinator.com/";
let currentPage = 1;
const newsItems = [];

// Function to fetch a page of news items
function fetchPage(page) {
    $.get(`${baseUrl}?p=${page}`, function(html) {
        const newsHTMLElements = $(html).find("tr.athing");

        newsHTMLElements.each((i, newsItemElement) => {
            const rank = $(newsItemElement).find("span.rank").text().replace('.', '');
            const titleElement = $(newsItemElement).find("span.titleline > a");
            const title = titleElement.text().trim();
            const url = titleElement.attr("href");

            // Extract additional details
            const subtext = $(newsItemElement).next().find(".subtext");
            const author = subtext.find("a.hnuser").text();
            const points = subtext.find("span.score").text();
            const postTime = subtext.find("span.age").text();

            // Store the extracted data
            newsItems.push({
                rank,
                title,
                url,
                author,
                points,
                postTime
            });
        });

        // Check for next page link
        const nextPageLink = $(html).find("a.morelink").attr("href");
        if (nextPageLink) {
            currentPage += 1;
            fetchPage(currentPage);
        } else {
            exportToCSV(newsItems);
        }
    }).fail(function(error) {
        console.error(`Error fetching page ${page}:`, error);
    });
}

// Function to export the data to a CSV file
function exportToCSV(data) {
    const csvHeaders = ['Rank', 'Title', 'URL', 'Author', 'Points', 'Post Time'];
    
    const csvRows = [
        csvHeaders.join(','),
        ...data.map(item => [
            item.rank,
            `"${item.title}"`,
            item.url,
            item.author,
            item.points,
            item.postTime
        ].join(','))
    ].join('\n');  // Join all rows with newline characters

    // Write the CSV string to a file
    fs.writeFile("newsItems.csv", csvRows, (err) => {
        if (err) {
            console.error("Error writing to CSV file:", err);
        } else {
            console.log("Data successfully exported to newsItems.csv");
        }
    });
}

fetchPage(currentPage);
				
			
Run the code to create a new CSV file in your directory.
Great! The data looks clean and readable in a CSV file. Let’s explore storing scraped data in a JSON file. To save the scraped data to a JSON file, you need to create a function that converts the newsItems array into a JSON format and then saves it to a file. In the function, you’ll need to convert the newsItems array into a JSON string by using JSON.stringify, and then write the JSON string to a file named newsItems.json using the fs.writeFile method from the fs module of the Node.js. Here’s the full code:
				
					const { JSDOM } = require("jsdom");
const fs = require("fs");

const { window } = new JSDOM("", {
    url: "https://news.ycombinator.com/",
});
const $ = require("jquery")(window);

const baseUrl = "https://news.ycombinator.com/";
let currentPage = 1;
const newsItems = [];

// Function to fetch a page of news items
function fetchPage(page) {
    $.get(`${baseUrl}?p=${page}`, function (html) {
        const newsHTMLElements = $(html).find("tr.athing");

        newsHTMLElements.each((i, newsItemElement) => {
            const rank = $(newsItemElement).find("span.rank").text().replace('.', '');
            const titleElement = $(newsItemElement).find("span.titleline > a");
            const title = titleElement.text().trim();
            const url = titleElement.attr("href");

            // Extract additional details
            const subtext = $(newsItemElement).next().find(".subtext");
            const author = subtext.find("a.hnuser").text();
            const points = subtext.find("span.score").text();
            const postTime = subtext.find("span.age").text();

            // Store the extracted data
            newsItems.push({ rank, title, url, author, points, postTime });
        });

        // Check for next page link
        const nextPageLink = $(html).find("a.morelink").attr("href");
        if (nextPageLink) {
            currentPage += 1;
            fetchPage(currentPage);
        } else {
            exportToJSON(newsItems);
        }
    }).fail(function (error) {
        console.error(`Error fetching page ${page}:`, error);
    });
}

// Function to export data to a JSON file
function exportToJSON(data) {
    const jsonData = JSON.stringify(data, null, 2);

    // Write the JSON string to a file
    fs.writeFile("newsItems.json", jsonData, (err) => {
        if (err) {
            console.error("Error writing to JSON file:", err);
        } else {
            console.log("Data successfully exported to newsItems.json");
        }
    });
}

// Start fetching from the first page
fetchPage(currentPage);
				
			
Run the code to create a new JSON file in your directory.

Conclusion

We tried to make this tutorial beginner-friendly. However, there are limitations when doing it on the client side. To understand more, you can read this full tutorial on Web scraping with JavaScript & NodeJs.

Also, if you are looking for a scraping solution without any limitations you can try Scrapingdog web scraping API.

Additional Resources

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked
My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Manthan Koolwal

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked

Recent Blogs

Web Scraping Amazon Reviews using Python

Web Scraping Amazon Reviews using Python

In this blog we have used Amazon Review Scraper API and Python to extract review data at scale.
Challenges in Web Scraping

7 Challenges in Large Scale Web Scraping & How To Overcome Them

In this read, we have listed out some of the challenges that you may have during large scale data extraction. You can build a hassle free data pipeline using a Web Scraping API.