Parsing is the most critical task after scraping. Whether you’re building a web crawler, scraping data, or just extracting elements from a page, PHP offers some great tools for HTML parsing.
In this detailed guide, we’ll explore everything you need to know about parsing HTML with PHP, from the basics to advanced examples.
Parsing Methods in PHP
Before we go deeper, let’s outline the primary ways HTML can be parsed using PHP:
- DOMDocument (Built-in)
- Simple HTML DOM Parser (External Library)
- Goutte (Symfony-based Web Scraper)
- cURL + Regex (Not recommended, but used)
Installing PHP and Required Libraries
Before you begin scraping or parsing, ensure you have PHP installed on your system.
Install PHP
If you’re on macOS:
brew install php
For Ubuntu/Debian:
sudo apt update
sudo apt install php php-cli php-curl php-mbstring
For Windows:
- Download PHP from php.net.
- Extract and add the path to your system’s environment variables.
Install Composer (PHP Dependency Manager)
php -r "copy('https://getcomposer.org/installer', 'composer-setup.php');"
php composer-setup.php
php -r "unlink('composer-setup.php');"
Then, move the file:
mv composer.phar /usr/local/bin/composer
Install paquettg/php-html-parser
This is one of the most popular PHP HTML parsing libraries.
composer require paquettg/php-html-parser
Forget about getting blocked while scraping the Web
Try out Scrapingdog Web Scraping API & scrape any website at Scale. We handle all the proxies, headless browsers & retries for you!
Scraping with PHP
For this tutorial, we are going to this site to scrape and parse. Now, create a PHP file by any name you like. I am naming the file as scraper.php
.
Read More: A Complete Guide on Web Scraping with PHP
The code is very simple, but let me explain you step by step.
- We defined a variable
$url
containing the URL of the web page you want to scrape. - Then we are using PHP’s built-in
file_get_contents()
function to send a GET request to the URL. - The HTML content of the page is then stored in the
$html variable.
- It’s a simple way to fetch raw HTML from a web page.
- This checks if the request failed (i.e.,
$html
isfalse
). - If the page couldn’t be loaded, the script stops and prints:
“Failed to fetch page” - This writes the fetched HTML content into a new file called
raw.html
. - The file will be created in the same directory as the script (or overwritten if it exists).
- Finally, a success message is printed to confirm that the file has been saved.
Now, let’s parse it.
Parsing with PHP
Now, let’s parse the raw HTML and extract the team name, year, wins, and losses.
loadFromFile('raw.html');
$data = [];
$rows = $dom->find('.table tbody tr');
foreach ($rows as $row) {
$teamName = $row->find('td', 0)->text;
$year = $row->find('td', 1)->text;
$wins = $row->find('td', 2)->text;
$losses = $row->find('td', 3)->text;
$data[] = [
'Team' => trim($teamName),
'Year' => trim($year),
'Wins' => trim($wins),
'Losses' => trim($losses)
];
}
print_r($data);
?>
- First, we load all Composer-installed PHP libraries.
- Assumes you’ve installed
paquettg/php-html-parser
via Composer. - Imports the
Dom
class from the library, allowing you to parse and interact with HTML DOM elements. - Creates a new DOM parser instance.
- Loads the HTML from
raw.html
(an offline copy of a webpage with a table) for processing. - Initializes an empty array called
$data
to hold the parsed results. - Uses a CSS selector to find all
<tr>
(table row) elements inside<tbody>
of a table with class.table
. - Iterates through each row of the table.
- Adds a new entry to the
$data
array. trim()
removes any leading/trailing whitespace from the extracted text.- Outputs the final structured array in a human-readable format.
Once you run this code, you will get a beautiful parsed response.
Array
(
[0] => Array
(
[Team] => Boston Celtics
[Year] => 2013
[Wins] => 41
[Losses] => 40
)
[1] => Array
(
[Team] => Brooklyn Nets
[Year] => 2013
[Wins] => 49
[Losses] => 33
)
[2] => Array
(
[Team] => New York Knicks
[Year] => 2013
[Wins] => 37
[Losses] => 45
)
[3] => Array
(
[Team] => Philadelphia 76ers
[Year] => 2013
[Wins] => 19
[Losses] => 63
)
[4] => Array
(
[Team] => Toronto Raptors
[Year] => 2013
[Wins] => 48
[Losses] => 34
)
)
Storing the Data in a CSV File
Let’s export this parsed data into a CSV file.
- Opens a new file named
teams.csv
in write mode. - If the file doesn’t exist, it will be created.
$csvFile
is now a file handle used for writing to the file.- Write the column headers (the first row) into the CSV file.
fputcsv()
automatically formats the array into a comma-separated line- Then, we iterate through the
$data
array, which is assumed to be an array of associative arrays. - Finally, we close the file after writing is complete.
Complete Code
loadFromFile('raw.html');
// Step 3: Extract table rows
$data = [];
$rows = $dom->find('.table tbody tr');
foreach ($rows as $row) {
$teamName = $row->find('td', 0)->text;
$year = $row->find('td', 1)->text;
$wins = $row->find('td', 2)->text;
$losses = $row->find('td', 3)->text;
$data[] = [
'Team' => trim($teamName),
'Year' => trim($year),
'Wins' => trim($wins),
'Losses' => trim($losses)
];
}
// Step 4: Write the data to a CSV file
$csvFile = fopen('teams.csv', 'w');
// Add header row
fputcsv($csvFile, ['Team', 'Year', 'Wins', 'Losses']);
// Write each data row
foreach ($data as $line) {
fputcsv($csvFile, $line);
}
fclose($csvFile);
echo "Data written to teams.csv successfully.\n";
?>
Conclusion
You just learned how to scrape and parse data from a real-world HTML page using PHP. From installing dependencies to writing your first HTML parsing logic and exporting the results, this guide covers the end-to-end workflow.
Whether you’re scraping for personal projects or commercial tools, PHP offers powerful solutions when used with the right libraries.
Additional Resources
