Skip to content

Latest commit

 

History

History
275 lines (186 loc) · 13.7 KB

php.md

File metadata and controls

275 lines (186 loc) · 13.7 KB

PHP Web Scraping

This document contains a list of libraries and resources for web scraping in PHP.

Table of Contents

Libraries

Note: All selected libraries are either actively maintained or widely used.

Network

HTTP Clients

  • PHP cURL: A built-in PHP library based on libcurl to connect and communicate to many different types of servers with many different types of protocols
  • Guzzle: A PHP HTTP client that makes it easy to send HTTP requests and trivial to integrate with web services [Guzzle proxy integration]
  • HttpClient: A Symfony component implementing a low-level HTTP client
  • Buzz: A lightweight (<1000 lines of code) PHP 7.1 library for issuing HTTP requests
  • amphp/http-client: An advanced async HTTP client library for PHP, enabling efficient, non-blocking, and concurrent requests and responses
  • Requests: A humble HTTP request library. It simplifies how you interact with other sites and takes away all your worries.
  • HTTPFul: A Chainable, REST-friendly, PHP HTTP client. A sane alternative to cURL. with support for both PHP stream wrappers and cURL

WebSockets

  • Sockets: A built-in PHP extension that implements a low-level interface to the socket communication functions based on the popular BSD sockets, providing the possibility to act as a socket server as well as a client
  • Ratchet: A PHP library for asynchronously serving WebSockets
  • Pawl: An asynchronous WebSocket client in PHP

Parsers

HTML and XML Parsers

  • Simple Html Dom Parser for PHP: A modern simple HTML DOM parser for PHP
  • HTML5-PHP: An HTML5 parser and serializer for PHP
  • DiDOM: A simple and fast HTML and XML parser
  • QueryPath : A PHP library for HTML(5)/XML querying (CSS 4 or XPath) and processing (like jQuery) with PHP8.3 support
  • DomCrawler: A Symfony component that eases DOM navigation for HTML and XML documents
  • PHP Html Parser: An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery
  • DOM: A built-in PHP extension that allows operations on XML and HTML documents through the DOM API with PHP
  • XML Document Parser PHP: A framework-agnostic package that provide a simple way to parse XML to array without having to write a complex logic

CSV Parsers

JSON Parsers

  • JSON Parser: A zero-dependencies lazy parser to read JSON of any dimension and from any source in a memory-efficient way
  • JsonMachine: An efficient, easy-to-use, and fast PHP JSON stream parser

PDF Parsers

  • PdfParser: A standalone PHP library, provides various tools to extract data from a PDF file

Email Parsers

Markdown Parsers

  • CommonMark PHP: An highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs
  • PHP Markdown: A parser for Markdown and Markdown Extra derived from the original Markdown.pl by John Gruber
  • Parsedown: A better Markdown parser in PHP

YAML Parsers

SQL Parsers

  • PHP-SQL-Parser: A pure PHP SQL (non validating) parser w/ focus on MySQL dialect of SQL
  • SQL Parser: A validating SQL lexer and parser with a focus on MySQL dialect

Office File Parsers

  • SimpleXLSX: A PHP library to parse and retrieve data from Excel XLSx files

Other

  • PHP Domain Parser: Public suffix list based domain parsing implemented in PHP
  • RSS & Atom Feeds for PHP: A small and easy-to-use library for consuming RSS and Atom feeds
  • PHP CSS Parser: A Parser for CSS Files written in PHP. Allows extraction of CSS files into a data structure, manipulation of said structure and output as (optimized) CSS
  • SitemapParser: An XML sitemap parser class compliant with the Sitemaps.org protocol
  • robots-txt-parser: A PHP class for parse all directives from robots.txt files according to specifications

Web Scraping

Frameworks

  • Crawler: A library for rapid (web) crawler and scraper development
  • Roach: A complete web scraping toolkit for PHP
  • PHP-Spider: A configurable and extensible PHP web spider
  • Embed: A PHP library to get info from any web service or page
  • PHPScraper: A versatile web-utility for PHP

Proxy Integration

  • Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]

CAPTCHA Solving

  • CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]
  • PHP Module for 2Captcha API: A PHP package for easy integration with the API of 2captcha captcha solving service to bypass recaptcha, hcaptcha, funcaptcha, geetest and solve any other captchas
  • captcha-solver-php: A PHP-based easy implementation for solving any type of captcha by Metabypass

Web Automation

Browser Automation Frameworks

  • Panther: A browser testing and web crawling library for PHP and Symfony
  • php-webdriver: A PHP client for Selenium/WebDriver protocol
  • chrome-php/chrome: A library to instrument headless chrome/chromium instances from PHP
  • Mink: PHP web browser emulator abstraction

Data Export

JSON

CSV

  • CSV: A library to ease parsing, writing and filtering CSV in PHP

Other

  • mPDF: A PHP library generating PDF files from UTF-8 encoded HTML
  • PhpSpreadsheet: A pure PHP library for reading and writing spreadsheet files
  • PHPWord: A A pure PHP library for reading and writing word processing documents
  • PHPPowerPoint: A pure PHP library for reading and writing presentations documents

Data Processing

General

  • Stringy: A PHP string manipulation library with multibyte support

Character Encoding

Date and Time

Prices

  • Brick\Money: A money and currency library for PHP
  • PHP Prices: A simple PHP library for complex monetary prices management

Phone Numbers

Units of Measurement

  • PhpUnitsOfMeasure: A library for handling physical quantities and the units of measure in which they're represented

Slugs

  • Urlify: A fast PHP slug generator and transliteration library that converts non-ascii characters for use in URLs
  • Slugify: A PHP library to convert a string to a slug. Includes integrations for Symfony, Silex, Laravel, Zend Framework 2, Twig, Nette and Latte

URLs and Network Addresses

  • Purl: A a simple Object Oriented URL manipulation library for PHP 7.2+
  • Uri: A PHP package that provides simple and intuitive classes to manage URIs in PHP
  • Url: A Swiss Army knife for URLs

Other

Multiprocessing

  • amphp/parallel: An advanced parallelization library for PHP, enabling efficient multitasking, optimizing resource use, and application responsiveness through multiple CPU threads

Even Handling

  • React - A library for event-driven, non-blocking I/O with PHP
  • Evenement: a very simple event dispatching library for PHP
  • Event - An event library with a focus on domain events

Software Automation Management

  • Salt: Software to automate the management and configuration of any infrastructure or application at scale
  • Puppet: A server automation framework and application

Task Scheduling

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

  • HTTP Client: PHP cURL, Guzzle, or Symfony's HttpClient

  • HTML Parser: Simple Html Dom Parser for PHP, DomCrawler, HTML5-PHP, or DiDOM

All-In-One Solution

  • Crawler, Roach, or PHP-Spider

Dynamic Web Pages

  • Panther or php-webdriver

Guides and Tutorials

General Guides

Proxies

HTTP Clients