Skip to content

Latest commit

 

History

History
240 lines (158 loc) · 11.9 KB

java.md

File metadata and controls

240 lines (158 loc) · 11.9 KB

Java Web Scraping

This document contains a list of libraries and resources for web scraping in Java.

Table of Contents

Libraries

Note: All selected libraries are either actively maintained or widely used.

Network

HTTP and WebSocket Clients

  • HttpClient: A built-in Java 11 HTTP client to send requests and retrieve their responses
  • HttpURLConnection: A built-in Java class to make a single request to an HTTP server
  • HttpsURLConnection: A built-in Java class that extends HttpURLConnection with HTTPS-specific features
  • Apache HttpClient: A synchronous HTTP and WebSocket client library for Java
  • Apache AsyncHttpClient: An asynchronous HTTP and WebSocket client library for Java
  • OkHttp: A meticulous HTTP client for the JVM, Android, and GraalVM
  • Jetty HttpClient: An HTTP Client that supports HTTP/2, HTTP/1.1, HTTP/1.0, websocket, servlets, and more
  • Google HTTP Client Library For Java: A flexible, efficient, and powerful Java library for accessing any resource on the web via HTTP
  • Retrofit: A type-safe HTTP client for Android and the JVM

WebSockets

  • WebSocket: A built-in Java WebSocket client
  • Java-WebSocket: A barebones WebSocket client and server implementation written in 100% Java

Low Level

  • pcap4j: A Java library for capturing, crafting, and sending packets

Parsers

General

  • Apache Tika: A toolkit to detect and extract metadata and text from over a thousand different file types (such as PPT, XLS, and PDF)
  • Jackson Text Dataformats: A uber-project for (some) standard Jackson textual format backends: CSV, properties, yaml

HTML Parsers

  • jsoup: A Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety
  • HtmlCleaner: An open-source HTML parser written in Java
  • JFiveParse: A Java HTML 5 compliant parser

XML Parsers

  • JAXP: Built-in Java APIs for XML processing

URL Parsers

  • URI: A built-in Java API for normalizing, resolving, and relativizing URI instances

CSV Parsers

  • Apache Commons CSV: A library that provides a simple interface for reading and writing CSV files of various types

JSON Parsers

  • JSON Path: A Java DSL for reading JSON documents

Email Parsers

  • Angus Mail: A library that provides a platform-independent and protocol-independent framework to build mail and messaging applications
  • Apache Commons Email: A library that provides an API for sending email, simplifying the JavaMail API

Markdown Parsers

  • Flexmark: A CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules
  • CommonMark: A Java library for parsing and rendering CommonMark (Markdown)

YAML Parsers

  • SnakeYAML: A complete YAML 1.1 processor for the JVM

SQL Parsers

  • JSqlParser: A library to parse an SQL statement and translate it into a hierarchy of Java classes

Office File Parsers

  • Apache POI: A Java API to access Microsoft format files

Other

  • Rome: A Java library for RSS and Atom feeds
  • uap-java: The Java implementation of ua-parser
  • Yauaa: Yet another User-Agent analyzer

Web Scraping

Frameworks

  • Apache Nutch: An extensible and scalable web crawler
  • WebMagic: A scalable web crawler framework for Java
  • Jaunt: A Java library for web-scraping, web-automation and JSON querying
  • ACHE Focused Crawler: A web crawler for domain-specific search
  • Gecco: An easy-to-use lightweight web crawler
  • StormCrawler: A scalable, mature and versatile web crawler based on Apache Storm

Proxy Integration

  • Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]

CAPTCHA Solving

  • CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]

Web Automation

Browser Automation Frameworks

  • Selenium: A browser automation framework and ecosystem
  • Playwright: The Java version of the Playwright testing and automation library
  • HtmlUnit: A GUI-less browser for Java programs that models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc
  • Jauntium: A Java library that allows you to easily automate Chrome, Firefox, Safari, Edge, IE, and other modern web browers

Other

  • WebDriverManager: A library that provides automated driver management and other helper features for Selenium WebDriver in Java

Data Processing

General

  • Jackson: A suite of data-processing tools for Java (and the JVM platform), including the flagship streaming JSON parser / generator library, matching data-binding library (POJOs to and from JSON) and additional data format modules to process data encoded in Avro, BSON, CBOR, CSV, Smile, (Java) Properties, Protobuf, TOML, XML or YAML; and even the large set of data format modules to support data types of widely used data types such as Guava, Joda, PCollections and many, many more (see below)

Character Encoding

  • Charset: A built-in Java class that defines charsets, decoders, and encoders, for translating between bytes and Unicode characters
  • Apache Commons Codec: A library that contains encoders and decoders for various formats such as Base16, Base32, Base64, digest, and Hexadecimal

JSON

  • JSON: A reference implementation of a JSON package in Java

Date and Time

  • LocalDate: A built-in Java class that becomes an immutable date-time object representing a date
  • LocalTime: A built-in Java class that becomes an immutable date-time object representing a time
  • Jode-Time: The widely used replacement for the Java date and time classes prior to Java SE 8

Prices

  • JSR 354: A library that provides an API for representing, transporting, and performing comprehensive calculations with money and currency

Phone Numbers

  • libphonenumber: A library for parsing, formatting, and validating international phone numbers

Slugs

  • Slugify: As mall utility library for generating speaking URLs

Languages

  • Apache Commons Lang: A library that provides a host of helper utilities for the java.lang API, notably String manipulation methods, basic numerical methods, object reflection, concurrency, creation and serialization and System properties

Other

Task Scheduling

  • Wisp: A simple Java scheduler library with a minimal footprint and a straightforward API

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

  • HTTP Client: HttpClient, Apache HttpClient, Apache AsyncHttpClient, OkHttp, Jetty HttpClient, Google HTTP Client Library For Java, or Retrofit

  • HTML Parser: Jsoup or HtmlCleaner

All-In-One Solution

  • Jsoup

Dynamic Web Pages

  • Selenium, Playwright or HtmlUnit

Guides and Tutorials

General Guides

Comparisons