This document contains a list of libraries and resources for web scraping in Java.
- Libraries
- Popular Web Scraping Stacks
- Guides and Tutorials
Note: All selected libraries are either actively maintained or widely used.
- HttpClient: A built-in Java 11 HTTP client to send requests and retrieve their responses
- HttpURLConnection: A built-in Java class to make a single request to an HTTP server
- HttpsURLConnection: A built-in Java class that extends HttpURLConnection with HTTPS-specific features
- Apache HttpClient: A synchronous HTTP and WebSocket client library for Java
- Apache AsyncHttpClient: An asynchronous HTTP and WebSocket client library for Java
- OkHttp: A meticulous HTTP client for the JVM, Android, and GraalVM
- Jetty HttpClient: An HTTP Client that supports HTTP/2, HTTP/1.1, HTTP/1.0, websocket, servlets, and more
- Google HTTP Client Library For Java: A flexible, efficient, and powerful Java library for accessing any resource on the web via HTTP
- Retrofit: A type-safe HTTP client for Android and the JVM
- WebSocket: A built-in Java WebSocket client
- Java-WebSocket: A barebones WebSocket client and server implementation written in 100% Java
- pcap4j: A Java library for capturing, crafting, and sending packets
- Apache Tika: A toolkit to detect and extract metadata and text from over a thousand different file types (such as PPT, XLS, and PDF)
- Jackson Text Dataformats: A uber-project for (some) standard Jackson textual format backends: CSV, properties, yaml
- jsoup: A Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety
- HtmlCleaner: An open-source HTML parser written in Java
- JFiveParse: A Java HTML 5 compliant parser
- JAXP: Built-in Java APIs for XML processing
- URI: A built-in Java API for normalizing, resolving, and relativizing URI instances
- Apache Commons CSV: A library that provides a simple interface for reading and writing CSV files of various types
- JSON Path: A Java DSL for reading JSON documents
- Angus Mail: A library that provides a platform-independent and protocol-independent framework to build mail and messaging applications
- Apache Commons Email: A library that provides an API for sending email, simplifying the JavaMail API
- Flexmark: A CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules
- CommonMark: A Java library for parsing and rendering CommonMark (Markdown)
- SnakeYAML: A complete YAML 1.1 processor for the JVM
- JSqlParser: A library to parse an SQL statement and translate it into a hierarchy of Java classes
- Apache POI: A Java API to access Microsoft format files
- Rome: A Java library for RSS and Atom feeds
- uap-java: The Java implementation of ua-parser
- Yauaa: Yet another User-Agent analyzer
- Apache Nutch: An extensible and scalable web crawler
- WebMagic: A scalable web crawler framework for Java
- Jaunt: A Java library for web-scraping, web-automation and JSON querying
- ACHE Focused Crawler: A web crawler for domain-specific search
- Gecco: An easy-to-use lightweight web crawler
- StormCrawler: A scalable, mature and versatile web crawler based on Apache Storm
- Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]
- CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]
- Selenium: A browser automation framework and ecosystem
- Playwright: The Java version of the Playwright testing and automation library
- HtmlUnit: A GUI-less browser for Java programs that models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc
- Jauntium: A Java library that allows you to easily automate Chrome, Firefox, Safari, Edge, IE, and other modern web browers
- WebDriverManager: A library that provides automated driver management and other helper features for Selenium WebDriver in Java
- Jackson: A suite of data-processing tools for Java (and the JVM platform), including the flagship streaming JSON parser / generator library, matching data-binding library (POJOs to and from JSON) and additional data format modules to process data encoded in Avro, BSON, CBOR, CSV, Smile, (Java) Properties, Protobuf, TOML, XML or YAML; and even the large set of data format modules to support data types of widely used data types such as Guava, Joda, PCollections and many, many more (see below)
- Charset: A built-in Java class that defines charsets, decoders, and encoders, for translating between bytes and Unicode characters
- Apache Commons Codec: A library that contains encoders and decoders for various formats such as Base16, Base32, Base64, digest, and Hexadecimal
- JSON: A reference implementation of a JSON package in Java
- LocalDate: A built-in Java class that becomes an immutable date-time object representing a date
- LocalTime: A built-in Java class that becomes an immutable date-time object representing a time
- Jode-Time: The widely used replacement for the Java date and time classes prior to Java SE 8
- JSR 354: A library that provides an API for representing, transporting, and performing comprehensive calculations with money and currency
- libphonenumber: A library for parsing, formatting, and validating international phone numbers
- Slugify: As mall utility library for generating speaking URLs
- Apache Commons Lang: A library that provides a host of helper utilities for the java.lang API, notably String manipulation methods, basic numerical methods, object reflection, concurrency, creation and serialization and System properties
- Wisp: A simple Java scheduler library with a minimal footprint and a straightforward API
-
HTTP Client: HttpClient, Apache HttpClient, Apache AsyncHttpClient, OkHttp, Jetty HttpClient, Google HTTP Client Library For Java, or Retrofit
-
HTML Parser: Jsoup or HtmlCleaner
- Jsoup
- Selenium, Playwright or HtmlUnit