Releases · janreges/siteone-crawler

24 Aug 17:15

janreges

v1.0.8

6c634e0

v1.0.8 Latest

Latest

This version includes redirect following for the first URL (if it points to the same domain/subdomain of level 2), detection of a large number of similar URLs with 404 due to wrong relative path (discovered in svelte docs) + url skipping behavior, other improvements in the area of exporting/cloning the site on modern JS frameworks, better handling of some edge-cases and a lot of various minor improvements (see changelog).

Changes

reports: changed file name composition from report.mydomain.com.* to mydomain.com.report.* #9
crawler: solved edge-case, which very rarely occurred when the queue processing was already finished, but the last outstanding coroutine still found some new URL a85990d
javascript processor: improvement of webpack JS processing in order to correctly replace paths from VueJS during offline export (as e.g. in case of docs.netlify.com) .. without this, HTML had the correct paths in the left menu, but JS immediately broke them because they started with an absolute path with a slash at the beginning 9bea99b
offline export: detect and process fonts.googleapis.com/css* as CSS even if there is no .css extension da33100
js processor: removed the forgotten var_dump 5f2c36d
offline export: improved search for external JS in the case of webpack (dynamic composition of URLs from an object with the definition of chunks) - it was debugged on docs.netlify.com a61e72e
offline export: in case the URL ends with a dot and a number (so it looks like an extension), we must not recognize it as an extension in some cases c382d95
offline url converter: better support for SVG in case the URL does not contain an extension at all, but has e.g. 'icon' in the URL (it's not perfect) c9c01a6
offline exporter: warning instead of exception for some edge-cases, e.g. not saving SVG without an extension does not cause the export to stop 9d285f4
cors: do not set Origin request header for images (otherwise error 403 on cdn.sanity.io for svg, etc.) 2f3b7eb
best practice analyzer: in checking for missing quotes ignore values longer than 1000 characters (fixes, e.g., at skoda-auto.cz the error Compilation failed: regular expression is too large at offset 90936) 8a009df
html report: added loading of extra headers to the visited URL list in the HTML report 781cf17
Frontload the report names 62d2aae
robots.txt: added option --ignore-robots-txt (we often need to view internal or preview domains that are otherwise prohibited from indexing by search engines) 9017c45
http client: adden an explicit 'Connection: close' header and explicitly calling $client->close(), even though Swoole was doing it automatically after exiting the coroutine 86a7346
javascript processor: parse url addresses to import the JS module only in JS files (otherwise imports from HTML documentation, e.g. on the websites svelte.dev or nextjs.org, were parsed by mistake) 592b618
html processor: added obtaining urls from HTML attributes that are not wrapped in quotes (but I am aware that current regexps can cause problems in the cases when are used spaces, which are not properly escaped) f00abab
offline url converter: swapping woff2/woff order for regex because in this case their priority is important and because of that woff2 didn't work properly 3f318d1
non-200 url basename detection: we no longer consider e.g. image generators that have the same basename and the url to the image in the query parameters as the same basename bc15ef1
supertable: activation of automatic creation of active links also for homepage '/' c2e228e
analysis and robots.txt: improving the display of url addresses for SEO analysis in the case of a multi-domain website, so that it cannot happen that the same url, e.g. '/', is in the overview multiple times without recognizing the domain or scheme + improving the work with robots.txt in SEO detection and displaying urls banned for indexing 47c7602
offline website exporter: we add the suffix '_' to the folder name only in the case of a typical extension of a static file - we don't want this to happen with domain names as well d16722a
javascript processor: extract JS urls also from imports like import {xy} from "./path/foo.js" aec6cab
visited url: added 'txt' extension to looksLikeStaticFileByUrl() 460c645
html processor: extract JS urls also from <link href="*.js">, typically with rel="modulepreload" c4a92be
html processor: extracting repeated calls to getFullUrl() into a variable a5e1306
analysis: do not include urls that failed to load (timeout, skipping, etc.) in the analysis of content-types and source-domains - prevention of displaying content type 'unknown' b21ecfb
cli options: improved method of removing quotes even for options that can be arrays - also fixes --extra-columns='Title' 97f2761
url skipping: if there are a lot of URLs with the same basename (ending after the last slash), we will allow a maximum of 5 requests for URLs with the same basename - the purpose is to prevent a lot of 404 from being triggered when there is an incorrect relative link to relative/my-img.jpg on all pages (e.g. on 404 page on v2.svelte.dev) 4fbb917
analysis: perform most of the analysis only on URLs from domains for which we have crawling enabled 313adde
audio & video: added audio/video file search in <audio> and <video> tags, if file crawling is not disabled d72a5a5
base practices: retexting stupid warning like '<h2> after <h0>' to '<h2> without previous heading 041b383
initial url redirect: in the case thats is entered url that redirects to another url/domain within the same 2nd-level domain (typically http->https or mydomain.tld -> www.mydomain.tld redirects), we continue crawling with new url/domain and declare a new url as initial url 166e617

Assets 7

0 Join discussion

22 Dec 22:44

janreges

v1.0.7

9d2be52

v1.0.7

Primary changes are implemented online HTML report upload option, improved sorting in generated sitemaps, detection and better display of SVG icon-sets, replacement of inline-JS from HTML report except for a few main static ones so that we can enable them through sha256 hashes in strict Content-Security-Policy and various minor fixes and changes.

Changes

html report template: updated logo link to crawler.siteone.io 9892cfe
http headers analysis: renamed 'Headers' to 'HTTP headers' 436e6ea
sitemap generator: added info about crawler to generated sitemap.xml 7cb7005
html report: refactor of all inline on* event listeners to data attributes and event listeners added from static JS inside <script>, so that we can disable all inline JS in the online HTML report and allow only our JS signed with hashes by Content-Security-Policy b576eef
readme: removed HTTP auth from roadmap (it's already done), improved guide how to implement own upload endpoint and message about SMTP moved under mailer options e1567ae
utils: hide passwords/authentication specified in cli parameters as *auth=xyz (e.g. --http-auth=abc:xyz)" in html report c8bb88f
readme: fixed formatting of the upload and expert options 2d14bd5
readme: added Upload Options d8352c5
upload exporter: added possibility via --upload to upload HTML report to offline URL, by default crawler.siteone.io/html/* 2a027c3
parsed-url: fixed warning in the case of url without host 284e844
seo and opengraph: fixed false positives 'DENY (robots.txt)' in some cases 658b649
best practices and inline-svgs: detection and display of the entire icon set in the HTML report in the case of <svg> with more <symbol> or <g> 3b2772c
sitemap generator: sort urls primary by number of dashes and secondary alphabetically (thanks to this, urls of the main levels will be at the beginning) bbc47e6
sitemap generator: only include URLs from the same domain as the initial URL 9969254
changelog: updated by 'composer changelog' 0c67fd4
package.json: used by auto-changelog generator 6ad8789

Assets 7

08 Dec 11:11

janreges

v1.0.6

b675873

v1.0.6

The primary change is to fix a bug that in some cases caused asynchronous request queue to get stuck in the last stage of crawling.

Changes

readme: removed bold links from the intro (it didn't look as good on github as it did in the IDE) b675873
readme: improved intro and gif animation with the real output fd9e2d6
http auth: for security reasons, we only send auth data to the same 2nd level domain (and possibly subdomains). With HTTP basic auth, the name and password are only base64 encoded and we would send them to foreign domains (which are referred to from the crawled website) 4bc8a7f
html report: increased specificity of the .header class for the header, because this class were also used by the generic class at <td class='header'> in security tab 9d270e8
html report: improved readability of badge colors in light mode 76c5680
crawler: moving the decrement of active workers after parsing URLs from the content, where further filling of the queue could occur (for this reason, queue processing could sometimes get stuck in the final stages) f8f82ab
analysis: do not parse/check empty HTML (it produced unnecessary warning) - it is valid to have content-type: text/html but with connect-lengt: 0 (for example case for 'gtm.js?id=') 436d81b

Assets 7

03 Dec 23:20

janreges

v1.0.5

f42fe18

v1.0.5

The first version that is already used by the Electron-based desktop application https://github.com/janreges/siteone-crawler-gui

Changes

option: replace placeholders like a '%domain' also in validateValue() method because there is also check if path is writable with attempt to mkdir 329143f
swoole in cygwin: improved getBaseDir() to work better even with the version of Swoole that does not have SCRIPT_DIR 94cc5af
html processor: it must also process the page with the redirect, because is needed to replace the URL in the meta redirect tag 9ce0eee
sitemap: use formatted output path (primary for better output in Cygwin environment with needed C:/foo <-> /cygwin/c/foo conversion) 6297a7f
file exporter: use formatted output path (primary for better output in Cygwin environment with needed C:/foo <-> /cygwin/c/foo conversion) 426cfb2
options: in the case of dir/file validation, we want to work with absolute paths for more precise error messages 6df228b
crawler.php: improved baseDir detection - we want to work with absolute path in all scenarios 9d1b2ce
utils: improved getAbsolutePath() for cygwin and added getOutputFormattedPath() with reverse logic for cygwin (C:/foo/bar <-> /cygdrive/c/foo/bar) 161cfc5
offline export: renamed --offline-export-directory to --offline-export-dir for consistency with --http-cache-dir or --result-storage-dir 26ef45d

Changes in 1.0.4 (skipped release)

dom parsing: handling warnings in case of impossibility to parse some DOM elements correctly, fixes #3 #3
version: 1.0.4.20231201 + changelog 8e15781
options: ignore empty values in the case of directives with the possibility of repeated definition 5e30c2f
http-cache: now the http cache is turned off using the 'off' value (it's more understandable) 9508409
core options: added --console-width to enforce the definition of the console width and disable automatic detection via 'tput cols' on macOS/Linux or 'mode con' on Windows (used by Electron GUI) 8cf44b0
gui support: added base-dir detection for Windows where the GUI crawler runs in Cygwin 5ce893a
renaming: renamed 'siteone-website-crawler' to 'siteone-crawler' and 'SiteOne Website Crawler' to 'SiteOne Crawler' 64ddde4
utils: fixed color-support detection 62dbac0
core options: added --force-color options to bypass tty detection (used by Electron GUI) 607b4ad
best practice analysis: in the case of checking an image (e.g. for the existence of WebP/AVIF), we also want to check external images, because very often websites have images linked from external domains or services for image modification or optimization 6100187
html report: set scaleDown as default object-fit for image gallery 91cd300
offline exporter: added short -oed as alias to --offline-export-directory 22368d9
image gallery: list of all images on the website (except those from the srcset, where there would be duplicates only in other sizes or formats), including SVG with rich filtering options (through image format, size and source tag/attribute) and the option of choosing small/medium/view and scale-down/contains/cover for object-fit css property 43de0af
core options: added a shortened version of the command name consisting of only one hyphen and the first letters of the words of the full command (e.g. --memory-limit has short version -ml), added getInitialScheme() eb9a3cc
visited url: added 'sourceAttr' with information about where the given URL was found and useful helper methods 6de4e39
found urls: in the case of the occurrence of one URL in several places/attributes, we consider the first one to be the main one (typically the same URL in src and then also in srcset) 660bb2b
url parsing: added more recognition of which attributes the given URL address was parsed from (we need to recognize src and srcset for ImageGallery in particular) 802c3c6
supertable and urls: in removing the redundant hostname for a more compact URL output, we also take into account the scheme http:// or https:// of initial URL (otherwise somewhere it lookedlike duplicate) + prevention of ansi-color definitions for bash in the HTML output 915469e
title/description/keywords parsing: added html entities decoding because some website uses decoded entities with í – etc 920523d
crawler: added 'sourceAttr' to the swoole table queue and already visited URLs (we will use it in the Image Gallery for filtering, so as not to display unnecessarily and a lot of duplicate images only in other resolutions from the srcsets) 0345abc
url parameter: it is already possible not to enter the scheme and https:// or http:// will be added automatically (http:// for e.g. for localhost) 85e14e9
disabled images: in the case of a request to remove the images, replace their body with a 1x1px transparent gif and place a semi-transparent hatch with the crawler logo and opacity as a background c1418c3
url regex filtering: added option , which will allow you to limit the list of crawled pages according to the declared regexps, but at the same time it will allow you to crawl and download assets (js, css, images, fonts, documents, etc.) from any URL (but with respect to allowed domains) 21e67e5
img srcset parsing: because a valid URL can also contain a comma (and various dynamic parametric img generators use them) and in the srcset a comma+whitespace should be used to separate multiple values, this is also reflected in the srcset parsing 0db578b
websocket server: added option to set --websocket-server, which starts a parallel process with the websocket server, through which the crawler sends various information about the progress of crawling (this will also be used by Electron UI applications) 649132f
http client: handle scenario when content loaded from cache is not valid (is_bool) 1ddd099
HTML report: updated logo with final look 2a3bb42
mailer: shortening and simplifying email content e797107
robots.txt: added info about loaded robots.txt to summary (limited to 10 domains for case of huge multi domain crawling) 00f9365
redirects analyzer: handled edge case with empty url e9be1e3
text output: added fancy banner with crawler logo (thanks to great Sit...

Assets 7

10 Nov 01:03

janreges

v1.0.3

5b80965

v1.0.3

Changelog

All notable changes to this project will be documented in this file. Dates are displayed in UTC.

Demo video: https://www.youtube.com/watch?v=qEiSTpb66nA

v1.0.3

cache/storage: better race-condition handling in a situation where several coroutines could write the same folder at one time, then mkdir reported 'File exists' be543dc

v1.0.2

10 November 2023

version: 1.0.2.20231110 + changelog 230b947
html report: added aria labels to active/important elements a329b9d
version: 1.0.1.20231109 - changelog 50dc69c

v1.0.1

9 November 2023

version: 1.0.1.20231109 e213cb3
offline exporter: fixed case when on https:// website is link to same path but with http:// protocol (it overrided proper *.html file just with meta redirect .. real case from nextjs.org) 4a1be0b
html processor: force to remove all anchor listeners when NextJS is detected (it is very hard to achive a working NextJS with offline file:// protocol) 2b1d935
file exporters: now by default crawler generates a html/json/txt report to 'tmp/[report|output].%domain%.%datetime%.[html|json|txt]' .. i assume that most people will want to save/see them 7831c6b
security analysis: removed multi-line console output for recommendations .. it was ugly 310af30
json output: added JSON_UNESCAPED_UNICODE for unescaped unicode chars (e.g. czech chars will be readable) cf1de9f
mailer: do not send e-mails in case of interruption of the crawler using ctrl+c 19c94aa
refactoring: manager stats logic extracted into ManagerStats and implemented also into manager of content processors + stats added into 'Crawler stats' tab in HTML report 3754200
refactoring: content related logic extracted to content processors based on ContentProcessor interface with methods findUrls():?FoundUrls, applyContentChangesForOfflineVersion():void and isContentTypeRelevant():bool + better division of web framework related logic (NextJS, Astro, Svelte, ...) + better URL handling and maximized usage of ParsedUrl 6d9f25c
phpstan: ignore BASE_DIR warning 6e0370a
offline website exporter: improved export of a website based on NextJS, but it's not perfect, because latest NextJS version do not have some JS/CSS path in code, but they are generated dynamicly from arrays/objects c4993ef
seo analyzer: fixed trim() warning when no <h1> found f0c526f
offline export: a lot of improvements when generating the offline version of the website on NextJS - chunk detection from the manifest, replacing paths, etc. 98c2e15
seo and og: fixed division by zero when no og/twitter tags found 19e4259
console output: lots of improvements for nice, consistent and minimal word-wrap output 596a5dc
basic file/dir structure: created ./crawler (for Linux/macOS) and ./crawler.bat for Windows, init script moved to ./src, small related changes about file/dir path building 5ce41ee
header status: ignore too dynamic Content-Disposition header 4e0c6fd
offline website exporter: added .html extensions to typical dynamic language extensions, because without it the browser will show them as source code 7130b9e
html report: show tables with details, even if they are without data (it is good to know that the checks were carried out, but nothing was found) da019e4
tests: repaired tests after last changes of file/url building for offline website .. merlot is great! 7c77c41
utils: be more precise and do not replace attributes in SVG .. creative designers will not love you when looking at the broken SVG in HTML report 3fc81bb
utils: be more precise in parsing phone numbers, otherwise people will 'love' you because of false positives .. wine is still great 51fd574
html parser: better support for formatted html with tags/attributes on multiple lines 89a36d2
utils: don't be hungry in stripJavaScript() because you ate half of my html :) wine is already in my head... 0e00957
file result storage: changed cache directory structure for consistency with http client's cache, so it looks like my.domain.tld-443/04/046ec07c.cache 26bf428
http client cache: for better consistency with result storage cache, directory structure now contains also port, so it looks like my.domain.tld-443/b9/b989bdcf2b9389cf0c8e5edb435adc05.cache a0b2e09
http client cache: improved directory structure for large scale and better orientation for partial cache deleting.. current structure in tmp dir: my.domain.tld/b9/b989bdcf2b9389cf0c8e5edb435adc05.cache 10e02c1
offline website exporter: better srcset handling - urls can be defined with or without sizes 473c1ad
html report: blue color for search term, looks better cb47df9
offline website exporter: handled situation of the same-name folder/file when both the folder /foo/next.js/ and the file /foo/next.js existed on the website (real case from vercel.com) 7c27d2c
exporters: added exec times to summary messages 41c8873
crawler: use port from URL if defined or by scheme .. previous solution didn't work properly for localhost:port and parsed URLs to external websites 324ba04
heading analysis: changed sorting to DESC by errors, renamed Headings structure -> Heading structure dbc1a38
security analysis: detection and ignoring of URLs that point to a non-existent static file but return 404 HTML, better description 193fb7d
super table: added escapeOutputHtml property to column for better escape managing + updated related supertables bfb901c
headings analysis: replace usage of DOMNode->textContent because when the headings contain other tags, including <script>, textContent also contains JS code, but without the <script> tag [5c426c2](5c426c2...

Assets 7

07 Nov 00:55

janreges

v1.0.0

c0f8ec2

v1.0.0

The first, but already very functional and quite decently tuned version. However, we expect the need to fix some bugs related to specific use-cases.

In the release packages, you can already find the crawler or crawler.bat executable binary, and you can use ./crawler --url=https://... to run crawler.

We believe that our crawler will make you happy and bring benefits for quality websites ♥

Full Changelog: https://github.com/janreges/siteone-website-crawler/commits/v1.0.0

Assets 7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes

Changes

Changes

Changes

Changes in 1.0.4 (skipped release)

Changelog

v1.0.3

v1.0.2

v1.0.1

Releases: janreges/siteone-crawler

v1.0.8

Changes

v1.0.7

Changes

v1.0.6

Changes

v1.0.5

Changes

Changes in 1.0.4 (skipped release)

v1.0.3

Changelog

v1.0.3

v1.0.2

v1.0.1

v1.0.0