Improve ZnURL parsing of extreme URL #100

svenvc · 2022-09-09T15:03:01Z

The article https://daniel.haxx.se/blog/2022/09/08/http-http-http-http-http-http-http/ discusses the parsing of an extreme URL, namely http://http://http://@http://http://?http://#http:// which should result in the following recognised parts:

the scheme http
the user name http
the password //http://
the host http
the default port 80
the path //http://
the query key http://
the fragment http://

We could probably improve our current result.

svenvc · 2022-09-12T12:57:44Z

OK, I studied this a bit more in detail and I feel/believe the URL is not valid RFC 3986 as already described in the section by that title in the original post. The password is not valid, its slashes have to be percent encoded.

Give the following input (which is also what ZnUrl would render back) the example works:

'http://http:%2f%2fhttp:%2f%2f@http://http://?http://#http://' asUrl.

See https://www.rfc-editor.org/rfc/rfc3986#section-3.2 and the full syntax at the end.

Rinzwind · 2022-09-28T20:30:14Z

As far as I understand the (updated) post: the example is a valid URI, but is parsed incorrectly by curl. Following the ABNF in RFC 3986:

uri := scheme , ':' , hier_part , '?' , query , '#' , fragment.
scheme := 'http'.
hier_part := '//' , authority , path_abempty.
authority := "userinfo , '@' , " host , ':' , port.
host := 'http'.
port := ''.
path_abempty := '/' , '/http:' , '/' , '/@http:' , '/' , '/http:' , '/' , '/'.
query := 'http://'.
fragment := 'http://'.

After executing the assignments in reverse, this is true:

uri = 'http://http://http://@http://http://?http://#http://'

The main confusion seems to be over the part http://http://@, as at first glance one might expect that to get parsed as the optional userinfo "@" part of authority, but that cannot be the case as the rule for userinfo does not allow a slash.

Rinzwind · 2023-05-05T06:56:34Z

Given below: a PetitParser grammar for the ‘URI’ rule from RFC 3986 (written to learn how to write such grammars, beware it might contain mistakes). The following, using the example of this issue, is true:

PPParser uriParser end matches: 'http://http://http://@http://http://?http://#http://'

The grammar, as a PPParser instantiation method:

uriParser

	<sampleInstance>

	| uri hierPart scheme authority userinfo host port ipLiteral ipvFuture ipv6Address h16 ls32
		ipv4Address decOctet regName pathAbempty pathAbsolute pathRootless pathEmpty segment segmentNz
		pchar query fragment pctEncoded unreserved subDelims alpha digit hexdig |
	
	thisContext tempNames do: [ :tempName |
		"Transcript tab; show: tempName , ' := PPDelegateParser new.'; cr."
		thisContext tempNamed: tempName put: PPDelegateParser new ].
	
	uri setParser: scheme , $: asParser , hierPart , ($? asParser , query) optional , ($# asParser , fragment) optional.
	
	hierPart setParser: ('//' asParser , authority , pathAbempty) / pathAbsolute / pathRootless / pathEmpty.
	
	scheme setParser: alpha , (alpha / digit / $+ asParser / $- asParser / $. asParser) star.
	
	authority setParser: (userinfo , $@ asParser) optional , host , ($: asParser , port) optional.
	userinfo setParser: (unreserved / pctEncoded / subDelims / $: asParser) star.
	host setParser: ipLiteral / ipv4Address / regName.
	port setParser: digit star.
	
	ipLiteral setParser: $[ asParser , (ipv6Address / ipvFuture) , $] asParser.
	
	ipvFuture setParser: #($v $V) asParser , hexdig plus , $. asParser , (unreserved / subDelims / $: asParser) plus.
	
	ipv6Address setParser: (((h16 , $: asParser) times: 6) , ls32)
		/ ('::' asParser , ((h16 , $: asParser) times: 5) , ls32)
		/ (h16 optional , '::' asParser , ((h16 , $: asParser) times: 4) , ls32)
		/ ((((h16 , $: asParser) max: 1) , h16) optional , '::' asParser , ((h16 , $: asParser) times: 3) , ls32)
		/ ((((h16 , $: asParser) max: 2) , h16) optional , '::' asParser , ((h16 , $: asParser) times: 2) , ls32)
		/ ((((h16 , $: asParser) max: 3) , h16) optional , '::' asParser , h16 , $: asParser , ls32)
		/ ((((h16 , $: asParser) max: 4) , h16) optional , '::' asParser , ls32)
		/ ((((h16 , $: asParser) max: 5) , h16) optional , '::' asParser , h16)
		/ ((((h16 , $: asParser) max: 6) , h16) optional , '::' asParser).
	
	h16 setParser: (hexdig min: 1 max: 4).
	ls32 setParser:	 (h16 , $: asParser , h16) / ipv4Address.
	ipv4Address setParser: decOctet , $. asParser , decOctet , $. asParser , decOctet , $. asParser , decOctet.
	
	decOctet setParser: digit
		/ (($1 to: $9) asParser , digit)
		/ ($1 asParser , (digit times: 2))
		/ ($2 asParser , ($0 to: $4) asParser , digit)
		/ ('25' asParser , ($0 to: $5) asParser).
	
	regName setParser: (unreserved / pctEncoded / subDelims) star.
	
	pathAbempty setParser: ($/ asParser , segment) star.
	pathAbsolute setParser: $/ asParser , (segmentNz , ($/ asParser , segment) star) optional.
	pathRootless setParser: segmentNz , ($/ asParser , segment) star.
	pathEmpty setParser: (pchar times: 0).
	
	segment setParser: pchar star.
	segmentNz setParser: (pchar min: 1).
	
	pchar setParser: unreserved / pctEncoded / subDelims / $: asParser / $@ asParser.
	
	query setParser: (pchar / $/ asParser / $? asParser) star.
	
	fragment setParser: (pchar / $/ asParser / $? asParser) star.
	
	pctEncoded setParser: $% asParser , hexdig , hexdig.
	
	unreserved setParser: alpha / digit / $- asParser / $. asParser / $_ asParser / $~ asParser.
	subDelims setParser: $! asParser / $$ asParser / $& asParser / $' asParser / $( asParser / $) asParser
		/ $* asParser / $+ asParser / $, asParser / $; asParser / $= asParser.
	
	alpha setParser: ($A to: $Z) asParser / ($a to: $z) asParser.
	digit setParser: ($0 to: $9) asParser.
	hexdig setParser: digit / ($A to: $F) asParser / ($a to: $f) asParser.
	
	^ uri

svenvc · 2023-05-05T08:38:47Z

Hi,

That looks nice ! Sadly, I am no PP user myself.

What I would suggest then is to start a new independent subproject, say Zinc-PPUriParser that depends on PP and holds your experiment. It would then offer an alternative way to parse a URI/URL into a ZnUrl. We can add tests and let people test it and see where that leads us.

But I would not make this part of the default group in the BaselineOf. There are many similar packages in Zinc's repository.

What do you think ?

Sven

Rinzwind · 2023-05-06T21:37:51Z

I’ll have to see if I can get around to turning it into a more full-fledged project. For now, I did take it one step further by adding the two methods given below. The first adds an action to break the URI down into its components using a regular expression as described in appendix B in RFC 3986. The second adds a different action which additionally breaks the authority and path components into subcomponents: the userinfo, host and port of the authority, and the segments of the path. A basic example using both methods:

'scheme://userinfo@host:987/segmentA/segmentB/segmentC?query#fragment' in: [ :uri |
  self assert: (PPParser uriComponentsParser end parse: uri)
    = #('scheme' 'userinfo@host:987' '/segmentA/segmentB/segmentC' 'query' 'fragment').
  self assert: (PPParser uriSubcomponentsParser end parse: uri)
    = #('scheme' #('userinfo' 'host' '987') #('' 'segmentA' 'segmentB' 'segmentC') 'query' 'fragment') ].

The example of this issue:

'http://http://http://@http://http://?http://#http://' in: [ :uri |
  self assert: (PPParser uriComponentsParser end parse: uri)
    = #('http' 'http:' '//http://@http://http://' 'http://' 'http://').
  self assert: (PPParser uriSubcomponentsParser end parse: uri)
    = #('http' #(nil 'http' '') #('' '' 'http:' '' '@http:' '' 'http:' '' '') 'http://' 'http://') ].

I found RFC 3986 somewhat confusing when it comes to path segments: that “a path consists of a sequence of path segments separated by a slash character” as stated in section 3.3 is not reflected in the grammar. Consider the path /. The statement implies it has two segments, both of which are empty. The grammar rules would, on the other hand, seem to imply a different, but inconsistent, number of segments: a single segment which is empty per path-abempty and no segments per path-absolute. To clarify, those two rules could be rewritten with the inclusion of a rule segment-z which matches nothing (similar to path-empty):

path-abempty  = [ segment-z 1*( "/" segment ) ]
path-absolute = segment-z "/" ( segment-z / ( segment-nz *( "/" segment ) ) )

A few tests for #uriSubcomponentsParser:

{ 's:' -> #('s' nil #() nil nil).
  's:/' -> #('s' nil #('' '') nil nil).
  's://h' -> #('s' #(nil 'h' nil) #() nil nil).
  's://h/' -> #('s' #(nil 'h' nil) #('' '') nil nil).
  's:p1/p2' -> #('s' nil #('p1' 'p2') nil nil).
  's:p1/p2/' -> #('s' nil #('p1' 'p2' '') nil nil).
  's:/p1/p2' -> #('s' nil #('' 'p1' 'p2') nil nil).
  's:/p1/p2/' -> #('s' nil #('' 'p1' 'p2' '') nil nil).
  's://h/p1/p2' -> #('s' #(nil 'h' nil) #('' 'p1' 'p2') nil nil).
  's://h/p1/p2/' -> #('s' #(nil 'h' nil) #('' 'p1' 'p2' '') nil nil).
  's://%22@[::]/%3C?%3E#%5C' -> #('s' #('%22' '[::]' nil) #('' '%3C') '%3E' '%5C').
  's://[1:2:3:4:5:6:7:8]' -> #('s' #(nil '[1:2:3:4:5:6:7:8]' nil) #() nil nil).
} asOrderedDictionary keysAndValuesDo: [ :uri :expectedComponents |
  self assert: (PPParser uriSubcomponentsParser end parse: uri) = expectedComponents ].

A remaining problem that I know of is that the parsing of IPv6 literal addresses does not fully work. The example from section 1.1.2 in RFC 3986 that uses an IPv6 address fails to parse. See: moosetechnology/PetitParser#67.

The two added methods:

uriComponentsParser

  ^ self uriParser flatten ==> [ :wellFormedURI |
    (RxMatcher forString: '^(([^:/?#]+)\:)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?') in: [ :componentsMatcher |
      self assert: (componentsMatcher matches: wellFormedURI).
      #(2 4 5 7 9) collect: [ :n | componentsMatcher subexpression: 1 + n ] ] ]

uriSubcomponentsParser

  ^ self uriParser ==> [ :nodes |
      | flattenBlock |
      flattenBlock := [ :nodeToFlatten |
        nodeToFlatten ifNil: [ '' ] ifNotNil: [
          nodeToFlatten isArray ifFalse: [ nodeToFlatten asString ] ifTrue: [
            '' join: (nodeToFlatten collect: flattenBlock) ] ] ].
      nodes third in: [ :hierPart |
        ((hierPart notEmpty and: [ hierPart first = '//' ]) ifTrue: [ hierPart second ]) in: [ :authority | {
          nodes first in: [ :scheme | flattenBlock value: scheme ].
          authority ifNotNil: [ {
            authority first ifNotNil: [ :userInfo | flattenBlock value: userInfo first ].
            authority second in: [ :host | flattenBlock value: host ].
            authority third ifNotNil: [ :port | flattenBlock value: port second ] } ].
          (authority ifNotNil: [ hierPart third ifNotEmpty: [ :segments | #(()) , (segments collect: #second) ] ] ifNil: [
            hierPart ifNotEmpty: [
              ((hierPart first = $/) ifTrue: [ #(()) ] ifFalse: [ #() ]) ,
              (((hierPart first = $/) ifTrue: [ hierPart second ifNil: [ #(() ()) ] ] ifFalse: [ hierPart ]) in: [ :segments |
                { segments first } , (segments second collect: #second) ]) ] ])
            collect: [ :segments | flattenBlock value: segments ].
          nodes fourth ifNotNil: [ :query | flattenBlock value: query second ].
          nodes fifth ifNotNil: [ :fragment | flattenBlock value: fragment second ] } ] ] ]

Rinzwind mentioned this issue May 5, 2023

#pathSegments returns segments that are either Character or ByteString #63

Open

Rinzwind mentioned this issue Sep 12, 2023

WAUrl>>#initializeFromString: error when query parameters include scheme SeasideSt/Seaside#1216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ZnURL parsing of extreme URL #100

Improve ZnURL parsing of extreme URL #100

svenvc commented Sep 9, 2022

svenvc commented Sep 12, 2022 •

edited

Loading

Rinzwind commented Sep 28, 2022

Rinzwind commented May 5, 2023

svenvc commented May 5, 2023

Rinzwind commented May 6, 2023

Improve ZnURL parsing of extreme URL #100

Improve ZnURL parsing of extreme URL #100

Comments

svenvc commented Sep 9, 2022

svenvc commented Sep 12, 2022 • edited Loading

Rinzwind commented Sep 28, 2022

Rinzwind commented May 5, 2023

svenvc commented May 5, 2023

Rinzwind commented May 6, 2023

svenvc commented Sep 12, 2022 •

edited

Loading