Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ZnURL parsing of extreme URL #100

Open
svenvc opened this issue Sep 9, 2022 · 5 comments
Open

Improve ZnURL parsing of extreme URL #100

svenvc opened this issue Sep 9, 2022 · 5 comments

Comments

@svenvc
Copy link
Owner

svenvc commented Sep 9, 2022

The article https://daniel.haxx.se/blog/2022/09/08/http-http-http-http-http-http-http/ discusses the parsing of an extreme URL, namely http://http://http://@http://http://?http://#http:// which should result in the following recognised parts:

  • the scheme http
  • the user name http
  • the password //http://
  • the host http
  • the default port 80
  • the path //http://
  • the query key http://
  • the fragment http://

We could probably improve our current result.

@svenvc
Copy link
Owner Author

svenvc commented Sep 12, 2022

OK, I studied this a bit more in detail and I feel/believe the URL is not valid RFC 3986 as already described in the section by that title in the original post. The password is not valid, its slashes have to be percent encoded.

Give the following input (which is also what ZnUrl would render back) the example works:

'http://http:%2f%2fhttp:%2f%2f@http://http://?http://#http://' asUrl.

See https://www.rfc-editor.org/rfc/rfc3986#section-3.2 and the full syntax at the end.

@Rinzwind
Copy link

As far as I understand the (updated) post: the example is a valid URI, but is parsed incorrectly by curl. Following the ABNF in RFC 3986:

uri := scheme , ':' , hier_part , '?' , query , '#' , fragment.
scheme := 'http'.
hier_part := '//' , authority , path_abempty.
authority := "userinfo , '@' , " host , ':' , port.
host := 'http'.
port := ''.
path_abempty := '/' , '/http:' , '/' , '/@http:' , '/' , '/http:' , '/' , '/'.
query := 'http://'.
fragment := 'http://'.

After executing the assignments in reverse, this is true:

uri = 'http://http://http://@http://http://?http://#http://'

The main confusion seems to be over the part http://http://@, as at first glance one might expect that to get parsed as the optional userinfo "@" part of authority, but that cannot be the case as the rule for userinfo does not allow a slash.

@Rinzwind
Copy link

Rinzwind commented May 5, 2023

Given below: a PetitParser grammar for the ‘URI’ rule from RFC 3986 (written to learn how to write such grammars, beware it might contain mistakes). The following, using the example of this issue, is true:

PPParser uriParser end matches: 'http://http://http://@http://http://?http://#http://' 

The grammar, as a PPParser instantiation method:

uriParser

	<sampleInstance>

	| uri hierPart scheme authority userinfo host port ipLiteral ipvFuture ipv6Address h16 ls32
		ipv4Address decOctet regName pathAbempty pathAbsolute pathRootless pathEmpty segment segmentNz
		pchar query fragment pctEncoded unreserved subDelims alpha digit hexdig |
	
	thisContext tempNames do: [ :tempName |
		"Transcript tab; show: tempName , ' := PPDelegateParser new.'; cr."
		thisContext tempNamed: tempName put: PPDelegateParser new ].
	
	uri setParser: scheme , $: asParser , hierPart , ($? asParser , query) optional , ($# asParser , fragment) optional.
	
	hierPart setParser: ('//' asParser , authority , pathAbempty) / pathAbsolute / pathRootless / pathEmpty.
	
	scheme setParser: alpha , (alpha / digit / $+ asParser / $- asParser / $. asParser) star.
	
	authority setParser: (userinfo , $@ asParser) optional , host , ($: asParser , port) optional.
	userinfo setParser: (unreserved / pctEncoded / subDelims / $: asParser) star.
	host setParser: ipLiteral / ipv4Address / regName.
	port setParser: digit star.
	
	ipLiteral setParser: $[ asParser , (ipv6Address / ipvFuture) , $] asParser.
	
	ipvFuture setParser: #($v $V) asParser , hexdig plus , $. asParser , (unreserved / subDelims / $: asParser) plus.
	
	ipv6Address setParser: (((h16 , $: asParser) times: 6) , ls32)
		/ ('::' asParser , ((h16 , $: asParser) times: 5) , ls32)
		/ (h16 optional , '::' asParser , ((h16 , $: asParser) times: 4) , ls32)
		/ ((((h16 , $: asParser) max: 1) , h16) optional , '::' asParser , ((h16 , $: asParser) times: 3) , ls32)
		/ ((((h16 , $: asParser) max: 2) , h16) optional , '::' asParser , ((h16 , $: asParser) times: 2) , ls32)
		/ ((((h16 , $: asParser) max: 3) , h16) optional , '::' asParser , h16 , $: asParser , ls32)
		/ ((((h16 , $: asParser) max: 4) , h16) optional , '::' asParser , ls32)
		/ ((((h16 , $: asParser) max: 5) , h16) optional , '::' asParser , h16)
		/ ((((h16 , $: asParser) max: 6) , h16) optional , '::' asParser).
	
	h16 setParser: (hexdig min: 1 max: 4).
	ls32 setParser:	 (h16 , $: asParser , h16) / ipv4Address.
	ipv4Address setParser: decOctet , $. asParser , decOctet , $. asParser , decOctet , $. asParser , decOctet.
	
	decOctet setParser: digit
		/ (($1 to: $9) asParser , digit)
		/ ($1 asParser , (digit times: 2))
		/ ($2 asParser , ($0 to: $4) asParser , digit)
		/ ('25' asParser , ($0 to: $5) asParser).
	
	regName setParser: (unreserved / pctEncoded / subDelims) star.
	
	pathAbempty setParser: ($/ asParser , segment) star.
	pathAbsolute setParser: $/ asParser , (segmentNz , ($/ asParser , segment) star) optional.
	pathRootless setParser: segmentNz , ($/ asParser , segment) star.
	pathEmpty setParser: (pchar times: 0).
	
	segment setParser: pchar star.
	segmentNz setParser: (pchar min: 1).
	
	pchar setParser: unreserved / pctEncoded / subDelims / $: asParser / $@ asParser.
	
	query setParser: (pchar / $/ asParser / $? asParser) star.
	
	fragment setParser: (pchar / $/ asParser / $? asParser) star.
	
	pctEncoded setParser: $% asParser , hexdig , hexdig.
	
	unreserved setParser: alpha / digit / $- asParser / $. asParser / $_ asParser / $~ asParser.
	subDelims setParser: $! asParser / $$ asParser / $& asParser / $' asParser / $( asParser / $) asParser
		/ $* asParser / $+ asParser / $, asParser / $; asParser / $= asParser.
	
	alpha setParser: ($A to: $Z) asParser / ($a to: $z) asParser.
	digit setParser: ($0 to: $9) asParser.
	hexdig setParser: digit / ($A to: $F) asParser / ($a to: $f) asParser.
	
	^ uri

@svenvc
Copy link
Owner Author

svenvc commented May 5, 2023

Hi,

That looks nice ! Sadly, I am no PP user myself.

What I would suggest then is to start a new independent subproject, say Zinc-PPUriParser that depends on PP and holds your experiment. It would then offer an alternative way to parse a URI/URL into a ZnUrl. We can add tests and let people test it and see where that leads us.

But I would not make this part of the default group in the BaselineOf. There are many similar packages in Zinc's repository.

What do you think ?

Sven

@Rinzwind
Copy link

Rinzwind commented May 6, 2023

I’ll have to see if I can get around to turning it into a more full-fledged project. For now, I did take it one step further by adding the two methods given below. The first adds an action to break the URI down into its components using a regular expression as described in appendix B in RFC 3986. The second adds a different action which additionally breaks the authority and path components into subcomponents: the userinfo, host and port of the authority, and the segments of the path. A basic example using both methods:

'scheme://userinfo@host:987/segmentA/segmentB/segmentC?query#fragment' in: [ :uri |
  self assert: (PPParser uriComponentsParser end parse: uri)
    = #('scheme' 'userinfo@host:987' '/segmentA/segmentB/segmentC' 'query' 'fragment').
  self assert: (PPParser uriSubcomponentsParser end parse: uri)
    = #('scheme' #('userinfo' 'host' '987') #('' 'segmentA' 'segmentB' 'segmentC') 'query' 'fragment') ].

The example of this issue:

'http://http://http://@http://http://?http://#http://' in: [ :uri |
  self assert: (PPParser uriComponentsParser end parse: uri)
    = #('http' 'http:' '//http://@http://http://' 'http://' 'http://').
  self assert: (PPParser uriSubcomponentsParser end parse: uri)
    = #('http' #(nil 'http' '') #('' '' 'http:' '' '@http:' '' 'http:' '' '') 'http://' 'http://') ].

I found RFC 3986 somewhat confusing when it comes to path segments: that “a path consists of a sequence of path segments separated by a slash character” as stated in section 3.3 is not reflected in the grammar. Consider the path /. The statement implies it has two segments, both of which are empty. The grammar rules would, on the other hand, seem to imply a different, but inconsistent, number of segments: a single segment which is empty per path-abempty and no segments per path-absolute. To clarify, those two rules could be rewritten with the inclusion of a rule segment-z which matches nothing (similar to path-empty):

path-abempty  = [ segment-z 1*( "/" segment ) ]
path-absolute = segment-z "/" ( segment-z / ( segment-nz *( "/" segment ) ) )

A few tests for #uriSubcomponentsParser:

{ 's:' -> #('s' nil #() nil nil).
  's:/' -> #('s' nil #('' '') nil nil).
  's://h' -> #('s' #(nil 'h' nil) #() nil nil).
  's://h/' -> #('s' #(nil 'h' nil) #('' '') nil nil).
  's:p1/p2' -> #('s' nil #('p1' 'p2') nil nil).
  's:p1/p2/' -> #('s' nil #('p1' 'p2' '') nil nil).
  's:/p1/p2' -> #('s' nil #('' 'p1' 'p2') nil nil).
  's:/p1/p2/' -> #('s' nil #('' 'p1' 'p2' '') nil nil).
  's://h/p1/p2' -> #('s' #(nil 'h' nil) #('' 'p1' 'p2') nil nil).
  's://h/p1/p2/' -> #('s' #(nil 'h' nil) #('' 'p1' 'p2' '') nil nil).
  's://%22@[::]/%3C?%3E#%5C' -> #('s' #('%22' '[::]' nil) #('' '%3C') '%3E' '%5C').
  's://[1:2:3:4:5:6:7:8]' -> #('s' #(nil '[1:2:3:4:5:6:7:8]' nil) #() nil nil).
} asOrderedDictionary keysAndValuesDo: [ :uri :expectedComponents |
  self assert: (PPParser uriSubcomponentsParser end parse: uri) = expectedComponents ].

A remaining problem that I know of is that the parsing of IPv6 literal addresses does not fully work. The example from section 1.1.2 in RFC 3986 that uses an IPv6 address fails to parse. See: moosetechnology/PetitParser#67.

The two added methods:

uriComponentsParser

  ^ self uriParser flatten ==> [ :wellFormedURI |
    (RxMatcher forString: '^(([^:/?#]+)\:)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?') in: [ :componentsMatcher |
      self assert: (componentsMatcher matches: wellFormedURI).
      #(2 4 5 7 9) collect: [ :n | componentsMatcher subexpression: 1 + n ] ] ]
uriSubcomponentsParser

  ^ self uriParser ==> [ :nodes |
      | flattenBlock |
      flattenBlock := [ :nodeToFlatten |
        nodeToFlatten ifNil: [ '' ] ifNotNil: [
          nodeToFlatten isArray ifFalse: [ nodeToFlatten asString ] ifTrue: [
            '' join: (nodeToFlatten collect: flattenBlock) ] ] ].
      nodes third in: [ :hierPart |
        ((hierPart notEmpty and: [ hierPart first = '//' ]) ifTrue: [ hierPart second ]) in: [ :authority | {
          nodes first in: [ :scheme | flattenBlock value: scheme ].
          authority ifNotNil: [ {
            authority first ifNotNil: [ :userInfo | flattenBlock value: userInfo first ].
            authority second in: [ :host | flattenBlock value: host ].
            authority third ifNotNil: [ :port | flattenBlock value: port second ] } ].
          (authority ifNotNil: [ hierPart third ifNotEmpty: [ :segments | #(()) , (segments collect: #second) ] ] ifNil: [
            hierPart ifNotEmpty: [
              ((hierPart first = $/) ifTrue: [ #(()) ] ifFalse: [ #() ]) ,
              (((hierPart first = $/) ifTrue: [ hierPart second ifNil: [ #(() ()) ] ] ifFalse: [ hierPart ]) in: [ :segments |
                { segments first } , (segments second collect: #second) ]) ] ])
            collect: [ :segments | flattenBlock value: segments ].
          nodes fourth ifNotNil: [ :query | flattenBlock value: query second ].
          nodes fifth ifNotNil: [ :fragment | flattenBlock value: fragment second ] } ] ] ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants