-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ZnURL parsing of extreme URL #100
Comments
OK, I studied this a bit more in detail and I feel/believe the URL is not valid RFC 3986 as already described in the section by that title in the original post. The password is not valid, its slashes have to be percent encoded. Give the following input (which is also what ZnUrl would render back) the example works:
See https://www.rfc-editor.org/rfc/rfc3986#section-3.2 and the full syntax at the end. |
As far as I understand the (updated) post: the example is a valid URI, but is parsed incorrectly by curl. Following the ABNF in RFC 3986: uri := scheme , ':' , hier_part , '?' , query , '#' , fragment.
scheme := 'http'.
hier_part := '//' , authority , path_abempty.
authority := "userinfo , '@' , " host , ':' , port.
host := 'http'.
port := ''.
path_abempty := '/' , '/http:' , '/' , '/@http:' , '/' , '/http:' , '/' , '/'.
query := 'http://'.
fragment := 'http://'. After executing the assignments in reverse, this is true: uri = 'http://http://http://@http://http://?http://#http://' The main confusion seems to be over the part |
Given below: a PetitParser grammar for the ‘URI’ rule from RFC 3986 (written to learn how to write such grammars, beware it might contain mistakes). The following, using the example of this issue, is true: PPParser uriParser end matches: 'http://http://http://@http://http://?http://#http://' The grammar, as a PPParser instantiation method: uriParser
<sampleInstance>
| uri hierPart scheme authority userinfo host port ipLiteral ipvFuture ipv6Address h16 ls32
ipv4Address decOctet regName pathAbempty pathAbsolute pathRootless pathEmpty segment segmentNz
pchar query fragment pctEncoded unreserved subDelims alpha digit hexdig |
thisContext tempNames do: [ :tempName |
"Transcript tab; show: tempName , ' := PPDelegateParser new.'; cr."
thisContext tempNamed: tempName put: PPDelegateParser new ].
uri setParser: scheme , $: asParser , hierPart , ($? asParser , query) optional , ($# asParser , fragment) optional.
hierPart setParser: ('//' asParser , authority , pathAbempty) / pathAbsolute / pathRootless / pathEmpty.
scheme setParser: alpha , (alpha / digit / $+ asParser / $- asParser / $. asParser) star.
authority setParser: (userinfo , $@ asParser) optional , host , ($: asParser , port) optional.
userinfo setParser: (unreserved / pctEncoded / subDelims / $: asParser) star.
host setParser: ipLiteral / ipv4Address / regName.
port setParser: digit star.
ipLiteral setParser: $[ asParser , (ipv6Address / ipvFuture) , $] asParser.
ipvFuture setParser: #($v $V) asParser , hexdig plus , $. asParser , (unreserved / subDelims / $: asParser) plus.
ipv6Address setParser: (((h16 , $: asParser) times: 6) , ls32)
/ ('::' asParser , ((h16 , $: asParser) times: 5) , ls32)
/ (h16 optional , '::' asParser , ((h16 , $: asParser) times: 4) , ls32)
/ ((((h16 , $: asParser) max: 1) , h16) optional , '::' asParser , ((h16 , $: asParser) times: 3) , ls32)
/ ((((h16 , $: asParser) max: 2) , h16) optional , '::' asParser , ((h16 , $: asParser) times: 2) , ls32)
/ ((((h16 , $: asParser) max: 3) , h16) optional , '::' asParser , h16 , $: asParser , ls32)
/ ((((h16 , $: asParser) max: 4) , h16) optional , '::' asParser , ls32)
/ ((((h16 , $: asParser) max: 5) , h16) optional , '::' asParser , h16)
/ ((((h16 , $: asParser) max: 6) , h16) optional , '::' asParser).
h16 setParser: (hexdig min: 1 max: 4).
ls32 setParser: (h16 , $: asParser , h16) / ipv4Address.
ipv4Address setParser: decOctet , $. asParser , decOctet , $. asParser , decOctet , $. asParser , decOctet.
decOctet setParser: digit
/ (($1 to: $9) asParser , digit)
/ ($1 asParser , (digit times: 2))
/ ($2 asParser , ($0 to: $4) asParser , digit)
/ ('25' asParser , ($0 to: $5) asParser).
regName setParser: (unreserved / pctEncoded / subDelims) star.
pathAbempty setParser: ($/ asParser , segment) star.
pathAbsolute setParser: $/ asParser , (segmentNz , ($/ asParser , segment) star) optional.
pathRootless setParser: segmentNz , ($/ asParser , segment) star.
pathEmpty setParser: (pchar times: 0).
segment setParser: pchar star.
segmentNz setParser: (pchar min: 1).
pchar setParser: unreserved / pctEncoded / subDelims / $: asParser / $@ asParser.
query setParser: (pchar / $/ asParser / $? asParser) star.
fragment setParser: (pchar / $/ asParser / $? asParser) star.
pctEncoded setParser: $% asParser , hexdig , hexdig.
unreserved setParser: alpha / digit / $- asParser / $. asParser / $_ asParser / $~ asParser.
subDelims setParser: $! asParser / $$ asParser / $& asParser / $' asParser / $( asParser / $) asParser
/ $* asParser / $+ asParser / $, asParser / $; asParser / $= asParser.
alpha setParser: ($A to: $Z) asParser / ($a to: $z) asParser.
digit setParser: ($0 to: $9) asParser.
hexdig setParser: digit / ($A to: $F) asParser / ($a to: $f) asParser.
^ uri |
Hi, That looks nice ! Sadly, I am no PP user myself. What I would suggest then is to start a new independent subproject, say Zinc-PPUriParser that depends on PP and holds your experiment. It would then offer an alternative way to parse a URI/URL into a ZnUrl. We can add tests and let people test it and see where that leads us. But I would not make this part of the default group in the BaselineOf. There are many similar packages in Zinc's repository. What do you think ? Sven |
I’ll have to see if I can get around to turning it into a more full-fledged project. For now, I did take it one step further by adding the two methods given below. The first adds an action to break the URI down into its components using a regular expression as described in appendix B in RFC 3986. The second adds a different action which additionally breaks the authority and path components into subcomponents: the userinfo, host and port of the authority, and the segments of the path. A basic example using both methods: 'scheme://userinfo@host:987/segmentA/segmentB/segmentC?query#fragment' in: [ :uri |
self assert: (PPParser uriComponentsParser end parse: uri)
= #('scheme' 'userinfo@host:987' '/segmentA/segmentB/segmentC' 'query' 'fragment').
self assert: (PPParser uriSubcomponentsParser end parse: uri)
= #('scheme' #('userinfo' 'host' '987') #('' 'segmentA' 'segmentB' 'segmentC') 'query' 'fragment') ]. The example of this issue: 'http://http://http://@http://http://?http://#http://' in: [ :uri |
self assert: (PPParser uriComponentsParser end parse: uri)
= #('http' 'http:' '//http://@http://http://' 'http://' 'http://').
self assert: (PPParser uriSubcomponentsParser end parse: uri)
= #('http' #(nil 'http' '') #('' '' 'http:' '' '@http:' '' 'http:' '' '') 'http://' 'http://') ]. I found RFC 3986 somewhat confusing when it comes to path segments: that “a path consists of a sequence of path segments separated by a slash character” as stated in section 3.3 is not reflected in the grammar. Consider the path
A few tests for { 's:' -> #('s' nil #() nil nil).
's:/' -> #('s' nil #('' '') nil nil).
's://h' -> #('s' #(nil 'h' nil) #() nil nil).
's://h/' -> #('s' #(nil 'h' nil) #('' '') nil nil).
's:p1/p2' -> #('s' nil #('p1' 'p2') nil nil).
's:p1/p2/' -> #('s' nil #('p1' 'p2' '') nil nil).
's:/p1/p2' -> #('s' nil #('' 'p1' 'p2') nil nil).
's:/p1/p2/' -> #('s' nil #('' 'p1' 'p2' '') nil nil).
's://h/p1/p2' -> #('s' #(nil 'h' nil) #('' 'p1' 'p2') nil nil).
's://h/p1/p2/' -> #('s' #(nil 'h' nil) #('' 'p1' 'p2' '') nil nil).
's://%22@[::]/%3C?%3E#%5C' -> #('s' #('%22' '[::]' nil) #('' '%3C') '%3E' '%5C').
's://[1:2:3:4:5:6:7:8]' -> #('s' #(nil '[1:2:3:4:5:6:7:8]' nil) #() nil nil).
} asOrderedDictionary keysAndValuesDo: [ :uri :expectedComponents |
self assert: (PPParser uriSubcomponentsParser end parse: uri) = expectedComponents ]. A remaining problem that I know of is that the parsing of IPv6 literal addresses does not fully work. The example from section 1.1.2 in RFC 3986 that uses an IPv6 address fails to parse. See: moosetechnology/PetitParser#67. The two added methods: uriComponentsParser
^ self uriParser flatten ==> [ :wellFormedURI |
(RxMatcher forString: '^(([^:/?#]+)\:)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?') in: [ :componentsMatcher |
self assert: (componentsMatcher matches: wellFormedURI).
#(2 4 5 7 9) collect: [ :n | componentsMatcher subexpression: 1 + n ] ] ] uriSubcomponentsParser
^ self uriParser ==> [ :nodes |
| flattenBlock |
flattenBlock := [ :nodeToFlatten |
nodeToFlatten ifNil: [ '' ] ifNotNil: [
nodeToFlatten isArray ifFalse: [ nodeToFlatten asString ] ifTrue: [
'' join: (nodeToFlatten collect: flattenBlock) ] ] ].
nodes third in: [ :hierPart |
((hierPart notEmpty and: [ hierPart first = '//' ]) ifTrue: [ hierPart second ]) in: [ :authority | {
nodes first in: [ :scheme | flattenBlock value: scheme ].
authority ifNotNil: [ {
authority first ifNotNil: [ :userInfo | flattenBlock value: userInfo first ].
authority second in: [ :host | flattenBlock value: host ].
authority third ifNotNil: [ :port | flattenBlock value: port second ] } ].
(authority ifNotNil: [ hierPart third ifNotEmpty: [ :segments | #(()) , (segments collect: #second) ] ] ifNil: [
hierPart ifNotEmpty: [
((hierPart first = $/) ifTrue: [ #(()) ] ifFalse: [ #() ]) ,
(((hierPart first = $/) ifTrue: [ hierPart second ifNil: [ #(() ()) ] ] ifFalse: [ hierPart ]) in: [ :segments |
{ segments first } , (segments second collect: #second) ]) ] ])
collect: [ :segments | flattenBlock value: segments ].
nodes fourth ifNotNil: [ :query | flattenBlock value: query second ].
nodes fifth ifNotNil: [ :fragment | flattenBlock value: fragment second ] } ] ] ] |
The article https://daniel.haxx.se/blog/2022/09/08/http-http-http-http-http-http-http/ discusses the parsing of an extreme URL, namely http://http://http://@http://http://?http://#http:// which should result in the following recognised parts:
We could probably improve our current result.
The text was updated successfully, but these errors were encountered: