-
Notifications
You must be signed in to change notification settings - Fork 4
Lexingutf 8
ericprud edited this page Mar 17, 2013
·
5 revisions
Lexing UTF-8
http://www.w3.org/TR/turtle/#grammar-production-PN_CHARS_BASE is defined in terms of unicode characters. This is trivially converted to a UTF-8 parser by e.g. http://www.w3.org/2005/03/23-lex-U:
AZ | [A-Z] | A-Z | [A-Z] | |
az | [a-z] | a-z | |[a-z] | |
ÀÖ | [#x00C0-#x00D6] | c380-c396 | |\xC3[\x80-\x96] | |
Øö | [#x00D8-#x00F6] | c398-c3b6 | |\xC3[\x98-\xB6] | |
ø˿ | [#x00F8-#x02FF] | c3b8-cbbf | |\xC3[\xB8-\xBF]|[\xC4-\xCB][\x80-\xBF] | |
Ͱͽ | [#x0370-#x037D] | cdb0-cdbd | |\xCD[\xB0-\xBD] | |
Ϳ | [#x037F-#x1FFF] | cdbf-e1bfbf | |\xCD\xBF|[\xCE-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|\xE1[\x80-\xBF][\x80-\xBF] | |
| [#x200C-#x200D] | e2808c-e2808d | |\xE2\x80[\x8C-\x8D] | |
⁰ | [#x2070-#x218F] | e281b0-e2868f | |\xE2(\x81[\xB0-\xBF]|[\x82-\x85][\x80-\xBF]|\x86[\x80-\x8F]) | |
Ⰰ | [#x2C00-#x2FEF | e2b080-e2bfaf | |\xE2([\xB0-\xBE][\x80-\xBF]|\xBF[\x80-\xAF]) | |
、 | [#x3001-#xD7FF] | e38081-ed9fbf | |\xE3(\x80[\x81-\xBF]|[\x81-\xBF][\x80-\xBF])|[\xE4-\xEC][\x80-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]|\xED[\x80-\x9F][\x80-\xBF] | |
豈﷏ | [#xF900-#xFDCF] | efa480-efb78f | |\xEF([\xA4-\xB6][\x80-\xBF]|\xB7[\x80-\x8F]) | |
ﷰ� | [#xFDF0-#xFFFD] | efb7b0-efbfbd | |\xEF(\xB7[\xB0-\xBF]|[\xB8-\xBE][\x80-\xBF]|\xBF[\x80-\xBD]) | |
𐀀� | [#x10000-#xEFFFF] | f0908080-f3afbfbf | |\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF] |[\xF1-\xF2][\x80-\xBF][\x80-\xBF][\x80-\xBF] |\xF3[\x80-\xAF][\x80-\xBF][\x80-\xBF] |
[\x00-\x09\x0B-\x0C\x0E-\x26\x28-\x5B\x5D-\x7F] |
|[\xC2-\xDF][\x80-\xBF] |
|\xE0[\xA0-\xBF][\x80-\xBF] |
|[\xE1-\xEC][\x80-\xBF][\x80-\xBF] |
|[\xE1-\xEC][\x80-\xBF][\x80-\xBF] |
|\xED[\x80-\x9F][\x80-\xBF] |
|[\xEE-\xEF][\x80-\xBF][\x80-\xBF] |
|\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF] |
|[\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF] |
|\xF4([\x80-\x8E][\x80-\xBF][\x80-\xBF]|\x8F([\x80-\xBE][\x80-\xBF]|\xBF[\x80-\xBF])) |
|{ECHAR}|{UCHAR} |
ascii boundaries "\x00\x09!#[]" correspond to "\u0000\u0009\u000b\u000c\u000e\u0021\u0023\u005b\u005d\u007f".
utf-8 boundaries "\u07ff\u0800\u0fff\u1000\ucfff\ud000\ud7ff\ue000\uffff\U00010000\U0003ffff\U00040000\U000fffff\U00100000\U0010ffff" correspond to "\u0080\u07ff\u0800\u0fff\u1000\ucfff\ud000\ud7ff\ue000\uffff\U00010000\U0003ffff\U00040000\U000fffff\U00100000\U0010ffff"