Skip to content
ericprud edited this page Mar 17, 2013 · 5 revisions

Lexing UTF-8

http://www.w3.org/TR/turtle/#grammar-production-PN_CHARS_BASE is defined in terms of unicode characters. This is trivially converted to a UTF-8 parser by e.g. http://www.w3.org/2005/03/23-lex-U:

AZ [A-Z] A-Z [A-Z]
az [a-z] a-z |[a-z]
ÀÖ [#x00C0-#x00D6] c380-c396 |\xC3[\x80-\x96]
Øö [#x00D8-#x00F6] c398-c3b6 |\xC3[\x98-\xB6]
ø˿ [#x00F8-#x02FF] c3b8-cbbf |\xC3[\xB8-\xBF]|[\xC4-\xCB][\x80-\xBF]
Ͱͽ [#x0370-#x037D] cdb0-cdbd |\xCD[\xB0-\xBD]
Ϳ῿ [#x037F-#x1FFF] cdbf-e1bfbf |\xCD\xBF|[\xCE-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|\xE1[\x80-\xBF][\x80-\xBF]
‌‍ [#x200C-#x200D] e2808c-e2808d |\xE2\x80[\x8C-\x8D]
⁰↏ [#x2070-#x218F] e281b0-e2868f |\xE2(\x81[\xB0-\xBF]|[\x82-\x85][\x80-\xBF]|\x86[\x80-\x8F])
Ⰰ⿯ [#x2C00-#x2FEF e2b080-e2bfaf |\xE2([\xB0-\xBE][\x80-\xBF]|\xBF[\x80-\xAF])
、퟿ [#x3001-#xD7FF] e38081-ed9fbf |\xE3(\x80[\x81-\xBF]|[\x81-\xBF][\x80-\xBF])|[\xE4-\xEC][\x80-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]|\xED[\x80-\x9F][\x80-\xBF]
豈﷏ [#xF900-#xFDCF] efa480-efb78f |\xEF([\xA4-\xB6][\x80-\xBF]|\xB7[\x80-\x8F])
ﷰ� [#xFDF0-#xFFFD] efb7b0-efbfbd |\xEF(\xB7[\xB0-\xBF]|[\xB8-\xBE][\x80-\xBF]|\xBF[\x80-\xBD])
𐀀� [#x10000-#xEFFFF] f0908080-f3afbfbf |\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF]
|[\xF1-\xF2][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|\xF3[\x80-\xAF][\x80-\xBF][\x80-\xBF]
[\x00-\x09\x0B-\x0C\x0E-\x26\x28-\x5B\x5D-\x7F]
|[\xC2-\xDF][\x80-\xBF]
|\xE0[\xA0-\xBF][\x80-\xBF]
|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]
|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]
|\xED[\x80-\x9F][\x80-\xBF]
|[\xEE-\xEF][\x80-\xBF][\x80-\xBF]
|\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF]
|[\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|\xF4([\x80-\x8E][\x80-\xBF][\x80-\xBF]|\x8F([\x80-\xBE][\x80-\xBF]|\xBF[\x80-\xBF]))
|{ECHAR}|{UCHAR}

ascii boundaries "\x00\x09!#[]" correspond to "\u0000\u0009\u000b\u000c\u000e\u0021\u0023\u005b\u005d\u007f".

utf-8 boundaries "\u07ff\u0800\u0fff\u1000\ucfff\ud000\ud7ff\ue000\uffff\U00010000\U0003ffff\U00040000\U000fffff\U00100000\U0010ffff" correspond to "\u0080\u07ff\u0800\u0fff\u1000\ucfff\ud000\ud7ff\ue000\uffff\U00010000\U0003ffff\U00040000\U000fffff\U00100000\U0010ffff"

Clone this wiki locally