This repo contains most of the Unicode Character Database (UCD) version 13.0.0 files converted to R data.frames
as well as other contributory data files from other databases and code used to download and convert the files. It is intended to serve as a companion site to this blog series on the Unicode standard.
The UCD provides machine-readable data files related to the Unicode Standard implementation characters properties and is documented in Unicode Standard Annex #44.
Additional information on this repo can be found here and in other entries of the series.
Most files consists of Unicode character integer code points —cp
— or code points range —cp_lo
–cp_hi
— or sequences and either one or more corresponding property value or mappings to other code points. Most files also contains one or more additional informative fields. Since they may vary in length, meaning and format, they're usually stored in single column named comments
located at the end of the data.frame
. Note that this is different from the line comments which contain information about code points or code point ranges range —see Metadata section below—.
In addition, some data.frame
have the variable.labels
attribute set with a short column description. Use attr(<data.frame/>, "variable.labels")
to see them.
Also, some R data.frame
names are abbreviated in order to avoid too much typing because of the added prefix. The mapping between the original files names and the corresponding data.frame
is documented in the README file of the ucd sub–directory in addition to a link that points to the file's description. A listing of each file's header and first six lines is also included.
Contributory data files for the Unicode Collation Algorithm are located in the uca sub–directory and Ideographic Variation Sequences files are located in the ivs sub–directory.
The Unicode Consortium also provides data files for Unicode Security Mechanisms which are located in the security sub–directory of the repository :
File Name | data.frame Name |
---|---|
confusables | ucs.confusables |
IdentifierStatus | ucs.idStat |
IdentifierType | ucs.idType |
intentional | ucs.intentional |
confusablesSummary | ucs.confusablesSummary |
Please refer to UTS#39: Unicode Security Mechanisms for further details.
If you don't want to download the entire repository, you can download individual files from R like this :
load(co <- url('https://github.com/tsoubiran/UCD/blob/master/13.0.0/ucd/Rdata/UnicodeData.Rdata?raw=true'));close(co)
Note that in order to d this, you need to change the domain name.
Otherwise, using https://github.com/tsoubiran/UCD/blob/master/13.0.0/ucd/Rdata/UnicodeData.Rdata won't work because because in that case github.com will redirect to raw.githubusercontent.com and url()
does not handle redirection so it seems.
You can also use the getUCDRdata
function as demonstrated here.
The src sub–directory contains —very hackish— R code used to download the original files and convert them to R data.frame
in addition to some utils for dealing with the files metadata. In order to use this code, you'll need to have the stringi library installed for parsing the files as well as the RCurl and XML for the download script.
Please refer to this blog entry for instruction on how to use this code.
Each data.frame
also stores the original commented lines —if any— in an attribute
named "htxt". For example, you can retrieve the header of the original UCD files which describes the file's format and points to the relevant Technical Report like this :
ucd.hdr(ucd.propValal)
or extract every comment blocks
## w/o expansion
comments <- ucd.comments(ucd.propValal)
## with expansion
comments <- ucd.comments(ucd.propValal, xpd=T)
ucd.comments
returns a list
of all comments. Using xpd=T
returns a list of the same length as the original data.frame
with comments located at their original position in the file. This can be useful for extracting relevant information and merging them back with the data.frame
.
UCD files are distributed following the Unicode® copyright and terms of Use. Please refer to this page for more information.