-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' of github.com:flipbit/tokenizer
- Loading branch information
Showing
1 changed file
with
154 additions
and
112 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,149 +1,191 @@ | ||
## .NET Tokenizer | ||
Tokenizer - Data Extraction Library | ||
=================================== | ||
|
||
.NET Tokenizer is a library written in C# that extracts values from text. The library creates structured objects by overlaying patterns onto blocks of text. | ||
[![GitHub Stars](https://img.shields.io/github/stars/flipbit/tokenizer.svg)](https://github.com/flipbit/tokenizer/stargazers) [![GitHub Issues](https://img.shields.io/github/issues/flipbit/tokenizer.svg)](https://github.com/flipbit/tokenizer/issues) [![NuGet Version](https://img.shields.io/nuget/v/tokenizer.svg)](https://www.nuget.org/packages/Tokenizer/) [![NuGet Downloads](https://img.shields.io/nuget/dt/tokenizer.svg)](https://www.nuget.org/packages/Tokenizer/) | ||
|
||
##Installation | ||
Tokenizer is a .NET Standard and .NET Framework library that allows you to extract information from text using predefined patterns. Tokens embedded within patterns are extracted, validated and transformed before being returned as a strongly typed object: | ||
|
||
Installation: | ||
```csharp | ||
var pattern = @"First Name: {FirstName}, Last Name: {LastName}, Enrolled: {Enrolled:ToDateTime('dd MMM yyyy')}"; | ||
var input = @"First Name: Alice, Last Name: Smith, Enrolled: 16 Jan 2018"; | ||
|
||
Enter the following into the Package Manager Console window in Visual Studio: | ||
var student = new Tokenizer().Parse<Student>(pattern, input); | ||
|
||
Install-Package Tokenizer | ||
Assert.AreEqual("Alice", student.FirstName); | ||
Assert.AreEqual("Smith", student.LastName); | ||
Assert.AreEqual(new DateTime(2018, 1, 16), student.Enrolled); | ||
``` | ||
|
||
##Basic Example | ||
Tokens work by matching the preceding text (preamble) in your input. When a match is found, the text after the preamble is taken and used to populate the token. Text is taken up to a terminator, or until the next token begins. | ||
|
||
The Tokenizer library was originally developed in order to parse information from WHOIS records. A typical WHOIS record will contain free-form text such as the example below: | ||
## In Order Processing | ||
|
||
``` | ||
Domain Name: LATIMES.COM | ||
Registry Domain ID: 510925_DOMAIN_COM-VRSN | ||
Registrar WHOIS Server: whois.godaddy.com | ||
Registrar URL: http://www.godaddy.com | ||
Update Date: 2013-12-02 10:38:01 | ||
Creation Date: 1990-12-12 00:00:00 | ||
Registrar Registration Expiration Date: 2015-12-11 00:00:00 | ||
Registrar: GoDaddy.com, LLC | ||
Registrar IANA ID: 146 | ||
Registrar Abuse Contact Email: [email protected] | ||
Registrar Abuse Contact Phone: +1.480-624-2505 | ||
Registrant Name: TRIBUNE COMPANY | ||
Registrant Organization: Tribune Technology LLC | ||
Registrant Street: 435 N. Michigan Ave | ||
Registrant City: Chicago | ||
Registrant State/Province: Illinois | ||
Registrant Postal Code: 60611 | ||
Registrant Country: United States | ||
Registrant Phone: +1.3122229100 | ||
Registrant Email: [email protected] | ||
``` | ||
Tokens can be processed either in the order they appear in the input pattern, or in any order. If processing in order, a token can be marked as optional with the `?` suffix to allow matching to continue if it is not present in the input. | ||
|
||
The Tokenizer will create an object with the information from the text extract and set onto it's properties. A JSON representation of the information extracted from the above would be: | ||
|
||
```json | ||
{ | ||
"WhoisRecord": { | ||
"DomainName": "LATIMES.COM", | ||
"RegistryDomainId": "510925_DOMAIN_COM-VRSN", | ||
"Registrar": { | ||
"WhoisHostName": "whois.godaddy.com", | ||
"URL": "http://www.godaddy.com", | ||
"Name": "GoDaddy.com, LLC", | ||
"IanaId": "146" | ||
}, | ||
"CreatedDate": "1990-12-12", | ||
"ModifiedDate": "2013-12-03", | ||
"ExpirationDate": "2015-12-11", | ||
"AbuseEmail": "[email protected]", | ||
"AbusePhoneNumber": "+1.480-624-2505", | ||
"Registrant": { | ||
"Name": "TRIBUNE COMPANY", | ||
"Organization": "Tribune Technology LLC", | ||
"Street": "435 N. Michigan Ave", | ||
"City": "Chicago", | ||
"State": "Illinois", | ||
"PostalCode": "60611", | ||
"Country": "United States", | ||
"Phone": "+1.3122229100", | ||
"Email": "[email protected]", | ||
}, | ||
} | ||
} | ||
``` | ||
```csharp | ||
var pattern = | ||
@"--- | ||
# Tokens must appear in defined order | ||
OutOfOrder: false | ||
--- | ||
First Name: {FirstName} | ||
Middle Name: {MiddleName?} | ||
Last Name: {LastName}"; | ||
|
||
The coding required to extract the information in the above example is simple. First create a simple class to hold all the information you'd like to extract from the source text: | ||
var input = | ||
@"First Name: Alice | ||
Last Name: Smith"; | ||
|
||
```c# | ||
public class WhoisRecord | ||
{ | ||
public string DomainName { get; set; } | ||
var student = new Tokenizer().Parse<Student>(pattern, input); | ||
|
||
... | ||
} | ||
Assert.AreEqual("Alice", student.FirstName); | ||
Assert.IsNull(student.MiddleName); | ||
Assert.AreEqual("Smith", student.LastName); | ||
``` | ||
|
||
In order to populate the class with values, create an instance of the Tokenizer and call the Parse() method. This instantiates a new object and parses the input text and reflects it's content onto the object. The TokenResult object contains a list of all tokens extracted, as well as a Value property with the new object assigned to. | ||
## Line Handling | ||
|
||
```c# | ||
public WhoisRecord Parse(string input) | ||
{ | ||
var tokenizer = new Tokenizer(); | ||
Multiple tokens can appear on the same line of text, or tokens can span multiple lines of text if desired. Windows and Unix line endings are automatically handled in patterns and input. | ||
|
||
var result = tokenizer.Parse<WhoisRecord>(pattern, input); | ||
```csharp | ||
var pattern = | ||
@"Comments: | ||
{Comment:Trim()} | ||
return result.Value; | ||
} | ||
``` | ||
Name: | ||
{Name}"; | ||
|
||
Before you can call the Tokenizer, you need to supply it with a pattern first. For the example above, the pattern would look something like: | ||
var input = | ||
@"Comments: | ||
10/10 | ||
Would parse text again. | ||
Name: | ||
Bob"; | ||
|
||
var review = new Tokenizer().Parse<Review>(pattern, input); | ||
|
||
Assert.AreEqual("10/10\nWould parse text again.", review.Comment); | ||
Assert.AreEqual("Bob", review.Name); | ||
``` | ||
Domain Name: #{WhoisRecord.DomainName} | ||
Registry Domain ID: #{WhoisRecord.DomainId} | ||
Registrar WHOIS Server: #{WhoisRecord.Registrar.WhoisHostName} | ||
Registrar URL: #{WhoisRecord.Registrar.Url} | ||
Updated Date: #{WhoisRecord.UpdatedDate} | ||
Creation Date: #{WhoisRecord.CreatedDate} | ||
Registrar Registration Expiration Date: #{WhoisRecord.ExpirationDate} | ||
Registrar: #{WhoisRecord.Registrar.Name} | ||
Registrar IANA ID: #{WhoisRecord.RegistrarIanaId} | ||
Registrar Abuse Contact Email: #{WhoisRecord.AbuseEmail} | ||
Registrar Abuse Contact Phone: #{WhoisRecord.AbusePhoneNumber} | ||
Registrant Name: #{WhoisRecord.Registrant.Name} | ||
Registrant Organization: #{WhoisRecord.Registrant.Organization} | ||
Registrant Street: #{WhoisRecord.Registrant.Street} | ||
Registrant City: #{WhoisRecord.Registrant.City} | ||
Registrant State/Province: #{WhoisRecord.Registrant.State} | ||
Registrant Postal Code: #{WhoisRecord.Registrant.PostalCode} | ||
Registrant Country: #{WhoisRecord.Registrant.Country} | ||
Registrant Phone: #{WhoisRecord.Registrant.PhoneNumber} | ||
Registrant Email: #{WhoisRecord.Registrant.Email} | ||
|
||
## New Line Termination | ||
|
||
When data is embedded in a single line, appending the `$` symbol to the end of the Token name will match to the end of the current line: | ||
|
||
```csharp | ||
var pattern = @"Name: {Name$} | ||
Age: {Age:IsNumeric()}"; | ||
|
||
var input = @"Name: Bob | ||
Surname: Jones | ||
Age: 31"; | ||
|
||
var person = new Tokenizer().Parse<Person>(pattern, input); | ||
|
||
Assert.AreEqual(person.Name, "Bob"); // Not: "Bob\nSurname: Jones" | ||
Assert.AreEqual(person.Age, 31); | ||
``` | ||
|
||
Each placeholder in the pattern refers to a property on the object. The library will walk the object graph, instantiating properties as it encounters them, to set the values specified in the placeholder. | ||
## Repeating | ||
|
||
## Transforming Input | ||
Lists and repeating data elements can be extracted multiple by appending the `*` suffix to the token. Tokenizer will populate an underlying `List<>` or `IList<>` on the target object. | ||
|
||
Sometimes the data you're processing requires preprocessing before it can be mapped onto your object. The Tokenizer library contains a number of built-in functions that enable this to save writing additional code. | ||
```csharp | ||
var pattern = | ||
@"Name: {Manager.Name} | ||
Employee: {Manager.Manages*} | ||
Number: {Manager.Number}"; | ||
|
||
### Transforming Dates | ||
var input = | ||
@"Name: Sue | ||
Employee: Alice | ||
Employee: Bob | ||
Employee: Charles | ||
Number: 1234"; | ||
|
||
Sometimes a dates in input text can't automatically be parsed by the .NET framework. In this case, you can add a ToDateTime() transform to tell the Tokenizer to parse the date in an exact format: | ||
var result = new Tokenizer().Parse<Manager>(pattern, input); | ||
|
||
Assert.AreEqual("Sue", result.Name); | ||
Assert.AreEqual(3, result.Manages.Count); | ||
Assert.AreEqual("Alice", result.Manages[0]); | ||
Assert.AreEqual("Bob", result.Manages[1]); | ||
Assert.AreEqual("Charles", result.Manages[2]); | ||
Assert.AreEqual(1234, result.Number); | ||
``` | ||
Creation Date: 4 Dec 1990 14:32 | ||
|
||
Repeating tokens are also treated as optional tokens. | ||
|
||
## Configuration | ||
|
||
Tokenizer configuration can be set either globally, per instance or per pattern. | ||
|
||
```csharp | ||
// Global configuration | ||
TokenizerOptions.Defaults.TrimTrailingWhiteSpace = false; | ||
|
||
// Instance configuration | ||
var tokenizer = new Tokenizer(); | ||
tokenizer.Options.TrimTrailingWhiteSpace = true; | ||
|
||
// Front matter configuration | ||
var pattern = @"--- | ||
# Trim Whitespace | ||
TrimTrailingWhitespace: true | ||
--- | ||
First Name: {FirstName} | ||
Last Name: {LastName} | ||
..."; | ||
|
||
``` | ||
|
||
Pattern with transform: | ||
### Configuration Front Matter | ||
|
||
Tokenizer templates are configurable via an embedded Front Matter section. The options set in the Front Matter will effect the parsing of that template only, and override both Global and instance settings. | ||
|
||
The Front Matter section is optional. It is processed between matching `---` sequences at the start of the template pattern. Within the Front Matter, lines starting with the hash sign (`#`) are treated as comments. | ||
|
||
```yaml | ||
--- | ||
# Treat missing properties on the target object as exceptions | ||
ThrowExceptionOnMissingProperty: true | ||
|
||
# Do a case insensitive compare when matching tokens to property names on the target | ||
CaseSensitive: false | ||
--- | ||
First Name: {FirstName} | ||
Middle Names: {MiddleNames*} | ||
Last Name: {LastName} | ||
|
||
``` | ||
Creation Date: #{WhoisRecord.CreationDate:ToDateTime('d MMM yyyy HH:mm')} | ||
Configuration directives and their effects are listed in the Wiki. | ||
## Data Transformations | ||
Extracted data can be transformed before being set on the target object. | ||
```csharp | ||
var pattern = "Name: {Name:Trim(),ToLower()}"; | ||
var input = "Name: Alice "; | ||
|
||
var person = new Tokenizer().Parse<Person>(pattern, input); | ||
|
||
Assert.AreEqual(person.Name, "alice"); | ||
``` | ||
Multiple transformations (and validators) can be chained together using the `,` symbol and are executed in the order they are specified. It is easy to implement and register your own token transformers by implementing the `ITokenTransformer` interface. See the Wiki for details how, and a list of built in transformers and their usage. | ||
|
||
## Data Validation | ||
|
||
## Limitations | ||
Token validation functions are run against extracted content before it's mapped to the target object. If a validation returns false, then the token is not mapped, and the input content is searched for another match. | ||
|
||
The Tokenizer currently works on line-by-line. You cannot currently write multi-line placeholders. | ||
```csharp | ||
var pattern = "Age: {Age:IsNumeric}"; | ||
var input = "Age: Ten, Age: 11"; | ||
|
||
## Extending Tokenizer | ||
var person = new Tokenizer().Parse<Person>(pattern, input); | ||
|
||
Assert.AreEqual(person.Age, 11); | ||
``` | ||
|
||
The Tokenizer is easily extensible by adding new Token Operators and Token Validators. | ||
It is easy to implement and register your own token validators by implementing the `ITokenValidator` interface. See the Wiki for details how, and a list of built in validators and their usage. |