Skip to content

Commit

Permalink
OPENNLP-216: Add Detokenizer API section (#388)
Browse files Browse the repository at this point in the history
* OPENNLP-216: Add Detokenizer API section

* OPENNLP-216: Add Detokenizer API section (correct)
  • Loading branch information
Alanscut authored Dec 30, 2020
1 parent 52eb4cf commit af6a6e0
Showing 1 changed file with 66 additions and 7 deletions.
73 changes: 66 additions & 7 deletions opennlp-docs/src/docbkx/tokenizer.xml
Original file line number Diff line number Diff line change
Expand Up @@ -396,19 +396,78 @@ test -> NO_OPERATION
<![CDATA[
He said "This is a test".]]>
</programlisting>
TODO: Add documentation about the dictionary format and how to use the API. Contributions are welcome.
</para>
<section id="tools.tokenizer.detokenizing.api">
<title>Detokenizing API</title>
<para>TODO: Write documentation about the detokenizer api. Any contributions
are very welcome. If you want to contribute please contact us on the mailing list
or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-216">OPENNLP-216</ulink>.</para>
<para>
The Detokenizer can be used to detokenize the tokens to String.
To instantiate the Detokenizer (a rule based detokenizer)
a DetokenizationDictionary (the rule of dictionary) must be created first.
The following code sample shows how a rule dictionary can be loaded.
<programlisting language="java">
<![CDATA[
InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
</programlisting>
After the rule dictionary is loadeed the DictionaryDetokenizer can be instantiated.
<programlisting language="java">
<![CDATA[
Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
</programlisting>
The detokenizer offers two detokenize methods,the first detokenize the input tokens into a String.
<programlisting language="java">
<![CDATA[
String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
String sentence = detokenizer.detokenize(tokens, null);
Assert.assertEquals("A co-worker helped.", sentence);]]>
</programlisting>
Tokens which are connected without a space in-between can be separated by a split marker.
<programlisting language="java">
<![CDATA[
String sentence = detokenizer.detokenize(tokens, "<SPLIT>");
Assert.assertEquals("A co<SPLIT>-<SPLIT>worker helped<SPLIT>.", sentence);]]>
</programlisting>
The API also offers a method which simply returns operations array in the input tokens array.
<programlisting language="java">
<![CDATA[
DetokenizationOperation[] operations = detokenizer.detokenize(tokens);
for (DetokenizationOperation operation : operations) {
System.out.println(operation);
}]]>
</programlisting>
Output:
<programlisting>
<![CDATA[
NO_OPERATION
NO_OPERATION
MERGE_BOTH
NO_OPERATION
NO_OPERATION
MERGE_TO_LEFT]]>
</programlisting>
</para>
</section>
<section id="tools.tokenizer.detokenizing.dict">
<title>Detokenizer Dictionary</title>
<para>TODO: Write documentation about the detokenizer dictionary. Any contributions
are very welcome. If you want to contribute please contact us on the mailing list
or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-217">OPENNLP-217</ulink>.</para>
<para>
Detokenization Dictionary is the rule dictionary about detokenizer.
tokens - an array of tokens that should be detokenized according to an operation.
operations - an array of operations which specifies which operation
should be used for the provided tokens.
The following code sample shows how a rule dictionary can be created.
<programlisting language="java">
<![CDATA[
String[] tokens = new String[]{".", "!", "(", ")", "\"", "-"};
Operation[] operations = new Operation[]{
Operation.MOVE_LEFT,
Operation.MOVE_LEFT,
Operation.MOVE_RIGHT,
Operation.MOVE_LEFT,
Operation.RIGHT_LEFT_MATCHING,
Operation.MOVE_BOTH};
DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations);]]>
</programlisting>
</para>
</section>
</section>
</chapter>

0 comments on commit af6a6e0

Please sign in to comment.