OPENNLP-216: Add Detokenizer API section (#388)

* OPENNLP-216: Add Detokenizer API section * OPENNLP-216: Add Detokenizer API section (correct)
apache · Dec 30, 2020 · af6a6e0 · af6a6e0
1 parent 52eb4cf
commit af6a6e0
Showing 1 changed file with 66 additions and 7 deletions.
diff --git a/opennlp-docs/src/docbkx/tokenizer.xml b/opennlp-docs/src/docbkx/tokenizer.xml
@@ -396,19 +396,78 @@ test -> NO_OPERATION
 			<![CDATA[
 He said "This is a test".]]>		
 		</programlisting>
-		TODO: Add documentation about the dictionary format and how to use the API. Contributions are welcome.
 		</para>
 		<section id="tools.tokenizer.detokenizing.api">
 			<title>Detokenizing API</title>
-			<para>TODO: Write documentation about the detokenizer api. Any contributions
-are very welcome. If you want to contribute please contact us on the mailing list
-or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-216">OPENNLP-216</ulink>.</para>
+			<para>
+				The Detokenizer can be used to detokenize the tokens to String.
+				To instantiate the Detokenizer (a rule based detokenizer)
+				a DetokenizationDictionary (the rule of dictionary) must be created first.
+				The following code sample shows how a rule dictionary can be loaded.
+				<programlisting language="java">
+					<![CDATA[
+InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
+DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
+				</programlisting>
+				After the rule dictionary is loadeed the DictionaryDetokenizer can be instantiated.
+				<programlisting language="java">
+					<![CDATA[
+Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
+				</programlisting>
+				The detokenizer offers two detokenize methods,the first detokenize the input tokens into a String.
+				<programlisting language="java">
+					<![CDATA[
+String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
+String sentence = detokenizer.detokenize(tokens, null);
+Assert.assertEquals("A co-worker helped.", sentence);]]>
+				</programlisting>
+				Tokens which are connected without a space in-between can be separated by a split marker.
+				<programlisting language="java">
+					<![CDATA[
+String sentence = detokenizer.detokenize(tokens, "<SPLIT>");
+Assert.assertEquals("A co<SPLIT>-<SPLIT>worker helped<SPLIT>.", sentence);]]>
+				</programlisting>
+				The API also offers a method which simply returns operations array in the input tokens array.
+				<programlisting language="java">
+					<![CDATA[
+DetokenizationOperation[] operations = detokenizer.detokenize(tokens);
+for (DetokenizationOperation operation : operations) {
+  System.out.println(operation);
+}]]>
+				</programlisting>
+				Output:
+				<programlisting>
+					<![CDATA[
+NO_OPERATION
+NO_OPERATION
+MERGE_BOTH
+NO_OPERATION
+NO_OPERATION
+MERGE_TO_LEFT]]>
+				</programlisting>
+			</para>
 		</section>
 		<section id="tools.tokenizer.detokenizing.dict">
 			<title>Detokenizer Dictionary</title>
-			<para>TODO: Write documentation about the detokenizer dictionary. Any contributions
-are very welcome. If you want to contribute please contact us on the mailing list
-or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-217">OPENNLP-217</ulink>.</para>
+			<para>
+				Detokenization Dictionary is the rule dictionary about detokenizer.
+				tokens - an array of tokens that should be detokenized according to an operation.
+				operations - an array of operations which specifies which operation
+				should be used for the provided tokens.
+				The following code sample shows how a rule dictionary can be created.
+				<programlisting language="java">
+					<![CDATA[
+String[] tokens = new String[]{".", "!", "(", ")", "\"", "-"};
+Operation[] operations = new Operation[]{
+    Operation.MOVE_LEFT,
+    Operation.MOVE_LEFT,
+    Operation.MOVE_RIGHT,
+    Operation.MOVE_LEFT,
+    Operation.RIGHT_LEFT_MATCHING,
+    Operation.MOVE_BOTH};
+DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations);]]>
+				</programlisting>
+			</para>
 		</section>
 	</section>
 </chapter>