-
Notifications
You must be signed in to change notification settings - Fork 1.6k
pattern nl
The pattern.nl module contains a fast part-of-speech tagger for Dutch (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization.
It can be used by itself or with other pattern modules: web | db | en | search | vector | graph.
The functions in this module take the same parameters and return the same values as their counterparts in pattern.en. Refer to the documentation there for more details.
For Dutch nouns there is singularize()
and pluralize()
. The implementation is
slightly less robust than the English version (accuracy 91% for
singularization and 80% for pluralization).
>>> from pattern.nl import singularize, pluralize
>>>
>>> print singularize('katten')
>>> print pluralize('kat')
kat
katten
For Dutch verbs there is conjugate()
,
lemma()
, lexeme()
and tenses()
. The lexicon for verb conjugation
contains about 4,000 common Dutch verbs. For unknown verbs it will fall
back to a rule-based approach with an accuracy of about 81%.
>>> from pattern.nl import conjugate
>>> from pattern.nl import INFINITIVE, PRESENT, SG
>>>
>>> print conjugate('ben', INFINITIVE)
>>> print conjugate('ben', PRESENT, 2, SG)
zijn
bent
Dutch adjectives followed by a noun inflect with an -e
suffix (e.g., braaf → brave kat). You
can get the base form with the predicative()
function, or vice versa
with attributive()
. Accuracy is 99%.
>>> from pattern.nl import attributive, predicative
>>>
>>> print predicative('brave')
>>> print attributive('braaf')
braaf
brave
For opinion mining there is sentiment()
, which returns a (polarity
, subjectivity
)-tuple, based on a lexicon of
adjectives. Polarity is a value between -1.0
and +1.0
, subjectivity between 0.0
and 1.0
.
The accuracy is around 82% (P 0.79, R 0.86) for book reviews:
>>> from pattern.nl import sentiment
>>> print sentiment('Een onwijs spannend goed boek!')
(0.69, 0.90)
For parsing there is parse()
, parsetree()
and split()
. The parse()
function annotates words in the given
string with their part-of-speech
tags (e.g.,
NN
for nouns and VB
for verbs). The parsetree() function takes a
string and returns a tree of nested objects (Text
→ Sentence
→ Chunk
→ Word
). The split()
function takes the output of parse()
and returns a Text
. See the pattern.en documentation
(here) how
to manipulate Text
objects.
>>> from pattern.nl import parse, split
>>>
>>> s = parse('De kat zit op de mat.')
>>> for sentence in split(s):
>>> print sentence
Sentence('De/DT/B-NP/O kat/NN/I-NP/O zit/VBZ/B-VP/O op/IN/B-PP/B-PNP'
'de/DT/B-NP/I-PNP mat/NN/I-NP/I-PNP ././O/O')
The parser is built on Jeroen Geertzen's Dutch language
model. The accuracy is
around 91%. The
original WOTAN
tagset is mapped to Penn Treebank. If you need to
work with the original tags you can also use parse()
with an optional parameter tagset="WOTAN"
.
Reference: Geertzen, J. (2010). Brill-NL. Retrieved from: http: //cosmion.net/jeroen/software/brill_pos/.