logbook

5 Feb '12
Lex-role
#<J48 J48 pruned tree
------------------

compl <= 0.22314: en (150.0/23.0)
compl > 0.22314
|   trans-subj <= 0.326121
|   |   cop-sub <= 0.011236
|   |   |   pass-subj <= 0.201835
|   |   |   |   pass-subj <= 0.028571: en (38.0/13.0)
|   |   |   |   pass-subj > 0.028571: es (237.0/83.0)
|   |   |   pass-subj > 0.201835: en (39.0/9.0)
|   |   cop-sub > 0.011236: es (127.0/16.0)
|   trans-subj > 0.326121: en (51.0/11.0)

Number of Leaves  : 	6

Size of the tree : 	11
>
=== Confusion Matrix ===

   a   b   <-- classified as
 242  79 |   a = es
 105 216 |   b = en

=== Summary ===

Correctly Classified Instances         458               71.3396 %
Incorrectly Classified Instances       184               28.6604 %
Kappa statistic                          0.4268
Mean absolute error                      0.3617
Root mean squared error                  0.4519
Relative absolute error                 72.3462 %
Root relative squared error             90.3791 %
Total Number of Instances              642

Verb valency:

#<J48 J48 pruned tree
------------------

three <= 0.067669
|   two <= 0.40625
|   |   three <= 0.030303
|   |   |   three <= 0.011442: en (17.0/1.0)
|   |   |   three > 0.011442
|   |   |   |   two <= 0.4: es (8.0/1.0)
|   |   |   |   two > 0.4: en (2.0)
|   |   three > 0.030303: en (12.0)
|   two > 0.40625
|   |   one <= 0.384615: en (99.0/36.0)
|   |   one > 0.384615: es (455.0/183.0)
three > 0.067669: en (49.0/5.0)

Number of Leaves  : 	7

Size of the tree : 	13
>
=== Confusion Matrix ===

   a   b   <-- classified as
 269  52 |   a = es
 189 132 |   b = en

=== Summary ===

Correctly Classified Instances         401               62.4611 %
Incorrectly Classified Instances       241               37.5389 %
Kappa statistic                          0.2492
Mean absolute error                      0.4523
Root mean squared error                  0.4857
Relative absolute error                 90.4594 %
Root relative squared error             97.1331 %
Total Number of Instances              642  


4 Feb '12
Ran the argument structure test in train_args.clj. Training a C4.5 decision tree and testing with 20 fold x-over gives:

#<J48 J48 pruned tree
------------------

ao <= 0.352657
|   aoc <= 0.054054
|   |   sc <= 0.16129
|   |   |   aoc <= 0.009259: en (29.0/2.0)
|   |   |   aoc > 0.009259
|   |   |   |   sc <= 0.15
|   |   |   |   |   pc <= 0.005
|   |   |   |   |   |   s <= 0.380531: en (5.0)
|   |   |   |   |   |   s > 0.380531
|   |   |   |   |   |   |   s <= 0.504762: es (7.0/1.0)
|   |   |   |   |   |   |   s > 0.504762: en (3.0/1.0)
|   |   |   |   |   pc > 0.005: es (6.0)
|   |   |   |   sc > 0.15: en (5.0)
|   |   sc > 0.16129
|   |   |   ao <= 0.275362
|   |   |   |   aoc <= 0.010526
|   |   |   |   |   sc <= 0.272727
|   |   |   |   |   |   sc <= 0.25641
|   |   |   |   |   |   |   p <= 0.042857: en (5.0/1.0)
|   |   |   |   |   |   |   p > 0.042857: es (46.0/15.0)
|   |   |   |   |   |   sc > 0.25641: en (9.0)
|   |   |   |   |   sc > 0.272727
|   |   |   |   |   |   s <= 0.285714
|   |   |   |   |   |   |   sc <= 0.340426: en (5.0)
|   |   |   |   |   |   |   sc > 0.340426: es (5.0/1.0)
|   |   |   |   |   |   s > 0.285714: es (46.0/5.0)
|   |   |   |   aoc > 0.010526
|   |   |   |   |   p <= 0.171429: es (116.0/9.0)
|   |   |   |   |   p > 0.171429
|   |   |   |   |   |   ao <= 0.243243
|   |   |   |   |   |   |   aoc <= 0.014493
|   |   |   |   |   |   |   |   s <= 0.3125: en (2.0)
|   |   |   |   |   |   |   |   s > 0.3125: es (4.0/1.0)
|   |   |   |   |   |   |   aoc > 0.014493: es (10.0)
|   |   |   |   |   |   ao > 0.243243: en (6.0/1.0)
|   |   |   ao > 0.275362: es (149.0/67.0)
|   aoc > 0.054054: en (36.0/5.0)
ao > 0.352657: en (148.0/21.0)

Number of Leaves  : 	20

Size of the tree : 	39
>=== Confusion Matrix ===

   a   b   <-- classified as
 264  57 |   a = es
 128 193 |   b = en

=== Summary ===

Correctly Classified Instances         457               71.1838 %
Incorrectly Classified Instances       185               28.8162 %
Kappa statistic                          0.4237
Mean absolute error                      0.3602
Root mean squared error                  0.4585
Relative absolute error                 72.0461 %
Root relative squared error             91.6902 %
Total Number of Instances              642     

28 Jan '12
Incorporated ICLE and OANC. Now have 321 instances each of es and en and 390,006 tokens of en and 384,885 tokens of es. Ran C4.5 giving tree:

#<J48 J48 pruned tree
------------------

nn <= 0.038776
|   poss <= 0.036585
|   |   quantmod <= 0.004098
|   |   |   prt <= 0.004178: es (236.0/6.0)
|   |   |   prt > 0.004178
|   |   |   |   auxpass <= 0.013208
|   |   |   |   |   possessive <= 0.002128: es (29.0/3.0)
|   |   |   |   |   possessive > 0.002128
|   |   |   |   |   |   predet <= 0.001289: es (6.0/1.0)
|   |   |   |   |   |   predet > 0.001289: en (6.0)
|   |   |   |   auxpass > 0.013208: es (27.0)
|   |   quantmod > 0.004098
|   |   |   expl <= 0.002786: en (7.0/1.0)
|   |   |   expl > 0.002786: es (9.0)
|   poss > 0.036585
|   |   complm <= 0.008368: en (13.0)
|   |   complm > 0.008368: es (4.0/1.0)
nn > 0.038776
|   cop <= 0.027211
|   |   poss <= 0.013575
|   |   |   mwe <= 0.001429
|   |   |   |   npadvmod <= 0.00074
|   |   |   |   |   parataxis <= 0.001107: en (5.0)
|   |   |   |   |   parataxis > 0.001107
|   |   |   |   |   |   dep <= 0.040302: es (10.0/1.0)
|   |   |   |   |   |   dep > 0.040302: en (2.0)
|   |   |   |   npadvmod > 0.00074: en (8.0)
|   |   |   mwe > 0.001429: en (33.0/1.0)
|   |   poss > 0.013575: en (222.0)
|   cop > 0.027211
|   |   det <= 0.115263
|   |   |   expl <= 0.002321: en (13.0)
|   |   |   expl > 0.002321
|   |   |   |   possessive <= 0.007974: es (5.0)
|   |   |   |   possessive > 0.007974: en (2.0)
|   |   det > 0.115263: es (5.0)

Number of Leaves  : 	19

Size of the tree : 	37

*********

20 fold cross validation gives:
=== Confusion Matrix ===

   a   b   <-- classified as
 291  30 |   a = es
  36 285 |   b = en

=== Summary ===

Correctly Classified Instances         576               89.7196 %
Incorrectly Classified Instances        66               10.2804 %
Kappa statistic                          0.7944
Mean absolute error                      0.1139
Root mean squared error                  0.3116
Relative absolute error                 22.7846 %
Root relative squared error             62.3168 %
Total Number of Instances              642 

*********
100 tree forest with 20 fold cv gives:

=== Confusion Matrix ===

   a   b   <-- classified as
 306  15 |   a = es
  26 295 |   b = en

=== Summary ===

Correctly Classified Instances         601               93.6137 %
Incorrectly Classified Instances        41                6.3863 %
Kappa statistic                          0.8723
Mean absolute error                      0.1686
Root mean squared error                  0.2413
Relative absolute error                 33.7132 %
Root relative squared error             48.2598 %
Total Number of Instances              642    

3 Nov '11
Reran t-test on dep relns. Results on 95% significant relns in log-addendum-006.txt. es in x, en in y.
Results on 99% sign in log-addendum-007.txt


23 Oct '11
Note to self: Do L1-en use VP anaphora (i.e. do) readily in English? What types of VP anaphora does Spanish have? See Lopez and Winkler ("spanish-vp-anaphora.pfd")


9 Oct '11
commit: 9b9e80fe082e8e68e96c42db9eb247476781a67d
Finish an initial version of the arguments/lexical NP based classifier. Code is in arguments.clj and train_arguments.clj.
T-test results show significant difference between the following argument configurations, with the shown language having the greater frequency. (alpha = 0.05)
s  -->  ES
S  -->  EN
ao  -->  ES
aO  -->  ES
Ao  -->  ES
AO  -->  EN
aiO  -->  ES
aoC  -->  ES
aOC  -->  ES
AOC  -->  EN
p  -->  ES
P  -->  EN
pC  -->  ES

More detailed results in log-addendum-005.txt

7 Oct '11
Most recently added some en samples from ICE and merged some of the wricle samples to more equalize the sample count. Also got good results with random forest with n=100. Will now begin verb valency test (see Sep 11 note).

Useful terms:valency expansion, valency reduction, avalent, monovalent, divalent, trivalent

5 Oct '11
Note to self:
Consider phrasal verbs. Do Spanish speakers separate phrasal verbs? e.g. "He chewed the food up".
But also consider "That is something he won't put up with."

25 Sep '11
Added functions to the 'train-verbs' and 'verb' namespaces to allow
classification by modal relative frequency and also by high-frequency verb
relative frequency. The verbs considered are from Altenberg and Granger and are
'have go take do say look know see give think come find get make use'.
Using a Neural Net, classifying by modals gives about 80% accuracy. I need to run it again and rcord the results.

Classifying by high freq vergs gives the following:
Correctly Classified Instances         139               78.0899 %
Incorrectly Classified Instances        39               21.9101 %
Kappa statistic                          0.4474
Mean absolute error                      0.2173
Root mean squared error                  0.4346
Relative absolute error                 54.2876 %
Root relative squared error             97.2821 %
Total Number of Instances              178     

23 Sep '11
commit 8b2c71f4a645c96a86fd60557aaed877664c8144
Added function count-words-in-corpora to thesis.data. The following are the current results
:micusp   :es 31057 
:micusp   :en 167984 
:wricle   :es 88977 
:misc   :es 496 
:misc   :en 2848 
:sulec   :es 36267

total es = 156,797
total en = 170,832

Note that ICLE will provde another 198,131 words of Spanish, more than doubling that body. I can probably get enough English from ICE to make up the difference on the English side.

22 Sep '11
Finished verb.clj to the point where I was able to run statistics and classifying tests (found in train_verb.clj)

Training on all corpora and doing a 10 fold cross validation, a decision tree gives:
Correctly Classified Instances         126               70.7865 %
Incorrectly Classified Instances        52               29.2135 %
Kappa statistic                          0.2077
Mean absolute error                      0.3059
Root mean squared error                  0.5139
Relative absolute error                 76.424  %
Root relative squared error            115.0477 %
Total Number of Instances              17

A Neural Net gives slightly better results at:
Correctly Classified Instances         136               76.4045 %
Incorrectly Classified Instances        42               23.5955 %
Kappa statistic                          0.3853
Mean absolute error                      0.2601
Root mean squared error                  0.4596
Relative absolute error                 64.9751 %
Root relative squared error            102.8927 %
Total Number of Instances              178

On the statistics test, at the 95% confidence level the following tense/aspect combinations were shown to have statistically different relative frequencies between L1-es and L1-en. The language is written next to it if that language had the higher frequency.

present-perfect-passive en
present-perfect-progressive es
present-progressive es
past-passive en
present es
past en

Note that negative get passives are still not recognized, so that needs to be done.

See addendum-004 for more statistics

21 Sep '11
Used Incanter to run t-test on dependency relations. The following dependencies show significant differences in relative frequency between L1 English and L1 Spanish at 95% confidence:
prt
purpcl
poss
predet
possessive
rcmod
csubj
mark
cop
xcomp
advcl
pcomp
expl
aux
neg
npadvmod
infmod
complm
ccomp
nsubj
det
nn

See addendum-003 for more statistics.

20 Sep '11
Did more work on verb.clj. Also created train_verbs.clj, which will contain that code that performs classifying tasks using the tools in verbs.clj. So far the 'extract-verbs' function can recognize all twelve tense/aspect combinations ({past,present,modal}/perfect/progressive). It also recognizes passives in 'be' and 'get' with both 'got' and 'gotten'. It does not yet recognize negative 'get' passives that require a 'do' operator, e.g. "he didn't get beaten" Also, questions are not supported and, as of my thinking right now, probably never will be.

Yesterday I added access methods for the sulec corpus to data.clj. I pooled the sulec essays by threes, creating a total of 41 "instances". I also generated a *.stats file for each instance (from all corpora). At the moment, this file is simply a text file that contains the number of stanford nlp parser tokens. I'm not sure if this will be useful. My thinking was that this number could provide a normalizing factor for the different attributes, but now that I think about it, my current approach might be better. For instance, in thesis.train-deps, I use the total number of dependencies of any type as a normalizing factor. I don't immediately see a reason to change this. One concern is that some instances are too small for certain attributes to have statistical relevance. This warrants some thought.


11 Sep '11
Note to self: a good attribute to analyze might involve looking at the arguments of a verb and seeing how many are lexical NPs and how many are referential forms. Recall that Du Bois says that no more than one lexical NP is preferred, in intransitives in the S position (of course), and in transitives and ditransitives in the D.O. position. This could be presented as fourteen attributes:
1 argument -> {s, S}
2 arguments -> {ao,aO,Ao,AO}
3 arguments -> {aio,aiO,aIo,aIO,Aio,AiO,AIo,AIO}

The values for each could be represented thus, where the variables are the number of occurences of that particular argument configuration:
Vs = sum(s)/(sum(s)+sum(S))
Vao = sum(ao)/(sum(ao)+sum(aO)+sum(Ao)+sum(AO))
and so forth

Sources:
Du Bois, John W. "Discourse and Grammar" from ed. Tomasello, Michael "The New Psychology of Language" Ch. 2 pp 48-87

3 Sep '11
No commit
Up next, figure out how tenses are represented in the syntactic parse trees and turn this into attributes. Let each verb tense be an attribute with the value being the number of times it appears in the text divide by the total number of verb usages. Restrict to finite verbs.


10 Aug '11
commit 17849cefb00c0082d9d6e5a4885ae83edc0eb6b6

Previously (about 10 days ago) wrote a function
train-deps/run-root-attr-trim-experiment that performs the following
experiment (from docs): This function makes an unpruned J48 tree using
the data in dataset, evaluates it using cross validation, then removes
the attribute from dataset that is the basis of the initial decision
in the tree and calls itself again, decrementing depth.
The purpose of this is to discover which features are apparently most
important in classification. This can likely also be done (though
probably with different results) using a classifier that can report
the most informative features automatically.

See log-addendum-002 for specifics.

29 Jul '11
3:14pm
commit 51b394f7ce93d8bed552b5e07119a35026db2719

Ran an experiment hoping to discover if some corpus files should be
dropped or otherwise treated specially due to small size. Initial
results seem to indicate no. The table below shows the results from 20
fold cross validation. See also log-addendum-001 for the list of files
used.  multilayer perceptron used as classifier with no special
options. Still need to investigate whether results can be improved by
pooling smaller files.

MinDep	Average correct over 10 runs
15	86.86131386861314
20	86.46616541353384
25 	85.88709677419355
35 	82.47422680412373
45 	82.49999999999999
60 	77.08333333333334


8:35am
commit 659bea22b18d4ff857aaa886b5797f1ce96a1aef

Training a decision tree  classifier (:decision-tree :c45 / J48 pruned
tree) on  a dataset consisting of  all L1-ES wricle texts  of levels 5
and 6 (79), all L1-ES micusp texts (8), and a sampling of L1-EN wricle
texts (48) yields a classifier which when evaluated with 10 fold cross
validation  yields a  success of  88.1481%.  The decision  tree is  as
follows:

nn <= 0.043394: es (78.0/1.0)
nn > 0.043394
|   predet <= 0.001776
|   |   advmod <= 0.034846
|   |   |   quantmod <= 0.001476
|   |   |   |   prt <= 0.00202: es (6.0)
|   |   |   |   prt > 0.00202: en (2.0)
|   |   |   quantmod > 0.001476: en (6.0)
|   |   advmod > 0.034846: en (40.0/1.0)
|   predet > 0.001776: es (3.0)

27   July  '11  Code   now  capable   of  doing   deps  relation-based
classification.   The   efficacy   of    this   has   not   yet   been
determined.  Relevent  code in  thesis.train-deps.  Next step,  choose
appropriate classifier algorith and test.

25 July '11 Added file  train_deps.clj to contain code for running the
deps-based  classification.  Using clj-ml  to  interface weka.   Began
initial test by  constructing a Weka data set  with as many attributes
as there are stanford dependencies  (50 some). I added a weka Instance
to the  data set for  each sample text.  I setup each attribute  to be
numeric, counting the number of times a particular dependency relation
is used in  a text (i.e. just the basic dep  such as "xsubj", ignoring
the  words that  are parameters),  dividing that  number by  the total
number of  dependencies and using  that as the attribute  value. Class
attribute  has  the  key  "L1"  with possible  values  "en"  or  "es".
thesis.train-deps/make-reln-dataset-with-samples  will  construct  the
weka dataset from  the supplied list of english  and spanish L1 micusp
files.   Starting  a   run  with  8  L1  spanish   and  9  L1  english
texts. Following run need to train a classifier.

20 July '11 10:05PM Renamed parser.clj to parse.clj along with package
thesis.parse.  This packages now  contains a  function parse-sentences
that takes  a list  of preprocessed text  and returns only  the parses
that  are complete sentences.  Also added  functions in  data.clj that
return the  preprocessed text. Next step  is to write  a function that
will create dependences for  these parses. Then use those dependencies
as features for WEKA.

2:29PM   Finished   a   preliminary   preprocessor  for   the   micusp
resource.  regexes  need improvement  as  some  documents  seem to  be
cutoff.  Also there  is  some  initial information  that  needs to  be
striped.

18 July '11 10:41PM Have been working on a preprocessor for the micusp
resource. Still in progress.  There are slight differences between the
files in terms of the page  headers that need to be stripped; mainly a
difference  of   newlines  but  possibly  more.   New  folder  code-nu
containing code that does batch converts from PDFs to utf8.

17 July '11
1:41PM
NB: MICUSP: Michigan Corpus of Upper-Level Student Papers
http://search-micusp.elicorpora.info/simple/

1:24PM Downloaded ice-canada corpus. Zip file (in data/ice-canada/) is
password  protected.  etc/ICElicence.doc  needs  to be  completed  and
emailed   after    July   '11    (researcher   is   out    of   town).
http://ICE-corpora.net/ice/download.htm

13 July  '11 9:32PM  Setup remote git  repository for this  project at
thesis@rivulus-sw.com:thesis

12 July '11 3:16PM
Initial entry. Up to this point, I have been
working in Clojure with the Stanford Parser (1.6.4 and now
1.6.7). Using the factored and PCFG parser I have generated
parse-trees and Stanford dependencies.
Yesterday spoke to Drs. Biava
and Smith regarding this project. Basically just explained my
intentions. Talked about finding corpora, the surprising paucity of
Spanish linguistics literature, etc.

I should provide a basic overview of my goals here:
Using parse trees, Stanford Dependencies, possibilty vocabulary, and a
machine learning package (likely WEKA) I want to create a system that
will look at an English text and attempt to determine if it was
written by a Spanish L1 or native English speaker. The problem here if
finding the features in trees etc. that will allow this
classification.

Immediate Plans:
Setup a Git repository? Today I emailed Dr. Smith regarding university
server space for this.
Stanford dependencies consists of "approximately 52 grammatical
relations" (SD manual). Try using these as features for a first
attempt at a classifications system. In other words, for a text just
count the number of each relation it has and see if that gives the
classifier something to work with.

Relevant Works:
-For the PCFG parser: Dan Klein and Christopher
D. Manning. 2003. Accurate Unlexicalized Parsing. Proceedings of the
41st Meeting of the Association for Computational Linguistics,
pp. 423-430. 
-For the factored parser: Dan Klein and Christopher
D. Manning. 2003. Fast Exact Inference with a Factored Model for
Natural Language Parsing. In Advances in Neural Information Processing
Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, pp. 3-10.