-
Notifications
You must be signed in to change notification settings - Fork 2
/
logbook
535 lines (443 loc) · 20.4 KB
/
logbook
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
5 Feb '12
Lex-role
#<J48 J48 pruned tree
------------------
compl <= 0.22314: en (150.0/23.0)
compl > 0.22314
| trans-subj <= 0.326121
| | cop-sub <= 0.011236
| | | pass-subj <= 0.201835
| | | | pass-subj <= 0.028571: en (38.0/13.0)
| | | | pass-subj > 0.028571: es (237.0/83.0)
| | | pass-subj > 0.201835: en (39.0/9.0)
| | cop-sub > 0.011236: es (127.0/16.0)
| trans-subj > 0.326121: en (51.0/11.0)
Number of Leaves : 6
Size of the tree : 11
>
=== Confusion Matrix ===
a b <-- classified as
242 79 | a = es
105 216 | b = en
=== Summary ===
Correctly Classified Instances 458 71.3396 %
Incorrectly Classified Instances 184 28.6604 %
Kappa statistic 0.4268
Mean absolute error 0.3617
Root mean squared error 0.4519
Relative absolute error 72.3462 %
Root relative squared error 90.3791 %
Total Number of Instances 642
Verb valency:
#<J48 J48 pruned tree
------------------
three <= 0.067669
| two <= 0.40625
| | three <= 0.030303
| | | three <= 0.011442: en (17.0/1.0)
| | | three > 0.011442
| | | | two <= 0.4: es (8.0/1.0)
| | | | two > 0.4: en (2.0)
| | three > 0.030303: en (12.0)
| two > 0.40625
| | one <= 0.384615: en (99.0/36.0)
| | one > 0.384615: es (455.0/183.0)
three > 0.067669: en (49.0/5.0)
Number of Leaves : 7
Size of the tree : 13
>
=== Confusion Matrix ===
a b <-- classified as
269 52 | a = es
189 132 | b = en
=== Summary ===
Correctly Classified Instances 401 62.4611 %
Incorrectly Classified Instances 241 37.5389 %
Kappa statistic 0.2492
Mean absolute error 0.4523
Root mean squared error 0.4857
Relative absolute error 90.4594 %
Root relative squared error 97.1331 %
Total Number of Instances 642
4 Feb '12
Ran the argument structure test in train_args.clj. Training a C4.5 decision tree and testing with 20 fold x-over gives:
#<J48 J48 pruned tree
------------------
ao <= 0.352657
| aoc <= 0.054054
| | sc <= 0.16129
| | | aoc <= 0.009259: en (29.0/2.0)
| | | aoc > 0.009259
| | | | sc <= 0.15
| | | | | pc <= 0.005
| | | | | | s <= 0.380531: en (5.0)
| | | | | | s > 0.380531
| | | | | | | s <= 0.504762: es (7.0/1.0)
| | | | | | | s > 0.504762: en (3.0/1.0)
| | | | | pc > 0.005: es (6.0)
| | | | sc > 0.15: en (5.0)
| | sc > 0.16129
| | | ao <= 0.275362
| | | | aoc <= 0.010526
| | | | | sc <= 0.272727
| | | | | | sc <= 0.25641
| | | | | | | p <= 0.042857: en (5.0/1.0)
| | | | | | | p > 0.042857: es (46.0/15.0)
| | | | | | sc > 0.25641: en (9.0)
| | | | | sc > 0.272727
| | | | | | s <= 0.285714
| | | | | | | sc <= 0.340426: en (5.0)
| | | | | | | sc > 0.340426: es (5.0/1.0)
| | | | | | s > 0.285714: es (46.0/5.0)
| | | | aoc > 0.010526
| | | | | p <= 0.171429: es (116.0/9.0)
| | | | | p > 0.171429
| | | | | | ao <= 0.243243
| | | | | | | aoc <= 0.014493
| | | | | | | | s <= 0.3125: en (2.0)
| | | | | | | | s > 0.3125: es (4.0/1.0)
| | | | | | | aoc > 0.014493: es (10.0)
| | | | | | ao > 0.243243: en (6.0/1.0)
| | | ao > 0.275362: es (149.0/67.0)
| aoc > 0.054054: en (36.0/5.0)
ao > 0.352657: en (148.0/21.0)
Number of Leaves : 20
Size of the tree : 39
>=== Confusion Matrix ===
a b <-- classified as
264 57 | a = es
128 193 | b = en
=== Summary ===
Correctly Classified Instances 457 71.1838 %
Incorrectly Classified Instances 185 28.8162 %
Kappa statistic 0.4237
Mean absolute error 0.3602
Root mean squared error 0.4585
Relative absolute error 72.0461 %
Root relative squared error 91.6902 %
Total Number of Instances 642
28 Jan '12
Incorporated ICLE and OANC. Now have 321 instances each of es and en and 390,006 tokens of en and 384,885 tokens of es. Ran C4.5 giving tree:
#<J48 J48 pruned tree
------------------
nn <= 0.038776
| poss <= 0.036585
| | quantmod <= 0.004098
| | | prt <= 0.004178: es (236.0/6.0)
| | | prt > 0.004178
| | | | auxpass <= 0.013208
| | | | | possessive <= 0.002128: es (29.0/3.0)
| | | | | possessive > 0.002128
| | | | | | predet <= 0.001289: es (6.0/1.0)
| | | | | | predet > 0.001289: en (6.0)
| | | | auxpass > 0.013208: es (27.0)
| | quantmod > 0.004098
| | | expl <= 0.002786: en (7.0/1.0)
| | | expl > 0.002786: es (9.0)
| poss > 0.036585
| | complm <= 0.008368: en (13.0)
| | complm > 0.008368: es (4.0/1.0)
nn > 0.038776
| cop <= 0.027211
| | poss <= 0.013575
| | | mwe <= 0.001429
| | | | npadvmod <= 0.00074
| | | | | parataxis <= 0.001107: en (5.0)
| | | | | parataxis > 0.001107
| | | | | | dep <= 0.040302: es (10.0/1.0)
| | | | | | dep > 0.040302: en (2.0)
| | | | npadvmod > 0.00074: en (8.0)
| | | mwe > 0.001429: en (33.0/1.0)
| | poss > 0.013575: en (222.0)
| cop > 0.027211
| | det <= 0.115263
| | | expl <= 0.002321: en (13.0)
| | | expl > 0.002321
| | | | possessive <= 0.007974: es (5.0)
| | | | possessive > 0.007974: en (2.0)
| | det > 0.115263: es (5.0)
Number of Leaves : 19
Size of the tree : 37
*********
20 fold cross validation gives:
=== Confusion Matrix ===
a b <-- classified as
291 30 | a = es
36 285 | b = en
=== Summary ===
Correctly Classified Instances 576 89.7196 %
Incorrectly Classified Instances 66 10.2804 %
Kappa statistic 0.7944
Mean absolute error 0.1139
Root mean squared error 0.3116
Relative absolute error 22.7846 %
Root relative squared error 62.3168 %
Total Number of Instances 642
*********
100 tree forest with 20 fold cv gives:
=== Confusion Matrix ===
a b <-- classified as
306 15 | a = es
26 295 | b = en
=== Summary ===
Correctly Classified Instances 601 93.6137 %
Incorrectly Classified Instances 41 6.3863 %
Kappa statistic 0.8723
Mean absolute error 0.1686
Root mean squared error 0.2413
Relative absolute error 33.7132 %
Root relative squared error 48.2598 %
Total Number of Instances 642
3 Nov '11
Reran t-test on dep relns. Results on 95% significant relns in log-addendum-006.txt. es in x, en in y.
Results on 99% sign in log-addendum-007.txt
23 Oct '11
Note to self: Do L1-en use VP anaphora (i.e. do) readily in English? What types of VP anaphora does Spanish have? See Lopez and Winkler ("spanish-vp-anaphora.pfd")
9 Oct '11
commit: 9b9e80fe082e8e68e96c42db9eb247476781a67d
Finish an initial version of the arguments/lexical NP based classifier. Code is in arguments.clj and train_arguments.clj.
T-test results show significant difference between the following argument configurations, with the shown language having the greater frequency. (alpha = 0.05)
s --> ES
S --> EN
ao --> ES
aO --> ES
Ao --> ES
AO --> EN
aiO --> ES
aoC --> ES
aOC --> ES
AOC --> EN
p --> ES
P --> EN
pC --> ES
More detailed results in log-addendum-005.txt
7 Oct '11
Most recently added some en samples from ICE and merged some of the wricle samples to more equalize the sample count. Also got good results with random forest with n=100. Will now begin verb valency test (see Sep 11 note).
Useful terms:valency expansion, valency reduction, avalent, monovalent, divalent, trivalent
5 Oct '11
Note to self:
Consider phrasal verbs. Do Spanish speakers separate phrasal verbs? e.g. "He chewed the food up".
But also consider "That is something he won't put up with."
25 Sep '11
Added functions to the 'train-verbs' and 'verb' namespaces to allow
classification by modal relative frequency and also by high-frequency verb
relative frequency. The verbs considered are from Altenberg and Granger and are
'have go take do say look know see give think come find get make use'.
Using a Neural Net, classifying by modals gives about 80% accuracy. I need to run it again and rcord the results.
Classifying by high freq vergs gives the following:
Correctly Classified Instances 139 78.0899 %
Incorrectly Classified Instances 39 21.9101 %
Kappa statistic 0.4474
Mean absolute error 0.2173
Root mean squared error 0.4346
Relative absolute error 54.2876 %
Root relative squared error 97.2821 %
Total Number of Instances 178
23 Sep '11
commit 8b2c71f4a645c96a86fd60557aaed877664c8144
Added function count-words-in-corpora to thesis.data. The following are the current results
:micusp :es 31057
:micusp :en 167984
:wricle :es 88977
:misc :es 496
:misc :en 2848
:sulec :es 36267
total es = 156,797
total en = 170,832
Note that ICLE will provde another 198,131 words of Spanish, more than doubling that body. I can probably get enough English from ICE to make up the difference on the English side.
22 Sep '11
Finished verb.clj to the point where I was able to run statistics and classifying tests (found in train_verb.clj)
Training on all corpora and doing a 10 fold cross validation, a decision tree gives:
Correctly Classified Instances 126 70.7865 %
Incorrectly Classified Instances 52 29.2135 %
Kappa statistic 0.2077
Mean absolute error 0.3059
Root mean squared error 0.5139
Relative absolute error 76.424 %
Root relative squared error 115.0477 %
Total Number of Instances 17
A Neural Net gives slightly better results at:
Correctly Classified Instances 136 76.4045 %
Incorrectly Classified Instances 42 23.5955 %
Kappa statistic 0.3853
Mean absolute error 0.2601
Root mean squared error 0.4596
Relative absolute error 64.9751 %
Root relative squared error 102.8927 %
Total Number of Instances 178
On the statistics test, at the 95% confidence level the following tense/aspect combinations were shown to have statistically different relative frequencies between L1-es and L1-en. The language is written next to it if that language had the higher frequency.
present-perfect-passive en
present-perfect-progressive es
present-progressive es
past-passive en
present es
past en
Note that negative get passives are still not recognized, so that needs to be done.
See addendum-004 for more statistics
21 Sep '11
Used Incanter to run t-test on dependency relations. The following dependencies show significant differences in relative frequency between L1 English and L1 Spanish at 95% confidence:
prt
purpcl
poss
predet
possessive
rcmod
csubj
mark
cop
xcomp
advcl
pcomp
expl
aux
neg
npadvmod
infmod
complm
ccomp
nsubj
det
nn
See addendum-003 for more statistics.
20 Sep '11
Did more work on verb.clj. Also created train_verbs.clj, which will contain that code that performs classifying tasks using the tools in verbs.clj. So far the 'extract-verbs' function can recognize all twelve tense/aspect combinations ({past,present,modal}/perfect/progressive). It also recognizes passives in 'be' and 'get' with both 'got' and 'gotten'. It does not yet recognize negative 'get' passives that require a 'do' operator, e.g. "he didn't get beaten" Also, questions are not supported and, as of my thinking right now, probably never will be.
Yesterday I added access methods for the sulec corpus to data.clj. I pooled the sulec essays by threes, creating a total of 41 "instances". I also generated a *.stats file for each instance (from all corpora). At the moment, this file is simply a text file that contains the number of stanford nlp parser tokens. I'm not sure if this will be useful. My thinking was that this number could provide a normalizing factor for the different attributes, but now that I think about it, my current approach might be better. For instance, in thesis.train-deps, I use the total number of dependencies of any type as a normalizing factor. I don't immediately see a reason to change this. One concern is that some instances are too small for certain attributes to have statistical relevance. This warrants some thought.
11 Sep '11
Note to self: a good attribute to analyze might involve looking at the arguments of a verb and seeing how many are lexical NPs and how many are referential forms. Recall that Du Bois says that no more than one lexical NP is preferred, in intransitives in the S position (of course), and in transitives and ditransitives in the D.O. position. This could be presented as fourteen attributes:
1 argument -> {s, S}
2 arguments -> {ao,aO,Ao,AO}
3 arguments -> {aio,aiO,aIo,aIO,Aio,AiO,AIo,AIO}
The values for each could be represented thus, where the variables are the number of occurences of that particular argument configuration:
Vs = sum(s)/(sum(s)+sum(S))
Vao = sum(ao)/(sum(ao)+sum(aO)+sum(Ao)+sum(AO))
and so forth
Sources:
Du Bois, John W. "Discourse and Grammar" from ed. Tomasello, Michael "The New Psychology of Language" Ch. 2 pp 48-87
3 Sep '11
No commit
Up next, figure out how tenses are represented in the syntactic parse trees and turn this into attributes. Let each verb tense be an attribute with the value being the number of times it appears in the text divide by the total number of verb usages. Restrict to finite verbs.
10 Aug '11
commit 17849cefb00c0082d9d6e5a4885ae83edc0eb6b6
Previously (about 10 days ago) wrote a function
train-deps/run-root-attr-trim-experiment that performs the following
experiment (from docs): This function makes an unpruned J48 tree using
the data in dataset, evaluates it using cross validation, then removes
the attribute from dataset that is the basis of the initial decision
in the tree and calls itself again, decrementing depth.
The purpose of this is to discover which features are apparently most
important in classification. This can likely also be done (though
probably with different results) using a classifier that can report
the most informative features automatically.
See log-addendum-002 for specifics.
29 Jul '11
3:14pm
commit 51b394f7ce93d8bed552b5e07119a35026db2719
Ran an experiment hoping to discover if some corpus files should be
dropped or otherwise treated specially due to small size. Initial
results seem to indicate no. The table below shows the results from 20
fold cross validation. See also log-addendum-001 for the list of files
used. multilayer perceptron used as classifier with no special
options. Still need to investigate whether results can be improved by
pooling smaller files.
MinDep Average correct over 10 runs
15 86.86131386861314
20 86.46616541353384
25 85.88709677419355
35 82.47422680412373
45 82.49999999999999
60 77.08333333333334
8:35am
commit 659bea22b18d4ff857aaa886b5797f1ce96a1aef
Training a decision tree classifier (:decision-tree :c45 / J48 pruned
tree) on a dataset consisting of all L1-ES wricle texts of levels 5
and 6 (79), all L1-ES micusp texts (8), and a sampling of L1-EN wricle
texts (48) yields a classifier which when evaluated with 10 fold cross
validation yields a success of 88.1481%. The decision tree is as
follows:
nn <= 0.043394: es (78.0/1.0)
nn > 0.043394
| predet <= 0.001776
| | advmod <= 0.034846
| | | quantmod <= 0.001476
| | | | prt <= 0.00202: es (6.0)
| | | | prt > 0.00202: en (2.0)
| | | quantmod > 0.001476: en (6.0)
| | advmod > 0.034846: en (40.0/1.0)
| predet > 0.001776: es (3.0)
27 July '11 Code now capable of doing deps relation-based
classification. The efficacy of this has not yet been
determined. Relevent code in thesis.train-deps. Next step, choose
appropriate classifier algorith and test.
25 July '11 Added file train_deps.clj to contain code for running the
deps-based classification. Using clj-ml to interface weka. Began
initial test by constructing a Weka data set with as many attributes
as there are stanford dependencies (50 some). I added a weka Instance
to the data set for each sample text. I setup each attribute to be
numeric, counting the number of times a particular dependency relation
is used in a text (i.e. just the basic dep such as "xsubj", ignoring
the words that are parameters), dividing that number by the total
number of dependencies and using that as the attribute value. Class
attribute has the key "L1" with possible values "en" or "es".
thesis.train-deps/make-reln-dataset-with-samples will construct the
weka dataset from the supplied list of english and spanish L1 micusp
files. Starting a run with 8 L1 spanish and 9 L1 english
texts. Following run need to train a classifier.
20 July '11 10:05PM Renamed parser.clj to parse.clj along with package
thesis.parse. This packages now contains a function parse-sentences
that takes a list of preprocessed text and returns only the parses
that are complete sentences. Also added functions in data.clj that
return the preprocessed text. Next step is to write a function that
will create dependences for these parses. Then use those dependencies
as features for WEKA.
2:29PM Finished a preliminary preprocessor for the micusp
resource. regexes need improvement as some documents seem to be
cutoff. Also there is some initial information that needs to be
striped.
18 July '11 10:41PM Have been working on a preprocessor for the micusp
resource. Still in progress. There are slight differences between the
files in terms of the page headers that need to be stripped; mainly a
difference of newlines but possibly more. New folder code-nu
containing code that does batch converts from PDFs to utf8.
17 July '11
1:41PM
NB: MICUSP: Michigan Corpus of Upper-Level Student Papers
http://search-micusp.elicorpora.info/simple/
1:24PM Downloaded ice-canada corpus. Zip file (in data/ice-canada/) is
password protected. etc/ICElicence.doc needs to be completed and
emailed after July '11 (researcher is out of town).
http://ICE-corpora.net/ice/download.htm
13 July '11 9:32PM Setup remote git repository for this project at
[email protected]:thesis
12 July '11 3:16PM
Initial entry. Up to this point, I have been
working in Clojure with the Stanford Parser (1.6.4 and now
1.6.7). Using the factored and PCFG parser I have generated
parse-trees and Stanford dependencies.
Yesterday spoke to Drs. Biava
and Smith regarding this project. Basically just explained my
intentions. Talked about finding corpora, the surprising paucity of
Spanish linguistics literature, etc.
I should provide a basic overview of my goals here:
Using parse trees, Stanford Dependencies, possibilty vocabulary, and a
machine learning package (likely WEKA) I want to create a system that
will look at an English text and attempt to determine if it was
written by a Spanish L1 or native English speaker. The problem here if
finding the features in trees etc. that will allow this
classification.
Immediate Plans:
Setup a Git repository? Today I emailed Dr. Smith regarding university
server space for this.
Stanford dependencies consists of "approximately 52 grammatical
relations" (SD manual). Try using these as features for a first
attempt at a classifications system. In other words, for a text just
count the number of each relation it has and see if that gives the
classifier something to work with.
Relevant Works:
-For the PCFG parser: Dan Klein and Christopher
D. Manning. 2003. Accurate Unlexicalized Parsing. Proceedings of the
41st Meeting of the Association for Computational Linguistics,
pp. 423-430.
-For the factored parser: Dan Klein and Christopher
D. Manning. 2003. Fast Exact Inference with a Factored Model for
Natural Language Parsing. In Advances in Neural Information Processing
Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, pp. 3-10.