-
Notifications
You must be signed in to change notification settings - Fork 39
/
CHANGELOG
70 lines (53 loc) · 3.58 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
V0.5.1 - Jan 24, 2012
* N-best list module now normalizes extra space in hypotheses
* Single optimizer runs no longer cause a NaN in the output table -- instead we print a warning message
V0.5.0 - 11/29/2012
* Bug fix for BLEU implementation affecting only multiple reference translations (see below)
IMPORTANT: Scores from MultEval BLEU 0.5.0 are *NOT* comparable to previous versions.
Please score all of your experiments with a consistent version of all metrics.
NOTE: Jon rescored several results using the fixed version of BLEU and the differences
between systems remained virtually unchanged despite the magnitudes of the scores changing.
* Added Travis CI regression tests (see https://travis-ci.org/jhclark/multeval)
* Added ability to produce sentence-level scores via the --sentLevelDir option
* More verbose output for BLEU
Examples of BLEU bug fix's effects on an Arabic-English 4 reference :
=============== V0.4.3 ======================= ||| =============== V0.5.0 ============== ||| === Comparison ==
Set | Baseline | Experimental | Improvement ||| Baseline | Experimental | Improvement ||| Improvement Delta
MT08nw | 47.8 | 47.8 | +/- ||| 48.3 | 48.4 | +/- ||| 0
MT08wb | 30.5 | 31.0 | +0.5 ||| 31.2 | 31.5 | +0.3 ||| 0.2
MT09nw | 51.6 | 51.5 | +/- ||| 53.2 | 53.1 | +/- ||| 0
MT09wb | 31.6 | 32.3 | +0.7 ||| 33.5 | 34.1 | +0.6 ||| 0.1
This same trend also held in several Chinese-English experiments with multiple
references -- absolute scores increased while relative differences remained nearly identical.
V0.4.3 - 8/27/2012
* Upgraded to Meteor 1.4 (also released 8/27/2012)
- Note: This change only affects previously unsupported languages by enabling new stemmers
V0.4.2 - 12/3/2012
* Fixed bug in n-best scoring that caused oracle *submetrics* such as precision and recall (not main BLEU, TER, METEOR, etc. scores) to be reported incorrectly for oracle hypotheses
* Multi-threaded bootstrap resampling and approximate randomization significance tests (large time savings for many systems with many optimization runs)
* Updated to Guava V11
Some timing results on 32 threads vs 4 threads vs 1 using the recent multi-threading improvements on the 3 example systems with 3 optimizer runs each:
32 / 4 / 1
Load METEOR 41.5s / 25.5s / 25.5s
Collect Sufficient Stats 14.8s / 32.7s / 86.7s
Bootstrap Resampling 35.2s / 70.9s / 189s
Approximate Randomization 19.4s / 122s / 336s
TOTAL 1m 54s / 4m 13s / 10m 39s
V0.4.1 - 12/30/2011
* Fix sizing bug reported by John DeNero, which caused MultEval to crash
* Removed "static" keyword from several places within the TER library to make it more amenable to multi-threading
V0.4 - 12/30/2011
* Multi-thread sufficient stats computation where possible (multi-threading of bootstrap resampling and approximate randomization are still on the to-do list)
* Include constants file in distribution
* A few other small optimizations
* Include example n-best list
V0.3 - 8/10/2011
* Use Meteor 1.3
* Better reporting of metric descriptions in Latex table
* Ability to find oracle-worst hypotheses from n-best list
* Better incremental status when processing n-best lists
V0.2 - 6/29/2011
* Use more aggressive shuffling algorithm for approximate randomization for better p-values in corner cases
* Dump prettier ASCII table upon completion
V0.1 - 6/21/2011
* Initial Release