CHANGELOG

V0.5.1 - Jan 24, 2012
* N-best list module now normalizes extra space in hypotheses
* Single optimizer runs no longer cause a NaN in the output table -- instead we print a warning message

V0.5.0 - 11/29/2012
* Bug fix for BLEU implementation affecting only multiple reference translations (see below)
  IMPORTANT: Scores from MultEval BLEU 0.5.0 are *NOT* comparable to previous versions.
             Please score all of your experiments with a consistent version of all metrics.
  NOTE: Jon rescored several results using the fixed version of BLEU and the differences
	between systems remained virtually unchanged despite the magnitudes of the scores changing.
* Added Travis CI regression tests (see https://travis-ci.org/jhclark/multeval)
* Added ability to produce sentence-level scores via the --sentLevelDir option
* More verbose output for BLEU

Examples of BLEU bug fix's effects on an Arabic-English 4 reference :
=============== V0.4.3 ======================= ||| =============== V0.5.0 ============== ||| === Comparison ==
Set    | Baseline | Experimental | Improvement ||| Baseline | Experimental | Improvement ||| Improvement Delta
MT08nw | 47.8     | 47.8         | +/-         ||| 48.3     | 48.4         | +/-         ||| 0
MT08wb | 30.5     | 31.0         | +0.5        ||| 31.2     | 31.5         | +0.3        ||| 0.2
MT09nw | 51.6     | 51.5         | +/-         ||| 53.2     | 53.1         | +/-         ||| 0
MT09wb | 31.6     | 32.3         | +0.7        ||| 33.5     | 34.1         | +0.6        ||| 0.1

This same trend also held in several Chinese-English experiments with multiple
references -- absolute scores increased while relative differences remained nearly identical.


V0.4.3 - 8/27/2012
* Upgraded to Meteor 1.4 (also released 8/27/2012)
  - Note: This change only affects previously unsupported languages by enabling new stemmers

V0.4.2 - 12/3/2012
* Fixed bug in n-best scoring that caused oracle *submetrics* such as precision and recall (not main BLEU, TER, METEOR, etc. scores) to be reported incorrectly for oracle hypotheses
* Multi-threaded bootstrap resampling and approximate randomization significance tests (large time savings for many systems with many optimization runs)
* Updated to Guava V11

Some timing results on 32 threads vs 4 threads vs 1 using the recent multi-threading improvements on the 3 example systems with 3 optimizer runs each:
                                32  /         4 /          1
Load METEOR                   41.5s /     25.5s /      25.5s
Collect Sufficient Stats      14.8s /     32.7s /      86.7s
Bootstrap Resampling          35.2s /     70.9s /     189s
Approximate Randomization     19.4s /    122s   /     336s
TOTAL                      1m 54s   / 4m  13s   / 10m  39s


V0.4.1 - 12/30/2011
* Fix sizing bug reported by John DeNero, which caused MultEval to crash
* Removed "static" keyword from several places within the TER library to make it more amenable to multi-threading


V0.4 - 12/30/2011
* Multi-thread sufficient stats computation where possible (multi-threading of bootstrap resampling and approximate randomization are still on the to-do list)
* Include constants file in distribution
* A few other small optimizations
* Include example n-best list


V0.3 - 8/10/2011
* Use Meteor 1.3
* Better reporting of metric descriptions in Latex table
* Ability to find oracle-worst hypotheses from n-best list
* Better incremental status when processing n-best lists


V0.2 - 6/29/2011
* Use more aggressive shuffling algorithm for approximate randomization for better p-values in corner cases
* Dump prettier ASCII table upon completion


V0.1 - 6/21/2011
* Initial Release