Assembled by Dave Howcroft, June 2013. Available for download from http://www.ling.ohio-state.edu/~mwhite/data/enlg13/
.
Companion materials to: Howcroft, Nakatsu, & White. 2013. "Enhancing the Expression of Contrast in the SPaRKy Restaurant Corpus". ENLG2013.
Contact dave.howcroft@gmail.com or mwhite@ling.ohio-state.edu with "Enhancing Contrast in SPaRKy" in the subject line with further inquiries.
README
This file
sparky-openccg.csv
Combined data from sparky-openccg-no-mods.csv and sparky-openccg-with-mods.csv
sparky-openccg-no-mods.csv
Ratings and LM scores for the realizations without contrastive enhancements
sparky-openccg-no-mods.sents
Realizations (and their identifiers) without contrastive enhancements
sparky-openccg.sents
All the realizations from sparky-openccg-no-mods.sents and sparky-openccg-with-mods.sents
sparky-openccg-with-mods.csv
Ratings and LM scores for the realizations with contrastive enhancements
sparky-openccg-with-mods.sents
Realizations (and their identifiers) with contrastive enhancements
sparky-validation.csv
See "About the SPaRKy Restaurant Corpus" below.
xml-textplans/
Directory containing (in two sub-directories) all of the XML textplans both with and without the contrast enhancements used in this study.
See "XML Textplans" under "File Formats" for more information about the files contained in these directories.
Also contains the following two files:
textplans-for-enlg2013.withMods
*.tp-mod.xml
files used in our studytextplans-for-enlg2013.noMods
*.tp.xml
files used in our study*.csv
The CSV
files contain the ratings assigned by human subjects. For sparky-validation.csv, this means there are two ratings fields, while in sparky-openccg*.csv
there is a third rating field as well. sparky-openccg*.csv
files also include language model scores associated with each sentence.
cp-id
is the content plan ID shortcodealt
is the realization alternative identifier for the sentence
rating1
, rating2
, and rating3
are the ratings assigned to that sentence by the human judges
lm_sc
is the probability of the sentence given by the language model built with semantic classes specific to the SRC domainlm_gw
is the log probability of the sentence given by the Gigaword n-gram language model*.sents
SENTS
files contain one realization per line with accompanying metadata.
There are three fields, separated by colons:
OpenCCG
.NB: The sentences in these files have not been post-processed for consumption by subjects: extra spaces around punctuation were removed, sentence-initial words were capitalized, and shorthand like 'otoh' was expanded to its full form (e.g. 'on the other hand') before being presented to subjects in the survey.
The directory xml-textplans contains two directories:
xml-textplans-with-modifications/
xml-textplans-without-modifications/
Each file in these directories corresponds to a particular SRC alternative realization, based on its plan.out
file and the TPLAN
file for that content plan. Filenames are of the form:
TASK_USER_STRATEGY_alt#.tp(-mod).xml
where
TASK
is one of the top-level directories in Walker et al.'s (2007) final_out directory (e.g. cheap
, eastvillagejapanese
, etc);USER
is one of the users that completed tasks in the original MATCH experiment (Walker et al. 2004);STRATEGY
is one of compare2
, k0.7_compare3
, or k-0.7_recommend
; andalt#
corresponds to one of the alt subdirectories for that TASK-USER-STRATEGY combination.For example, midwest_OwenRambow_compare2_alt.tp-mod.xml
corresponds to the modified version of the textplan located at:
final_out/midwest/OwenRambow_compare2/alt6/plan.out
where the nuclei of the plan (e.g. nucleus:<2>assert-com-price
) have been resolved using the TPLAN
file located at:
final_out/midtownwest/OwenRambow_compare2/OwenRambow_compare2.tplan
The data used for the Validation portion of our study are available from Marilyn A. Walker's website. The data we used were from "The textplans/utterances" in final_out.tar.gz
.
The first level of the uncompressed final_out/ directory is as follows:
final_out/
cheap/
eastvillagejapanese/
french/
italianwestvillage/
midtownwest/
upperwestsideasian/
With each of these directories corresponding to a particular task in the original Walker et al. (2007) study.
Each of these contains a set of directories corresponding to different users and different realization strategies (i.e. either comparing two restaurants, comparing three or more restaurants, or recommending a single restaurant).
Our cp_id
s correspond to this level of organization. For example, eORc3
refers to the content plan (CP) in the East Village Japanese task as completed by Owen Rambow when the compare3 strategy was used.
In sparky-validation.csv
, the alt number (second column) corresponds to one of the alternatives in these CP directories. For example, the content plan mORc2
and alt ID 6 correspond to the alternative located at:
final_out/midtownwest/OwenRambow_compare2/alt6