Assembled by Dave Howcroft, June 2013. Available for download from http://www.ling.ohio-state.edu/~mwhite/data/enlg13/.
Companion materials to: Howcroft, Nakatsu, & White. 2013. "Enhancing the Expression of Contrast in the SPaRKy Restaurant Corpus". ENLG2013.
Contact dave.howcroft@gmail.com or mwhite@ling.ohio-state.edu with "Enhancing Contrast in SPaRKy" in the subject line with further inquiries.
READMEThis file
sparky-openccg.csvCombined data from sparky-openccg-no-mods.csv and sparky-openccg-with-mods.csv
sparky-openccg-no-mods.csvRatings and LM scores for the realizations without contrastive enhancements
sparky-openccg-no-mods.sentsRealizations (and their identifiers) without contrastive enhancements
sparky-openccg.sentsAll the realizations from sparky-openccg-no-mods.sents and sparky-openccg-with-mods.sents
sparky-openccg-with-mods.csvRatings and LM scores for the realizations with contrastive enhancements
sparky-openccg-with-mods.sentsRealizations (and their identifiers) with contrastive enhancements
sparky-validation.csvSee "About the SPaRKy Restaurant Corpus" below.
xml-textplans/Directory containing (in two sub-directories) all of the XML textplans both with and without the contrast enhancements used in this study.
See "XML Textplans" under "File Formats" for more information about the files contained in these directories.
Also contains the following two files:
textplans-for-enlg2013.withMods
*.tp-mod.xml files used in our studytextplans-for-enlg2013.noMods
*.tp.xml files used in our study*.csvThe CSV files contain the ratings assigned by human subjects. For sparky-validation.csv, this means there are two ratings fields, while in sparky-openccg*.csv there is a third rating field as well. sparky-openccg*.csv files also include language model scores associated with each sentence.
cp-id is the content plan ID shortcodealt is the realization alternative identifier for the sentence
rating1, rating2, and rating3 are the ratings assigned to that sentence by the human judges
lm_sc is the probability of the sentence given by the language model built with semantic classes specific to the SRC domainlm_gw is the log probability of the sentence given by the Gigaword n-gram language model*.sentsSENTS files contain one realization per line with accompanying metadata.
There are three fields, separated by colons:
OpenCCG.NB: The sentences in these files have not been post-processed for consumption by subjects: extra spaces around punctuation were removed, sentence-initial words were capitalized, and shorthand like 'otoh' was expanded to its full form (e.g. 'on the other hand') before being presented to subjects in the survey.
The directory xml-textplans contains two directories:
xml-textplans-with-modifications/
xml-textplans-without-modifications/
Each file in these directories corresponds to a particular SRC alternative realization, based on its plan.out file and the TPLAN file for that content plan. Filenames are of the form:
TASK_USER_STRATEGY_alt#.tp(-mod).xml
where
TASK is one of the top-level directories in Walker et al.'s (2007) final_out directory (e.g. cheap, eastvillagejapanese, etc);USER is one of the users that completed tasks in the original MATCH experiment (Walker et al. 2004);STRATEGY is one of compare2, k0.7_compare3, or k-0.7_recommend; andalt# corresponds to one of the alt subdirectories for that TASK-USER-STRATEGY combination.For example, midwest_OwenRambow_compare2_alt.tp-mod.xml corresponds to the modified version of the textplan located at:
final_out/midwest/OwenRambow_compare2/alt6/plan.out
where the nuclei of the plan (e.g. nucleus:<2>assert-com-price) have been resolved using the TPLAN file located at:
final_out/midtownwest/OwenRambow_compare2/OwenRambow_compare2.tplan
The data used for the Validation portion of our study are available from Marilyn A. Walker's website. The data we used were from "The textplans/utterances" in final_out.tar.gz.
The first level of the uncompressed final_out/ directory is as follows:
final_out/
cheap/
eastvillagejapanese/
french/
italianwestvillage/
midtownwest/
upperwestsideasian/
With each of these directories corresponding to a particular task in the original Walker et al. (2007) study.
Each of these contains a set of directories corresponding to different users and different realization strategies (i.e. either comparing two restaurants, comparing three or more restaurants, or recommending a single restaurant).
Our cp_ids correspond to this level of organization. For example, eORc3 refers to the content plan (CP) in the East Village Japanese task as completed by Owen Rambow when the compare3 strategy was used.
In sparky-validation.csv, the alt number (second column) corresponds to one of the alternatives in these CP directories. For example, the content plan mORc2 and alt ID 6 correspond to the alternative located at:
final_out/midtownwest/OwenRambow_compare2/alt6