Data for "Enhancing the Expression of Contrast in the SPaRKy Restaurant Corpus"

Assembled by Dave Howcroft, June 2013. Available for download from http://www.ling.ohio-state.edu/~mwhite/data/enlg13/.

Companion materials to: Howcroft, Nakatsu, & White. 2013. "Enhancing the Expression of Contrast in the SPaRKy Restaurant Corpus". ENLG2013.

Contact dave.howcroft@gmail.com or mwhite@ling.ohio-state.edu with "Enhancing Contrast in SPaRKy" in the subject line with further inquiries.

File Inventory

README

This file

sparky-openccg.csv

Combined data from sparky-openccg-no-mods.csv and sparky-openccg-with-mods.csv

sparky-openccg-no-mods.csv

Ratings and LM scores for the realizations without contrastive enhancements

sparky-openccg-no-mods.sents

Realizations (and their identifiers) without contrastive enhancements

sparky-openccg.sents

All the realizations from sparky-openccg-no-mods.sents and sparky-openccg-with-mods.sents

sparky-openccg-with-mods.csv

Ratings and LM scores for the realizations with contrastive enhancements

sparky-openccg-with-mods.sents

Realizations (and their identifiers) with contrastive enhancements

sparky-validation.csv

Ratings for realizations drawn from the SPaRKy Restaurant Corpus.

See "About the SPaRKy Restaurant Corpus" below.

xml-textplans/

Directory containing (in two sub-directories) all of the XML textplans both with and without the contrast enhancements used in this study.

See "XML Textplans" under "File Formats" for more information about the files contained in these directories.

Also contains the following two files:

textplans-for-enlg2013.withMods
- the list of 122 *.tp-mod.xml files used in our study
textplans-for-enlg2013.noMods
- the list of 122 *.tp.xml files used in our study

File Formats

`*.csv`

The CSV files contain the ratings assigned by human subjects. For sparky-validation.csv, this means there are two ratings fields, while in sparky-openccg*.csv there is a third rating field as well. sparky-openccg*.csv files also include language model scores associated with each sentence.

cp-id is the content plan ID shortcode
alt is the realization alternative identifier for the sentence
- In sparky-validation.csv these correspond to alt IDs in the SRC. See "About the SPaRKy Restaurant Corpus" for more information.
rating1, rating2, and rating3 are the ratings assigned to that sentence by the human judges
- These ratings range from very unnatural (1) to very natural (7).
lm_sc is the probability of the sentence given by the language model built with semantic classes specific to the SRC domain
lm_gw is the log probability of the sentence given by the Gigaword n-gram language model

`*.sents`

SENTS files contain one realization per line with accompanying metadata.

There are three fields, separated by colons:

a shortcode referring to the content plan which the sentence realizes;
an integer labelling this realization alternative (alt) for our experiment;
- These are arbitrary, except insofar as numbers 1-20 are used for the 20 realizations without the contrast enhancements and numbers 101-120 are used for the 20 realizations with the enhancements.
the sentence itself, as produced by OpenCCG.

NB: The sentences in these files have not been post-processed for consumption by subjects: extra spaces around punctuation were removed, sentence-initial words were capitalized, and shorthand like 'otoh' was expanded to its full form (e.g. 'on the other hand') before being presented to subjects in the survey.

XML Textplans

The directory xml-textplans contains two directories:

xml-textplans-with-modifications/
xml-textplans-without-modifications/

Each file in these directories corresponds to a particular SRC alternative realization, based on its plan.out file and the TPLAN file for that content plan. Filenames are of the form:

TASK_USER_STRATEGY_alt#.tp(-mod).xml

where

TASK is one of the top-level directories in Walker et al.'s (2007) final_out directory (e.g. cheap, eastvillagejapanese, etc);
USER is one of the users that completed tasks in the original MATCH experiment (Walker et al. 2004);
STRATEGY is one of compare2, k0.7_compare3, or k-0.7_recommend; and
alt# corresponds to one of the alt subdirectories for that TASK-USER-STRATEGY combination.

For example, midwest_OwenRambow_compare2_alt.tp-mod.xml corresponds to the modified version of the textplan located at:

final_out/midwest/OwenRambow_compare2/alt6/plan.out

where the nuclei of the plan (e.g. nucleus:<2>assert-com-price) have been resolved using the TPLAN file located at:

final_out/midtownwest/OwenRambow_compare2/OwenRambow_compare2.tplan

About the SPaRKy Restaurant Corpus (SRC)

The data used for the Validation portion of our study are available from Marilyn A. Walker's website. The data we used were from "The textplans/utterances" in final_out.tar.gz.

The first level of the uncompressed final_out/ directory is as follows:

final_out/
  cheap/
  eastvillagejapanese/
  french/
  italianwestvillage/
  midtownwest/
  upperwestsideasian/

With each of these directories corresponding to a particular task in the original Walker et al. (2007) study.

Each of these contains a set of directories corresponding to different users and different realization strategies (i.e. either comparing two restaurants, comparing three or more restaurants, or recommending a single restaurant).

Our cp_ids correspond to this level of organization. For example, eORc3 refers to the content plan (CP) in the East Village Japanese task as completed by Owen Rambow when the compare3 strategy was used.

In sparky-validation.csv, the alt number (second column) corresponds to one of the alternatives in these CP directories. For example, the content plan mORc2 and alt ID 6 correspond to the alternative located at:

final_out/midtownwest/OwenRambow_compare2/alt6

Edit History

June 2013
- First version.