Data for "Enhancing the Expression of Contrast in the SPaRKy Restaurant Corpus"

Assembled by Dave Howcroft, June 2013. Available for download from http://www.ling.ohio-state.edu/~mwhite/data/enlg13/.

Companion materials to: Howcroft, Nakatsu, & White. 2013. "Enhancing the Expression of Contrast in the SPaRKy Restaurant Corpus". ENLG2013.

Contact dave.howcroft@gmail.com or mwhite@ling.ohio-state.edu with "Enhancing Contrast in SPaRKy" in the subject line with further inquiries.

File Inventory

README

This file

sparky-openccg.csv

Combined data from sparky-openccg-no-mods.csv and sparky-openccg-with-mods.csv

sparky-openccg-no-mods.csv

Ratings and LM scores for the realizations without contrastive enhancements

sparky-openccg-no-mods.sents

Realizations (and their identifiers) without contrastive enhancements

sparky-openccg.sents

All the realizations from sparky-openccg-no-mods.sents and sparky-openccg-with-mods.sents

sparky-openccg-with-mods.csv

Ratings and LM scores for the realizations with contrastive enhancements

sparky-openccg-with-mods.sents

Realizations (and their identifiers) with contrastive enhancements

sparky-validation.csv
Ratings for realizations drawn from the SPaRKy Restaurant Corpus.

See "About the SPaRKy Restaurant Corpus" below.

xml-textplans/

Directory containing (in two sub-directories) all of the XML textplans both with and without the contrast enhancements used in this study.

See "XML Textplans" under "File Formats" for more information about the files contained in these directories.

Also contains the following two files:

File Formats

*.csv

The CSV files contain the ratings assigned by human subjects. For sparky-validation.csv, this means there are two ratings fields, while in sparky-openccg*.csv there is a third rating field as well. sparky-openccg*.csv files also include language model scores associated with each sentence.

  1. cp-id is the content plan ID shortcode
  2. alt is the realization alternative identifier for the sentence
  3. rating1, rating2, and rating3 are the ratings assigned to that sentence by the human judges
  4. lm_sc is the probability of the sentence given by the language model built with semantic classes specific to the SRC domain
  5. lm_gw is the log probability of the sentence given by the Gigaword n-gram language model

*.sents

SENTS files contain one realization per line with accompanying metadata.

There are three fields, separated by colons:

  1. a shortcode referring to the content plan which the sentence realizes;
  2. an integer labelling this realization alternative (alt) for our experiment;
  3. the sentence itself, as produced by OpenCCG.

NB: The sentences in these files have not been post-processed for consumption by subjects: extra spaces around punctuation were removed, sentence-initial words were capitalized, and shorthand like 'otoh' was expanded to its full form (e.g. 'on the other hand') before being presented to subjects in the survey.

XML Textplans

The directory xml-textplans contains two directories:

xml-textplans-with-modifications/
xml-textplans-without-modifications/

Each file in these directories corresponds to a particular SRC alternative realization, based on its plan.out file and the TPLAN file for that content plan. Filenames are of the form:

TASK_USER_STRATEGY_alt#.tp(-mod).xml

where

For example, midwest_OwenRambow_compare2_alt.tp-mod.xml corresponds to the modified version of the textplan located at:

final_out/midwest/OwenRambow_compare2/alt6/plan.out

where the nuclei of the plan (e.g. nucleus:<2>assert-com-price) have been resolved using the TPLAN file located at:

final_out/midtownwest/OwenRambow_compare2/OwenRambow_compare2.tplan

About the SPaRKy Restaurant Corpus (SRC)

The data used for the Validation portion of our study are available from Marilyn A. Walker's website. The data we used were from "The textplans/utterances" in final_out.tar.gz.

The first level of the uncompressed final_out/ directory is as follows:

final_out/
  cheap/
  eastvillagejapanese/
  french/
  italianwestvillage/
  midtownwest/
  upperwestsideasian/

With each of these directories corresponding to a particular task in the original Walker et al. (2007) study.

Each of these contains a set of directories corresponding to different users and different realization strategies (i.e. either comparing two restaurants, comparing three or more restaurants, or recommending a single restaurant).

Our cp_ids correspond to this level of organization. For example, eORc3 refers to the content plan (CP) in the East Village Japanese task as completed by Owen Rambow when the compare3 strategy was used.

In sparky-validation.csv, the alt number (second column) corresponds to one of the alternatives in these CP directories. For example, the content plan mORc2 and alt ID 6 correspond to the alternative located at:

final_out/midtownwest/OwenRambow_compare2/alt6

Edit History