Every week day 10:00-11:50 Hagerty Hall 186
Instructors: Marie-Catherine de Marneffe
Office hours: Monday 1:00-2:00 or by appointment
Ohio Stadium East, room 118E
In current linguistic research, it is often necessary to deal with lots of data (corpus or experimental data). This course offers practical training in standard computational tools for tackling different kinds of data for linguistic research. Students will learn computational techniques to access, search and format linguistic datasets, including text corpora, speech and audio, structured representations and experimental measurements. The course will also cover data exploration and visualization.
No prerequisite in programming is required: the course will cover introductory scripting in Python, R and Praat. The course is designed to be hands-on, and students will have the opportunity to work on the problem sets during the class sessions.
- Students will gain hands-on experience gathering, formatting, and manipulating data.
- Students will learn to use corpus, field, and experimental data, as well as to combine data from multiple sources.
- Students will learn to work with existing computational tools.
- At the end of the course, students will be able to process massive amounts of linguistic data.
The course is designed to stand alone, but also to provide an introduction to the graduate Computational Linguistics sequence. It is not a prerequisite for the Computational Linguistics courses, but is helpful for students who lack any prior experience with computational tools.
- Accessing and navigating corpora
- Linguistic data manipulation and visualization
- Automatic processing of structured linguistic representations
- R scripting
- Praat scripting
The schedule page and this one serve as the syllabus for this course.
Unit 1: Basic data manipulations
Unit 2: Reading text and counting words
- Introduction and motivation
- Case study: Do women talk more than men?
- How to use dialogue corpora to test a hypothesis
Basic Unix environment
- How to access and navigate data directories
- A computer language to deal with human language: Introductory Python
- Basic file IO
- Decision-making: logic, comparatives, conditionals
Unit 3: R and Praat scripting
- Case study: Investigating Zipf's law
- Counting instances of a word in a file
- Counting all words/bigrams
Unit 4: Dealing with linguistic structured representations
- Case study: Automatically extracting measurements to make vowel space plots, a glimpse into OhioSpeaks
- The R language:
- Variables, control statements and data structures in R
- Data exploration: the dative alternation, do children differ from adults?
- The Praat language:
- Variables, control statements and data structures in Praat
- Manipulation of audio data
- Case study: Which verbs allow the dative alternation?
- Field-structured: CSV, space-delimited
- Tree-structured and NLTK: Penn TreeBank parses
- Internationalization, non-standard character sets:
How to deal with Arabic, Chinese, Hindi or Cyrillic?
There will be three assignments to turn in. Each one will require students to write a short program to perform some analysis of a dataset (for instance, assignment 1 is to write a Python program measuring utterance lengths by men and women in a section of the Fisher corpus). Students will work on the assignments both in class and at home, and will be encouraged to work collaboratively in small groups, but everyone has to turn in his/her own assignment.
Periodically there will be day to day short assignments strongly recommended, but that do not have to be turned in. Participation in class is mandatory.
A tentative schedule for the month is posted on the schedule page. Readings and assignments may change! Deadlines will be announced in class too.
This is a 3-credit course, graded on a letter-grade (A, B, C, D, E) basis. Students are expected to attend class meetings, complete reading and assignments, as well as actively participate in class discussions.
Homeworks (75%): Three homework assignments will be due by the beginning of class. They will be turned in through Carmen. No late homeworks will be accepted.
Participation (25%): Participation in class will
Grades will be assigned using the standard OSU scale.
Materials for each unit will be posted on the website, as will the slides presented in class. Datasets which cannot be made publicly available will be on Carmen. Assignments will need to be turned in through Carmen.
Note that email from Carmen is sent to your official email address (Name.Number@osu.edu). You should read email sent to your official OSU account on a daily basis.
If you know you won't be able to make a deadline, please see me before you miss the deadline!
As with any class at this university, students are required to follow the Ohio State Code of Student Conduct. In particular, note that students are not allowed to, among other things, submit plagiarized (copied but unacknowledged) work for credit. If any violation occurs, I am required to report the violation to the Council on Academic Misconduct. See the Committee on Academic Misconduct's Frequently Asked Questions.
Students who need an accommodation based on the impact of a disability should contact me to arrange an appointment as soon as possible to discuss the course format, to anticipate needs, and to explore potential accommodations. I rely on the Office of Disability Services for assistance in verifying the need for accommodations and developing accommodation strategies. Students who have not previously contacted the Office for Disability Services are encouraged to do so (292-3307; http://www.ods.ohio-state.edu).