Computational linguistics II: Statistical Natural language processing

Semester:Spring 2015 (Jan 13 - Apr 23)
Meeting time:11:10-12:45 TR
Room:Biological Sciences 676
Instructor:Micha Elsner (melsner@ling.osu.edu)
Office hours:Wed 3:30pm, Oxley 221 or by appointment

This course covers the fundamentals of designing, building and evaluating data-driven programs for natural language processing. There will be five projects, all based on constructing a program and running it on a real dataset, then writing an analysis of the results.

The course is designed for graduate students and advanced undergraduates. It will involve a significant amount of programming (Python recommended), and also a reasonable amount of probability. Computational Linguistics 1 provides these prerequisites. If CL1 was your first programming experience, I hope to make CL2 useful and accessible to you, but be warned that it will be a substantial amount of work.

There are various texts and supplemental readings available on Carmen. These are not required, but they might be helpful. The most important of these is a PDF preprint of "Introduction to Computational Linguistics" by Charniak and Johnson, which is available on Carmen. For general background, you might want to look up "Speech and Language Processing" by Jurafsky and Martin.

Some of the topics and due dates listed may change as the semester goes on. Lecture notes, assignments and materials uploaded to Carmen may also change before the dates on which they are scheduled to be presented. Reading ahead is fine, but be aware that you're looking at last year's versions.

Schedule

Collaboration policy

You are allowed to discuss assignments with classmates, but the code and written assignments you turn in must be entirely your own work; you may not incorporate notes or code snippets written by others in the class, nor should you allow your classmates to use your written notes or view your computer code.

You may use all standard Python libraries and all parts of the scipy/numpy package, as well as their accompanying documentation. In some cases, it may be appropriate to use parts of the NLTK Python NLP toolkit or other publically available toolkits for NLP or machine learning. Please ask for instructor permission before doing so. You may use the Model and CondModel classes, or other libraries distributed as part of CL1, if you like.

See the COAM site http://oaa.osu.edu/coam.html for more details.

Assignments and grades

Grading will be based primarily on the 6 assignments, each of which has a programming component and an analytical writeup. The 6 assignments are worth 100 points each. You do not have to code in Python. However, all support code I distribute will be in Python and I may or may not read any code you need help with that is not in Python.

50 points will depend on your giving two mini-presentations. These will be short 5-minute summaries of some paper from the literature related to what we're studying in class.

60 points will depend on timely completion of the "trivia" pre-assignments. The remaining 100 points will be based on attendance and active participation in class activities and discussions.

Numerical grades will be mapped to letter grades using the standard OSU policy of: 93-100 (A), 90-92.9 (A-), 87-89.9 (B+), 83-86.9 (B), 80-82.9 (B-), 77-79.9 (C+), 73-76.9 (C), 70-72.9 (C-), 67-69.9 (D+), 60-66.9 (D), below 60 (E).

The rubric for projects is as follows:

  • An A project shows evidence of a working implementation; the writeup comprehensively analyzes the system's strengths and weaknesses as well as answering all questions in the assignment; some optional or novel feature has been included, or there is interesting speculation about extensions to the basic algorithm.
  • A B project shows evidence of a working implementation. The writeup is adequate, but lacks a full analysis, and there are no suggestions for extensions or novel features.
  • A C project shows evidence of a mostly-working implementation, with some attempt to explain what went wrong and why.
  • A D project does not work at all, or shows a misunderstanding of basic concepts.
  • An E project is not turned in on time.

Projects must be turned in on time. However, if you submit the project on time, but your grade is above 60% (that is, you made some effort to do all the parts of the project but couldn't get it to work), you may turn in further versions of the project after the due date, which will be regraded for up to full credit. Turn in some attempt to do the project, by the due date. If you don't turn in anything by the due date, you won't get credit.

Trivia is graded on a 0/1 basis. If you turn it in, you get credit. Otherwise not. The point of trivia is to encourage you to start the assignments early and to make sure you've figured out the basic coding skills you're going to need.

The rubric for mini-presentations is as follows:

  • An A presentation explains what problem the paper is trying to solve, how (at a very high level) it applies the technology we are studying to address it, whether any novel features or extensions of the algorithm were necessary, and what the results were like.
  • A B presentation explains what problem the paper is trying to solve and takes a stab at the remaining issues.
  • A C presentation is deeply confused or far too long.
  • Not giving a presentation earns you an E.

Minipresentations are required to be no more than 4 slides long (including the title) or no more than a single handout page and should take no more than 5 minutes. The point of minipresentations is to build your skill at skimming and summarizing a paper without focusing on all the minutiae.

Suggested papers are listed on Carmen, but feel free to present any paper.

Disability policy

Any student who feels they may need an accommodation based on the impact of a disability should contact me privately to discuss their specific needs, and contact the Office for Disability Services at 614-292-3307 in room 150 Pomerene Hall to coordinate reasonable accommodations.