Modelblocks Readme
Occasionally, people ask me for step-by-step instructions on how to use Modelblocks. Some day, I'll add these to the README, but in the meantime, here are some step-by-step instructions for...
Obtaining incremental complexity statistics for a given corpus:
# Set your working directory
cd modelblocks-repository/wsjparse
# Do some stupid preliminary stuff that should be absent in the new release
# It doesn't matter if you have the following corpora or not; Make just needs these files to exist
# If you do have the corpora, you can put in the correct locations if you'd like to use them
# In any case, ignore the warnings while these are being made
make user-bnc-location.txt user-mipacq-location.txt user-ccgbank-location.txt user-dundee-location.txt
# You will need the WSJ corpus for training purposes
# After making it, edit the following file to point to your WSJ directory
# You can use a different corpus for training, but that'll have to be setup by you (see the Makefile for ideas)
make user-treebank-location.txt
# Now that all the prereqs have been satisfied, run the following to obtain complexity metrics
# You'll want to run the following in 'screen' or something similar because it takes a long time
time make MYCORPUS.wsj02to21-nodashtags-1671-5sm-bd.x-efabp.-c_-b2000.complextoks
# MYCORPUS refers to a pre-tokenized file with one sentence per line
## genmodel/MYCORPUS.sents
# wsj02to21-nodashtags refers to the training corpus
## Sections 02 to 21 of the WSJ with standard PTB tags
## genmodel/wsj02to21.nodashtags.linetrees
# 1671 is the standard portion of the training set used to tune the split-merge tagset (the last 1671 lines)
# 5sm refers to the number of split-merge iterations
## 5 was shown to be optimal without overfitting by Petrov et al (2007)
# bd means the grammar is side- and depth-specific (parser juju)
# x-efabp is the parser
# -c is a parser flag that tells it to output complexity metrics
# -b2000 is a parser flag that tells it to run with a beam-width of 2000
## 2000 was shown to be optimal for accuracy with the 5sm PTB tagset by van Schijndel et al (2013)
## Schuler et al (in submission) has found the fit to reading times to improve with larger beams despite the lack of increased accuracy
## For reading time fits, we recommend a beam-width of 5000, but note that this will greatly slow the parser, so maybe pilot at 2000
# You may also find it useful to check out the ModelBlocks FAQ
cd modelblocks-repository/wsjparse
# Do some stupid preliminary stuff that should be absent in the new release
# It doesn't matter if you have the following corpora or not; Make just needs these files to exist
# If you do have the corpora, you can put in the correct locations if you'd like to use them
# In any case, ignore the warnings while these are being made
make user-bnc-location.txt user-mipacq-location.txt user-ccgbank-location.txt user-dundee-location.txt
# You will need the WSJ corpus for training purposes
# After making it, edit the following file to point to your WSJ directory
# You can use a different corpus for training, but that'll have to be setup by you (see the Makefile for ideas)
make user-treebank-location.txt
# Now that all the prereqs have been satisfied, run the following to obtain complexity metrics
# You'll want to run the following in 'screen' or something similar because it takes a long time
time make MYCORPUS.wsj02to21-nodashtags-1671-5sm-bd.x-efabp.-c_-b2000.complextoks
# MYCORPUS refers to a pre-tokenized file with one sentence per line
## genmodel/MYCORPUS.sents
# wsj02to21-nodashtags refers to the training corpus
## Sections 02 to 21 of the WSJ with standard PTB tags
## genmodel/wsj02to21.nodashtags.linetrees
# 1671 is the standard portion of the training set used to tune the split-merge tagset (the last 1671 lines)
# 5sm refers to the number of split-merge iterations
## 5 was shown to be optimal without overfitting by Petrov et al (2007)
# bd means the grammar is side- and depth-specific (parser juju)
# x-efabp is the parser
# -c is a parser flag that tells it to output complexity metrics
# -b2000 is a parser flag that tells it to run with a beam-width of 2000
## 2000 was shown to be optimal for accuracy with the 5sm PTB tagset by van Schijndel et al (2013)
## Schuler et al (in submission) has found the fit to reading times to improve with larger beams despite the lack of increased accuracy
## For reading time fits, we recommend a beam-width of 5000, but note that this will greatly slow the parser, so maybe pilot at 2000
# You may also find it useful to check out the ModelBlocks FAQ