Marie-Catherine de Marneffe & Micha Elsner
LING 5050 - Technical tools for linguists
Maysession 2016
DUE: Friday May 27, 2016 (no late homework accepted!)
We will now try to deal with the original Fisher data, and not data that we actually pre-transformed for you ;-) We still did a bit of work for you, and put on Carmen only the relevant files of the Fisher data that we will need (see the folder "OriginalFisher" on Carmen).
Let's take a look at the file we looked at previously in the first "Unix" class, to see how the file is structured in the original transcript.
$ more 065/more fe_03_06596.txt # fe_03_06596.sph # Transcribed by BBN/WordWave 0.59 1.92 A: hello 1.96 2.97 B: (( hello )) 2.95 3.98 A: hello 3.71 5.43 B: my name is kevin gonzales 5.49 7.30 A: hi this is carol 6.95 8.35 B: carol okay carol 8.42 9.44 A: [laughter] 9.56 11.27 A: well that was fast 12.48 13.79 B: uh hello 13.47 14.85 A: yeah i'm here 14.58 15.53 B: okay 16.81 25.87 A: um so do you think that public or private school have the right
The transcripts do not contain gender information... So we will need to extract such information from another file which keeps track of that. The gender information is in the file "fe_03_p2_calldata.tbl". Let's look at a few lines of the file (at the beginning and at the end):
CALL_ID,DATE_TIME,TOPICID,SIG_GRADE,CNV_GRADE,APIN,ASX.DL,APHNUM,APHSET,APHTYP,BPIN,BSX.DL,BPHNUM,BPHSET,BPHTYP 05851,20030514_18:49:18,ENG10,4,4,96498,f.o,650428gei,4,3,86972,f.a,no_BPHNUM,4,2 05852,20030514_18:58:48,ENG10,4,2.5,26375,f.a,480948qqo,,3,39550,f.a,917757yjm,,1 05853,20030514_19:06:10,ENG10,4,3.5,45776,f.a,818312pbg,4,1,72959,f.a,no_BPHNUM,4,2 05854,20030514_19:38:43,ENG10,4,3.5,12903,f.a,218879qim,4,3,82880,f.a,718444ekj,4,2 05855,20030514_19:42:07,ENG10,4,4,86322,f.a,931906iqu,4,3,55384,f.a,931526jhk,4,2 05856,20030514_19:43:56,ENG10,4,3.5,95020,f.a,814322ojp,4,3,44763,f.a,no_BPHNUM,2,3 ... 11692,20031118_19:47:30,ENG32,4,4,16321,m.a,585615upb,4,3,62607,f.a,206241jlv,4,2 11693,20031118_19:57:18,ENG32,4,4,31775,m.a,765452yfm,4,3,52954,f.o,no_BPHNUM,, 11694,20031118_20:02:23,ENG32,4,4,74447,m.a,718491eip,4,3,27475,m.a,301770vqn,4,3 11695,20031118_20:21:14,ENG32,4,4,17087,m.a,617755gqs,4,1,50757,m.a,727461knu,4,2 11696,20031118_20:31:06,ENG32,4,4,50278,m.a,612599elj,4,1,99630,m.a,512791rev,4,1 11697,20031118_20:51:04,ENG32,4,4,46881,f.a,614276ico,4,2,65668,m.a,203265lbe,4,1 11698,20031118_21:02:23,ENG32,4,4,23441,m.a,210349gbf,4,3,93991,f.a,818762njk,4,3 11699,20031118_21:18:22,ENG32,4,4,18625,m.a,818752hfj,4,3,14313,m.a,215349ppw,4,3
We see that there are different fields, separated by commas, and that the gender information for participants A and B are in the 7th and 12th fields, respectively (ASX.DL and BSX.DL). This format is standardly referred to as "csv" (comma-separated values). The first field contains the conversation ID. The names of the transcript files contain the conversation ID too: the last digits before the ".txt" extension.
With that information, write a python script that outputs:
- the raw total number of words spoken by women
- the raw total number of words spoken by men
- the total number of utterances spoken by women
- the total number of utterance spoken by men
- the average number of words per utterance spoken by women and by men
- the number of female speakers
- the number of male speakers
You will submit your code on Carmen in the HW2 submissions folder. Make sure your code runs. Make sure to appropriately comment your code! Find the right balance in your comments: too few or too many isn't helpful.