| Corpora for biomedical natural language processing | |
|
A project of the Biomedical Text Mining Group at the Center for Computational Pharmacology Lab: RC-1 S. Room L18-6400A Phone: 303-916-2417 E-mail: Kevin.Cohen@gmail.com
|
| Home | Obtaining corpora | Publications | Empirical data on corpus usage | Corpus design | Survey data | |||
|
Counting words in six biomedical corporaThe six corpora discussed in the paper are distributed in six different formats, so producing a word count for each one required a separate parser. The final version of the paper didn't have room for giving the details of how we came up with the word counts reported, so the following sections explain what we counted in each corpus, and where relevant, gives the code that we used. Word count for the PDG corpusThe PDG corpus is distributed as a single HTML file. We removed all HTML formatting from the file. This leaves a file in which comments, annotations, and text are all in the same format, but are on separate lines. We hand-edited all comments and annotations from the file, leaving just the text, and then used the unix wc command to count whitespace-tokenized words. This gives a count of 10,291. See pdg_corpus.txt for the file with all HTML, comments, and annotations removed.Word count for the Wisconsin corpusSize for the U. Wisconsin corpus is based on the first line of all data files in the MIPS/all (1,080,265 words), OMIM/all (291,397 words), and YPD/all (158,069 words) directories. See cravenWordCount.pl for exactly what got counted as a word.Word count for the GENIA corpusSize for the GENIA corpus is based on the file GENIAcorpus3.02.pos.txt. This file has each token on a separate line, so one of the points for this one is to avoid counting tokens that aren't words. The script geniaWordCount.pl shows what exactly was ignored and what was counted.Word count for the Yapex corpusSize for the Yapex corpus is based on the <ArticleTitle> and <AbstractText> elements in the yapex_ref_collection.txt (23,049 words) and yapex_test_collection.txt (22,094 words) files. We extracted the plain text from the XML, whitespace-tokenized it, and counted the resulting tokens.Word count for the GENETAG corpusSize for the GENETAG corpus is based on the TAGGED_GENE_CORPUS files in the train (170,832 words), test (56,761 words), and round1 (114,981 words) directories. Round2 data is described in the paper, but is not included in the current distribution. Tanabe et al. give much higher word counts (204,195 for train, 68,043 for test, and 137,586 for round1), but I suspect that those counts are after tokenization of punctuation, and are counts of tokens, not words. |