Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
users:indexing_a_corpus [2010/11/28 17:18] – hYUzRFqmdGJMqTgmehd 196.214.141.219 | users:indexing_a_corpus [2014/09/02 11:23] (current) – [Indexing your first corpus] eros | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | Nw4mVa | + | ====== Indexing |
+ | |||
+ | ===== Introduction ===== | ||
+ | |||
+ | Before you can use a corpus with CWB/CQP you have to encode it in a special binary format, write a registry file and create indexes for efficient access. But don't worry, we have two very handy command-line tools that take care of all that stuff for you: // | ||
+ | |||
+ | For an even more convenient encoding procedure, install the [[http:// | ||
+ | |||
+ | ===== Indexing your first corpus ===== | ||
+ | |||
+ | Consider a corpus in the following format: | ||
+ | |||
+ | < | ||
+ | < | ||
+ | <text id=" | ||
+ | <s> | ||
+ | volunteers | ||
+ | work VVB | ||
+ | as PRP as | ||
+ | part NN1 | ||
+ | of PRF of | ||
+ | a AT0 a | ||
+ | team NN1 | ||
+ | and | ||
+ | provide VVB | ||
+ | help NN1-VVB help | ||
+ | </s> | ||
+ | <s> | ||
+ | |||
+ | [...] | ||
+ | |||
+ | </ | ||
+ | </ | ||
+ | < | ||
+ | |||
+ | [...] | ||
+ | |||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | A corpus such as this has three // | ||
+ | |||
+ | In order to encode this corpus for CWB/CQP, follow this procedure: | ||
+ | |||
+ | Create a new directory for your encoded corpus, something like '' | ||
+ | |||
+ | <code bash> | ||
+ | |||
+ | If this directory already exists (because you want to re-encode the corpus), make sure to delete all CWB data files in the directory before proceeding. | ||
+ | |||
+ | Now run the encode tool: | ||
+ | |||
+ | <code bash> | ||
+ | |||
+ | In the above example: | ||
+ | |||
+ | * **-d ~/ | ||
+ | * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in '' | ||
+ | * **-R / | ||
+ | * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 (" | ||
+ | * **-P pos** tells cwb-encode that the corpus has the positional attribute '' | ||
+ | * **-P lemma** tells cwb-encode that the corpus has the positional attribute '' | ||
+ | * **-V text** tells cwb-encode that the corpus has the structural attribute ''< | ||
+ | * **-S s** tells cwb-encode that the corpus has the structural attribute ''< | ||
+ | |||
+ | **NB**: | ||
+ | |||
+ | * the positional attribute '' | ||
+ | * we ignore the''< | ||
+ | |||
+ | Finally, we generate a lexicon, frequency list and index for each attribute using // | ||
+ | |||
+ | <code bash> | ||
+ | cwb-makeall -V MYCORPUS | ||
+ | </ | ||
+ | |||
+ | Congratulations, | ||
+ | |||
+ | ===== Test it ===== | ||
+ | |||
+ | OK, now the new corpus should be ready. Fire up CQP: | ||
+ | |||
+ | <code bash> | ||
+ | cqp -eC | ||
+ | </code> | ||
+ | |||
+ | and try to use your new corpus: | ||
+ | |||
+ | < | ||
+ | [no corpus]> MYCORPUS; | ||
+ | MYCORPUS> | ||
+ | </code> | ||
+ | |||
+ | You can also check the corpus size and some basic statistics from the command line with // | ||
+ | |||
+ | <code bash> | ||
+ | cwb-describe-corpus -s MYCORPUS | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Compressing the data ===== | ||
+ | |||
+ | For any sizable corpus, the CWB data files generated by // | ||
+ | |||
+ | A much better solution is to install the CWB/Perl interface, which includes a convenient alternative to // | ||
+ | |||
+ | <code bash> | ||
+ | |||
+ | and you're ready to go! If you have already run // | ||
+ | |||
+ | <code bash> | ||
+ | |||
+ | NB: For large corpora (more than a few million words), use the '' |