Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| users:indexing_a_corpus [2010/12/01 06:08] – XQovKhFbNlJtR 111.160.70.226 | users:indexing_a_corpus [2014/09/02 11:23] (current) – [Indexing your first corpus] eros | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | http://www.carsinsurance4u.com/ online auto insurance quotes qrxhf http://www.autosinsurance4u.com/ car insureance | + | ====== Indexing a corpus ====== |
| + | |||
| + | ===== Introduction ===== | ||
| + | |||
| + | Before you can use a corpus with CWB/CQP you have to encode it in a special binary format, write a registry file and create indexes for efficient access. But don't worry, we have two very handy command-line tools that take care of all that stuff for you: // | ||
| + | |||
| + | For an even more convenient encoding procedure, install the [[http://cwb.sourceforge.net/download.php# | ||
| + | |||
| + | ===== Indexing your first corpus ===== | ||
| + | |||
| + | Consider a corpus in the following format: | ||
| + | |||
| + | < | ||
| + | < | ||
| + | <text id="http://www.foo.org/index.html"> | ||
| + | <s> | ||
| + | volunteers | ||
| + | work VVB | ||
| + | as PRP as | ||
| + | part NN1 | ||
| + | of PRF of | ||
| + | a | ||
| + | team NN1 | ||
| + | and | ||
| + | provide VVB | ||
| + | help NN1-VVB help | ||
| + | </ | ||
| + | <s> | ||
| + | |||
| + | [...] | ||
| + | |||
| + | </ | ||
| + | </ | ||
| + | < | ||
| + | |||
| + | [...] | ||
| + | |||
| + | </ | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | A corpus such as this has three // | ||
| + | |||
| + | In order to encode this corpus for CWB/CQP, follow this procedure: | ||
| + | |||
| + | Create a new directory for your encoded corpus, something like '' | ||
| + | |||
| + | <code bash> | ||
| + | |||
| + | If this directory already exists (because you want to re-encode the corpus), make sure to delete all CWB data files in the directory before proceeding. | ||
| + | |||
| + | Now run the encode tool: | ||
| + | |||
| + | <code bash> | ||
| + | |||
| + | In the above example: | ||
| + | |||
| + | * **-d ~/ | ||
| + | * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in '' | ||
| + | * **-R / | ||
| + | * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 (" | ||
| + | * **-P pos** tells cwb-encode that the corpus has the positional attribute '' | ||
| + | * **-P lemma** tells cwb-encode that the corpus has the positional attribute '' | ||
| + | * **-V text** tells cwb-encode that the corpus has the structural attribute ''< | ||
| + | * **-S s** tells cwb-encode that the corpus has the structural attribute ''< | ||
| + | |||
| + | **NB**: | ||
| + | |||
| + | * the positional attribute '' | ||
| + | * we ignore the''< | ||
| + | |||
| + | Finally, we generate a lexicon, frequency list and index for each attribute using // | ||
| + | |||
| + | <code bash> | ||
| + | cwb-makeall -V MYCORPUS | ||
| + | </ | ||
| + | |||
| + | Congratulations, | ||
| + | |||
| + | ===== Test it ===== | ||
| + | |||
| + | OK, now the new corpus should be ready. Fire up CQP: | ||
| + | |||
| + | <code bash> | ||
| + | cqp -eC | ||
| + | </ | ||
| + | |||
| + | and try to use your new corpus: | ||
| + | |||
| + | < | ||
| + | [no corpus]> MYCORPUS; | ||
| + | MYCORPUS> | ||
| + | </ | ||
| + | |||
| + | You can also check the corpus size and some basic statistics from the command line with // | ||
| + | |||
| + | <code bash> | ||
| + | cwb-describe-corpus -s MYCORPUS | ||
| + | </ | ||
| + | |||
| + | |||
| + | ===== Compressing the data ===== | ||
| + | |||
| + | For any sizable corpus, the CWB data files generated by // | ||
| + | |||
| + | A much better solution is to install the CWB/Perl interface, which includes a convenient alternative to // | ||
| + | |||
| + | <code bash> | ||
| + | |||
| + | and you're ready to go! If you have already run // | ||
| + | |||
| + | <code bash> | ||
| + | |||
| + | NB: For large corpora (more than a few million words), use the '' | ||