This is an old revision of the document!
Indexing a corpus
Introduction
Before you can use a corpus with CWB/CQP you have to encode it in a special format and create a registry file. But don't worry, we have two very handy command-line tool that take care of all that stuff for you: cwb-encode and cwb-makeall. You just have to enter a few parameters, feed them to the encoding tool and you're good to go.
Indexing your first corpus
Consider a corpus in the following format:
<corpus> <text id="http://www.foo.org/index.html"> <s> volunteers NN2 volunteer work VVB work as PRP as part NN1 part of PRF of a AT0 a team NN1 team and CJC and provide VVB provide help NN1-VVB help </s> <s> [...] </s> </text> <text> [...] </text> </corpus>
A corpus such as this has three positional attributes (word, pos and lemma) and three structural attributes (namely <corpus>, <text> and <s>). In order to encode the corpus using cwb-encode, follow this procedure:
Create a new directory for your encoded corpus, something like mycorpus:
mkdir ~/mycorpus
Now run the encode tool:
cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -P pos -P lemma -V text -S s
In the above example:
- -d ~/mycorpus designates the directory where corpus data will be stored
- -f filename.xml is the filename of the original text file containing the corpus
- -R /corpora/c1/registry/mycorpus is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.)
- -P pos tells cwb-encode that the corpus has the positional attribute pos
- -P lemma tells cwb-encode that the corpus has the positional attribute lemma
- -V text tells cwb-encode that the corpus has the structural attribute <text> and that it has annotations
- -S s tells cwb-encode that the corpus has the structural attribute <s>
NB: the word positional attribute is implicit, and we ignore the <corpus>
attribute because it's used to enclose the entire corpus.
Finally, we generate a lexicon and index using cwb-makeall
cwb-makeall -V MYCORPUS
Congratulations, you've just indexed your first corpus!
Test it
OK, now the new corpus should be ready. Fire up CQP:
cqp -eC
and try to use your new corpus:
[no corpus]> MYCORPUS; [MYCORPUS]> "paraphernalia";