This is an old revision of the document!
Indexing a corpus
Introduction
Before you can use a corpus with CWB/CQP you have to encode it in a special binary format, write a registry file and create indexes for efficient access. But don't worry, we have two very handy command-line tools that take care of all that stuff for you: cwb-encode and cwb-makeall. You just have to enter a few parameters, feed them to the encoding tool and you're good to go.
For an even more convenient encoding procedure, install the CWB/Perl interface and use the cwb-make tool instead of cwb-makeall (see below for more information).
Indexing your first corpus
Consider a corpus in the following format:
<corpus> <text id="http://www.foo.org/index.html"> <s> volunteers NN2 volunteer work VVB work as PRP as part NN1 part of PRF of a AT0 a team NN1 team and CJC and provide VVB provide help NN1-VVB help </s> <s> [...] </s> </text> <text> [...] </text> </corpus>
A corpus such as this has three positional attributes (word
, pos
and lemma
) and three structural attributes (namely <corpus>
, <text>
and <s>
). While <s>
simply marks sentence boundaries, the <text>
attribute is also annotated with further information in the form of tag-value pairs (id=“…”
). The <corpus>
tags are irrelevant wrappers and need not be included in the CWB corpus.
In order to encode this corpus for CWB/CQP, follow this procedure:
Create a new directory for your encoded corpus, something like mycorpus
:
mkdir ~/mycorpus
If this directory already exists (because you want to re-encode the corpus), make sure to delete all CWB data files in the directory before proceeding.
Now run the encode tool:
cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus
In the above example:
- -d ~/mycorpus designates the directory where corpus data will be stored (here relative to your home directory)
- -f filename.xml is the filename of the original text file containing the corpus (can also be compressed with filename ending in
.gz
) - -R /corpora/c1/registry/mycorpus is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by cwb-encode, so you don't have to worry about the format details
- -c latin1 indicates that the input file is encoded in the ISO-8859-1 (“Latin-1”) character set; other character sets are available in CWB release 3.2 and newer (currently in beta-testing)
- -P pos tells cwb-encode that the corpus has the positional attribute
pos
(parts of speech) in the second column - -P lemma tells cwb-encode that the corpus has the positional attribute
lemma
in the third column - -V text tells cwb-encode that the corpus has the structural attribute
<text>
and that it is annotated withid
information (which will automatically be made available as a virtual attribute<text_id>
in CQP) - -S s tells cwb-encode that the corpus has the structural attribute
<s>
(sentence boundaries)
NB:
- the positional attribute
word
in the first column is implicit; while it can be renamed using the-P
option, CQP does not work properly on corpora without aword
attribute – so don't! - we ignore the
<corpus>
tags because they just enclose the entire file and are irrelevant for corpus searches; the declaration -0 corpus (digit “zero”) is required so that these tags aren't inserted as regular tokens into the encoded corpus
Finally, we generate a lexicon, frequency list and index for each attribute using cwb-makeall
cwb-makeall -V MYCORPUS
Congratulations, you've just indexed your first corpus!
Test it
OK, now the new corpus should be ready. Fire up CQP:
cqp -eC
and try to use your new corpus:
[no corpus]> MYCORPUS; MYCORPUS> "paraphernalia";
You can also check the corpus size and some basic statistics from the command line with cwb-describe-corpus:
cwb-describe-corpus -s MYCORPUS