Indexing a corpus
Introduction
Before you can use a corpus with CWB/CQP you have to encode it in a special binary format, write a registry file and create indexes for efficient access. But don't worry, we have two very handy command-line tools that take care of all that stuff for you: cwb-encode and cwb-makeall. You just have to enter a few parameters, feed them to the encoding tool and you're good to go.
For an even more convenient encoding procedure, install the CWB/Perl interface and use the cwb-make tool instead of cwb-makeall (see below for more information).
Indexing your first corpus
Consider a corpus in the following format:
<corpus> <text id="http://www.foo.org/index.html"> <s> volunteers NN2 volunteer work VVB work as PRP as part NN1 part of PRF of a AT0 a team NN1 team and CJC and provide VVB provide help NN1-VVB help </s> <s> [...] </s> </text> <text> [...] </text> </corpus>
A corpus such as this has three positional attributes (word
, pos
and lemma
) and three structural attributes (namely <corpus>
, <text>
and <s>
). While <s>
simply marks sentence boundaries, the <text>
attribute is also annotated with further information in the form of tag-value pairs (id=“…”
). The <corpus>
tags are irrelevant wrappers and need not be included in the CWB corpus.
In order to encode this corpus for CWB/CQP, follow this procedure:
Create a new directory for your encoded corpus, something like mycorpus
:
mkdir ~/mycorpus
If this directory already exists (because you want to re-encode the corpus), make sure to delete all CWB data files in the directory before proceeding.
Now run the encode tool:
cwb-encode -d ~/mycorpus -f filename.xml -R /usr/local/share/cwb/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus
In the above example:
- -d ~/mycorpus designates the directory where corpus data will be stored (here relative to your home directory)
- -f filename.xml is the filename of the original text file containing the corpus (can also be compressed with filename ending in
.gz
) - -R /usr/local/share/cwb/registry/mycorpus is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by cwb-encode, so you don't have to worry about the format details
- -c latin1 indicates that the input file is encoded in the ISO-8859-1 (“Latin-1”) character set; other character sets are available in CWB release 3.2 and newer (currently in beta-testing)
- -P pos tells cwb-encode that the corpus has the positional attribute
pos
(parts of speech) in the second column - -P lemma tells cwb-encode that the corpus has the positional attribute
lemma
in the third column - -V text tells cwb-encode that the corpus has the structural attribute
<text>
and that it is annotated withid
information (which will automatically be made available as a virtual attribute<text_id>
in CQP) - -S s tells cwb-encode that the corpus has the structural attribute
<s>
(sentence boundaries)
NB:
- the positional attribute
word
in the first column is implicit; while it can be renamed using the-P
option, CQP does not work properly on corpora without aword
attribute – so don't! - we ignore the
<corpus>
tags because they just enclose the entire file and are irrelevant for corpus searches; the declaration -0 corpus (digit “zero”) is required so that these tags aren't inserted as regular tokens into the encoded corpus
Finally, we generate a lexicon, frequency list and index for each attribute using cwb-makeall
cwb-makeall -V MYCORPUS
Congratulations, you've just indexed your first corpus!
Test it
OK, now the new corpus should be ready. Fire up CQP:
cqp -eC
and try to use your new corpus:
[no corpus]> MYCORPUS; MYCORPUS> "paraphernalia";
You can also check the corpus size and some basic statistics from the command line with cwb-describe-corpus:
cwb-describe-corpus -s MYCORPUS
Compressing the data
For any sizable corpus, the CWB data files generated by cwb-encode and cwb-makeall should be compressed in order to save disk space and make corpus searches more efficient (because less data have to be read from relatively slow hard disks). This can be done with the CWB utilities cwb-huffcode and cwb-compress-rdx, but the procedure is fairly tedious and error-prone.
A much better solution is to install the CWB/Perl interface, which includes a convenient alternative to cwb-makeall called cwb-make. The cwb-make program performs the same indexing as cwb-makeall, but also compresses the data files automatically. Simply type
cwb-make MYCORPUS
and you're ready to go! If you have already run cwb-makeall, don't worry – cwb-make will detect this and only perform the compression step. You may now also want to use the cwb-regedit tool to add an informational note to the registry file showing that the language of the corpus is English:
cwb-regedit MYCORPUS :prop language en
NB: For large corpora (more than a few million words), use the -M
option of cwb-make to indicate how much RAM (in MiB) can be used during the indexing procedure, e.g. -M 2000 if you can spare roughly 2 GiB.