Indexing a corpus

This is an old revision of the document!

Before you can use a corpus with CWB/CQP you have to encode it in a special format and create a registry file. But don't worry, we have two very handy command-line tool that take care of all that stuff for you: cwb-encode and cwb-makeall. You just have to enter a few parameters, feed them to the encoding tool and you're good to go.

Consider a corpus in the following format:

<corpus>
<text id="http://www.foo.org/index.html">
<s>
volunteers      NN2     volunteer
work    VVB     work
as      PRP     as
part    NN1     part
of      PRF     of
a       AT0     a
team    NN1     team
and     CJC     and
provide VVB     provide
help    NN1-VVB help
</s>
<s>

[...]

</s>
</text>
<text>

[...]

</text>
</corpus>

A corpus such as this has three positional attributes (word, pos and lemma) and three structural attributes (namely <corpus>, <text> and <s>). In order to encode the corpus using cwb-encode, follow this procedure:

Create a new directory for your encoded corpus, something like mycorpus:

mkdir ~/mycorpus

Now run the encode tool:

cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -P pos -P lemma -V text -S s

In the above example:

-d ~/mycorpus designates the directory where corpus data will be stored
-f filename.xml is the filename of the original text file containing the corpus
-R /corpora/c1/registry/mycorpus is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.)
-P pos tells cwb-encode that the corpus has the positional attribute pos
-P lemma tells cwb-encode that the corpus has the positional attribute lemma
-V text tells cwb-encode that the corpus has the structural attribute <text> and that it has annotations
-S s tells cwb-encode that the corpus has the structural attribute <s>

NB: the word positional attribute is implicit, and we ignore the <corpus> attribute because it's used to enclose the entire corpus.

Finally, we generate a lexicon and index using cwb-makeall

cwb-makeall -V MYCORPUS

Congratulations, you've just indexed your first corpus!

OK, now the new corpus should be ready. Fire up CQP:

cqp -eC

and try to use your new corpus:

[no corpus]> MYCORPUS;
[MYCORPUS]> "paraphernalia";

Indexing a corpus

Introduction

Indexing your first corpus

Test it

CWB - Open Corpus WorkBench