users:indexing_a_corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
users:indexing_a_corpus [2006/12/13 14:59] 131.173.35.40users:indexing_a_corpus [2014/09/02 11:23] (current) – [Indexing your first corpus] eros
Line 3: Line 3:
 ===== Introduction ===== ===== Introduction =====
  
-Before you can use a corpus with CWB/CQP you have to encode it in a special format and create a registry file. But don't worry, we have two very handy command-line tool that take care of all that stuff for you: //cwb-encode// and //cwb-makeall//. You just have to enter a few parameters, feed them to the encoding tool and you're good to go.+Before you can use a corpus with CWB/CQP you have to encode it in a special binary format, write a registry file and create indexes for efficient access. But don't worry, we have two very handy command-line tools that take care of all that stuff for you: //cwb-encode// and //cwb-makeall//. You just have to enter a few parameters, feed them to the encoding tool and you're good to go
 + 
 +For an even more convenient encoding procedure, install the [[http://cwb.sourceforge.net/download.php#perl|CWB/Perl]] interface and use the //cwb-make// tool instead of //cwb-makeall// (see below for more information).
  
 ===== Indexing your first corpus ===== ===== Indexing your first corpus =====
Line 11: Line 13:
 <file> <file>
 <corpus> <corpus>
-<text>+<text id="http://www.foo.org/index.html">
 <s> <s>
 volunteers      NN2     volunteer volunteers      NN2     volunteer
Line 38: Line 40:
 </file> </file>
  
-A corpus such as this has three positional attributes (word, pos and lemma) and three structural attributes (namely <corpus>, <text> and <s>). In order to encode the corpus using cwb-encode, follow this procedure:+A corpus such as this has three //positional attributes// (''word''''pos'' and ''lemma'') and three //structural attributes// (namely ''<corpus>''''<text>'' and ''<s>'').  While ''<s>'' simply marks sentence boundaries, the ''<text>'' attribute is also annotated with further information in the form of tag-value pairs (''id="..."'').  The ''<corpus>'' tags are irrelevant wrappers and need not be included in the CWB corpus.
  
-Create a new directory for your encoded corpus, something like mycorpus:+In order to encode this corpus for CWB/CQP, follow this procedure: 
 + 
 +Create a new directory for your encoded corpus, something like ''mycorpus'':
  
 <code bash>mkdir ~/mycorpus</code> <code bash>mkdir ~/mycorpus</code>
 +
 +If this directory already exists (because you want to re-encode the corpus), make sure to delete all CWB data files in the directory before proceeding.
  
 Now run the encode tool: Now run the encode tool:
  
-<code>cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -P pos -P lemma -S corpus -S text -S s</code>+<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /usr/local/share/cwb/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>
  
 In the above example: In the above example:
  
-  * **-d ~/mycorpus** designates the directory where corpus data will be stored +  * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory) 
-  * **-f filename.xml** is the filename of the original text file containing the corpus +  * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'') 
-  * **-R /corpora/c1/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.) +  * **-R /usr/local/share/cwb/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details 
-  * **-P pos** tells cwb-encode that the corpus has the positional attribute //pos/+  * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing) 
-  * **-P lemma** tells cwb-encode that the corpus has the positional attribute //lemma// +  * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column 
-  * **-S corpus** tells cwb-encode that the corpus has the structural attribute //<corpus>// +  * **-P lemma** tells cwb-encode that the corpus has the positional attribute ''lemma'' in the third column 
-  * **-text** tells cwb-encode that the corpus has the structural attribute //<text>// +  * **-text** tells cwb-encode that the corpus has the structural attribute ''<text>'' and that it is annotated with ''id'' information (which will automatically be made available as a virtual attribute ''<text_id>'' in CQP) 
-  * **-S s** tells cwb-encode that the corpus has the structural attribute //<s>//+  * **-S s** tells cwb-encode that the corpus has the structural attribute ''<s>'' (sentence boundaries)
  
-Finally, we generate a lexicon and index using cwb-makeall+**NB**:
  
-<code>+  * the positional attribute ''word'' in the first column is implicit; while it can be renamed using the ''-P'' option, CQP does not work properly on corpora without a ''word'' attribute -- so don't! 
 +  * we ignore the''<corpus>'' tags because they just enclose the entire file and are irrelevant for corpus searches; the declaration  **-0 corpus** (digit "zero") is required so that these tags aren't inserted as regular tokens into the encoded corpus 
 + 
 +Finally, we generate a lexicon, frequency list and index for each attribute using //cwb-makeall// 
 + 
 +<code bash>
 cwb-makeall -V MYCORPUS cwb-makeall -V MYCORPUS
 </code> </code>
Line 69: Line 80:
 ===== Test it ===== ===== Test it =====
  
-OK, now the new corpus should be ready, fire up cqp:+OK, now the new corpus should be ready. Fire up CQP:
  
-<code>+<code bash>
 cqp -eC cqp -eC
 </code> </code>
Line 79: Line 90:
 <code> <code>
 [no corpus]> MYCORPUS; [no corpus]> MYCORPUS;
-[MYCORPUS]> "paraphernalia";+MYCORPUS> "paraphernalia";
 </code> </code>
 +
 +You can also check the corpus size and some basic statistics from the command line with //cwb-describe-corpus//:
 +
 +<code bash>
 +cwb-describe-corpus -s MYCORPUS
 +</code>
 +
 +
 +===== Compressing the data =====
 +
 +For any sizable corpus, the CWB data files generated by //cwb-encode// and //cwb-makeall// should be compressed in order to save disk space and make corpus searches more efficient (because less data have to be read from relatively slow hard disks).  This can be done with the CWB utilities //cwb-huffcode// and //cwb-compress-rdx//, but the procedure is fairly tedious and error-prone.
 +
 +A much better solution is to install the CWB/Perl interface, which includes a convenient alternative to //cwb-makeall// called //cwb-make// The //cwb-make// program performs the same indexing as //cwb-makeall//, but also compresses the data files automatically.  Simply type
 +
 +<code bash>cwb-make MYCORPUS</code>
 +
 +and you're ready to go! If you have already run //cwb-makeall//, don't worry -- //cwb-make// will detect this and only perform the compression step.  You may now also want to use the //cwb-regedit// tool to add an informational note to the registry file showing that the language of the corpus is English:
 +
 +<code bash>cwb-regedit MYCORPUS :prop language en</code>
 +
 +NB: For large corpora (more than a few million words), use the ''-M'' option of //cwb-make// to indicate how much RAM (in MiB) can be used during the indexing procedure, e.g. **-M 2000** if you can spare roughly 2 GiB.
  • users/indexing_a_corpus.1166018397.txt.gz
  • Last modified: 2006/12/13 14:59
  • by 131.173.35.40