users:indexing_a_corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
users:indexing_a_corpus [2011/03/06 13:53] – updated and revised basic instructions stefanusers:indexing_a_corpus [2014/09/02 11:23] (current) – [Indexing your first corpus] eros
Line 52: Line 52:
 Now run the encode tool: Now run the encode tool:
  
-<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>+<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /usr/local/share/cwb/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>
  
 In the above example: In the above example:
Line 58: Line 58:
   * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory)   * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory)
   * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'')   * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'')
-  * **-R /corpora/c1/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details+  * **-R /usr/local/share/cwb/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details
   * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing)   * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing)
   * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column   * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column
Line 77: Line 77:
  
 Congratulations, you've just indexed your first corpus! Congratulations, you've just indexed your first corpus!
- 
  
 ===== Test it ===== ===== Test it =====
Line 99: Line 98:
 cwb-describe-corpus -s MYCORPUS cwb-describe-corpus -s MYCORPUS
 </code> </code>
 +
 +
 +===== Compressing the data =====
 +
 +For any sizable corpus, the CWB data files generated by //cwb-encode// and //cwb-makeall// should be compressed in order to save disk space and make corpus searches more efficient (because less data have to be read from relatively slow hard disks).  This can be done with the CWB utilities //cwb-huffcode// and //cwb-compress-rdx//, but the procedure is fairly tedious and error-prone.
 +
 +A much better solution is to install the CWB/Perl interface, which includes a convenient alternative to //cwb-makeall// called //cwb-make// The //cwb-make// program performs the same indexing as //cwb-makeall//, but also compresses the data files automatically.  Simply type
 +
 +<code bash>cwb-make MYCORPUS</code>
 +
 +and you're ready to go! If you have already run //cwb-makeall//, don't worry -- //cwb-make// will detect this and only perform the compression step.  You may now also want to use the //cwb-regedit// tool to add an informational note to the registry file showing that the language of the corpus is English:
 +
 +<code bash>cwb-regedit MYCORPUS :prop language en</code>
 +
 +NB: For large corpora (more than a few million words), use the ''-M'' option of //cwb-make// to indicate how much RAM (in MiB) can be used during the indexing procedure, e.g. **-M 2000** if you can spare roughly 2 GiB.
  • users/indexing_a_corpus.1299415980.txt.gz
  • Last modified: 2011/03/06 13:53
  • by stefan