Differences

This shows you the differences between two versions of the page.

--- users:indexing_a_corpus [2011/03/06 13:53] – updated and revised basic instructions stefan
+++ users:indexing_a_corpus [2014/09/02 11:23] (current) – [Indexing your first corpus] eros
@@ Line 52: / Line 52: @@
 Now run the encode tool:
-<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>
+<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /usr/local/share/cwb/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>
 In the above example:
@@ Line 58: / Line 58: @@
   * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory)
   * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'')
-  * **-R /corpora/c1/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details
+  * **-R /usr/local/share/cwb/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details
   * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing)
   * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column
@@ Line 77: / Line 77: @@
 Congratulations, you've just indexed your first corpus!
 ===== Test it =====
@@ Line 99: / Line 98: @@
 cwb-describe-corpus -s MYCORPUS
 </code>
+===== Compressing the data =====
+For any sizable corpus, the CWB data files generated by //cwb-encode// and //cwb-makeall// should be compressed in order to save disk space and make corpus searches more efficient (because less data have to be read from relatively slow hard disks).  This can be done with the CWB utilities //cwb-huffcode// and //cwb-compress-rdx//, but the procedure is fairly tedious and error-prone.
+A much better solution is to install the CWB/Perl interface, which includes a convenient alternative to //cwb-makeall// called //cwb-make//.  The //cwb-make// program performs the same indexing as //cwb-makeall//, but also compresses the data files automatically.  Simply type
+<code bash>cwb-make MYCORPUS</code>
+and you're ready to go! If you have already run //cwb-makeall//, don't worry -- //cwb-make// will detect this and only perform the compression step.  You may now also want to use the //cwb-regedit// tool to add an informational note to the registry file showing that the language of the corpus is English:
+<code bash>cwb-regedit MYCORPUS :prop language en</code>
+NB: For large corpora (more than a few million words), use the ''-M'' option of //cwb-make// to indicate how much RAM (in MiB) can be used during the indexing procedure, e.g. **-M 2000** if you can spare roughly 2 GiB.