users:indexing_a_corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revisionBoth sides next revision
users:indexing_a_corpus [2011/03/06 13:53] – updated and revised basic instructions stefanusers:indexing_a_corpus [2011/03/06 14:03] – notes on data compression and using cwb-make from CWB/Perl stefan
Line 99: Line 99:
 cwb-describe-corpus -s MYCORPUS cwb-describe-corpus -s MYCORPUS
 </code> </code>
 +
 +
 +===== Compressing the data =====
 +
 +For any sizable corpus, the CWB data files generated by //cwb-encode// and //cwb-makeall// should be compressed in order to save disk space and make corpus searches more efficient (because less data have to be read from relatively slow hard disks).  This can be done with the CWB utilities //cwb-huffcode// and //cwb-compress-rdx//, but the procedure is fairly tedious and error-prone.
 +
 +A much better solution is to install the CWB/Perl interface, which includes a convenient alternative to //cwb-makeall// called //cwb-make// The //cwb-make// program performs the same indexing as //cwb-makeall//, but also compresses the data files automatically.  Simply type
 +
 +<code bash>cwb-make MYCORPUS</code>
 +
 +and you're ready to go! If you have already run //cwb-makeall//, don't worry -- //cwb-make// will detect this and only perform the compression step.  You may now also want to use the //cwb-regedit// tool to add an informational note to the registry file showing that the language of the corpus is English:
 +
 +<code bash>cwb-regedit MYCORPUS :prop language en</code>
 +
 +NB: For large corpora (more than a few million words), use the ''-M'' option of //cwb-make// to indicate how much RAM (in MiB) can be used during the indexing procedure, e.g. **-M 2000** if you can spare roughly 2 GiB.
  • users/indexing_a_corpus.txt
  • Last modified: 2014/09/02 11:23
  • by eros