Differences

This shows you the differences between two versions of the page.

--- users:indexing_a_corpus [2011/03/06 14:03] – notes on data compression and using cwb-make from CWB/Perl stefan
+++ users:indexing_a_corpus [2014/09/02 11:23] (current) – [Indexing your first corpus] eros
@@ Line 52: / Line 52: @@
 Now run the encode tool:
-<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>
+<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /usr/local/share/cwb/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>
 In the above example:
@@ Line 58: / Line 58: @@
   * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory)
   * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'')
-  * **-R /corpora/c1/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details
+  * **-R /usr/local/share/cwb/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details
   * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing)
   * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column
@@ Line 77: / Line 77: @@
 Congratulations, you've just indexed your first corpus!
 ===== Test it =====