users:indexing_a_corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
users:indexing_a_corpus [2011/03/06 14:03] – notes on data compression and using cwb-make from CWB/Perl stefanusers:indexing_a_corpus [2014/09/02 11:23] (current) – [Indexing your first corpus] eros
Line 52: Line 52:
 Now run the encode tool: Now run the encode tool:
  
-<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>+<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /usr/local/share/cwb/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code>
  
 In the above example: In the above example:
Line 58: Line 58:
   * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory)   * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory)
   * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'')   * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'')
-  * **-R /corpora/c1/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details+  * **-R /usr/local/share/cwb/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details
   * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing)   * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing)
   * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column   * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column
Line 77: Line 77:
  
 Congratulations, you've just indexed your first corpus! Congratulations, you've just indexed your first corpus!
- 
  
 ===== Test it ===== ===== Test it =====
  • users/indexing_a_corpus.txt
  • Last modified: 2014/09/02 11:23
  • by eros