Both sides previous revision Previous revision | |
users:indexing_a_corpus [2011/03/06 14:03] – notes on data compression and using cwb-make from CWB/Perl stefan | users:indexing_a_corpus [2014/09/02 11:23] (current) – [Indexing your first corpus] eros |
---|
Now run the encode tool: | Now run the encode tool: |
| |
<code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code> | <code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /usr/local/share/cwb/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code> |
| |
In the above example: | In the above example: |
* **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory) | * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory) |
* **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'') | * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'') |
* **-R /corpora/c1/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details | * **-R /usr/local/share/cwb/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details |
* **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing) | * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing) |
* **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column | * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column |
| |
Congratulations, you've just indexed your first corpus! | Congratulations, you've just indexed your first corpus! |
| |
| |
===== Test it ===== | ===== Test it ===== |