Differences

This shows you the differences between two versions of the page.

--- developers:cwb32_plans [2011/03/06 23:02] – [Other open questions] stefan
+++ developers:cwb32_plans [2011/03/06 23:04] (current) – [Performance improvements] stefan
@@ Line 40: / Line 40: @@
 The underlying algorithms still do not scale up to corpora of a billion words or more, and are even pushed to their limits when extracting complex n-gram tables from the 100-million word BNC.  Some improvements could be made with moderate effort and included in the official 3.2 release.  These are in particular:
-  * The frequency counting algorithms used by ''count'' and ''group'' in CQP are far from optimal (in different ways, so depending on the task at hand, one or the other might work reasonably well).  Both should be re-implemented with hash tables in order to ensure high performance in all situations.
+  * The **frequency counting algorithms used by ''count'' and ''group'' in CQP** are far from optimal (in different ways, so depending on the task at hand, one or the other might work reasonably well).  Both should be re-implemented with hash tables in order to ensure high performance in all situations.
-  * Although ''cwb-scan-corpus'' is designed to be memory-efficient, it does not scale up to very large corpora because it collects frequency data in memory (for instance, n-gram scans of the BNC require a computer with more than 8 GiB RAM).  It would be desirable to implement a split-and-merge strategy in order to process arbitrary amounts of data with limited RAM:
+  * Although ''cwb-scan-corpus'' is designed to be memory-efficient, it does not scale up to very large corpora because it collects frequency data in memory (for instance, n-gram scans of the BNC require a computer with more than 8 GiB RAM).  It would be desirable to implement a **split-and-merge strategy for ''cwb-scan-corpus''** in order to process arbitrary amounts of data with limited RAM:
     * ''cwb-scan-corpus'' is run on suitably small slices of a large corpus or multiple individual corpora; for each slice/corpus, frequency data are stored in a well-defined sort order; finally, the individual data files are merged in linear time (because they are already sorted)
     * required changes: (i) option to save frequency table in well-defined sort order (using ''cl_strcmp''); (ii) "merge" mode to combine multiple component files in linear time and apply frequency filter (multi-way merge, caching as much data from the input streams as possible in RAM in order to reduce disk trashing)
     * ideally, user can specify memory limit; when limit is reached during a corpus scan, ''cwb-scan-corpus'' saves partial frequency data to disk, then re-starts scan as new "slice"; either notifies user or automatically runs the multi-way merge at the end (using the allowed amount of RAM to cache input streams)
-  * Neither ''cwb-scan-corpus'' nor the CQP commands are suitable for the extraction of surface collocations (i.e. cooccurrence within a window of //n// words).  It would be nice to implement a modified version of ''cwb-scan-corpus'' that carries out such collocation scans.
+  * Neither ''cwb-scan-corpus'' nor the CQP commands are suitable for the extraction of surface collocations (i.e. cooccurrence within a window of //n// words).  It would be nice to implement a modified version of ''cwb-scan-corpus'' that carries out such **collocation scans**.
     * internal frequency tables as well as the split-and-merge procedure will be very similar to ''cwb-scan-corpus'', so implementation of a basic collocation scan should be reasonably easy
     * further options for filtering nodes and collocates can be added at a later time
-  * All improvements suggested here are based on specialised high-performance hash tables.  These should be implemented in the CL library and exported in the public API, so CQP and the command-line tools can share the same code base.  Required "modules" are listed below.
+  * All improvements suggested here are based on specialised **high-performance hash tables**.  These should be implemented in the CL library and exported in the public API, so CQP and the command-line tools can share the same code base.  Required "modules" are listed below.
     * an efficient memory pool without deallocation (for hash tables and copies of string keys)
     * improved generic string hash tables, using memory pools for keys and entries