Differences

This shows you the differences between two versions of the page.

--- developers:cwb32_plans [2011/03/06 14:25] – created stefan
+++ developers:cwb32_plans [2011/03/06 23:04] (current) – [Performance improvements] stefan
@@ Line 9: / Line 9: @@
 Please read the [[http://cwb.svn.sourceforge.net/viewvc/cwb/cwb/trunk/doc/unicode_roadmap.html|internal roadmap document]] for details on user-visible changes, implementation, and the reasons behind the technical decisions we made.
+==== kwic formatting in CQP ====
+//kwic// output in CQP (with the ''cat'' command) suffers from a range of known problems, e.g. buffer overruns resulting in a program crash if large contexts are displayed.  In addition, many users have asked for more flexible formatting options (e.g. to change the delimiters between different p-attributes of tokens) and some of the output modes (notably ''latex'' and ''sgml'') and options (''set PrintOptions'') are peculiar and of little practical use.
+Since these problems have been exacerbated by the new Unicode support, it seems to be a good time to tackle the long-overdue complete overhaul of the ''kwic'' formatting code.  CWB users are also more likely to accept fundamental changes (which will certainly break backward compatibility of CQP) in return for being able to work with Unicode data.
+== Stefan's proposal ==
+  * rewrite //kwic// formatting code from scratch, throwing out most traditional print modes and options
+  * only two modes will be supported: **ASCII** and **XML**
+  * **ASCII** mode is intended for interactive use of CQP in a terminal session
+    * should be as flexible as can be provided without complicating the implementation
+    * perhaps a single string option in CQP could be used to set the various kinds of delimiter characters
+    * supports fixed-character context, reimplemented to handle UTF-8 and highlighting correctly in all circumstances
+    * **Q:** //should highlighting and colours be kept?// (due to limitations and bugs in termcap/ncurses and various terminal programs, it is //very// difficult to get this right on all platforms)
+  * **XML** mode is intended for any further processing, including Web GUIs, frequency analysis, etc.
+    * standardised XML format (//to be agreed upon by community//) **without user options** (except for attributes to be included and context-size)
+    * no fixed-character context in this output mode
+    * simplified implementation should make XML output faster than ASCII mode
+    * efficient and light-weight (non-validating) XML parsers available for most programming languages -> no other format needed for data exchange
+    * GUIs (esp. Web interfaces) can use convenient XSLT stylesheets for flexible and customisable //kwic// display
+    * for statistical post-processing and frequency tables, ''tabulate'' command is often a better alternative
+  * standardised XML format should also be offered by //cwb-decode// and similar tools; perhaps formatting code should be moved to ''CL'' library (and exported in its API)
+==== Performance improvements ====
+While there is much room for improving the speed of (simple) CQP queries, this will require fundamental changes to the query evaluation mechanism and has to wait for a later release.  Several sophisticated optimisation strategies are implemented by the new [[http://maximos.aksis.uib.no/Aksis-wiki/Corpuscle|Corpuscle query engine]], which is currently under development at the University of Bergen.
+However, fast query execution is not the only requirement on a modern corpus search tool.  An increasingly important application is the extraction of frequency data for n-grams, collocations, meta-information, etc. from very large corpora.  The CWB already supports this task with CQP commands ''count'', ''group'' and ''tabulate'', as well as the memory-efficient command-line tool ''cwb-scan-corpus''.
+The underlying algorithms still do not scale up to corpora of a billion words or more, and are even pushed to their limits when extracting complex n-gram tables from the 100-million word BNC.  Some improvements could be made with moderate effort and included in the official 3.2 release.  These are in particular:
+  * The **frequency counting algorithms used by ''count'' and ''group'' in CQP** are far from optimal (in different ways, so depending on the task at hand, one or the other might work reasonably well).  Both should be re-implemented with hash tables in order to ensure high performance in all situations.
+  * Although ''cwb-scan-corpus'' is designed to be memory-efficient, it does not scale up to very large corpora because it collects frequency data in memory (for instance, n-gram scans of the BNC require a computer with more than 8 GiB RAM).  It would be desirable to implement a **split-and-merge strategy for ''cwb-scan-corpus''** in order to process arbitrary amounts of data with limited RAM:
+    * ''cwb-scan-corpus'' is run on suitably small slices of a large corpus or multiple individual corpora; for each slice/corpus, frequency data are stored in a well-defined sort order; finally, the individual data files are merged in linear time (because they are already sorted)
+    * required changes: (i) option to save frequency table in well-defined sort order (using ''cl_strcmp''); (ii) "merge" mode to combine multiple component files in linear time and apply frequency filter (multi-way merge, caching as much data from the input streams as possible in RAM in order to reduce disk trashing)
+    * ideally, user can specify memory limit; when limit is reached during a corpus scan, ''cwb-scan-corpus'' saves partial frequency data to disk, then re-starts scan as new "slice"; either notifies user or automatically runs the multi-way merge at the end (using the allowed amount of RAM to cache input streams)
+  * Neither ''cwb-scan-corpus'' nor the CQP commands are suitable for the extraction of surface collocations (i.e. cooccurrence within a window of //n// words).  It would be nice to implement a modified version of ''cwb-scan-corpus'' that carries out such **collocation scans**.
+    * internal frequency tables as well as the split-and-merge procedure will be very similar to ''cwb-scan-corpus'', so implementation of a basic collocation scan should be reasonably easy
+    * further options for filtering nodes and collocates can be added at a later time
+  * All improvements suggested here are based on specialised **high-performance hash tables**.  These should be implemented in the CL library and exported in the public API, so CQP and the command-line tools can share the same code base.  Required "modules" are listed below.
+    * an efficient memory pool without deallocation (for hash tables and copies of string keys)
+    * improved generic string hash tables, using memory pools for keys and entries
+    * optimised hash tables with lists of integers as keys (similar to the tables currently used by ''cwb-scan-corpus'')
+==== Other open questions ====
+  * How important is it to offer **binary packages** at least for the most common platforms?  Due to the use of [[http://library.gnome.org/devel/glib/stable/|GLib]] for Unicode support, it has become difficult to produce stand-alone binaries of CWB version 3.2, which has to be compiled from source code at the moment (after installing GLib and other external dependencies).
+    * For Ubuntu, we could try to offer ''.deb'' packages that list the necessary dependencies, and let the package manager take care of the rest.
+    * For Mac OS X, the usual strategy is to include "local" installations of the required shared libraries in the binary distribution, and link the main programs using relative paths (there are special tools in Mac OS X for this purpose).
 ===== Feature requests for future releases =====
 ===== Long-term goals =====