Discussion: Roadmap and goals for CWB 3.2 and beyond

This is an old revision of the document!

The main new feature of CWB 3.2 is full support for Unicode (UTF-8) as well as most ISO-8859-* character sets. As a side effect, regular expression have become considerably more powerful, now supporting the full PCRE syntax.

Please read the internal roadmap document for details on user-visible changes, implementation, and the reasons behind the technical decisions we made.

kwic output in CQP (with the cat command) suffers from a range of known problems, e.g. buffer overruns resulting in a program crash if large contexts are displayed. In addition, many users have asked for more flexible formatting options (e.g. to change the delimiters between different p-attributes of tokens) and some of the output modes (notably latex and sgml) and options (set PrintOptions) are peculiar and of little practical use.

Since these problems have been exacerbated by the new Unicode support, it seems to be a good time to tackle the long-overdue complete overhaul of the kwic formatting code. CWB users are also more likely to accept fundamental changes (which will certainly break backward compatibility of CQP) in return for being able to work with Unicode data.

Stefan's proposal

rewrite kwic formatting code from scratch, throwing out most traditional print modes and options
only two modes will be supported: ASCII and XML
ASCII mode is intended for interactive use of CQP in a terminal session
- should be as flexible as can be provided without complicating the implementation
- perhaps a single string option in CQP could be used to set the various kinds of delimiter characters
- supports fixed-character context, reimplemented to handle UTF-8 and highlighting correctly in all circumstances
- Q: should highlighting and colours be kept? (due to limitations and bugs in termcap/ncurses and various terminal programs, it is very difficult to get this right on all platforms)
XML mode is intended for any further processing, including Web GUIs, frequency analysis, etc.
- standardised XML format (to be agreed upon by community) without user options (except for attributes to be included and context-size)
- no fixed-character context in this output mode
- simplified implementation should make XML output faster than ASCII mode
- efficient and light-weight (non-validating) XML parsers available for most programming languages → no other format needed for data exchange
- GUIs (esp. Web interfaces) can use convenient XSLT stylesheets for flexible and customisable kwic display
- for statistical post-processing and frequency tables, tabulate command is often a better alternative
standardised XML format should also be offered by cwb-decode and similar tools; perhaps formatting code should be moved to CL library (and exported in its API)

While there is much room for improving the speed of (simple) CQP queries, this will require fundamental changes to the query evaluation mechanism and has to wait for a later release. Several sophisticated optimisation strategies are implemented by the new Corpuscle query engine, which is currently under development at the University of Bergen.

However, fast query execution is not the only requirement on a modern corpus search tool. An increasingly important application is the extraction of frequency data for n-grams, collocations, meta-information, etc. from very large corpora. The CWB already supports this task with CQP commands count, group and tabulate, as well as the memory-efficient command-line tool cwb-scan-corpus.

The underlying algorithms still do not scale up to corpora of a billion words or more, and are even pushed to their limits when extracting complex n-gram tables from the 100-million word BNC. Some improvements could be made with moderate effort and included in the official 3.2 release. These are in particular: