The main new feature of CWB 3.2 is full support for Unicode (UTF-8) as well as most ISO-8859-* character sets. As a side effect, regular expression have become considerably more powerful, now supporting the full PCRE syntax.
Please read the internal roadmap document for details on user-visible changes, implementation, and the reasons behind the technical decisions we made.
kwic output in CQP (with the cat
command) suffers from a range of known problems, e.g. buffer overruns resulting in a program crash if large contexts are displayed. In addition, many users have asked for more flexible formatting options (e.g. to change the delimiters between different p-attributes of tokens) and some of the output modes (notably latex
and sgml
) and options (set PrintOptions
) are peculiar and of little practical use.
Since these problems have been exacerbated by the new Unicode support, it seems to be a good time to tackle the long-overdue complete overhaul of the kwic
formatting code. CWB users are also more likely to accept fundamental changes (which will certainly break backward compatibility of CQP) in return for being able to work with Unicode data.
tabulate
command is often a better alternativeCL
library (and exported in its API)While there is much room for improving the speed of (simple) CQP queries, this will require fundamental changes to the query evaluation mechanism and has to wait for a later release. Several sophisticated optimisation strategies are implemented by the new Corpuscle query engine, which is currently under development at the University of Bergen.
However, fast query execution is not the only requirement on a modern corpus search tool. An increasingly important application is the extraction of frequency data for n-grams, collocations, meta-information, etc. from very large corpora. The CWB already supports this task with CQP commands count
, group
and tabulate
, as well as the memory-efficient command-line tool cwb-scan-corpus
.
The underlying algorithms still do not scale up to corpora of a billion words or more, and are even pushed to their limits when extracting complex n-gram tables from the 100-million word BNC. Some improvements could be made with moderate effort and included in the official 3.2 release. These are in particular:
count
and group
in CQP are far from optimal (in different ways, so depending on the task at hand, one or the other might work reasonably well). Both should be re-implemented with hash tables in order to ensure high performance in all situations.cwb-scan-corpus
is designed to be memory-efficient, it does not scale up to very large corpora because it collects frequency data in memory (for instance, n-gram scans of the BNC require a computer with more than 8 GiB RAM). It would be desirable to implement a split-and-merge strategy for cwb-scan-corpus
in order to process arbitrary amounts of data with limited RAM:cwb-scan-corpus
is run on suitably small slices of a large corpus or multiple individual corpora; for each slice/corpus, frequency data are stored in a well-defined sort order; finally, the individual data files are merged in linear time (because they are already sorted)cl_strcmp
); (ii) “merge” mode to combine multiple component files in linear time and apply frequency filter (multi-way merge, caching as much data from the input streams as possible in RAM in order to reduce disk trashing)cwb-scan-corpus
saves partial frequency data to disk, then re-starts scan as new “slice”; either notifies user or automatically runs the multi-way merge at the end (using the allowed amount of RAM to cache input streams)cwb-scan-corpus
nor the CQP commands are suitable for the extraction of surface collocations (i.e. cooccurrence within a window of n words). It would be nice to implement a modified version of cwb-scan-corpus
that carries out such collocation scans.cwb-scan-corpus
, so implementation of a basic collocation scan should be reasonably easycwb-scan-corpus
) .deb
packages that list the necessary dependencies, and let the package manager take care of the rest.