Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
developers:cwb32_plans [2011/03/06 14:50] – [kwic-output in CQP] stefan | developers:cwb32_plans [2011/03/06 23:04] (current) – [Performance improvements] stefan | ||
---|---|---|---|
Line 28: | Line 28: | ||
* no fixed-character context in this output mode | * no fixed-character context in this output mode | ||
* simplified implementation should make XML output faster than ASCII mode | * simplified implementation should make XML output faster than ASCII mode | ||
- | * efficient and light-weight (non-validating) XML parsers available for most programming languages -> no other | + | * efficient and light-weight (non-validating) XML parsers available for most programming languages -> no other format needed for data exchange |
- | * | + | * GUIs (esp. Web interfaces) can use convenient XSLT stylesheets for flexible and customisable //kwic// display |
+ | * for statistical post-processing and frequency tables, '' | ||
+ | * standardised XML format should also be offered by // | ||
==== Performance improvements ==== | ==== Performance improvements ==== | ||
+ | While there is much room for improving the speed of (simple) CQP queries, this will require fundamental changes to the query evaluation mechanism and has to wait for a later release. | ||
+ | |||
+ | However, fast query execution is not the only requirement on a modern corpus search tool. An increasingly important application is the extraction of frequency data for n-grams, collocations, | ||
+ | |||
+ | The underlying algorithms still do not scale up to corpora of a billion words or more, and are even pushed to their limits when extracting complex n-gram tables from the 100-million word BNC. Some improvements could be made with moderate effort and included in the official 3.2 release. | ||
+ | |||
+ | * The **frequency counting algorithms used by '' | ||
+ | * Although '' | ||
+ | * '' | ||
+ | * required changes: (i) option to save frequency table in well-defined sort order (using '' | ||
+ | * ideally, user can specify memory limit; when limit is reached during a corpus scan, '' | ||
+ | * Neither '' | ||
+ | * internal frequency tables as well as the split-and-merge procedure will be very similar to '' | ||
+ | * further options for filtering nodes and collocates can be added at a later time | ||
+ | * All improvements suggested here are based on specialised **high-performance hash tables**. | ||
+ | * an efficient memory pool without deallocation (for hash tables and copies of string keys) | ||
+ | * improved generic string hash tables, using memory pools for keys and entries | ||
+ | * optimised hash tables with lists of integers as keys (similar to the tables currently used by '' | ||
==== Other open questions ==== | ==== Other open questions ==== | ||
+ | * How important is it to offer **binary packages** at least for the most common platforms? | ||
+ | * For Ubuntu, we could try to offer '' | ||
+ | * For Mac OS X, the usual strategy is to include " | ||
===== Feature requests for future releases ===== | ===== Feature requests for future releases ===== | ||
===== Long-term goals ===== | ===== Long-term goals ===== |