Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| developers:cwb32_plans [2011/03/06 14:50] – [kwic-output in CQP] stefan | developers:cwb32_plans [2011/03/06 23:04] (current) – [Performance improvements] stefan | ||
|---|---|---|---|
| Line 28: | Line 28: | ||
| * no fixed-character context in this output mode | * no fixed-character context in this output mode | ||
| * simplified implementation should make XML output faster than ASCII mode | * simplified implementation should make XML output faster than ASCII mode | ||
| - | * efficient and light-weight (non-validating) XML parsers available for most programming languages -> no other | + | * efficient and light-weight (non-validating) XML parsers available for most programming languages -> no other format needed for data exchange | 
| - | * | + | * GUIs (esp. Web interfaces) can use convenient XSLT stylesheets for flexible and customisable //kwic// display | 
| + |     * for statistical post-processing and frequency tables, '' | ||
| + |   * standardised XML format should also be offered by // | ||
| ==== Performance improvements ==== | ==== Performance improvements ==== | ||
| + | While there is much room for improving the speed of (simple) CQP queries, this will require fundamental changes to the query evaluation mechanism and has to wait for a later release.  | ||
| + | |||
| + | However, fast query execution is not the only requirement on a modern corpus search tool.  An increasingly important application is the extraction of frequency data for n-grams, collocations, | ||
| + | |||
| + | The underlying algorithms still do not scale up to corpora of a billion words or more, and are even pushed to their limits when extracting complex n-gram tables from the 100-million word BNC.  Some improvements could be made with moderate effort and included in the official 3.2 release.  | ||
| + | |||
| + |   * The **frequency counting algorithms used by '' | ||
| + |   * Although '' | ||
| + |     * '' | ||
| + |     * required changes: (i) option to save frequency table in well-defined sort order (using '' | ||
| + |     * ideally, user can specify memory limit; when limit is reached during a corpus scan, '' | ||
| + |   * Neither '' | ||
| + |     * internal frequency tables as well as the split-and-merge procedure will be very similar to '' | ||
| + | * further options for filtering nodes and collocates can be added at a later time | ||
| + |   * All improvements suggested here are based on specialised **high-performance hash tables**.  | ||
| + | * an efficient memory pool without deallocation (for hash tables and copies of string keys) | ||
| + | * improved generic string hash tables, using memory pools for keys and entries | ||
| + |     * optimised hash tables with lists of integers as keys (similar to the tables currently used by '' | ||
| ==== Other open questions ==== | ==== Other open questions ==== | ||
| + |   * How important is it to offer **binary packages** at least for the most common platforms?  | ||
| + |     * For Ubuntu, we could try to offer '' | ||
| + |     * For Mac OS X, the usual strategy is to include " | ||
| ===== Feature requests for future releases ===== | ===== Feature requests for future releases ===== | ||
| ===== Long-term goals ===== | ===== Long-term goals ===== | ||