libibex/TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

Stability
---------
* ibex_open should never crash, and should never return NULL without
errno being set. Should check for errors when reading.


Performance
-----------
* Profiling, keep thinking about data structures, etc.

* Check memory usage

* See if writing the "inverse image" of long ref streams helps
compression without hurting performance now. (ie, if a word appears in
more than half of the files, write out the list of files it _doesn't_
appear in). (I tried this before, and it wasn't working well, but the
file format and data structures have changed a lot.)

* We could save a noticeable chunk of time if normalize_word computed
the hash of the word and then we could pass that into
g_hash_table_insert somehow.

* Make a copy of the buffer to be indexed (or provide interface for
caller to say ibex can munge the provided data) and then use that
rather than constantly copying things. ?


Functionality
-------------
* ibex file locking

* specify file mode in ibex_open

* ibex_find* need to normalize the search words... should this be done
by the caller or by ibex_find?

* Needs to be some way to do a secondary search after getting results
back from ibex_find* (ie, for "foo near bar"). This either has to be
done by ibex, or requires us to export the normalize interface.

* Does there need to be an ibex_find_any, or is that easy enough for the
caller to do?

* utf8_trans needs to cover at least two more code pages. This is
tricky because it's not clear whether some of the letters there should
be translated to ASCII or left as UTF8. This requires some
investigation.

* ibex_index_* need to ignore HTML tags.
  NAME = [A-Za-z][A-Za-z0-9.-]*
  </?{NAME}(\s*{NAME}(\s*=\s*({NAME}|"[^"]*"|'[^']*')))*>
  <!(--([^-]*|-[^-])--\s*)*>

  ugh. ok, simplifying, we get:
  <[^!](([^"'>]*("[^"]*"|'[^']*'))*> or
  <!(--([^-]*|-[^-])--\s*)*>

  which is still not simple. sigh.

* ibex_index_* need to recognize and ignore "non-text". Particularly
BinHex and uuencoding.