diff options
| author | David Bremner <david@tethera.net> | 2017-03-22 08:23:00 -0300 |
|---|---|---|
| committer | David Bremner <david@tethera.net> | 2017-04-20 06:59:40 -0300 |
| commit | 77c9ec1fddcbe145facfc3d65eee55b11ad61fb9 (patch) | |
| tree | bd8adc589322454463db36b966a84501858fa4d2 /test/corpora/README | |
| parent | e56511817284afc14352f47a13fcf85b2fabd628 (diff) | |
test: add known broken test for indexing html
'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.
The second test is sanity check for any "improved" indexing of HTML.
Diffstat (limited to 'test/corpora/README')
| -rw-r--r-- | test/corpora/README | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/test/corpora/README b/test/corpora/README index 77c48e6e..c9a35fed 100644 --- a/test/corpora/README +++ b/test/corpora/README @@ -9,3 +9,6 @@ default broken The broken corpus contains messages that are broken and/or RFC non-compliant, ensuring we deal with them in a sane way. + +html + The html corpus contains html parts |
