notmuch/lib/index.cc, branch 0.23.5

notmuch/lib/index.cc, branch 0.23.5 thread-based email index, search, and tagging https://git.notmuchmail.org/git/notmuch/atom?h=0.23.5 2016-06-05T11:32:17Z Use https instead of http where possible 2016-06-05T11:32:17Z Daniel Kahn Gillmor dkg@fifthhorseman.net 2016-06-02T16:26:14Z urn:sha1:6a833a6e83865f6999707cc30768d07e1351c2cb Many of the external links found in the notmuch source can be resolved using https instead of http. This changeset addresses as many as i could find, without touching the e-mail corpus or expected outputs found in tests. lib: whitespace cleanup 2016-06-05T11:23:28Z Tomi Ollila tomi.ollila@iki.fi 2016-05-28T17:45:31Z urn:sha1:cf09631a45d276826255d197c1d5c913a29c79f4 Cleaned the following whitespace in lib/* files: lib/index.cc: 1 line: trailing whitespace lib/database.cc 5 lines: 8 spaces at the beginning of line lib/notmuch-private.h: 4 lines: 8 spaces at the beginning of line lib/message.cc: 1 line: trailing whitespace lib/sha1.c: 1 line: empty lines at the end of file lib/query.cc: 2 lines: 8 spaces at the beginning of line lib/gen-version-script.sh: 1 line: trailing whitespace lib: content disposition values are not case-sensitive 2015-11-19T11:47:29Z Jani Nikula jani@nikula.org 2015-09-26T09:35:21Z urn:sha1:506b81679a883d2a96bcd17e7c826a3166bdf82e Per RFC 2183, the values for Content-Disposition values are not case-sensitive. While at it, use the gmime function for getting at the disposition string instead of referencing the field directly. This fixes "attachment" tagging and filename term generation for attachments while indexing. lib: replace almost all fprintfs in library with _n_d_log 2015-03-28T23:34:15Z David Bremner david@tethera.net 2014-12-26T16:25:35Z urn:sha1:736ac26407914425a9c94e86616225292cf716dd This is not supposed to change any functionality from an end user point of view. Note that it will eliminate some output to stderr. The query debugging output is left as is; it doesn't really fit with the current primitive logging model. The remaining "bad" fprintf will need an internal API change. Add indexing for the mimetype term 2015-01-24T15:47:59Z Todd todd@electricoding.com 2015-01-22T23:43:38Z urn:sha1:b04bc967f9837e9d451ef88c276c744aa55accaa This adds the indexing support for the "mimetype:" term and removes the broken test flag. The indexing is probablistic in Xapian terms, which gives a better experience to end users. Standard content-types of the form "foo/bar" are automatically interpreted as phrases in Xapian due to the embedded slash. Assume, separate messages with application/pdf and application/x-pdf are indexed, then: - mimetype:application/x-pdf will find only the application/x-pdf - mimetype:application/pdf will find only the application/pdf - mimetype:pdf will find both of the messages lib: Index name and address of from/to headers as a phrase 2014-06-18T20:55:14Z Austin Clements amdragon@MIT.EDU 2014-06-16T02:40:32Z urn:sha1:44327ca86d8e3563490801f57a2d1ca455d9588e Previously, we indexed the name and address parts of from/to headers with two calls to _notmuch_message_gen_terms. In general, this indicates that these parts are separate phrases. However, because of an implementation quirk, the two calls to _notmuch_message_gen_terms generated adjacent term positions for the prefixed terms, which happens to be the right thing to do in this case, but the wrong thing to do for all other calls. Furthermore, _notmuch_message_gen_terms produced potentially overlapping term positions for the un-prefixed copies of the terms, which is simply wrong. This change indexes both the name and address in a single call to _notmuch_message_gen_terms, indicating that they should be part of a single phrase. This masks the problem with the un-prefixed terms (fixing the two known-broken tests) and puts us in a position to fix the unintentionally phrases generated by other calls to _notmuch_message_gen_terms. lib: replace the header parser with gmime 2014-04-05T15:53:04Z Jani Nikula jani@nikula.org 2014-03-30T21:21:49Z urn:sha1:473930bb6fb167078a9428ad85f53accf7d4559f The notmuch library includes a full blown message header parser. Yet the same message headers are parsed by gmime during indexing. Switch to gmime parsing completely. These are the main changes: * Gmime stops header parsing at the first invalid header, and presumes the message body starts from there. The current parser is quite liberal in accepting broken headers. The change means we will be much pickier about accepting invalid messages. * The current parser converts tabs used in header folding to spaces. Gmime preserve the tabs. Due to a broken python library used in mailman, there are plenty of mailing lists that produce headers with tabs in header folding, and we'll see plenty of tabs. (This change has been mitigated in preparatory patches.) * For pure header parsing, the current parser is likely faster than gmime, which parses the whole message rather than just the headers. Since we parse the message and its headers using gmime for indexing anyway, this avoids and extra header parsing round when adding new messages. In case of duplicate messages, we'll end up parsing the full message although just headers would be sufficient. All in all this should still speed up 'notmuch new'. * Calls to notmuch_message_get_header() may be slightly slower than previously for headers that are not indexed in the database, due to parsing of the whole message. Within the notmuch code base, notmuch reply is the only such user. lib: drop support for single-message mbox files 2014-04-05T15:52:42Z Jani Nikula jani@nikula.org 2014-03-30T21:21:48Z urn:sha1:6812136bf576d894591606d9e10096719054d1f9 We've supported mbox files containing a single message for historical reasons, but the support has been deprecated, with a warning message while indexing, since Notmuch 0.15. Finally drop the support, and consider all mbox files non-email. lib/cli: pass GMIME_ENABLE_RFC2047_WORKAROUNDS to g_mime_init() 2013-09-14T17:13:43Z Jani Nikula jani@nikula.org 2013-09-11T17:36:43Z urn:sha1:71521f06b00a01c5b0eaea5f5f624fe57ed7f426 As explained by Jeffrey Stedfast, the author of GMime, quoted in [1]: > Passing the GMIME_ENABLE_RFC2047_WORKAROUNDS flag to g_mime_init() > *should* solve the decoding problem mentioned in the thread. This > flag should be safe to pass into g_mime_init() without any bad side > effects and my unit tests do test that code-path. The thread being referred to is [2]. [1] id:87bo56viyo.fsf@nikula.org [2] id:08cb1dcd-c5db-4e33-8b09-7730cb3d59a2@gmail.com _notmuch_message_index_file: unref (free) address lists from gmime. 2012-12-24T23:02:22Z David Bremner bremner@debian.org 2012-12-11T03:33:40Z urn:sha1:47693539a64b884cbd9bffc9c832162848ad98f2 Apparently as of GMime 2.4, you don't need to call internet_address_list_destroy anymore, but you still need to call g_object_unref (from the GMime Changelog). On the medium performance corpus, valgrind shows "possibly lost" leakage in "notmuch new" dropping from 7M to 300k.