aboutsummaryrefslogtreecommitdiff
path: root/lib/index.cc
AgeCommit message (Collapse)Author
2023-04-02lib: index attachments with mime types matching index.as_textDavid Bremner
Instead of skipping indexing all attachments, we check of a (user configured) mime type that is indexable as text.
2021-05-13lib: make glib initialization thread-safeDavid Bremner
In principle this could be done without depending on C++11 features, but these features should be available since gcc 4.8.1, and this localized usage is easy to replace if it turns out to be problematic for portability.
2021-03-13lib: run uncrustifyuncrustify
This is the result of running $ uncrustify --replace --config ../devel/uncrustify.cfg *.c *.h *.cc in the lib directory
2020-05-22smime: Index cleartext of envelopedData when requestedDaniel Kahn Gillmor
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2020-05-22crypto: Make _notmuch_crypto_decrypt take a GMimeObjectDaniel Kahn Gillmor
As we prepare to handle S/MIME-encrypted PKCS#7 EnvelopedData (which is not multipart), we don't want to be limited to passing only GMimeMultipartEncrypted MIME parts to _notmuch_crypto_decrypt. There is no functional change here, just a matter of adjusting how we pass arguments internally. Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2020-05-22smime: Identify encrypted S/MIME parts during indexingDaniel Kahn Gillmor
We don't handle them correctly yet, but we can at least mark them as being encrypted. Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2020-05-22lib: index PKCS7 SignedData partsDaniel Kahn Gillmor
When we are indexing, we should treat SignedData parts the same way that we treat a multipart object, indexing the wrapped part as a distinct MIME object. Unfortunately, this means doing some sort of cryptographic verification whose results we throw away, because GMime doesn't offer us any way to unwrap without doing signature verification. I've opened https://github.com/jstedfast/gmime/issues/67 to request the capability from GMime but for now, we'll just accept the additional performance hit. As we do this indexing, we also apply the "signed" tag, by analogy with how we handle multipart/signed messages. These days, that kind of change should probably be done with a property instead, but that's a different set of changes. This one is just for consistency. Note that we are currently *only* handling signedData parts, which are basically clearsigned messages. PKCS#7 parts can also be envelopedData and authEnvelopedData (which are effectively encryption layers), and compressedData (which afaict isn't implemented anywhere, i've never encountered it). We're laying the groundwork for indexing these other S/MIME types here, but we're only dealing with signedData for now. Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-09-15index: repair "Mixed Up" messages before indexing.Daniel Kahn Gillmor
When encountering a message that has been mangled in the "mixed up" way by an intermediate MTA, notmuch should instead repair it and index the repaired form. When it does this, it also associates the index.repaired=mixedup property with the message. If a problem is found with this repair process, or an improved repair process is proposed later, this should make it easy for people to reindex the relevant message. The property will also hopefully make it easier to diagnose this particular problem in the future. Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-09-01index: avoid indexing legacy-display partsDaniel Kahn Gillmor
When we notice a legacy-display part during indexing, it makes more sense to avoid indexing it as part of the message body. Given that the protected subject will already be indexed, there is no need to index this part at all, so we skip over it. If this happens during indexing, we set a property on the message: index.repaired=skip-protected-headers-legacy-display Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-09-01util/crypto: _n_m_crypto_potential_payload returns whether part is the payloadDaniel Kahn Gillmor
Our _notmuch_message_crypto_potential_payload implementation could only return a failure if bad arguments were passed to it. It is an internal function, so if that happens it's an entirely internal bug for notmuch. It will be more useful for this function to return whether or not the part is in fact a cryptographic payload, so we dispense with the status return. If some future change suggests adding a status return back, there are only a handful of call sites, and no pressure to retain a stable API, so it could be changed easily. But for now, go with the simpler function. We will use this return value in future patches, to make different decisions based on whether a part is the cryptographic payload or not. But for now, we just leave the places where it gets invoked marked with (void) to show that the result is ignored. Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-06-14lib: run uncrustifyuncrustify
This is the result of running $ uncrustify --replace --config ../devel/uncrustify.cfg *.c *.h *.cc in the lib directory
2019-05-29indexing: record protected subject when indexing cleartextDaniel Kahn Gillmor
When indexing the cleartext of an encrypted message, record any protected subject in the database, which should make it findable and visible in search. Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-05-25lib/database: index user headers.David Bremner
This essentially involves calling _notmuch_message_gen_terms once for each user defined header.
2019-05-03gmime-cleanup: use GMime 3.0 function namesDaniel Kahn Gillmor
Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-05-03gmime-cleanup: drop all arguments unused in GMime 3Daniel Kahn Gillmor
This means dropping GMimeCryptoContext and notmuch_config arguments. All the argument changes are to internal functions, so this is not an API or ABI break. We also get to drop the #define for g_mime_3_unused. signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-05-03gmime-cleanup: drop g_mime_2_6_unrefDaniel Kahn Gillmor
signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-05-03gmime-cleanup: always support session keysDaniel Kahn Gillmor
Our minimum version of GMime 3.0 always supports good session key handling. signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2019-05-03gmime-cleanup: drop unused gmime 2.6 content_type from _index_encrypted_mime_partDaniel Kahn Gillmor
In _index_mime_part, we don't need to extract the content-type from the part until just before we use it, so we also defer it lazily. Signed-off-by: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
2018-10-21index: explicitly follow GObject conventionsDaniel Kahn Gillmor
Use explicit labels for GTypeInfo member initializers, rather than relying on comments and ordering. This is both easier to read, and harder to screw up. This also makes it clear that we're mis-casting GObject class initializers for gcc. Without this patch, g++ 8.2.0-7 produces this warning: CXX -g -O2 lib/index.o lib/index.cc: In function ‘GMimeFilter* notmuch_filter_discard_non_term_new(GMimeContentType*)’: lib/index.cc:252:23: warning: cast between incompatible function types from ‘void (*)(NotmuchFilterDiscardNonTermClass*)’ {aka ‘void (*)(_NotmuchFilterDiscardNonTermClass*)’} to ‘GClassInitFunc’ {aka ‘void (*)(void*, void*)’} [-Wcast-function-type] (GClassInitFunc) notmuch_filter_discard_non_term_class_init, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The definition of GClassInitFunc in /usr/include/glib-2.0/gobject/gtype.h suggests that this function will always be called with the class_data member of the GTypeInfo. We set that value to NULL in both GObject definitions in notmuch. So we mark it as explicitly unused. There is no functional change here, just code cleanup.
2018-05-26lib: expose notmuch_message_get_database()Daniel Kahn Gillmor
We've had _notmuch_message_database() internally for a while, and it's useful. It turns out to be useful on the other side of the library interface as well (i'll use it later in this series for "notmuch show"), so we expose it publicly now.
2018-05-14drop use of register keywordDavid Bremner
The performance benefits are dubious, and it's deprecated in C++11.
2017-12-08crypto: actually stash session keys when decrypt=trueDaniel Kahn Gillmor
If you're going to store the cleartext index of an encrypted message, in most situations you might just as well store the session key. Doing this storage has efficiency and recoverability advantages. Combined with a schedule of regular OpenPGP subkey rotation and destruction, this can also offer security benefits, like "deletable e-mail", which is the store-and-forward analog to "forward secrecy". But wait, i hear you saying, i have a special need to store cleartext indexes but it's really bad for me to store session keys! Maybe (let's imagine) i get lots of e-mails with incriminating photos attached, and i want to be able to search for them by the text in the e-mail, but i don't want someone with access to the index to be actually able to see the photos themselves. Fret not, the next patch in this series will support your wacky uncommon use case.
2017-12-08crypto: record whether an actual decryption attempt happenedDaniel Kahn Gillmor
In our consolidation of _notmuch_crypto_decrypt, the callers lost track a little bit of whether any actual decryption was attempted. Now that we have the more-subtle "auto" policy, it's possible that _notmuch_crypto_decrypt could be called without having any actual decryption take place. This change lets the callers be a little bit smarter about whether or not any decryption was actually attempted.
2017-12-08crypto: new decryption policy "auto"Daniel Kahn Gillmor
This new automatic decryption policy should make it possible to decrypt messages that we have stashed session keys for, without incurring a call to the user's asymmetric keys.
2017-12-08lib: convert notmuch decryption policy to an enumDaniel Kahn Gillmor
Future patches in this series will introduce new policies; this merely readies the way for them. We also convert --try-decrypt to a keyword argument instead of a boolean.
2017-12-08indexopts: change _try_decrypt to _decrypt_policyDaniel Kahn Gillmor
This terminology makes it clearer what's going on at the API layer, and paves the way for future changesets that offer more nuanced decryption policy.
2017-12-04crypto: use stashed session-key properties for decryption, if availableDaniel Kahn Gillmor
When doing any decryption, if the notmuch database knows of any session keys associated with the message in question, try them before defaulting to using default symmetric crypto. This changeset does the primary work in _notmuch_crypto_decrypt, which grows some new parameters to handle it. The primary advantage this patch offers is a significant speedup when rendering large encrypted threads ("notmuch show") if session keys happen to be cached. Additionally, it permits message composition without access to asymmetric secret keys ("notmuch reply"); and it permits recovering a cleartext index when reindexing after a "notmuch restore" for those messages that already have a session key stored. Note that we may try multiple decryptions here (e.g. if there are multiple session keys in the database), but we will ignore and throw away all the GMime errors except for those that come from last decryption attempt. Since we don't necessarily know at the time of the decryption that this *is* the last decryption attempt, we'll ask for the errors each time anyway. This does nothing if no session keys are stashed in the database, which is fine. Actually stashing session keys in the database will come as a subsequent patch.
2017-12-04crypto: add _notmuch_crypto_decrypt wrapper functionDaniel Kahn Gillmor
We will use this centralized function to consolidate the awkward behavior around different gmime versions. It's only invoked from two places: mime-node.c's node_decrypt_and_verify() and lib/index.cc's _index_encrypted_mime_part(). However, those two places have some markedly distinct logic, so the interface for this _notmuch_crypto_decrypt function is going to get a little bit clunky. It's worthwhile, though, for the sake of keeping these #if directives reasonably well-contained.
2017-10-21crypto: index encrypted parts when indexopts try_decrypt is set.Daniel Kahn Gillmor
If we see index options that ask us to decrypt when indexing a message, and we encounter an encrypted part, we'll try to descend into it. If we can decrypt, we add the property index.decryption=success. If we can't decrypt (or recognize the encrypted type of mail), we add the property index.decryption=failure. Note that a single message may have both values of the "index.decryption" property: "success" and "failure". For example, consider a message that includes multiple layers of encryption. If we manage to decrypt the outer layer ("index.decryption=success"), but fail on the inner layer ("index.decryption=failure"). Because of the property name, this will be automatically cleared (and possibly re-set) during re-indexing. This means it will subsequently correspond to the actual semantics of the stored index.
2017-10-09lib: convert notmuch_bool_t to stdbool internallyJani Nikula
C99 stdbool turned 18 this year. There really is no reason to use our own, except in the library interface for backward compatibility. Convert the lib internally to stdbool.
2017-09-17lib: index the content-type of the parts of encrypted messagesDaniel Kahn Gillmor
This is a logical followup to "lib: index the content type of signature parts", which will make it easier to record the message structure of all messages.
2017-09-17lib: index the content type of signature partsJani Nikula
It's useful (*) to be able to easily find messages with certain types of signatures. Having the mimetype: prefix searches fail for some content types is also genuinely surprising (*). Index the content type of signature parts. While at it, switch to the gmime convenience constants for content and signature part indexes. *) At least for developers of email software!
2017-09-17lib: abstract content type indexingJani Nikula
Make the follow-up change of indexing signature content types easier. No functional changes.
2017-09-04lib&cli: use g_object_new instead of g_object_newvDavid Bremner
'g_object_newv' is deprecated, and prints annoying warnings. The warnings suggest using 'g_object_new_with_properties', but that's only available since glib 2.55 (i.e. a month ago as of this writing). Since we don't actuall pass any properties, it seems we can just call 'g_object_new'.
2017-07-14lib: paper over allocation differenceDavid Bremner
In gmime 3.0 this function is "transfer none", so no deallocation is needed (or permitted)
2017-07-14lib/cli: replace use of g_mime_message_get_senderDavid Bremner
This function changes semantics in gmime-3.0 so make a new function that provides the same functionality in both
2017-07-01lib/index: add simple html filterDavid Bremner
The filter just drops all (HTML) tags. As an enabling change, pass the content type to the filter constructor so we can decide which scanner to user.
2017-07-01lib/index.cc: generalize filter state machineDavid Bremner
To match things more complicated than fixed strings, we need states with multiple out arrows.
2017-07-01lib/index: separate state table definition from scanner.David Bremner
We want to reuse the scanner definition with a different table. This is mainly code movement, and making the state table part of the filter struct/class.
2017-07-01lib/index: generalize name of indexing filterDavid Bremner
In followup commits we will generalize the functionality of this filter to deal with other types of non-indexable content.
2016-06-05Use https instead of http where possibleDaniel Kahn Gillmor
Many of the external links found in the notmuch source can be resolved using https instead of http. This changeset addresses as many as i could find, without touching the e-mail corpus or expected outputs found in tests.
2016-06-05lib: whitespace cleanupTomi Ollila
Cleaned the following whitespace in lib/* files: lib/index.cc: 1 line: trailing whitespace lib/database.cc 5 lines: 8 spaces at the beginning of line lib/notmuch-private.h: 4 lines: 8 spaces at the beginning of line lib/message.cc: 1 line: trailing whitespace lib/sha1.c: 1 line: empty lines at the end of file lib/query.cc: 2 lines: 8 spaces at the beginning of line lib/gen-version-script.sh: 1 line: trailing whitespace
2015-11-19lib: content disposition values are not case-sensitiveJani Nikula
Per RFC 2183, the values for Content-Disposition values are not case-sensitive. While at it, use the gmime function for getting at the disposition string instead of referencing the field directly. This fixes "attachment" tagging and filename term generation for attachments while indexing.
2015-03-29lib: replace almost all fprintfs in library with _n_d_logDavid Bremner
This is not supposed to change any functionality from an end user point of view. Note that it will eliminate some output to stderr. The query debugging output is left as is; it doesn't really fit with the current primitive logging model. The remaining "bad" fprintf will need an internal API change.
2015-01-24Add indexing for the mimetype termTodd
This adds the indexing support for the "mimetype:" term and removes the broken test flag. The indexing is probablistic in Xapian terms, which gives a better experience to end users. Standard content-types of the form "foo/bar" are automatically interpreted as phrases in Xapian due to the embedded slash. Assume, separate messages with application/pdf and application/x-pdf are indexed, then: - mimetype:application/x-pdf will find only the application/x-pdf - mimetype:application/pdf will find only the application/pdf - mimetype:pdf will find both of the messages
2014-06-18lib: Index name and address of from/to headers as a phraseAustin Clements
Previously, we indexed the name and address parts of from/to headers with two calls to _notmuch_message_gen_terms. In general, this indicates that these parts are separate phrases. However, because of an implementation quirk, the two calls to _notmuch_message_gen_terms generated adjacent term positions for the prefixed terms, which happens to be the right thing to do in this case, but the wrong thing to do for all other calls. Furthermore, _notmuch_message_gen_terms produced potentially overlapping term positions for the un-prefixed copies of the terms, which is simply wrong. This change indexes both the name and address in a single call to _notmuch_message_gen_terms, indicating that they should be part of a single phrase. This masks the problem with the un-prefixed terms (fixing the two known-broken tests) and puts us in a position to fix the unintentionally phrases generated by other calls to _notmuch_message_gen_terms.
2014-04-05lib: replace the header parser with gmimeJani Nikula
The notmuch library includes a full blown message header parser. Yet the same message headers are parsed by gmime during indexing. Switch to gmime parsing completely. These are the main changes: * Gmime stops header parsing at the first invalid header, and presumes the message body starts from there. The current parser is quite liberal in accepting broken headers. The change means we will be much pickier about accepting invalid messages. * The current parser converts tabs used in header folding to spaces. Gmime preserve the tabs. Due to a broken python library used in mailman, there are plenty of mailing lists that produce headers with tabs in header folding, and we'll see plenty of tabs. (This change has been mitigated in preparatory patches.) * For pure header parsing, the current parser is likely faster than gmime, which parses the whole message rather than just the headers. Since we parse the message and its headers using gmime for indexing anyway, this avoids and extra header parsing round when adding new messages. In case of duplicate messages, we'll end up parsing the full message although just headers would be sufficient. All in all this should still speed up 'notmuch new'. * Calls to notmuch_message_get_header() may be slightly slower than previously for headers that are not indexed in the database, due to parsing of the whole message. Within the notmuch code base, notmuch reply is the only such user.
2014-04-05lib: drop support for single-message mbox filesJani Nikula
We've supported mbox files containing a single message for historical reasons, but the support has been deprecated, with a warning message while indexing, since Notmuch 0.15. Finally drop the support, and consider all mbox files non-email.
2013-09-14lib/cli: pass GMIME_ENABLE_RFC2047_WORKAROUNDS to g_mime_init()Jani Nikula
As explained by Jeffrey Stedfast, the author of GMime, quoted in [1]: > Passing the GMIME_ENABLE_RFC2047_WORKAROUNDS flag to g_mime_init() > *should* solve the decoding problem mentioned in the thread. This > flag should be safe to pass into g_mime_init() without any bad side > effects and my unit tests do test that code-path. The thread being referred to is [2]. [1] id:87bo56viyo.fsf@nikula.org [2] id:08cb1dcd-c5db-4e33-8b09-7730cb3d59a2@gmail.com
2012-12-24_notmuch_message_index_file: unref (free) address lists from gmime.David Bremner
Apparently as of GMime 2.4, you don't need to call internet_address_list_destroy anymore, but you still need to call g_object_unref (from the GMime Changelog). On the medium performance corpus, valgrind shows "possibly lost" leakage in "notmuch new" dropping from 7M to 300k.