aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCarl Worth <cworth@cworth.org>2009-10-14 16:49:26 -0700
committerCarl Worth <cworth@cworth.org>2009-10-14 16:49:26 -0700
commitbae1ce09a37071cdf592048938319c72653e96e0 (patch)
treea30f2e75f0774884b40ccb6a5e89d0e3a0d8605f
parent784779fb6759a0a3990604befeb5e5ee7e7d1145 (diff)
Document the little details separating the sup and notmuch indexes.
As can be seen here, there are not a lot of differences. I've verified this by using sup-sync to import a month of mail from the sup mailing list, and comparing the database term-by-term, value-by-value, and data-by-data with that created by notmuch. There are no differences other than those documented here.
-rw-r--r--notmuch-index-message.cc32
1 files changed, 32 insertions, 0 deletions
diff --git a/notmuch-index-message.cc b/notmuch-index-message.cc
index 79ec81e4..4585a3b9 100644
--- a/notmuch-index-message.cc
+++ b/notmuch-index-message.cc
@@ -17,6 +17,38 @@
* Author: Carl Worth <cworth@cworth.org>
*/
+/* This indexer creates a Xapian mail index that is remarkably similar
+ * to that created by sup. The big difference, (and the thing that
+ * will keep a notmuch index from being used by sup directly), is that
+ * sup expects a serialized ruby data structure in the document's data
+ * field, but notmuch just puts the mail's filename there (trusting
+ * that the email client can get the data in needs from the filename).
+ *
+ * Note: One bug here is that sup actually merges together fields such
+ * as To, CC, Bcc etc. when finding multiple emails with the same
+ * message ID. To support something similar, notmuch should list
+ * multiple files in the data field.
+ *
+ * Other differences between sup and notmuch-index identified so far:
+ *
+ * o sup supports encrypted mime parts by prompting for a passphrase
+ * to decrypt the message. So far, notmuch doesn't support this,
+ * both because I'm lazy to code it, and I also think doing so
+ * would present a security leak.
+ *
+ * o sup and notmuch have different heuristics for identifying (and
+ * thus ignoring) signatures. For example, sup considers a line
+ * consisting of two hypens as a signature separator, while
+ * notmuch expects those two hyphens to be followed by a space
+ * character.
+ *
+ * o sup as been seen to split some numbers before indexing
+ * them. For example, the number 1754 in an email message was
+ * indexed by sup as separate terms 17 and 54. I couldn't find any
+ * explanation for this behavior and did not try to replicate it
+ * in notmuch.
+ */
+
#include <stdio.h>
#include <stdlib.h>
#include <string.h>