aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2009-10-19Document which pieces of glib we're still using.Carl Worth
Looks like we can copy in a hash-table implementation, (from cairo, say), and then a few _ascii_ functions from glib, (we'll need to switch a few current uses if things like isspace, etc. to locale- independent versions as well). So not too hard to free ourselves of glib for now, (until we add GMime back in later, of course).
2009-10-19Hook up our fancy new notmuch_parse_date function.Carl Worth
With all the de-glib-ification out of the way, we can now use it to allow for date-based sorting of Xapian search results.
2009-10-19notmuch_parse_date: Handle a NULL date string gracefully.Carl Worth
The obvious thing to do is to treat a missing date as the beginning of time. Also, remove a useless cast from another return of 0.
2009-10-19date.c: Rename function to notmuch_parse_dateCarl Worth
Now completing the process of making this function "our own". The documentation is deleted here, because we already have the documentation we want in notmuch-private.h.
2009-10-19date.c: Add hard-coded definition of HAVE_TIMEZONECarl Worth
The original code expected this to be set by running configure. We'll just manually set it here for now. This isn't as portable as if we were doing some compile-time examination of the current system, but I don't need portability now. When someone comes along that wants to port notmuch to another system, they will already have all the #ifdefs in place and will simply need to add the appropriate machinery to set the defines.
2009-10-19date.c: Don't use glib's slice allocator.Carl Worth
This change is gratuitous. For now, notmuch is still linking against glib, so I don't have any requirement to remove this, (unlike the last few changes where good taste really did require the changes). The motivation here is two-fold: 1. I'm considering switching away from all glib-based allocation soon so that I can more easily verify that the memory management is solid. I want valgrind to say "no leaks are possible" not "there is tons of memory still allocated, but probably reachable so who knows if there are leaks or not?". And glib seems to make that impossible. 2. I don't think there's anything performance-sensitive about the allocation here. (In fact, if there is, then the right answer would be to do this parsing without any allocation whatsoever.)
2009-10-19date.c: Remove occurrences of gboolean.Carl Worth
While this is surely one of the most innocent typedefs, it still annoys me to have basic types like 'int' re-defined like this. It just makes it harder to copy the code between projects, with very little benefit in readability. For readability, predicate functions and variables should be obviously Boolean-natured by their actual *names*.
2009-10-19date.c: Remove all occurrences of g_return_val_if_failCarl Worth
That's got to be one of the hardest macro names to read, ever, (it's phrased with an implicit negative in the condition, rather than something simple like "assert"). Plus, it's evil, since it's a macro with a return in it. And finally, it's actually *longer* than just typing "if" and "return". So what's the point of this ugly idiom?
2009-10-19date.c: Keep the comments clean.Carl Worth
Never know when the children might be reading over my shoulder, for example. :-)
2009-10-19date.c: Change headers/defines t owork within notmuch.Carl Worth
We can't rely on any gmime-internal headers, (and fortunately we don't need to). We also aren't burdened with any autconf machinery so don't reference any of that.
2009-10-19date.c: Remove a bunch of undesired code.Carl Worth
We're only interested in the date-parsing code here.
2009-10-19date.c: Convert from LGPL-2+ to GPL-3+Carl Worth
As authorized by LGPL-2 term (3).
2009-10-19date.c: Add new file directly from gmime2.4-2.4.6/gmime/gmime-utils.cCarl Worth
We're sucking in one gmime implementation file just to get the piece that parses an RFC 822 date, because I don't want to go through the pain of replicating that.
2009-10-19notmuch: Switch from gmime to custom, ad-hoc parsing of headers.Carl Worth
Since we're currently just trying to stitch together In-Reply-To and References headers we don't need that much sophistication. It's when we later add full-text searching that GMime will be useful. So for now, even though my own code here is surely very buggy compared to GMime it's also a lot faster. And speed is what we're after for the initial index creation.
2009-10-19notmuch: Ignore .notmuch when counting files.Carl Worth
We were correctly ignoring this when adding files, but not when doing the initial count. Clearly we need better code sharing here.
2009-10-18notmuch: Start actually adding messages to the index.Carl Worth
This is the beginning of the notmuch library as well, with its interface in notmuch.h. So far we've got create, open, close, and add_message (all with a notmuch_database prefix). The current add_message function has already been whittled down from what we have in notmuch-index-message to add only references, message-id, and thread-id to the index, (that is---just enough to do thread-linkage but nothing for full-text searching). The concept here is to do something quickly so that the user can get some data into notmuch and start using it. (The most interesting stuff is then thread-linkage and labels like inbox and unread.) We can defer the full-text indexing of the body of the messages for later, (such as in the background while the user is reading mail). The initial thread-stitching step is still slower than I would like. We may have to stop using libgmime for this step as its overhead is not worth it for the simple case of just parsing the message-id, references, and in-reply-to headers.
2009-10-18xapian-dump: Rewrite to generate C code as output.Carl Worth
This was for some time testing, (to see how fast xapian could be if we were strictly adding documents and not doing any other IO or computation). The answer is that xapian is quite fast, (on the order of 1000 documents per second).
2009-10-17Start a new top-level executable: notmuch.Carl Worth
Of course, there's not much that this program does yet. It's got some structure for some sub-commands that don't do anything. And it has a main command that prints some explanatory text and then counts all the regular files in your mail archive.
2009-10-16Fix more memory leaks.Carl Worth
These were more significant than the previous leak because these were in the loop and leaking memory for every message being parsed. It turns out that g_hash_table_new should probably be named g_hash_table_new_and_leak_memory_please. The actually useful function is g_hash_table_new_full which lets us pass a free function, (to free keys when inserting duplicates into the hash table). And after all, weeding out duplicates is the only reason we are using this hash table in the first place. It almost goes without saying, valgrind found these leaks.
2009-10-16Fix a one-time memory leak.Carl Worth
This was a single object in main outside any loops, so there was no impact on performance or anything, but obviously we still want to patch this. Of course, valgrind gets the credit for seeing this.
2009-10-16Avoid reading a byte just before our allocated buffer.Carl Worth
When looking for a trailing ':' to introduce a quotation we peek at the last character before a newline. But for blank lines, that's not where we want to look. And when the first line in our buffer is a blank line, we're underrunning our buffer. The fix is easy---just bail early on blank lines since they have no terms anyway. Thanks to valgrind for pointing out this error.
2009-10-16Generate random thread IDs instead of using an arbitrary Message-ID.Carl Worth
Previously, we used as the thread-id the message-id of the first message in the thread that we happened to find. In fact, this is a totally arbitrary identifier, so it might as well be random. And an advantage of actually using a random identifier is that we now have fixed-length thead identifiers, (and the way is open to even allow abbreviated identifiers like git does---though we're less likely to show these identifiers to actual users).
2009-10-15Change progress report to show "instantaneous" rate. Also print total time.Carl Worth
Instead of always showing the overall rate, we wait until the end to show that. Then, on incremental updates we show the rate over the last increment. This makes it much easier to actually watch what's happening, (and it's easy to see the efect of xapian's internal 10,000 document flush).
2009-10-14Protect against missing message id while indexing filesKeith Packard
2009-10-14Walk address groups and parse each address separatelyKeith Packard
Signed-off-by: Keith Packard <keithp@keithp.com>
2009-10-14Reduce the verbosity of the progress indicator.Carl Worth
It's fast enough that we can wait for 1000 messages before updating.
2009-10-14Add support for message-part mime parts.Carl Worth
We could (and probably should) reparse and index all the headers from the embedded message, but I'm not choosing to do that now---I'm just indexing the body of the embedded message.
2009-10-14Avoid segfault on message with no subject.Carl Worth
It's fun how turning a program loose on 500,000 messages will find lots of littel corner cases.
2009-10-14Add some sort of progress indicator.Carl Worth
It's nice to let the user know that something is happening.
2009-10-14Avoid complaints about messages with empty mime parts.Carl Worth
2009-10-14Avoid complaints about empty address lists.Carl Worth
2009-10-14Document the little details separating the sup and notmuch indexes.Carl Worth
As can be seen here, there are not a lot of differences. I've verified this by using sup-sync to import a month of mail from the sup mailing list, and comparing the database term-by-term, value-by-value, and data-by-data with that created by notmuch. There are no differences other than those documented here.
2009-10-14Avoid trimming initial whitespace while looking for signatures.Carl Worth
I ran into a message with an indented stack trace that my indexer was mistaking for a signature.
2009-10-14Index an attachment's filename extension as well.Carl Worth
I hadn't realized that sup used a special term for this. But there you go.
2009-10-14Index the filename of any attachment.Carl Worth
2009-10-14[sup-compat] Don't index mime parts with content-disposition of attachmentCarl Worth
Here's another change which I'm making for sup compatibility against my better judgment. It seems that sup never indexes content from mime parts with content-disposition of attachment. But these attachments are often very indexable, (for example, the first one I encountered was a small shell script). So I'll have to think a bit about whether or not I want to revert this commit. To do this properly we would really want to distinguish between attachments that are indexable, (such as text), and those that aren't, (such as binaries). I know the mime-type alone isn't alwas sufficient here as even this little plaintext shell script was attached as octet-stream. And if we wanted to get really fancy we could run things like antiword to generate text from non-text attachments and index their output.
2009-10-14Add label "attachment" when an attachment is seen.Carl Worth
2009-10-14Split thread_id value on commas before inserting into hash.Carl Worth
One thread_id value may have multiple thread IDs in it so we need to separate them out before inserting into our hash.
2009-10-14Add missing null terminator before using byte-array contents as string.Carl Worth
Thanks to valgrind for spotting this one.
2009-10-14notmuch-index-message: Add explicit support for multipart mime.Carl Worth
Instead of using the recursive "foreach" method, we implement our own recursive function. This allows us to ignore the signature component of a multipart/signed message, (which we certainly don't need to index).
2009-10-14[sup-compat] Don't trim trailing whitespace on line introducing quotation.Carl Worth
Ignoring this whitespace seems like a good idea to me, but it's interfering with my comparisons with sup since sup doesn't do this. This might be a commit worth dropping in the future since it exists only for pedantic consistency with sup and not for any reason of its own.
2009-10-14notmuch-index-message: Fix handling of thread_id terms.Carl Worth
We now emit one term per thread_id, rather than the comma-separated super-term we were doing previously.
2009-10-14notmuch-index-message: Use local-part of email addres in lieu of name.Carl Worth
If there's no name given, take the portion of the email addres before the '@' sign. One step closer to matching sup's terms in the database.
2009-10-14Use gmime's own reference-parsing code.Carl Worth
Here's another instance where I "knew" gmime must have support for some functionality, but not finding it, I rolled my own. Now that I found g_mime_references_decode I'm glad to drop my ugly code.
2009-10-14notmuch-index-message: Correctly parse and index encoded mime parts.Carl Worth
This cleans up some old code that was very ugly, (separately opening the mail file and seeking to the end of the headers to parse the body). I knew gmime must have had support for transparently decoding mime content, but I just couldn't find it previously. Note: Multipart and MultipartSigned parts are not handled yet. Things are quite happy now. The few differences I see with sup are: 1. sup forces email address domains to lowercase, (I don't think I care) 2. sup and notmuch disagree on ordering of multiple thread_id values (another thing that's of no concern) We are still doing one thing wrong when a message belongs to multiple threads. We've got a nice comma-separated thread-value just like sup, but then we're also putting in a comma-separated thread-term where sup does multiple thread terms. That should be an easy fix. Beyond that, sup and notmuch are still disagreeing on the term lists for some messages, (I think attachment vs. inline content-disposition is at least one piece of this). But there are likley still differences in the heuristics for which chunks of the message body to index. I'll be looking into this more.
2009-10-14notmuch-index-message: Lookup children for thread_id as well.Carl Worth
This provides the thread_id linkage for when a child message is indexed before the parent.
2009-10-14notmuch-index-message: Use more meaningful variable names.Carl Worth
The abuse of the generic "value" name was getting very hard to read.
2009-10-14notmuch-index-message: Start generating correct thread_id values.Carl Worth
Currently we're looking up all parents (based on In-reply-to and References header) and using the list of all thread_id values from those as our thread_id value. We're missing one step which sup does which is to also look up any children in the database that have reference our message ID. So we'll need to do that next.
2009-10-14Factor out parsing of reference-header values and pickup In-reply-to.Carl Worth
This is in preparation for doing a couple of passes over the references, (one to add terms to the database, and a second to find the thread_id). We also now parse the In-reply-to header which we were missing before. We treat it identically to the References header.
2009-10-14notmuch-index-message: Ignore more signature patterns.Carl Worth
Getting more sup-compatible all the time.