Skip to content

Data structures: Metadata index

Jack Dodds edited this page Jun 27, 2018 · 4 revisions

This page is based on reading code from commit 8511284 dated 2018-02-23 and files generated by it. See search.MailIndex.edit_msg_info() and ... .add_new_msg(). There may errors or omissions!

Structure

The metadata index contains a contact index section and a message index section. The contact index lists email addresses and associates a unique integer Contact Index ID to each one. The message index lists messages. To each message it associates a unique integer Message Index ID, a pointer to the message location in external storage, and metadata from that message. The metadata for a message can include cross-references to other Message Index IDs and to Contact Index IDs.

External format

The metadata index is stored in file mailpile.idx. The file is normally encrypted. An authenticated user can make a plaintext copy using the CLI, for example:

mailpile> cd ~/.local/share/Mailpile/default
mailpile> pipe >mailpile.idx.txt cat mailpile.idx

The file consists of lines terminated by <LF> = 0Ah.

Contact index lines start with '@' = 40h.

Each line consists of fields delimited by tabs <HT> = 09h.

Some fields may contain a list of items delimited by commas = 2Ch.

The complete metadata index is written to mailpile.idx by mailpile.search.MailIndex.save(), called by mailpile.search.MailIndex.save_changes(), mailpile.commands.Command._background_save() and mailpile.plugins.core.Optimize(). Previous versions of the file are renamed and retained as mailpile.idx.1, ....

Updates to the metadata index are appended to an existing mailpile.idx by mailpile.search.MailIndex.save_changes(), called by mailpile.commands.Command._background_save() and at shutdown by mailpile.app.Main(). When an updated mailpile.idx is later read into internal storage, only the last entry for any Contact Index ID or Message Index ID is retained.

In one Mailpile instance (which may not be typical) the mailpile.idx file contained about 50,000 entries and occupied about 24 MB (encrypted).

Fields - contact index lines

  1. Contact Index ID. Base36 - variable length - no leading zeros.
  2. Email address in a safe format with the display name at the end, for example:
    brmdamon%40hushmail.com%20%28Jack%20Dodds%29

Fields - message index lines

  1. Message Index ID. Base36 - variable length - no leading zeros.
  2. Message location pointer. List (but usually just one) of hex - 14 digits
    May be blank e.g. when snippet contains '(missing message)'.
    Digits 0-4: Mailbox.
    Refers to subdirectory paths listed in mailpile.cfg, section config/sys/mailbox.
    Digits 5-14: File name of message.
  3. MessageID Hash. Base64 - 27 digits representing 160 bits. This is the SHA1 hash of the MessageID if it exists.
    See base.BaseIndex._encode_msg_id().
  4. Date/time. Base36 - 6 digits. UTC s since epoch.
  5. From field as it appears in the message.
  6. To field as a list of Base36 Contact Index IDs.
  7. Cc field as a list of Base36 Contact Index IDs.
  8. Size. Base36 - variable length. Message size in units of 1024 octets.
  9. Subject field as it appears in the message.
  10. Snippet of the message body. Sometimes '(missing message)'.
    See search.MailIndex.add_new_ghost()
  11. Tags. List of Base36 (lower case) tag IDs. e.g. ID xxx refers to mailpile.cfg section config/tags/xxx.
    Contains {G} when Snippet contains '(missing message)'.
  12. Replies. List of Base36 Message Index IDs.
  13. Conversation ID. Base36, or two Base36 separated by slash / = 2Fh.
    Constants (MSG_MID, MSG_PTRS, MSG_ID, MSG_DATE, ... ) that identify these fields are defined in mailpile.index.msginfo.MessageInfoConstants.

Internal format - contact index

The contact index is stored in two attributes of mailpile.search.MailIndex.

EMAILS is a list in which the items correspond structurally to the lines in the contact index part of the external format, without the safe encoding. Example (print EMAILS). The position in the list is the Contact Index ID.

[u'[email protected] (Bjarni Runar Einarsson)',
u'[email protected] (Jack Dodds)', ... ,
u'[email protected] (schneier)',
u'[email protected] (Sm\xe1ri McCarthy)',
u'[email protected] (Jack Dodds)', ... ,
u'[email protected] (Google)',
... ]

EMAIL_IDS is a dictionary which maps each bare email address (no display name) to its Contact Index ID.

{u'[email protected]': 36,
u'[email protected]': 32,
u'[email protected]': 30,
u'[email protected]': 11,
u'[email protected]': 12,
u'[email protected]': 31, }

EMAILS_SAVED is a count of the contact index entries that have been written to external storage, that is, EMAILS[:EMAILS_SAVED] have been written while EMAILS[EMAILS_SAVED:] have not.

Attributes EMAILS, EMAIL_IDS, and EMAILS_SAVED are loaded at startup by mailpile.search.load.process_lines().

Internal format - message index

The message index is stored in several attributes of mailpile.search.MailIndex.

INDEX and INDEX_THR are lists in which the items correspond structurally to the lines in the message index part of the external format. The position in the list is the Message Index ID. Each INDEX list item is a unicode object corresponding to one line of the external format, including the tab delimiters. Each INDEX_THR item is the conversation ID (the first field only when the ID has more than one field).

INDEX_SORT is a dictionary in which each key is a sort order (e.g. 'freshness' or 'date') and each value is a list. In each list, the index is the Message Index ID; the item is a number which defines the sort priority of that message according to the sort order.

MSGIDS is a dictionary which maps the MessageID Hash of each message to its Messsage Index Entry ID.

PTRS is a dictionary which maps the Message location(s) for a message to the Message Index ID.

TAGS is a dictionary in which the keys are tag IDs and the values are sets of Message Index IDs

At startup, the above index attributes are loaded by mailpile.search.MailIndex.set_msg_at_idx_pos() called by ... .load() called by ... .process_lines().

EMAILS_SAVED is the largest Contact Index ID in the externally stored index.

MODIFIED is a set of Message Index IDs of message index entries that have been modified but not written to external storage.

CACHE is a dictionary in which the keys are Message Index IDs and the values are structurally the same as items in INDEX.

Notes

Base64 - As used in the MessageID Hash, underscore = 5Fh represents value 63. (Differs from RFC2045 and RFC4880.)
Base36 - Digits (in order) "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
Base36 (lower case) - Digits (in order) "0123456789abcdefghijklmnopqrstuvwxyz"

Clone this wiki locally