-
Notifications
You must be signed in to change notification settings - Fork 1k
Data structures: Metadata index
This page is based on reading code from commit 8511284 dated 2018-02-23 and files generated by it. See search.MailIndex.edit_msg_info()
and ... .add_new_msg()
. There may errors or omissions!
The metadata index contains a contact index section and a message index section. The contact index lists email addresses and associates a unique integer Contact Index ID to each one. The message index lists messages. To each message it associates a unique integer Message Index ID, a pointer to the message location in external storage, and metadata from that message. The metadata for a message can include cross-references to other Message Index IDs and to Contact Index IDs.
The metadata index is stored in file mailpile.idx
. The file is normally encrypted. An authenticated user can make a plaintext copy using the CLI, for example:
mailpile> cd ~/.local/share/Mailpile/default
mailpile> pipe >mailpile.idx.txt cat mailpile.idx
The file consists of lines terminated by <LF>
= 0Ah.
Contact index lines start with '@' = 40h.
Each line consists of fields delimited by tabs <HT>
= 09h.
Some fields may contain a list of items delimited by commas = 2Ch.
The complete metadata index is written to mailpile.idx
by mailpile.search.MailIndex.save()
, called by mailpile.search.MailIndex.save_changes()
, mailpile.commands.Command._background_save()
and mailpile.plugins.core.Optimize()
. Previous versions of the file are renamed and retained as mailpile.idx.1, ...
.
Updates to the metadata index are appended to an existing mailpile.idx
by mailpile.search.MailIndex.save_changes()
, called by mailpile.commands.Command._background_save()
and at shutdown by mailpile.app.Main()
. When an updated mailpile.idx is later read into internal storage, only the last entry for any Contact Index ID or Message Index ID is retained.
In one Mailpile instance (which may not be typical) the mailpile.idx
file contained about 50,000 entries and occupied about 24 MB (encrypted).
- Contact Index ID. Base36 - variable length - no leading zeros.
- Email address in a safe format with the display name at the end, for example:
brmdamon%40hushmail.com%20%28Jack%20Dodds%29
- Message Index ID. Base36 - variable length - no leading zeros.
- Message location pointer. List (but usually just one) of hex - 14 digits
May be blank e.g. when snippet contains '(missing message)'.
Digits 0-4: Mailbox.
Refers to subdirectory paths listed in mailpile.cfg, section config/sys/mailbox.
Digits 5-14: File name of message. - MessageID Hash. Base64 - 27 digits representing 160 bits.
This is the SHA1 hash of the MessageID if it exists.
See base.BaseIndex._encode_msg_id(). - Date/time. Base36 - 6 digits. UTC s since epoch.
- From field as it appears in the message.
- To field as a list of Base36 Contact Index IDs.
- Cc field as a list of Base36 Contact Index IDs.
- Size. Base36 - variable length. Message size in units of 1024 octets.
- Subject field as it appears in the message.
- Snippet of the message body. Sometimes '(missing message)'.
See search.MailIndex.add_new_ghost() - Tags. List of Base36 (lower case) tag IDs.
e.g. ID xxx refers to mailpile.cfg section config/tags/xxx.
Contains {G} when Snippet contains '(missing message)'. - Replies. List of Base36 Message Index IDs.
- Conversation ID. Base36, or two Base36 separated by slash / = 2Fh.
Constants (MSG_MID, MSG_PTRS, MSG_ID, MSG_DATE, ... ) that identify these fields are defined in mailpile.index.msginfo.MessageInfoConstants.
The contact index is stored in two attributes of mailpile.search.MailIndex
.
EMAILS
is a list in which the items correspond structurally to the lines in the contact index part of the external format, without the safe encoding. Example (print EMAILS
). The position in the list is the Contact Index ID.
[u'[email protected] (Bjarni Runar Einarsson)',
u'[email protected] (Jack Dodds)', ... ,
u'[email protected] (schneier)',
u'[email protected] (Sm\xe1ri McCarthy)',
u'[email protected] (Jack Dodds)', ... ,
u'[email protected] (Google)',
... ]
EMAIL_IDS
is a dictionary which maps each bare email address (no display name) to its Contact Index ID.
{u'[email protected]': 36,
u'[email protected]': 32,
u'[email protected]': 30,
u'[email protected]': 11,
u'[email protected]': 12,
u'[email protected]': 31, }
EMAILS_SAVED
is a count of the contact index entries that have been written to external storage, that is, EMAILS[:EMAILS_SAVED]
have been written while EMAILS[EMAILS_SAVED:]
have not.
Attributes EMAILS
, EMAIL_IDS
, and EMAILS_SAVED
are loaded at startup by mailpile.search.load.process_lines()
.
The message index is stored in several attributes of mailpile.search.MailIndex
.
INDEX
and INDEX_THR
are lists in which the items correspond structurally to the lines in the message index part of the external format. The position in the list is the Message Index ID. Each INDEX
list item is a unicode object corresponding to one line of the external format, including the tab delimiters. Each INDEX_THR
item is the conversation ID (the first field only when the ID has more than one field).
INDEX_SORT
is a dictionary in which each key is a sort order (e.g. 'freshness' or 'date') and each value is a list. In each list, the index is the Message Index ID; the item is a number which defines the sort priority of that message according to the sort order.
MSGIDS
is a dictionary which maps the MessageID Hash of each message to its Messsage Index Entry ID.
PTRS
is a dictionary which maps the Message location(s) for a message to the Message Index ID.
TAGS
is a dictionary in which the keys are tag IDs and the values are sets of Message Index IDs
At startup, the above index attributes are loaded by mailpile.search.MailIndex.set_msg_at_idx_pos()
called by ... .load()
called by ... .process_lines()
.
EMAILS_SAVED
is the largest Contact Index ID in the externally stored index.
MODIFIED
is a set of Message Index IDs of message index entries that have been modified but not written to external storage.
CACHE
is a dictionary in which the keys are Message Index IDs and the values are structurally the same as items in INDEX
.
Base64 - As used in the MessageID Hash, underscore = 5Fh represents value 63.
(Differs from RFC2045 and RFC4880.)
Base36 - Digits (in order) "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
Base36 (lower case) - Digits (in order) "0123456789abcdefghijklmnopqrstuvwxyz"