diff --git a/README.md b/README.md index 825f1e5..2c28b90 100644 --- a/README.md +++ b/README.md @@ -30,6 +30,8 @@ mail-parser supports Python 3. ## mail-parser on Web - [Splunk app](https://splunkbase.splunk.com/app/4129/) + - [FreeBSD port](https://www.freshports.org/mail/py-mail-parser/) + - [Arch User Repository](https://aur.archlinux.org/packages/mailparser/) ## Description @@ -58,6 +60,16 @@ There are other properties to get: - to domains - timezone +The `attachments` property is a list of objects. Every object has the following keys: + - binary: it's true if the attachment is a binary + - charset + - content_transfer_encoding + - content-disposition + - content-id + - filename + - mail_content_type + - payload: attachment payload in base64 + To get custom headers you should replace "-" with "\_". Example for header `X-MSMail-Priority`: @@ -171,6 +183,7 @@ mail.received mail.subject mail.text_plain: only text plain mail parts in a list mail.text_html: only text html mail parts in a list +mail.text_not_managed: all not managed text (check the warning logs to find content subtype) mail.to mail.to_domains mail.timezone: returns the timezone, offset from UTC diff --git a/README.rst b/README.rst deleted file mode 100644 index 07dc244..0000000 --- a/README.rst +++ /dev/null @@ -1,263 +0,0 @@ -`PyPI version `__ `Build -Status `__ `Coverage -Status `__ -`BCH compliance `__ -` `__ - -.. figure:: https://raw.githubusercontent.com/SpamScope/spamscope/develop/docs/logo/spamscope.png - :alt: SpamScope - - SpamScope - -mail-parser -=========== - -Overview --------- - -mail-parser is not only a wrapper for -`email `__ Python -Standard Library. It give you an easy way to pass from raw mail to -Python object that you can use in your code. It’s the key module of -`SpamScope `__. - -mail-parser can parse Outlook email format (.msg). To use this feature, -you need to install ``libemail-outlook-message-perl`` package. For -Debian based systems: - -:: - - $ apt-get install libemail-outlook-message-perl - -For more details: - -:: - - $ apt-cache show libemail-outlook-message-perl - -mail-parser supports Python 3. - -mail-parser on Web ------------------- - -- `Splunk app `__ - -Description ------------ - -mail-parser takes as input a raw email and generates a parsed object. -The properties of this object are the same name of `RFC -headers `__: - -- bcc -- cc -- date -- delivered_to -- from\_ (not ``from`` because is a keyword of Python) -- message_id -- received -- reply_to -- subject -- to - -There are other properties to get: - body - body html - body plain - -headers - attachments - sender IP address - to domains - timezone - -To get custom headers you should replace “-” with “\_”. Example for -header ``X-MSMail-Priority``: - -:: - - $ mail.X_MSMail_Priority - -The ``received`` header is parsed and splitted in hop. The fields -supported are: - by - date - date_utc - delay (between two hop) - -envelope_from - envelope_sender - for - from - hop - with - -mail-parser can detect defect in mail: - -`defects `__: -mail with some not compliance RFC part - -All properties have a JSON and raw property that you can get with: - -name_json - name_raw - -Example: - -:: - - $ mail.to (Python object) - $ mail.to_json (JSON) - $ mail.to_raw (raw header) - -The command line tool use the JSON format. - -Defects -~~~~~~~ - -These defects can be used to evade the antispam filter. An example are -the mails with a malformed boundary that can hide a not legitimate -epilogue (often malware). This library can take these epilogues. - -Apache 2 Open Source License -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -mail-parser can be downloaded, used, and modified free of charge. It is -available under the Apache 2 license. - -If you want support the project: - -`Donate `__ - -Authors -------- - -Main Author -~~~~~~~~~~~ - -**Fedele Mantuano**: -`LinkedIn `__ - -Installation ------------- - -Clone repository - -:: - - git clone https://github.com/SpamScope/mail-parser.git - -and install mail-parser with ``setup.py``: - -:: - - $ cd mail-parser - - $ python setup.py install - -or use ``pip``: - -:: - - $ pip install mail-parser - -Usage in a project ------------------- - -Import ``mailparser`` module: - -:: - - import mailparser - - mail = mailparser.parse_from_bytes(byte_mail) - mail = mailparser.parse_from_file(f) - mail = mailparser.parse_from_file_msg(outlook_mail) - mail = mailparser.parse_from_file_obj(fp) - mail = mailparser.parse_from_string(raw_mail) - -Then you can get all parts - -:: - - mail.attachments: list of all attachments - mail.body - mail.date: datetime object in UTC - mail.defects: defect RFC not compliance - mail.defects_categories: only defects categories - mail.delivered_to - mail.from_ - mail.get_server_ipaddress(trust="my_server_mail_trust") - mail.headers - mail.mail: tokenized mail in a object - mail.message: email.message.Message object - mail.message_as_string: message as string - mail.message_id - mail.received - mail.subject - mail.text_plain: only text plain mail parts in a list - mail.text_html: only text html mail parts in a list - mail.to - mail.to_domains - mail.timezone: returns the timezone, offset from UTC - mail_partial: returns only the mains parts of emails - -Usage from command-line ------------------------ - -If you installed mailparser with ``pip`` or ``setup.py`` you can use it -with command-line. - -These are all swithes: - -:: - - usage: mailparser [-h] (-f FILE | -s STRING | -k) - [-l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}] [-j] [-b] - [-a] [-r] [-t] [-dt] [-m] [-u] [-c] [-d] [-o] - [-i Trust mail server string] [-p] [-z] [-v] - - Wrapper for email Python Standard Library - - optional arguments: - -h, --help show this help message and exit - -f FILE, --file FILE Raw email file (default: None) - -s STRING, --string STRING - Raw email string (default: None) - -k, --stdin Enable parsing from stdin (default: False) - -l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}, --log-level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET} - Set log level (default: WARNING) - -j, --json Show the JSON of parsed mail (default: False) - -b, --body Print the body of mail (default: False) - -a, --attachments Print the attachments of mail (default: False) - -r, --headers Print the headers of mail (default: False) - -t, --to Print the to of mail (default: False) - -dt, --delivered-to Print the delivered-to of mail (default: False) - -m, --from Print the from of mail (default: False) - -u, --subject Print the subject of mail (default: False) - -c, --receiveds Print all receiveds of mail (default: False) - -d, --defects Print the defects of mail (default: False) - -o, --outlook Analyze Outlook msg (default: False) - -i Trust mail server string, --senderip Trust mail server string - Extract a reliable sender IP address heuristically - (default: None) - -p, --mail-hash Print mail fingerprints without headers (default: - False) - -z, --attachments-hash - Print attachments with fingerprints (default: False) - -sa, --store-attachments - Store attachments on disk (default: False) - -ap ATTACHMENTS_PATH, --attachments-path ATTACHMENTS_PATH - Path where store attachments (default: /tmp) - -v, --version show program's version number and exit - - It takes as input a raw mail and generates a parsed object. - -Example: - -.. code:: shell - - $ mailparser -f example_mail -j - -This example will show you the tokenized mail in a JSON pretty format. - -From `raw -mail `__ -to `parsed -mail `__. - -Exceptions ----------- - -Exceptions hierarchy of mail-parser: - -:: - - MailParserError: Base MailParser Exception - | - \── MailParserOutlookError: Raised with Outlook integration errors - | - \── MailParserEnvironmentError: Raised when the environment is not correct - | - \── MailParserOSError: Raised when there is an OS error - | - \── MailParserReceivedParsingError: Raised when a received header cannot be parsed diff --git a/mailparser/mailparser.py b/mailparser/mailparser.py index 10aa9aa..62da6bc 100644 --- a/mailparser/mailparser.py +++ b/mailparser/mailparser.py @@ -246,6 +246,7 @@ def _reset(self): self._attachments = [] self._text_plain = [] self._text_html = [] + self._text_not_managed = [] self._defects = [] self._defects_categories = set() self._has_defects = False @@ -352,6 +353,7 @@ def parse(self): charset_raw = p.get_content_charset() log.debug("Charset {!r} part {!r}".format(charset, i)) + # this is an attachment if filename: log.debug("Email part {!r} is an attachment".format(i)) log.debug("Filename {!r} part {!r}".format(filename, i)) @@ -395,6 +397,8 @@ def parse(self): "content-disposition": content_disposition, "charset": charset_raw, "content_transfer_encoding": transfer_encoding}) + + # this isn't an attachments else: log.debug("Email part {!r} is not an attachment".format(i)) payload = ported_string( @@ -402,8 +406,13 @@ def parse(self): if payload: if p.get_content_subtype() == 'html': self._text_html.append(payload) - else: + elif p.get_content_subtype() == 'plain': self._text_plain.append(payload) + else: + log.warning( + 'Email content {!r} not handled'.format( + p.get_content_subtype())) + self._text_not_managed.append(payload) else: # Parsed object mail with all parts self._mail = self._make_mail() @@ -528,7 +537,7 @@ def body(self): "--- mail_boundary ---" """ return "\n--- mail_boundary ---\n".join( - self.text_plain + self.text_html) + self.text_plain + self.text_html + self.text_not_managed) @property def headers(self): @@ -561,6 +570,13 @@ def text_html(self): """ return self._text_html + @property + def text_not_managed(self): + """ + Return a list of all text not managed of email. + """ + return self._text_not_managed + @property def date(self): """ diff --git a/mailparser/version.py b/mailparser/version.py index 01c84d2..563eb1a 100644 --- a/mailparser/version.py +++ b/mailparser/version.py @@ -17,7 +17,7 @@ limitations under the License. """ -__version__ = "3.10.0" +__version__ = "3.11.0" if __name__ == "__main__": print(__version__) diff --git a/setup.py b/setup.py index 0096d5e..b95b541 100644 --- a/setup.py +++ b/setup.py @@ -25,7 +25,7 @@ current = os.path.realpath(os.path.dirname(__file__)) -with io.open(os.path.join(current, 'README.rst'), encoding="utf-8") as f: +with io.open(os.path.join(current, 'README.md'), encoding="utf-8") as f: long_description = f.read() with open(os.path.join(current, 'requirements.txt')) as f: diff --git a/tests/mails/mail_test_14 b/tests/mails/mail_test_14 new file mode 100644 index 0000000..f259804 --- /dev/null +++ b/tests/mails/mail_test_14 @@ -0,0 +1,31 @@ +From: example@example.com +Subject: Test +Date: Wed, 24 Apr 2019 10:05:02 +0200 (CEST) +Mime-Version: 1.0 +Content-Type: multipart/mixed; boundary="===============8544575414772382491==" +To: rcpt@example.com + +--===============8544575414772382491== +Content-Type: text/html; charset=UTF-8 +Content-Transfer-Encoding: 7bit + + +Foo + + +HTML here + +--===============8544575414772382491== +Content-Type: image/png +Content-Transfer-Encoding: base64 +Content-Disposition: inline + +UE5HIGhlcmU= +--===============8544575414772382491== +Content-Type: text/plain; charset="us-ascii" +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Disposition: inline + +Plaintext here. +--===============8544575414772382491==-- \ No newline at end of file diff --git a/tests/test_mail_parser.py b/tests/test_mail_parser.py index 84b2cc8..9ac6a53 100644 --- a/tests/test_mail_parser.py +++ b/tests/test_mail_parser.py @@ -59,6 +59,7 @@ mail_test_11 = os.path.join(base_path, 'mails', 'mail_test_11') mail_test_12 = os.path.join(base_path, 'mails', 'mail_test_12') mail_test_13 = os.path.join(base_path, 'mails', 'mail_test_13') +mail_test_14 = os.path.join(base_path, 'mails', 'mail_test_14') mail_malformed_1 = os.path.join(base_path, 'mails', 'mail_malformed_1') mail_malformed_2 = os.path.join(base_path, 'mails', 'mail_malformed_2') mail_malformed_3 = os.path.join(base_path, 'mails', 'mail_malformed_3') @@ -92,6 +93,13 @@ def test_html_field(self): self.assertIsInstance(mail.text_html_json, six.text_type) self.assertEqual(len(mail.text_html), 1) + def test_text_not_managed(self): + mail = mailparser.parse_from_file(mail_test_14) + self.assertIsInstance(mail.text_not_managed, list) + self.assertIsInstance(mail.text_not_managed_json, six.text_type) + self.assertEqual(len(mail.text_not_managed), 1) + self.assertEqual("PNG here", mail.text_not_managed[0]) + def test_get_mail_keys(self): mail = mailparser.parse_from_file(mail_test_11) all_parts = get_mail_keys(mail.message)