Skip to content

Commit

Permalink
Merge branch 'release/2.0.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
fedelemantuano committed Oct 10, 2017
2 parents d5b98ed + 5ba7277 commit 9463fb9
Show file tree
Hide file tree
Showing 10 changed files with 212 additions and 21 deletions.
12 changes: 12 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
sudo: required

language: python

python:
Expand All @@ -7,8 +9,18 @@ python:
- "3.5"
- "3.6"

before_install:
- sudo apt-get -qq update

# Install msgconvert
- sudo apt-get install -y libemail-outlook-message-perl

# command to install dependencies
install:
# Install msgconvert
- export PERL_MM_USE_DEFAULT=1
- sudo cpan -f -i Email::Outlook::Message

- pip install -r requirements.txt
- pip install coveralls

Expand Down
17 changes: 14 additions & 3 deletions README
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,22 @@ Overview
mail-parser is a wrapper for `email`_ Python Standard Library. It’s the
key module of `SpamScope`_.

From version 1.0.0rc1 mail-parser supports Python 3.
mail-parser can parse Outlook email format (.msg). To use this feature, you need to install ``libemail-outlook-message-perl`` package. For Debian based systems:

::

$ apt-get install libemail-outlook-message-perl
$ apt-cache show libemail-outlook-message-perl

mail-parser supports Python 3.

Description
-----------

mail-parser takes as input a raw mail and generates a parsed object.
This object is a tokenized email with some indicator:
- body - headers - subject - from - to - attachments - message id - date
- charset mail - sender IP address
- charset mail - sender IP address - receiveds

We have also two types of indicator: - anomalies: mail without message id or date
- `defects`_: mail with some not compliance RFC part
Expand Down Expand Up @@ -101,6 +108,7 @@ Then you can get all parts
mail.anomalies
mail.has_anomalies
mail.get_server_ipaddress(trust="my_server_mail_trust")
mail.receiveds

.. _email: https://docs.python.org/2/library/email.message.html
.. _SpamScope: https://github.com/SpamScope/spamscope
Expand All @@ -117,7 +125,8 @@ These are all swithes:
::

usage: mailparser [-h] (-f FILE | -s STRING) [-j] [-b] [-a] [-r] [-t] [-m]
[-u] [-d] [-n] [-i Trust mail server string] [-p] [-z] [-v]
[-u] [-c] [-d] [-n] [-i Trust mail server string] [-p] [-z]
[-v]

Wrapper for email Python Standard Library

Expand All @@ -134,8 +143,10 @@ These are all swithes:
-t, --to Print the to of mail (default: False)
-m, --from Print the from of mail (default: False)
-u, --subject Print the subject of mail (default: False)
-c, --receiveds Print all receiveds of mail (default: False)
-d, --defects Print the defects of mail (default: False)
-n, --anomalies Print the anomalies of mail (default: False)
-o, --outlook Analyze Outlook msg (default: False)
-i Trust mail server string, --senderip Trust mail server string
Extract a reliable sender IP address heuristically
(default: None)
Expand Down
30 changes: 24 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,22 @@

## Overview

mail-parser is a wrapper for [email](https://docs.python.org/2/library/email.message.html) Python Standard Library. It's the key module of [SpamScope](https://github.com/SpamScope/spamscope).
mail-parser is a wrapper for [email](https://docs.python.org/2/library/email.message.html) Python Standard Library.
It's the key module of [SpamScope](https://github.com/SpamScope/spamscope).

From version 1.0.0rc1 mail-parser supports Python 3.
mail-parser can parse Outlook email format (.msg). To use this feature, you need to install `libemail-outlook-message-perl` package. For Debian based systems:

```
$ apt-get install libemail-outlook-message-perl
```

For more details:

```
$ apt-cache show libemail-outlook-message-perl
```

mail-parser supports Python 3.

## Description

Expand All @@ -24,6 +37,7 @@ mail-parser takes as input a raw email and generates a parsed object. This objec
- date
- charset mail
- sender IP address
- receiveds

We have also two types of indicator:
- anomalies: mail without message id or date
Expand Down Expand Up @@ -56,15 +70,15 @@ git clone https://github.com/SpamScope/mail-parser.git
and install mail-parser with `setup.py`:

```
cd mail-parser
$ cd mail-parser
python setup.py install
$ python setup.py install
```

or use `pip`:

```
pip install mail-parser
$ pip install mail-parser
```

## Usage in a project
Expand Down Expand Up @@ -99,6 +113,7 @@ mail.has_defects
mail.anomalies
mail.has_anomalies
mail.get_server_ipaddress(trust="my_server_mail_trust")
mail.receiveds
```

## Usage from command-line
Expand All @@ -109,7 +124,8 @@ These are all swithes:

```
usage: mailparser.py [-h] (-f FILE | -s STRING) [-j] [-b] [-a] [-r] [-t] [-m]
[-u] [-d] [-n] [-i Trust mail server string] [-p] [-z] [-v]
[-u] [-c] [-d] [-n] [-i Trust mail server string] [-p] [-z]
[-v]
Wrapper for email Python Standard Library
Expand All @@ -125,8 +141,10 @@ optional arguments:
-t, --to Print the to of mail (default: False)
-m, --from Print the from of mail (default: False)
-u, --subject Print the subject of mail (default: False)
-c, --receiveds Print all receiveds of mail (default: False)
-d, --defects Print the defects of mail (default: False)
-n, --anomalies Print the anomalies of mail (default: False)
-o, --outlook Analyze Outlook msg (default: False)
-i Trust mail server string, --senderip Trust mail server string
Extract a reliable sender IP address heuristically
(default: None)
Expand Down
2 changes: 1 addition & 1 deletion mailparser/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@
"""


from .mailparser import (MailParser, parse_from_file,
from .mailparser import (MailParser, parse_from_file, parse_from_file_msg,
parse_from_string, parse_from_bytes)
12 changes: 11 additions & 1 deletion mailparser/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,13 @@ def get_args():
action="store_true",
help="Print the anomalies of mail")

parser.add_argument(
"-o",
"--outlook",
dest="outlook",
action="store_true",
help="Analyze Outlook msg")

parser.add_argument(
"-i",
"--senderip",
Expand Down Expand Up @@ -184,7 +191,10 @@ def main():
args = get_args().parse_args()

if args.file:
parser = mailparser.parse_from_file(args.file)
if args.outlook:
parser = mailparser.parse_from_file_msg(args.file)
else:
parser = mailparser.parse_from_file(args.file)
elif args.string:
parser = mailparser.parse_from_string(args.string)

Expand Down
76 changes: 70 additions & 6 deletions mailparser/mailparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,16 @@
import datetime
import email
import logging
import os
import re

import ipaddress
import six
import simplejson as json

from .utils import (ported_string, decode_header_part,
ported_open, find_between)
from .utils import (
ported_string, decode_header_part, ported_open,
find_between, msgconvert)


log = logging.getLogger(__name__)
Expand All @@ -49,6 +51,18 @@ def parse_from_file(fp):
return MailParser.from_file(fp).parse()


def parse_from_file_msg(fp):
"""Parsing email from file Outlook msg.
Args:
fp (string): file path of raw Outlook email
Returns:
Instance of MailParser with raw email parsed
"""
return MailParser.from_file_msg(fp).parse()


def parse_from_string(s):
"""Parsing email from string.
Expand All @@ -74,7 +88,8 @@ def parse_from_bytes(bt):


class MailParser(object):
"""MailParser package provides a standard parser that understands
"""
MailParser package provides a standard parser that understands
most email document structures like official email package.
MailParser handles the enconding of email and split the raw email for you.
"""
Expand All @@ -84,20 +99,40 @@ def __init__(self, message=None):
self._message = message

@classmethod
def from_file(cls, fp):
def from_file(cls, fp, is_outlook=False):
"""Init a new object from a file path.
Args:
fp (string): file path of raw email
is_outlook (boolean): if True is an Outlook email
Returns:
Instance of MailParser
"""

with ported_open(fp) as f:
message = email.message_from_file(f)

if is_outlook:
os.remove(fp)

return cls(message)

@classmethod
def from_file_msg(cls, fp):
"""
Init a new object from a Outlook message file,
mime type: application/vnd.ms-outlook
Args:
fp (string): file path of raw Outlook email
Returns:
Instance of MailParser
"""
f, _ = msgconvert(fp)
return cls.from_file(f, True)

@classmethod
def from_string(cls, s):
"""Init a new object from a string.
Expand Down Expand Up @@ -143,6 +178,21 @@ def parse_from_file(self, fp):
self._message = email.message_from_file(f)
return self.parse()

def parse_from_file_msg(self, fp):
"""Parse the raw email from a file Outlook.
Args:
fp (string): file path of raw email
Returns:
Instance of MailParser
"""
t, _ = msgconvert(fp)
with ported_open(t) as f:
self._message = email.message_from_file(f)
os.remove(t)
return self.parse()

def parse_from_string(self, s):
"""Parse the raw email from a string.
Expand Down Expand Up @@ -215,6 +265,7 @@ def _make_mail(self):
"message_id": self.message_id,
"subject": self.subject,
"to": self.to_,
"receiveds": self.receiveds_obj,
"has_defects": self.has_defects,
"has_anomalies": self.has_anomalies}

Expand Down Expand Up @@ -270,21 +321,34 @@ def parse(self):
if not p.is_multipart():
filename = ported_string(p.get_filename())
charset = p.get_content_charset('utf-8')
binary = False

if filename:
mail_content_type = ported_string(p.get_content_type())
transfer_encoding = ported_string(
p.get('content-transfer-encoding', '')).lower()

if transfer_encoding == "base64":
if transfer_encoding in ("base64"):
payload = p.get_payload(decode=False)
binary = True
elif transfer_encoding in ("quoted-printable"):
d = p.get_payload(decode=True)
e = p.get_payload(decode=False)

# In this case maybe is a binary with malformed base64
if d == e:
payload = e
binary = True
else:
payload = ported_string(d, encoding=charset)
else:
payload = ported_string(
p.get_payload(decode=True), encoding=charset)

self._attachments.append({
"filename": filename,
"payload": payload,
"binary": binary,
"mail_content_type": mail_content_type,
"content_transfer_encoding": transfer_encoding})
else:
Expand Down Expand Up @@ -390,7 +454,7 @@ def headers(self):
"""Return the only the headers. """
s = ""
for k, v in self.message.items():
v_u = decode_header_part(v)
v_u = re.sub(" +", " ", decode_header_part(v))
s += k + ": " + v_u + "\n"
return s

Expand Down
Loading

0 comments on commit 9463fb9

Please sign in to comment.