Skip to content

Commit

Permalink
enhancement: max_partition kwarg for limiting element size (#818)
Browse files Browse the repository at this point in the history
* add max partition size logic

* work splitting logic into split_by_paragraph

* pass through max_partition to other functions

* added test for splitting long document

* add type hint

* add documentation

* version and changelog

* ingest-test-fixtures-update

* Update ingest test fixtures (#819)

Co-authored-by: MthwRobinson <[email protected]>

* retrigger ci

* ingest-test-fixtures-update

* ingest-test-fixtures-update

* Update ingest test fixtures (#821)

Co-authored-by: MthwRobinson <[email protected]>

* update default for partition_xml

* update version for release

* update msg doc string

---------

Co-authored-by: MthwRobinson <[email protected]>
  • Loading branch information
MthwRobinson and MthwRobinson authored Jun 28, 2023
1 parent 3845777 commit 44411ec
Show file tree
Hide file tree
Showing 13 changed files with 982 additions and 126 deletions.
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
## 0.7.10-dev4
## 0.7.10

### Enhancements

* Adds a `max_partition` parameter to `partition_text`, `partition_pdf`, `partition_email`,
`partition_msg` and `partition_xml` that sets a limit for the size of an individual
document elements. Defaults to `1500` for everything except `partition_xml`, which has
a default value of `None`.
* DRY connector refactor

### Features
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@
</h3>

<p>While access to the hosted Unstructured API will remain free, API Keys will soon be required to make requests. To prevent any disruption, get yours <a href="https://www.unstructured.io/api-key/">here</a> now and start using it today!</p>
<p>Checkout the <a href="https://github.com/Unstructured-IO/unstructured-api#--">readme</a> here to get started making API calls.

<p>Checkout the <a href="https://github.com/Unstructured-IO/unstructured-api#--">readme</a> here to get started making API calls.
We'd love to hear your feedback, let us know how it goes in our
community slack. And stay tuned for improvements to both quality and performance!</p>

Expand Down Expand Up @@ -98,17 +98,17 @@ about the library.
| Document Type | Partition Function | Strategies | Table Support | Options |
| --- | --- | --- | --- | --- |
| CSV Files (`.csv`) | `partition_csv` | N/A | Yes | None |
| E-mails (`.eml`) | `partition_eml` | N/A | No | Encoding |
| E-mails (`.msg`) | `partition_msg` | N/A | No | Encoding |
| E-mails (`.eml`) | `partition_eml` | N/A | No | Encoding; Max Partition |
| E-mails (`.msg`) | `partition_msg` | N/A | No | Encoding; Max Partition |
| EPubs (`.epub`) | `partition_epub` | N/A | Yes | Include Page Breaks |
| Excel Documents (`.xlsx`/`.xls`) | `partition_xlsx` | N/A | Yes | None |
| HTML Pages (`.html`) | `partition_html` | N/A | No | Encoding; Include Page Breaks |
| Images (`.png`/`.jpg`) | `partition_image` | `"auto"`, `"hi_res"`, `"ocr_only"` | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy |
| Markdown (`.md`) | `partitin_md` | N/A | Yes | Include Page Breaks |
| Org Mode (`.org`) | `partition_org` | N/A | Yes | Include Page Breaks |
| Open Office Documents (`.odt`) | `partition_odt` | N/A | Yes | None |
| PDFs (`.pdf`) | `partition_pdf` | `"auto"`, `"fast"`, `"hi_res"`, `"ocr_only"` | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy |
| Plain Text (`.txt`) | `partition_text` | N/A | No | Encoding, Paragraph Grouper |
| PDFs (`.pdf`) | `partition_pdf` | `"auto"`, `"fast"`, `"hi_res"`, `"ocr_only"` | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy |
| Plain Text (`.txt`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper |
| Power Points (`.ppt`) | `partition_ppt` | N/A | Yes | Include Page Breaks |
| Power Points (`.pptx`) | `partition_pptx` | N/A | Yes | Include Page Breaks |
| ReStructured Text (`.rst`) | `partition_rst` | N/A | Yes | Include Page Breaks |
Expand All @@ -118,7 +118,7 @@ about the library.
| Word Documents (`.docx`) | `partition_docx` | N/A | Yes | None |
| Word Documents (`.doc`) | `partition_doc` | N/A | Yes | Include Page Breaks |
| Word Documents (`.docx`) | `partition_docx` | N/A | Yes | Include Page Breaks |
| XML Documents (`.xml`) | `partition_xml` | N/A | No | Encoding; XML Keep Tags |
| XML Documents (`.xml`) | `partition_xml` | N/A | No | Encoding; Max Partition; XML Keep Tags |



Expand Down
39 changes: 37 additions & 2 deletions docs/source/bricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ Examples:
from unstructured.partition.tsv import partition_tsv
elements = partition_tsv(filename="example-docs/stanley-cups.tsv")
print(elements[0].metadata.text_as_html)
print(elements[0].metadata.text_as_html)
``partition_doc``
Expand Down Expand Up @@ -265,6 +265,14 @@ Examples:
elements = partition_email(text=text, include_headers=True)
``partition_email`` includes a ``max_partition`` parameter that indicates the maximum character
length for a document element.
This parameter only applies if ``"text/plain"`` is selected as the ``content_source``.
The default value is ``1500``, which roughly corresponds to
the average character length for a paragraph.
You can disable ``max_partition`` by setting it to ``None``.


``partition_epub``
---------------------

Expand Down Expand Up @@ -423,6 +431,13 @@ Examples:
elements = partition_msg(filename="example-docs/fake-email.msg")
``partition_msg`` includes a ``max_partition`` parameter that indicates the maximum character
length for a document element.
This parameter only applies if ``"text/plain"`` is selected as the ``content_source``.
The default value is ``1500``, which roughly corresponds to
the average character length for a paragraph.
You can disable ``max_partition`` by setting it to ``None``.


``partition_multiple_via_api``
------------------------------
Expand Down Expand Up @@ -531,7 +546,7 @@ If the PDF text is not extractable, ``partition_pdf`` will fall back to ``"ocr_o
``"fast"`` strategy in most cases where the PDF has extractable text.

If a PDF is copy protected, ``partition_pdf`` can process the document with the ``"hi_res"`` strategy (which
will treat it like an image), but cannot process the document with the ``"fast"`` strategy.
will treat it like an image), but cannot process the document with the ``"fast"`` strategy.
If the user chooses ``"fast"`` on a copy protected PDF, ``partition_pdf`` will fall back to the ``"hi_res"``
strategy. If ``detectron2`` is not installed, ``partition_pdf`` will fail for copy protected
PDFs because the document will not be processable by any of the available methods.
Expand All @@ -549,6 +564,14 @@ Examples:
elements = partition_pdf("example-docs/copy-protected.pdf", strategy="fast")
``partition_pdf`` includes a ``max_partition`` parameter that indicates the maximum character
length for a document element.
This parameter only applies if the ``"ocr_only"`` strategy is used for partitioning.
The default value is ``1500``, which roughly corresponds to
the average character length for a paragraph.
You can disable ``max_partition`` by setting it to ``None``.


``partition_ppt``
---------------------

Expand Down Expand Up @@ -685,6 +708,12 @@ Examples:
partition_text(text=text, paragraph_grouper=group_broken_paragraphs)
``partition_text`` includes a ``max_partition`` parameter that indicates the maximum character
length for a document element.
The default value is ``1500``, which roughly corresponds to
the average character length for a paragraph.
You can disable ``max_partition`` by setting it to ``None``.


``partition_via_api``
---------------------
Expand Down Expand Up @@ -752,6 +781,12 @@ If ``xml_keep_tags=True``, the function returns tag information in addition to t
elements = partition_xml(filename="example-docs/factbook.xml", xml_keep_tags=False)
``partition_xml`` includes a ``max_partition`` parameter that indicates the maximum character length for a document element.
The default value is ``1500``, which roughly corresponds to
the average character length for a paragraph.
You can disable ``max_partition`` by setting it to ``None``.



########
Cleaning
Expand Down
Loading

0 comments on commit 44411ec

Please sign in to comment.