Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.7.10
0.7.10
Enhancements
- Adds a
max_partition
parameter topartition_text
,partition_pdf
,partition_email
,
partition_msg
andpartition_xml
that sets a limit for the size of an individual
document elements. Defaults to1500
for everything exceptpartition_xml
, which has
a default value ofNone
. - DRY connector refactor
Features
hi_res
model for pdfs and images is selectable via environment variable.
Fixes
- CSV check now ignores escaped commas.
- Fix for filetype exploration util when file content does not have a comma.
- Adds negative lookahead to bullet pattern to avoid detecting plain text line
breaks like-------
as list items. - Fix pre tag parsing for
partition_html
- Fix lookup error for annotated Arabic and Hebrew encodings
0.7.9
0.7.8
0.7.8
Enhancements
Features
- Adds Google Cloud Service connector
Fixes
- Updates the
parse_email
forpartition_eml
so thatunstructured-api
passes the smoke tests partition_email
now works if there is no message content- Updates the
"fast"
strategy forpartition_pdf
so that it's able to recursively - Adds recursive functionality to all fsspec connectors
- Adds generic --recursive ingest flag
0.7.7
0.7.7
Enhancements
- Adds functionality to replace the
MIME
encodings foreml
files with one of the common encodings if aunicode
error occurs - Adds missed file-like object handling in
detect_file_encoding
- Adds functionality to extract charset info from
eml
files
Features
- Added coordinate system class to track coordinate types and convert to different coordinate
Fixes
- Adds an
html_assemble_articles
kwarg topartition_html
to enable users to capture
control whether content outside of<article>
tags is captured when
<article>
tags are present. - Check for the
xml
attribute onelement
before looking for pagebreaks inpartition_docx
.
0.7.6
0.7.6
Enhancements
- Convert fast startegy to ocr_only for images
- Adds support for page numbers in
.docx
and.doc
when user or renderer
created page breaks are present. - Adds retry logic for the unstructured-ingest Biomed connector
Features
- Provides users with the ability to extract additional metadata via regex.
- Updates
partition_docx
to include headers and footers in the output. - Create
partition_tsv
and associated tests. Make additional changes todetect_filetype
.
Fixes
- Remove fake api key in test
partition_via_api
since we now require valid/empty api keys - Page number defaults to
None
instead of1
when page number is not present in the metadata.
A page number ofNone
indicates that page numbers are not being tracked for the document
or that page numbers do not apply to the element in question.. - Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide
in case the shape.top and shape.left attributes areNone
.
0.7.5
0.7.5
Enhancements
- Adds functionality to sort elements in
partition_pdf
forfast
strategy - Adds ingest tests with
--fast
strategy on PDF documents - Adds --api-key to unstructured-ingest
Features
- Adds
partition_rst
for processed ReStructured Text documents.
Fixes
- Adds handling for emails that do not have a datetime to extract.
- Adds pdf2image package as core requirement of unstructured (with no extras)
0.7.4
0.7.4
Enhancements
- Allows passing kwargs to request data field for
partition_via_api
andpartition_multiple_via_api
- Enable MIME type detection if libmagic is not available
- Adds handling for empty files in
detect_filetype
andpartition
.
Features
Fixes
- Reslove
grpcio
import issue onweaviate.schema.validate_schema
for python 3.9 and 3.10 - Remove building
detectron2
from source in Dockerfile
0.7.3
0.7.3
Enhancements
- Update IngestDoc abstractions and add data source metadata in ElementMetadata
Features
Fixes
- Pass
strategy
parameter down frompartition
forpartition_image
- Filetype detection if a CSV has a
text/plain
MIME type convert_office_doc
no longers prints file conversion info messages to stdout.partition_via_api
reflects the actual filetype for the file processed in the API.
0.7.2
0.7.2
Enhancements
- Adds an optional encoding kwarg to
elements_to_json
andelements_from_json
- Bump version of base image to use new stable version of tesseract
Features
Fixes
- Update the
read_txt_file
utility function to keep usingspooled_to_bytes_io_if_needed
for xml - Add functionality to the
read_txt_file
utility function to handle file-like object from URL - Remove the unused parameter
encoding
frompartition_pdf
- Change auto.py to have a
None
default for encoding - Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
- Adds benchmark test with test docs in example-docs
- Re-enable test_upload_label_studio_data_with_sdk
- File detection now detects code files as plain text
- Adds
tabulate
explicitly to dependencies - Fixes an issue in
metadata.page_number
of pptx files - Adds showing help if no parameters passed
0.7.1
0.7.1
Enhancements
Features
- Add
stage_for_weaviate
to stageunstructured
outputs for upload to Weaviate, along with
a helper function for defining a class to use in Weaviate schemas. - Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.