Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.6.0
0.6.0
Enhancements
- Adds an
ssl_verify
kwarg topartition
andpartition_html
to enable turning off
SSL verification for HTTP requests. SSL verification is on by default. - Allows users to pass in ocr language to
partition_pdf
andpartition_image
through
theocr_language
kwarg.ocr_language
corresponds to the code for the language pack
in Tesseract. You will need to install the relevant Tesseract language pack to use a
given language.
Features
- Table extraction is now possible for pdfs from
partition
andpartition_pdf
. - Adds support for extracting attachments from
.msg
files
Fixes
0.5.13
0.5.13
Enhancements
- Allow headers to be passed into
partition
whenurl
is used.
Features
bytes_string_to_string
cleaning brick for bytes string output.
Fixes
- Fixed typo in call to
exactly_one
inpartition_json
- unstructured-documents encode xml string if document_tree is
None
in_read_xml
. - Update to
_read_xml
so that Markdown files with embedded HTML process correctly. - Fallback to "fast" strategy only emits a warning if the user specifies the "hi_res" strategy.
- unstructured-partition-text_type exceeds_cap_ratio fix returns and how capitalization ratios are calculated
partition_pdf
andpartition_text
group broken paragraphs to avoid fragmentedNarrativeText
elements.- .json files resolved as "application/json" on centos7 (or other installs with older libmagic libs)
0.5.12
0.5.12
Enhancements
- Add OS mimetypes DB to docker image, mainly for unstructured-api compat.
- Use the image registry as a cache when building Docker images.
- Adds the ability for
partition_text
to group together broken paragraphs.
Features
- Add --partition-by-api parameter to unstructured-ingest
- Added
partition_rtf
for processing rich text files. partition
now accepts aurl
kwarg in addition tofile
andfilename
.
Fixes
- Allow encoding to be passed into
replace_mime_encodings
. - unstructured-ingest connector-specific dependencies are imported on demand.
- unstructured-ingest --flatten-metadata supported for local connector.
- unstructured-ingest fix runtime error when using --metadata-include.
0.5.11
0.5.11
Enhancements
Features
Fixes
- Guard against null style attribute in docx document elements
- Update HTML encoding to better support foreign language characters
0.5.10
0.5.10
Enhancements
- Updated inference package
- Add sender, recipient, date, and subject to element metadata for emails
Features
- Added
--download-only
parameter tounstructured-ingest
Fixes
- FileNotFound error when filename is provided but file is not on disk
0.5.9
0.5.8
0.5.8
Enhancements
- Update
elements_to_json
to return string when filename is not specified elements_from_json
may take a string instead of a filename with thetext
kwargdetect_filetype
now does a final fallback to file extension.- Empty tags are now skipped during the depth check for HTML processing.
Features
- Add local file system to
unstructured-ingest
- Add
--max-docs
parameter tounstructured-ingest
- Added
partition_msg
for processing MSFT Outlook .msg files.
Fixes
convert_file_to_text
now passes through thesource_format
andtarget_format
kwargs.
Previously they were hard coded.- Partitioning functions that accept a
text
kwarg no longer raise an error if an empty
string is passed (and empty list of elements is returned instead). partition_json
no longer fails if the input is an empty list.- Fixed bug in
chunk_by_attention_window
that caused the last word in segments to be cut-off
in some cases.
BREAKING CHANGES
stage_for_transformers
now returns a list of elements, making it consistent with other
staging bricks
0.5.7
0.5.7
Enhancements
- Refactored codebase using
exactly_one
- Adds ability to pass headers when passing a url in partition_html()
- Added optional
content_type
andfile_filename
parameters topartition()
to bypass file detection
Features
- Add
--flatten-metadata
parameter tounstructured-ingest
- Add
--fields-include
parameter tounstructured-ingest
Fixes
0.5.6
0.5.6
- Fix problem with PDF partition (duplicated test)
Enhancements
contains_english_word()
, used heavily in text processing, is 10x faster.
Features
- Add
--metadata-include
and--metadata-exclude
parameters tounstructured-ingest
- Add
clean_non_ascii_chars
to remove non-ascii characters from unicode string
Fixes
- Fixes duplicated elements issue with
partition_pdf(..., strategy="fast")
0.5.4
0.5.4
Enhancements
- Added Biomedical literature connector for ingest cli.
- Add
FsspecConnector
to easily integrate any existingfsspec
filesystem as a connector. - Rename
s3_connector.py
tos3.py
for readability and consistency with the
rest of the connectors. - Now
S3Connector
relies ons3fs
instead of onboto3
, and it inherits
fromFsspecConnector
. - Adds an
UNSTRUCTURED_LANGUAGE_CHECKS
environment variable to control whether or not language
specific checks like vocabulary and POS tagging are applied. Set to"true"
for higher
resolution partitioning and"false"
for faster processing. - Improves
detect_filetype
warning to include filename when provided. - Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast"
strategy if detectron2 is not available. - Start deprecation life cycle for
unstructured-ingest --s3-url
option, to be deprecated in
favor of--remote-url
.
Features
- Add
AzureBlobStorageConnector
based on itsfsspec
implementation inheriting
fromFsspecConnector
- Add
partition_epub
for partitioning e-books in EPUB3 format.
Fixes
- Fixes processing for text files with
message/rfc822
MIME type. - Open xml files in read-only mode when reading contents to construct an XMLDocument.