Skip to content

Releases: Unstructured-IO/unstructured

0.7.10

28 Jun 19:27
44411ec
Compare
Choose a tag to compare

0.7.10

Enhancements

  • Adds a max_partition parameter to partition_text, partition_pdf, partition_email,
    partition_msg and partition_xml that sets a limit for the size of an individual
    document elements. Defaults to 1500 for everything except partition_xml, which has
    a default value of None.
  • DRY connector refactor

Features

  • hi_res model for pdfs and images is selectable via environment variable.

Fixes

  • CSV check now ignores escaped commas.
  • Fix for filetype exploration util when file content does not have a comma.
  • Adds negative lookahead to bullet pattern to avoid detecting plain text line
    breaks like ------- as list items.
  • Fix pre tag parsing for partition_html
  • Fix lookup error for annotated Arabic and Hebrew encodings

0.7.9

26 Jun 21:54
95f02f2
Compare
Choose a tag to compare

0.7.9

Enhancements

  • Improvements to string check for leafs in partition_xml.
  • Adds --partition-ocr-languages to unstructured-ingest.

Features

  • Adds partition_org for processed Org Mode documents.

Fixes

0.7.8

23 Jun 02:23
5f5da65
Compare
Choose a tag to compare

0.7.8

Enhancements

Features

  • Adds Google Cloud Service connector

Fixes

  • Updates the parse_email for partition_eml so that unstructured-api passes the smoke tests
  • partition_email now works if there is no message content
  • Updates the "fast" strategy for partition_pdf so that it's able to recursively
  • Adds recursive functionality to all fsspec connectors
  • Adds generic --recursive ingest flag

0.7.7

20 Jun 19:13
c53ce11
Compare
Choose a tag to compare

0.7.7

Enhancements

  • Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs
  • Adds missed file-like object handling in detect_file_encoding
  • Adds functionality to extract charset info from eml files

Features

  • Added coordinate system class to track coordinate types and convert to different coordinate

Fixes

  • Adds an html_assemble_articles kwarg to partition_html to enable users to capture
    control whether content outside of <article> tags is captured when
    <article> tags are present.
  • Check for the xml attribute on element before looking for pagebreaks in partition_docx.

0.7.6

16 Jun 15:09
a611532
Compare
Choose a tag to compare

0.7.6

Enhancements

  • Convert fast startegy to ocr_only for images
  • Adds support for page numbers in .docx and .doc when user or renderer
    created page breaks are present.
  • Adds retry logic for the unstructured-ingest Biomed connector

Features

  • Provides users with the ability to extract additional metadata via regex.
  • Updates partition_docx to include headers and footers in the output.
  • Create partition_tsv and associated tests. Make additional changes to detect_filetype.

Fixes

  • Remove fake api key in test partition_via_api since we now require valid/empty api keys
  • Page number defaults to None instead of 1 when page number is not present in the metadata.
    A page number of None indicates that page numbers are not being tracked for the document
    or that page numbers do not apply to the element in question..
  • Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide
    in case the shape.top and shape.left attributes are None.

0.7.5

14 Jun 06:06
3fe7e1b
Compare
Choose a tag to compare

0.7.5

Enhancements

  • Adds functionality to sort elements in partition_pdf for fast strategy
  • Adds ingest tests with --fast strategy on PDF documents
  • Adds --api-key to unstructured-ingest

Features

  • Adds partition_rst for processed ReStructured Text documents.

Fixes

  • Adds handling for emails that do not have a datetime to extract.
  • Adds pdf2image package as core requirement of unstructured (with no extras)

0.7.4

12 Jun 18:41
b354e8e
Compare
Choose a tag to compare

0.7.4

Enhancements

  • Allows passing kwargs to request data field for partition_via_api and partition_multiple_via_api
  • Enable MIME type detection if libmagic is not available
  • Adds handling for empty files in detect_filetype and partition.

Features

Fixes

  • Reslove grpcio import issue on weaviate.schema.validate_schema for python 3.9 and 3.10
  • Remove building detectron2 from source in Dockerfile

0.7.3

09 Jun 18:16
Compare
Choose a tag to compare

0.7.3

Enhancements

  • Update IngestDoc abstractions and add data source metadata in ElementMetadata

Features

Fixes

  • Pass strategy parameter down from partition for partition_image
  • Filetype detection if a CSV has a text/plain MIME type
  • convert_office_doc no longers prints file conversion info messages to stdout.
  • partition_via_api reflects the actual filetype for the file processed in the API.

0.7.2

07 Jun 17:22
6bc1168
Compare
Choose a tag to compare

0.7.2

Enhancements

  • Adds an optional encoding kwarg to elements_to_json and elements_from_json
  • Bump version of base image to use new stable version of tesseract

Features

Fixes

  • Update the read_txt_file utility function to keep using spooled_to_bytes_io_if_needed for xml
  • Add functionality to the read_txt_file utility function to handle file-like object from URL
  • Remove the unused parameter encoding from partition_pdf
  • Change auto.py to have a None default for encoding
  • Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
  • Adds benchmark test with test docs in example-docs
  • Re-enable test_upload_label_studio_data_with_sdk
  • File detection now detects code files as plain text
  • Adds tabulate explicitly to dependencies
  • Fixes an issue in metadata.page_number of pptx files
  • Adds showing help if no parameters passed

0.7.1

01 Jun 20:52
c35fff2
Compare
Choose a tag to compare

0.7.1

Enhancements

Features

  • Add stage_for_weaviate to stage unstructured outputs for upload to Weaviate, along with
    a helper function for defining a class to use in Weaviate schemas.
  • Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.

Fixes