Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update unstructured requirement from <0.15 to <0.17 in /topic/machine-learning/llm-langchain #680

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Oct 18, 2024

Updates the requirements on unstructured to permit the latest version.

Release notes

Sourced from unstructured's releases.

0.16.0

Enhancements

  • Remove ingest implementation. The deprecated ingest functionality has been removed, as it is now maintained in the separate unstructured-ingest repository.
    • Replace extras in requirements/ingest directory with a new ingest.txt extra for installing the unstructured-ingest library.
    • Remove the unstructured.ingest submodule.
    • Delete all shell scripts previously used for destination ingest tests.

Features

Fixes

  • Add language parameter to OCRAgentGoogleVision. Introduces an optional language parameter in the OCRAgentGoogleVision constructor to serve as a language hint for document_text_detection. This ensures compatibility with the OCRAgent's get_instance method and resolves errors when parsing PDFs with Google Cloud Vision as the OCR agent.
Changelog

Sourced from unstructured's changelog.

0.16.0

Enhancements

  • Remove ingest implementation. The deprecated ingest functionality has been removed, as it is now maintained in the separate unstructured-ingest repository.
    • Replace extras in requirements/ingest directory with a new ingest.txt extra for installing the unstructured-ingest library.
    • Remove the unstructured.ingest submodule.
    • Delete all shell scripts previously used for destination ingest tests.

Features

Fixes

  • Add language parameter to OCRAgentGoogleVision. Introduces an optional language parameter in the OCRAgentGoogleVision constructor to serve as a language hint for document_text_detection. This ensures compatibility with the OCRAgent's get_instance method and resolves errors when parsing PDFs with Google Cloud Vision as the OCR agent.

0.15.14

Enhancements

Features

  • Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like .filename, .filetype and .languages. This will be installed in a closely following PR to replace the four currently being used for this purpose.

Fixes

  • Update Python SDK usage in partition_via_api. Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK.
  • Remove "unused" date_from_file_object parameter. As part of simplifying partitioning parameter set, remove date_from_file_object parameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from the metadata_last_modified argument.
  • Fix occasional KeyError when mapping parent ids to hash ids. Occasionally the input elements into assign_and_map_hash_ids can contain duplicated element instances, which lead to error when mapping parent id.
  • Allow empty text files. Fixes an issue where text files with only white space would fail to be partitioned.
  • Remove double-decoration for CSV, DOC, ODT partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners.
  • Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner; remove decoration from delegating partitioners.
  • Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners.
  • Remove obsolete min_partition/max_partition args from TXT and EML. The legacy min_partition and max_partition parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters from partition_text() and partition_email().
  • Remove double-decoration on EML and MSG. Refactor these partitioners to rely on the new @apply_metadata() decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG.
  • Remove double-decoration for PPT. Remove decorators from the delegating PPT partitioner.
  • Quick-fix CI error in auto test-filetype. Better fix to follow shortly.

0.15.13

BREAKING CHANGES

  • Remove dead experimental code. Unused code in file_utils.experimental and file_utils.metadata was removed. These functions were never published in the documentation, but if a client dug these out and used them this removal could break client code.

Enhancements

  • Improve pdfminer image cleanup process. Optimized the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances overall processing speed of PDF documents.

Features

Fixes

... (truncated)

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Updates the requirements on [unstructured](https://github.com/Unstructured-IO/unstructured) to permit the latest version.
- [Release notes](https://github.com/Unstructured-IO/unstructured/releases)
- [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md)
- [Commits](Unstructured-IO/unstructured@0.2.0...0.16.0)

---
updated-dependencies:
- dependency-name: unstructured
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Oct 18, 2024
@cla-bot cla-bot bot added the cla-signed label Oct 18, 2024
@amotl
Copy link
Member

amotl commented Oct 24, 2024

@dependabot rebase

@amotl amotl merged commit fc49187 into main Oct 24, 2024
3 checks passed
@amotl amotl deleted the dependabot/pip/topic/machine-learning/llm-langchain/unstructured-lt-0.17 branch October 24, 2024 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed dependencies Pull requests that update a dependency file python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant