Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PDF to supported formats; summarize content and extract tags using LLM #90

Open
jqnatividad opened this issue May 19, 2023 · 2 comments
Labels
1.x will be done in DP+ 1.x - DP+ running as CKAN extension enhancement New feature or request

Comments

@jqnatividad
Copy link
Contributor

The legacy Datapusher used to support PDFs, as messytables supported extracting tables from PDFs using pdftables.

That functionality has been removed, as well as Excel support.

We reenabled Excel support in DP+ using qsv.

We should re-enable PDF support again, not to extract tables for now (though there is tabula-rs), but to summarize the content for the Description field and suggest tags.

@jqnatividad jqnatividad added 1.x will be done in DP+ 1.x - DP+ running as CKAN extension enhancement New feature or request labels Jun 16, 2023
@jqnatividad
Copy link
Contributor Author

will be done when qsv describegpt command is done. Though qsv is primarily focused on tabular data, describegpt will have a mode in a later version to summarize PDFs and get get the description and tags for CKAN, which we can use in DP+.

dathere/qsv#1036

cc @rzmk @samibaig

@jqnatividad
Copy link
Contributor Author

Thinking about it more, PDF summarization is outside the scope of qsv, so we should not add that functionality to qsv.

Though it is still in scope for DP+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x will be done in DP+ 1.x - DP+ running as CKAN extension enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant