Added lxml_html_clean and updated pdfplumber dependencies #143

koddas · 2024-06-11T13:38:38Z

The LXML project has separated its HTML cleaner into its own project, breaking the current build. This change updates the required packages list in setup.py to reflect that change. Also, for some reason, the build didn't finish as expected unless I updated the pdfplumber dependency to 0.11.

GjjvdBurg · 2024-06-13T21:43:30Z

Thanks for this @koddas! Looks like some tests are failing for unrelated reasons, I'll take a look at those soon

koddas · 2024-06-16T23:39:45Z

@GjjvdBurg No worries. Is there a way to run the tests locally? The readme isn't that informative on the matter :)

The commit includes three tests and an update to the readme file.

koddas · 2024-06-19T21:00:08Z

Well, this sucks... I managed to to commit some changes I made to the wrong branch. It's totally fine if you reject the lastest commit, it was supposed to be a new PR after the current PR has been approved.

GjjvdBurg · 2024-07-22T22:26:01Z

Thanks for the changes @koddas! I'm happy to merge this with the addition of the DiVa provider, but the tests are currently failing because of formatting issues, and the test_diva_2 test fails as well (see comment). Formatting configuration can be found in the .pre-commit.yaml file. Don't worry about the other tests, I can fix those later after merging this in. Thanks for your help!

GjjvdBurg · 2024-07-22T22:23:44Z

paper2remarkable/providers/diva.py

+        soup = bs4.BeautifulSoup(page, "html.parser")
+
+        pdf_url = soup.find("meta", {"name": "citation_pdf_url"})
+        print(pdf_url)


Would you mind removing this print statement?

GjjvdBurg · 2024-07-22T22:24:24Z

tests/test_providers.py

+        # Testing absolute URLs and sanitization of filenames
+        prov = DiVA(upload=False, verbose=VERBOSE)
+        url = "https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1480467"
+        exp = "Alhussein_-_Privacy_by_Design_Internet_of_Things_managing_privacy_2018.pdf"


When running the test I get Alhussein_-_Privacy_by_Design_Amp_Internet_of_Things_Managing_Privacy_2018.pdf which causes it to fail

Oh, thanks!

GjjvdBurg · 2024-08-08T21:15:23Z

Thanks @koddas, merged!

Added lxml_html_clean and updated pdfplumber dependencies

6e8643a

Add DiVA provider for fetching papers from DiVA repository

1f87f0f

The commit includes three tests and an update to the readme file.

GjjvdBurg reviewed Jul 22, 2024

View reviewed changes

Fixed a broken test and removed a print statement

a5a81bf

GjjvdBurg merged commit c645528 into GjjvdBurg:master Aug 8, 2024
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added lxml_html_clean and updated pdfplumber dependencies #143

Added lxml_html_clean and updated pdfplumber dependencies #143

koddas commented Jun 11, 2024

GjjvdBurg commented Jun 13, 2024

koddas commented Jun 16, 2024

koddas commented Jun 19, 2024 •

edited

Loading

GjjvdBurg commented Jul 22, 2024 •

edited

Loading

GjjvdBurg Jul 22, 2024

koddas Jul 29, 2024

GjjvdBurg Jul 22, 2024

koddas Jul 29, 2024

GjjvdBurg commented Aug 8, 2024

Added lxml_html_clean and updated pdfplumber dependencies #143

Added lxml_html_clean and updated pdfplumber dependencies #143

Conversation

koddas commented Jun 11, 2024

GjjvdBurg commented Jun 13, 2024

koddas commented Jun 16, 2024

koddas commented Jun 19, 2024 • edited Loading

GjjvdBurg commented Jul 22, 2024 • edited Loading

GjjvdBurg Jul 22, 2024

Choose a reason for hiding this comment

koddas Jul 29, 2024

Choose a reason for hiding this comment

GjjvdBurg Jul 22, 2024

Choose a reason for hiding this comment

koddas Jul 29, 2024

Choose a reason for hiding this comment

GjjvdBurg commented Aug 8, 2024

koddas commented Jun 19, 2024 •

edited

Loading

GjjvdBurg commented Jul 22, 2024 •

edited

Loading