Pdf2doi

Latest version: v1.7

Safety actively analyzes 681881 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

1.7

Main changes
- Changed url for dx.doi.org validation (https://github.com/MicheleCotrufo/pdf2doi/issues/35)
- Added 'r' in front of strings to suppress warnings in recent Python versions (https://github.com/MicheleCotrufo/pdf2doi/pull/36)
- Changed pymupdf dependency to pymupdf>=1.21.0 (https://github.com/MicheleCotrufo/pdf2doi/issues/32 https://github.com/MicheleCotrufo/pdf2doi/issues/28 https://github.com/MicheleCotrufo/pdf2doi/issues/37)

1.6

Main changes
- The library pypdf is now used (instead of PyPdf2) to add new metadata to the pdf files (see also fix below). Since PyPdf2 is now deprecated, in the next version of pdf2doi we will progressively replace all tasks performed by PyPdf2 by pypdf

Added
- Make sure that the input variable target is converted to a string before processing https://github.com/MicheleCotrufo/pdf2doi/pull/27

Fixed
- Fixed a bug related to the storing of the DOI into the metadata of the pdf files. Due to some quirks of the library PyPdf2, the size of the pdf file would double after adding the metadata. In this new version, adding metadata to a pdf file is now performed via the library pypdf (Thanks Ole Steuernagel for pointing out this issue).

1.5

Main changes
- The library textract has been removed from the required dependencies because it often creates problems during installation (due to conflicts between library versions),
and because it generally requires installing many other dependencies which are not needed by pdf2doi. The user can still decide to install textract==1.6.4 if desired.
pdf2doi will use textract only if it is installed.
- pdf2doi now stores any found identifier into a tag called /pdf2doi_identifier (previously was /identifier).

Added
- The library pdfminer is now directly used by pdf2doi to extract the text from a pdf file (instead of doing it indirectly via textract)
- An additional method to find the title of a pdf file, based on the library pymupdf, has been added .
- [Issue https://github.com/MicheleCotrufo/pdf2doi/issues/21]: When an arXiv ID is found, a corresponding DOI is also returned when available. This could be either the standard arXiv DOI (see also [here](https://blog.arxiv.org/2022/02/17/new-arxiv-articles-are-now-automatically-assigned-dois/)),
or the DOI of the corresponding journal publication. This behavior can be disabled by adding the optional command -no_arxiv2doi to the pdf2doi invocation.
- [Issue https://github.com/MicheleCotrufo/pdf2doi/issues/22]: The function get_pdf_text (finders.py) has been modified to allow the library PyPDF2 to extract also the text of any annotation/comment present in the pdf file.

Fixed
- Potential titles of the papers were often not correctly found, because the function find_possible_titles() (finders.py) would mistakenly disregard all the results if one of the three methods (pdftitle, PyPDF2, filename) generated an error.
- Fixed bug in the function add_metadata() (finders.py). In previous versions, some of the pre-existing metadata were not preserved when a new one was added ([commit](https://github.com/MicheleCotrufo/pdf2doi/commit/0804439f2d31191e476ea56369d1257d293d92dd)).

1.4

Main improvements (see also merge from https://github.com/MicheleCotrufo/pdf2doi/pull/20)
- Check for server error status codes when validating on dx.doi.org as 504 errors can occur
- When performing google searches, it looks for DOIs also in the URLs.
- Support any URL with a matching DOI and the doi keyword in the URL.
- Attempt to strip extensions from filenames doi10.111/1111.pdf will fail to locate the doi as 10.111/1111.pdf is a valid, if uncommon DOI.
- "Standardise" DOIs to handle loose matches e.g. case variations, or trailing punctuation.

Minor code changes (see also merge from https://github.com/MicheleCotrufo/pdf2doi/pull/20)
- Moved regex patterns to patterns.py + add pytest tests for common DOI patterns
- Update to use logger.exception which provides tracebacks on errors.
- Moved code to add the '/identifier' tag to a general function add_metadata() in finders.py

1.3

Fixed
- Object files were not closed after being opened (issue https://github.com/MicheleCotrufo/pdf2doi/issues/17).
- Make sure that the version 2.0.0 of `pypdf2` is used, since the text extracted with newer versions occasionally messes up some DOI.

1.2

Added
- Print explicit error when target path is not a valid file or directory (when used via CLI).

Fixed
- Bug due to some functions returning None instead of an empty list (issue https://github.com/MicheleCotrufo/pdf2doi/issues/15).
- Fixed typo at line 134 of main.py ('/identfier' -> '/identifier')

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.