Main changes
- The library textract has been removed from the required dependencies because it often creates problems during installation (due to conflicts between library versions),
and because it generally requires installing many other dependencies which are not needed by pdf2doi. The user can still decide to install textract==1.6.4 if desired.
pdf2doi will use textract only if it is installed.
- pdf2doi now stores any found identifier into a tag called /pdf2doi_identifier (previously was /identifier).
Added
- The library pdfminer is now directly used by pdf2doi to extract the text from a pdf file (instead of doing it indirectly via textract)
- An additional method to find the title of a pdf file, based on the library pymupdf, has been added .
- [Issue https://github.com/MicheleCotrufo/pdf2doi/issues/21]: When an arXiv ID is found, a corresponding DOI is also returned when available. This could be either the standard arXiv DOI (see also [here](https://blog.arxiv.org/2022/02/17/new-arxiv-articles-are-now-automatically-assigned-dois/)),
or the DOI of the corresponding journal publication. This behavior can be disabled by adding the optional command -no_arxiv2doi to the pdf2doi invocation.
- [Issue https://github.com/MicheleCotrufo/pdf2doi/issues/22]: The function get_pdf_text (finders.py) has been modified to allow the library PyPDF2 to extract also the text of any annotation/comment present in the pdf file.
Fixed
- Potential titles of the papers were often not correctly found, because the function find_possible_titles() (finders.py) would mistakenly disregard all the results if one of the three methods (pdftitle, PyPDF2, filename) generated an error.
- Fixed bug in the function add_metadata() (finders.py). In previous versions, some of the pre-existing metadata were not preserved when a new one was added ([commit](https://github.com/MicheleCotrufo/pdf2doi/commit/0804439f2d31191e476ea56369d1257d293d92dd)).