Textract-py3

Latest version: v2.1.1

Safety actively analyzes 723882 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 4

1.6.1

-------------------

* several bug fixes, including:

* fixing the readthedocs build (`150`_)

1.6.0

-------------------

* Let the user provide file extension as an argument when the file name has no
extension (`148`_ by `motazsaad`_)

* Added ability to parse audio with ``pocketsphinx`` (`122`_ by `barrust`_)

* Added ability to parse ``.psv`` and ``.tsv`` files (`141`_)

* several bug fixes, including:

* checking for the importability of a parser rather than the presense of the
file (`136`_ by `AusIV`_)

* manage versions with `bumpversion <https://pypi.python.org/pypi/bumpversion>`_
(`146`_)

* properly reporting on missing external dependencies (`139`_ by `AusIV`_)

* pin `chardet` to version 2.1.1 to avoid decode errors (`107`_)

* avoid unicode decode error with html parser (`147`_ by `suned`_)

* enabling autocomplete and improving error handling (`149`_)

1.5.0

-----

* Added python 3 support, including pdfminer (`104`_ by `sirex`_ via `126`_)

* Python 3 support for ``pdfminer`` using ``pdfminer.six`` (`116`_ by
`jaraco`_ via `126`_)

* fixed security vulnerability by properly using subprocess.call (`114`_ by
`pierre-ernst`_)

* updating to ``tesseract`` 3.03 (`127`_)

* adding a ``.tif`` synonym for ``.tiff`` files (`113`_ by `onionradish`_)

* improved ``.docx`` support using ``docx2txt`` (`100`_ by `ankushshah89`_)

* several bug fixes, including:

* including all requirements for ``Pillow`` (`119`_ by `akoumjian`_)

1.4.0

-----

* added layout preservation option for pdftotext pdf extractor (`93`_ by
`ankushshah89`_)

* added simple support for extensionless filenames, treating them as plain
``.txt`` files (`85`_)

* several bug fixes, including:

* now extracting the text in tables from docx files at the end of the text
extraction (`92`_ by `jsmith-mploir`_)

* faster testing framework by only rebuilding test data when needed (`90`_)

* fixed ``.html`` and ``.epub`` parsers to deal with beautifulsoup4
upgrades

* using official ``msg-extractor`` now that it has a native ``setup.py``

* updated tests for ``.html``, ``.ogg``, ``.wav``, and ``.mp3`` file types to
be consistent with more recent versions of the underlying packages.

1.3.0

-----

* support for ``.rtf`` files (`84`_)

* support for ``.msg`` files (`87`_ and `17`_ by `anthonygarvan`_)

1.2.0

-----

* support for ``.tiff`` files (`81`_)

* added support for other languages for tesseract (`76`_ by `anderser`_)

* added ``--option/-O`` flag to pass arbitrary arguments for things like
languages into textract

* several bug fixes, including:

* fix bug with doing OCR on multi-page pdfs and removing temporary directory
(`82`_ by `pudo`_)

* correctly accounting for whitespace in ``.odt`` documents (`79`_
by `evfredericksen`_)

* standardizing testing environment to be compatible with different versions
of third-party command line tools (`78`_)

Page 2 of 4

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.