- Fix bug where hyphens didn't show up at the end of lines - Improve wrapping for hyphens - join words across hyphens before newline (disable by passing `keep_hyphens`) - Restructure output to avoid redundant info in json blob - keep track of text spans with similar font info instead of individual characters - Update model to predict blocks more accurately
0.2.1
- Switch the character box to a `loose` box, to get the full character range
0.2.0
- Rotate bboxes if pdf is rotated
0.1.2
- Optimize some internal routines - Improve the model further
0.1.1
- Added a few extra line-related features - Improved accuracy of the model
0.1.0
Initial version of pdftext. Fast text extraction based on pypdfium2.
- Extract plain text, sorted into reading order or in pdf order - Extract structured blocks and lines with font and other information per-character