Kraken 5.x is a major release introducing trainable reading order, a cleaner API, and changes resulting in a ~50% performance improvement of recognition inference, in addition to a large number of smaller bug fixes and stability improvements.
What's Changed
* Trainable reading order based on an neural order relation operator adapted from [this](https://ieeexplore.ieee.org/document/9413256) method (https://github.com/mittagessen/kraken/pull/492)
* Updates to the ALTO/PageXML templates and the serializer which correct serialization of region and line taxonomies, use UUIDs, and reuse identifiers from input XML files in output.
* Requirements are now mostly pinned to avoid pytorch/lightning accuracy and speed regressions that popped up semi-regularly with more free package versions.
* Threadpool limits are now set in all CLI drivers to prevent slowdown from unreasonably large numbers of threads in libraries like OpenCV. As a result the `--threads` option of all commands has been split into `--workers` and --`threads`.
* `kraken.repo` methods have been adapted to the new Zenodo API. They also correctly handle versioned records now.
* A small fix enabling recognition inference with AMP.
* Support for `--fixed-splits` in `ketos test` (PonteIneptique)
* Performance increase for polygon extraction by Evarin in https://github.com/mittagessen/kraken/pull/555
* Speed up legacy polygon extraction by anutkk in https://github.com/mittagessen/kraken/pull/586
* New container classes in `kraken.containers` replace the previous dicts produced and expected by `segment/rpred/serialize`.
* `kraken.serialize.serialize_segmentation()` has been removed as part of the container class rework.
* `train/rotrain/segtrain/pretrain` cosine annealing scheduling now allows setting the final learning rate with `--cos-min-lr`.
* Lots of PEP8/whitespace/spelling mistake fixes from stweil
New features
Reading order training
Reading order can now be learned with `ketos rotrain` and reading order models can be added to segmentation model files. The training process is documented [here](https://kraken.re/5.2/ketos.html#reading-order-training).
Upgrade guide
Command line
Polygon extractor
The polygon extractor is responsible for taking a page image, baselines, and their bounding polygons and dewarping + masking out the line. Here is an example:
![kraken_faster](https://github.com/mittagessen/kraken/assets/4519091/fec6900a-fe23-40c3-89c0-50ecfc73f320)
The new polygon extractor reduces line extraction time 30x, roughly halving inference time and significantly speeding up training from XML files and compilation of datasets. It should be noted that polygon extraction *does not* concern data in the legacy bounding box format nor does it touch the segmentation process as it is only a preprocessing step in the recognizer on an already existing segmentation.
Not all improvements in the polygon extractor are backward compatible, causing models trained with data extracted with the old implementation to suffer from a slight reduction in accuracy (usually <0.25 percentage points). Therefore models now contain a flag in their metadata indicating which implementation has been used to train them. This flag can be overridden, e.g.:
$ kraken --no-legacy-polygons -i ... ... ocr ...
to enable all speedups for a slight increase in character error rate.
For training the new extractor is enabled per default, i.e. models trained with kraken 5.x will perform slightly worse on earlier kraken version but will still work. It is possible to force use of only backwards compatible speedups:
$ ketos compile --legacy-polygons ...
$ ketos train --legacy-polygons ....
$ ketos pretrain --legacy-polygons ...
Threads and Multiprocessing
The command line tools now handle multiprocessing and thread pools more completely and configurably. `--workers` has been split into `--threads` and `--workers`, the former option limiting the size of thread pools (as much as possible) for intra-op parallelization, the latter setting the number of worker processes, usually for the purpose of data loading in training and dataset compilation.
API changes
While 5.x preserves the general OCR functional blocks, the existing dictionary-based data structures have been replaced with [container classes](https://kraken.re/5.2/api_docs.html#kraken-containers-module) and the XML parser has been reworked.
Container classes
For straightforward processing little has changed. Most keys of the dictionaries have been converted into attributes of their respective classes.
The segmentation methods now return a [Segmentation](https://kraken.re/5.2/api_docs.html#kraken.containers.Segmentation) object containing [Region](https://kraken.re/5.2/api_docs.html#kraken.containers.Region) and [BaselineLine](https://kraken.re/5.2/api_docs.html#kraken.containers.BaselineLine)/[BBoxLine](https://kraken.re/5.2/api_docs.html#kraken.containers.BBoxLine) objects:
>>> pageseg.segment(im)
{'text_direction': 'horizontal-lr',
'boxes': [(x1, y1, x2, y2),...],
'script_detection': False
}
>>> blla.segment(im)
{'text_direction': '$dir',
'type': 'baseline',
'lines': [{'baseline': [[x0, y0], [x1, y1], ..., [x_n, y_n]], 'boundary': [[x0, y0, x1, y1], ... [x_m, y_m]]}, ...
{'baseline': [[x0, ...]], 'boundary': [[x0, ...]]}]
'regions': [{'region': [[x0, y0], [x1, y1], ..., [x_n, y_n]], 'type': 'image'}, ...
{'region': [[x0, ...]], 'type': 'text'}]
}
becomes:
>>> pageseg.segment(im)
Segmentation(type='bbox',
imagename=None,
text_direction='horizontal-lr',
script_detection=False,
lines=[BBoxLine(id='f1d5b1e2-030c-41d5-b299-8a114eb0996e',
bbox=[34, 198, 279, 251],
text=None,
base_dir=None,
type='bbox',
imagename=None,
tags=None,
split=None,
regions=None,
text_direction='horizontal-lr'),
BBoxLine(...],
line_orders=[])
>>> blla.segment(im)
Segmentation(type='baseline',
imagename=im,
text_direction='horizontal-lr',
script_detection=False,
lines=[BaselineLine(id='50ab1a29-c3b6-4659-9713-ff246b21d2dc',
baseline=[[183, 284], [272, 282]],
boundary=[[183, 284], ... ,[183, 284]],
text=None,
base_dir=None,
type='baselines',
tags={'type': 'default'},
split=None,
regions=['e28ccb6b-2874-4be0-8e0d-38948f0fdf09']), ...],
regions={'text': [Region(id='e28ccb6b-2874-4be0-8e0d-38948f0fdf09',
boundary=[[123, 218], ..., [123, 218]],
tags={'type': 'text'}), ...],
'foo': [Region(...), ...]},
line_orders=[])
The recognizer now yields
[`BaselineOCRRecords`](https://kraken.re/5.2/api_docs.html#kraken.containers.BaselineOCRRecord)/[`BBoxOCRRecords`](https://kraken.re/5.2/api_docs.html#kraken.containers.BBoxOCRRecord)
which both inherit from the `BaselineLine`/`BBoxLine` classes:
>>> record = rpred(network=model,
im=im,
segmentation=baseline_seg)
>>> record = next(rpred.rpred(im))
>>> record
BaselineOCRRecord pred: 'predicted text' baseline: ...
>>> record.type
'baselines'
>>> record.line
BaselineLine(...)
>>> record.prediction
'predicted text'
One complication is the new serialization function which now accepts a
`Segmentation` object instead of a list of `ocr_records` and ancillary metadata:
>>> records = list(x for x in rpred(...))
>>> serialize(records,
image_name=im.filename,
image_size=im.size,
writing_mode='horizontal-tb',
scripts=['Latn', 'Hebr'],
regions=[{...}],
template='alto',
template_source='native',
processing_steps=proc_steps)
becomes:
>>> import dataclasses
>>> baseline_seg
Segmentation(...)
>>> records = list(x for x in rpred(..., segmentation=baseline_seg))
>>> results = dataclasses.replace(baseline_seg, lines=records)
>>> serialize(results,
image_size=im.size,
writing_mode='horizontal-tb',
scripts=['Latn', 'Hebr'],
template='alto',
template_source='native',
processing_steps=proc_steps)
This requires the construction of a new `Segmentation` object that contains the
records produced by the text predictor. The most straightforward way to create
this new `Segmentation` is through the `dataclasses.replace` function as our
container classes are immutable.
Lastly, `serialize_segmentation` has been removed. The `serialize` function now
accepts `Segmentation` objects which do not contain text predictions:
>>> serialize_segmentation(segresult={'text_direction': '$dir',
'type': 'baseline',
'lines': [{'baseline': [[x0, y0], [x1, y1], ..., [x_n, y_n]], 'boundary': [[x0, y0, x1, y1], ... [x_m, y_m]]}, ...
{'baseline': [[x0, ...]], 'boundary': [[x0, ...]]}]
'regions': [{'region': [[x0, y0], [x1, y1], ..., [x_n, y_n]], 'type': 'image'}, ...
{'region': [[x0, ...]], 'type': 'text'}]
},
image_name=im.filename,
image_size=im.size,
template='alto',
template_source='native',
processing_steps=proc_steps)
is replaced by:
>>> baseline_seg
Segmentation(...)
>>> serialize(baseline_seg,
image_size=im.size,
writing_mode='horizontal-tb',
scripts=['Latn', 'Hebr'],
template='alto',
template_source='native',
processing_steps=proc_steps)
XML parsing
The `kraken.lib.xml.parse_{xml,alto,page}` methods have been replaced by a single [`kraken.lib.xml.XMLPage`](https://kraken.re/5.2/api.html#xml-parsing) class.
>>> parse_xml('xyz.xml')
{'image': impath,
'lines': [{'boundary': [[x0, y0], ...],
'baseline': [[x0, y0], ...],
'text': apdjfqpf',
'tags': {'type': 'default', ...}},
...
{...}],
'regions': {'region_type_0': [[[x0, y0], ...], ...], ...}}
becomes
>>> XMLPage('xyz.xml')
XMLPage xyz.xml (format: alto, image: impath)
As the parser is now aware of reading order the `XMLPage.lines` attribute is an
unordered dict of `BaselineLine`/`BBoxLine` container classes. As ALTO/PageXML
files can generally contain multiple different reading orders the
`XMLPage.get_sorted_lines()/XMLPAge.get_sorted_regions()` method on the object
provides an ordered view of lines or regions. The default order
`line_implicit`/`region_implicit` corresponds to the order produced by the
previous parsers, i.e. the order formed by the sequence of elements in the XML
tree.
`XMLPage` objects can be converted into a `Segmentation` container using the
`XMLPage.to_container()` method:
>>> XMLPage('xyz.xml').to_container()
Segmentation(...)
**Full Changelog**: https://github.com/mittagessen/kraken/compare/4.3.13...5.2