- page.search_text() did not find certain substrings present in page.free_form_text(). Found two reasons for this behavior.
* The list of ocr-data passed to `find_frirst_word_coord` was page.ocr('words'), which has the entries sorted by word_id. This makes the sort flag of the function obsolete, and second it leads to cases in which the coordinates of sub-strings from page.free_form_text() cannot be found using the function the order of the words in page.free_form_text() is unrelated to word_id.
For example a sub-string of page.free_form_text() might be “brown fox” with the word_id of “brown” being equal to 3 and “fox” being equal to 8. In this case find_first_word_coords would not find the coordinates, as it would break the for-loop as the word with word_id 4 is not “fox”.
This behavior was fixed by passing the ocr data in the same order as in page.free_form_text(), still giving the option to sort it by word_id using the sort-flag.
* Inside the find_first_word_coord function the words of the sub-string were always put through a cleanup regex before being compared to the ocr_text (which was not cleaned up if the clean-flag was set to false). This leads to cases in which a sub-string such as “Phone: 12345” would not be found as “Phone:” would be cleaned up to “Phone”.
This was fixed by either putting the words of the sub-string as well as the values for ocr_text through a cleanup regex or neither of them, depending on the clean-flag.