What's Changed
- SemanticProcessing! This is the recommended processing pipeline.
- Add optional annotations to the pdf draw functions
- Fixed reading order bug
Breaking Changes
1. Renaming
- `Node.aggregate_position` renamed to `Node.reading_order`.
- `RemoveStubs` to `RemoveNodesBelowNTokens`
2. Refactored processing pipelines to use a class to promote ease of reuse
Previously
python
from openparse import ProcessingStep, default_pipeline, Node
from typing import List
class CustomCombineTables(ProcessingStep):
def process(self, nodes: List[Node]) -> List[Node]:
return nodes
copy the default pipeline (or create a new one)
custom_pipeline = default_pipeline.copy()
custom_pipeline.append(CustomCombineTables())
parser = openparse.DocumentParser(
table_args={"parsing_algorithm": "pymupdf"}, processing_pipeline=custom_pipeline
)
custom_10k = parser.parse(meta10k_path)
Now becomes
python
from openparse import processing, Node
from typing import List
class CustomCombineTables(processing.ProcessingStep):
def process(self, nodes: List[Node]) -> List[Node]:
return nodes
copy the default pipeline (or create a new one)
custom_pipeline = processing.BasicIngestionPipeline()
custom_pipeline.append_transform(CustomCombineTables())
parser = openparse.DocumentParser(
table_args={"parsing_algorithm": "pymupdf"}, processing_pipeline=custom_pipeline
)
custom_10k = parser.parse(meta10k_path)
3. `openai` and `numpy` as now required dependencies, will likely split this out in the future.
**Full Changelog**: https://github.com/Filimoa/open-parse/compare/v0.4.1...v0.5.0