feat: new splitter: structure-aware markdown
Features:
- `ParsingConfig.splitter` has a new option `Splitter.Markdown` which is now the default, and works for both plain text (which by definition is
markdown) as well as markdown text. It implements "structure-aware" chunking, which means:
- tries to keep entire sections as chunks if they are not too big (relative to chunking configs)
- recursively splits large sections by avoiding breaking paras, and if that's not feasible, then avoids breaking sentences,
and only avoids breaking sentences as a last resort.
- enriches chunks with the headers from enclosing sections to improve match surface during retrieval
- `DocChatAgent` by default now uses this splitter
- Crawlers in `URLLoader`:
- `TrafilaturaCrawlerConfig.format` can be set to 3 possible values:
- `"markdown"` (default) - extracts content from page in markdown format
- `"txt"` - extracts content as plain text
- `"xml"` - extracts text with html tags, and the output is converted to markdown using `markdownify` lib
- `ExaCrawler` now extracts content in html content, which is then converted to makdown using `markdownify`