Crawl4ai

Latest version: v0.5.0.post8

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 5

0.3.746

Major Features
1. Enhanced Docker Support (Nov 29, 2024)
- Improved GPU support in Docker images.
- Dockerfile refactored for better platform-specific installations.
- Introduced new Docker commands for different platforms:
- `basic-amd64`, `all-amd64`, `gpu-amd64` for AMD64.
- `basic-arm64`, `all-arm64`, `gpu-arm64` for ARM64.

Infrastructure & Documentation
- Enhanced README.md to improve user guidance and installation instructions.
- Added installation instructions for Playwright setup in README.
- Created and updated examples in `docs/examples/quickstart_async.py` to be more useful and user-friendly.
- Updated `requirements.txt` with a new `pydantic` dependency.
- Bumped version number in `crawl4ai/__version__.py` to 0.3.746.

Breaking Changes
- Streamlined application structure:
- Removed static pages and related code from `main.py` which might affect existing deployments relying on static content.

Development Updates
- Developed `post_install` method in `crawl4ai/install.py` to streamline post-installation setup tasks.
- Refined migration processes in `crawl4ai/migrations.py` with enhanced logging for better error visibility.
- Updated `docker-compose.yml` to support local and hub services for different architectures, enhancing build and deploy capabilities.
- Refactored example test cases in `docs/examples/docker_example.py` to facilitate comprehensive testing.

README.md
Updated README with new docker commands and setup instructions.
Enhanced installation instructions and guidance.

crawl4ai/install.py
Added post-install script functionality.
Introduced `post_install` method for automation of post-installation tasks.

crawl4ai/migrations.py
Improved migration logging.
Refined migration processes and added better logging.

docker-compose.yml
Refactored docker-compose for better service management.
Updated to define services for different platforms and versions.

requirements.txt
Updated dependencies.
Added `pydantic` to requirements file.

crawler/__version__.py
Updated version number.
Bumped version number to 0.3.746.

docs/examples/quickstart_async.py
Enhanced example scripts.
Uncommented example usage in async guide for user functionality.

main.py
Refactored code to improve maintainability.
Streamlined app structure by removing static pages code.

0.3.743

Enhance features and documentation
- Updated version to 0.3.743
- Improved ManagedBrowser configuration with dynamic host/port
- Implemented fast HTML formatting in web crawler
- Enhanced markdown generation with a new generator class
- Improved sanitization and utility functions
- Added contributor details and pull request acknowledgments
- Updated documentation for clearer usage scenarios
- Adjusted tests to reflect class name changes

CONTRIBUTORS.md
Added new contributors and pull request details.
Updated community contributions and acknowledged pull requests.

crawl4ai/__version__.py
Version update.
Bumped version to 0.3.743.

crawl4ai/async_crawler_strategy.py
Improved ManagedBrowser configuration.
Enhanced browser initialization with configurable host and debugging port; improved hook execution.

crawl4ai/async_webcrawler.py
Optimized HTML processing.
Implemented 'fast_format_html' for optimized HTML formatting; applied it when 'prettiify' is enabled.

crawl4ai/content_scraping_strategy.py
Enhanced markdown generation strategy.
Updated to use DefaultMarkdownGenerator and improved markdown generation with filters option.

crawl4ai/markdown_generation_strategy.py
Refactored markdown generation class.
Renamed DefaultMarkdownGenerationStrategy to DefaultMarkdownGenerator; added content filter handling.

crawl4ai/utils.py
Enhanced utility functions.
Improved input sanitization and enhanced HTML formatting method.

docs/md_v2/advanced/hooks-auth.md
Improved documentation for hooks.
Updated code examples to include cookies in crawler strategy initialization.

tests/async/test_markdown_genertor.py
Refactored tests to match class renaming.
Updated tests to use renamed DefaultMarkdownGenerator class.

0.3.731

- Fixed: Browser context unexpectedly closing in Docker environment during crawl operations.
- Removed: __del__ method from AsyncPlaywrightCrawlerStrategy to prevent unreliable asynchronous cleanup, ensuring - browser context is closed explicitly within context managers.
- Added: Monitoring for ManagedBrowser subprocess to detect and log unexpected terminations.
- Updated: Dockerfile configurations to expose debugging port (9222) and allocate additional shared memory for improved browser stability.
- Improved: Error handling and resource cleanup processes for browser lifecycle management within the Docker environment.

0.3.75

PruningContentFilter

1. Introduced PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
A new content filtering strategy that removes less relevant nodes based on metrics like text and link density.

**Affected Files:**
- `crawl4ai/content_filter_strategy.py`: Enhancement of content filtering capabilities.
diff
Implemented effective pruning algorithm with comprehensive scoring.

- `README.md`: Improved documentation regarding new features.
diff
Updated to include usage and explanation for the PruningContentFilter.

- `docs/md_v2/basic/content_filtering.md`: Expanded documentation for users.
diff
Added detailed section explaining the PruningContentFilter.

2. Added Unit Tests for PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
Comprehensive tests added to ensure correct functionality of PruningContentFilter

**Affected Files:**
- `tests/async/test_content_filter_prune.py`: Increased test coverage for content filtering strategies.
diff
Created test cases for various scenarios using the PruningContentFilter.

Development Updates

3. Enhanced BM25ContentFilter tests (Dec 01, 2024) (Dec 01, 2024)
Extended testing to cover additional edge cases and performance metrics.

**Affected Files:**
- `tests/async/test_content_filter_bm25.py`: Improved reliability and performance assurance.
diff
Added tests for new extraction scenarios including malformed HTML.

Infrastructure & Documentation

4. Updated Examples (Dec 01, 2024) (Dec 01, 2024)
Altered examples in documentation to promote the use of PruningContentFilter alongside existing strategies.

**Affected Files:**
- `docs/examples/quickstart_async.py`: Enhanced usability and clarity for new users.
- Revised example to illustrate usage of PruningContentFilter.

0.3.74

1. **File Download Processing** (Nov 14, 2024)
- Added capability for users to specify download folders
- Implemented file download tracking in crowd result object
- Created new file: `tests/async/test_async_doanloader.py`

2. **Content Filtering Improvements** (Nov 14, 2024)
- Introduced Relevance Content Filter as an improvement over Fit Markdown
- Implemented BM25 algorithm for content relevance matching
- Added new file: `crawl4ai/content_filter_strategy.py`
- Removed deprecated: `crawl4ai/content_cleaning_strategy.py`

3. **Local File and Raw HTML Support** (Nov 13, 2024)
- Added support for processing local files
- Implemented raw HTML input handling in AsyncWebCrawler
- Enhanced `crawl4ai/async_webcrawler.py` with significant performance improvements

4. **Browser Management Enhancements** (Nov 12, 2024)
- Implemented new async crawler strategy using Playwright
- Introduced ManagedBrowser for better browser session handling
- Added support for persistent browser sessions
- Updated from playwright_stealth to tf-playwright-stealth

5. **API Server Component**
- Added CORS support
- Implemented static file serving
- Enhanced root redirect functionality

0.3.73

Added
- preserve_tags: Added support for preserving specific HTML tags during markdown conversion.
- Smart overlay removal system in AsyncPlaywrightCrawlerStrategy:
- Automatic removal of popups, modals, and cookie notices
- Detection and removal of fixed/sticky position elements
- Cleaning of empty block elements
- Configurable via `remove_overlay_elements` parameter
- Enhanced screenshot capabilities:
- Added `screenshot_wait_for` parameter to control timing
- Improved screenshot handling with existing page context
- Better error handling with fallback error images
- New URL normalization utilities:
- `normalize_url` function for consistent URL formatting
- `is_external_url` function for better link classification
- Custom base directory support for cache storage:
- New `base_directory` parameter in AsyncWebCrawler
- Allows specifying alternative locations for `.crawl4ai` folder

Enhanced
- Link handling improvements:
- Better duplicate link detection
- Enhanced internal/external link classification
- Improved handling of special URL protocols
- Support for anchor links and protocol-relative URLs
- Configuration refinements:
- Streamlined social media domain list
- More focused external content filtering
- LLM extraction strategy:
- Added support for separate API base URL via `api_base` parameter
- Better handling of base URLs in configuration

Fixed
- Screenshot functionality:
- Resolved issues with screenshot timing and context
- Improved error handling and recovery
- Link processing:
- Fixed URL normalization edge cases
- Better handling of invalid URLs
- Improved error messages for link processing failures

Developer Notes
- The overlay removal system uses advanced JavaScript injection for better compatibility
- URL normalization handles special cases like mailto:, tel:, and protocol-relative URLs
- Screenshot system now reuses existing page context for better performance
- Link processing maintains separate dictionaries for internal and external links to ensure uniqueness

Page 2 of 5

Releases

Has known vulnerabilities

Previous Next

Crawl4ai

Page 2 of 5

0.3.746

0.3.743

0.3.731

0.3.75

0.3.74

0.3.73

Page 2 of 5

Links

Releases