Scrapling

Latest version: v0.2.5

Safety actively analyzes 682457 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.2.5

What's changed

Bugs Squashed
- Handled an error that happens with the 'wait_selector' argument if it resolved to more than 1 element. This affects the `StealthyFetcher` and the `PlayWrightFetcher` classes.
- Fixed the encoding type in cases where the `content_type` header gets value with parameters like `charset` (Thanks to andyfcx for [12](https://github.com/D4Vinci/Scrapling/issues/12) )

Quality of life
- Added more tests to cover new parts of the code and made tests run in threads.
- I updated the docs strings to be readable correctly on Sphinx's apidoc or similar tools.

Contributions
- andyfcx made their first contribution at [13 ](https://github.com/D4Vinci/Scrapling/pull/13)

---

> [!NOTE]
> A friendly reminder that maintaining and improving `Scrapling` takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like `Scrapling` and want it to keep improving, you can help by supporting me through the [Sponsor button](https://github.com/sponsors/D4Vinci).

0.2.4

What's changed

Bugs Squashed
- Fixed a bug when retrieving response bytes after using the `network_idle` argument in both the `StealthyFetcher` and `PlayWrightFetcher` classes. <br/> That was causing the following error message:

Response.body: Protocol error (Network.getResponseBody): No resource with given identifier found

- The PlayWright API sometimes returns empty status text with responses, so now `Scrapling` will calculate it manually if that happens. This affects both the `StealthyFetcher` and `PlayWrightFetcher` classes.

---

> [!NOTE]
> A friendly reminder that maintaining and improving `Scrapling` takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like `Scrapling` and want it to keep improving, you can help by supporting me through the [Sponsor button](https://github.com/sponsors/D4Vinci).

0.2.3

What's changed

Bugs Squashed
- Fixed a bug with pip installation that prevented the stealth mode on PlayWright Fetcher from working entirely.


---

> [!NOTE]
> A friendly reminder that maintaining and improving `Scrapling` takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like `Scrapling` and want it to keep improving, you can help by supporting me through the [Sponsor button](https://github.com/sponsors/D4Vinci).

0.2.2

What's changed
New features
- Now if you don't want to pass arguments to the generated `Adaptor` object and want to use the default values, you can use this import instead for cleaner code
python
>> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
>> page = Fetcher.get('https://example.com', stealthy_headers=True)

Otherwise
python
>> from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
>> page = Fetcher(auto_match=False).get('https://example.com', stealthy_headers=True)



Bugs Squashed
1. Fixed a bug with the `Response` object introduced with patch v0.2.1 yesterday that happened with some cases of nested selecting/parsing.

---

> [!NOTE]
> A friendly reminder that maintaining and improving `Scrapling` takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like `Scrapling` and want it to keep improving, you can help by supporting me through the [Sponsor button](https://github.com/sponsors/D4Vinci).

0.2.1

What's changed
New features
1. Now the `Response` object returned from all fetchers is the same as the `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`. All `cookies`, `headers`, and `request_headers` are always of type `dictionary`. <br/>So your code can now become like:
python
>> from scrapling import Fetcher
>> page = Fetcher().get('https://example.com', stealthy_headers=True)
>> print(page.status)
200
>> products = page.css('.product')

Instead of before
python
>> from scrapling import Fetcher
>> fetcher = Fetcher().get('https://example.com', stealthy_headers=True)
>> print(fetcher.status)
200
>> page = fetcher.adaptor
>> products = page.css('.product')

But I have left the `.adaptor` property working for backward compatibility.
2. Now both the `StealthyFetcher` and `PlayWrightFetcher` classes can take a `proxy` argument with the fetch method which accepts a string or a dictionary.
3. Now the `StealthyFetcher` class has the `os_randomize` argument with the `fetch` method. If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.


Bugs Squashed
1. Fixed a bug that happens while passing headers with the `Fetcher` class.
2. Fixed a bug with parsing JSON responses passed from the fetcher-type classes.

Quality of life changes
1. The text functionality behavior was to try to remove HTML comments before returning the text but that induced errors in some cases and made the code more complicated than needed. Now it has reverted to the default lxml behavior, **you will notice a slight speed increase to all operations that counts on elements' text like selectors**. Now if you want Scrapling to remove HTML comments from elements before returning the text to avoid the weird text-splitting behavior that's in lxml/parsel/scrapy, just keep the `keep_comments` argument set to True as it is by default.

---

> [!NOTE]
> A friendly reminder that maintaining and improving `Scrapling` takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like `Scrapling` and want it to keep improving, you can help by supporting me through the [Sponsor button](https://github.com/sponsors/D4Vinci).

0.2

What's changed
New features
1. Introducing the `Fetchers` feature with 3 new main types to make Scrapling fetch pages for you with a LOT of options!
- The `Fetcher` class for basic HTTP requests
- The `StealthyFetcher` class is a completely stealthy fetcher that uses a [stealthy modified version of Firefox](https://github.com/daijro/camoufox).
- The `PlayWrightFetcher` class that allows doing browser-based requests with Vanilla PlayWright, PlayWright with stealth mode made by me, Real browsers through CDP, and [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless)!
2. Added the completely new `find_all`/`find` methods to find elements easily on the page with dark magic!
3. Added the methods `filter` and `search` to the `Adaptors` class for easier bulk operations on `Adaptor` object groups.
4. Added methods `css_first` and `xpath_first` methods for easier usage.
5. Added the new class type `TextHandlers` which is used for bulk operations on `TextHandler` objects like the `Adaptors` class.
6. Added `generate_full_css_selector` and `generate_full_xpath_selector` methods.

Bugs Squashed
1. Now the `Adaptors` class version of `re_first` returns the first result that matches in all `Adaptor` objects inside instead of the faulty logic of returning the results of `re_first` of all `Adaptor` objects.
2. Now if the user selects a text-type content to be returned from selected elements (like css `::text` function) with any method like `.css` or `.xpath`. The `Adaptor` object will return the `TextHandlers` class instead of returning a list of strings like before. So now you can do `page.css('something::text').re_first(r'regex_pattern').json()` instead of `page.css('something::text')[0].re_first(r'regex_pattern').json()`
3. Now `Adaptor`/`Adaptors` re/re_first arguments are consistent with the `TextHandler` ones. So now you have `clean_match` and `case_sensitive` arguments.
4. Now the `auto_match` argument is enabled by default in the initialization of `Adaptor` but still you have to enable it while selecting elements if you want to enable it. (Not a bug but a design decision)
5. A lot of type-annotations corrections here and there for better auto-completion experience while you are coding with Scrapling.

Quality of life changes
1. Renamed both `css_selector` and `xpath_selector` methods to `generate_css_selector` and `generate_xpath_selector` for clarity and to not interrupt the auto-completion while coding.
2. Restructured most of the old code into a `core` subpackage and other design decisions for cleaner and easier maintenance in the future.
3. Restructured the tests folder into a cleaner structure and added tests for the new features. Also now tox environments are cached on GitHub for faster automated tests with each commit.

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.