Advertools

Latest version: v0.16.4

Safety actively analyzes 722032 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 8

0.12.012

-----------------------

* Added
- New function ``logs_to_df``: Convert a log file of any non-JSON format
into a pandas DataFrame and save it to a `parquet` file. This also
compresses the file to a much smaller size.
- Crawler extracts all available ``img`` attributes: 'alt', 'crossorigin',
'height', 'ismap', 'loading', 'longdesc', 'referrerpolicy', 'sizes',
'src', 'srcset', 'usemap', and 'width' (excluding global HTML attributes
like ``style`` and ``draggable``).
- New parameter for the ``crawl`` function ``skip_url_params``: Defaults to
False, consistent with previous behavior, with the ability to not
follow/crawl links containing any URL parameters.
- New column for ``url_to_df`` "last_dir": Extract the value in the last
directory for each of the URLs.

* Changed
- Query parameter columns in ``url_to_df`` DataFrame are now sorted by how
full the columns are (the percentage of values that are not `NA`)

0.12.3

-------------------

* Fixed
- Crawler stops when provided with bad URLs in list mode.

0.11.1

-------------------

* Added
- The `nofollow` attribute for nav, header, and footer links.

* Fixed
- Timeout error while downloading robots.txt files.
- Make extracting nav, header, and footer links consistent with all links.

0.11.0

-------------------

* Added
- New parameter `recursive` for ``sitemap_to_df`` to control whether or not
to get all sub sitemaps (default), or to only get the current
(sitemapindex) one.
- New columns for ``sitemap_to_df``: ``sitemap_size_mb``
(1 MB = 1,024x1,024 bytes), and ``sitemap_last_modified`` and ``etag``
(if available).
- Option to request multiple robots.txt files with ``robotstxt_to_df``.
- Option to save downloaded robots DataFrame(s) to a file with
``robotstxt_to_df`` using the new parameter ``output_file``.
- Two new columns for ``robotstxt_to_df``: ``robotstxt_last_modified`` and
``etag`` (if available).
- Raise `ValueError` in ``crawl`` if ``css_selectors`` or
``xpath_selectors`` contain any of the default crawl column headers
- New XPath code recipes for custom extraction.
- New function ``crawllogs_to_df`` which converts crawl logs to a DataFrame
provided they were saved while using the ``crawl`` function.
- New columns in ``crawl``: `viewport`, `charset`, all `h` headings
(whichever is available), nav, header and footer links and text, if
available.
- Crawl errors don't stop crawling anymore, and the error message is
included in the output file under a new `errors` and/or `jsonld_errors`
column(s).
- In case of having JSON-LD errors, errors are reported in their respective
column, and the remainder of the page is scraped.

* Changed
- Removed column prefix `resp_meta_` from columns containing it
- Redirect URLs and reasons are separated by '' for consistency with
other multiple-value columns
- Links extracted while crawling are not unique any more (all links are
extracted).
- Emoji data updated with v13.1.
- Heading tags are scraped even if they are empty, e.g. <h2></h2>.
- Default user agent for crawling is now advertools/VERSION.

* Fixed
- Handle sitemap index files that contain links to themselves, with an
error message included in the final DataFrame
- Error in robots.txt files caused by comments preceded by whitespace
- Zipped robots.txt files causing a parsing issue
- Crawl issues on some Linux systems when providing a long list of URLs

* Removed
- Columns from the ``crawl`` output: `url_redirected_to`, `links_fragment`

0.10.7

-------------------

* Added
- New function ``knowledge_graph`` for querying Google's API
- Faster ``sitemap_to_df`` with threads
- New parameter `max_workers` for ``sitemap_to_df`` to determine how fast
it could go
- New parameter `capitalize_adgroups` for ``kw_generate`` to determine
whether or not to keep ad groups as is, or set them to title case (the
default)

* Fixed
- Remove restrictions on the number of URLs provided to ``crawl``,
assuming `follow_links` is set to `False` (list mode)
- JSON-LD issue breaking crawls when it's invalid (now skipped)

* Removed
- Deprecate the ``youtube.guide_categories_list`` (no longer supported by
the API)

0.10.6

-------------------

* Added
- JSON-LD support in crawling. If available on a page, JSON-LD items will
have special columns, and multiple JSON-LD snippets will be numbered for
easy filtering
* Changed
- Stricter parsing for rel attributes, making sure they are in link
elements as well
- Date column names for ``robotstxt_to_df`` and ``sitemap_to_df`` unified
as "download_date"
- Numbering OG, Twitter, and JSON-LD where multiple elements are present in
the same page, follows a unified approach: no numbering for the first
element, and numbers start with "1" from the second element on. "element",
"element_1", "element_2" etc.

Page 4 of 8

Releases

Has known vulnerabilities

Previous Next