Wayback

Latest version: v0.4.5

Safety actively analyzes 688087 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 4

0.4.0

Breaking Changes

This release includes a significant overhaul of parameters for `WaybackClient.search`.

- Removed parameters that did nothing, could break search, or that were for internal use only: `gzip`, `showResumeKey`, `resumeKey`, `page`, `pageSize`, `previous_result`.

- Removed support for extra, arbitrary keyword parameters that could be added to each request to the search API.

- All parameters now use snake_case. (Previously, parameters that were passed unchanged to the HTTP API used camelCase, while others used snake_case.) The old, non-snake-case names are deprecated, but still work. They’ll be completely removed in v0.5.0.

- `matchType` → `match_type`
- `fastLatest` → `fast_latest`
- `resolveRevisits` → `resolve_revisits`

- The `limit` parameter now has a default value. There are very few cases where you should not set a `limit` (not doing so will typically break pagination), and there is now a default value to help prevent mistakes. We’ve also added documentation to explain how and when to adjust this value, since it is pretty complex. (65)

- Expanded the method documentation to explain things in more depth and link to more external references.

While we were at it, we also renamed the `datetime` parameter of `WaybackClient.get_memento` to `timestamp` for consistency with the `CdxRecord` and `Memento` classes. The old name still works for now, but it will be fully removed in v0.5.0.


Features

- `Memento.headers` is now case-insensitive. The keys of the `headers` dict are returned with their original case when iterating, but lookups are performed case-insensitively. For example:

py
list(memento.headers) == ['Content-Type', 'Date']
memento.headers['Content-Type'] == memento.headers['content-type']


(98)

- There are now built-in, adjustable rate limits for calls to both `search()` and `get_memento()`. The default values should keep you from getting temporarily blocked by the Wayback Machine servers, but you can also adjust them when instantiating `WaybackSession`:

py
Limit get_memento() calls to 2 per second (or one every 0.5 seconds):
client = WaybackClient(WaybackSession(memento_calls_per_second=2))

These now take a minimum of 0.5 seconds, even if the Wayback Machine
responds instantly (there's no delay on the first call):
client.get_memento('http://www.noaa.gov/', timestamp='20180816111911')
client.get_memento('http://www.noaa.gov/', timestamp='20180829092926')


A huge thanks to LionSzl for implementing this. (12)


Fixes & Maintenance

- All API requests to archive.org now use HTTPS instead of HTTP. Thanks to sundhaug92 for calling this out. (81)

- Headers from the original archived response are again included in `Memento.headers`. As part of this, the `headers` attribute is now case-insensitive (see new features above), since the Internet Archive servers now return headers with different cases depending on how the request was made. (98)

0.3.3

This release extends the timestamp parsing fix from version 0.3.2 to handle a similar problem, but with the month portion of timestamps in addition to the day. It also implements a small performance improvement in timestamp parsing. Thanks to edsu for discovering this issue and addressing this. (88)

**Full Changelog**: https://github.com/edgi-govdata-archiving/wayback/compare/v0.3.2...v0.3.3

0.3.2

Some Wayback CDX records have invalid timestamps with `"00"` for the day-of-month portion. `wayback.WaybackClient.search` previously raised an exception when parsing CDX records with this issue, but now handles them safely. Thanks to 8W9aG for discovering this issue and addressing it. (85)

0.3.1

Some Wayback CDX records have no `length` information, and previously caused `WaybackClient.search` to raise an exception. These records now have their `length` property set to `None` instead of a number. Thanks to 8W9aG for discovering this issue and addressing it! (83)

0.3.0

This release marks a *major* update we’re really excited about: `WaybackClient.get_memento` no longer returns a `Response` object from the [Requests package](https://requests.readthedocs.io/) that takes a lot of extra work to interpret correctly. Instead, it returns a new `Memento` object. It’s really similar to the `Response` we used to return, but doesn’t mix up current and historical data — it represents the historical, archived HTTP response that is stored in the Wayback Machine. This is a big change to the API, so we’ve bumped the version number to `0.3.x`.


Notable Changes

- **Breaking change:** `WaybackClient.get_memento` takes new parameters and has a new return type. More details below.

- **Breaking change:** `memento_url_data` now returns 3 values instead of 2. The last value is a string representing the playback mode (see below description of the new `mode` parameter on `WaybackClient.get_memento` for more about playback modes).

- Requests to the Wayback Machine now have a default timeout of 60 seconds. This was important because we’ve seen many recent issues where the Wayback Machine servers don’t always close connections.

If needed, you can disable this by explicitly setting `timeout=None` when creating a `WaybackSession`. Please note this is *not* a timeout on how long a whole request takes, but on the time between bytes received.

- `WaybackClient.get_memento` now raises `NoMementoError` when the requested URL has never been archived by the Wayback Machine. It no longer raises `requests.exceptions.HTTPError` under any circumstances.

You may notice that removing APIs from the [Requests package](https://requests.readthedocs.io/) is a theme here. Under the hood, *Wayback* still uses *Requests* for HTTP requests, but we expect to change that in order to ensure this package is thread-safe. We will bump the version to v0.4.x when doing so.


get_memento() Parameters

The parameters in `WaybackClient.get_memento` have been re-organized. The method signature is now:

py
def get_memento(self,
url, Accepts new types of values.
datetime=None, New parameter.
mode=Mode.original, New parameter.
*, Everything below is keyword-only.
exact=True,
exact_redirects=None,
target_window=24 * 60 * 60,
follow_redirects=True) New parameter.


- All parameters except `url` (the first parameter) from v0.2.x must now be specified with keywords, and cannot be specified positionally.

If you previously used keywords, your code will be fine and no changes are necessary:

py
This still works great!
client.get_memento('http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/',
exact=False,
exact_redirects=False,
target_window=3600)


However, positional parameters like the following will now cause problems, and you should switch to the above keyword form:

py
This will now cause you some trouble :(
client.get_memento('http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/',
False,
False,
3600)


- The `url` parameter can now be a normal, non-Wayback URL or a `CdxRecord`, and new `datetime` and `mode` parameters have been added.

Previously, if you wanted to get a memento of what `http://www.noaa.gov/` looked like on August 1, 2018, you would have had to construct a complex string to pass to `get_memento()`:

py
client.get_memento('http://web.archive.org/web/20180801000000id_/http://www.noaa.gov/')


Now you can pass the URL and time you want as separate parameters:

py
client.get_memento('http://www.noaa.gov/', datetime.datetime(2018, 8, 1))


If the `datetime` parameter does not specify a timezone, it will be treated as UTC (*not* local time).

You can also pass a `CdxRecord` that you received from `WaybackClient.search` instead of a URL and time:

py
for record in client.search('http://www.noaa.gov/'):
client.get_memento(record)


Finally, you can now specify the *playback mode* of a memento using the `mode` parameter:

py
client.get_memento('http://www.noaa.gov/',
datetime=datetime.datetime(2018, 8, 1),
mode=wayback.Mode.view)


The default mode is `Mode.original`, which returns the exact HTTP response body as was originally archived. Other modes reformat the response body so it’s more friendly for browsing by changing the URLs of links, images, etc. and by adding informational content to the page about the memento you are viewing. They are the modes typically used when you view the Wayback Machine in a web browser.

Don’t worry, though — complete Wayback URLs are still supported. This code still works fine:

py
client.get_memento('http://web.archive.org/web/20180801000000id_/http://www.noaa.gov/')


- A new `follow_redirects` parameter specifies whether to follow *historical* redirects (i.e. redirects that happened when the requested memento was captured). It defaults to `True`, which matches the old behavior of this method.


get_memento() Returns a Memento Object

`get_memento()` no longer returns a response object from the [Requests package](https://requests.readthedocs.io/). Instead it returns a specialized `Memento` object, which is similar, but provides more useful information about the Memento than just the HTTP response from Wayback. For example, `memento.url` is the original URL the memento is a capture of (e.g. `http://www.noaa.gov/`) rather than the Wayback URL (e.g. `http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/`). You can still get the full Wayback URL from `memento.memento_url`.

You can check out the full API documentation for `Memento`, but here’s a quick guide to what’s available:

py
memento = client.get_memento('http://www.noaa.gov/home',
datetime(2018, 8, 16, 11, 19, 11),
exact=False)

These values were previously not available except by parsing
`memento.url`. The old `memento.url` is now `memento.memento_url`.
memento.url == 'http://www.noaa.gov/'
memento.timestamp == datetime(2018, 8, 29, 8, 8, 49, tzinfo=timezone.utc)
memento.mode == 'id_'

Used to be `memento.url`:
memento.memento_url == 'http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/'

Used to be a list of `Response` objects, now a *tuple* of Mementos. It
lists only the redirects that are actual Mementos and not part of
Wayback's internal machinery:
memento.history == (Memento<url='http://noaa.gov/home'>,)

Used to be a list of `Response` objects, now a *tuple* of URL strings:
memento.debug_history == ('http://web.archive.org/web/20180816111911id_/http://noaa.gov/home',
'http://web.archive.org/web/20180829092926id_/http://noaa.gov/home',
'http://web.archive.org/web/20180829092926id_/http://noaa.gov/')

Headers now only lists headers from the original archived response, not
additional headers from the Wayback Machine itself. (If there's
important information you needed in the headers, file an issue and let
us know! We'd like to surface that kind of information as attributes on
the Memento now.
memento.headers = {'header_name': 'header_value',
'another_header': 'another_value',
'and': 'so on'}

Same as before:
memento.status_code
memento.ok
memento.is_redirect
memento.encoding
memento.content
memento.text

0.3.0b1

wayback.WaybackClient.get_memento` now raises `wayback.exceptions.NoMementoError` when the requested URL has never been archived. It also now raises `wayback.exceptions.MementoPlaybackError` in all other cases where an error was returned by the Wayback Machine (so you should never see a `requests.exceptions.HTTPError`). However, you may still see other *network-level* errors (e.g. `ConnectionError`).

Page 2 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.