Code
- Default throttling for downloaders set to max 300 requests per second.
- `Downloader` now takes a client for downloading, currently there exists two clients:
* s3 -> Directly queries the common crawl buckets
* api -> Quries CommonCrawl API Gateway
- Retry system has been updated to leverage tenacity, additionaly we now use random exponential random backoff instead of linear random backoff
CLI
- New global parameter `--aws_profile` for setting an aws_profile to use
- New parameter `--download_method` which can be set for
* `extract...records --download_method`
* `download...html --download_method`
In both cases the argument can be set to either s3 or api, which definies how the commoncrawl will be accessed when downloading warc files.