This release features significant improvements to downloading large websites
that have about 10 million URLs. Projects open and close faster. The UI is faster.
Downloads are faster. Progress bars are shown for all slow operations.
Estimated time remaining is shown when downloading groups.
* Large project improvements (with 3,000,000 - 11,000,000 URLs)
* Open projects containing many URLs in about 50% as much time as before:
* Approximate the URL count when loading a project in O(1) time
rather than getting an exact URL count in O(r) time,
where r = the number of URLs in the project
* Decrease the time to load groups from O(r·g) to about O(r + g·log(r)),
where r = the number of URLs in the project and
g = the number of groups in the project
* Defer creation of Entity Tree nodes corresponding to group members
until the group is actually expanded
* Close projects with very many queued tasks (such as download tasks)
in O(1) time rather than O(t) time, where t = the number of queued tasks
* Speed up interacting with the Entity Tree and Task Tree when
there are very many URLs in a project:
* Entity Tree: Speed up expanding URL nodes when large groups exist,
now in O(k) time rather than O(r·k) time,
where k = the number of links originating from the URL node and
r = the number of URLs in the project.
* Entity Tree: Load only the first 100 members of each group, on demand
* Task Tree: Show only up to 100 children when downloading a group
* Speed up interacting with the Add Group dialog when
there are very many URLs in a project:
* When typing each character of a new URL pattern and no wildcard
has yet been typed, perform an O(1) search for matching URLs
in the preview pane.
* When typing each character of a new URL pattern and at least one
wildcard has been typed, perform an O(log(r)) search for matching URLs
in the preview pane, where r = the number of URLs in the project.
* Previously an O(r) search was performed in both of the above cases.
* Show progress while upgrading project with many URLs
* Show progress dialog when starting to download a large group
* Show elapsed time in all progress dialogs
* Prevent system idle sleep while tasks are running (on macOS and Windows)
* Print large numbers with comma separators or whatever the appropriate
separator is for the current locale
* Minimize memory use when there are very many URLs in a project
by shrinking in-memory Resource, Task, TaskTreeNode, and NodeView objects
by defining explicit `__slots__`
* Minimize memory growth while downloading URLs in a project for
multiple hours or days
* If free disk space drops too low then refuse to download further resources
* Quit immediately even when a project with many resources was open recently
* Open preferences dialog significantly faster for projects containing many URLs
* Significantly speedup creation of tasks that have many children,
such as tasks that download groups with very many members
* First-time-run experience improvements
* Improve defaults
* New/Open Project Dialog: Default to creating a new project rather
than opening an existing one.
* New Group Dialog: Expand "Preview Members" by default.
* Polish user interface
* Use consistent words to refer to common concepts
* {Create, Add} -> New
* {URL, Root URL} -> Root URL
* Add menus
* macOS: Add proxy icon to the project window, making it easier to navigate
to the project in the Finder.
* Add app name to version label in lower-left corner of project window.
* Add keyboard shortcuts everywhere
* Groups without a source can now be downloaded, as one would expect.
* Task Tree: Remove top-level tasks that complete periodically,
rather than waiting for all of them to complete first
* Critical fixes
* Linux: Fix dialog that appears on app launch to be sized correctly.
* Linux: Fix View button to open browser even if Crystal run from read-only volume.
* Linux: Fix most other dialogs to be sized correctly.
* macOS: Fix issue where dialogs could appear at unusual locations,
including offscreen.
* Crawling improvements
* Don't recurse infinitely if resource identifies alias of itself as an
embedded resource.
* Downloading improvements
* Show estimated time remaining and speed when downloading groups and URLs
* Download faster
* Reinstate the ASSUME_RESOURCES_DOWNLOADED_IN_SESSION_WILL_ALWAYS_REMAIN_FRESH
optimization that was disabled in v1.4.0b, which significantly speeds up
downloading groups of HTML pages that link to similar URLs
* Support immediate early completion of download tasks for URLs
that were downloaded in the current session or a recent session
* Record links while downloading faster by writing all of them to the
project in bulk rather than one by one
* Open the project's underlying SQLite database
in [Write-Ahead Logging (WAL) mode](https://www.sqlite.org/wal.html)
which is faster than the default mode
* Change delay between downloads to be inserted after each HTML page downloads
(with its embedded resources), rather than after each single resource downloads.
This new behavior simulates user browsing more closely and results in
much faster downloading of HTML pages with many images
(or other embedded resources).
* Parallelize download of URLs from origin server with writes to local
database where possible.
* Avoid querying the database for revisions of an URL if it is already
known that there are no revisions because of other information
cached in memory
* Precompile XPath selectors used to parse links from HTML
* Use an optimized version of [shutil.copyfileobj] that avoids
repeatedly allocating intermediate buffers
* Maximum download speed increased from 1 item/sec to 2 items/sec
* Autopopulate an HTTP Date header when downloading if none provided
by origin server, as per RFC 7231 §7.1.1.2.
* Load HTTPS CA certificates from certifi on Windows,
in addition to from the system CA store.
* Load HTTPS CA certificates from `$SSL_CERT_FILE` if specified.
* Parsing improvements
* Links are parsed in about 18% as much time as before.
* Can identify URL references inside `<img srcset="...">`.
* Skip parsing links in downloaded files known to be binary files.
* Serving improvements
* Server logs are now displayed in a UI drawer.
* Links to anchors on the same page are no longer rewritten,
for better compatibility with JavaScript libraries that
treat such links specially.
* Archived pages are read from disk about 45% faster by avoiding an
unnecessary `os.stat` call.
* Archived pages are served faster and more efficiently by using
the [os.sendfile] primitive when supported by the operating system.
* Don't warn about unknown X- HTTP headers.
* CLI improvements
* Profiling warnings:
* Several foreground tasks are optimized so that they
no longer print slow foreground task warnings
* Slow garbage collection operations now print a profiling warning
* Slow "Recording links" operations now print a profiling warning
* Include [guppy] module for manual [memory leak profiling].
* A `$PYTHONSTARTUP` file can be defined that is run automatically
at the beginning of a shell session.
* Error handling improvements
* When attempting to download a previously-downloaded revision that is
missing a body file on disk, delete & redownload the old revision.
* Testing improvements
* An entire test module can now be run with `--test`, in addition to
individual test functions.
* Minor fixes
* Clear completed root tasks in all cases, even in the rare case where
all tasks except the first one are complete
* When deleting a ResourceRevision, don't delete revision body if project
is read-only and also properly mark related Resource as no longer being
downloaded this session
* When querying a ResourceRevision's size, don't crash with a traceback
* When running as a macOS .app, log stdout and stderr to files correctly
once more
* Backward-incompatible API changes
* `Resource.revisions()` now returns `Iterable[ResourceRevision]` instead
of `List[ResourceRevision]` to support streaming results.
* If the old behavior is desired, wrap calls to `Resource.revisions()`
inside of a `list(...)` expression.
* `MainWindow.frame` is no longer public.
* `ResourceRevision.load()` has been renamed to
`ResourceRevision._load_from_data()` and privatized.
* A replacement `ResourceRevision.load()` method now exists that loads
an existing revision given an ID.
[guppy]: https://pypi.org/project/guppy3/
[memory leak profiling]: https://github.com/davidfstr/Crystal-Web-Archiver/wiki/Testing-for-Memory-Leaks
[os.sendfile]: https://docs.python.org/3/library/os.html#os.sendfile
[shutil.copyfileobj]: https://docs.python.org/3/library/shutil.html#shutil.copyfileobj