Crystal-web

Latest version: v1.9.0b0

Safety actively analyzes 682387 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 2

1.9.0b

This release contains error-handling improvements and bug fixes
in preparation for Crystal to exit beta status.

* Parsing improvements
* Can identify `'https://'` inside JavaScript as a URL reference,
which helps download/serve sites using Disqus.
* Can identify URL references inside `<style>` elements.
* Can identify URL references inside `<* style="...">`,
which helps download/serve sites based on phpBB.
* Can identify `data:` URL references inside `<* srcset="...">`.
* Can rewrite URL references that use [Subresource Integrity].

* Error handling improvements
* Crashes now provide tracebacks with more context,
back to the start of the thread in which the crash occurred.

* Major fixes
* Fix crash when dynamically downloading a served URL which is a member of
a group being actively downloaded.

* Minor fixes
* Do not show a progress dialog if the related operation completes quickly,
fixing a flickering effect especially noticeable on Windows.
* Can now save projects named with url-unsafe characters like ``.
* Fix closing a project to no longer have a race condition that
could cause use-after-free of wxPython objects and corrupt memory,
potentially crashing Crystal later.

1.8.0b

This release contains many workflow improvements, error-handling improvements,
and bug fixes in preparation for Crystal to exit beta status.

It is faster than ever before to define the structure of a site using the UI,
with support for loose browser-style URL entry, better guesses for names and
sources of entities, and the ability to rename entities after creation.

* First-time-run experience improvements
* New Root URL Dialog:
* Accept URLs in loose format, similar to what regular web browsers accept.
* Better error message when try to create duplicate root URL.
* Disallow create of empty root URL.
* New Group Dialog:
* Improve suggested source when creating a new group.
* New Root URL and New Group Dialogs:
* Improve suggested name when creating a new root URL or group.
* Make it optional to provide a name.
* Rearrange fields to deemphasize the name field.
* Update the selected node in the Entity Tree intelligently after
creating or forgetting a root URL or a group.
* Allow resizing.
* Ignore leading and trailing whitespace in URLs and URL patterns.
* Main Window
* Prevent resizing the window to be too small.
* Use ⚓️ and 📁 icons consistently in the UI to refer to
Root URLs and Groups respectively.

* Workflow improvements
* Can now edit the name and source of Root URLs and Groups after creation.
* Can mark resource group as "do not download" to prevent their members
from being downloaded when in an embedded context.
* By default the Default Domain will be set to match the first Root URL
created, enabling reliable serving of more-modern websites with
client-side URL routing.

* Improved support for Default URL Prefixes
* Can now set a Default Domain when serving a downloaded project.
Previously only a Default Directory could be set.
* The Default Domain/Directory can be set to match a Root URL that
is being created or edited.
* A top-level menuitem can now be used to set the Default Domain/Directory
to match an existing Root URL. Previously it was necessary to use a
right-click menuitem instead.
* Hovering over a URL or Group in the Entity Tree always shows the
full URL or URL Pattern for the entity, even if a Default Domain
or Default Directory is set.

* Downloading improvements
* Remaining time is now reported while downloading groups whose members
are slow to download, taking >7 seconds each.
* Don't crash when downloading a group that already contains some member
URLs that were already downloaded.
* This crash bug was introduced in v1.7.0b with the new strategy of
creating member download tasks on demand rather than upfront.
* Don't crash when try to download a URL that is already downloading.
* Don't crash when try to download a group that has no member URLs.

* Crawling improvements
* Download the implicit favicon referenced by the root page of any domain.

* Parsing improvements
* Can identify URL references to images inside `<source srcset="...">`.
* Gracefully handle references to invalid URLs like `"//*[id='"`
rather than crashing.
* Parse links from RSS and Atom feeds advertised with a specialized XML MIME type
like `application/rss+xml` or `application/atom+xml`.

* Serving improvements
* XML files like Atom feeds and RSS feeds are now served correctly,
without introducing an invalid `<script>` tag.
* The log showing HTTP requests made to the served project now always
displays inside the main window, rather than attempting to appear as an
attached drawer on macOS and Windows.
* Drawers are not a concept not natively supported by any OS except
macOS, and even there they are deprecated.
* Drawers have never worked properly on Linux, due to Wayland not
providing APIs to position windows precisely relative to each other.
* The old drawer mode didn't stay attached to the main window properly
when using Mission Control on macOS.

* Error handling improvements
* If a task crashes, show it as crashed in the UI and allow it to be dismissed.
* If the scheduler thread crashes, show it as crashed in the UI and allow it to be restarted.
* If an update to the entity tree crashes,
show the crash in the UI and allow the entity tree to be refreshed.

* Testing improvements
* Waits now use a soft timeout in addition to a hard timeout,
which makes it easier to tune/bump timeout durations as needed.
* Triggering a soft timeout causes a warning to be logged.
* Warnings logged during a test run are collected and reported
at the end of the test run.
* Warnings logged during a test run are reported to GitHub Actions
as warning annotations.
* A screenshot is taken automatically whenever a timeout error occurs
and whenever a rich assertion method (from `asserts.py`) fails.
* A terminal bell sound is played automatically when tests finish running.
* When an abort() or SIGABRT occurs while running tests during continuous integration,
print a stack trace using faulthandler.

* Major fixes
* Fixed multiple cases where code updating the task tree accessed
the task hierarchy without synchronizing with the scheduler thread,
which could cause crashes when downloading groups containing members
that were already downloaded in the same session.
* Access to the task hierarchy is now protected with scheduler_affinity
and explicit is_synced_with_scheduler_thread() checks.
* This issue was first introduced in v1.7.0b and is now fixed.
* For projects on a non-SSD drive, fix issue where newly created groups
did not find any member URLs that were discovered since the project was opened.
It was previously necessary to reopen a project to reliably find all
members of a recently created group.
* This issue was first introduced in v1.7.0b and is now fixed.

* Minor fixes
* Prevent system idle sleep while tasks are running,
in more situations on macOS.
* Fix disappearance of error nodes when new root URL or group is added.
* If try to create group with empty URL pattern, show error dialog
rather than silently failing.

1.7.0b

This release features further improvements to downloading large websites
(up to 10 million URLs). Projects open in constant time. Memory usage
while downloading large groups remains constant.

Additionally .crystalproj documents now have an appropriate icon and can be
easily opened by double-clicking that icon on all supported operating systems.

There has also been a major change to the .crystalproj format: Revisions are
now stored in a hierarchy of directories rather than as a single flat directory.
Crystal continues to be able to read and write projects of all major versions.

* First-time-run experience improvements
* App name, logo, and icon fixes
* macOS: Fix application menu title and title of its menuitems
* Windows: Add app icon and Windows-friendly title to main window
* Linux: Fix app title and icon in dock to be correct
* .crystalproj package changes
* .crystalproj packages now contain a README so that users on computers
without Crystal are informed about what a .crystalproj package is
and how to open it
* Windows/Linux: .crystalproj packages now have an icon
* Windows: .crystalproj packages that are double-clicked open in Crystal
* Windows/Linux: .crystalproj packages now contain a .crystalopen file
so that it is easy to open a .crystalproj from a file browser
* macOS: Hide .crystalproj and .crystalopen file extensions

* Large project improvements (with 3,000,000 - 11,000,000 URLs)
* Large projects now open immediately because URLs and group members
are now loaded on demand rather than upfront.
* Large groups now start downloading faster because member
download tasks are now created on demand rather than upfront.
* The .crystalproj format has a new major change:
* .crystalproj format now stores revisions in a hierarchy of nested
directories rather than all of them inside a single directory,
to provide faster performance on filesystems which behave poorly
when a single directory has very many files.
* .crystalproj format now stores revisions in lexicographic order
in the filesystem so that when a project is copied to a new location,
the order of revisions on disk is preserved in the new copy.
* Projects using this new format have a `major_version` of `2`
in the `project_property` table.
* Projects whose database is on a solid state drive (SSD)
will use less memory because such projects will now prefer to
load group members via a database query rather than loading
*all* project URLs into memory.

* Support changes
* Add Kubuntu as a supported Linux distribution
* Add KDE as a supported desktop environment, in addition to GNOME
* Drop support for macOS 10.14. macOS 12+ remains supported.

* Parsing improvements
* Can identify `<link rel="preload">` references as embedded.
* Can identify URLs inside `<script>` blocks with a trailing `?...` query.
* Improved reporting of unknown types of `<link rel="...">`.

* Serving improvements
* Multiple projects can be open and serving URLs at the same time.

* Error handling improvements
* When expanding an URL in the Entity Tree that downloaded with an error,
display an error node appropriately.
* When expanding an undownloaded URL in the Entity Tree that could not be
downloaded because the disk is full or the project has too many revisions,
display an error node appropriately.
* When expanding an URL in the Entity Tree whose revision body has been
deleted, try to redownload it automatically.

* Critical fixes
* Fix continuous integration to regularly run UI tests on macOS once again.
* Fix continuous integration to reliably fail if UI tests fail.

* Minor fixes
* Eliminated race condition where scheduler thread could try to read from
the root task's children list concurrently with a different thread
adding a new child to it.
* Hide .crystalproj extension in main window title if extension hidden
in file browser

* Backward-incompatible API changes
* `fg_call_later`, `fg_call_and_wait`, `bg_call_later`:
* Keyword arguments are now required for all optional parameters.
* Arguments passed in the format `*_call_*(callable, ...)`,
must now be passed as `*_call_*(callable, args=(...))`.
* `no_profile=` is replaced with `profile=`.
* `force=` is renamed to `force_later=`.
* The `OpenProjectProgressListener` interface has substantially changed
to reflect the new strategy for opening projects.
* Additionally, a new `LoadUrlsProgressListener` interface is
introduced to allow monitoring of when a project decides to
load its URLs. It can be provided to `Project.__init__`.
* `Project.title` has been removed.
Calculate a reasonable title from `Project.path` instead.

1.6.0b

This release features significant improvements to downloading large websites
that have about 10 million URLs. Projects open and close faster. The UI is faster.
Downloads are faster. Progress bars are shown for all slow operations.
Estimated time remaining is shown when downloading groups.

* Large project improvements (with 3,000,000 - 11,000,000 URLs)
* Open projects containing many URLs in about 50% as much time as before:
* Approximate the URL count when loading a project in O(1) time
rather than getting an exact URL count in O(r) time,
where r = the number of URLs in the project
* Decrease the time to load groups from O(r·g) to about O(r + g·log(r)),
where r = the number of URLs in the project and
g = the number of groups in the project
* Defer creation of Entity Tree nodes corresponding to group members
until the group is actually expanded
* Close projects with very many queued tasks (such as download tasks)
in O(1) time rather than O(t) time, where t = the number of queued tasks
* Speed up interacting with the Entity Tree and Task Tree when
there are very many URLs in a project:
* Entity Tree: Speed up expanding URL nodes when large groups exist,
now in O(k) time rather than O(r·k) time,
where k = the number of links originating from the URL node and
r = the number of URLs in the project.
* Entity Tree: Load only the first 100 members of each group, on demand
* Task Tree: Show only up to 100 children when downloading a group
* Speed up interacting with the Add Group dialog when
there are very many URLs in a project:
* When typing each character of a new URL pattern and no wildcard
has yet been typed, perform an O(1) search for matching URLs
in the preview pane.
* When typing each character of a new URL pattern and at least one
wildcard has been typed, perform an O(log(r)) search for matching URLs
in the preview pane, where r = the number of URLs in the project.
* Previously an O(r) search was performed in both of the above cases.
* Show progress while upgrading project with many URLs
* Show progress dialog when starting to download a large group
* Show elapsed time in all progress dialogs
* Prevent system idle sleep while tasks are running (on macOS and Windows)
* Print large numbers with comma separators or whatever the appropriate
separator is for the current locale
* Minimize memory use when there are very many URLs in a project
by shrinking in-memory Resource, Task, TaskTreeNode, and NodeView objects
by defining explicit `__slots__`
* Minimize memory growth while downloading URLs in a project for
multiple hours or days
* If free disk space drops too low then refuse to download further resources
* Quit immediately even when a project with many resources was open recently
* Open preferences dialog significantly faster for projects containing many URLs
* Significantly speedup creation of tasks that have many children,
such as tasks that download groups with very many members

* First-time-run experience improvements
* Improve defaults
* New/Open Project Dialog: Default to creating a new project rather
than opening an existing one.
* New Group Dialog: Expand "Preview Members" by default.
* Polish user interface
* Use consistent words to refer to common concepts
* {Create, Add} -> New
* {URL, Root URL} -> Root URL
* Add menus
* macOS: Add proxy icon to the project window, making it easier to navigate
to the project in the Finder.
* Add app name to version label in lower-left corner of project window.
* Add keyboard shortcuts everywhere
* Groups without a source can now be downloaded, as one would expect.
* Task Tree: Remove top-level tasks that complete periodically,
rather than waiting for all of them to complete first

* Critical fixes
* Linux: Fix dialog that appears on app launch to be sized correctly.
* Linux: Fix View button to open browser even if Crystal run from read-only volume.
* Linux: Fix most other dialogs to be sized correctly.
* macOS: Fix issue where dialogs could appear at unusual locations,
including offscreen.

* Crawling improvements
* Don't recurse infinitely if resource identifies alias of itself as an
embedded resource.

* Downloading improvements
* Show estimated time remaining and speed when downloading groups and URLs
* Download faster
* Reinstate the ASSUME_RESOURCES_DOWNLOADED_IN_SESSION_WILL_ALWAYS_REMAIN_FRESH
optimization that was disabled in v1.4.0b, which significantly speeds up
downloading groups of HTML pages that link to similar URLs
* Support immediate early completion of download tasks for URLs
that were downloaded in the current session or a recent session
* Record links while downloading faster by writing all of them to the
project in bulk rather than one by one
* Open the project's underlying SQLite database
in [Write-Ahead Logging (WAL) mode](https://www.sqlite.org/wal.html)
which is faster than the default mode
* Change delay between downloads to be inserted after each HTML page downloads
(with its embedded resources), rather than after each single resource downloads.
This new behavior simulates user browsing more closely and results in
much faster downloading of HTML pages with many images
(or other embedded resources).
* Parallelize download of URLs from origin server with writes to local
database where possible.
* Avoid querying the database for revisions of an URL if it is already
known that there are no revisions because of other information
cached in memory
* Precompile XPath selectors used to parse links from HTML
* Use an optimized version of [shutil.copyfileobj] that avoids
repeatedly allocating intermediate buffers
* Maximum download speed increased from 1 item/sec to 2 items/sec
* Autopopulate an HTTP Date header when downloading if none provided
by origin server, as per RFC 7231 §7.1.1.2.
* Load HTTPS CA certificates from certifi on Windows,
in addition to from the system CA store.
* Load HTTPS CA certificates from `$SSL_CERT_FILE` if specified.

* Parsing improvements
* Links are parsed in about 18% as much time as before.
* Can identify URL references inside `<img srcset="...">`.
* Skip parsing links in downloaded files known to be binary files.

* Serving improvements
* Server logs are now displayed in a UI drawer.
* Links to anchors on the same page are no longer rewritten,
for better compatibility with JavaScript libraries that
treat such links specially.
* Archived pages are read from disk about 45% faster by avoiding an
unnecessary `os.stat` call.
* Archived pages are served faster and more efficiently by using
the [os.sendfile] primitive when supported by the operating system.
* Don't warn about unknown X- HTTP headers.

* CLI improvements
* Profiling warnings:
* Several foreground tasks are optimized so that they
no longer print slow foreground task warnings
* Slow garbage collection operations now print a profiling warning
* Slow "Recording links" operations now print a profiling warning
* Include [guppy] module for manual [memory leak profiling].
* A `$PYTHONSTARTUP` file can be defined that is run automatically
at the beginning of a shell session.

* Error handling improvements
* When attempting to download a previously-downloaded revision that is
missing a body file on disk, delete & redownload the old revision.

* Testing improvements
* An entire test module can now be run with `--test`, in addition to
individual test functions.

* Minor fixes
* Clear completed root tasks in all cases, even in the rare case where
all tasks except the first one are complete
* When deleting a ResourceRevision, don't delete revision body if project
is read-only and also properly mark related Resource as no longer being
downloaded this session
* When querying a ResourceRevision's size, don't crash with a traceback
* When running as a macOS .app, log stdout and stderr to files correctly
once more

* Backward-incompatible API changes
* `Resource.revisions()` now returns `Iterable[ResourceRevision]` instead
of `List[ResourceRevision]` to support streaming results.
* If the old behavior is desired, wrap calls to `Resource.revisions()`
inside of a `list(...)` expression.
* `MainWindow.frame` is no longer public.
* `ResourceRevision.load()` has been renamed to
`ResourceRevision._load_from_data()` and privatized.
* A replacement `ResourceRevision.load()` method now exists that loads
an existing revision given an ID.

[guppy]: https://pypi.org/project/guppy3/
[memory leak profiling]: https://github.com/davidfstr/Crystal-Web-Archiver/wiki/Testing-for-Memory-Leaks
[os.sendfile]: https://docs.python.org/3/library/os.html#os.sendfile
[shutil.copyfileobj]: https://docs.python.org/3/library/shutil.html#shutil.copyfileobj

1.5.0b

This release focuses on making it easy to install Crystal from PyPI,
adds support for running on Linux from source (but not from a binary),
and fixes many bugs with the built-in CLI shell.

Additionally items in the main window are easier to understand
because icons and tooltips have been added for all tree nodes.

* Distribution improvements
* Can install Crystal using pipx and pip, from PyPI:
* `pipx install crystal-web`
* `crystal`
* Can run Crystal using `crystal` binary:
* `poetry run crystal`
* Can run Crystal using `python -m crystal`:
* `poetry run python -m crystal`
* Add support for Linux platform (Ubuntu 22.04, Fedora 37)

* CLI improvements
* Fixed shell to not hang if exited before UI exited, under certain circumstances.
* Fixed {help, exit, quit} functions to be available when Crystal runs as an .app or .exe.
* Altered exiting message while windows open to be more accurate.
* Pinned the public API of `Project` and `MainWindow`.

* Testing improvements
* Tests are much faster now that download delays are minimized while running tests.
* Failure messages are improved whenever a WaitTimedOut.
* A screenshot is taken whenever a test fails.
* Several race conditions related to accessing the foreground thread are fixed.

* UI Improvements
* Icons and tooltips added to all tree nodes in the main window,
clarifying the different types of entities, links, and tasks that exist.
* Easy to distinguish between URLs and groups.
* Easy to see whether a URL was downloaded,
and whether it was downloaded successfully.
* URL clusters now show in their title how many members they contain.
* Fixed "Offsite" cluster nodes to update children appropriately whenever
the Default URL Prefix is changed.
* Fixed right-click on non-URL node to no longer print a traceback.
* Fixed attempt to download a group with no source to no longer print a traceback.

1.4.0b

This release adds early support for [incrementally redownloading sites
with new page versions](https://github.com/davidfstr/Crystal-Web-Archiver/issues/80).

It is also now possible to download sites requiring login from the UI
and a tutorial has been added showing how to do that.

There are also many stability improvements, with fewer wxPython-related
Segmentation Faults and dramatically improved automated test coverage.

For more information see the [release notes](https://github.com/davidfstr/Crystal-Web-Archiver#release-notes-).

Page 1 of 2

Releases

Has known vulnerabilities