Archivebox

Latest version: v0.7.2

Safety actively analyzes 685507 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 6

3.7

services:
archivebox:
image: nikisweeting/archivebox:latest
command: server 0.0.0.0:8000
stdin_open: true
tty: true
ports:
- 8000:8000
environment:
- USE_COLOR=True
volumes:
- ./data:/data


Screenshots

<img width="500px" alt="Screen Shot 2020-07-28 at 6 19 48 AM" src="https://user-images.githubusercontent.com/511499/88663507-bac8e580-d0a9-11ea-9c3f-25a8d12d3db4.png">
<img width="500px" src="https://user-images.githubusercontent.com/511499/88663619-e51aa300-d0a9-11ea-9a6b-b8f3851471a4.png">
<img width="500px" src="https://user-images.githubusercontent.com/511499/88663793-21e69a00-d0aa-11ea-9166-ca7a265af43a.png">
<img width="500px" src="https://user-images.githubusercontent.com/511499/88663848-31fe7980-d0aa-11ea-8f97-a60aed49f684.png">

New Features

A bunch of big changes:
- `pip install archivebox` is now available
- full transition to Django Sqlite DB with migrations (making upgrades between versions much safer now)
- maintains an intuitive and helpful CLI that's backwards-compatible with all previous archivebox data versions
- uses argparse instead of hand-written CLI system: see `archivebox/cli/archivebox.py`
- new subcommands-based CLI for `archivebox` (see below)
- new Web UI with pagination, better search, filtering, permissions, and more
- 30+ assorted bugfixes, new features, and tickets closed

For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

Released in this version:

Install Methods:
- ✅ [`pip/pipenv install archivebox [--dev]`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-pip-install-archivebox)
- ✅ [`docker run nikisweeting/archivebox` / `docker-compose up`](https://github.com/pirate/ArchiveBox/wiki/Docker)
- ❌ [`apt/brew/pkg/yum/nix/etc install archivebox`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-pip-install-archivebox) (maybe later)

Command Line Interface:
- ✅ [`archivebox`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-help-h--help)
- ✅ [`archivebox version`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-version--version)
- ✅ [`archivebox help`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-help-h--help)
- ✅ [`archivebox init`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-init)
- ✅ [`archivebox status`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-status)
- ✅ [`archivebox add`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-add)
- ✅ [`archivebox remove`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-remove)
- ✅ [`archivebox update`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-update)
- ✅ [`archivebox list`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-list)
- ✅ [`archivebox schedule`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-schedule)
- ✅ [`archivebox config`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-config)
- ✅ [`archivebox server`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-server)
- ✅ [`archivebox shell`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-shell)
- ✅ [`archivebox manage`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-manage)
- ❌ [`archivebox oneshot`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-oneshot)
- ❌ [`archivebox export`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-export)
- ❌ [`archivebox proxy`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-proxy)

Web UI:
- ✅ `/` Main index
- ✅ `/add` Page to add new links to the archive (but needs improvement)
- ✅ `/archive/<timestamp>/` Snapshot details page
- ✅ `/archive/<timestamp>/<url>` live wget archive of page
- ✅ `/archive/<timestamp>/<extractor>` get a specific extractor output for a given snapshot
- ✅ `/archive/<url>` shortcut to view most recent snapshot of given url
- ✅ `/archive/<url_hash>` shortcut to view most recent snapshot of given url
- ✅ `/admin` Admin interface to view and edit archive data
- ✅ `/old.html` Backwards-compatible static HTML index for the previous version


Python API:
- ✅ [`from archivebox.main import add, remove, info, config, etc...`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#api-for-normal-archivebox-usage)
- ✅ [`from archivebox.core.models import Snapshot, User, etc...`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#api-for-all-useful-subcomponents)
- ✅ [`from archivebox.extractors import media, wget, screenshot, etc...`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#api-for-all-useful-subcomponents)
- ✅ [`from archivebox.index import json, sql, html, etc...`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#api-for-all-useful-subcomponents)
- ✅ [`from archivebox.parsers import pinboard_rss, pocket_html, generic_json, etc...`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#api-for-all-useful-subcomponents)

(Red ❌ features are still unfinished and will be released in later versions)

0.8.5rc

> [!WARNING]
> *This is a *BETA pre-release* that improves upon the previous v0.8.4-rc ALPHA pre-release*. **The next stable release will be v0.9.0.** The `v0.8.x-rc` series of releases are for collecting feedback while we make [big architectural improvements](https://github.com/ArchiveBox/ArchiveBox/issues/1526) to support a new public plugin marketplace + ecosystem (powered by [`pluggy`](https://pluggy.readthedocs.io/en/stable/index.html) + [`huey`](https://huey.readthedocs.io/) + [`pydantic`](https://pydantic-docs.helpmanual.io/)). We want brave early adopters to help us test it! (if that's not you, wait for v0.9!)

<details>
<summary>⬇️ BETA Instructions: 1. backup your collection 2. install the <code>:dev</code> branch with <code>docker</code>/<code>pip</code> <i>(expand for details)</i></summary>
<br/>

1. 🗜️ Always make a full backup before installing new BETA releases!
Remember, this is an unstable *sneak-preview* <a href="https://github.com/ArchiveBox/ArchiveBox/issues/1526">in the middle of a rewrite</a>, so it MAY DAMAGE DATA.
bash
gzip -k ./data/index.sqlite3 do this at least 🙏
zip -r data.bak.zip data OR even better: backup the entire data dir


2. 📦 Then install the latest nightly build from source with Docker or Pip:
<pre><code>docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
OR
pip install 'git+https://github.com/ArchiveBox/ArchiveBoxdev'
</code></pre>

3. ↗️ Then run <code>archivebox init</code> to upgrade your collection:
<i><b>This take several hours to migrate existing data from v0.7.x</b> on a slower HDDs (up to ~1min/1000 URLs).</i>

<pre><code>archivebox install make sure all package and runtime dependencies are installed & available
archivebox init run data migrations (slow, theoretically safe to Ctrl+C and resume, but try not to)
archivebox version check that everything updated properly and dependencies are installed
archivebox status see a health report on the collection index & snapshot directories
</code></pre>

4. 💬 Let us know if you find bug or have suggestions by [opening a new issue](https://github.com/ArchiveBox/ArchiveBox/issues)! In particular we want to hear:
- was the upgrade/migration process smooth?
- can you find any areas of the UI/CLI that are slow?
- how do you like the new plugin system? (see `archivebox/plugins_extractor/*`) Would you contribute a new plugin?

</details>

---

Highlights


<img width="24%" align="top" alt="Screenshot 2024-10-03 at 4 34 08 AM ArchiveBox shell" src="https://github.com/user-attachments/assets/4b663f9d-b8b4-4b2c-a2cf-c753b791136c"><img width="24%" align="top" alt="Screenshot 2024-10-03 at 4 33 52 AM ArchiveBox help" src="https://github.com/user-attachments/assets/a07cd6dd-2b87-4d72-8727-907ea3810e83"><img width="24%" align="top" src="https://github.com/user-attachments/assets/ddb76dee-3f0e-4e39-9d08-2ec6e089552f"/><img width="24%" align="top" src="https://github.com/user-attachments/assets/733e3848-e8f8-4f97-955c-aa5fb2520d12"/>


What's Changed
* 📦 Deprated `apt` and `brew` install methods in favor of `pip` + new `archivebox install` cmd
* 🌈 Much improved `archivebox help`, `archivebox version`, and `archivebox shell` CLI interfaces
* ⚡️ Massive speedups to binary detection and loading at startup time
* ✍️ New `Machine`, `NetworkInterface`, and `InstalledBinary` models keep an audit log of host environment changes and health
* Many other bugfixes, speedups, and internal architecture improvements
* Move novnc web-ui to 8081 by agowa in https://github.com/ArchiveBox/ArchiveBox/pull/1522
* Add OpenContainer Image Format Annotations as Labels to Docker Image by mpgirro in https://github.com/ArchiveBox/ArchiveBox/pull/1525

New Contributors
* agowa made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1522
* mpgirro made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1525

**Full Changelog**: https://github.com/ArchiveBox/ArchiveBox/compare/v0.8.4-rc...v0.8.5-rc

0.8.4rc

> [!WARNING]
> *This is an *ALPHA pre-release* that improves upon the previous v0.8.3-rc ALPHA pre-release*. **The next stable release will be v0.9.0.** The `v0.8.x-rc` series of releases are for collecting feedback while we make [big architectural improvements](https://github.com/ArchiveBox/ArchiveBox/issues/1526) to support a new public plugin marketplace + ecosystem (powered by [`pluggy`](https://pluggy.readthedocs.io/en/stable/index.html) + [`huey`](https://huey.readthedocs.io/) + [`pydantic`](https://pydantic-docs.helpmanual.io/)). We want brave early adopters to help us test it! (if that's not you, wait for v0.9!)

---

Highlights

<img src="https://github.com/user-attachments/assets/522ddf0e-6074-428a-a07a-8ce92a57e6b2" width="29%" align="top"/><img src="https://github.com/user-attachments/assets/3f12b1c7-3876-4d71-92c4-291ee7d152a6" width="30%" align="top"/><img src="https://github.com/user-attachments/assets/f9d99c71-f9e8-4eb9-b385-ccd27d5f9425" width="28%" align="top"/>

- 🪵 moved to proper event-driven task system [huey](https://huey.readthedocs.io/) + [`django-huey-monitor`](https://github.com/boxine/django-huey-monitor)
- 🦸‍♂️ integrated [supervisord](http://supervisord.org) to manage bg workers
- 📦 integrated ansible/[pyinfra](https://github.com/pyinfra-dev/pyinfra) (an ansible alternative) to install subdependency packages at runtime
- ⚡️ continued switching from `runserver` to proper [Channels + Daphne ASGI](https://github.com/django/daphne/)
- 🧩 lots more plugins!

<img width="403" alt="Screenshot 2024-09-12 at 3 13 58 AM examplecom archivebox add" src="https://github.com/user-attachments/assets/725d0208-c4fa-4d96-9449-8f486bf1e8f1">

**Full Changelog**: https://github.com/ArchiveBox/ArchiveBox/compare/v0.8.3-rc...v0.8.4-rc

0.8.3rc

> [!WARNING]
> *This is an *ALPHA pre-release* that improves upon the previous v0.8.2-rc ALPHA pre-release*. **The next stable release will be v0.9.0.** The `v0.8.x-rc` series of releases are for collecting feedback while we make [big architectural improvements](https://github.com/ArchiveBox/ArchiveBox/issues/1526) to support a new public plugin marketplace + ecosystem. We want brave early adopters to help us test it! (if that's not you, wait for v0.9!)

---

Highlights

![Screenshot 2024-09-06 at 3 22 01 AM Get Title Get Missing Archive again](https://github.com/user-attachments/assets/2f96bffe-262c-488c-b09f-90a2b9f19c28)

- New Admin action buttons text should make it clearer what the butons do
- Adding new URLs / clicking action buttons now runs task in a BG thread instead of running syncronously (and often timing out)
- Added ability to click "View on site" from any object in admin to go directly to viewing the content
- Switched `archivebox server` from using `runserver` to a proper `daphne` ASGI server
- Added HTTP byte range request support (allows you to seek to the middle of a big .mp4 without downloading the whole thing)
- Added ability to regenerate ABIDs on objects that have gone out of sync
- New plugin system architecture is coming along, standard API for hooks now available in `plugantic/base_hook.py`
- improved CLI logging output using `rich` for pretty colors and nicer tracebacks
- improved HTTP request logging to filter out noisy 404/304/200 lines
- renamed `.created` -> `.created_at`, `.modified` -> `.modified_at`, `.added` -> `.bookmarked_at`, `.updated` -> `.downloaded_at`
- allow accessing admin change pages, API records, and archive contents by both ABID and ID (UUID)
- add ruff linting and lots of type hint improvements with pydantic
- improve auth and CSRF security for the new REST API (cookies no longer work for API auth, a token is appended to URLs instead)
- bump default `USER_AGENT` settings to chrome v128, bump `yt-dlp`, `singlefile`, etc. versions
- lots of other small fixes, speedups, and improvements!


![Screenshot 2024-09-06 at 3 01 47 AM API Identifiers](https://github.com/user-attachments/assets/b95fbf16-5da4-4447-8ec4-a385831ef481)
![Screenshot 2024-09-06 at 3 01 44 AM USER SQUASH](https://github.com/user-attachments/assets/af7b0feb-723a-4dff-8712-bf4294b2eae0)

---

**Full Changelog**: https://github.com/ArchiveBox/ArchiveBox/compare/v0.8.2-rc...v0.8.3-rc

0.8.2rc

> [!WARNING]
> *This was a *BETA pre-release* that improved upon the previous v0.8.0-rc ALPHA pre-release*. This one brings us closer to a final v0.8 release and contains several core architectural improvements around how we key things with unique IDs, as well as a ✨ new Snapshot Detail UI ✨.

<img src="https://github.com/user-attachments/assets/4cea8156-f588-481a-8673-61d3f7e43703" height="400px"/>

![image](https://github.com/user-attachments/assets/3f03902d-2965-4434-a96d-b0a7197bfa8f)

![image](https://github.com/user-attachments/assets/8ccae5d8-61c0-4392-b5c4-eb4f6a7ab795)


**Changelog**: https://github.com/ArchiveBox/ArchiveBox/compare/v0.8.0-rc...v0.8.2-rc

0.8.0rc

WIP ALPHA pre-release for the upcoming ArchiveBox `v0.8` release.

> [!CAUTION]
> *This was an ALPHA pre-release*. We were promoting it a little earlier than usual because it contains ✨ lots of big new features ✨ and we want brave early adopters to help us test it!

<img height="300" alt="New ArchiveBox REST API" src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/341761e0-6dd3-4d17-bf0f-eaea46b3b0ea" align="top"><img height="300" alt="ArchiveBox Admin Webhooks UI" src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/d985e51c-f6c6-4a99-8f27-6c4c24997474" align="top"><img height="300" alt="ArchiveBox Configuration Admin UI" src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/3af7e7a1-5dc6-4513-9c2e-846eb9a57560" align="top"><img height="300" alt="S3/B2/SMB/NFS/GDrive Remote Storage Setup" src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/c84a0d01-43c9-4afc-9ff8-ed76623fde4c" align="top">


Highlights

- **[New REST API](https://github.com/ArchiveBox/ArchiveBox/pull/1397) built with `django-ninja` (thanks Brandl!)**
- **[New ability to send outgoing webhooks](https://github.com/ArchiveBox/ArchiveBox/pull/1418) triggered by archiving events**
- **[new support for S3/B2/Google Drive/etc. remote storage](https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage)** using Docker + `rclone`
- **[new ability to manage ArchiveBox config in Admin UI](https://github.com/ArchiveBox/ArchiveBox/pull/1420)** (read-only for now, ability to edit coming soon...)
- **[new noVNC remote viewing support for ArchiveBox browser](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile)** (grab the updated [`docker-compose.yml`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml) first!)
- **[upgraded to Django 5.0 internally](https://github.com/ArchiveBox/ArchiveBox/pull/1388) (thanks jimwins!)**
- [add new `*_EXTRA_ARGS` options](https://github.com/ArchiveBox/ArchiveBox/pull/1360) (thanks benmuth!) and new unified [`USER_AGENT` option](https://github.com/ArchiveBox/ArchiveBox/pull/1311/commits/1fc5d7c5c8aa9075ee05d7f7a7e2c8dc1d23fcd0)
- [add new `generic_jsonl` parser](https://github.com/ArchiveBox/ArchiveBox/pull/1370) (thanks jimwins!)
- [switch to `feedparser`](https://github.com/ArchiveBox/ArchiveBox/pull/1362) for RSS parsing (thanks jimwins!)
- remember `Snapshot` detail page header expanded/collapsed state

<details>
<summary>Expand to see see more...</summary>

- add gitea and other domains to default GIT_DOMAINS list to run git archiving on
- check `/`, `/data`, and `/data/archive` in Docker and warn if running low on disk space
- Add COOKIES_FILE support for singlefile extractor by naoph in https://github.com/ArchiveBox/ArchiveBox/pull/1372
- Use `COOKIES_FILE` to fetch page titles by benmuth in https://github.com/ArchiveBox/ArchiveBox/pull/1364
- Fallback to not `chown`'ing `./data/archive` dir if it's a network mount that prevents ownership changes by gnattu in https://github.com/ArchiveBox/ArchiveBox/pull/1312
- Show the upgrade notification only in specific views by benmuth in https://github.com/ArchiveBox/ArchiveBox/pull/1314
- ability to populate is_staff and is_superuser flags at LDAP authentication by vladimirdulov in https://github.com/ArchiveBox/ArchiveBox/pull/1335
- Make it a little easier to run specific tests by jimwins in https://github.com/ArchiveBox/ArchiveBox/pull/1371
- disable chrome automatic self-updating when running headless
- Add ability to populate `is_staff` and `is_superuser` flags during LDAP first auth
- allow more restrictive NFS permission coercion on `./data/archive`
- bump `yt-dlp`, `singlefile`, `wget`, `curl`, and `chrome` versions
- fix `RESOLUTION` being ignored when using Chrome headless in Docker
- fix sorting by Size / Files in the Admin Snapshots list page UI
- fix spinner icon showing on some Snapshots instead of favicon when only a few extractors are enabled
- fix yt-dlp sometimes failing to archive media due to filenames being too long or containing special characters
- fix wget extractor not finding output when `:80` or `:443` port is present in the original URL
- fix `/var/spool/cron/crontabs` permissions when mounting it via Docker
- fix `/browsers` chown on Docker `armv7` entrypoint failing

</details>

*COMING SOON: [new `sci-dl` scientific paper downloader](https://github.com/ArchiveBox/ArchiveBox/issues/720) being worked on by benmuth*

New Contributors
* Brandl made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1397
* tqobqbq made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1396
* gnattu made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1312
* speerer made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1323
* neel-suthar made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1330
* jimwins made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1365
* naoph made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1372
* rdela made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1374
* n-hebert made their first contribution in https://github.com/ArchiveBox/ArchiveBox/pull/1382

**Full Changelog**: https://github.com/ArchiveBox/ArchiveBox/compare/v0.7.2...v0.8.0-rc

Page 1 of 6

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.