HfFileSystem: interact with the Hub through the Filesystem API
We introduce [HfFileSystem](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_file_system#huggingface_hub.HfFileSystem), a pythonic filesystem interface compatible with [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/). Built on top of `HfApi`, it offers typical filesystem operations like `cp`, `mv`, `ls`, `du`, `glob`, `get_file` and `put_file`.
py
>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem()
List all files in a directory
>>> fs.ls("datasets/myself/my-dataset/data", detail=False)
['datasets/myself/my-dataset/data/train.csv', 'datasets/myself/my-dataset/data/test.csv']
>>> train_data = fs.read_text("datasets/myself/my-dataset/data/train.csv")
Its biggest advantage is to provide ready-to-use integrations with popular libraries like Pandas, DuckDB and Zarr.
py
import pandas as pd
Read a remote CSV file into a dataframe
df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")
Write a dataframe to a remote CSV file
df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")
For a more detailed overview, please have a look to [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system).
* Transfer the `hffs` code to `hfh` by mariosasko in 1420
* Hffs misc improvements by mariosasko in 1433
Webhook Server
`WebhooksServer` allows to implement, debug and deploy webhook endpoints on the Hub without any overhead. Creating a new endpoint is as easy as decorating a Python function.
python
app.py
from huggingface_hub import webhook_endpoint, WebhookPayload
webhook_endpoint
async def trigger_training(payload: WebhookPayload) -> None:
if payload.repo.type == "dataset" and payload.event.action == "update":
Trigger a training job if a dataset is updated
...
For more details, check out this [twitter thread](https://twitter.com/Wauplin/status/1646893678500392960) or the [documentation guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/webhooks_server).
Note that this feature is experimental which means the API/behavior might change without prior notice. A warning is displayed to the user when using it. As it is experimental, we would love to get feedback!
* [Feat] Webhook server by Wauplin in 1410
Some upload QOL improvements
Faster upload with `hf_transfer`
Integration with a Rust-based library to upload large files in chunks and concurrently. Expect x3 speed-up if your bandwidth allows it!
* feat: add `hf_transfer` upload by McPatate in 1395
Upload in multiple commits
Uploading large folders at once might be annoying if any error happens while committing (e.g. a connection error occurs). It is now possible to upload a folder in multiple (smaller) commits. If a commit fails, you can re-run the script and resume the upload. Commits are pushed to a dedicated PR. Once completed, the PR is merged to the `main` branch resulting in a single commit in your git history.
py
upload_folder(
folder_path="local/checkpoints",
repo_id="username/my-dataset",
repo_type="dataset",
multi_commits=True, resumable multi-upload
multi_commits_verbose=True,
)
Note that this feature is also experimental, meaning its behavior might be updated in the future.
* New endpoint: `create_commits_on_pr` by Wauplin in 1375
Upload validation
Some more pre-validation done before committing files to the Hub. The `.git` folder is ignored in `upload_folder` (if any) + fail early in case of invalid paths.
* Fix `path_in_repo` validation when committing files by Wauplin in 1382
* Raise issue if trying to upload `.git/` folder + ignore `.git/` folder in `upload_folder` by Wauplin in 1408
Keep-alive connections between requests
Internal update to reuse the same HTTP session across `huggingface_hub`. The goal is to keep the connection open when doing multiple calls to the Hub which ultimately saves a lot of time. For instance, updating metadata in a README became 40% faster while listing all models from the Hub is 60% faster. This has no impact for atomic calls (e.g. 1 standalone GET call).
* Keep-alive connection between requests by Wauplin in 1394
* Accept backend_factory to configure Sessions by Wauplin in 1442
Custom sleep time for Spaces
It is now possible to programmatically set a custom sleep time on your upgraded Space. After X seconds of inactivity, your Space will go to sleep to save you some $$$.
py
from huggingface_hub import set_space_sleep_time
Put your Space to sleep after 1h of inactivity
set_space_sleep_time(repo_id=repo_id, sleep_time=3600)
* [Feat] Add `sleep_time` for Spaces by Wauplin in 1438
Breaking change
- `fsspec` has been added as a main dependency. It's a lightweight Python library required for `HfFileSystem`.
No other breaking change expected in this release.
Bugfixes & small improvements
File-related
A lot of effort has been invested in making `huggingface_hub`'s cache system more robust especially when working with symlinks on Windows. Hope everything's fixed by now.
* Fix relative symlinks in cache by Wauplin in 1390
* Hotfix - use relative symlinks whenever possible by Wauplin in 1399
* [hot-fix] Malicious repo can overwrite any file on disk by Wauplin in 1429
* Fix symlinks on different volumes on Windows by Wauplin in 1437
* [FIX] bug "Invalid cross-device link" error when using snapshot_download to local_dir with no symlink by thaiminhpv in 1439
* Raise after download if file size is not consistent by Wauplin in 1403
ETag-related
After a server-side configuration issue, we made `huggingface_hub` more robust when getting Hub's Etags to be more future-proof.
* Update file_download.py by Wauplin in 1406
* 🧹 Use `HUGGINGFACE_HEADER_X_LINKED_ETAG` const by julien-c in 1405
* Normalize both possible variants of the Etag to remove potentially invalid path elements by dwforbes in 1428
Documentation-related
* Docs about how to hide progress bars by Wauplin in 1416
* [docs] Update docstring for repo_id in push_to_hub by tomaarsen in 1436
Misc
* Prepare for 0.14 by Wauplin in 1381
* Add force_download to snapshot_download by Wauplin in 1391
* Model card template: Move model usage instructions out of Bias section by NimaBoscarino in 1400
* typo by Wauplin (direct commit on main)
* Log as warning when waiting for ongoing commands by Wauplin in 1415
* Fix: notebook_login() does not update UI on Databricks by fwetdb in 1414
* Passing the headers to hf_transfer download. by Narsil in 1444
Internal stuff
* Fix CI by Wauplin in 1392
* PR should not fail if codecov is bad by Wauplin (direct commit on main)
* remove cov check in PR by Wauplin (direct commit on main)
* Fix restart space test by Wauplin (direct commit on main)
* fix move repo test by Wauplin (direct commit on main)