Huggingface-text-data-analyzer

Latest version: v1.1.0

Safety actively analyzes 723144 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

1.1.0

New features n stuff:

- Better caching, caching is granular to your field and analysis type
- Prompting to use cached data
- More args to define control, like skipping basic analysis or defaulting to always using cached
- Bug fixes

1.0.0

Fixed a ton of bugs to make it release ready and added important features:

- Supports graph visualization of results.
- Removed dependency on fast_text, focused on using Huggingface models.
- Added more args.
- Fixed tons of bugs.
- Cleaned up files.

Also have an image for the repository now :)

v0.1.0-alpla

0.1.0

Initial release of a comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This release provides both command-line and programmatic interfaces for performing detailed analysis of text datasets.

Installation

bash
pip install huggingface-text-data-analyzer


Key Features

Basic Analysis
- Text length statistics with field-specific analysis
- Word distribution analysis and visualization
- Junk text detection (HTML tags, special characters)
- Batch-processed tokenizer analysis
- Chat template support for conversational datasets
- Configurable field analysis

Advanced Analysis
- Part-of-Speech (POS) tagging
- Named Entity Recognition (NER)
- Language detection
- Sentiment analysis

Performance Optimizations
- Batch processing for tokenization
- Progress tracking with rich console output
- Tokenizer parallelism
- Caching support for tokenized texts
- Memory-efficient large dataset processing

Usage

Basic analysis:
bash
analyze-dataset "dataset_name" --split "train" --output-dir "results"


Full analysis with all features:
bash
analyze-dataset "dataset_name" \
--advanced \
--use-pos \
--use-ner \
--use-lang \
--use-sentiment \
--tokenizer "bert-base-uncased" \
--fields instruction response


Requirements
- Python 3.8+
- Key dependencies: transformers, datasets, spacy, rich, torch, pandas, numpy, scikit-learn

Documentation
Full documentation and usage examples are available in the [README](README.md).

Notes
- First public release
- Apache License 2.0
- Contributions welcome

Links

Releases

Has known vulnerabilities

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.