Speakerbox

Latest version: v1.2.0

Safety actively analyzes 626555 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

1.2.0

Speakerbox is a library which enables:

1. the creation of an audio-based speaker identification datasets
2. training an audio-based speaker identification transformer model
3. applying a pre-trained audio-based speaker identification to a new audio file and predicting portions of audio as the known speakers

This release completes the work for our [Journal of Open Source Software (JOSS)](https://joss.theoj.org/) paper.

The changes from v1.0.0 include:

* An example video attached to the README which demonstrates how to use this library (on a toy example) -- [YouTube Video Link](https://www.youtube.com/watch?v=SK2oVqSKPTE).
* A more thorough workflow diagram attached to the README which explains how all the components of this library fit together.
* The example data used for model reproduction is now available for download directly from a Python command.
* Upgrading to newer dependency versions.
* The JOSS paper content: [paper.md](https://github.com/CouncilDataProject/speakerbox/blob/main/paper/paper.md).
* Upgraded linting with [ruff](https://github.com/charliermarsh/ruff).
* Minor improvements to logging.

1.0.0

Speakerbox is a library for few-shot fine-tuning of a Transformer for speaker identification. This initial release has all the functionality needed to quickly generate a training set and fine-tune a model for use in downstream analysis tasks.

Given a set of recordings of multi-speaker recordings:


example/
├── 0.wav
├── 1.wav
├── 2.wav
├── 3.wav
├── 4.wav
└── 5.wav


Where each recording has some or all of a set of speakers, for example:

- 0.wav -- contains speakers: A, B, C, D, E
- 1.wav -- contains speakers: B, D, E
- 2.wav -- contains speakers: A, B, C
- 3.wav -- contains speakers: A, B, C, D, E
- 4.wav -- contains speakers: A, C, D
- 5.wav -- contains speakers: A, B, C, D, E

You want to train a model to classify portions of audio as one of the N known speakers
in future recordings not included in your original training set.

`f(audio) -> [(start_time, end_time, speaker), (start_time, end_time, speaker), ...]`

i.e. `f(audio) -> [(2.4, 10.5, "A"), (10.8, 14.1, "D"), (14.8, 22.7, "B"), ...]`

The `speakerbox` library contains methods for both generating datasets for annotation
and for utilizing multiple audio annotation schemes to train such a model.

![Speakerbox example workflow](https://raw.githubusercontent.com/CouncilDataProject/speakerbox/main/docs/_static/images/workflow.png)

The following table shows model performance results as the dataset size increases:

| dataset_size | mean_accuracy | mean_precision | mean_recall | mean_training_duration_seconds |
|:---------------|----------------:|-----------------:|--------------:|---------------------------------:|
| 15-minutes | 0.874 ± 0.029 | 0.881 ± 0.037 | 0.874 ± 0.029 | 101 ± 1 |
| 30-minutes | 0.929 ± 0.006 | 0.94 ± 0.007 | 0.929 ± 0.006 | 186 ± 3 |
| 60-minutes | 0.937 ± 0.02 | 0.94 ± 0.017 | 0.937 ± 0.02 | 453 ± 7 |

Please see our [documentation](https://CouncilDataProject.github.io/speakerbox) for more details.

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.