Speakerbox is a library for few-shot fine-tuning of a Transformer for speaker identification. This initial release has all the functionality needed to quickly generate a training set and fine-tune a model for use in downstream analysis tasks.
Given a set of recordings of multi-speaker recordings:
example/
├── 0.wav
├── 1.wav
├── 2.wav
├── 3.wav
├── 4.wav
└── 5.wav
Where each recording has some or all of a set of speakers, for example:
- 0.wav -- contains speakers: A, B, C, D, E
- 1.wav -- contains speakers: B, D, E
- 2.wav -- contains speakers: A, B, C
- 3.wav -- contains speakers: A, B, C, D, E
- 4.wav -- contains speakers: A, C, D
- 5.wav -- contains speakers: A, B, C, D, E
You want to train a model to classify portions of audio as one of the N known speakers
in future recordings not included in your original training set.
`f(audio) -> [(start_time, end_time, speaker), (start_time, end_time, speaker), ...]`
i.e. `f(audio) -> [(2.4, 10.5, "A"), (10.8, 14.1, "D"), (14.8, 22.7, "B"), ...]`
The `speakerbox` library contains methods for both generating datasets for annotation
and for utilizing multiple audio annotation schemes to train such a model.
![Speakerbox example workflow](https://raw.githubusercontent.com/CouncilDataProject/speakerbox/main/docs/_static/images/workflow.png)
The following table shows model performance results as the dataset size increases:
| dataset_size | mean_accuracy | mean_precision | mean_recall | mean_training_duration_seconds |
|:---------------|----------------:|-----------------:|--------------:|---------------------------------:|
| 15-minutes | 0.874 ± 0.029 | 0.881 ± 0.037 | 0.874 ± 0.029 | 101 ± 1 |
| 30-minutes | 0.929 ± 0.006 | 0.94 ± 0.007 | 0.929 ± 0.006 | 186 ± 3 |
| 60-minutes | 0.937 ± 0.02 | 0.94 ± 0.017 | 0.937 ± 0.02 | 453 ± 7 |
Please see our [documentation](https://CouncilDataProject.github.io/speakerbox) for more details.