Fairness dimension
Synthetic data can be generated for various reasons. Privacy is usually part of the motivation, but augmentation is also a key factor.
Creating synthetic data that accurately mimics real-world datasets also propagates biases and imbalances unless extra steps are taken. Whether to try to mitigate such biases and imbalances and thus hurt resemblance to the training data remains a dilemma, but there are many arguments for tackling these problems early in the pipeline rather than downstream.
Enabling fairness metrics to be treated separately from utility and privacy metrics in SynthEval will hopefully prove productive for promoting additional fairness dimensions to be investigated, and for monitoring how utility and privacy are affected by better/worse fairness.
Doctests
Tests for SynthEval have long been so outdated to the point that they were practically non-existent. In this update reviewed all metrics and key files and added additional documentation and doctests to all metrics and main scripts. The doctests work as examples on how each method functions and are verified to work through pytest. The new yml file on GitHub, runs all doctests every time the main branch is updated, and the the status from the latest run is now displayed at the top of the front page.
Other updates and fixes
- Statistical Parity Difference is added as the first fairness metric.
- A helper function is added for formatting in the console print (old stuff can be updated).
- Minor fix for MIA metric, also renamed activation key from "mia_risk" to "mia" to avoid misinterpretation.
- MIA metric had its saved outputs changed from back from F1, to recall and precision.