With this release we restructure the ML models in order to improve their precision. Moreover, the new models will be directly integrated in the project, overcoming the painful download&linking needed for the former ones.
All the changes are transparent to the final user (i.e., no API or function definition changed), thus there was no need for a major upgrade to v5.
Path Model
We decided to deprecate the fasttext approach and shifted to the usage of a regex to filter out false positive file paths. Indeed, according to our tests, we noticed that we can keep a good precision while decreasing the overhead
~SnippetModel~ PasswordModel
We decided to deprecate the old fasttext double-model (extractor+classifier) approach in order to shift to a NLP approach based on CodeBERT. Overall, it's slower but way more precise, even if it only works for password. Hence, the change of name from *SnippetModel* to *PasswordModel*.
Moreover, since the PasswordModel only works for passwords, we added a check in the Client to only run this model over password discoveries.
AoB
- The `download` function has been deprecated and models are managed automatically by Credential Digger
- The generator was strongly linked to the SnippetModel, so it has been deprecated
- The documentation has been updated, both in the README and in the wiki
- We added a `categories` enum in the postgres db in order to drive the users to 4 main rule categories. Nevertheless, this enum is only enforced in new postgres installations to make the transition smoother
- The UI has been updated to use the new models
- We ported the incremental `scan_snapshot` from v4.3.1
- Minor bug fixes
- Refresh the UI every 8s (was 5s)
---
Credits also go to the wonderful work from melisande1