Now 17× faster
IMPC data contains 1,332,579 lines or 1.82 GB of uncompressed JSON data. Running in parallel on a 12 core machine with 64 GB RAM:
* Old validator runs in **13 minutes 46 seconds** (826 seconds);
* New validator runs in **48 seconds.**
This is achieved by switching from `jsonschema` to `fastjsonschema` with compiled validator objects, and also by processing evidence strings in blocks to decrease multiprocessing overhead.
Does not hang
The `pypeln` library which used to give us so many issues with validator randomly hanging is not used anymore.
Helpful error messages
The new way to format error messages immediately draws attention to it and doesn't leave the user guessing as to what went wrong. It also provides the original evidence string which errored:
2023-09-07 14:08:03,454 - opentargets_validator.validator - ERROR -
Line 82 is a valid JSON object, but it does not match the schema:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃data.effects[0] must contain ['direction'] properties┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
{"id":"ENSG00000006071","event":"abnormal pulse","eventId":"HP_0031860","datasource":"Lynch et al. (2017)","effects":[{"dosing":"acute"}],"literature":"28216264","biosamples":[{"tissueLabel":"pancreas","tissueId":"UBERON_0001264"},{"tissueLabel":"renal","tissueId":"UBERON_0001008"},{"tissueLabel":"cardiovascular","tissueId":"UBERON_0004535"}]}
No legacy dependencies
Used to have six dependencies: `requests`, `jsonschema`, `rfc3987`, `simplejson`, `pypeln`, `opentargets-urlzsource`. The last one, in particular, has long been moved to Open Targets archive repository and hasn't been supported in ages.
Now just two shiny and new dependencies: `pathos` and `fastjsonschema`.
Other improvements
* Updated README
* Updated CLI help message (the actual usage syntax is unaffected)
* Added `--version` argument to print version and exit
Technical changes
* Minimum Python version lifted from 3.7 to 3.8 (required by dependencies)
* Configured Black formatter and formatted the entire code base
* Updated `.gitignore`
* Removed `Dockerfile`