- greatly improved hdf file handling for large datasets:
1. Results are no longer stored in memory but loaded from hdf when needed
2. When checking for already performed download only a fraction of data (namely the process ids) are retrieved from the hdf store. This greatly increases performance. Benchmarked with a 75k ESV dataset, that now just takes one second to load and check for failed downloads / already downloaded additional data
3. Completly reworked the top hit selection, so that it does not need all data in memory and restores input data order while adding the additional data.
4. Top hits are now collected in chunks of 10k sequences so that dataset size no longer limits the analysis. Any amount of sequences can now be identified with BOLDigger with linear increase in time
- added another search mode 4, rapid search for short sequences:
1. Mode 4 combines mode 1 and 3. Since BOLD only allows for a search depth of 94% similarity for sequences shorter than 225 bp, mode 4 now combines a batch size of 1000 sequences (and therefore the speed of mode 1) with the retrieval of 100 top hits with a search depth of 94%, greatly improving analysis speed for short sequences.
- implemented database nr 8 from the id engine
- someminor bug fixes