kmerdb graph` introduced, producing a new file form `.kdbg`, an edge list. New metadata schema for new format as well. `kmerdb view` and `kmerdb header` are compatible with new format.
The goal is to create an weighted graph. Support for assembly and graph visualizations in the future.
After 0.7.6 the `.kdb` spec will be *loosely* deprecated. While the .kdb format may remain unchanged (don't know yet), the goal is to produce an adjacency list structure from only the k-mer counts and the 'neighbor' k-mer ids. After the format revision (mostly to the `--all-metadata` option), a new command `kmerdb graph` will be applied to generate a on-disk representation of an adjacency list.
* What does this mean?
At this point, the new feature is in the planning stage, and it is not known if backwards compatibility (< 0.7.7) will be supported. One goal is to create an adjacency list structure on disk from the `--all-metadata` augmented `.kdb` format. It is not clear yet if cycles will be permitted in the graph structure, or if a distinct "offset" flag will be used. An example follows.
- 0.7.6 `.kdb` format
col1 is row number, col2 is sort order, col3 is k-mer id, col4 is k-mer count, col5 (`--all-metadata`) featured a loosely specified 'neighbor' JSON field, consisting of a dictionary with "A", "C", "T" "G" etc. keys and it was poorly implemented. Basically, the neighboring (left side and right side) k-mer ids were provided.
1 1 1 123
- 0.7.7+ `.kdbg`
col1 is unique row number, col2 is k-mer id (may be repeated), col3 is a `.csv` field of possible adjacent *row-ids*, corresponding to the k-mer id's (col2) neighbors in kmer-space. col4 represents a possible solution for the graph traversal that produces a Hamiltonian (whatever) walk through the graph recapitulating either the exact (`.fasta`) assembly solution *OR* a potential solution to the assembly from available data and a feasible solution either using `networkx` or somehow a custom graph traversal algorithm that minimized the penalty of omitting rows/k-mers based on the suggestion of the shortest path to visit each k-mer once but that also? maximizes the number of rows visited? I'm not sure yet how this will be specifically implemented, as the `.kdbg` format is the first step.
1 1234 2345,3456,... 3