Hash Table / Seed Extensions
Due to repetitive sequences, some seeds of any given length match many locations in the reference genome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When the software determines that a primary seed occurs at many reference locations, it extends the seed by some number of bases at both ends, to some greater length that is more unique in the reference.
For example, a 21-base primary seed may be extended by 7 bases at each end to a 35-base extended seed. A 21-base primary seed may match 100 places in the reference. But 35-base extensions of these 100 seed positions may divide into 40 groups of 1–3 identical 35-base seeds. Iterative seed extensions are also supported, and are automatically generated when a large set of identical primary seeds contains various subsets that are best resolved by different extension lengths.
The maximum extended seed length, by default equal to the primary seed length plus 128, can be controlled with the --ht-max-ext-seed-len option. For example, for short reads, it is advisable to set the maximum extended seed shorter than the read length, because extensions longer than the whole read can never match.
It is also possible to tune how aggressively seeds are extended using the following options (advanced usage):
• | --ht-cost-coeff-seed-len |
• | --ht-cost-coeff-seed-freq |
• | --ht-cost-penalty |
• | --ht-cost-penalty-incr |
There is a tradeoff between extension length and hit frequency. Faster mapping can be achieved using longer seed extensions to reduce seed hit frequencies, or more accurate mapping can be achieved by avoiding seed extensions or keeping extensions short, while tolerating the higher hit frequencies that result. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs, and by finding more candidate mapping locations at which to score alignments. The default extension settings along with default seed frequency settings, lean aggressively toward mapping accuracy, with relatively short seed extensions and high hit frequencies.
The defaults for the seed frequency options are as follows:
Option |
Default |
---|---|
--ht-cost-coeff-seed-len |
1 |
--ht-cost-coeff-seed-freq |
0.5 |
--ht-cost-penalty |
0 |
--ht-cost-penalty-incr |
0.7 |
--ht-max-seed-freq |
16 |
--ht-target-seed-freq |
4 |