Smith-Waterman Alignment Scoring Settings

The first stage of mapping is to generate seeds from the read and look for exact matches in the reference genome. These results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. This well-documented algorithm works by comparing each position of the read against all the candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each of these candidate alignment positions, Smith-Waterman generates scores that are used to evaluate whether the best alignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal movement), a deletion (horizontal movement), or an insertion (vertical movement). A match between read and reference provides a bonus, on the score, and a mismatch or indel imposes a penalty. The overall highest scoring path through the matrix is the alignment chosen.

The specific values chosen for scores in this algorithm indicate how to balance, for an alignment with multiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. But any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors, and differently tuned alignment scoring values can be more appropriate for some applications.

The following alignment options control Smith-Waterman Alignment:

Command-Line Option Name

Configuration File Option Name

--Aligner.global

global

--Aligner.match-score

match-score

--Aligner.match-n-score

match-n-score

--Aligner.mismatch-pen

mismatch-pen

--Aligner.gap-open-pen

gap-open-pen

--Aligner.gap-ext-pen

gap-ext-pen

--Aligner.unclip-score

unclip-score

--Aligner.no-unclip-score

no-unclip-score

--Aligner.aln-min-score

aln-min-score

global

The global option (value can be 0 or 1) controls whether alignment is forced to be end-to-end in the read. When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch global alignment algorithm (although not end-to-end in the reference), and alignment scores can be positive or negative. When set to 0, alignments can be clipped at either or both ends of the read, as in the Smith-Waterman local alignment algorithm, and alignment scores are nonnegative.

Generally, global=0 is preferred for longer reads, so significant read segments after a break of some kind (large indel, structural variant, chimeric read, and so forth) can be clipped without severely decreasing the alignment score. Setting global=1 might not have the desired effect with longer reads because insertions at or near the ends of a read can function as pseudoclipping. Also, with global=0, multiple (chimeric) alignments can be reported when various portions of a read match widely separated reference positions.

Using global=1 is sometimes preferable with short reads, which are unlikely to overlap structural breaks, unable to support chimeric alignments, and are suspected of incorrect mapping if they cannot align well end-to-end.

Consider using the unclip-score option, or increasing it, instead of setting global=1, to make a soft preference for unclipped alignments.

match-score

The match-score option is the score for a read nucleotide matching a reference nucleotide (A, C, G, or T). Its value is an unsigned integer, from 0 to 15. match_score=0 can only be used when global=1. A higher match score results in longer alignments, and fewer long insertions.

match-2-score

The match-2-score option is the score for a read nucleotide matching a 2-base IUPAC-IUB code in the reference (K, M, R, S, W, or Y). This option is a signed integer, ranging from -16 to 15.

match-3-score

The match-3-score option is the score for a read nucleotide matching a 3-base IUPAC-IUB code in the reference (B, D, H, or V). This option is a signed integer, from -16 to 15.

match-n-score

The match-n-score option is the score for a read nucleotide matching an N code in the reference. This option is a signed integer, from -16 to 15.

mismatch-pen

The mismatch-pen option is the penalty (negative score) for a read nucleotide mismatching any reference nucleotide or IUPAC-IUB code (except ‘N’, which cannot mismatch). This option is an unsigned integer, from 0 to 63. A higher mismatch penalty results in alignments with more insertions, deletions, and clipping to avoid SNPs.

gap-open-pen

The gap-open-pen option is the penalty (negative score) for opening a gap (ie, an insertion or deletion). This value is only for a 0-base gap. It is always added to the gap length times gap-ext-pen. This option is an unsigned integer, from 0 to 127. A higher gap open penalty causes fewer insertions and deletions of any length in alignment CIGARs, with clipping or alignment through SNPs used instead.

gap-ext-pen

The gap-ext-pen option is the penalty (negative score) for extending a gap (ie, an insertion or deletion) by one base. This option is an unsigned integer, from 0 to 15. A higher gap extension penalty causes fewer long insertions and deletions in alignment CIGARs, with short indels, clipping, or alignment through SNPs used instead.

unclip-score

The unclip-score option is the score bonus for an alignment reaching the beginning or end of the read. An end-to-end alignment receives twice this bonus. This option is an unsigned integer, from 0 to 127. A higher unclipped bonus causes alignment to reach the beginning and/or end of a read more often, where this can be done without too many SNPs or indels.

A nonzero unclip-score is useful when global=0 to make a soft preference for unclipped alignments. Unclipped bonuses have little effect on alignments when global=1, because end-to-end alignments are forced anyway (although 2 × unclip-score does add to every alignment score unless no-unclip-score = 1). It is recommended to use the default for unclip-score when global=1, because some internal heuristics consider how local alignments would have been clipped.

Note that, especially with longer reads, setting unclip-score much higher than gap-open-pen can have the undesirable effect of insertions at or near one end of a read being utilized as pseudoclipping, as happens with global=1.

no-unclip-score

The no-unclip-score option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, any unclipped bonus (unclip-score) contributing to an alignment is removed from the alignment score before further processing, such as comparison with aln-min-score, comparison with other alignment scores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoring alignment found by Smith-Waterman alignment to a given reference segment, biasing toward unclipped alignments.

When unclip-score > 0 causes a Smith-Waterman local alignment to extend out to one or both ends of the read, the alignment score stays the same or increases if no-unclip-score=0, whereas it stays the same or decreases if no-unclip-score=1.

The default, no-unclip-score=1, is recommended when global=1, because every alignment is end-to-end, and there is no need to add the same bonus to every alignment.

When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score=0, unclipped bonuses are included in alignment scores compared to the aln-min-score floor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.

aln-min-score

The aln-min-score option specifies a minimum acceptable alignment score. Any alignment results scoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase the percentage of reads mapped. This option is a signed integer (negative alignment scores are possible with global=0).

aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores, and aln-min-score serves as the suboptimal alignment score if nothing higher was found except the best score. Therefore, increasing aln-min-score can decrease reported MAPQ for some low-scoring alignments.