Repeat Expansion Detection with Expansion Hunter
Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.
DRAGEN includes a repeat expansion detection method called ExpansionHunter. ExpansionHunter performs sequence-graph based realignment of reads that originate inside and around each target repeat. It then genotypes the length of the repeat in each allele based on these graph alignments. More information and analysis is available in the following ExpansionHunter papers:
• | ExpansionHunter (http://www.genome.org/cgi/doi/10.1101/gr.225672.117) |
• | Graph ExpansionHunter (https://doi.org/10.1101/572545) |
These methods work only for whole human genome samples in PCR-free libraries. Repeats are only genotyped if the coverage at the locus is at least 10x.

To enable DRAGEN repeat expansion detection, the following command-line options are required.
• | --repeat-genotype-enable = true |
• | --repeat-genotype-specs=<path to spec file> |
In addition, the sex of the sample should be set using the --sample-sex option.
The following options are optional.
• | --repeat-genotype-region-extension-length=<length of region around repeat to examine> (default 1000bp) |
• | --repeat-genotype-min-baseq=<Minimum base quality for ‘high confidence’ bases> (default 20) |
For more information on the spec file specified by --repeat-genotype-specs option, see Repeat Expansion Specification Files.
The main output of repeat expansion detection is a VCF file, containing the variants found via this analysis.

The repeat-specification (also called variant catalog) JSON file defines the repeat regions for ExpansionHunter to analyze. Default repeat-specification for some pathogenic repeats are in the /opt/edico/repeat-specs/_directory (based on the reference genome used with DRAGEN).
You can create specification files for new repeat regions by using one of the provided specification files as a template. See the ExpansionHunter documentation for details on the format.
`--repeat-genotype-specs` is required for ExpansionHunter. If the option is not provided, DRAGEN will attempt to auto-detect the applicable catalog file from /opt/edico/repeat-specs/ based on the reference provided.


The results of repeat genotyping are output as a separate VCF file, giving the length of each allele at each callable repeat defined in the repeat-specification catalog file. The name is <outputPrefix>.repeats.vcf (.gz).
The VCF output file begins with the following fields.
Field |
Description |
---|---|
CHROM |
Chromosome identifier |
POS |
Position of the first base before the repeat region in the reference |
ID |
Always . |
REF |
The reference base at position POS |
ALT |
List of repeat alleles in format <STRn> where n is the number of repeat units |
QUAL |
Always . |
FILTER |
LowDepth filter is applied when the overall locus depth is below 10x or the number of reads that span one or both breakends is below 5. |
Field |
Description |
---|---|
SVTYPE |
Always STR |
END |
Position of the last base of the repeat region in the reference |
REF |
Number of repeat units spanned by the repeat in the reference |
RL |
Reference length in bp |
RU |
Repeat unit in the reference orientation |
REPID |
Repeat id from the repeat-specification file |
Field |
Description |
---|---|
GT |
Genotype |
SO |
Type of reads that support the allele; can be SPANNING, FLANKING, or INREPEAT meaning that the reads span, flank, or are fully contained in the repeat |
CI |
Confidence interval called repeat length of each allele |
AD_SP |
Number of spanning reads consistent with the allele |
AD_FL |
Number of flanking reads consistent with the allele |
AD_IR |
Number of in-repeat reads consistent with the allele |
For example, the following VCF entry describes the state of C9orf72 repeat in a sample with ID LP6005616-DNA_A03.
QUAL FILTER INFO FORMAT LP6005616-DNA_A03
chr9 27573526 . C <STR2>,<STR349> . PASS SVTYPE=STR;END=27573544;REF=3;RL=18;RU=GGCCCC;REPID=ALS GT:SO:CN:CI:AD_SP:AD_FL:AD_IR 1/2:SPANNING/INREPEAT:2/349:2-2/323-376:19/0:3/6:0/459
In this example, the first allele spans 2 repeat units while the second allele spans 349 repeat units. The repeat unit is GGCCCC (RU INFO field), so the sequence of the first allele is GGCCCCGGCCCC and the sequence of the second allele is GGCCCC x 349. The repeat spans three repeat units in the reference (REF INFO field).
The length of the short allele was estimated from spanning reads (SPANNING) while the length of the expanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size of the expanded allele is (323,376). There are 19 spanning and 3 flanking reads consistent with the repeat allele of size 2 (that is 19 reads fully contain the repeat of size 2 and 2 flanking reads overlap at most 2 repeat units). Also, there are 6 flanking and 459 in-repeat reads consistent with the repeat allele of size 349.

The sequence-graph alignments of reads in the targeted repeat regions are output in a BAM file. You can use a specialized GraphAlignmentViewer tool (github.com/Illumina/GraphAlignmentViewer/) to visualize the alignments. Programs like Integrative Genomics Viewer (IGV) are not designed for displaying graph-aligned reads and cannot visualize these BAMS.
The BAMs store graph alignments in custom XG tags using the format <LocusName>,<StartPosition>,<GraphCIGAR>.
• | LocusName—A locus identifier that matches the corresponding entry in the repeat expansion specification file. |
• | StartPosition—The starting alignment position of a read on the first graph node. |
• | GraphCIGAR—The alignment of a read against the graph starting from that position. GraphCIGAR consists of a sequence of graph node identifiers and linear CIGARS describing the alignment of the read to each node. |
Quality scores in the BAM file are binary. High-scoring bases are assigned a score of 40, and low-scoring bases are assigned a score of 0.