Virtual Long Read Detection

DRAGEN Virtual Long Read Detection (VLRD) is an alternate and more accuratevariant caller focused on processing homologous/similar regions of the genome. A conventional variant caller relies on the mapper/aligner to determine which reads likely originated from a given location. It also detects the underlying sequence at that location independently of other regions not immediately adjacent to it. Conventional variant calling works well when the region of interest does not resemble any other region of the genome over the span of a single read (or a pair of reads for paired-end sequencing).

However, a significant fraction of the human genome does not meet this criterion. Many regions of the genome have near-identical copies elsewhere, and as a result, the true source location of a read might be subject to considerable uncertainty. If a group of reads is mapped with low confidence, a typical variant caller might ignore the reads, even though they contain useful information. If a read is mismapped (ie, the primary alignment is not the true source of the read), it can result in detection errors. Short-read sequencing technologies are especially susceptible to these problems. Long-read sequencing can mitigate these problems, but it typically has much higher cost and/or higher error rates, or other shortcomings.

DRAGEN VLRD attempts to tackle the complexities presented by the genome's redundancy from a perspective driven by the short-read data. Instead of considering each region in isolation, VLRD considers all locations from which a group of reads may have originated and attempts to detect the underlying sequences jointly using all available information.

VLRD Settings

The following options are specific to VLRD in the DRAGEN host software.

•

--enable-vlrd

If set to true, VLRD is enabled for the DRAGEN pipeline.

•

--vc-target-bed

Specifies the input bed file. DRAGEN requires an input target bed file specifying the homologous regions to be processed by VLRD. The input bed file id formatted to correctly process the homologous regions. The maximum region span processed by VLRD is 900 bp.

For example:

chr1 161497562 161498362 0 0

chr1 161579204 161580004 0 0

chr1 21750837 21751637 1 0

chr1 21809355 21810155 1 1

•

The first three columns are like traditional bed files: column 1 is chromosome description, column 2 is region start, and column 3 is region end.

•

Column 4 is homologous region Group ID. This groups regions that are homologous to each other.

•

Rows 1 and 2 have the same value in column 4 indicating that these should be processed as a set of homologous regions, independent of the next group in row 3 and 4. If not set correctly, software might group regions that are not homologous to each other, leading to incorrect variant calls.

•

Column 5 indicates whether a region is reverse complemented with respect to the other homologous region. A value of 1 denotes that the region is reverse complemented with respect to other regions in the same group.

•

Row 4, column 5 is set to 1. This indicates that the region is homologous to the region in row 3 only if it is reverse complemented.

The DRAGEN installation package contains two VRLD bed files for hg19 and hs37d5 reference genomes under /opt/edico/examples/VLRD. You can use these bed files as is for running VLRD, or as an example to generate a custom bed file.

•

--enable-vlrd-map-align-output

If set to true, VLRD outputs a remapped BAM/SAM file that only contains reads mapped to the regions that were processed by VLRD.