SEQdata-BEACON: a comprehensive database of sequencing performance and statistical tools for performance evaluation and yield simulation in BGISEQ-500

By Yanqiu Zhou, Chen Liu, Rongfang Zhou, Anzhi Lu, Biao Huang, Liling Liu, Ling Chen, Bei Luo, Jin Huang, Zhijian Tian

Posted 30 May 2019
bioRxiv DOI: 10.1101/652347 (published DOI: 10.1186/s13040-019-0209-9)

Background BGISEQ-500 is based on DNBSEQ™ technology and superior in providing high outputs and requiring less cost. This sequencer has been widely used in various areas of scientific and clinical research. A better understanding of the sequencing process and sequencer performance is essential for stabilizing sequencing process, accurately interpreting sequencing results and efficiently solving sequencing troubles. To solve these problems, a comprehensive database SEQdata-BEACON was constructed to accumulate sequencing performance data in BGISEQ-500. Methods Totally 60 BGISEQ-500 sequencers in BGI-Wuhan lab were used to collect the sequencing performance data. Those lanes in paired-end 100 sequencing using 10bp barcode were chosen, and each lane containing 66 metrics was assigned a unique entry number as ID. The database was constructed in MySQL server 8.0 and the website was built on Apache (2.4.33 win64 VC15 server). The statistical analysis and linear regression models were generated by R program based on the data from November 2018 to April 2019. Results A total of 2236 entries were recorded in the database, including sample ID, yield, quality, machine state and supplies information. According to correlation matrix, the 52 numerical metrics were clustered into three groups signifying yield-quality, machine state and sequencing calibration. The metrics distributions also delivered some patterns and rendered clues for further explanation or analysis of the sequencing process. Using the data of total 200 cycles, the linear regression model well simulated the final outputs. Moreover, the predicted final yield could be provided in the 15th cycle of the early stage of sequencing and the corresponding coefficient of determination R2 of the 200th and 15th cycle models were 0.97 and 0.81 respectively. The data source, statistical findings and application tools were all available in our website <http://seqBEACON.genomics.cn:443/home.html>. These resources can be used as a constantly updated reference for BGISEQ-500 users to comprehensively understand DNBSEQ™ technology, solve sequencing problems and optimize the sequencing process.

