Improved Gossypium raimondii genome using a Hi-C-based proximity-guided assembly

Genome sequence plays an important role in both basic and applied studies. Gossypium raimondii, the putative contributor of the D subgenome of upland cotton (G. hirsutum), highlights the need to improve the genome quality rapidly and efficiently. We performed Hi-C sequencing of G. raimondii and reassembled its genome based on a set of new Hi-C data and previously published scaffolds. We also compared the reassembled genome sequence with the previously published G. raimondii genomes for gene and genome sequence collinearity. A total of 98.42% of scaffold sequences were clustered successfully, among which 99.72% of the clustered sequences were ordered and 99.92% of the ordered sequences were oriented with high-quality. Further evaluation of results by heat-map and collinearity analysis revealed that the current reassembled genome is significantly improved than the previous one (Nat Genet 44:98–1103, 2012). This improvement in G. raimondii genome not only provides a better reference to increase study efficiency but also offers a new way to assemble cotton genomes. Furthermore, Hi-C data of G. raimondii may be used for 3D structure research or regulating analysis.


Introduction
Over the last decade, next-generation sequencing (NGS) technologies have brought immense improvements in plant genome sequencing throughput and reduced the cost, and many plant genomes have been sequenced using this technology, such as F. vesca (Shulaev et al. 2011), C. cajan (Varshney et al. 2012), G. raimondii (Wang et al. 2012;Paterson et al. 2012), G. arboreum (Li et al. 2014) and G. hirsutum (Li et al. 2015;Zhang et al. 2018). However, de novo assembly of large eukaryotic genomes has remained a great challenge with the NGS platform because of a significant amount of repeat contents. As a result, de novo assembly of chromosomescale scaffolds has become a major constraint to the completion of a high-quality genome sequence. Compared with traditional methods, like genetic map and physical map, high-throughput chromosome conformation capture (Hi-C) technology (Belton et al. 2012) has assisted in the assembly of long scaffolds to produce chromosome-scale genome assemblies (Lightfoot et al. 2017). The Hi-C technique is based on restriction enzyme cutting sites, which are more evenly distributed and have much higher density than genetic map.
The Hi-C-based proximity-guided assembly was initially developed to study the three-dimensional (3D) conformation of chromosomes of yeast gene expression (Berkum et al. 2010). The working principle based on two regular models with Hi-C data: the first one is that the rate of Hi-C interaction is inversely proportional to the genomic distance between the pairs of loci; the second one is that the rate of Hi-C interaction of pairs of loci within a chromosome is significantly higher than that in different chromosomes (Xie et al. 2015). Based on these two models, Hi-C-based proximity-guided assembly was applied for de novo assembling of human beings, and subsequently for the assembling of mouse and Drosophila genomes which reported good results or improvement (Burton et al. 2013). With the success of testing and verifying this method in Arabidopsis thaliana (Xie et al. 2015), Hi-C based proximity-guide assembly has been reported as an effective and efficient method which subsequently has been used in many other plants.
Besides its commercial value, cotton also serves as a perfect model system for studying cell wall biosynthesis (Zhang et al. 2018), cell elongation (Guo et al. 2017), and polyploidization (Yuan et al. 2016). The Gossypium genus comprises more than 50 species including at least 5 tetraploid species and 45 diploid species. Diploid cotton species are divided into 8 subgenomes, denoted A-G and K based on chromosome pairing relationships (Wendel 1992). Tetraploid cotton species, such as cultivated G. hirsutum (AD 1 ) and G. barbadense (AD 2 ), had formed by an allopolyploidy event about 1-2 million years ago (Paterson et al. 2012). These tetraploid cotton species share common ancestors with the modern New World species G. raimondii (D 5 ) and the Old World A-genome species G. herbaceum (A 1 ) or G. arboreum (A 2 ). Previously, the genomes of different cotton species were sequenced and assembled including G. raimondii (Wang et al. 2012), G. arboreum (Li et al. 2014), and G. hirsutum (Li et al. 2015), respectively.
Among these cotton species, the genome of G. raimondii has the lowest complexity which has been sequenced and assembled using the next-generation Illumina paired-end sequencing strategy (Wang et al. 2012). Approximately 73% (281 scaffolds) of the assembled sequences were anchored to 13 chromosomes, covering 88.1% of the genome, while only 52.4% (228 scaffolds) of the total sequences were both ordered and oriented. The completeness and accuracy of the previous sequenced and assembled genome of G. raimondii (Wang et al. 2012) were relatively low due to the large number of repeat elements and the small number of genetic markers. In the present study, we conducted a de novo Hi-C sequencing of G. raimondii, and incorporated the new Hi-C data with the existing G. raimondii scaffolds (Wang et al. 2012) to improve the quality of the Dgenome.

Tissue collection and Hi-C sequencing Plant materials
The seeds of G. raimondii D 5-1 were planted in an incubator at constant environmental conditions having 27°C temperature, 60% relative humidity, 16/8 h light/dark photoperiod, and 100% fluorescent light. When the sixth euphylla came out, these seedlings were transplanted into big pots. Approximately 3 g young leaves from G. raimondii plants were collected and immediately treated with formaldehyde.

Hi-C pipeline
During this study, we have used the same Hi-C pipeline as in Arabidopsis thaliana (Xie et al. 2015). Before starting this experiment, we have tested the integrity of DNA from the formaldehyde-treated tissue, and then the DNA samples were isolated and digested by MboI instead of HindIII because of the shorter recognition site (only four bases of MboI). The resulting sticky ends were filled with nucleotides in which cytosine is biotinylated, and ligated the adjacent blunt ends to a chimeric circle under extremely dilute conditions. Subsequently, DNA was purified and broken into 300-500 base pairs using ultrasonic, pull-down the biotin-labeled DNA and performed the PCR reaction (10 cycles). After DNA purification, the finished Hi-C library was sequenced with an Illumina Hiseq (PE150). A total of 570 412 361 readpairs were obtained. We named the reassemble chromosome by its length order. The length of chromosome contained the 100 bp N between neighboring scaffolds Genome assembly based on Hi-C data Assembling of G. raimondii genome involved three steps. First, valid Hi-C paired-end reads and contact matrix with a resolution of 100 kb were generated by HiC-Pro (Servant et al. 2015). The raw sequence data with low quality, unmapped and invalid mapped paired reads were filtered out by HiC-Pro and contact-matrix based on interaction frequency was created. HiC-Pro results showed that 95.6% of sequences were clean Q30 bases, showing a good quality of sequence data. After filtering out the Hi-C data, Length of scaffold N50 1 600 000 60 016 944 All of the statistics got rid of the short scaffolds (scaffold length < 1 000 bp) Fig. 1 The heat-map of the reassembled result at a 100-kb resolution. Hi-C interactions were detected in the reassembled G.raimondii genome sequences, and most interactions maps of chromosome sequences suggested that the genome sequence was assembled in the correct order. The darker red dot is representing the higher interaction between bins 81.95% of uniquely mapped sequences were valid pairedend reads. Thus, the valid paired-end reads (223304666) were used for further genome assembly. At the second step, Errors in scaffolds of the initial draft assembly were identified and corrected following the Aedes aegypti's de novo assembly procedure (Dudchenko et al. 2017). Briefly, the errors were corrected by identifying the bins where a scaffold's long-range contact pattern changes abruptly, which was unlikely a correct scaffold. We cut out the error bins as a new scaffold. There are 259 errors within scaffolds. At the third step, the G. raimondii genome was assembled with the Hi-C data by Lachesis (Burton et al. 2013), which contained clustering, ordering, and orienting. Finally, the assembled G. raimondii genome was assessed by heat-map and collinear analysis.

Assembling results
The genome of G. raimondii was reassembled using Hi-C data. About 98.42% of total sequence length were clustered successfully, among which 99.72% and 97.37% of the clustered sequences were ordered and high-quality ordered, respectively. And approximately 99.92% of the ordered sequences were oriented with high-quality. The statistics of pseudo-chromosome length are shown in Table 1, while the indicator statistics of the initial draft genome and reassembly results are shown in Table 2. From the parameters like scaffolds number, N50 of previously draft genome and reassemble genome, we found that the G. raimondii genome using a Hi-C-based proximity-guided assembly is much better than the reported draft genome (Wang et al. 2012).

Genome assessment results
The quality of the reassembled genome can be verified in four aspects. First, BUSCO analysis reported that 97.2% (> 90%) completeness scores for the reassembled genome, indicating that it was complete. Compared with~32 000 genes in the old draft genome (besides scaffolds), 38 975 genes were found in the annotation of the reassembled genome. Second, The heatmap directly proved the validity of the processing methods. Based on the two regular models as we mentioned above, the heat-map of the reassembled results revealed that the diagonal interaction was much higher. The boundaries of each pseudo-chromosome are relatively clear and it is under low background noise that shows good reassembling results (Fig. 1). Third, we compared the draft genome and the reassembled genome with two previous genome assemblies of G. raimondii (Paterson et al. 2012;Udall et al. 2019) (Fig. 2). The reassemble genome had a general agreement in their alignments with two previous genome assemblies than the draft genome. Fourth, we compared homologous genes in the draft genome and reassembled genome with which in two previous genome assemblies of G. Fig. 3 The collinearity analysis of homologous genes between the reassemble genome, the draft genome (Wang et al. 2012) and reassembled genome (new) with two previous genome assemblies of G. raimondii (Paterson et al. 2012;Udall et al. 2019) raimondii (Paterson et al. 2012;Udall et al. 2019) (Fig. 3). Both the draft genome and reassembled genome showed a good agreement with other two versions, but the sizes of chromosomes (like chromosome 4 and 8) in the reassembled genome were more reliable than the draft genome.

Discussion
In the present study, we applied de novo Hi-C sequencing of the G. raimondii genome to improve the quality and accuracy of its previously reported draft genome from Wang et al. (2012). Results from the comparative analysis of different parameters between the draft genome (Wang et al. 2012) and the current reassembled genome showed significantly improved quality as compared with the previous one. The reassembled genome is not good enough, because it is based on the next generation sequencing in 2012. The G. raimondii genome (Udall et al. 2019) was assembled to a chromosome level using PacBio long-read technology, but Hi-C and Bionano optical mapping may have a better assembly. But diffirent genome versions can provide different information, like some genes can only be annotated in a certain version of annotation file.
Abbreviations NGS: Next generation sequencing; Hi-C: High-throughput chromosome conformation capture; CAAS: Chinese Academy of Agricultural Science