Tissue collection and Hi-C sequencing
The seeds of G. raimondii D5–1 were planted in an incubator at constant environmental conditions having 27 °C temperature, 60% relative humidity, 16/8 h light/dark photoperiod, and 100% fluorescent light. When the sixth euphylla came out, these seedlings were transplanted into big pots. Approximately 3 g young leaves from G. raimondii plants were collected and immediately treated with formaldehyde.
During this study, we have used the same Hi-C pipeline as in Arabidopsis thaliana (Xie et al. 2015). Before starting this experiment, we have tested the integrity of DNA from the formaldehyde-treated tissue, and then the DNA samples were isolated and digested by MboI instead of HindIII because of the shorter recognition site (only four bases of MboI). The resulting sticky ends were filled with nucleotides in which cytosine is biotinylated, and ligated the adjacent blunt ends to a chimeric circle under extremely dilute conditions. Subsequently, DNA was purified and broken into 300–500 base pairs using ultrasonic, pull-down the biotin-labeled DNA and performed the PCR reaction (10 cycles). After DNA purification, the finished Hi-C library was sequenced with an Illumina Hiseq (PE150). A total of 570 412 361 read-pairs were obtained.
Genome assembly based on Hi-C data
Assembling of G. raimondii genome involved three steps. First, valid Hi-C paired-end reads and contact matrix with a resolution of 100 kb were generated by HiC-Pro (Servant et al. 2015). The raw sequence data with low quality, unmapped and invalid mapped paired reads were filtered out by HiC-Pro and contact-matrix based on interaction frequency was created. HiC-Pro results showed that 95.6% of sequences were clean Q30 bases, showing a good quality of sequence data. After filtering out the Hi-C data, 81.95% of uniquely mapped sequences were valid paired-end reads. Thus, the valid paired-end reads (223304666) were used for further genome assembly. At the second step, Errors in scaffolds of the initial draft assembly were identified and corrected following the Aedes aegypti’s de novo assembly procedure (Dudchenko et al. 2017). Briefly, the errors were corrected by identifying the bins where a scaffold’s long-range contact pattern changes abruptly, which was unlikely a correct scaffold. We cut out the error bins as a new scaffold. There are 259 errors within scaffolds. At the third step, the G. raimondii genome was assembled with the Hi-C data by Lachesis (Burton et al. 2013), which contained clustering, ordering, and orienting. Finally, the assembled G. raimondii genome was assessed by heat-map and collinear analysis.
Repeats and gene annotation
Repeat sequences of G. raimondii genome were masked by RepeatMasker (v4.0.8) with a custom library generated from RepeatModeler v2.0.1 (Flynn et al. 2020). Repeat-masked sequences were obtained as a draft genome in gene prediction. de novo prediction, homology-based prediction and transcriptome-based prediction were combined to annotate the draft genome of G. raimondii. GlimmerHMM v3.0.4 (Chrysanthou et al. 2011) and AUGUSTUS v3.3.2 (Stanke et al. 2006) were run for de novo prediction. We used 9 transcriptomes (SRR389181, SRR203240, SRR203250, SRR8878792, SRR8878720, SRR8878562, SRR8878553, SRR8878556, and SRR8878559) of G.raimondii from different organs and growth stages to predict genes. Homologous proteins from Arabidopsis, maize, and rice were input into GenomeThreader (v1.6.1) to train models for homology-based prediction. Sequences of transcriptome samples were aligned to draft genome by HISAT2 v2.1.0 (Kim et al. 2019) and transcripts were assembled by StringTie v1.3.6 (Pertea et al. 2015). All files in general feature format were integrated into a final genome annotation file by EvidenceModeler v1.1.1 (Haas et al. 2008).
The reassembled genome was evaluated in three aspects: completeness, consistency and continuity. We evaluated the continuity according to the length of scaffold N50 calculated by our python script. The completeness was checked by BUSCO (v4.1.4; Simão et al. 2015) and the consistency was evaluated by comparative analysis between our reassembled genome and other published genomes of G. raimondii. Genome alignment was performed by Minimap2 v2.1 (Li, 2018) and dot plots were generated by dotPlotly (https://github.com/tpoorten/dotPlotly/). Orthologous genes of G. raimondii genomes were detected by Orthofinder v2.5.1 (Emms and Kelly, 2015), and the visualization of orthologous gene detection was implemented by jcvi (https://github.com/tanghaibao/jcvi). All parameters of the softwares in this article were set as default.