Genome-wide characterization of the Rho family in cotton provides insights into fiber development

Cotton is the source of natural fibers globally, fulfilling 90% of the textile industry's requirements. However, fiber development is a complex biological process comprising four stages. Fiber develops from a single cell, and cell elongation is a vital process in fiber development. Therefore, it is pertinent to understand and exploit mechanisms underlying cell elongation during fiber development. A previous report about cell division control protein 42 (CDC-42) with its key role in cell elongation in eukaryotes inspired us to explore its homologs Rho GTPases for understanding of cell elongation during cotton fiber development. We classified 2 066 Rho proteins from 8 Gossypium species into 5 and 8 groups within A and D sub-genomes, respectively. Asymmetric evolution of Rho members was observed among five tetraploids. Population fixation statistics between two short and long fiber genotypes identified highly diverged regions encompassing 34 Rho genes in G. hirustum, and 31 of them were retained through further validation by genome wide association analysis (GWAS). Moreover, a weighted gene co-expression network characterized genome-wide expression patteren of Rho genes based on previously published transcriptome data. Twenty Rho genes from five modules were identified as hub genes which were potentially related to fiber development. Interaction networks of 5 Rho genes based on transcriptional abundance and gene ontology (GO) enrichment emphasized the involvement of Rho in cell wall biosynthesis, fatty acid elongation, and other biological processes. Our study characterized the Rho proteins in cotton, provided insights into the cell elongation of cotton fiber and potential application in cotton fiber improvement.

. Several protein-coding genes involved in various metabolic processes were systematically identified for their potential roles in fiber development, such as ethylene biosynthesis, long-chain fatty acids biosynthesis, and auxin transport. However, detailed regulatory mechanisms remain to be elucidated, such as the synthesis of long fatty acid and cell wall components during fiber elongation.
We got inspiration from researches on single-cell elongation in eukaryotes to explore and exploit critical factors influencing cell elongation/fiber elongation. CDC-42, a highly conserved plasma membrane-associated small Rho-family GTPase, harboring an active GTPbound state and an inactive GDP-bound state to promote the cell elongation in eukaryotes (Etienne 2004;Farhan and Hsu 2016;Mack and Georgiou 2014). CDC-42 regulates the extension and maintenance of filopodia in neurons of Mus musculus (Sakabe et al. 2012). CDC-42 also mediates the filopodia initiation in human cells (Gauthier et al. 2004). Moreover, several reports have verified the active role of CDC-42 in cell elongation and cell shape in different species (Murakoshi et al. 2011;Galic et al. 2014). For instance, Rho GTPases, homologs of CDC-42 in Arabidopsis thaliana, are involved in vesicle trafficking and protein transport, implying the role of GTPase in cell membrane trafficking (Anai et al. 1994;Koh et al. 2009). Based on the above-mentioned reports, it can be inferred that CDC-42 is a highly conserved functional protein among different species. Therefore, we hypothesize that the homologs of CDC-42 might play a significant role in cotton fiber development. Delmer et al. (1995) cloned several Rho genes encoding small GTPase and explored their roles in fiber development. Recent advances in omics provide data platform for rapid identification and further characterization of complex regulatory mechanisms underlying fiber development (Huang et al. 2020;Chen et al. 2020;Wang et al. 2019). In this study, We adapted a systematic approach and utilized the previously published genomic data (Ma et al. 2018) and transcriptome data (Qin et al. 2019) to identify and characterize the potential functions of Rho genes in cotton.

Rho family members in cotton
We identified Rho proteins among eight Gossypium species, including three diploids and five tetraploids, after obtained Rho HMM (Hiden Markov Model) file from Pfam platform (Table 1). We grouped Rho protein from tetraploid cotton according to the chromosome distribution on A sub-genome or D sub-genome. These Rho proteins were annotated using Uniprot database (Additional file 1: Table S1). Annotation information suggested that the number of Rho proteins in Gossypium species ranged from 145 to 332. The longest protein sequence (1 101 aa) was identified in G. arboreum, while the shortest Rho proteins (644 aa or 645 aa) were identified in 4 tetraploids, i.e., G. hirsutum, G. tometosum, G. mustelinum, and G. darwini (Additional file 2: Table S2). We further compared the difference between short and long Rho proteins, whereas 13 were selected as the small Rho proteins with length ranged from 644 aa to 679 aa. Gar05G18660, encoding the longest protein of 1 101 aa, had three extra domains in an unaligned region, i.e., Glyco_transf_8, PPR (pentatricopeptide repeat), and PPR_2. Glyco_transf_8 was found in G. arboreum, G. herbaceum, and all analyzed tetraploid species except G. raimondii. Interestingly, all of the Rho proteins containing Glyco_transf_8 domain in all analyzed tetraploids were in A t sub-genomes, indicating Glyco_transf_8 in allotetraploids was inherited from diploids of A genome and had not been duplicated in D t sub-genome during the evolution of tetraploids. According to Pfam database, Glyco_transf_8 (PF01501)is involved in glucose transferring. Considering that both tetraploids and diploids of A genome had longer fiber than G. raimondii, we speculated that Rho proteins with Glyco_transf_8 may contribute to the fiber elongation. Moreover, PPR and PPR_2 were only identified in Gar05G18660. According to the annotation in Pfam (PF01535), the function of PPR domain is still unclear. Other Rho proteins had no domain in unaligned regions. Variations in protein length and domain distribution implied high sequence variation among Rho family members.

Phylogeny analysis of Rho members in cotton
We performed phylogenetic analysis to understand the diversity among Rho proteins. Two phylogenetic trees were built according to A t and D t sub-genomes, respectively ( Fig. 1a, b). Rho proteins from two species of A genomes (G. herbaceum and G. arboreum) and five allotetraploids of A sub-genomes (A t1 -A t5 ) were classified into five groups, while members from diploid species of D genome (G. raimondii) and 5 allotetraploids of D sub-genomes (D t1 -D t5 ) were classified into eight groups. We evaluated sequence conservation of Rho genes in eight cotton species by identical site rate (ISR %) (Fig. 2). By comparing ISR of Rho proteins among different cotton species, we noticed that Rho members in G. herbaceum (A 1 -A 1 ) had a lower ISR compared with G. arboreum (A 2 -A 2 ) ( Fig. 2a, b).
Furthermore, we found that inter-genomic conservation (A 1 -A t1-t5 ) was higher than intra-genomic conservation (A 1 -A 1 ) among Rho genes in species of A 1 -genome (higher ISR), indicating higher similarity between A 1 and A sub-genomes from 5 allotetraploids (Fig. 2a, b; Additional file 3: Table S3). These results indicated that Rho members in G. herbaceum remained conserved with less sequence diversity and were consistent with the previous conclusion that G. herbaceum was likely a donor of A genome in allotetraploid (Huang et al. 2020). Moreover, Rho protein sequences in G. raimondii were more conserved with those from D sub-genomes in G. hirustum (D 5 -D t1 ; P = 1e-6) and G. tomentosum (D 5 -D t3 ; P = 0.03) ( Fig. 2c; Additional file 3: Table S3). Phylogenetic analysis suggested more similarity between Rho genes from the D 5 genome and those from D t1 and D t3 , indicating Rho genes had a potential contribution to the independent evolution of five tetraploids.

Fiber length-related association analysis of Rho members
To check up the potential role of Rho GTPases in fiber development, we performed association analysis of fiber length based on the previous sequenced data (Ma et al. 2018). Thirty cultivars with the longest fiber and 30 cultivars with the shortest fiber were selected among the 419 resequenced dataset to identify the candidate loci associated with fiber length. By t-test, a significant divergence of fiber length was observed between 2 groups (P = 1.618e-30) (Fig. 3a). To detect candidate genomic regions, genomic sequences were divided into windows (windowsize = 50 000 bp, stepsize = 5 000 bp) and Fst of these windows were calculated (Fig. 3b). Finally, the top 5% Fst from 44 626 windows were selected to detect potential genes associated with fiber length. A total of 6 885 genes overlapped with the selected windows were extracted as candidate genes (Additional file 4: Table S4 and Additional file 5: Table S5). The functions of 6 885 genes were initially checked by GO enrichment analysis. We found that genes involved in basic biological processes such as RNA binding (GO:0003723), mRNA splicing (GO:0000398), and several processes related to glucose metabolism, for instance, galactose metabolic process (GO:0006012), UDP-glucose 4-epimerase activity (GO:0003978) were enriched. Interestingly, we noticed that GTPase activity (GO:0003924) was also enriched, indicating that the activities of GTPase could influence fiber length (Fig. 3c). All 34 Rho genes among the 6 885 candidate genes were selected for further analysis (Additional file 6: Table S6). Furthermore, we performed GWAS analysis on all 419 cultivars for fiber length and found that 31 Rho genes among above-mentioned 34 Rho genes had significant SNPs (Additional file 6: Table S6).

Expression patterns of Rho members in transcriptome data
To evaluate the transcription abundance of 31 Rho genes, we analyzed the published transcriptome data. Raw transcriptome data of 8 samples were downloaded (Table 2) to perform transcriptome analysis. As a result, 8 678 and 298 differentially expressed genes (DEGs) from fiber and ovule groups, long fiber and short fiber groups were identified, respectively ( Fig. 4a,b; Additional file 7: Table S7). We defined genes with the maximum TPM score larger than 5 as expressed genes. For 34 Rho candidate genes, 20 of them were found to be expressed. Among 20 expressed Rho genes, 10 of them were found in DEGs among the fiber, ovule groups (Fig. 4c, d and Additional file 8: Table S8). These 10 Rho genes with diverge expression patterns between fiber and ovule epidermis cells may participate in cell elongation.

WGCNA analysis of transcriptome data
Rho genes may regulate cell polarity with the involvement of other genes (Etienne 2004). To investigate the potential interactions of Rho proteins during fiber development, we conducted a weighted gene co-expression network analysis (WGCNA) (Fig. 5a). Genes with the transcription score smaller than 5 among all samples were trimmed, and finally 34 181 genes were retained for WGCNA analysis. The soft threshold of the network was set as 26, while R square was above 0.8, and mean connectivity was lower than 2 000, indicating that the network was scale-free ( Fig. 5b). After network construction, genes were classified into 13 modules according to the expression profile (Fig. 5c).
Since the modules were generated according to expression patterns of genes, the association analysis between gene expression and phenotype data among each classified modules revealed potential functions of corresponding modules. The phenotype data of 8 samples were classified as fiber, ovule and long fiber, short fiber groups, respectively. We set 0.5 as the threshold of the R value of Pearson correlation to identify the module associated with fiber length phenotype (Fig. 5d). Among modules, pink, magenta, and tan modules were related to the long fiber trait, while the purple module was related to the short fiber trait. As for the fiber, ovule trait, we found blue and brown modules related to fiber development. On the other hand, pink, turquoise, purple, and tan modules were related to ovule tissue.
We checked the interaction networks within Rho genes and found that Rho genes in turquoise, blue, brown, green, and yellow modules showed potential interactions (Fig. 5e). To investigate the role of these modules, the genes with the top 10% membership were selected for GO enrichment analysis (Additional file 9: Table S9). Results of GO enrichment of blue module showed that fatty acid biosynthetic process (GO:0006633), microtubule (GO:0005874), and microtubule-based movement (GO:0007018) were enriched (Additional file 12: Fig.  S1a). The other ovule-related brown module contained different GO terms, such as glucose metabolic process (GO:0006006), cellulose microfibril organization (GO:0010215), and cell growth (GO:0016049) (Additional file 12: Fig. S1b). Although various biological processes were enriched in blue and brown modules, most of the enriched GO terms were metabolic pathways.

Fig. 2
Identical site rate (ISR) in Rho proteins across 8 cotton species. a. Identical site rate of Rho proteins from G. herbaceum and 5 tetraploids. A 1 -A 1 means all Rho proteins from G. herbaceum. The A 1 -A t indicates all mixed Rho proteins from both G. herbaceum and the corresponding A t sub-genome from tetraploid. b Identical site rate of Rho proteins from G. arboreum and 5 tetraploids. A 2 -A 2 means all Rho proteins from G. arboreum. The A 2 -A t indicates all mixed Rho proteins from both G. arboreum and the corresponding A t sub-genome from tetraploid. c Identical site rate of Rho proteins from G. raimondii and 5 tetraploids. D 5 -D 5 means all Rho proteins from G.raimondii. The D 5 -D t indicates all mixed Rho proteins from both G. raimondii and the corresponding D t sub-genome from tetraploid Meanwhile, in the turquoise module more basic biological processes such as RNA binding (GO:0003723), DNA binding (GO:0003677), and chromatin remodeling (GO:0006338) were enriched to these fiber-related modules (Additional file 12: Fig. S1c). The green module, another fiber-related module, had microtubule-based movement (GO:0007018), microtubule motor activity (GO:0003777), and small GTPase mediated signal transduction (GO:0007264) (Additional file 12: Fig. S1d). The yellow module contained biological processes related to ATP synthase activity (Additional file 12: Fig. S1e). Rho genes among the above mentioned modules indicate that complicated mechanisms underlying the cell elongation process.

Interaction networks of Rho family members
The Rho genes from 5 modules were further screened, and genes with the highest membership were selected (Additional file 10: Table S10). Finally, Ghir_ A12G010690, Ghir_D08G007370, Ghir_D08G018200, Ghir_A11G010670, and Ghir_D03G002970 were identified as core genes from blue, brown, magenta, turquoise, and green modules, respectively (Fig. 6a, b, c, e, f ). Their expression patterns during fiber development were validated by quantitative real-time polymerase chain reaction (qRT-PCR) (Additional file 13: Fig. S2; Additional file 14: Fig. S3). Among 5 core genes, 2 of them, Ghir_A12G010690 and Ghir_D08G007370, were DEGs between fiber and ovule tissues. Genes possessing high transcriptional correlation (pearson) with the 5 core genes were recorded as interacted genes, and were further used to construct interaction networks. In the interaction network of Ghir_A12G010690, genes related to the fatty acid biosynthetic process (GO:0006633), microtubule-based process (GO:0007017), and small GTPase mediated signal transduction (GO:0007264) were identified (Fig. 6d). It was known that the fatty acid was essential for fiber elongation (Qin et al. 2011). Meanwhile, small GTPase mediated signal transduction was also enriched in this network, implying that Ghir_A12G010690 may participate in signal transduction activities to regulate cell polarity (Etienne 2004). What's more, genes related in fatty acid biosynthesis which participated in the development of cotton fiber were also clustered with Ghir_A12G010690 (Qin et al. 2007). We inferred that Rho protein as a GTPase that might related to fatty acid biosynthesis to influence fiber development ( Fig. 6c and d). In other four networks, there were also GO terms involved in fiber development, such as cell wall biosynthesis (GO:0042546), and calcium ion binding (GO:0005509) (Additional file 11: Table S11). The diverged GO terms enriched among these five Rho networks implied that Rho may involved in different biological processes.

Discussion
In this study, we identified 2 066 Rho proteins in 8 cotton species. ISR of Rho family members in 2 diploids of A genome showed that Rho members in G. herbaceum (A 1 ) went through less selection which means A 1 is closely related to A donor in tetraploids. This conclusion is consistent with the previous findings (Huang et al. 2020). Interestingly, for D sub-genomes, we also noticed that Rho proteins in G. hirustum and G. tomentosum were more conserved than those in other three tetraploids, indicating a specific selection happend in G. tomentosum and G. hirustum. In the previous study, G. hirustum and G. tomentosum are two cotton species with the closest relativeness (Chen et al. 2020). Therefore, Rho genes in D sub-genomes may undergo a strong selection before the divergence between G. hirustum and G. tomentosum (Chen et al. 2020). This selection may result in long fiber length of G. hirustum.
In a previous study, association analysis between genotype and phenotype data could dig out functional genes with high confidentiality (Ma et al. 2018). In this research, we performed an association analysis on 60 cultivars to verify potential functions of Rho genes. Combined with transcriptomic analysis and association analysis, 22 Rho genes were selected with high confidence. GO enrichment on WGCNA identified modules of 22 candidate Rho genes implied a complicated mechanisms underlying Rho functions. Rho protein, as GTPase, participates in many essential pathways and has many interactors as reported in other species (Etienne 2004;Farhan and Hsu 2016;Goryachev and Leda 2017;Moran et al 2019). The inferred interaction networks shed lights on how Rho control the fiber elongation. GO enrichment analysis on genes from a  . 4 Characterization of DEGs in transcriptome data. a relative transcription abundance of DEGs among long fiber and short fiber cultivars. b relative transcription abundance of DEGs among fiber and ovule tissues. c Venn plot of 22 Rho members which detected by the combination of transcriptome and association analysis (in magenta color) and DEGs in fiber/ovule tissues (in orange colors). d differentially expressed Rho genes between fibers and ovules (overlapped genes in c) HE et al. Journal of Cotton Research (2022) 5:21 network constructed towards Ghir_A12G010690 showed that fatty acid biosynthesis process and small GTPase mediated signal transduction are enriched. We speculate that Rho genes may activate a signal transduction pathway and regulate fatty acid biosynthesis which has been proved to activate fiber elongation. What's more, we noticed that cell wall biosynthesis, calcium ion binding also emerged in other Rho-based interaction networks. This result suggests that apart from influencing fatty acid biosynthesis, Rho genes may regulate fiber development through multiple machinisms.

Conclusions
In this study we identified Rho proteins in 5 allotetraploid cottons and their corresponding sub-genome donors. Rho GTPase, as a key factor involved in basic biological processes, had undergone asymmetric evolution during the divergence of allotetraploids. Five Rho genes potentially involved in fiber development were revealed by the combination of association analysis and transcriptome profiling. Further, potential interaction networks for Rho genes were built. We believed that findings in this study could be utilized in cotton molecular breeding for fiber improvement.

Calculation for identical site rate among Rho proteins
Identical site rates (ISRs) among Rho proteins from 8 Gossypium species were used to evaluate protein conservation during cotton evolution. The calculation for identical rates was based on the results of multi-alignment. Here, we took the calculation of identical rate in A 1 -A 1 as example. All 158 Rho proteins identified in G. herbaceum were collected and analyzed in muscle (v.3.8.1551) for multi-alignment. The result of multi-alignment was presented as a fasta file. For each site in the fasta file, the number of protein sequences which were not aligned to other sequences at this site was counted to calculate ISR. The mean identical rates calculated from each site among Rho proteins were used to demonstrate the intra-genomic conservation. As for inter-genomic conservation (taking A 1 -A t as example), Rho proteins from G. herbaceum and A sub-genomes of tetraploids were collected. All Rho protein sequences were aligned to each other by muscle (v.3.8.1551) and ISR for each site was calculated in the same way that applied in evaluation of intra-genomic conservation. In the aligned fasta file, the length of each sequence was equal, and sequences with less than 5% unaligned sites were abandoned for the subsequent identical rate calculation. t-test for comparsion of identical rates between 2 groups were performed through python package stats.

Phylogeny analysis on Rho members
Since tetraploid cotton species have two sub-genomes, we performed phylogeny analysis separately on A genomes (A sub-genomes) and D genomes (D sub-genomes). The protein sequences were aligned by muscle and put into fasttree (v.2.1.10) to construct phylogenetic trees by default parameters (Price et al. 2009). The phylogenetic tree was decorated by iTOL ( https:// itol. embl. de/).

Association analysis pipeline
Previous resequenced data was fetched from NCBI (https:// www. ncbi. nlm. nih. gov/). The sequence read archive (SRA) accession of raw data in fastq format was SRP115740 at PRJNA399050 (Ma et al. 2018). The phenotype data was downloaded from http:// cotton. hebau. edu. cn/. Phenotype data of fiber length was utilized in this study. The best linear unbiased prediction (BLUP) was applied to deal with fiber length from 419 cultivars in 12 environments. Thirty cultivars with the longest fiber length and 30 cultivars with the shortest fiber length were selected for association analysis. Raw fastq data were aligned to the reference genome TM-1_HAU (https:// cotto nfgd. org/ about/ downl oad. html, G. hirustum, HAU) by bwa (v0.7.17) (Wang et al. 2019). Samtools (v1.9) transformed the results of bwa into a binary alignment file (bam file), and the bcftools (v1.9) was used to call SNPs from bam files (Li et al. 2009;Danecek et al. 2011). The fixation index (Fst) of 2 groups was calculated by vcftools (v0.1.16) with 50 000 window size and 5 000 step size. For the GWAS analysis, the vcf files of 419 cultivars were merged and transformed into tped file by vcftools (v0.1.16). EMMAX (https:// genome. sph. umich. edu/ wiki/ EMMAX) was used to perform GWAS based on tped file and phenotype data of 419 upland cotton cultivars (Hyun et al. 2010). The SNPs with the P value smaller than 0.05 was defined as significant SNPs.

Transcriptome analysis pipeline
Raw transcriptome data in fastq format were downloaded from SRA database on NCBI (https:// www. ncbi. nlm. nih. gov/). Transcriptome datasets (SRR5992414 and SRR6379574-SRR6379580) from PRJNA400837 were obtained by sratoolkit (v.2.9.6). Quality control and the data trimming were performed by fastp (v.0.23.1) with the default parameters in pair-end mode, and the corresponding clean data were generated (Chen et al. 2018). We used the genome of TM-1 as the reference in the subsequent analysis. Index of the reference genome was implemented by hisat2 (v.2.2.1). The clean data of 8 samples were aligned to the reference genome in a pair-end mode (Kim et al. 2019). The results of hisat2 alignment were transformed into bam, then the bam files were sorted. Transformed and sorted reads were generated by samtools (v.1.9) (Li et al. 2009). Stringtie (2.1.7) was used to identify the transcription abundance of each gene, guided with gene anotation file (-G) and transcription abundance of 8 samples were merged into one file (Pertea et al. 2015).
For the detection of DEGs in short and long fiber cottons, transcriptional abundance of genes between all short fiber samples (69-6025-12, mixture of fiber and ovule at different developmental stages) and long fiber samples (601 long stapled cotton mixture from fiber and ovule at different developmental stages) was compared by t-test. Genes with significant effect in t-test (P < = 0.05) were identified as differentially expressed genes (DEGs) between short and long fiber samples. Detection for DEGs between fiber and ovule tissues was based on the same method applied in long and short fiber group. The relative transcription abundance of DEGs was used to plot the heatmap. The relative transcription abundance (RTA) of a gene was calculated by normalizing TPM values among different samples. Assuming that there are n samples, RTA of a gene in sample1 was calculated by TPM sample1 /(TPM sample1 + ……TPM sample n ). The calculation of a gene's RTA could present the divergence of this gene's transcription abundance among samples without the disturbation of mean TPM values of different genes.

WGCNA analysis pipeline
Since transcription abundance of all genes in 8 samples were gained, a weighted gene co-expression network analysis (WGCNA) could be performed (Langfelder and Horvath 2012;Langfelder and Horvath 2008). Before network construction, genes with the maximum TPM among eight samples smaller than five were filtered. Apart from criteria of transcription abundance, genes absent in more than two samples and samples had more than 10% genes absent were removed. After data filtering, the co-expression network was constructed by R package WGCNA based on the remained genes. Different soft thresholds were applied, and finally, the network was evaluated by mean connectivity and R square. To construct a scale-free network, deepSplit was set as two, and mergeCutHeight was set as 0.35. The minimum genes in a block were 30, while the maximum gene number in a block was 35 000. All remained genes were divided into several modules by WGCNA, and the eigenvalue of the association of each module and phenotype through Pearson correlations were calculated. Membership of each gene and their potential interacted genes were recorded in the Cytoscape-like file for further visualization.

GO enrichment analysis
To perform GO enrichment analysis, genes collected from WGCNA were handed up to the CottonFGD (https:// cotto nfgd. org/ analy ze/) (Zhu et al. 2017). The significant level was set as 0.05, and the minimum gene number for enrichment analysis was set as 3. The result of significant GO enrichment was recorded as a table, and the q value of each GO term was used for visualization.

QRT-PCR analysis
We used TM-1 as material for qRT-PCR analysis. Ovule at 0 DPA and 3DPA, fiber at 5 DPA, 10 DPA, and 15 DPA were selected for the qRT-PCR validation of 5 Rho genes.