Identification of the cotton LHC proteins
One hundred and nine proteins translated by the LHC genes were recognized in the three sequenced cotton genomes, with 55, 27, and 27 proteins in G. hirsutum (AD), G. raimondii (D) and G. arboreum (A), respectively (Table S2). The total number of the proteins found related to the LHC genes in the two diploid cotton species, G. raimondii and G. arboreum, were one less than the number of LHC proteins in G. hirsutum, may be due to AD emerged in the whole genome duplications between A and D genomes.
The protein lengths for the G. hirsutum proteins stretched from 62 aa to 644 aa, molecular weights ranged from 6.88 kDa to 72.66 kDa, respectively, in Gh_Sca017783G01 and Gh_A02G1068, and the molecular charge ranged from − 8.5 (Gh_A01G0519) to 7(Gh_A02G1068), the isoelectric point (pI) ranged from 4.701 (Gh_D06G2350) to 10.228 (Gh_D04G1505) and finally the grand average of hydropathy (GRAVY) ranged from − 0.529 (Gh_A01G0519) to 0.233 (Gh_D06G2350) (Table S1).
In the two diploid cotton species, the G. arboreum and G. raimondii, the physiochemical properties of the LHC proteins exhibited slight differences in molecular weights, protein lengths, pI, molecular charge, and GRAVY values. The protein length stretched from 114 aa to 610 aa, and 151 aa to 349 aa, respectively, and the molecular weights ranged from 12.823 to 68.741 kDa, and 16.55 to 38.267 kDa by a charge range of − 6 to 9 and − 4.5 to 7.5 in G. arboreum and G. raimondii, respectively (Table S1).
On the other hand, the values for the pI and GRAVY were almost the same, the pI ranges from 4.87 to 9.897, and 4.701 to 9.296, the GRAVY ranges from − 0.377 to 0.167 and − 0.249 to 0.244 in order of G. arboreum and G. raimondii, respectively. In all cotton species, the GRAVY values were low (positive and negative), which indicates the likelihood of enhanced relations with water that leads to hydrophilic nature.
Phylogenetic tree and Synteny block analysis of the cotton LHC proteins
The phylogenetic analysis grouped the LHC proteins together with other plants into five clades. Numerous homolog gene pairs were formed among the several proteins encrypted by the cotton Light-Harvesting Chloro a/b binding genes (Fig. 1a).
The collinearity analysis among the three cotton species was done, in which Circle gene viewer was applied to distinguish the collinear gene pairs with TBtools software (Chen et al. 2018). The collinearity analysis between the physical map of At and Dt subgenomes of G. hirsutum, G. arboreum and G. raimondii for their A Vs D; A vs At, and between D Vs Dt subgenome relationships were observed. We found a good collinearity between A vs D with 23 genes, A vs At with 20 genes, and between D vs Dt with 23 genes in the subgenome (Fig. 1b).
Gene ontology analysis
According to the gene ontology analysis, in G. hirsutum, the functions in biological processes (GO: 0008150), were cellular and metabolic processes, various cellular (GO: 0005575) functions were noted in the cell and cell part. Similarly, in G. arboreum, the biological functions (GO: 0008150) were responsible for stimuli, cellular and metabolic processes. Whereas in cellular component (GO: 00055750), the functions focused on cell, macromolecular complex (protein), and membrane related issues, whereas molecular functions (GO: 0003674) were related with binding function. In G. raimondii the biological process (GO: 0008150) was coined with cellular and metabolic processes, which is similar to G. hirsutum, whereas in cellular component (GO: 0005575), the function is related to membrane. In both G. hirsutum and G. raimondii, there was no significant GO term for molecular functions (Fig. 2).
Gene structure and motif identification of LHC proteins
Gene structural study is observed as a likely sign of the evolution of multigene families (Nei and Rooney 2005). To obtain additional evidence into the structural diversity of cotton Light-Harvesting Chloro a/b-bind genes, the exon/intron association in the representative transcripts were investigated in contrast with their equivalent genomic DNA sequences of distinct genes in G. hirsutum, and it was found that a higher proportion of the Light-Harvesting Chloro a/b-bind genes and their exons were extremely conserved inside the group.
Gene structures of some Light-Harvesting Chloro a/b-bind genes possessed introns. The maximum number of introns observed for the Light-Harvesting Chloro a/b-bind gene structures were eleven (Gh_A02G1068), eleven (Ga02G0756), and five (Gorai.003G092700) for G. hirsutum, G. arboreum and G. raimondii, respectively. The highest number of exons and introns were found in Gh_A02G1068 (twelve exons, eleven introns) and Gh_A01G0519 (ten exons, nine introns). Remarkably, exons and introns for diverse Light-Harvesting Chloro a/b-bind genes were observed to be dissimilar based on their lengths. For example, eighteen genes have two exons and one intron and seven genes have three exons and two introns, whereas, seven genes with one exon but no intron (Fig. 3).
On the other hand, in the diploid species, the maximum number of exon/intron were twelve exons, eleven introns (Ga02G0756), eleven exons, and ten introns (Ga01G0731) in G. arboreum, whereas G. raimondii six exons, five introns (Gorai.003G092700) and six exons, five introns (Gorai.009G262000), respectively. Similarly, there are seven G. arboreum genes that have two exons and one intron while ten genes in G. raimondii have two exons and one intron. Genes with three exons and two introns as well as a single exon with no intron were five and three, respectively, in both species.
To explore the structural evolution of LHC proteins, the patterns of motifs were analyzed. A total of 20 different motifs were detected by the MEME analysis (http://meme-suite.org) in the three Gossypium species (Fig. 4). Based on the identified motifs, motif 3, motif 4 and motif 12 were the conserved motifs in G. hirsutum, whereas motif 2 and 8 in G. arboreum and motif 11 and 4 in G. raimondii, respectively, were conserved, too.
Chromosomal mapping of the LHC genes
The LHC genes were unevenly distributed across various chromosomes of A2, D5, and (AD)1 cotton genomes. In the tetraploid (AD)1 genome with At Subgenome, the highest number of genes were found on chromosome At01, At05, and At10 with three genes, while At03, At08, and At09 chromosomes harbored none. Similarly, in the (AD1), Dt Subgenome, the highest number of genes were found on Dt07, Dt01, and Dt05 with five, four, and four genes, respectively, whereas At03, At08, and At09 had zero genes. The rest of the chromosomes harbored one to three genes (Fig. 5a and b). In the two diploid cotton species, A2 and D5 genomes, the gene distribution arrangement was different. In G. arboreum, the highest genes were observed on chromosomes, A2(05), and A2(07), with the four genes while in G. raimondii, chromosomes D5(01), D5(09), and D5(10) possessed the highest gene number with four genes, respectively, while chromosomes A2(04) and D5(06) harbored none. Some LHC genes have tandem duplications, although most are singletons dispersed along the genome (Fig. 5c and d).
Identification of cis-regulatory elements
Cis-acting regulatory elements are important molecular switches involved in the transcriptional regulation of a dynamic network of gene activities controlling various biological processes, including abiotic stress responses, hormone responses, and developmental processes. These genes encode genomic blueprints for coordinating spatiotemporal gene expression programs underlying highly specialized cell functions (Mao et al. 2020). Analysis of cis-regulatory elements revealed that ABA-responsive element (ABRE), Antioxidant response elements (ARE), Metal response elements (MRE), Myeloblastosis (MYB), AT-rich elements (ATREs), Dehydration-responsive element (DRE), MBS, Box-4, and Angiotensin-Converting Enzyme (ACE) cis-regulatory elements were found related to drought stress in the three cotton species (Fig. 6). The ABA-responsive element (ABRE) and the dehydration-responsive element/C-repeat (DRE/CRT) are two major cis-acting elements involved in ABA-dependent and ABA-independent gene expression in osmotic and cold stress responses (Yamaguchi-Shinozaki and Shinozaki 2005).
Evolution of LHC genes in Gossypium species
During the evolution, the Ks value of a gene is not affected by natural selection generally, but Ka value is affected. The Ka/Ks value show positive, neutral, and negative selection when, Ka/Ks > 1, Ka/Ks = 1, or Ka/Ks < 1, respectively (Zhao et al. 2020). The distributions of Ka, Ks, and Ka/Ks among homologous pairs of Gossypium species revealed similar results (Fig. 7, Table S3). The Ka/Ks for GhAt-Ga ranged from 0 to 0.949034416, while for GhDt-Gr Ka/Ks ranged from 0 to 0.838286204. The Ka/Ks of GhAt-GhDt ranged from 0 to 0.523637063, whereas the Ka/Ks value of Ga-Gr ranged from 0 to 0.755930549. In all the pairs, the Ka/Ks value was < 1 which indicated that the gene family was subjected to negative selection. The results suggested that the LHC genes of G. hirsutum derived from G. raimondii and G. arboreum experienced negative selection throughout the process of evolution.
Prediction of subcellular localization for LHC proteins in Gossypium species
The results from the WOLF PSORT (https://wolfpsort.hgc.jp) showed that the LHC proteins were localized in various cell parts including chloroplast, cytoplasm, endoplasmic reticulum, mitochondria, nucleus and vacuole (Fig. 8). Based on the online analysis of the three Gossypium species, the LHC proteins were mainly localized in the chloroplast with 472.5 (72.1%), 251.5 (73.1%), 232 (73.3%) in G. hirsutum, G. arboreum and G. raimondii, respectively (Table S5).
RT-qPCR validation of LHC genes under water deficit conditions
Twenty-seven LHC genes expression profiles were carried out in different tissues and varying time intervals under PEG-6000 treatment. All genes showed differential expression patterns in the analyzed tissues (Table S6). Gh_D10G2385, Gh_A13G0222, Gh_A05G0725, Gh_D05G0860, Gh_D07G0661, Gh_D01G1508, Gh_D12G1495, Gh_A07G2182, and Gh_A10G2108 were found to be highly upregulated in the roots, whereas, Gh_A07G2184, Gh_D10G2385, Gh_D05G0860, Gh_D02G1996, Gh_A13G0222, and Gh_A05G0725 showed higher upregulation after 12 h of stress exposure in leaves. Similarly, Gh_A13G0222, Gh_D06G1791, and Gh_A06G1447 genes were upregulated in stem tissues starting from 6 h up to 24 h (Fig. 9).
Changes in genes expression at different time intervals and in different plant tissues was observed in the results. Most genes were downregulated mainly in leaf tissue followed by stem. Genes like Gh_A10G0361, Gh_D10G0369, Gh_A03G2154, and Gh_D03G0610 were downregulated in the three tissues of cotton at almost all time points. Generally, more genes were upregulated in the root tissues. Expression of gene Gh_A13G0222 was significantly higher in root and stem tissues at 12 h and 24 h after treatment as compared with other time points, whereas in root tissues high expression was observed in almost all time points except at 12 h. Gh_D10G2385, Gh_D05G0860, and Gh_A05G0725 was also upregulated in leaf and root tissues under drought stress. A detailed exploration of these genes will offer efficient information on considerate LHC genes in cotton (Gossypium) and its part in drought stress tolerance. Drought effect comes first at the root zone, and the higher upregulation of various genes in the root tissues is in line with earlier results in which most of the LEA genes were upregulated in the root tissues in relative to leaf and stem tissues during drought stress situation (Magwanga et al. 2018).