Identification of ZAT proteins
To identify the ZAT genes in four cotton species, the protein sequences and genome sequences of G. arboreum, G. raimondii, G. hirsutum, and G. barbadense were compared using HMMER 3.0 software by using PF13912 as a query condition. Redundant sequences were detected and deleted by manual, 115, 115, 59, and 56 genes were identified in G. hirsutum, G. barbadense, G. arboretum, and G. raimondii, respectively. Statistical analysis of the number of genes in four Gossypium species showed that the number of ZAT genes in two tetraploid cotton species (G. hirsutum and G. barbadense) were almost double of their progenitors (G. arboretum and G. raimondii) (Fig. 1), and their numbers were also much higher than many of other plant species like V. vinifera (17), T. cacao (24) and O. sativa (46), indicating that the ZAT gene family in cotton had exhibited large scale expansion during the evolution.
To understand the ZAT gene family more conveniently, the ZAT genes were renamed according to the position of the gene on the cotton chromosome. For example, we assigned the name to ZAT genes as GhZAT1-GhZAT115 for G. hirsutum, GbZAT1-GbZAT115 for G. barbadense, GaZAT1-GaZAT59 for G. arboretum, GrZAT1-GrZAT56 for G. raimondii (Supplementary Table S1). Then we predicted the physical and chemical properties of GhZAT genes, including protein length, protein molecular weight (MWs), isoelectric point (pI), grand average of hydropathy analysis, and subcellular location (Supplementary Table S1). For Gossypium hirsutum, all of the 115 genes encoded proteins ranging from 134 (GhZAT72) to 577 (GhZAT105) amino acids, with an average of 283.6 amino acids. The isoelectric point (pI) varied from 4.936 (GhZAT85) to 10.171 (GhZAT37) with a mean of 7.98, and MWs varied from 15.105 (GhZAT72) kDa to 63.324 (GhZAT105) kDa. The values of the grand average of hydropathy were all negative, which proved that the proteins of ZAT family genes were hydrophilic. Subcellular location prediction showed that there were 110 genes located in the nuclear, 3 in the extracellular, one in the mitochondrial, and one in the plasma membrane. Most GhZAT genes located in the nuclear, which may be consistent with their interaction with DNA (Takatsuji et al. 1992).
Phylogenetic analysis of ZAT genes
To understand the evolutionary relationship of ZAT genes among four cotton species and other plant species, phylogenetic analysis of 662 ZAT protein sequences (Fig. 1) was performed to construct the tree based on multiple sequence alignment using the neighbor-joining (NJ) method in MEGA 7 with 1 000 bootstrap replicates, including 115 in G. hirsutum, 115 in G. barbadense, 59 in G. arboretum, 56 in G. raimondii, 42 in A. thaliana, 83 in G. max, 46 in O. sativa, 46 in P. trichocarpa, 24 in T. cacao, 17 in V. vinifera and 59 in Z. mays (Fig. 2). According to the number of domains in the genes, we divided the ZAT ZAT genes into 5 clades, and each clade can be divided into different classes (Fig. 2B). Among them, there were 2 classes in clade I, 6 in clade II, 2 in clade III, 1 in clade IV, and one in clade V (Fig. 2B). Among them, clade II had the most subgroups. As shown in the phylogenetic tree, we found that the gene distribution in subfamily Ia was the most extensive. There were 25 GhZAT genes in subfamily Ia, accounting for 21.7% of all GhZAT genes. The sequence similarity of the proteins between cotton and T. cacao was higher than that of any other plant species, which was consistent with a previous study that cotton and T. cacao were originated from the same ancestor. All species had gene pairs derived from the same node, demonstrating that the ZAT genes in all species had experienced gene duplication events that causing the expansion of the ZAT gene family. However, gene duplication among different groups varies from each other in different species. These results indicated that gene duplication was the main contributor toward ZAT gene family expansion in all plant species.
In addition, in order to study the relationship between the common ancestors of diploid cotton (G. arboretum and G. raimondii) and allotetraploid cotton (G. hirsutum and G. barbadense), we constructed NJ tree of four cotton species (Fig. 2A). In general, 345 ZAT genes in diploid cotton and allotetraploid cotton can be divided into four clades according to their number of domains. We can see that each branch contained genes from diploid and allotetraploid cotton species (Fig. 2A).
Analysis of chromosomal localization
To further study the genetic divergence and duplication events of the ZAT gene family, we constructed chromosomal maps of 345 ZAT genes from G. hirsutum, G. barbadense, G. arboretum and G. raimondii (Fig. 3). 334 out of 345 ZAT genes were distributed to their specific chromosomes, while the remaining only 11 ZATs were allocated to the unmapped scaffold. The distribution of ZAT genes on 13 chromosomes of different cotton species was uneven, and the number of each chromosome did not show a significant positive correlation with its length.
Among 115 identified ZAT genes of G. hirsutum, 56 genes were distributed on 12 chromosomes of At-subgenome (GhAt) while two genes were found at the scaffold, 53 genes were positioned on 12 chromosomes of Dt-subgenome (GhDt) while four genes at the scaffold (Fig. 3A, B). In G. hirsutum, Chr5, Chr7, Chr8, Chr10, Chr12 in GhAt had relatively more ZAT genes, including 7, 7, 6, 6, and 7 genes, respectively; while Chr5, Chr7, Chr8, Chr9, Chr10 in GhDt had relatively more ZAT genes, including 5, 6, 5, 5, 5 and 7 genes, respectively (Fig. 4). Furthermore, the number of genes located on the chromosomes of GhAt was not the same as that on its homologous chromosome of GhDt except GhA01-GhD01, GhA11-GhD11, GhA12-GhD12, GhA13-GhD13. Interestingly, we found that there was no GhZAT gene on Chr6, which may be related to the chromosome deletion of G. hirsutum or the translocation of large fragments during the evolution. Similarly, the number of genes located on the chromosomes of GhAt/GhDt subgenomes as compared with their homologous chromosome of A/D genome (G. arboreum/G. raimondii) was almost the same except GhAt01-A01, GhAt02-A02, GhAt03-A03, GhAt07-A07, GhAt08-A08, GhAt11-A11, GhDt03-D03, implied that some ZAT genes had lost or added during the evolution. At the same time, the uneven and random distribution of genes indicated that gene loss may occur during the evolutionary process, but it may also be the result of incomplete genome assembly.
Among 115 identified ZAT genes of G. barbadense, 57 genes were distributed on 13 chromosomes of the At-subgenome (GbAt) while only one gene was found at the scaffold, 55 ZAT genes were positioned on 13 chromosomes of Dt-subgenome (GbDt) while two genes were found at the scaffold (Fig. 3C, D). In G. barbadense, Chr5, Chr7, Chr10, and Chr12 in GbAt had relatively more ZAT genes, including 7, 7, 7, and 6 genes, respectively; while Chr7, Chr10, Chr12 in GbDt had relatively more ZAT genes, including 7, 7, 6 genes, respectively (Fig. 4). In addition, the number of genes located on the chromosomes of GbAt was not the same as that on its homologous chromosome of GbDt except GbA01-GbD01, GbA02-GbD02, GbA06-GbD06, GbA07-GbD07, GbA09-GbD09, GbA10-GbD10. We also found the number of genes on the GbAt/GbDt subgenome with the homologous chromosome A/D genome (G. arboreum/G. raimondii) was not the same, except for GbAt01-A01, GbAt02-A02, GbAt03-A03, Gbt06-A06, Gbt07-A07, GbAt09-A09, GhAt12-A12, and GhDt03-D03, which was similar to the situation of G. hirsutum.
Among 59 identified ZAT genes of G. arboreum, 57 genes were distributed on 13 chromosomes while two genes were found at the scaffold (Fig. 3E). Among 56 identified ZAT genes of G. raimondii, 56 genes were distributed on 13 chromosomes while no gene was found at the scaffold (Fig. 3F). In G. arboreum, Chr7, Chr8, Chr12 had relatively more ZAT genes, including 7, 6, 6 genes, respectively; while Chr1, Chr8, Chr9 in G. raimondii had relatively more ZAT genes, including 7, 7, 6 genes, respectively (Fig. 4).
Analysis of conserved protein motifs and gene structure
Diversity of gene structure and differentiation of conserved motifs were possible mechanisms for the evolution of polygenic families (Hu et al. 2010). We identified the conserved motifs of ZAT protein using the online website MEME and found 10 conserved motifs (named motif 1 to motif 10), the results were represented in schematic diagrams (Fig. 5B). ZAT genes in the same clade had similar motif composition, and the different composition of motifs indicating the diversity of gene function, which was particularly obvious in the class IIa and class IIb subfamily. Similar motif arrangements in the same subgroup revealed that protein structure was conserved in a specific subgroup, and the functions of most conserved motifs remain to be clarified. There were 1 ~ 11 motifs in each GhZAT gene, and all GhZAT genes had a conservative protein motif distribution. Except for the genes of subclasses Ia and Ib, all others had a conserved motif 3. Interestingly, in addition to the GhZAT93 gene, the genes of subclass Ia contained a special motif 8, while the other subclasses did not have, which may be the reason that subclass Ia had obtained the special motif 8 through evolution or other subfamilies had lacked specific conserved motifs during evolution.
In general, genes in the same subclass had similar intron-exon arrangements in terms of intron number and exon length (Malik et al. 2020). As can be seen from Fig. 5C, most genes had no introns except for a few genes. The exon length of the sequence encoded by the gene had little difference, indicating that the genes were somewhat conserved.
In addition, 8 genes had only one intron, one gene had three introns, and many genes (106 of 115) had no introns. Fewer introns in the ZAT gene family indicated that GhZAT genes were less likely to undergo alternative splicing. Exon/intron analysis along with phylogenetic tree results displayed their random distribution among GhZAT genes of different clades and showed that similar genes were clustered close to one another in the same group of phylogeny trees. It was worth noting that closely related genes were more similar in gene structure, but not necessarily the same in exon/intron length.
Gene duplication and collinearity analysis
Previous studies had shown that gene duplication was one of the major drivers of the evolution of genomes and genetic systems (Moore and Purugganan 2003). Whole-genome duplication, segmental duplication, and tandem duplication were considered to be the main causes of plant gene family expansion (Xu et al. 2012). Segmental duplication produced multiple genes by polyploidy and chromosomal rearrangement (Yu et al. 2005), and it occurred in plants because most plants were diploid or polyploid and retained many duplicate chromosome blocks in their genomes (Cannon et al. 2004). Tandem duplication was characterized by multiple members of a family in the same or adjacent intergenic region (Ramamoorthy et al. 2008). A gene cluster usually contained two or more genes of chromosome regions in a 200 kb region (Holub 2001). Gene duplication can be large-scale whole-genome duplication (WGD), and can also be a small series of segmental duplication and tandem duplication (Ramsey and Schemske 1998). In this study, we performed to analyze the gene duplication and syntenic relationship from G. hirsutum, G. barbadense, G. arboreum, and G. raimondii using MCScanX (Fig. 6). We used the MCScanX software to analyze the gene pairs of tandem duplication, segmental duplication, and whole-genome duplication. The genes which lied on the same chromosomal block (e-value <1e− 5) were categorized as tandem duplicated while the remaining from the same genomes were considered as segmental and else of all from different genomes and sub-genomes of the four Gossypium species were allocated in the whole-genome duplication (Malik et al. 2020). A total of 1 629 gene pairs in the four cotton species were identified as duplicated genes pairs, of which 19 pairs were tandem duplication, 386 pairs were segmental duplication, and 1 224 pairs were whole-genome duplication (Supplementary Table S2). A total of 19 tandem repeat gene pairs were identified in the four cotton species, of which 2 were Ga-Ga (GaZAT21/22, GaZAT36/37), six were Gh-Gh (GhZAT20/21, GhZAT34/35, GhZAT41/42, GhZAT77/78, GhZAT90/91, GhZAT95/96), 8 were Gb-Gb (GbZAT22/23, GbZAT36/37, GbZAT42/43, GbZAT44/45, GbZAT77/78, GbZAT91/92, GbZAT97/98, GbZAT99/100), three were Gr-Gr (GrZAT1/2, GrZAT24/25, GrZAT48/49). Therefore, tandem duplication played an indispensable role in the evolution of the ZAT gene family.
In addition to tandem duplication, other replication mechanisms may also contribute to the extension of genes. It was suggested that parallel homologous genes were usually produced by segmental duplication and the whole-genome duplication before polyploidy, especially segmental duplication was the most significant factor in evolution (Malik et al. 2020). Segmental duplication of the ZAT genes in four cotton species was analyzed and 386 gene pairs were identified. There were 49, 150, 152, and 35 gene pairs among Ga-Ga, Gh-Gh, Gb-Gb, and Gr-Gr, respectively. In addition, there were 185, 241, 250, 254, 190, and 104 gene pairs between Gh-Ga, Gh-Gr, Gh-Gb, Gb-Gr, Gb-Ga, and Ga-Gr, respectively. From these results, we presumed that tandem duplication, segmental duplication, and whole-genome duplication had an impartment effect on the evolution of the ZAT gene family.
Calculation of non-synonymous (Ka) to synonymous (Ks) substitution rates
To investigate the differentiation mechanism of ZAT genes after polyploid duplication in cotton, Ka, Ks, and Ka/Ks of 1 332 homologous gene pairs from 10 combinations of four Gossypium species were determined, including Ga-Ga, Ga-Gb, Ga-Gr, Gb-Gb, Gb-Gr, Gh-Gh, Gh-Ga, Gh-Gb, Gh-Gr and Gr-Gr (Supplementary Table S3). Ka/Ks ratio was used to infer the selection pressure of duplicated gene pairs. Ka/Ks < 1 was considered as a purification selection, indicating that natural selection eliminated harmful mutations and kept the protein unchanged, Ka/Ks > 1 was considered as positive selection, indicating natural selection changed the protein, the mutation site was quickly fixed in the population, and the evolution of the gene was accelerated, Ka/Ks = 1 was considered as neutral selection, indicating that natural selection did not affect mutation (Hurst 2002).
Over all, we found 1 332 duplicated gene pairs in this analysis from four Gossypium species (Ga, Gr, Gh, and Gb). Among them, 1 288 (96.7%) duplicated gene pairs with Ka/Ks ratio < 1 including 1 045 gene pairs with Ka/Ks < 0.5 and 243 gene pairs between 0.5 and 0.99, exhibiting pure selection. Only 44 (3.3%) duplicated gene pairs with Ka/Ks > 1 and these gene pairs might have experienced relatively rapid evolution following duplication and underwent positive selection pressure. Studies had shown that sub-functional duplicated genes generally differentiated at the transcriptional level first and tended to be subject to strong purification (Roulin et al. 2013). In general, gene duplication will produce two copies, on the one hand, under low selective pressure, the gene will evolve, on the other hand, it will produce some new functions, which can contribute to the evolution of the species.
Analysis of promoters and differentially expressed genes
To further understand the response of GhZAT genes to abiotic stress, we analyzed their expression under cold, heat, salt, and PEG stress using RNA-seq data and cis-acting elements of promoters under abiotic stress. Considering the importance of cis-acting elements in abiotic stress, we selected the region of 2 kb upstream of the start codon of the ZAT genes, then we used PlantCARE to study the cis-acting elements that may be involved in gene expression, and extracted the GhZAT gene promoter region which was associated with stress and plant hormone. In this study, most GhZAT genes in G. hirsutum contained cis-acting elements related to plant hormones (ABA, MeJA, GA, IAA) and various stresses (low temperature, light stress, and wound stress) (Fig. 7B), and genes in the same subfamily were functionally diverse despite the similar motifs. In terms of plant hormone response, we found the cis-acting elements including GARE-motif (gibberellin responsive elements), ERE (ethylene response), ABRE (involved in abscisic acid responsiveness), AuxRR-core (cis-acting regulatory elements involved in auxin responsiveness), and CGTCA-motif (cis-acting regulatory element involved in MeJA responsiveness). Almost all the promoters were found to contain several hormones responsive elements except GhZAT25, GhZAT23, GhZAT100, but we did not find any close relationship of hormone-responsive elements with their subfamily. Therefore, GhZAT genes may be involved in regulating the growth and defense responses of cotton through specific hormonal signaling pathways.
Gene expression patterns can provide important references for functional analysis of genes, and gene expression was related to biological functions controlled by cis-acting elements. We detected the expression pattern of GhZAT genes in different times (1 h, 3 h, 6 h, 12 h) under different stresses (cold, hot, salt, and PEG) by analyzing RNA-seq data (Fig. 7C). The results showed that the genes of the IIb and IIe subfamily all responded to cold, heat, salt, and PEG stress, while the expression levels of other subfamily genes also changed, but the expression levels were not as high as these two types. The gene expression patterns of the same subfamily were almost the same, but the gene expression patterns of the same subfamily were slightly different. These results proved that GhZAT genes were involved in plant stress response.
Cotton plants with GhZAT gene silenced by VIGS were sensitive to alkaline stress
We also explored whether ZAT genes responded to alkaline stress. We selected a high-expressed gene GhD03G1247.1 (GhZAT67) from transcriptome data treated with alkaline stress (125 mmol·L–1 NaHCO3 treated for 12 h) for further study. VIGS vector pYL156: GhZAT67 was constructed using In-Fusion technology to study their functions under NaHCO3 stress. Afterward, the cotton carrying pYL156: PDS appeared albino, indicating that VIGS silencing was successful (Fig. 8A). qRT-PCR was used to detect the gene silencing effect, and the results showed that the GhZAT67 expression level of pYL156: GhZAT67 was significantly lower than that of the control pYL156 (Fig. 8B). After treatment with 125 mmol·L–1 NaHCO3, we found that the pYL156: GhZAT67 cotton seedlings wilted more severely than the pYL156. Therefore, GhZAT67 was also involved in the adaptability of cotton to alkaline stress.