Association mapping and domestication analysis to dissect genetic improvement process of upland cotton yield-related traits in China

Cotton fiber yield is a complex trait, which can be influenced by multiple agronomic traits. Unravelling the genetic basis of cotton fiber yield-related traits contributes to genetic improvement of cotton. In this study, 503 upland cotton varieties covering the four breeding stages (BS1–BS4, 1911–2011) in China were used for association mapping and domestication analysis. One hundred and forty SSR markers significantly associated with ten fiber yield-related traits were identified, among which, 29 markers showed an increasing trend contribution to cotton yield-related traits from BS1 to BS4, and 26 markers showed decreased trend effect. Four favorable alleles of 9 major loci (R2 ≥ 3) were strongly selected during the breeding stages, and the candidate genes of the four strongly selected alleles were predicated according to the gene function annotation and tissue expression data. The study not only uncovers the genetic basis of 10 cotton yield-related traits but also provides genetic evidence for cotton improvement during the cotton breeding process in China.


Background
Cotton is one of the most important industrial crop in the word, which has been cultivated for over 7 000 years, providing the important raw material for the textile industry (Fang et al. 2017a;Maik et al. 2015). Among the four cultivated cotton species, Gossypium hirsutum (upland cotton) is the most widespread species due to the high adaptability and yield, which takes up approximately 95% of cotton production in the word (Chen et al. 2007). Hence, the genetic improvement of upland cotton to increase cotton fiber yield is one of the most important goals of cotton breeding.
Fiber yield is a complex trait, which can be influenced by multiple traits, including seed cotton weight (SCW), lint weight (LW), lint percentage (LP), effective boll number (EBN), plant height (PH), first fruit spur height (FFSH), fruit spur branch number (FSBN), first fruit branch position (FFBP), flowering period (FP), and whole growth period (WGP) (Li et al. 2018b;Sun et al. 2018). Unravelling the genetic basis of cotton fiber yield-related traits contributes to cotton fiber production increase. Based on linkage analysis, many QTLs for cotton yield traits have been identified (An et al. 2010;Deng et al. 2019;Gore et al. 2014;Liu et al. 2012). However, the construction of linkage population used for fine mapping, such as recombinant inbred lines (RIL), and near-isogenic line (NIL), is a time-consuming process (Ma et al. 2020;Li et al. 2018a;Zhang et al. 2020). In recent years, genomewide association studies (GWAS), which is based on linkage disequilibrium (LD), has been widely used in plants to identify various traits related QTLs by using the natural populations Mengistu et al. 2016;Yang et al. 2014). In cotton, GWAS has been conducted for the genetic dissection of yield-related traits. Two hundred and fifty-one significant loci were detected to be associated with lint yield yields by using 651 simple sequence repeats (SSRs) and 323 accessions of Gossypium hirsutum L. (Jia et al. 2014). Thirteen cotton yield-related QTLs and 44 cotton fiber quality related QTLs were detected, respectively, by using 198 SSR markers and 302 elite upland cotton accessions (Ademe et al. 2017). Forty-three marker loci were detected to be associated with cotton yield traits by using 201 pleomorphic markers and 403 accessions (Dong et al. 2018). The identification of cotton yieldrelated QTLs lays the foundation for gene cloning and marker assisted selective (MAS) breeding.
The natural population resources possess widely genetic variation, which are suitable to study species domestication, such as maize (Hufford et al. 2012;Xue et al. 2016), rice (Zheng et al. 2019), soybean Zhou et al. 2015), tomato (Soltis et al. 2019), watermelon , and cotton (Du et al. 2018;Nie et al. 2020). During crop breeding process, the favorable alleles were constantly enriched, so the varieties collected from different breeding stages could help us uncover the artificially selected loci. In this study, 503 upland cotton varieties covering main breeding stages in China were used to construct a natural population. Ten cotton yield-related traits were investigated in multiple environments. An association analysis was performed based on best linear unbiased predictions (BLUP) data from multiple environments, and 176 polymorphic SSRs, after which, the identified loci were used for domestication analysis and favorable alleles selection. Three major loci for LP and one loci for FS and WGP were strongly selected during these breeding stages. This study aimed to explore the genetic architecture of 10 cotton yieldrelated traits, and to uncover the genetic improvement during the cotton breeding process in China.

Field experiments and phenotype data collection
Ten cotton yield-related traits of the 503 cotton inbred cultivars were measured under multiple environments. Five phenotypic data, including SCW, LW, LP, EBN, and PH were collected from 4 locations in China (Shihezi (SHZ), Xinjiang; Kuerle (KEL), Xinjiang; Yuanyang (YY), Henan; Huanggang (HG), Hubei) in 2012 and 2013. The phenotypic data of FFSH, and FSBN were collected from these locations in China (SHZ, KEL and YY), too, in 2012 and 2013. The phenotypic data of FFBP, FP, and WGP were collected from 2 locations (SHZ and YY) in 2012 and 2013. The experiment followed a randomized complete block design with single row plot and two replications.

Phenotypic data analysis
The variance, correlation, and repeatability analysis of the phenotypic data in multiple environments were conducted by R programming language.
Best linear unbiased predictions (BLUP) were used to estimate phenotypic traits across multiple environments based on a linear model by R software (http://www.rproject.org).
The basic statistical analysis of phenotypic data, including minimum (Min), maximum (Max), mean, standard deviation (SD), and coefficient of variation (CV), were conducted by IBM SPSS Statistics ver. 21.

Association analysis
One hundred and seventy-nine polymorphic SSR markers covering the whole genome were taken from the previous study, where the linkage disequilibrium and population structure analysis were also done (Nie et al. 2016). TASSEL V2.1 software was used to detect the association relationship between BLUP phenotypic data and genotypic data in three models, including GLM (P + G) + Q, GLM (P + G) + PCA and MLM (G + P + Q + K).
The P values of markers associated with QTLs were regulated by the method of multiple testing correction by controlling the false discovery rate (Benjamini and Hochberg 1995).
Effect and evolution analysis of the associated markers in four breeding stages, and the selection of favorable alleles for major loci The effect of the associated markers were evaluated in the four breeding stages based on the ten traits. Nine major loci (R 2 ≥ 3) were used for accumulation of favorable alleles analysis. The effects of different alleles were evaluated by using the phenotypic data, after that, the favorable alleles were used for frequency analysis in the four breeding stages, as well as the favorable alleles carrier materials selection.

Candidate gene annotation and prediction
The four strongly selected alleles associated to cotton yield-related traits were mapped to the TM-1 genome (Gossypium hirsutum, Huazhong Agricultural University (HAU) assembly) ). The candidate regions for marker-trait loci were set around the LD decay distance as 400 kb. The genes in the candidate regions were used for key candidate genes selection by using annotation.

Gene expression patterns
The tissue expression levels of the candidate genes were obtained from Huazhong Agricultural University (unpublished).

Results
Performances of cotton yield-related traits of the 503 upland cotton germplasm resources in multiple environments BLUP was used to determine the phenotypic data of cotton yield-related traits in multiple environments (Table  S3), and the BLUP data of ten yield-related traits was used for association analysis. The average phenotypic coefficient of variation (CV) for 10 yield traits ranged from 2.85% (WGP) to 12.31% (FFSH) ( Table 1). The highest CV was observed in FFSH (12.09%), the lowest in WGP (2.28%). The highest heritability was in LP (0.93), and the lowest in FSBN (0.46), ranging from 0.75 to 0.87 in the other seven traits ( Table 1).
The phenotypic trends of yield-related traits in different environments were shown in The correlations between two environments were obtained for the ten cotton yield-related traits (Fig. 2). Among the 10 yield-related traits, the average correlations between environments were ranked as LP (0.62) > FP (0.57) > WGP (0.51) > FFBP (0.45) > LW (0.43) > FFSH (0.38) > SCW (0.32) > PH (0.31) > FSBN (0.16) > EBN (0.15). It was with far more interest to further analyze one trait between environments. For example, the correlations of FFSH ranged from 0.16 (FFSH_ 12KEL and FUHML_12YY) to 0.75 (FFSH_12SHZ and FUHML_13 SHZ). In addition, the average correlation for FFSH in the same site in different years was 0.57, while the average correlation for FFSH in the same year from different sites was 0.34 ( Fig. 2).

Molecular genetic diversity and population structure
A total of 179 polymorphic markers obtained from the previous study (Nie et al. 2016) were used for genetic analysis, which contained 426 allele loci. The average genetic similarity coefficient variation among the 503 cultivars was 0.552 (Nie et al. 2016). The linkage disequilibrium (LD) of this population was analyzed using 179 SSR markers. Based on r 2 estimates, only 2.09% (r 2 ≥ 0.05) and 1.30% (r 2 ≥ 0.1) of the markers showed significant LD, which is suitable for association mapping (Nie et al. 2016).
Population structure was determined by three methods. e.g., PCA (Principal component analysis) plots, Nei's genetic distance, and STRUCTURE software, in the previous study, and the population were divided into 7 subgroups (Nie et al. 2016).

Association mapping of cotton yield-related traits
The association analysis was carried out based on the genotypic data, PCA matrix, genetic relationship matrix and BLUP data of ten cotton yield-related traits (Bradbury et al. 2007;Jakobsson and Rosenberg 2007). Through association analysis, the effects of GLM (P + G) + Q, GLM (P + G) + PCA and MLM (G + P + Q + K) models in association analysis were compared (Fig. S1). For LW and EBN, the above three models had similar effect in controlling population structure; For PH, FFBP, FP and WGP, GLM-Q and MLM-Q-K models were better than GLM-PCA. For SCW, FFSH and FSBN, GLM-PCA model was better than GLM-Q and MLM-Q-K. For LP, GLM-Q model was better than GLM-PCA and MLM-Q-K model. According to the results, the effect of the MLM-Q-K model in the association analysis is more stable than other two models, and the effect of the 10 yield-related traits in controlling population structure is smaller. Therefore, the MLM-Q-K model was suitable for controlling the population structure of ten yieldrelated traits. After filtering the minimum alleles (≤5%), 179 polymorphic SSR markers (426 alleles) were used for association analysis based on MLM-Q-K model. At the P ≤ 0.05 and P ≤ 0.01 levels, 140 (78.21%) and 45 (25.14%) markers were associated with yield-related traits, respectively (Table S4). An average of 5.4 associated markers were detected on each chromosome, ranging from 1 to 11, with the maximum of 11 markers on chromosomes D05 (Fig. 3). At P ≤ 0.05 level, the phenotypic variation interpretation rates (PVE) of the associated markers ranged from 0.48% (NAU4884a and NAU3468a) to 3.89% (NAU3377db) with an average of 1.30%, while at P ≤ 0.01 level, the PVE ranged from 1.03% (MON-CGR5866a) to 3.89% (NAU3377db), with an average of 2.14% (Table S4). In addition, one marker was generally associated with multiple traits (Fig. 3). For example, NAU2095 (on Chromosome A01) was associated with FFSH, PH, FSBN, WGP, LP, and EBN; MON_ (on Chromosome D02) was associated with PH, FFSH, FFBP, WGP, and FP at an extremely significant level (P < 0.01); HAU1355 (on Chromosome D13) was simultaneously associated with PH, FSBN, and LW; NAU3084 (on Chromosome D12) was associated with PH, FFSH, SCW, LW, and EBN (Fig. 3, Table S4).

Effect value analysis of associated markers in four breeding stages
According to the history of cotton breeding stages in China, the 503 upland cotton germplasm resources were divided into 4 groups (Table S2), from breeding stage1 (BS1) to breeding stage 4 (BS4), which represented abroad variation in each experiment site. The ten cotton Fig. 3 The distribution of the SSR markers associated with the yield-related traits on 26 chromosomes of upland cotton yield-related traits were compared among the four breeding stages. According to the results, the BS4 almost represented the highest level of 9 traits, except WGP, which showed a decreased trend during the breeding process; LW, LP, PH, and FFSH showed an increasing trend from BS1 to BS4, while the others showed little changes or random fluctuations during the four breeding stages (Fig. 4).
To further uncover the effect of the associated markers on the ten cotton yield-related traits, the phenotypic effect value of each locus in each breeding stage was assessed (Fig. 5). For 48 associated markers of SCW, the average phenotypic effect value of the allele loci in each stage (BS1-BS4) was − 0.11, − 0.003, 0.01, − 0.003, respectively (Table S5). The phenotypic effect range of each locus in each stage was − 4.96 to 0.31, − 0.44 to 0.38, − 0.37 to 0.24, and − 0.21 to 0.27, respectively (Table S5). As shown in Fig. 5a, the phenotypic effect showed an upward trend during the four breeding stages in the marker loci of HAU0590a (− 0.09 to 0.02), NAU6966ba (− 0.03 to 0.04), and NAU3607a (− 4.96 to 0.27), while NAU2858a (− 0.08 to 0.05) showed a downward trend (Fig. 5a, Table S5).
Among the associated markers of PH, the phenotypic effect of DPL0475a showed an upward trend, while MON_CGR5167b and BNL3790c showed a downward trend (Fig. 5e). Among the associated markers of FFSH, the phenotypic effect of HAU0211c showed an upward trend, while MON_CGR5423e showed a downward trend (Fig. 5f). Among the associated markers of FSBN, the phenotypic effect of BNL3347a and MON_ CGR5867c showed an upward trend, while HAU2835b, HAU1355a, MON_CGR5866b, HAU0197a, HAU1434b and DPL0461c showed downward trend (Fig. 5g). Fig. 4 The statistic analysis and comparison of the ten cotton yield-related traits in the four breeding stages  Among the associated markers of FFBP, the phenotypic effect of MON_DPL0544e showed an upward trend, while MON_CGR5866a and CK98221_PR_SSc showed a downward trend (Fig. 5h). Among the associated markers of FP, the phenotypic effect of MON_ DPL0544e, MON_CGR5113e, HAU4022b, MON_ CGR5399c, and DPL0475b increased from BS1 to BS4, while NAU3468a and NAU3966b decreased (Fig. 5i). Among the associated markers of WGP, the phenotypic effect of MON_DPL0544e increased from BS1 to BS4, while phenotypic effect of MON_CGR5866a, NAU2858a and NAU3966b decreased.
Accumulation of favorable alleles for major loci in four cotton breeding stages and typical carrier materials Nine major loci (R 2 ≥ 3) were used for accumulation of favorable alleles analysis, including three loci for LP (NAU2671, NAU3774, and MON-CGR6356), three loci for PH (NAU3377, MON-CGR5113, and NAU2095), one locus for FFSH (HAU4022), one locus for FP (HAU1434), and one locus for WGP (HAU1434). NAU2671 has three alleles, i.e., NAU2671a, NAU2671b, and NAU2671e, among which, NAU2671a was the favorable allele (Fig. 6a), showing the highest FP. The frequency of NAU2671a showed an increasing trend during the four breeding stages (0.62 to 0.79) (Fig. 6a). NAU3774 has three alleles, too, i.e., NAU3774a, NAU3774b, and NAU3774d, among which, NAU3774a was the favorable allele. The frequency of NAU3774a showed an increasing trend during the four breeding stages (0.19 to 0.70) (Fig. 6b). MON-CGR6356 also had three alleles, i.e., MON-CGR6356a, MON-CGR6356b, and MON-CGR6356c, among which, MON-CGR6356a was the favorable allele. The frequency of MON-CGR6356a showed an increasing trend during the four breeding stages (0.71 to 0.92) (Fig. 6c). NAU3377c/d are favorable alleles of PH, which could increase the plant height. The frequency of NAU3377c/d showed a certain fluctuation, and the frequencies of BS3 and BS4 were slightly decreased (Fig. 6d). MON-CGR5113a/d were favorable alleles of PH. The frequency of MON-CGR5113a/d showed an increasing trend from BS1 to BS3, but a little decrease in BS4 (Fig. 6e). NAU2095a is a favorable allele of PH, and the frequency of which showed an irregular fluctuation among the four breeding stages, while the frequencies of BS3 and BS4 were slightly increased. HAU4022a/c were favorable alleles of FFSH, the frequency of which was relatively stable among the four breeding stages. HAU1434a/c were favorable alleles of both FP and WGP, which showed an increasing trend from BS1 to BS4 (0.80 to 0.95) (Fig. 6i).
Additionally, the typical carrier materials possessing the favorable allele were shown in Table S6. For LP, the umber carrier materials with three favorable alleles (NAU2671a, NAU3774a, and MON-CGR6356a) was 136, such as ZY1, ZY27, ZY28, and ZY44. For PH, there were 19 materials carried three favorable alleles (NAU3377c/d, MON-CGR5113a/d, and NAU2095a), such as ZY13, ZY55 and ZY71. For FFSH, there were 473 materials carrying the favorable alleles HAU4022a/c. For FP and WGP, there were 429 materials carrying the favorable alleles HAU1434a/c. However, there were only three materials carrying 8 favorable alleles, including ZY92, ZY398, and ZY87 (Table S6). ZY92 was derived from the BS3, while ZY398 and ZY87 were derived from BS4.

Candidate gene annotation and prediction
Four artificially selected major loci during the breeding process, including three loci of LP (NAU2671, NAU3774, and MON-CGR6356), one locus of FP and WGP (HAU1434), were used for candidate gene annotation and prediction. Candidate genes were listed within the candidate regions of loci, which were set around the LD decay distance as 400 kb. There were 34, 42, 20 and 22 candidate genes for 4 loci (NAU2671, NAU3774, MON-CGR6356, and HAU1434), respectively (Table  S7). One key candidate gene for each locus was predicted according to the functional annotation. For NAU2671 of LP, Ghir_A12G024420 encoded a homeobox protein BEL1, which was expressed in all tested tissues (Fig.  7a); in Arabidopsis, BEL1-LIKE HOMEODOMAIN6 was involved in secondary cell wall formation of interfascicular fiber (Liu et al. 2014). For NAU3774 of LP, Ghir_D12G026070 encoded a Cytochrome B5 protein, which was highly expressed in the developing fibers (Fig. 7b); in Arabidopsis, Cytochrome b5 worked as an obligate electron shuttle for syringyl lignin biosynthesis, which also played an important role in the cotton fiber development (Gou et al. 2019). For MON-CGR6356 of LP, Ghir_ A01G016840 encoded auxin-induced 5NG4 protein, which was expressed specially in the developing fibers (Fig. 7c); auxin-induced protein 5NG4 was associated with culm cellulose content in bread wheat (Kaur et al. 2017). For HAU1434 of FP and WGP, Ghir_ A12G013050 encoded a SBP domain protein, which was highly expressed specially in the floral organ (Fig.  7d); SBP domain proteins were widely reported to be involved in flowering initiation and flower development (Cardon et al. 1997;Hou et al. 2017;Yamasaki et al. 2006).

Discussion
Extensive genetic variation and multiple environments contribute to QTL mapping Genome-wide association study using natural population is a powerful strategy to effectively fine map QTL due to a great number of historical recombination events that lead to the rapid decay of linkage disequilibrium (LD) (Flint-Garcia et al. 2003;Li et al. 2013). The population size and diversity of germplasm resources could affect the LD and the resolution of QTL. In our study, 503 germplasms derived from the United States, the former Soviet Union, and China were divided into 7 subgroups by using the SSR markers, which represent extensive genetic variations and are suitable for GWAS. In addition, quantitative traits are generally susceptible to environments, so the QTLs stable in multiple environments were more reliable, which can be used for gene cloning and marker assistance selection (MAS) (Raihan et al. 2016;Wang et al. 2015). In this study, the data of 5 traits were collected from 8 environments (4 typical cotton growing areas in China in 2 years), and BLUP phenotypic data were used to provide comprehensive multi-environmental accurate phenotypic values for association analysis. According to the results, the extensive genetic variation and multiple environments of the study make the QTLs more reliable.
In order to verify the reliability of the association mapping results, the associated markers were compared with the previous studies. As a result, some of the markers were consistent with previous studies. For example, an LP-and SCW-related loci, BNL3590 (Chr A02), was reported in two studies, where it was associated with SCW (An et al. 2010;Mei et al. 2013). MON_CGR5399 (Chr A10) was reported to be associated with LW in previous studies, while, in the study, it was associated with multiple traits, including LW, FP, PH, FFBP and SCW . NAU3377 (Chr D11) was reported to be associated with LP in previous studies, while in the study (Wang et al. 2007), NAU3377 (Chr D11) was not only associated with LP, but also with EBN, LW, PH, SCW. NAU3774 (Chr D12) was reported to be associated with LW, while in the study, NAU3774 was associated with both LW and LP . Additionally, there were more new loci identified in this study and could be used for further genetic analysis of cotton yield-related traits.
The artificial selection during the breeding process contributes to the increased cotton production As a natural fiber resource, Gossypium was cultivated about 7 000 years ago (Fang et al. 2017a). During the process of cultivation, its fiber productivity has been increased duo to natural genome polyploidization and artificial selection (Jiang et al. 1998;Yuan et al. 2015). Artificial selection could influence the allele frequency through the selection of preferred traits in the traditional breeding process (Luikart et al. 2003). Mutation and recombination are two important factors in determining the efficiency of selection (Nachman and Payseur 2012; Noor and Bennett 2009). The comparison of the genome sequence of Gossypium hirsutum CRI-12 family (including CRI-12, its parental cultivars, and progeny cultivars) revealed that 1 029 haplotype blocks might be recombined under artificial selection (Lu et al. 2019). Strong artificial selection during domestication has resulted in reduced genetic diversity but stronger linkage disequilibrium and higher extents of selective sweeps (Ma et al. 2019). In our study, LW, LP, PH, and FFSH showed significant increasing trends during the cotton breeding process, while WGP showed a significant decreased trend, which was just consistent with the evolution of cotton varieties in Xinjiang, China, and resulted in higher yield but shorter growing period cotton varieties. The effect and evolution analysis of markers associated with yield traits in the four breeding stages have been assessed. Twenty-nine markers showed an increasing trend contribution to cotton yield-related traits from BS1 to BS4, which could be caused by increasing favorable allele frequency during artificial selection breeding process. However, 26 associated markers showed decreased trend effect from BS1 to BS4, which showed that there was a great potential to gain cotton yield by increasing the favorable alleles of those loci. Additionally, the favorable allele frequency of major loci were detected, and three favorable alleles (NAU2671a, NAU3774a, and MON-CGR6356a) associated with LP were strongly selected during the cotton breeding process, which contributed to the increasing LP of cotton; one favorable allele associated with WGP was strongly selected, which contributed the decreasing WGP of cotton. According to the results, cotton breeding process is a genetic improvement to increase frequency and number of favorable alleles. However, there were still some favorable alleles that were not affected or negative selected during the evolution, which could be as a potential genetic resource for cotton genetic improvement in the future.
The loci identified lay the foundation for gene cloning and molecular breeding With the developing of genome sequencing technology, the reference genome sequences of Gossypium hirsutum have been well developed , which provide lots of useful information for candidate gene identification from the results of GWAS Nie et al. 2020). According to the reference genome, the SSR markers significantly associated with cotton yieldrelated traits could be anchored on the physical maps, which provided candidate regions based on LD decay distance ). The annotated genes in these regions could be used for further verification by functional annotation or using reverse genetics methods (Fang et al. 2017b;Li et al. 2017;Xu et al. 2020). In our study, at the level P ≤ 0.01, 45 markers were identified to be associated with cotton yield-related traits, among which, 8 major loci were used for favorable allele identification. Additionally, the materials which carried the favorable alleles were identified, which could be used as donors to improve the cotton yield, such as ZY92, ZY398, and ZY87, which carried 8 favorable alleles. Four strongly selected alleles were used for candidate predication according to gene functional annotation and tissue expression. As a result, three candidate genes were identified for LP, which were associated with fiber development according to the function of homologous in Arabidopsis and wheat (Gou et al. 2019;Kaur et al. 2017;Liu et al. 2014). One candidate gene was identified for FP and WGP, which was specifically expressed in flower organs, and the homologous in other plants were involved in flowering initiation and flower development (Cardon et al. 1997;Yamasaki et al. 2006).

Conclusion
In this study, 140 markers significantly associated with ten fiber yield-related traits were identified by using 503 upland cotton varieties covering the four breeding stages in China, among which, 29 loci showed an increasing trend contribution to cotton yield-related traits from BS1 to BS4, and 26 loci showed a decreased trend effect. Four favorable alleles of 9 major loci (R 2 ≥ 3) were strongly selected during the breeding stages, and the candidate genes of the four strongly selected alleles were predicated according to the gene function annotation and tissue expression data. The study not only uncovers the genetic basis of 10 cotton yield-related traits but also provides genetic evidence for cotton improvement during the cotton breeding process in China.

Funding
This work was supported by the National Natural Science Foundation of China (31760402), Young and Middle-aged Science and Technology Leading Talents of Xinjiang Production and Construction Corps (2019CB027).