Skip to main content

Genetic diversity and population structure of Gossypium arboreum L. collected in China



Gossypium arboreum is a diploid species cultivated in the Old World. It possesses favorable characters that are valuable for developing superior cotton cultivars.


A set of 197 Gossypium arboreum accessions were genotyped using 80 genome-wide SSR markers to establish patterns of the genetic diversity and population structure. These accessions were collected from three major G. arboreum growing areas in China. A total of 255 alleles across 80 markers were identified in the genetic diversity analysis.


Three subgroups were found using the population structure analysis, corresponding to the Yangtze River Valley, North China, and Southwest China zones of G.arboreum growing areas in China. Average genetic distance and Polymorphic information content value of G. arboreum population were 0.34 and 0.47, respectively, indicating high genetic diversity in the G. arboreum germplasm pool. The Phylogenetic analysis results concurred with the subgroups identified by Structure analysis with a few exceptions. Variations among and within three groups were observed to be 13.61% and 86.39%, respectively.


The information regarding genetic diversity and population structure from this study is useful for genetic and genomic analysis and systematic utilization of economically important traits in G. arboreum.


Cotton is the most important natural fiber crop in the world. It includes approximately 45 diploid (2n = 2× = 26) and 5 allotetraploid (2n = 4× =52) species distributed mostly in tropical and subtropical regions throughout the world (Fryxell 1992; Wendel and Albert 1992; Wendel and Cronn 2003). Tetraploid species, including Gossypium hirsutum and Gossypium barbadense, arose in the New World from inter-specific hybridization between an A genome and a D genome diploid species which are believed to have originated from ancestors similar to modern G. herbaceum race africanum and G. raimondii, respectively (Stephens 1944; Seelanan et al. 1997; Brubaker et al. 1999; Liu et al. 2001; Chen et al. 2007; Sunilkumar et al. 2006). Diploid species (2n = 26) are classified into eight genomic groups (A-G and K), occurring naturally in Africa, Asia, America, and Australia (Wendel and Cronn 2003). Worldwide, four species are cultivated: two of these cultivated species are diploids (2n = 2× = 26) and two are allotetraploids (2n = 4× = 52).

Gossypium arboreum is a diploid species cultivated in the Old World. It was first domesticated near the Indus Valley before 6000 BC (Hutchinson 1954; Fryxell 1979, 1992; Moulherat et al. 2002). The primitive G. arboreum was perennial, and was once considered to have evolved from the wild G. herbaceum in Africa (Hutchinson 1954). More recently, Wendel et al. have presented evidence that G. arboreum was independently domesticated from a different wild plant that gave rise to G. herbaceum (Wendel et al. 1989; Renny-Byfield et al. 2016). G. arboreum lost photoperiod sensitivity when it spread from the West India to the North and East India (Hutchinson 1954). The annual types of G. arboreum facilitated extension to larger areas and evolved tolerance to diseases, pest and frost (Silow 1944). Furthermore, environmental conditions associated with geographic distribution and domestication resulted in the development of considerable variation, which has been classified into six races, soundanense, indicum, burmanicum, cernuum, bengalense, and sinense in different regions (Silow 1944; Brubaker et al. 1999).

Gossypium arboreum was introduced into China from various routes, and was domesticated as a local crop between the 7th and the 13th centuries (Watt 1907; Guo et al. 2006). It was thought that two primary routes of importation were overland from Bengal-Assam to the Yellow River, and by sea from Indo-China to the Yangtze River Valley (Silow 1944). In the south of the Five Ridges area, Hainan Island and Yunan, G. arboreum was only grown as garden plants until an extremely early-fruiting type which were developed from Indian and Indo-Chinese varieties. After the new technology of weaving was brought to the Yangtze River Valley in the thirteenth century, various landraces were developed and widely cultivated in the area of the middle and lower Yangtze River Valley, then spread to Northern China encouraged by Imperial edict in the fifteenth century (Watt 1907; Silow 1944; Guo et al. 2006). The three major growing regions of G. arboreum including the Southwest region, the Yangtze River Valley, and the Northern region were gradually formed with the breeding of local varieties (Guo et al. 2006). Then, the most important type, race sinense, was developed in China, until it was completely replaced by Upland cotton (Gossypium hirsutum L.) in the 1950s (Huang 1996; Guo et al. 2006).

As the cultivated ‘Old World’ diploid cotton, Gossypium arboreum experienced from natural and artificial selection due to environmental stress, and evolved to possess favorable characters that the tetraploid cultivars lack, such as drought tolerance, disease resistance, and insect pest resistance which makes it well adapted to biotic and abiotic stresses (Kantartzi et al. 2009; Mehetre et al. 2003), spinnable fiber with various colors and high strength that are good for weaving (Park et al. 2005; Mehetre et al. 2003). These G. arboreum landraces with adaptive features are important genetic resources for the improvement of tetraploid cotton, and can help to develop cultivars with invaluable genes for early maturity, stress tolerance, and high fiber strength in cotton-breeding programs (Xiang 1988; Rahman et al. 2002; Mehetre et al. 2003; Liu et al. 2006). Understanding the genetic relationships among the landraces of G. arboreum would facilitate efficient use for developing superior cotton cultivars with favorable agronomic traits.


Plant material

One hundred and ninety-seven accessions of Gossypium arboreum were collected from 19 provinces in China, and were preserved in the Gene Bank of Institute of Cotton Research of Chinese Academy of Agricultural Sciences. These accessions were cultivated in the main cotton growing areas of China including the North region, the Yangtze River Valley and the Southwest region. Their accession numbers and passport data are listed in Additional file 1: Table S1. A panel of 24 accessions were selected to screen the polymorphic microsatellites for the analysis of diversity and structure of the natural population.

Genotyping with SSR markers

DNA from young and fully expanded leaves of each accession was extracted as described by Paterson and Smith (1999). SSR primers information was obtained from the Cottongen (Cottongen, PCR is conducted in 10 μL volumes, which included 1.0 μL 10× Buffer (consisting of 20 mmol·L− 1 MgSO4, 100 mmol·L− 1 KCl, 80 mmol·L− 1 (NH4)2SO4, 100 mmol·L− 1 Tris-HCl, pH 9.0, 0.5% NP-40), 50 ng template DNA, 0.5 mmol·L− 1 dNTP, 0.4 units of Taq DNA polymerase, 0.5 μmol·L− 1 forward and reverse primers. The amplification program of PCR included a 3 min pre-denaturation step at 95 °C, 30 cycles of 94 °C for 45 s, 57 °C for 45 s, 72 °C for 1 min, and 7 min extension at 72 °C. All reactions were completed using a PTC-100TM thermocycler. The PCR products were stored at 4 °C before being running on the 8% non-denatured PAGE gel (Sambrook et al. 1989). The gel was stained using the method of Zhang et al. (2000), and was photographed using SYNGENE gel system.

Data collection and analysis

The most intensely amplified band for each SSR locus was scored using a standard 50 base pairs (bp) DNA marker (Takara Biotech, Dalian, China) as reference. Presence of amplified fragments was scored as 1, and the absence was labelled as 0 for the SSR locus. Missing data was represented as “-9”. Diversity was calculated based on the genotype data for 80 polymorphic SSRs in 197 individuals. SpaGeDi software was used to calculate allele frequencies (Hardy and Vekemans 2002). The polymorphic information content (PIC) was also estimated using the Powermarker software package version 3.25 (Liu and Muse 2005). Powermarker software package version 3.25 was used to calculate the genetic distance (GD). Principal coordinate analysis (PCA) was done with NTSYS-pc software version 2.1 in using Dcenter and Eigen functions (Rohlf 2000). Analysis of molecular variance (AMOVA) among and within groups was performed using Arlequin ver 3.5 software (Excoffier and Lischer 2010).

Population structure analysis

Population structure was estimated using Structure version 2.3.4 (Pritchard et al. 2000) based on co-dominant genotypic data. The number of populations tested was assumed as K from 1 to 10. The length of running time was 100 000 and replication after burning was 10 000 for the STRUCTURE with the admixture model. The second graphs for Pn and ΔK(Delta K) were built to find a proper number of K values using the method of Evanno et al. (2005).


SSR marker analysis

A total of 116 SSR primer pairs were selected to detect the genotypes of all accessions. Among the SSRs, 24 primers were found to be monomorphic, and 12 primers could not be scored clearly. These 36 SSRs were deleted, leaving 80 SSR primer pairs for analysis. Accessions that missed more than 5% SSR data were also removed. Finally, 197 accessions and 80 SSR primer pairs were used for further analysis. In these accessions, a total of 255 SSR alleles were detected with an average of 3.2 alleles per SSR marker (from 2 to 6 alleles) (Table 1). The number of effective allele varied from 1.1 to 4.8 with an average of 2.3 effective alleles per locus. A summary of marker statistics for all the accessions is presented in Additional file 2: Table S2.

Table 1 Summary of SSR polymorphisms

Population structure

Population structure of the 197 accessions was performed with the software Structure version 2.3.4. The log-likelihood increased with the value of K, but no evidence showed that the number of subpopulations could be identified from the plot of probability for K (Fig. 1a). Then, the plot of ΔK was built using the method described by Evanno et al. (2005) (Fig. 1b). A strong signal for the number of clusters was successfully identified to be three based on ΔK value. Among all accessions, 128 accessions could be assigned to three different subgroups based on 60% membership threshold, and the remaining 69 accessions were considered to have admixed parentage. The subgroups were showed as three different colored bar plots that reflected the single ancestral genetic background (Fig. 2). Detailed membership probabilities of all accessions were described in Additional file 1: Table S1.

Fig. 1
figure 1

Analysis of the population structure. The number of group was calculated using STRUCTURE software. a Graph for the log-likelihood. b Graph for ΔK

Fig. 2
figure 2

Population structure: The bar plot of Q-matrix estimates for the accessions: Groups are represented in different colors (Red for Group 1, Blue for Group 2, Green for Group 3)

The whole group of G. arboreum were separated into three subgroups. These three subgroups consisted of 42, 40 and 46 accessions, and were labeled as Group 1, 2 and 3, respectively. These subgroups were found to correspond to the three traditional G. arboreum growing zones in China namely the Yangtze River Valley, North China and Southwest China. Most of the accessions in Group 1 (Additional file 1: Table S1) were from the Yangtze River Valley, except seven accessions from Southwest China and one from North China, which meant that they have been selected and evolved to adapt to the local environment. Accessions in Group 2 (Additional file 1: Table S1) were mainly from North China excluding nine accessions from the Yangtze River Valley and one from Southwest China. Almost all the accessions in Group 3 (Additional file 1: Table S1) were from Southwest China excluding three accessions from the Yangtze River Valley and one from North China. Accessions that have admixed parentage could be found in all the three zones meaning that they had mixed genetic background.

Genetic diversity

Genetic diversity for all accessions was analyzed using Powermarker software package. The overall PIC value for SSRs ranged from 0.17 to 0.79 with an average of 0.47 (Table 1). The average genetic distance was 0.32 and ranged from 0.02 to 0.55 (Table 1). The highest genetic distance of 0.55 was between accession No.1 named Guichi Xiaozimian Baizi from the Yangtze River Valley and accession No.151 named Donglan Changjing Zhongmian 1 from Southwest China. The lowest genetic distance of 0.02 was between accession No.155 named Changrong Zhongmian from the Yangtze River Valley and accession No.157 named Wangmo Sanglang Da Mianhua from Southwest China. Among the three groups, Group 2 and Group 3 had the highest genetic distance of 0.205 indicating that the accessions from North and Southwest China are genetically far from each other (Table 2). However, Groups 1 and 3 had the lowest genetic distance of 0.196 reflecting the proximity of descent between the two groups (Table 2). The mixed group was similar and low distance with the three groups, certifying the parentage from the multi-regions. Within the groups, Group 3 had the highest average genetic distance of 0.308, and Group 1 had the lowest average genetic distance of 0.253.

Table 2 Genetic distance estimates calculated using Nei et al. distance matrix within and between Gossypium arboreum groups identified by STRUCTURE analysis

A phylogenetic tree was constructed based on the distance matrix using the Neighbor Joining (N-J) algorithm. In the dendrogram, three major clusters were found (Fig. 3). The dendrogram was also compared with the results of Structure analysis. Three groups identified in Structure were found to be in agreement with the clusters observed in the phylogenetic tree shown with different colored lines (Fig. 3). Most of lines that were grouped in one cluster of phylogenetic tree were found to be from the same group of Structure analysis. Although a few lines were incongruent in the clusters and strucuture, most of accessions in the clustering pattern were grouped close to their pedigree parents.

Fig. 3
figure 3

Neighbor-joining tree of 197 Gossypium arboreum accessions. Colors in the tree correspond to subgroups identified in Structure analysis (Red for Group 1, Blue for Group 2, Green for Group 3, Purple for mixed group)

Further study of the genetic relationships between the accessions was carried out using Principal coordinate analysis (PCA) (Fig. 4). Accessions from each subgroup identified in Structure were spread out over three-dimensional plane with some overlapping between accessions from different subgroups. The proportions of first two axes of PCA were up to 68.1% of the variation, which indicates that level of genetic diversity between the subgroups was enough high for the identification of possible genes in G. arboreum germplasm.

Fig. 4
figure 4

Three-dimensional principal coordinate analysis of G. arboreum accessions

Analysis of molecular variance

Analysis of molecular variance (AMOVA) was performed with Arlequin ver 3.5 software. Significant variation between the groups was observed which contributed 13.61% of the total variation (P < 0.000 1) (Table 3). A larger amount of variation within the groups was found to be 86.39% (P < 0.000 1) (Table 3). Differentiation estimate (FST) from the genetic structure analyses was 0.137 highly significant at P < 0.000 1, which was concurred with the analysis of molecular variance. Based on pairwise FST values, the highest genetic differentiation was observed between Group 1 and Group 2, which revealed that accessions of Group 1 are distinct from accessions of Group 2 (P < 0.000 1) (Table 4). The lowest genetic differentiation was between Group 2 and Group 3, suggesting that accessions of these two groups are closer to each other.

Table 3 Analysis of molecular variance for Gossypium arboreum accessions among and within three groups corresponding to three major regions of Gossypium arboreum growing in China as identified by STRUCTURE
Table 4 Pairwise FST estimates for the three groups corresponding to three major regions of cotton production as identified by STRUCTURE


In the present study, 80 SSR primer pairs were used to evaluate the diversity of Gossypium arboreum accessions. These primer pairs produced 255 alleles in the population of 197 accessions with an average of 3.2 alleles per marker. PIC values were found to range from 0.17 to 0.79 with an average of 0.47. Kantartzi et al. (2009) used 115 SSR primers to characterize 96 G. arboreum accessions and found 2.40 alleles per locus. The average PIC value in Kantartzi et al. (2009) was 0.42, which is in agreement with the present study. Guo et al. (2006) observed higher number of alleles per locus and a higher value of polymorphism information content with 60 SSR markers and 108 accessions. Liu et al. (2006) reported genetic similarity coefficients ranged from 0.58 to 0.87 with 39 G. arboreum accessions. The diversity of each population and alleles observed per marker deeply depends on markers, germplasm and the platform used for the resolution of amplified products (Lacape et al. 2007). Lower diversity was found in tetraploid type of G. hirsutum though it owned more alleles per locus with the average alleles 3.9–4 and average PIC 0.13–0.17 in different G. hirsutum populations (Abdurakhmonov et al. 2008; Tyagi et al. 2014; Cai et al. 2014).

Three differentiated subgroups in G. arboreum accessions that were identified with the analysis of population structure were congruent with the major G. arboreum growing regions in China: the Yangtze River Valley, North China and Southwest China. Mixed group embraced 69 accessions that were spread across the three regions. A few accessions that were classified in one subgroup by structure analysis were found to be from outside the geographical region of accession origin. For example, Group 1 was considered to represent the Yangtze River Valley area, although it included seven accessions from Southwest China and one accession from North China. Group 2 composed of nine accessions from the Yangtze River Valley and one from Southwest China. Group 3 had three accessions from the Yangtze River Valley and one accession from North China region. It might be a result of germplasm migrating or gene flow by local people among different regions.

The phylogenetic tree based on the estimates of genetic distance revealed the relationships among the Gossypium arboreum accessions. Three subgroups identified by the Structure software were observed in clusters which were presented in three colored lines. Accessions in the mixed group were found within the main clusters in the phylogenetic tree. This result was mostly in accordance with their pedigree, although a few discrepancies were observed between pedigree information and genetic relationships for some accessions. The highest genetic distance between the Group 2 and Group 3 supported the fact of the farthest geographic distance between North China and Southwest China. The lowest genetic distance between Group 1 and Group 3 corroborated the historical fact that G. arboreum was imported to the Yangtze River Valley from South China, which then spread over the cotton growing regions of the country. Group 3 was far from Group 1 and Group 2 in both genetics and geography suggesting that G. arboreum grew in Southwest China for a long time before it was introduced to the Yangtze River Valley across the Yunnan-Guizhou Plateau and the Five Ridges Mountains. This result corroborates the studies of previous researchers who reported that G. arboreum spread from Southern to Northern China (Guo et al. 2006). However, Silow (1939) found that the lack lintless modifiers type of G. arboreum were common in both the Yellow River Valley (belong to North China) and the Yangtze River Valley, and pointed to two primary routes of importation that were overland from Bengal-Assam to the Yellow River basin, and by sea from Indo-China to the Yangtze River Valley. Watt (1907) also thought that cotton was introduced into China several times from various sources and was domesticated into diverse types, including the early-fruiting type each time. In the present study, accessions in the mixed group were scattered in the three regions. Accessions from Group 3 overlapped with accessions from Group 1 and Group 2 in the phylogenetic tree and by principal component analysis (PCA), but there was little overlap between Group 1 and Group 2, which probably means that accessions from Group 3 (from Southwest China) are the common ancestors of accessions from Group 1 (from the Yangtze River Valley) and Group 2 (from North China). Moreover, the highest genetic differentiation was observed between Group 1 and Group 2, which corroborated the genetic relationships and confirmed the various routes of introduction. The lowest genetic differentiation observed between Group 2 and Group 3, further validating the introducing route of G. arboreum from Bengal-Assam to the Yellow River Valley. The difference between these two kinds of analyses maybe caused by recombination and selection of G. arboreum under the environment or by the choice of Gossypium arboreum samples.

The differentiation among groups obtained from Structure analysis was also validated by analysis of molecular variance. The marker variation among groups was observed to be 13.61%, whereas variation within groups was 86.39%. This was in agreement with 0.137 of FST caused by differences among accessions in genetic structure analyses. FST values observed in this study ranged from 0.18 to 0.23 which are different with the results in upland cotton (ranged from 0.29 to 0.42) (Tyagi et al. 2014). Deep population differentiation could reduce the efficiency of the successful Genome-wide association studies (GWAS) (Tyagi et al. 2014). Moreover, large variations would decrease the power of structure-based association studies to detect the effects of single genes (Flint-Garcia et al. 2005). Because of low differentiation, Gossypium arboreum population may be more suitable for GWAS analysis than Gossypium hirsutum for screening important genes associated with traits. Moreover, G. arboreum has a relatively smaller genome than G. hirsutum, which is beneficial for characterization at the molecular level and facilitates the use of this resource in developing superior cotton cultivars with favorable agronomic traits. Therefore, evaluation of genetic diversity and population differentiation is desirable for the efficient utilization of valuable genes of G. arboreum.


Genetic structure was studied within the Gossypium arboreum accessions collected in China. Three subgroups identified from the analysis of Structure agreed with the main regions of G. arboreum growing regions in China. Genetic diversity of the panel corroborated the result of genetic structure analysis, however a few discrepancies were observed between pedigree information and genetic relationships. From the results of AMOVA and genetic structure, the genetic differentiation among and within the groups was a reality. Recombination and selection caused by the environment and farmers may have led to the occurrence of valuable traits that can be useful for breeding programs in cotton. These valuable traits will be beneficial to the breeding programs of cotton through the generation of inter-specific hybrids.



Analysis of molecular variance


Cotton marker database


Genetic distance


Genome-wide association study


Polyacrylamide gel electrophoresis


Principal coordinate analysis


Polymerase chain reaction


Polymorphic information content


Simple sequence repeat


Download references


This research was supported by the National Natural Science Foundation of China Agriculture (Grant No. 2015NWB039).

Availability of data and materials

Please contact author for data requests.

Author information

Authors and Affiliations



Jia YH carried out the molecular genetic studies, and drafted the manuscript. Pan ZE participated in the sequence alignment. He SP participated in the sequence alignment. Gong WF participated in the design of the study and performed the statistical analysis. Geng XL conceived of the study, and helped to draft the manuscript. Pang BY participated in its design and coordination. Wang LR helped to draft the manuscript. Du XM conceived the study, and helped to draft the manuscript. All the authors read and approved the final manuscript.

Corresponding authors

Correspondence to JIA Yinhua or DU Xiongming.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional files

Additional file 1:

Table S1. Description of accessions used in this research. (XLS 36 kb)

Additional file 2:

Table S2. A summary of marker statistics. (XLSX 18 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

JIA, Y., PAN, Z., HE, S. et al. Genetic diversity and population structure of Gossypium arboreum L. collected in China. J Cotton Res 1, 11 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: