Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 93,037 bioRxiv papers from 397,248 authors.
Most downloaded bioRxiv papers, all time
in category evolutionary biology
5,545 results found. For more information, click each entry to expand.
3,108 downloads evolutionary biology
Population-scale genomic datasets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g. only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNN are capable of outperforming expert-derived statistical methods, and offer a new path forward in cases where no likelihood approach exists.
3,105 downloads evolutionary biology
As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning. We review the fundamentals of machine learning, discuss recent applications of supervised machine learning to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised machine learning is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.
3,103 downloads evolutionary biology
Neanderthals and modern humans came in contact with each other and interbred at least twice in the past 100,000 years. Such contact and interbreeding likely led both to the transmission of viruses novel to either species and to the exchange of adaptive alleles that provided resistance against the same viruses. Here, we show that viruses were responsible for dozens of adaptive introgressions between Neanderthals and modern humans. We identify RNA viruses, specifically lentiviruses and orthomyxoviruses, as likely drivers of introgressions from Neanderthals to Europeans. Our results imply that many introgressions between Neanderthals and modern humans were adaptive, and that host genetic variation can be used to understand ancient viral epidemics, potentially providing important insights regarding current and future epidemics.
3,063 downloads evolutionary biology
The hundreds of cichlid fish species in Lake Malawi constitute the most extensive recent vertebrate adaptive radiation. Here we characterize its genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times. Common signatures of selection on visual and oxygen transport genes shared by distantly related deep water species point to both adaptive introgression and independent selection. These findings enhance our understanding of genomic processes underlying rapid species diversification, and provide a platform for future genetic analysis of the Malawi radiation.
3,048 downloads evolutionary biology
The vast majority of human mutations have minor allele frequencies (MAF) under 1%, with the plurality observed only once (i.e., “singletons”). While Mendelian diseases are predominantly caused by rare alleles, their cumulative contribution to complex phenotypes remains largely unknown. We develop and rigorously validate an approach to jointly estimate the contribution of all alleles, including singletons, to phenotypic variation. We apply our approach to transcriptional regulation, an intermediate between genetic variation and complex disease. Using whole genome DNA and lymphoblastoid cell line RNA sequencing data from 360 European individuals, we conservatively estimate that singletons contribute ~25% of cis -heritability across genes (dwarfing the contributions of other frequencies). Strikingly, the majority (~76%) of singleton heritability derives from ultra-rare variants absent from thousands of additional samples. We develop a novel inference procedure to demonstrate that our results are consistent with rampant purifying selection shaping the regulatory architecture of most human genes.
3,048 downloads evolutionary biology
Pyricularia oryzae is a species complex that causes blast disease on more than 50 species of poaceous plants. Pyricularia oryzae has a worldwide distribution as a rice (Oryza) pathogen and in the last century emerged as an important wheat (Triticum) pathogen in southern Brazil. Presently, P. oryzae pathotype Oryza is considered the rice blast pathogen, whereas P. oryzae pathotype Triticum is the wheat blast pathogen. In this study we investigated whether the Oryza and Triticum pathotypes of P. oryzae were distinct at the species level. We also describe a new Pyricularia species causing blast on several other poaceous hosts in Brazil, including wheat. We conducted phylogenetic analyses using 10 housekeeping loci from an extensive sample (N = 128) of sympatric populations of P. oryzae adapted to rice, wheat and other poaceous hosts found in or near wheat fields. The Bayesian phylogenetic analysis grouped the isolates into two major monophyletic clusters (I and II) with high Bayesian probabilities (P = 0.99). Cluster I contained isolates obtained from wheat as well as other Poaceae hosts (P = 0.98). Cluster II was divided into three host-associated clades (Clades 1, 2 and 3; P > 0.75). Clade 1 contained isolates obtained from wheat and other poaceous hosts, Clade 2 contained exclusively wheat-derived isolates, and Clade 3 comprised isolates associated only with rice. Our interpretation was that cluster I and cluster II correspond to two distinct species: Pyricularia graminis-tritici sp. nov. (Pgt), newly described in this study, and Pyricularia oryzae (Po). The host-associated clades found in P. oryzae Cluster II correspond to P. oryzae pathotype Triticum (PoT; Clades 1 and 2), and P. oryzae pathotype Oryza (PoO; Clade 3). No morphological or cultural differences were observed among these species, but a distinctive pathogenicity spectrum was observed. Pgt and PoT were pathogenic and highly aggressive on Triticum aestivum (wheat), Hordeum vulgare (barley), Urochloa brizantha (signal grass) and Avena sativa (oats). PoO was highly virulent on the original rice host (Oryza sativa), and also on wheat, barley, and oats, but not on signal grass. We concluded that blast disease on wheat and its associated Poaceae hosts in Brazil is caused by multiple Pyricularia species: the newly described Pyricularia graminis-tritici sp. nov., and the known P. oryzae pathotypes Triticum and Oryza. To our knowledge, P. graminis-tritici sp. nov. is still restricted to Brazil, but obviously represents a serious threat to wheat cultivation globally.
3,039 downloads evolutionary biology
A central problem in evolutionary biology is to infer the full genealogical history of a set of DNA sequences. This history contains rich information about the forces that have influenced a sexually reproducing species. However, existing methods are limited: the most accurate is unable to cope with more than a few dozen samples. With modern genetic data sets rapidly approaching millions of genomes, there is an urgent need for efficient inference methods to exploit such rich resources. We introduce an algorithm to infer whole-genome history which has comparable accuracy to the state-of-the-art but can process around four orders of magnitude more sequences. Additionally, our method results in an "evolutionary encoding" of the original sequence data, enabling efficient access to genealogies and calculation of genetic statistics over the data. We apply this technique to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the genealogies we estimate are both rich in biological signal and efficient to process.
3,021 downloads evolutionary biology
Species delimitation is problematic in many taxa due to the difficulty of evaluating predictions from species delimitation hypotheses, which chiefly relay on subjective interpretations of morphological observations and/or DNA sequence data. This problem is exacerbated in recalcitrant taxa for which genetic resources are scarce and inadequate to resolve questions regarding evolutionary relationships and uniqueness. In this case study we demonstrate the empirical utility of restriction site associated DNA sequencing (RAD-seq) by unambiguously resolving phylogenetic relationships among recalcitrant octocoral taxa with divergences greater than 80 million years. We objectively infer robust species boundaries in the genus Paragorgia, which contains some of the most important ecosystem engineers in the deep-sea, by testing alternative taxonomy-guided or unguided species delimitation hypotheses using the Bayes factors delimitation method (BFD*) with genome-wide single nucleotide polymorphism data. We present conclusive evidence rejecting the current morphological species delimitation model for the genus Paragorgia and indicating the presence of cryptic species boundaries associated with environmental variables. We argue that the suitability limits of RAD-seq for phylogenetic inferences in divergent taxa cannot be assessed in terms of absolute time, but depend on taxon-specific factors such as mutation rate, generation time and effective population size. We show that classic morphological taxonomy can greatly benefit from integrative approaches that provide objective tests to species delimitation hypothesis. Our results pave the way for addressing further questions in biogeography, species ranges, community ecology, population dynamics, conservation, and evolution in octocorals and other marine taxa.
2,978 downloads evolutionary biology
Whole-genome studies have documented that most Native American ancestry stems from a single population that diversified within the continent more than twelve thousand years ago. However, this shared ancestry hides a more complex history whereby at least four distinct streams of Eurasian migration have contributed to present-day and prehistoric Native American populations. Whole genome studies enhanced by technological breakthroughs in ancient DNA now provide evidence of a sequence of events involving initial migration from a structured Northeast Asian source population, followed by a divergence into northern and southern Native American lineages. During the Holocene, new migrations from Asia introduced the Saqqaq/Dorset Paleoeskimo population to the North American Arctic ~4,500 years ago, ancestry that is potentially connected with ancestry found in Athabaskan-speakers today. This was then followed by a major new population turnover in the high Arctic involving Thule-related peoples who are the ancestors of present-day Inuit. We highlight several open questions that could be addressed through future genomic research.
2,936 downloads evolutionary biology
Given genomic variation data from multiple individuals, computing the likelihood of complex population genetic models is often infeasible. To circumvent this problem, we introduce here a novel likelihood-free inference framework by applying deep learning, a powerful modern technique in machine learning. In contrast to Approximate Bayesian Computation, another likelihood-free approach widely used in population genetics and other fields, deep learning does not require a distance function on summary statistics or a rejection step, and it is robust to the addition of uninformative statistics. To demonstrate that deep learning can be effectively employed to estimate population genetic parameters and learn informative features of data, we focus on the challenging problem of jointly inferring natural selection and demography (in the form of a population size change history). Our method is able to separate the global nature of demography from the local nature of selection, without sequential steps for these two factors. Studying demography and selection jointly is motivated by Drosophila, where pervasive selection confounds demographic analysis. We apply our method to 197 African Drosophila melanogaster genomes from Zambia to infer both their overall demography, and regions of their genome under selection. We find many regions of the genome that have experienced hard sweeps, and fewer under selection on standing variation (soft sweep) or balancing selection. Interestingly, we find that soft sweeps and balancing selection occur more frequently closer to the centromere of each chromosome. In addition, our demographic inference suggests that previously estimated bottlenecks for African Drosophila melanogaster are too extreme, likely due in part to the unaccounted impact of selection.
2,933 downloads evolutionary biology
Morphological and archaeological studies suggest that the Americas were first occupied by non-Mongoloids with Australo-Melanesian traits (the Paleoamerican model), which was subsequently followed by Southwest Europeans coming in along the pack ice of the North Atlantic Ocean (the Solutrean model) and by East Asians and Siberians arriving by way of the Bering Strait. Past DNA studies, however, have produced different accounts. With a better understanding of genetic diversity, we have now reinterpreted public DNA data. Consistent with our recent finding of a close relationship between South Pacific populations and Denisovans or Neanderthals who were archaic Africans with Eurasian admixtures, the ~9500 year old Kennewick Man skeleton with Australo-Melanesian affinity from North America was about equally related to Europeans and Africans, least related to East Asians among present-day people, and most related to the ~42000 year old Neanderthal Mezmaiskaya-2 from Adygea Russia among ancient Eurasian DNAs. The ~12700 year old Anzick-1 of the Clovis culture was most related to the ~18720 year old El Miron of the Magdalenian culture in Spain among ancient DNAs. Amerindian mtDNA haplotypes, unlike their Eurasian sister haplotypes, share informative SNPs with Australo-Melanesians, Africans, or Neanderthals. These results suggest a unifying account of informative findings on the settlement of the Americas.
2,876 downloads evolutionary biology
Human severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is most closely related, by average genetic distance, to two coronaviruses isolated from bats, RaTG13 and RmYN02. However, there is a segment of high amino acid similarity between human SARS-CoV-2 and a pangolin isolated strain, GD410721, in the receptor binding domain (RBD) of the spike protein, a pattern that can be caused by either recombination or by convergent amino acid evolution driven by natural selection. We perform a detailed analysis of the synonymous divergence, which is less likely to be affected by selection than amino acid divergence, between human SARS-CoV-2 and related strains. We show that the synonymous divergence between the bat derived viruses and SARS-CoV-2 is larger than between GD410721 and SARS-CoV-2 in the RBD, providing strong additional support for the recombination hypothesis. However, the synonymous divergence between pangolin strain and SARS-CoV-2 is also relatively high, which is not consistent with a recent recombination between them, instead it suggests a recombination into RaTG13. We also find a 14-fold increase in the dN/dS ratio from the lineage leading to SARS-CoV-2 to the strains of the current pandemic, suggesting that the vast majority of non-synonymous mutations currently segregating within the human strains have a negative impact on viral fitness. Finally, we estimate that the time to the most recent common ancestor of SARS-CoV-2 and RaTG13 or RmYN02 based on synonymous divergence, is 51.71 years (95% C.I., 28.11-75.31) and 37.02 years (95% C.I., 18.19-55.85), respectively. ### Competing Interest Statement The authors have declared no competing interest.
2,855 downloads evolutionary biology
Although homologous recombination is accepted to be common in bacteria, so far it has been challenging to accurately quantify its impact on genome evolution within bacterial species. We here introduce methods that use the statistics of single-nucleotide polymorphism (SNP) splits in the core genome alignment of a set of strains to show that, for many bacterial species, recombination dominates genome evolution. Each genomic locus has been overwritten so many times by recombination that it is impossible to reconstruct the clonal phylogeny and, instead of a consensus phylogeny, the phylogeny typically changes many thousands of times along the core genome alignment. We also show how SNP splits can be used to quantify the relative rates with which different subsets of strains have recombined in the past. We find that virtually every strain has a unique pattern of frequencies with which its lineages have recombined with those of other strains, and that the relative rates with which different subsets of strains share SNPs follow long-tailed distributions. Our findings show that bacterial populations are neither clonal nor freely recombining, but structured such that recombination rates between different lineages vary along a continuum spanning several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect these long-tailed distributions of recombination rates. ### Competing Interest Statement The authors have declared no competing interest.
2,838 downloads evolutionary biology
Brown rats ( Rattus norvegicus ) thrive in urban environments by navigating the anthropocentric environment and taking advantage of human resources and by-products. From the human perspective, rats are a chronic problem that causes billions of dollars in damage to agriculture, health and infrastructure. Did genetic adaptation play a role in the spread of rats in cities? To approach this question, we collected whole-genome samples from 29 brown rats from New York City (NYC) and scanned for genetic signatures of adaptation. We applied multiple methods, testing for (i) high-frequency, extended haplotypes that could indicate selective sweeps and (ii) loci of extreme genetic divergence between the NYC sample and a sample from the presumed ancestral range of brown rats in rural north east China. We found candidate selective sweeps near or inside genes associated with metabolism, diet, organ morphogenesis and locomotory behavior. The divergence between NYC and rural Chinese rats at putative sweep loci suggests that many sweeps began after the split from the ancestral population. Together, our results suggest several hypotheses for a genetic component behind the adaptation of rats in response to human activity.
2,837 downloads evolutionary biology
There has been much interest in analyzing genome-scale DNA sequence data to infer population histories, but inference methods developed hitherto are limited in model complexity and computational scalability. Here we present an efficient, flexible statistical method, diCal2, that can utilize whole-genome sequence data from multiple populations to infer complex demographic models involving population size changes, population splits, admixture, and migration. Applying our method to data from Australian, East Asian, European, and Papuan populations, we find that the population ancestral to Australians and Papuans started separating from East Asians and Europeans about 100,000 years ago, and that the separation of East Asians and Europeans started about 50,000 years ago, with pervasive gene flow between all pairs of populations.
2,817 downloads evolutionary biology
With Next Generation Sequencing Data (NGS) coming off age and being routinely used, evolutionary biology is transforming into a data-driven science. As a consequence, researchers have to rely on a growing number of increasingly complex software. All widely used tools in our field have grown considerably, in terms of the number of features as well as lines of code. In addition, analysis pipelines now include substantially more components than 5-10 years ago. A topic that has received little attention in this context is the code quality of widely used codes. Unfortunately, the majority of users tend to blindly trust software and the results it produces. To this end, we assessed the code quality of 15 highly cited tools (e.g., MrBayes, MAFFT, SweepFinder etc.) from the broader area of evolutionary biology that are used in current data analysis pipelines. We also discuss widely unknown problems associated with floating point arithmetics for representing real numbers on computer systems. Since, the software quality of the tools we analyzed is rather mediocre, we provide a list of best practices for improving the quality of existing tools, but also list techniques that can be deployed for developing reliable, high quality scientific software from scratch. Finally, we also discuss journal and science policy as well as funding issues that need to be addressed for improving software quality as well as ensuring support for developing new and maintaining existing software. Our intention is to raise the awareness of the community regarding software quality issues and to emphasize the substantial lack of funding for scientific software development.
2,791 downloads evolutionary biology
As the highly risk and infectious diseases, the outbreak of coronavirus disease 2019 (COVID-19) poses unprecedent challenges to global health. Up to March 3, 2020, SARS-CoV-2 has infected more than 89,000 people in China and other 66 countries across six continents. In this study, we used 10 new sequenced genomes of SARS-CoV-2 and combined 136 genomes from GISAID database to investigate the genetic variation and population demography through different analysis approaches (e.g. Network, EBSP, Mismatch, and neutrality tests). The results showed that 80 haplotypes had 183 substitution sites, including 27 parsimony-informative and 156 singletons. Sliding window analyses of genetic diversity suggested a certain mutations abundance in the genomes of SARS-CoV-2, which may be explaining the existing widespread. Phylogenetic analysis showed that, compared with the coronavirus carried by pangolins (Pangolin-CoV), the virus carried by bats (bat-RaTG13-CoV) has a closer relationship with SARS-CoV-2. The network results showed that SARS-CoV-2 had diverse haplotypes around the world by February 11. Additionally, 16 genomes, collected from Huanan seafood market assigned to 10 haplotypes, indicated a circulating infection within the market in a short term .The EBSP results showed that the first estimated expansion date of SARS-CoV-2 began from 7 December 2019. ### Competing Interest Statement The authors have declared no competing interest.
2,785 downloads evolutionary biology
Explaining the origin and evolutionary dynamics of the genetic architecture of adaptation is a major research goal of evolutionary genetics. Despite controversy surrounding success of the attempts to accomplish this goal, a full understanding of adaptive genetic variation necessitates knowledge about the genomic location and patterns of dispersion for the genetic components affecting fitness-related phenotypic traits. Even with advances in next generation sequencing technologies, the production of full genome sequences for non-model species is often cost prohibitive, especially for tree species such as pines where genome size often exceeds 20 to 30 Gbp. We address this need by constructing a dense linkage map for fox- tail pine (Pinus balfouriana Grev. & Balf.), with the ultimate goal of uncovering and explaining the origin and evolutionary dynamics of adaptive genetic variation in natural populations of this forest tree species. We utilized megagametophyte arrays (n = 76?95 megagametophytes/tree) from four maternal trees in combination with double-digestion restriction site associated DNA sequencing (ddRADseq) to produce a consensus linkage map covering 98.58% of the foxtail pine genome, which was estimated to be 1276 cM in length (95% CI: 1174cM to 1378cM). A novel bioinformatic approach using iterative rounds of marker ordering and imputation was employed to produce single-tree linkage maps (507?17066 contigs/map; lengths: 1037.40?1572.80 cM). These linkage maps were collinear across maternal trees, with highly correlated marker orderings (Spearman's ρ > 0.95). A consensus linkage map derived from these single-tree linkage maps contained 12 linkage groups along which 20 655 contigs were non-randomly distributed across 901 unique positions (n = 23 contigs/position), with an average spacing of 1.34 cM between adjacent positions. Of the 20 655 contigs positioned on the consensus linkage map, 5627 had enough sequence similarity to contigs contained within the most recent build of the loblolly pine (P. taeda L.) genome to identify them as putative homologs containing both genic and non-genic loci. Importantly, all 901 unique positions on the consensus linkage map had at least one contig with putative homology to loblolly pine. When combined with the other biological signals that predominate in our data (e.g., correlations of recombination fractions across single trees), we show that dense linkage maps for non-model forest tree species can be efficiently constructed using next generation sequencing technologies. We subsequently discuss the usefulness of these maps as community-wide resources and as tools with which to test hypotheses about the genetic architecture of adaptation.
2,764 downloads evolutionary biology
Mashaal Sohail, Robert M. Maier, Andrea Ganna, Alex Bloemendal, Alicia R. Martin, Michael C. Turchin, Charleston W K Chiang, Joel N. Hirschhorn, Mark J Daly, Nick Patterson, Benjamin M. Neale, Iain Mathieson, David Reich, Shamil R. Sunyaev
Genetic predictions of height differ among human populations and these differences are too large to be explained by genetic drift. This observation has been interpreted as evidence of polygenic adaptation. Differences across populations were detected using SNPs genome-wide significantly associated with height, and many studies also found that the signals grew stronger when large numbers of sub-significant SNPs were analyzed. This has led to excitement about the prospect of analyzing large fractions of the genome to detect subtle signals of selection and claims of polygenic adaptation for multiple traits. Polygenic adaptation studies of height have been based on SNP effect size measurements in the GIANT Consortium meta-analysis. Here we repeat the height analyses in the UK Biobank, a much more homogeneously designed study. Our results show that polygenic adaptation signals based on large numbers of SNPs below genome-wide significance are extremely sensitive to biases due to uncorrected population structure.
2,762 downloads evolutionary biology
Learning how complex traits like eyes originate is fundamental for understanding evolution. Here, we first sketch historical perspectives on trait origins and argue that new technologies offer key new insights. Next, we articulate four open questions about trait origins. To address them, we define a research program to break complex traits into components and study the individual evolutionary histories of those parts. By doing so, we can learn when the parts came together and perhaps understand why they stayed together. We apply the approach to five structural innovations critical for complex eyes, reviewing the history of the parts of each of those innovations. Photoreceptors evolved within animals by bricolage, recombining genes that originated far earlier. Multiple genes used in eyes today had ancestral roles in stress responses. We hypothesize that photo-stress could have increased the chance those genes were expressed together in places on animals where light was abundant.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!