How can I generate a phylogenetic tree in R starting from a FASTA file?

field_epidemiologist · April 28, 2022, 7:32am

Dear Applied Epi Team,
I have sequencing results of some COVID-19 positive specimens from a local nursing home in FASTA format which I would like to further investigate, to better understand if the samples are related (linked) or if they represent distinct introductions into the facility, a possible outbreak with identical genotypes within the nursing home. or distinct introductions. Could the Applied Epi genomics team guide me on how to approach this and which tools or packages in R could help me build a phylogeny with the cases from the nursing home? I have struggled to find tools in R that support managing FASTA files, running multiple sequence alignment (MSA) and building phylogenetic trees within an R environment.
Thanks in advance for your support

BryanTegomoh · April 28, 2022, 5:19pm

Dear @Field_Epidemiologist. Thanks for your question

Phylogenetic trees are indeed a useful way of showing evolutionary relationship of a sequenced agent of interest. They can show (in the case of an infectious agent) how samples are related and how transmission may have happened from a primary case to a secondary or tertiary case. However, it is important to note that direct linkage on a phylogenetic tree does not necessarily infer an epidemiological association. The epidemiological context on which the genomic data is based is important for directionality inferences.

Yes, it is possible to create phylogenetic trees from a FASTA file using R, although there are tools outside R that may be better suited for phylogenetic analyses. I have attached some code below to help you do this in R using the packages: msa, seqinr and ape.

R is amazing for working with already generated trees stored in Newick or Nexus format (as described in Chapter 38 of The Epidemiologist R Handbook), using packages such as ggtree, ape to annotate the trees with epidemiological data. Some of the common tools used to generate trees include the Nextstrain pipeline (https://github.com/nextstrain/ncov), which contain scripts to run the entire workflow including the sequence alignment, phylogenetic tree inference, tree dating, ancestral state construction, etc.

These steps could be done in R but could take longer. I experimented with 05 sample specimens and the multiple sequence alignment step took over 30minutes for the 05 SARS-CoV-2 genomes + the reference genome, which might take shorter time with Nextstrain. (Happy to guide with Nextstrain if needed).

The basic steps to generating a phylogenetic tree:

You need your consensus genome (in FASTA format) then align the sequence possibly to the reference depending on the research question. Including the reference could be useful for estimating the genetic distance of your sample cases from the tree root.

Here’s some sample code on how I would approach this in R

#Loading required packages for the phylogenetic analysis
library(seqinr)
library(ape)
library(msa) # this package is available through the Bioconductor packages. If you do not have the BiocManager package installed, you could install this using the following lines of code... then load it using library(msa)

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("msa")
library(msa)

## Read sequences from FASTA file of the cases

sequence <- readAAStringSet("my_sample_fasta_sequences.fasta")

## Perform multiple sequence alignment

my_alignment <- msa(sequence)

## Compute distance matrix

my_alignment_sequence <- msaConvert(my_alignment, type="seqinr::alignment")

distance_alignment <- dist.alignment(my_alignment_sequence)

## compute phylogenetic tree using neighbor joining

Tree <- bionj(distance_alignment)

## display phylogenetic tree

plot(Tree)