The microbiome is a complex ecosystem of microorganisms:
Scanning electron image of a complex microbial community. Colors are artificial. Source: https://www.newscientist.com
So how can we look at which microbes make up a microbiota?
The methodology we are discussing today falls under the umbrella of Amplicon Sequencing.
In amplicon sequencing, one gene shared by all microorganisms of interest is amplified and sequenced to characterize who is in the microbiota?
Amplicon sequencing involves:
This is most often done for bacterial communities with sections of the DNA coding region of 16S Ribosomal RNA.
This is a useful gene because:
Nine variable regions are each separated by highly conserved regions. Source: biology.stackexchange.com
16S rDNA is not the only marker gene used for amplicon sequencing, just the most common.
Keep in mind before you try to design your own amplicons…
Therefore, don’t try to design your own primers unless you have a really good reason… and lots of time to validate.
Key advantages of Amplicon Sequencing
Key disadvantages of Amplicon Sequencing
Aligning short reads means there may be very few differences to distinguish related taxa. Source: nature.com
Most bacteria have two or more copies of rDNA in their genomes, which must be corrected for when counting totals. Source: researchgate.net
Seeing DNA in a sample does not confirm that there are live microbes, just that their DNA was present.
Key Takeaways
There are caveats and biases in every method
Plan and optimize well, then stick to one protocol!
How do I start a microbiome amplicon study? Plan ahead!
It is particularly important to consider the details of your study before you start
Source: Aaron Bacall Source: www.art.com
Start with a question and hypothesis.
Always step back and ask: “Will a whole community profile answer my question?”
How will you collect your samples?
Generally consider in collection:
Source: www.dnagenotek.com
Key controls to keep in mind:
If working with human subjects:
Source: Charis Tsevis
Source: www.dnagenotek.com
DNA extraction techniques vary significantly in community bias
PCR amplification of the marker gene of interest
Earth Microbiome Project V4 Primers
515F FWD: GTGYCAGCMGCCGCGGTAA
806R REV: GGACTACNVGGGTWTCTAAT
Earth Microbiome Project V4-V5 Primers
515F FWD: GTGYCAGCMGCCGCGGTAA
926R REV: CCGYCAATTYMTTTRAGTTT
The primer sequences in EMP protocols are always listed in the 5′ -> 3′ orientation. This is the orientation that should be used for ordering.
Components of full reaction:
_____5′ Illumina adapter__________Golay barcodePad____Linker_Forward primer
FWD: AATGATACGGCGACCACCGAGATCTACACGCT XXXXXXXXXXXX TATGGTAATT GT GTGYCAGCMGCCGCGGTAA
Illumina sequencing molecular biology Source: Jaroslaw Grzadziel (Research Gate)
Illumina MiSeq
Illumina HiSeq/NextSeq/NovaSeq
PacBio and Nanopore
How do you store all of this information about samples in a useful way?
Source: Example mapping file in excel.
For more detailed information check out the full QIIME2 Metadata Guide!
Best Practices
Leading and trailing white spaces are ignored
Once you have your carefully collected data, how do you decide what to do with it?
There are four major steps in amplicon sequence analysis:
Depending on the specifics of your data, you may also need to:
What options are available in QIIME2 for characterizing communities?
We will use DADA2 later today. To learn more about Deblur, take a look at this QIIME2 tutorial
Ok, but how do I make decisions about my data?
QIIME2 has extensive documentation, including guides on what to do with specific data types: QIIME2 documentation. However! It is difficult to use a normal search engine to find things in the QIIME documentation because they update the documentation with each release. Instead, directly searching the documentation is best.
Also, the QIIME developers put together a forum where users can ask questions: QIIME Forum It’s a great place to go when you get stuck or have a non-standard analysis to do. You’ll get answers both from other researchers (who may have encountered your same roadblock before!), but the QIIME developers also monitor it.
Data Source: https://github.com/vmimicrobiome/vmimicrobiome.github.io
What to do when you get back your sequence files?
What do sequence file formats look like?
They can be displayed in a couple ways.
The first is as two separate files for the sequences and the quality scores
Fasta and quality score file formats
Fasta files - The genetic sequences of your amplicons
Quality score files
Phred score translation to base call accuracy
A second way you may find your files is a fastq file
Fastq file format
Some sequencing centers perform some quality processing, others don’t. Make sure you ask what has been done!
How do you make QIIME2 aware of your sequences and mapping file?
Source: Martha Park - Institute for Quantitative & Computational Biosciences Workshop
Source: QIIME2 - https://docs.qiime2.org/2021.4/tutorials/overview/#demultiplexing
Generally importing sequences will follow this format:
qiime tools import \
--type XXX \ # The format of your sequence files
--input-path XXX \ # The folder or manifest file where sequences are located
--input-format XXX \ # Tell QIIME2 how to import the sequences properly
--output-path XXX # Where the processed sequences should be saved
Fastq Manifest Format
This is the most flexible approach if the sequencing center demultiplexed your samples
Make a manifest telling QIIME2:
Manifest Format
sample-id,absolute-filepath,direction #header line
sample-1,$PWD/some/filepath/sample1_R1.fastq,forward
sample-1,$PWD/some/filepath/sample1_R2.fastq,reverse
Phred 33 vs 64
Phred score encoding for 64 and 33. Source: https://www.drive5.com/usearch/manual/quality_score.html
Command options:
-input-format
- SingleEndFastqManifestPhred33
- SingleEndFastqManifestPhred64
- PairedEndFastqManifestPhred33
- PairedEndFastqManifestPhred64
-type
- 'SampleData[SequencesWithQuality]'
- 'SampleData[PairedEndSequencesWithQuality]'
-input-path
- Path to your manifest file or folder of files
-output-path
- Where QIIME2 will save the processed sequences in .qza format
Depending on your formats it will look something like:
Example for Paired end with Phred 64
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \ # Change for paired/single end sequences
--input-path my_manifest.txt \
--output-path 1_0_input_seqs.qza \
--input-format PairedEndFastqManifestPhred64 # Change for paired/single and phred format
Earth Microbiome Project format
This is a more specific format if you follow the EMP protocols and the sequencing center does as well
A good approach if your sequences are not demultiplexed (all samples sequences together)
Place these files in a folder (my_seqs/), and use that folder name as the –input-path argument:
Single End
qiime tools import \
--type EMPSingleEndSequences \
--input-path my_seqs \
--output-path 1_0_input_seqs.qza
Paired End
qiime tools import \
--type EMPPairedEndSequences \
--input-path my_seqs \
--output-path 1_0_input_seqs.qza
There are other options for inputting sequences : check the QIIME2 documentation if these don’t fit your data
Importing and Demultiplexing
First lets take a look at the mapping file to understand how QIIME2 creates visuals
qiime metadata tabulate \
--m-input-file paired_end/metadata.tsv \
--o-visualization paired_end/1_0_metadata_stats
This will output a file 1_0_metadata_stats.qzv
Let’s see what we made
qiime tools view paired_end/1_0_metadata_stats.qzv
To import sequences and demultiplex…
See the files
ls -lsh paired_end/raw_seqs/
Import the files into QIIME2 format
qiime tools import \
--type EMPPairedEndSequences \
--input-path paired_end/raw_seqs/ \
--output-path paired_end/1_0_input_seqs.qza
See the new output file
ls -lsh paired_end
Demultiplex the sequences based on barcodes in mapping file
qiime demux emp-paired \
--m-barcodes-file paired_end/metadata.tsv \
--m-barcodes-column BarcodeSequence \
--i-seqs paired_end/1_0_input_seqs.qza \
--o-per-sample-sequences paired_end/1_1_demultiplexed_seqs \
--p-rev-comp-mapping-barcodes \
--output-dir paired_end/log_files
See the new output file
ls -lsh paired_end
Summarize the sequences per sample
qiime demux summarize \
--i-data paired_end/1_1_demultiplexed_seqs.qza \
--o-visualization paired_end/1_2_demultiplexed_seqs_summary.qzv
Open the summary
qiime tools view paired_end/1_2_demultiplexed_seqs_summary.qzv
Click the ‘Interactive Quality Plot’ tab to view Phred scores
QIIME2 has our sequences… now what?
If they are paired end then we need to join them
Paired end quality scores across sequence. Source: Kwon, S. Lee, B. Yoon, S. doi: 10.1186/1471-2105-15-S9-S10
We also need to do some trimming and filtering to remove basepairs and sequences with low quality score
And, we need to characterize the community
Note: Input and output files are named incrementally!
Keep in mind there are often more options than just the ones I am showing here…
To view all of the parameters available in a QIIME2 script follow just the function call with –help
LIST ALL PARAMETERS FOR quality-filter q-score-joined SCRIPT
qiime quality-filter q-score --help
For example here are all of the useful options for filtering by quality scores
--i-demux ARTIFACT_PATH The demultiplexed sequence data to be
quality filtered. [required]
--p-min-quality INTEGER The minimum acceptable PHRED score. All
PHRED scores less that this value are
considered to be low PHRED scores.
[default: 4]
--p-quality-window INTEGER The maximum number of low PHRED scores that
can be observed in direct succession before
truncating a sequence read. [default: 3]
--p-min-length-fraction FLOAT The minimum length that a sequence read can
be following truncation and still be
retained. This length should be provided as
a fraction of the input sequence length.
[default: 0.75]
--p-max-ambiguous INTEGER The maximum number of ambiguous (i.e., N)
base calls. This is applied after trimming
sequences based on `min_length_fraction`.
[default: 0]
--o-filtered-sequences ARTIFACT_PATH SampleData[JoinedSequencesWithQuality]
The resulting quality-filtered sequences.
[required if not passing --output-dir]
--o-filter-stats ARTIFACT_PATH Summary statistics of the filtering process. [required if not passing --output-dir]
--output-dir DIRECTORY Output unspecified results to a directory
Traditional Operational Taxonomic Unit (OTU) clustering
To get DADA2 ‘features’ we will:
Let’s go back to our summary plot to decide where to trim our seuqences… click the ‘Interactive Quality Plot’ tab
##### INVESTIGATE THE QUALITY SCORE PLOT TO PICK TRIM LENGTHS BELOW #####
qiime tools view paired_end/1_2_demultiplexed_seqs_summary_qc_summary.qzv
Note at the ends, the quality starts to dip slightly. However, the Phred score never drops below 20, so we will keep the full length read.
Where you trim on the ends is somewhat subjective, but generally pick a point where the quality drops but you still have enough overlap and make sure you change in your script!
Let’s set the right/reverse truncation to 150bp (our sequence length) for both sequences
Where you trim at the beginning is based entirely on your primer length. The primers in this study were both 13bp long
Let’s set the left/forward trim to 13bp (our primer length) for both sequences
There are many options for the next step, think about your situation
DADA2! Lets denoise our sequences into a feature table
DADA2 DENOISE DUPLICATE SEQUENCES AND CREATE OBSERVATIONS
qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired_end/1_1_demultiplexed_seqs.qza \
--p-trunc-len-f 150 \
--p-trunc-len-r 150 \
--p-trim-left-f 13 \
--p-trim-left-r 13 \
--o-representative-sequences paired_end/2_0_dada2_representative_seqs.qza \
--o-table paired_end/2_1_dada2_table.qza \
--o-denoising-stats paired_end/2_2_dada2_stats.qza
# See output files
ls -lsh paired_end
DADA2 can take a while to run, so take a minute to stretch your legs!
Now we have generated:
Let’s look at the stats…
##### DADA2 VISUALIZE STATISTICS #####
qiime metadata tabulate \
--m-input-file paired_end/2_2_dada2_stats.qza \
--o-visualization paired_end/2_3_dada2_stats_summary.qzv
qiime tools view paired_end/2_3_dada2_stats_summary.qzv
And summarize the table…
##### DADA2 SUMMARIZE TABLE #####
qiime feature-table summarize \
--i-table paired_end/2_1_dada2_table.qza \
--o-visualization paired_end/2_4_dada2_table_summary.qzv \
--m-sample-metadata-file paired_end/metadata.tsv
qiime tools view paired_end/2_4_dada2_table_summary.qzv
And look at the representative sequences…
##### DADA2 SUMMARIZE SEQUENCES #####
qiime feature-table tabulate-seqs \
--i-data paired_end/2_0_dada2_representative_seqs.qza \
--o-visualization paired_end/2_5_dada2_representative_seqs_summary.qzv
qiime tools view paired_end/2_5_dada2_representative_seqs_summary.qzv
Neat! Well done!
Example phylogeny. Source: Rob Onyenwoke et. al. doi: 10.1007/s00203-004-0696-y
A phylogeny depicts the genetic relatedness between each of our amplicons
To construct a phylogeny of our representative sequences we will..
QIIME2 has individual commands that perform each of these steps, but also has a helpful builtin pipeline that runs all of them with a single command.
##### ALIGN REPRESENTATIVE SEQUENCES, MASK NOISY POSITIONS, CREATE PHYLOGENY, ROOT PHYLOGENY AT MIDPOINT #####
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences paired_end/2_0_dada2_representative_seqs.qza \
--o-alignment paired_end/3_0_dada2_aligned_seqs.qza \
--o-masked-alignment paired_end/3_1_dada2_aligned_masked.qza \
--o-tree paired_end/3_2_dada2_tree_unrooted.qza \
--o-rooted-tree paired_end/3_3_dada2_tree.qza
# look at files
ls -lsh paired_end
Finally before we can start analysis we want to assign taxonomy to each of our representative sequences
We will use the GreenGenes 94% reference database for a minimal example, but I would suggest another database for your actual project
First import the GreenGenes Database
### IMPORT TAXONOMY REFERENCE SEQUENCES ###
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path greengenes/94_otus.fasta \
--output-path paired_end/greengenes_94_otus.qza
### IMPORT TAXONOMY REFERENCE TABLE ###
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path greengenes/94_otu_taxonomy.txt \
--output-path paired_end/greengenes_94_taxonomy.qza
Next we will:
### EXTRACT REFERENCE READS BASED ON PRIMERS AND TRIM LENGTH ###
qiime feature-classifier extract-reads \
--i-sequences paired_end/greengenes_94_otus.qza \
--p-f-primer AGAGTTTGATCCTGGCTCAG \
--p-r-primer GCTGCCTCCCGTAGGAGT \
--p-trunc-len 320 \
--o-reads paired_end/greengenes_94_seqs_extracted.qza
### TRAIN NAIVE BAYES CLASSIFIER FOR TAXONOMY ASSIGNMENT ###
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads paired_end/greengenes_94_seqs_extracted.qza \
--i-reference-taxonomy paired_end/greengenes_94_taxonomy.qza \
--o-classifier paired_end/greengenes_94_classifier.qza
### ASSIGN TAXONOMY WITH TRAINED CLASSIFIER ###
qiime feature-classifier classify-sklearn \
--i-classifier paired_end/greengenes_94_classifier.qza \
--i-reads paired_end/2_0_dada2_representative_seqs.qza \
--o-classification paired_end/3_4_assigned_greengenes_94_taxonomy.qza
Finally let’s visualize our taxonomic assignment
### MAKE TAXONOMY OUTPUT VISUALIZATION ###
qiime metadata tabulate \
--m-input-file paired_end/3_4_assigned_greengenes_94_taxonomy.qza \
--o-visualization paired_end/3_5_taxonomy_visualization.qzv
qiime tools view paired_end/3_5_taxonomy_visualization.qzv
We did it!!!!
….but why are some confidence scores greater than 1?!? Did we break statistics? It turns out these are rounding errors and can safely be considered to be ~1 or 100% confidence. Want to know more? Read this
In order to proceed with downstream analysis in MicrobiomeAnalyst, we will need to export our feature table, taxonomic assignments, and phylogenetic tree. We do this using the QIIME2 tools command.
Feature tables and taxonomy are exported in BIOM format.
qiime tools export --input-path paired_end/2_1_dada2_table.qza \
--output-path paired_end/exported
qiime tools export --input-path paired_end/3_4_assigned_greengenes_94_taxonomy.qza \
--output-path paired_end/exported
We then need to quickly manually edit the header of the taxonomy file to match the BIOM software format. We will make a copy and then edit the copy (just in case something goes awry)
cp paired_end/exported/taxonomy.tsv paired_end/exported/biom-taxonomy.tsv
Change the first line of the header in the biom-taxonomy.tsv file to read:
#OTUID taxonomy confidence
Note: this header is case sensitive.
Now we can add the taxonomy to the .biom file .
biom add-metadata -i paired_end/exported/feature-table.biom -o paired_end/table-with-taxonomy.biom --observation-metadata-fp paired_end/exported/biom-taxonomy.tsv --sc-separated taxonomy
Finally, we need to export the phylogenetic tree for downstream phylogenetic analyses.
qiime tools export --input-path paired_end/3_3_dada2_tree.qza \
--output-path paired_end/rooted_tree.tre
In order to proceed with downstream analysis in MicrobiomeAnalyst, we will need to export our feature table, taxonomic assignments, and phylogenetic tree. We do this using the QIIME2 tools command.
Feature tables and taxonomy are exported in BIOM format.
qiime tools export --input-path paired_end/2_1_dada2_table.qza \
--output-path paired_end/exported
qiime tools export --input-path paired_end/3_4_assigned_greengenes_94_taxonomy.qza \
--output-path paired_end/exported
We then need to quickly manually edit the header of the taxonomy file to match the BIOM software format. We will make a copy and then edit the copy (just in case something goes awry)
cp paired_end/exported/taxonomy.tsv paired_end/exported/biom-taxonomy.tsv
Change the first line of the header in the biom-taxonomy.tsv file to read:
#OTUID taxonomy confidence
Note: this header is case sensitive.
Now we can add the taxonomy to the .biom file .
biom add-metadata -i paired_end/exported/feature-table.biom -o paired_end/table-with-taxonomy.biom --observation-metadata-fp paired_end/exported/biom-taxonomy.tsv --sc-separated taxonomy
Finally, we need to export the phylogenetic tree for downstream phylogenetic analyses.
qiime tools export --input-path paired_end/3_3_dada2_tree.qza \
--output-path paired_end/rooted_tree.tre
How can we examine microbial communities?
Let’s switch datasets and look at: The American Gut Project (AGP)
Why?
We are moving from QIIME2 into an web-based graphic user interface (GUI) for data analysis. We love QIIME2 and QIIME2 has move flexibility in some ways, but MicrobiomeAnalyst has several advantages:
Note: MicrobiomeAnalyst has different requirements for metadata files than QIIME2. The first column with your sample names should have #NAMES as the header.
Now click the yellow Submit button!
Double check the info provided as part of the Data Integrity Check. Does everything look like you expect? If yes, click “Proceed” at the bottom of the screen.
While we have already filtered our raw sequences to remove low-quality sequences in QIIME2, there is some additional data filtering we can do to improve the biological validity of our results. We can:
Note: there is another table called “Sample Editor” here that allows you to exclude specific samples, in case you want to remove outliers, samples that didn’t sequence well, or control samples prior to downstream analysis.
We often need to account for uneven sequencing prior to downstream analysis. This will give us comparable measures of diversity between and among samples. We also want to either scale OR transform data (but usually not both) to normalize it prior to analysis.
We will choose to rarefy to the minimum library size to evenly sample across our dataset and use total sum scaling (equivalent to relative abundance). Select the appropriate radio buttons and click Submit. This one will take a minute…
Once you see this:
Click Proceed.
Note: There is a big debate in the microbiome community about what steps are best here. But it basically boils down to all rarefaction and normalization methods are just good attempts at getting microbiome data to behave in downstream models. You can read more here: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0237-y https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13115 https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531
Great! We are ready for downstream analyses!! :tada: :tada:
Let’s examine how our community is taxonomically distributed
Let’s change our graph type to “Stacked Bar [Percentage Abundance]” (because we are working relative abundance data)
We have too many samples to look at individually, so we will need to select a metadata category next to “Merge samples to groups” to average relative abundances within groups in that category. As an example, let’s choose sex.
Click Submit (it might be hiding behind the command history bar on the right side of your screen - you can scroll over to find it).
After it has finished processing, scroll down to see a barplot.
This is a really useful tool for examining our microbial communities
Rob already gave you a great overview of alpha and beta diversity prior to the workshop. You can rewatch those videos here. Briefly, alpha diversity measures within-sample diversity, while beta diversity measure between-sample diversity. There are many metrics (ways to measure) alpha and beta diversity. Some take phylogeny into account, while others do not. Some are based on presence/absence of microbes, and others account for abundance.
Tada!
Note: Over on the right in the side panel, you can download the plots that are shown in the GUI AND the result table with the calculated alpha diversity values for each sample.
Ok, but what does this plot mean?
How do we ask the questions we are actually interested in?
First, an example: A study of the gut microbiome in melanoma patients.
The authors looked at how patients responded to treatment and how that related to survival.
Survival of melanoma patients with response to anti-PD-1 immunotherapy. Source: Gopalakrishnan, V. et. al. Science. 2017. doi: 10.1126/science.aan4236
They also looked at the gut microbiome using an ordination plot.
Microbiome ordination highlighting response to anti-PD-1 immunotherapy. Source: Gopalakrishnan, V. et. al. Science. 2017. doi: 10.1126/science.aan4236
What if your variable of interest in continuous?
What if we are interested in features or taxa that are shifting in abundance?
Let’s statistically examine abundances across metadata categories using linear discriminant analysis (implemented in LEfSe). This allows us to identify taxa that are over or underrepresented in a given group.
The default view is a dot plot, but you can also ask it for a Bar Plot (which sometimes is easier to parse if the metadata category has lots of levels).
There are a few other ways to look at differences in taxonomic abundance - feel free to check out Classical univariate analysis, metagenomeSeq, and RNA-seq methods during the breakout sessions.
Supervised learning classifiers perform a series of analytical steps to see if you can predict the metadata group of a sample from it’s microbiome profile.
We’ll show a quick example of a Random Forest model.
This plot is showing us that it can correctly classify a sample from an individual in their 30s or 40s ~45% of the time. But that it’s not so easy to classify samples from other metadata categories. Our overal classification rate is not so hot. This analysis is super sensitive to even samples sizes, however.
Which brings us to some important notes:
You should do a fair amount of background reading before throwing advanced statistics around!
We want you to be comfortable analyzing your data (or data a collaborator sent you) in MicrobiomeAnalyst at the end of this workshop. So! We are going to break into groups and you will go through the MicrobiomeAnalyst workflow together with the Atacama Soils dataset. This dataset can be found in the practice_dataset folder. We borrowed this dataset from a classic QIIME tutorial.
MicrobiomeAnalyst tasks
Questions to discuss
Please take a few minutes to complete a post-workshop survey here when you have time. This link will be activate for the next 7 days (and we will e-mail you a reminder as well). Your feedback will help us guage how well we presented the materials to you today and update our workshops for future offerings.