The VMIC Microbiome Workshop

View the Project on GitHub

Welcome to the VMIC Microbiome Workshop!

VMI

Introducing the Microbiome

The microbiome is a complex ecosystem of microorganisms:

Human Microbiome2

Scanning electron image of a complex microbial community. Colors are artificial. Source: https://www.newscientist.com

16S Ribosomal DNA and Phylogenetic Markers

So how can we look at which microbes make up a microbiota?

The methodology we are discussing today falls under the umbrella of Amplicon Sequencing.

In amplicon sequencing, one gene shared by all microorganisms of interest is amplified and sequenced to characterize who is in the microbiota?

Amplicon sequencing involves:

  1. Extracting all DNA from a microbial community.
  2. PCR amplifying a specific marker gene shared by the microbes of interest.
  3. Sequencing all of the PCR amplicons.
  4. Assigning the genetic sequences back to reference databases of known microbes.
  5. Compiling counts of the number of sequences representing each taxon.

This is most often done for bacterial communities with sections of the DNA coding region of 16S Ribosomal RNA.

This is a useful gene because:

Human Microbiome8

Nine variable regions are each separated by highly conserved regions. Source: biology.stackexchange.com

16S rDNA is not the only marker gene used for amplicon sequencing, just the most common.

Keep in mind before you try to design your own amplicons…

Therefore, don’t try to design your own primers unless you have a really good reason… and lots of time to validate.

Advantages and Disadvantages

Key advantages of Amplicon Sequencing

Key disadvantages of Amplicon Sequencing

Human Microbiome9

Aligning short reads means there may be very few differences to distinguish related taxa. Source: nature.com

Human Microbiome10

Most bacteria have two or more copies of rDNA in their genomes, which must be corrected for when counting totals. Source: researchgate.net

Seeing DNA in a sample does not confirm that there are live microbes, just that their DNA was present.

Experimental Design and Sample Preparation

Key Takeaways

There are caveats and biases in every method

Plan and optimize well, then stick to one protocol!

Experimental Design

How do I start a microbiome amplicon study? Plan ahead!

It is particularly important to consider the details of your study before you start

Human Microbiome11

Source: Aaron Bacall Source: www.art.com

Start with a question and hypothesis.

Always step back and ask: “Will a whole community profile answer my question?”

Sampling and Controls

How will you collect your samples?

Generally consider in collection:

Human Microbiome12

Source: www.dnagenotek.com

Key controls to keep in mind:

If working with human subjects:

animated

Source: Charis Tsevis

Human Microbiome13

Source: www.dnagenotek.com

DNA Extraction

DNA extraction techniques vary significantly in community bias

Amplification

PCR amplification of the marker gene of interest

Earth Microbiome Project V4 Primers

515F FWD: GTGYCAGCMGCCGCGGTAA

806R REV: GGACTACNVGGGTWTCTAAT

Earth Microbiome Project V4-V5 Primers

515F FWD: GTGYCAGCMGCCGCGGTAA

926R REV: CCGYCAATTYMTTTRAGTTT

The primer sequences in EMP protocols are always listed in the 5′ -> 3′ orientation. This is the orientation that should be used for ordering.

Components of full reaction:

_____5′ Illumina adapter__________Golay barcodePad____Linker_Forward primer

FWD: AATGATACGGCGACCACCGAGATCTACACGCT XXXXXXXXXXXX TATGGTAATT GT GTGYCAGCMGCCGCGGTAA

Sequencing Platforms

Human Microbiome14

Illumina sequencing molecular biology Source: Jaroslaw Grzadziel (Research Gate)

Illumina MiSeq

Illumina HiSeq/NextSeq/NovaSeq

PacBio and Nanopore

Metadata File

How do you store all of this information about samples in a useful way?

Human Microbiome16 Source: Example mapping file in excel.

For more detailed information check out the full QIIME2 Metadata Guide!

Best Practices

An Overview of Analysis in QIIME2

Once you have your carefully collected data, how do you decide what to do with it?

There are four major steps in amplicon sequence analysis:

  1. Filter low quality sequences and trim low quality stretches of DNA.
  2. Lump similar sequences together.
  3. Assign taxonomy to each of the sequence clusters (either Operational Taxonomic Units - OTUs - or Amplicon Sequence Variants - ASVs).
  4. Analyze the taxonomic composition of the community.

Depending on the specifics of your data, you may also need to:

  1. Pair (join) paired end sequences.
  2. Demultiplex (sort) sequences into samples based on their barcodes/indexes.

What options are available in QIIME2 for characterizing communities?

  1. Clustering methods (the OTU approach)
    • These are methods of clustering sequences by percent similarity (usually 97% or 99%) to identify likely species
      • Closed reference OTU picking aligns sequences to their closest match in a known reference database
      • De novo OTU picking clusters all sequences in your dataset to themselves
      • Open reference OTU picking first uses closed reference algorithms, and then takes unaligned sequences and clusters then de novo
    • OTU methods are not recommended in most cases (exceptions are to match older analyses OR if your sequencing method has a high error rate)
    • You can learn more about OTU clustering in QIIME 2 here
  2. Denoising methods (the ASV approach)
    • These are methods of removing or correcting sequence errors to identify likely strains
      • Deblur - Identical sequences based on likelihood of being biological versus artifactual noise
      • DADA2 - Similar sequences based on likelihood of being biological versus artifactual noise
    • There are advantages and caveats to each, but they get very similar results for the most part
      • Deblur removes sequences that are not likely biologically valid
      • DADA2 corrects sequences that are not likely biologically valid
      • Deblur requires separate commands to pair and quality-filter sequences prior to denoising
      • DADA2 pairs, trims, quality-filters, and denoises sequences all in one QIIME command
      • DADA2 can be used for older pyrosequencing data

We will use DADA2 later today. To learn more about Deblur, take a look at this QIIME2 tutorial

Ok, but how do I make decisions about my data?

QIIME2 has extensive documentation, including guides on what to do with specific data types: QIIME2 documentation. However! It is difficult to use a normal search engine to find things in the QIIME documentation because they update the documentation with each release. Instead, directly searching the documentation is best.

QIIME2Doc1

Also, the QIIME developers put together a forum where users can ask questions: QIIME Forum It’s a great place to go when you get stuck or have a non-standard analysis to do. You’ll get answers both from other researchers (who may have encountered your same roadblock before!), but the QIIME developers also monitor it.

QIIME2Doc2

Invited VMIC Speaker: Metagenomics with Dr. Megan Behringer

VMI

From Sequences to Community Composition

Data Source: https://github.com/vmimicrobiome/vmimicrobiome.github.io

What to do when you get back your sequence files?

Sequence File Formats

What do sequence file formats look like?

They can be displayed in a couple ways.

The first is as two separate files for the sequences and the quality scores

Human Microbiome17

Fasta and quality score file formats

Human Microbiome18

Phred score translation to base call accuracy

A second way you may find your files is a fastq file

Human Microbiome19

Fastq file format

Some sequencing centers perform some quality processing, others don’t. Make sure you ask what has been done!

Types of Sequences in QIIME2

How do you make QIIME2 aware of your sequences and mapping file?

Human Microbiome20

Source: Martha Park - Institute for Quantitative & Computational Biosciences Workshop

Human Microbiome21

Source: QIIME2 - https://docs.qiime2.org/2021.4/tutorials/overview/#demultiplexing

Generally importing sequences will follow this format:

   qiime tools import \
      --type XXX \ # The format of your sequence files
      --input-path XXX \ # The folder or manifest file where sequences are located
      --input-format XXX \ # Tell QIIME2 how to import the sequences properly
      --output-path XXX # Where the processed sequences should be saved

Fastq Manifest Format

This is the most flexible approach if the sequencing center demultiplexed your samples

Make a manifest telling QIIME2:

Manifest Format

sample-id,absolute-filepath,direction #header line

sample-1,$PWD/some/filepath/sample1_R1.fastq,forward

sample-1,$PWD/some/filepath/sample1_R2.fastq,reverse

Phred 33 vs 64

Human Microbiome22

Phred score encoding for 64 and 33. Source: https://www.drive5.com/usearch/manual/quality_score.html

Command options:

   -input-format
      - SingleEndFastqManifestPhred33
      - SingleEndFastqManifestPhred64
      - PairedEndFastqManifestPhred33
      - PairedEndFastqManifestPhred64

   -type
      - 'SampleData[SequencesWithQuality]'
      - 'SampleData[PairedEndSequencesWithQuality]'

   -input-path
      - Path to your manifest file or folder of files

   -output-path
      - Where QIIME2 will save the processed sequences in .qza format

Depending on your formats it will look something like:

Example for Paired end with Phred 64

   qiime tools import \
     --type 'SampleData[PairedEndSequencesWithQuality]' \ # Change for paired/single end sequences
     --input-path my_manifest.txt \
     --output-path 1_0_input_seqs.qza \
     --input-format PairedEndFastqManifestPhred64 # Change for paired/single and phred format

Earth Microbiome Project format

This is a more specific format if you follow the EMP protocols and the sequencing center does as well

A good approach if your sequences are not demultiplexed (all samples sequences together)

Place these files in a folder (my_seqs/), and use that folder name as the –input-path argument:

Single End

   qiime tools import \
     --type EMPSingleEndSequences \
     --input-path my_seqs \
     --output-path 1_0_input_seqs.qza

Paired End

   qiime tools import \
     --type EMPPairedEndSequences \
     --input-path my_seqs \
     --output-path 1_0_input_seqs.qza

There are other options for inputting sequences : check the QIIME2 documentation if these don’t fit your data

Now let’s get to work and try it on some example data!!!

Importing and Demultiplexing

animated

First lets take a look at the mapping file to understand how QIIME2 creates visuals

   qiime metadata tabulate \
     --m-input-file paired_end/metadata.tsv \
     --o-visualization paired_end/1_0_metadata_stats

This will output a file 1_0_metadata_stats.qzv

Let’s see what we made

   qiime tools view paired_end/1_0_metadata_stats.qzv

To import sequences and demultiplex…

  1. You should navigate to the folder called data/
  2. We will use the folder paired_end: This contains multiplexed sequence files and a metadata file
    1. Forward sequences - reverse.fastq.gz
    2. Reverse sequences - forward.fastq.gz
    3. Barcode sequences - barcodes.fastq.gz
    4. Metadata file - metadata.tsv
  3. We will import the sequences into QIIME2 format
  4. We will demultiplex the sequences
  5. We will examine the distribution of sequences across each sample

See the files

   ls -lsh paired_end/raw_seqs/

Import the files into QIIME2 format

   qiime tools import \
      --type EMPPairedEndSequences \
      --input-path paired_end/raw_seqs/ \
      --output-path paired_end/1_0_input_seqs.qza

See the new output file

   ls -lsh paired_end

Demultiplex the sequences based on barcodes in mapping file

   qiime demux emp-paired \
      --m-barcodes-file paired_end/metadata.tsv \
      --m-barcodes-column BarcodeSequence \
      --i-seqs paired_end/1_0_input_seqs.qza \
      --o-per-sample-sequences paired_end/1_1_demultiplexed_seqs \
      --p-rev-comp-mapping-barcodes \
      --output-dir paired_end/log_files

See the new output file

   ls -lsh paired_end

Summarize the sequences per sample

   qiime demux summarize \
      --i-data paired_end/1_1_demultiplexed_seqs.qza \
      --o-visualization paired_end/1_2_demultiplexed_seqs_summary.qzv

Open the summary

   qiime tools view paired_end/1_2_demultiplexed_seqs_summary.qzv

Click the ‘Interactive Quality Plot’ tab to view Phred scores

QIIME2 has our sequences… now what?

If they are paired end then we need to join them

Human Microbiome24

Paired end quality scores across sequence. Source: Kwon, S. Lee, B. Yoon, S. doi: 10.1186/1471-2105-15-S9-S10

We also need to do some trimming and filtering to remove basepairs and sequences with low quality score

And, we need to characterize the community

Note: Input and output files are named incrementally!

Keep in mind there are often more options than just the ones I am showing here…

To view all of the parameters available in a QIIME2 script follow just the function call with –help

LIST ALL PARAMETERS FOR quality-filter q-score-joined SCRIPT

   qiime quality-filter q-score --help

For example here are all of the useful options for filtering by quality scores


  --i-demux ARTIFACT_PATH         The demultiplexed sequence data to be
                                  quality filtered.  [required]
  --p-min-quality INTEGER         The minimum acceptable PHRED score. All
                                  PHRED scores less that this value are
                                  considered to be low PHRED scores.
                                  [default: 4]
  --p-quality-window INTEGER      The maximum number of low PHRED scores that
                                  can be observed in direct succession before
                                  truncating a sequence read.  [default: 3]
  --p-min-length-fraction FLOAT   The minimum length that a sequence read can
                                  be following truncation and still be
                                  retained. This length should be provided as
                                  a fraction of the input sequence length.
                                  [default: 0.75]
  --p-max-ambiguous INTEGER       The maximum number of ambiguous (i.e., N)
                                  base calls. This is applied after trimming
                                  sequences based on `min_length_fraction`.
                                  [default: 0]
  --o-filtered-sequences ARTIFACT_PATH     SampleData[JoinedSequencesWithQuality]
                                  The resulting quality-filtered sequences.
                                  [required if not passing --output-dir]
  --o-filter-stats ARTIFACT_PATH       Summary statistics of the filtering process. [required if not passing --output-dir]
  --output-dir DIRECTORY          Output unspecified results to a directory

DADA2 - The New OTU

Human Microbiome25

Traditional Operational Taxonomic Unit (OTU) clustering

To get DADA2 ‘features’ we will:

  1. Pair the sequences
  2. Evaluate the quality (Phred) scores along the sequences
  3. Trim the sequences to the same length and to remove primers
  4. Use DADA2 to ‘denoise’ - correct likely sequencing errors
  5. Use DADA2 to count the remaining sequences for each feature

Let’s go back to our summary plot to decide where to trim our seuqences… click the ‘Interactive Quality Plot’ tab

##### INVESTIGATE THE QUALITY SCORE PLOT TO PICK TRIM LENGTHS BELOW #####
   qiime tools view paired_end/1_2_demultiplexed_seqs_summary_qc_summary.qzv

Note at the ends, the quality starts to dip slightly. However, the Phred score never drops below 20, so we will keep the full length read.

Where you trim on the ends is somewhat subjective, but generally pick a point where the quality drops but you still have enough overlap and make sure you change in your script!

Let’s set the right/reverse truncation to 150bp (our sequence length) for both sequences

Where you trim at the beginning is based entirely on your primer length. The primers in this study were both 13bp long

Let’s set the left/forward trim to 13bp (our primer length) for both sequences

There are many options for the next step, think about your situation

DADA2! Lets denoise our sequences into a feature table

DADA2 DENOISE DUPLICATE SEQUENCES AND CREATE OBSERVATIONS

  qiime dada2 denoise-paired \
    --i-demultiplexed-seqs paired_end/1_1_demultiplexed_seqs.qza \
    --p-trunc-len-f 150 \
    --p-trunc-len-r 150 \
    --p-trim-left-f 13 \
    --p-trim-left-r 13 \
    --o-representative-sequences paired_end/2_0_dada2_representative_seqs.qza \
    --o-table paired_end/2_1_dada2_table.qza \
    --o-denoising-stats paired_end/2_2_dada2_stats.qza

# See output files
  ls -lsh paired_end

DADA2 can take a while to run, so take a minute to stretch your legs!

Now we have generated:

Let’s look at the stats…

##### DADA2 VISUALIZE STATISTICS #####
  qiime metadata tabulate \
    --m-input-file paired_end/2_2_dada2_stats.qza \
    --o-visualization paired_end/2_3_dada2_stats_summary.qzv

  qiime tools view paired_end/2_3_dada2_stats_summary.qzv

And summarize the table…

##### DADA2 SUMMARIZE TABLE #####
   qiime feature-table summarize \
      --i-table paired_end/2_1_dada2_table.qza \
      --o-visualization paired_end/2_4_dada2_table_summary.qzv \
      --m-sample-metadata-file paired_end/metadata.tsv

   qiime tools view paired_end/2_4_dada2_table_summary.qzv

And look at the representative sequences…

##### DADA2 SUMMARIZE SEQUENCES #####
   qiime feature-table tabulate-seqs \
      --i-data paired_end/2_0_dada2_representative_seqs.qza \
      --o-visualization paired_end/2_5_dada2_representative_seqs_summary.qzv

   qiime tools view paired_end/2_5_dada2_representative_seqs_summary.qzv

Neat! Well done!

Alignment into Amplicon Phylogeny

Human Microbiome26

Example phylogeny. Source: Rob Onyenwoke et. al. doi: 10.1007/s00203-004-0696-y

A phylogeny depicts the genetic relatedness between each of our amplicons

To construct a phylogeny of our representative sequences we will..

  1. Align the sequences using Mafft
  2. Mask highly variable positions in the alignment (removes uninformative noise)
  3. Construct an unrooted phylogeny with FastTree
  4. Root the phylogeny at the tree’s midpoint

QIIME2 has individual commands that perform each of these steps, but also has a helpful builtin pipeline that runs all of them with a single command.

##### ALIGN REPRESENTATIVE SEQUENCES, MASK NOISY POSITIONS, CREATE PHYLOGENY, ROOT PHYLOGENY AT MIDPOINT #####
   qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences paired_end/2_0_dada2_representative_seqs.qza \
    --o-alignment paired_end/3_0_dada2_aligned_seqs.qza \
    --o-masked-alignment paired_end/3_1_dada2_aligned_masked.qza \
    --o-tree paired_end/3_2_dada2_tree_unrooted.qza \
    --o-rooted-tree paired_end/3_3_dada2_tree.qza

# look at files
   ls -lsh paired_end

Taxonomy Assignment

Finally before we can start analysis we want to assign taxonomy to each of our representative sequences

We will use the GreenGenes 94% reference database for a minimal example, but I would suggest another database for your actual project

First import the GreenGenes Database

### IMPORT TAXONOMY REFERENCE SEQUENCES ###
   qiime tools import \
      --type 'FeatureData[Sequence]' \
      --input-path greengenes/94_otus.fasta \
      --output-path paired_end/greengenes_94_otus.qza

### IMPORT TAXONOMY REFERENCE TABLE ###
   qiime tools import \
      --type 'FeatureData[Taxonomy]' \
      --input-format HeaderlessTSVTaxonomyFormat \
      --input-path greengenes/94_otu_taxonomy.txt \
      --output-path paired_end/greengenes_94_taxonomy.qza

Next we will:

### EXTRACT REFERENCE READS BASED ON PRIMERS AND TRIM LENGTH ###
   qiime feature-classifier extract-reads \
      --i-sequences paired_end/greengenes_94_otus.qza \
      --p-f-primer AGAGTTTGATCCTGGCTCAG \
      --p-r-primer GCTGCCTCCCGTAGGAGT \
      --p-trunc-len 320 \
      --o-reads paired_end/greengenes_94_seqs_extracted.qza

### TRAIN NAIVE BAYES CLASSIFIER FOR TAXONOMY ASSIGNMENT ###
   qiime feature-classifier fit-classifier-naive-bayes \
      --i-reference-reads paired_end/greengenes_94_seqs_extracted.qza \
      --i-reference-taxonomy paired_end/greengenes_94_taxonomy.qza \
      --o-classifier paired_end/greengenes_94_classifier.qza

### ASSIGN TAXONOMY WITH TRAINED CLASSIFIER ###
   qiime feature-classifier classify-sklearn \
      --i-classifier paired_end/greengenes_94_classifier.qza \
      --i-reads paired_end/2_0_dada2_representative_seqs.qza \
      --o-classification paired_end/3_4_assigned_greengenes_94_taxonomy.qza

Finally let’s visualize our taxonomic assignment

### MAKE TAXONOMY OUTPUT VISUALIZATION ###
   qiime metadata tabulate \
      --m-input-file paired_end/3_4_assigned_greengenes_94_taxonomy.qza \
      --o-visualization paired_end/3_5_taxonomy_visualization.qzv

   qiime tools view paired_end/3_5_taxonomy_visualization.qzv

We did it!!!!

….but why are some confidence scores greater than 1?!? Did we break statistics? It turns out these are rounding errors and can safely be considered to be ~1 or 100% confidence. Want to know more? Read this

We will pause here for today to take questions and meet back tomorrow at 12PM CST to take our newly created BIOM table to Microbiome Analyst. Thanks!

animated

Welcome to Day 2 of the Workshop

Exporting QIIME Feature Tables

In order to proceed with downstream analysis in MicrobiomeAnalyst, we will need to export our feature table, taxonomic assignments, and phylogenetic tree. We do this using the QIIME2 tools command.

Feature tables and taxonomy are exported in BIOM format.

  qiime tools export --input-path paired_end/2_1_dada2_table.qza \
    --output-path paired_end/exported

  qiime tools export --input-path paired_end/3_4_assigned_greengenes_94_taxonomy.qza \
    --output-path paired_end/exported

We then need to quickly manually edit the header of the taxonomy file to match the BIOM software format. We will make a copy and then edit the copy (just in case something goes awry)

  cp paired_end/exported/taxonomy.tsv paired_end/exported/biom-taxonomy.tsv

Change the first line of the header in the biom-taxonomy.tsv file to read:

#OTUID  taxonomy  confidence

Note: this header is case sensitive.

Now we can add the taxonomy to the .biom file .

  biom add-metadata -i paired_end/exported/feature-table.biom -o paired_end/table-with-taxonomy.biom --observation-metadata-fp paired_end/exported/biom-taxonomy.tsv --sc-separated taxonomy

Finally, we need to export the phylogenetic tree for downstream phylogenetic analyses.

  qiime tools export --input-path paired_end/3_3_dada2_tree.qza \
    --output-path paired_end/rooted_tree.tre

Exporting QIIME Feature Tables

In order to proceed with downstream analysis in MicrobiomeAnalyst, we will need to export our feature table, taxonomic assignments, and phylogenetic tree. We do this using the QIIME2 tools command.

Feature tables and taxonomy are exported in BIOM format.

  qiime tools export --input-path paired_end/2_1_dada2_table.qza \
    --output-path paired_end/exported

  qiime tools export --input-path paired_end/3_4_assigned_greengenes_94_taxonomy.qza \
    --output-path paired_end/exported

We then need to quickly manually edit the header of the taxonomy file to match the BIOM software format. We will make a copy and then edit the copy (just in case something goes awry)

  cp paired_end/exported/taxonomy.tsv paired_end/exported/biom-taxonomy.tsv

Change the first line of the header in the biom-taxonomy.tsv file to read:

#OTUID  taxonomy  confidence

Note: this header is case sensitive.

Now we can add the taxonomy to the .biom file .

  biom add-metadata -i paired_end/exported/feature-table.biom -o paired_end/table-with-taxonomy.biom --observation-metadata-fp paired_end/exported/biom-taxonomy.tsv --sc-separated taxonomy

Finally, we need to export the phylogenetic tree for downstream phylogenetic analyses.

  qiime tools export --input-path paired_end/3_3_dada2_tree.qza \
    --output-path paired_end/rooted_tree.tre

Concepts of Community Analysis

How can we examine microbial communities?

Let’s switch datasets and look at: The American Gut Project (AGP)

AGP1

Why?

MicrobiomeAnalyst

MicrobiomeAnalystBanner

We are moving from QIIME2 into an web-based graphic user interface (GUI) for data analysis. We love QIIME2 and QIIME2 has move flexibility in some ways, but MicrobiomeAnalyst has several advantages:

Importing data into MicrobiomeAnalyst

  1. Go to https://www.microbiomeanalyst.ca/ in your favorite web browser. On the home page, click on the red “Marker Data Profiling (MDP)” bubble.

MA homepage

  1. We are going to upload the files from the american_gut folder you downloaded prior to the workshop. We have provided the feature table in BIOM format, so click on the “BIOM format” dropdown to access the correct upload selections. Select “Choose File” and navigate to the american_gut folder on your computer and upload the .biom, .txt, and .tre files. Select QIIME for the taxonomy label type. Your screen should look like this:

MA import

Note: MicrobiomeAnalyst has different requirements for metadata files than QIIME2. The first column with your sample names should have #NAMES as the header.

  1. Now click the yellow Submit button!

  2. Double check the info provided as part of the Data Integrity Check. Does everything look like you expect? If yes, click “Proceed” at the bottom of the screen.

Filtering data

While we have already filtered our raw sequences to remove low-quality sequences in QIIME2, there is some additional data filtering we can do to improve the biological validity of our results. We can:

  1. Filter out low count reads that are likely to be spurious sequences and sequencing artefacts
    • We can require a minimum count in order to remove features that appear rarely in our data (filtering by abundance). We want to make sure this minimum value isn’t too high, or we will remove rare but biologically important sequences. Let’s set it to 2 here.
    • We can also require a feature be present in a minimum number of samples to keep it in the dataset (filtering by ubiquity or prevalence). This allows us to remove infrequently occurring species that may skew our results. Again, this threshold shouldn’t be too high or we will remove biologically important sequences. Let’s use 10% here.
  2. Filter out features with low variance that are likely to be PCR errors or contamination
    • We can remove features with the lowest X percentage of variance based on various metrics. This isn’t commonly done, so let’s set the variance filter to 0%.

MA data filtering

  1. Click Submit and when you get this message in the upper right:

MA ok

  1. Click Proceed.

Note: there is another table called “Sample Editor” here that allows you to exclude specific samples, in case you want to remove outliers, samples that didn’t sequence well, or control samples prior to downstream analysis.

Data normalization

We often need to account for uneven sequencing prior to downstream analysis. This will give us comparable measures of diversity between and among samples. We also want to either scale OR transform data (but usually not both) to normalize it prior to analysis.

We will choose to rarefy to the minimum library size to evenly sample across our dataset and use total sum scaling (equivalent to relative abundance). Select the appropriate radio buttons and click Submit. This one will take a minute…

Once you see this:

MA data norm

Click Proceed.

Note: There is a big debate in the microbiome community about what steps are best here. But it basically boils down to all rarefaction and normalization methods are just good attempts at getting microbiome data to behave in downstream models. You can read more here: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0237-y https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13115 https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531

Great! We are ready for downstream analyses!! :tada: :tada:

Community transformation and examination

Let’s examine how our community is taxonomically distributed

  1. In the Analysis Overview screen, click on Stacked bar/area plot under the Visual exploration section.:

MA analysis screen

  1. Let’s change our graph type to “Stacked Bar [Percentage Abundance]” (because we are working relative abundance data)

  2. We have too many samples to look at individually, so we will need to select a metadata category next to “Merge samples to groups” to average relative abundances within groups in that category. As an example, let’s choose sex.

  3. Click Submit (it might be hiding behind the command history bar on the right side of your screen - you can scroll over to find it).

  4. After it has finished processing, scroll down to see a barplot.

MA barplot

  1. Take a minute to select other metadata categories and see if there are any interesting differences! Note: if you accidentally select something that is constant for all samples (ie, body_site - theses are all fecal samples), MicrobiomeAnalyst will get cranky and throw an error.

This is a really useful tool for examining our microbial communities

Alpha and beta biodiversity

Rob already gave you a great overview of alpha and beta diversity prior to the workshop. You can rewatch those videos here. Briefly, alpha diversity measures within-sample diversity, while beta diversity measure between-sample diversity. There are many metrics (ways to measure) alpha and beta diversity. Some take phylogeny into account, while others do not. Some are based on presence/absence of microbes, and others account for abundance.

Alpha diversity

  1. Navigate back to the “Analysis Overview” screen. Under the “Community Profiling” section, select “Alpha-diversity analysis”.
  2. Change the Experimental factor to “age_cat”. Let’s keep Chao1 (species richness accounting for abundance) as our metric for now.
  3. Click Submit.
  4. Tada! MA grouped age Chao1

  5. Let’s try switching up the alpha diversity metric (Shannon - species richness and evenness accounting for abundance) and colors (Viridis). MA grouped age Shannon

Note: Over on the right in the side panel, you can download the plots that are shown in the GUI AND the result table with the calculated alpha diversity values for each sample.

Beta diversity

  1. Navigate back to the “Analysis Overview” screen. Under the “Community Profiling” section, select “Beta-diversity analysis”.
  2. Keep everything as the default (the Experimental factor should already be “age_cat”, because that is what we used for alpha diversity). with 1800+ samples, building a new ordination plot will take a lot of time.
  3. Scroll down to see the ordination plot!

Bray-Curtis PCOA

Ok, but what does this plot mean?

Microbiome analysis and hypothesis testing

How do we ask the questions we are actually interested in?

Ordination and composition analysis

First, an example: A study of the gut microbiome in melanoma patients.

The authors looked at how patients responded to treatment and how that related to survival.

Survival curve

Survival of melanoma patients with response to anti-PD-1 immunotherapy. Source: Gopalakrishnan, V. et. al. Science. 2017. doi: 10.1126/science.aan4236

They also looked at the gut microbiome using an ordination plot.

Ordi plot

Microbiome ordination highlighting response to anti-PD-1 immunotherapy. Source: Gopalakrishnan, V. et. al. Science. 2017. doi: 10.1126/science.aan4236

What does this ordination show?

Ok… but what is an ordination?

Here is a great example of one way things can go wrong

Diagonal

Diversity analysis in more detail

Before we assessed alpha diversity between categorical groups

What if your variable of interest in continuous?

  1. Navigate back to “Alpha-diversity analysis”.
  2. Select a continuous variable, like “age_years”.
  3. Click Submit
  4. Realize that MicrobiomeAnalyst has some limitations.
    • If you notice, it’s still treating each numerical age as a categorical variable
    • Unfortunately, MicrobiomeAnalyst does not have the ability to do correlations between continuous variables and alpha diversity.
    • However, this is when exporting that result table becomes useful! You can take that .csv file into your stats program of choice and run a non-parametric Spearman correlation

Testing beta diversity across groups

  1. Navigate back to “Beta-diversity analysis”.
  2. Examine the dropdown menu next to “Statistical method”.

Community and taxonomic differential abundance analysis

What if we are interested in features or taxa that are shifting in abundance?

Let’s statistically examine abundances across metadata categories using linear discriminant analysis (implemented in LEfSe). This allows us to identify taxa that are over or underrepresented in a given group.

  1. Navigate to the “Analysis overview” screen. Under the “Comparison & classification” section, click on “LEfSe”.
  2. Feature table numbers are not very informative, so change the taxonomy level to Family.
  3. We’ll continue to use “age_cat” as our Experimental factor. Leave all other default settings as is.
  4. Click Submit.

The default view is a dot plot, but you can also ask it for a Bar Plot (which sometimes is easier to parse if the metadata category has lots of levels).

MA lefse results

There are a few other ways to look at differences in taxonomic abundance - feel free to check out Classical univariate analysis, metagenomeSeq, and RNA-seq methods during the breakout sessions.

Random forest/machine learning methods

Supervised learning classifiers perform a series of analytical steps to see if you can predict the metadata group of a sample from it’s microbiome profile.

We’ll show a quick example of a Random Forest model.

  1. Navigate to the “Analysis overview” screen. Under the “Comparison & classification” section, click on “Random Forest”.
  2. We’ll continue to use “age_cat” as our Experimental factor. Leave all other default settings as is.
  3. Click Submit.

MA random forest

This plot is showing us that it can correctly classify a sample from an individual in their 30s or 40s ~45% of the time. But that it’s not so easy to classify samples from other metadata categories. Our overal classification rate is not so hot. This analysis is super sensitive to even samples sizes, however.

Which brings us to some important notes:

You should do a fair amount of background reading before throwing advanced statistics around!

Practice time!

We want you to be comfortable analyzing your data (or data a collaborator sent you) in MicrobiomeAnalyst at the end of this workshop. So! We are going to break into groups and you will go through the MicrobiomeAnalyst workflow together with the Atacama Soils dataset. This dataset can be found in the practice_dataset folder. We borrowed this dataset from a classic QIIME tutorial.

MicrobiomeAnalyst tasks

  1. Upload the data, using the BIOM format option. A couple of quick notes: 1) This dataset’s taxonomy is in QIIME format (like the American Gut dataset earlier); 2) You might get an error about sample names not matching in the OTU table and metadata. You can ignore this error (most of the sample names do match, which is good enough for our purposes today).
  2. Filter low prevalence and variance features. Note: you may want to use the mean abundance value filter for this very small and sparse dataset.
  3. Rarefy and scale data.
  4. Look at taxonomic differences (Visual exploration).
  5. Investigate differences in alpha and beta diversity (Community profiling).
  6. Test some hypotheses (Clustering & correlation/Comparison & classification).

Questions to discuss

  1. How do different count and variance filters influence how many features are retained?
  2. Can you visually identify any taxonomic differences between transects or based on other metadata categories?
  3. How does vegetation presence (or other metadata categories) affect alpha and beta diversity of the soil microbes?
  4. Are any features differentially abundant based on any metadata categories? Do the different methods agree?

We’ve reached the end of today’s workshop. Go forth and do the microbiome science!

animated

Please take a few minutes to complete a post-workshop survey here when you have time. This link will be activate for the next 7 days (and we will e-mail you a reminder as well). Your feedback will help us guage how well we presented the materials to you today and update our workshops for future offerings.