Metagenomic Analysis 1: From QC to Gene Abundances

3 minute read

Published:

The standard pipeline that we use in the Brito Lab for quality control, alignments, and gene abundance calculations for metagenomic data.

Paths are relative to our in-house work stations, CBSUBrito and CBSUBrito2.

Quality Control (QC)

Step 1

Removing Human Reads

Input: Use the output from the dereplication step
Program: BMTOOLS > BMTAGGER (/programs/bmtools/bmtagger)
Script: /workdir/scripts/QC/run_bmtagger.qsub

Notes: This step is only needed for metagenomic samples from humans. This step removes human reads that are considered contamination in your samples.

Step 2

Removing Adapter Sequence and Quality Trimming

Input: Use the output from the dereplication step (or the bmtagger step if you performed that)
Program: Trimmomatic (/programs/trimmomatic/trimmomatic-0.36.jar)
Scripts: /workdir/scripts/QC/run_trimmomatic.qsub

Notes: This script removes adapter sequence, uses a sliding window to trim poor quality bases, and removes leading and trailing poor quality bases. It also restricts a minimum read length of 50 (this can be changed depending on your needs). The reference file of adapter sequences is located: /home/britolab/refdbs/nextera_truseq_adapters.fasta. The script outputs a FQ file for forward paired, forward singletons, reverse paired, and reverse singletons.

Step 3

Removing Duplicates

Input: Raw sequencing reads
Program: Prinseq (/programs/prinseq-lite-0.20.2)
Script: /workdir/scripts/QC/run_dereplication.qsub

Notes: Files must be uncompressed for this program. Use -no_qual_header to reduce file size and prevent prinseq from adding redundant qual header. This has been modified in original script.

Alignments

This process will start with quality controlled reads and go through alignments, filtering, and normalization, all the way to gene abundances.

Step 1

Align reads to reference

Program: BWA MEM - run_bwa_mem.qsub
Input: Fastq (R1 and R2), assemblies (IDBA), reference (IGC, indexed for BWA)
Note: Input reference can be whatever is relevant to your study. It must be indexed for BWA.
Output: SAM files

Step 2

SAM -> BAM

Program: Samtools

2a) Filtering step - Removes alignments with less than 90% identity.

Script: run_samfilter.qsub
Input: SAM files
Output: $NAME_filter.sam

2b) SAM2BAM - Removes unaligned reads, keeps only primary alignments, converts to BAM.

Script: run_sam2bam_twins.qsub
Input: filtered SAM files
Output: BAM (sorted, indexed, and default)

Note: at this point, if all BAMs successfully created, you can delete SAM files

Step 3

Samtools idxstats

Script: Run_idxstats.qsub
Input: sorted BAM files (indexed BAM files must be in same directory as the sorted files)
Output: idxstats files (these contain alignment statistics used in the RPKM script)

Step 4

Samtools depth - This is where coverage comes from

Script: Run_samtools_depth.qsub
Input: sorted BAM files (indexed BAM files must be in same directory as the sorted files). Output: depth files (depth files needed for RPKM)

Step 5

RPKM - This is a modified version of RPKM that normalizes to the total number of reads sequenced per sample, not the total number of reads mapped. This is because in microbiome research, there is no such concept as a complete reference genome for mapping.

Script: Run_rpkm.qsub –> Rpkm_calculate.py
Input:

  • Text file containing fastq lengths (not divided by 4, the full length of the file). It should look like: Sample name | FQ1 length | FQ2 length
  • Idxstats file
  • Depth file Output: Logs and gene abundance files (one per sample)

Step 6

Combine and filter the final RPKM table

6a) Filter RPKM files - keep genes with at least 80% coverage across the gene

Script: Run_filterCountsFiles.qsub
Input: RPKM files
Output: filtered RPKM files (at this point, the counts are from alignments filtered at 90% identity and 80% coverage)

6b) Combine the individual RPKM tables into one big table

Script: Run_combineGeneAbundances.qsub. Input: filtered RPKM tables. Output: The end result is a table of normalized gene abundances (rows = samples, columns = genes).

Download

You can see the snakemake file for this pipeline here.