Ocean Metagenomics Functional Profiling¶
Note
If you are starting out with NGLess for metagenomics profiling, consider using the predefined pipeline collection, NG-meta-profiler. This tutorial is based on deconstructing a pipeline very similar to those.
In this tutorial, we will analyse a small dataset of oceanic microbial metagenomes.
Note
This tutorial uses the full Ocean Microbial Reference Gene Catalog presented in Structure and function of the global ocean microbiome Sunagawa, Coelho, Chaffron, et al., Science, 2015
This catalog contains ca. 40 million genes and requires 56GiB of RAM
- Download the toy dataset
First download all the tutorial data:
ngless --download-demo ocean-short
This will download and
expand the data to a directory called ocean-short
.
This is a toy dataset. It is based on real data, but the samples were trimmed so that they contains only 250k paired-end reads.
The dataset is organized in so that each directory contains a sample with multiple fastq files). NGLess does not require this structure, but it is convenient:
$ find
./SAMEA2621229.sampled
./SAMEA2621229.sampled/ERR594355_2.short.fq.gz
./SAMEA2621229.sampled/ERR594355_1.short.fq.gz
./SAMEA2621155.sampled
./SAMEA2621155.sampled/ERR599133_1.short.fq.gz
./SAMEA2621155.sampled/ERR599133_2.short.fq.gz
./SAMEA2621033.sampled
./SAMEA2621033.sampled/ERR594391_2.short.fq.gz
./SAMEA2621033.sampled/ERR594391_1.short.fq.gz
./tara.demo.short
./process.ngl
The whole script we will be using is there as well (process.ngl
), so you
can immediately run it with:
ngless process.ngl
The rest of this tutorial is an explanation of the steps in this script.
- Preliminary imports
To run ngless, we need write a script. We start with a few imports:
ngless "1.4"
import "parallel" version "1.0"
import "omrgc" version "1.0"
These will all be used in the tutorial.
- Parallelization
We are going to process each sample separately. For this, we use the lock1
function from the parallel module (which we
imported before):
samples = readlines('tara.demo.short')
sample = lock1(samples)
The readlines
function reads a file and returns all lines. In this case, we
are reading the tara.demo.short
file, which contains the three samples
(SAMEA2621229.sampled
, SAMEA2621155.sampled
, and
SAMEA2621033.sampled
).
lock1()
is a slightly more complex function. It takes a list and locks one
of the elements and returns it. It always chooses an element which has not
been locked before, so you each time you run _ngless_, you will get a different
sample.
- Preprocessing
First, we load the data (the FastQ files):
input = load_fastq_directory(sample)
And, now, we preprocess the data:
input = preprocess(input, keep_singles=False) using |read|:
read = substrim(read, min_quality=25)
if len(read) < 45:
discard
- Profiling using the OM-RGC
After preprocessing, we map the reads to the ocean microbial reference gene catalog:
mapped = map(input, reference='omrgc', mode_all=True)
The line above is the reason we needed to import the omrgc
module: it made
the omrgc
reference available.
mapped = select(mapped, keep_if=[{mapped}, {unique}])
Now, we need to count
the results. This function takes the result of the
above and aggregates it different ways. In this case, we want to aggregate by
KEGG KOs, and eggNOG OGs:
counts = count(mapped,
features=['KEGG_ko', 'eggNOG_OG'],
normalization={scaled})
- Aggregate the results
We have done all this computation, now we need to save it somewhere. We will
use the collect()
function to aggregate across all the samples processed:
collect(counts
current=sample,
allneeded=samples,
ofile='omgc.profiles.txt')
- Run it!
This is our script. We save it to a file (process.ngl
in this example) and
run it from the command line:
$ ngless process.ngl
Note that we need a large amount (ca. 64GB) of RAM memory to be able to use the OM-RGC. You also need to run it once for each sample. However, this can be done in parallel, taking advantage of high performance computing clusters.
Full script¶
Here is the full script:
ngless "1.4"
import "parallel" version "0.0"
import "omrgc" version "1.0"
samples = readlines('tara.demo.short')
sample = lock1(samples)
input = load_fastq_directory(sample)
input = preprocess(input, keep_singles=False) using |read|:
read = substrim(read, min_quality=25)
if len(read) < 45:
discard
mapped = map(input, reference='omrgc', mode_all=True)
mapped = select(mapped, keep_if=[{mapped}, {unique}])
collect(
count(mapped,
features=['KEGG_ko', 'eggNOG_OG'],
normalization={scaled}),
current=sample,
allneeded=samples,
ofile='omgc.profile.txt')