Nextflow

Some of the examples below are from: https://carpentries-incubator.github.io/workflows-nextflow. Please activate the conda env nf_metag before running any code.

conda activate nf_metag
cd /mnt

Nextflow/Groovy Language Basics

This guide provides an overview of the basic elements of the Nextflow language.

Printing

In Nextflow, you can print messages to the console using the println statement. Please create and open a file named ch0.nf in the /mnt directory and write the following code in it

println("Hello, Nextflow!")

Save the file and run it with: nextflow run ch0.nf

Methods

Methods in Nextflow are defined using the Groovy syntax. Here’s a simple method definition:

def greet(name) {
    return "Hello, ${name}!"
}

println(greet("Nextflow"))

Replace the content of file ch0.nf with above code save and run it.

Comments

Comments in Nextflow are identical to Groovy/Java:

Single-line comments start with //.
Multi-line comments are wrapped in /* ... */.

// This is a single-line comment
/* This is a
   multi-line comment */

Variables

Variables are declared using the def keyword or directly with an assignment.

def age = 25
name = "Nextflow"

Data Types

Nextflow supports various data types, including integers, floats, strings, and booleans.

def age = 25  // Integer
def pi = 3.14  // Float
def isActive = true  // Boolean

Strings

Strings can be declared using double quotes ". String interpolation is supported with ${}.

def name = "Nextflow"
println("Hello, ${name}!")

Lists

Lists in Nextflow can be defined using square brackets [].

def tools = ["Nextflow", "Docker", "Singularity"]
println(tools[0])  // Prints "Nextflow"

Maps

Maps are key-value pairs and can be defined using the syntax [:].

def config = [memory: "10 GB", cpus: 4]
println(config.memory)  // Prints "10 GB"

Closures

Closures are code blocks that can be assigned to variables or passed as arguments.

square = { it * it }
println(square(3))

Processes

The basic structure of a process is:

process < NAME > {
  [ directives ]
  input:
  < process inputs >
  output:
  < process outputs >
  when:
  < condition >
  [script|shell|exec]:
  < user script to be executed >
}

Please create a file named ch1.nf and write the following code in it:

 // nextflow.config file to specify using DSL2
 nextflow.enable.dsl=2

 // Define a process
 process seqStats {
    output:
    stdout

    """
    seqkit stats /mnt/WGS-data/read1.fq
    """
 }

// Define a workflow that calls the process
 workflow {
    seqStats().view()
 }

Channels

There are different types of channels in nextflow:

Value channel

A value channel is bound to a single value and can be created with Channel.value factory method.

Queue channel

Queue (consumable) channels can be created using the following channel factory methods.

Channel.of
Channel.fromList
Channel.fromPath
Channel.fromFilePairs
Channel.fromSRA

bases = ['A', 'C', 'G', 'T']

Channel.value(bases)
  .view()

Channel.of('A', 'C', 'G', 'T')
  .view()

Channel.fromList(bases)
  .view()

Channel.fromPath("${projectDir}/WGS-data/*.fq")
  .view()

Channel.fromFilePairs("${projectDir}/WGS-data/*{1,2}.fq")
  .view()


Channel.fromSRA('SRP043510')
  .view()

Write the above code in ch2.nf, and run it.

Workflows

We can connect different processes with channels to make a complete workflow. We have already seen a minimal example of workflow in Processes section with only one process. We can create a workflow consists of two process in ch3.nf:

#!/usr/bin/env nextflow

// nextflow.config file to specify using DSL2
nextflow.enable.dsl=2

// Define parameters
params.reads = "/mnt/WGS-data/read{1,2}.fq" // Default pattern for paired-end reads
params.outdir = "./output_nf" // Default output directory
params.threads = 8

// QC the reads
process seqQC {
   tag "${sample_id}"

   // Define output dir
   publishDir params.outdir
   // Input file
   input:
   tuple val(sample_id), path(reads)

   // Output file
   output:
   tuple val(sample_id), path("*.fastp.{1,2}.fq.gz")

   script:
   def (r1, r2) = reads
     """
     fastp -i $r1 -I $r2 \
        -o ${sample_id}.fastp.1.fq.gz -O ${sample_id}.fastp.2.fq.gz \
        -5 -3 -q 20 --cut_mean_quality 20 -l 80 -w ${params.threads}
     """
}

// Stats on the QCed reads
process seqStats {
   tag "${sample_id}"
   publishDir params.outdir, mode: 'move'

   input:
   tuple val(sample_id), path(reads)

   output:
   tuple val(sample_id), path("*.fastp.stats.txt")

   script:
   def seqstats_out = "${sample_id}.fastp.stats.txt"
   """
   seqkit stats -T -a $reads -o $seqstats_out -j $params.threads
   """

}

// Define a workflow that calls the process
workflow {
    // Create a channel for paired-end input files
    read_pairs_ch = Channel
      .fromFilePairs(params.reads, size: 2, checkIfExists: true)
    seqQC(read_pairs_ch)
    seqStats(seqQC.out)
}

After the workflow excuted, we should be able to find the final stats output file. We can view it with:

csvtk pretty -t output_nf/read.fastp.stats.txt

Operators

Nextflow provides a powerful set of operators that allow manipulation and control of the data flow. These operators can be categorized into several types based on their functionality: filtering, transforming, splitting, combining, forking, and performing arithmetic operations. This document outlines examples of each category. You can try different operators in ch4.nf file.

Filtering

The filter operator allows you to get only the items emitted by a channel that satisfy a condition and discarding all the others.

Channel
 .of( 'a', 'b', 'aa', 'bc', 3, 4.5 )
 .filter( ~/^a.*/ )
 .view()

Transforming

Transforming operators modify the value or data contained in the channel elements. The

// Example: Transform filenames to uppercase
Channel
 .fromPath('WGS-data/*.fq')
 .map { file -> file.name.toUpperCase() }
 .view { "Transformed filename: $it" }

// Converting a list into multiple items
Channel
 .of([1, 2, 3, 4])
 .flatten()
 .view()

// The reverse of the flatten operator is collect.
// The collect operator collects all the items emitted by a
//channel to a list and return the resulting object as a sole emission.
Channel
 .of( 1, 2, 3, 4 )
 .collect()
 .view()

// Grouping contents of a channel by a key.
// The first element of tuple is the default key.
Channel.fromPath('output_nf/*.fastp.{1,2}.fq.gz')
  .map{file -> [file.name.split('\\.')[0], file]}
  .groupTuple()
  .view()

Splitting

Sometimes, it’s necessary to split the content of an individual item in a channel, such as a file or string, into smaller chunks for downstream processing. This could include items stored in a CSV file, entries in FASTA or FASTQ formats, or multi-line strings/text files.

Nextflow provides several splitting operators to facilitate this: splitCsv, splitFasta, splitFastq, splitText. Each of these operators enables precise control over the handling and preprocessing of data streams, enhancing the flexibility and efficiency of Nextflow pipelines.

Channel.of("val1\tval2\tval3\nval4\tval5\tval6\n")
 .splitCsv(sep: "\t")
 .view()

Combining

Combining operators are used to join two or more channels: mix, join

// Example: Combine three channels
ch1 = Channel.of( 1,2,3 )
ch2 = Channel.of( 'X','Y' )
ch3 = Channel.of( 'mt' )

ch4 = ch1.mix(ch2,ch3).view()

// Joins together the items emitted by two channels for which exists a matching key.
// The key is defined, by default, as the first element in each item emitted
reads1_ch = Channel
  .of(['wt', 'wt_1.fq'], ['mut','mut_1.fq'])
reads2_ch= Channel
  .of(['wt', 'wt_2.fq'], ['mut','mut_2.fq'])
reads_ch = reads1_ch
  .join(reads2_ch)
  .view()

Forking

Forking operators split a single channel into multiple channels.

Channel
 .of(1, 2, 3, 40, 50)
 .branch {
     small: it < 10
     large: it > 10
 }
 .set { result }

 result.small.view { "$it is small" }
 result.large.view { "$it is large" }

Maths

The maths operators allows you to apply simple math function on channels.

The maths operators are: count, min, max, sum, toInteger

Channel
 .of(1..22,'X','Y')
 .count()
 .view()

nf-core workflows for metagenomics

List the nf-core workflows in a specified catagory and sort by stars

nf-core list metagenomics -s stars

Pipeline Name	Stars	Latest Release	Released	Last Pulled	Have latest release?
mag	167	2.5.2	2 days ago	2 days ago	No (v2.5.1)
ampliseq	146	2.8.0	3 weeks ago
eager	117	2.5.0	3 months ago
viralrecon	105	2.6.0	11 months ago
taxprofiler	79	1.1.4	1 week ago	2 days ago	No (v1.1.4)
funcscan	49	1.1.4	3 months ago