Genomic Analyses

Basic tools for processing and analyzing genomic sequencing (next-generation sequencing, NGS) data are available preinstalled in a container environment.

The path to the container on Komondor is: /opt/software/packages/containers/datascience_genomics.sif. The definition file (datascience_genomics.def) is also available in the same path next to the .sif file and in the “HPC Bioinformatics / Containers” public git repository

Available tools

Open-source tools that are currently available in the container:

Quality control of sequencing data: FastQC, MultiQC
Trimming and cleaning sequences: Cutadapt, Trimmomatic, Trim Galore, TGS-GapCloser
Sequence read mapping tools: BWA, Bowtie, Bowtie 2, STAR, BBMap, HISAT2, Bismark, minimap2, muscle, Clustal, MAFFT
Manipulation of sequences and alignment data analysis: Samtools, Picard, MACS2, deepTools, HOMER, HTSeq, tabix, SeqKit
Genome arithmetics, genomic features: Bedtools, GenomeTools
Gene expression analysis: Salmon, DESeq2, kallisto
Genetic variant identification and analysis: BCFtools, VCFtools, freebayes
Structure analysis and prediction: ViennaRNA
Chromosome conformation capture analysis: HiCUP, HiCExplorer, HiC-Pro, Hic2Cool, cooler
Genome assembly: ABySS, MEGAHIT, Velvet, SPAdes, Canu, Tigmint, Trinity, MetaPlatanus, MetaBAT, Racon
Sequence analysis and identification: HMMER, TRF, BLAST, RepeatMasker, RepeatModeler, RepeatScout, TASSEL, CD-HIT
Genome analysis toolkit: GATK4
Data analysis and scripting: R (with many packages), Python (with many packages), Julia, Perl, Ruby

See the “HPC Bioinformatics / Containers public git repository” for detailed software list and versions and more info about the container.

Important

Note that authors usually request a citation in the publication if their software tool was used in the published work. Please always check the terms of use and the citation request on the tool’s website.

Scheduled analyses (sbatch)

Using the container, bioinformatics pipeline scripts can be run as scheduled slurm jobs.

Example: Bowtie2 alignment

In this example, suppose we have paired-end NGS data from an experiment that we want to align to the human reference genome.

We have the reference genome prepared (indexed) in the reference directory (on the PROJECT filesystem). We have the two NGS fastq (read1 and read2) files in the data directory and we created an output directory for the alignment files (on the faster SCRATCH filesystem).

$ ls -1 /project/my_project/reference
hs_grch38.fa.gz
hs_grch38.fa.gz.1.bt2
hs_grch38.fa.gz.2.bt2
hs_grch38.fa.gz.3.bt2
hs_grch38.fa.gz.4.bt2
hs_grch38.fa.gz.rev.1.bt2
hs_grch38.fa.gz.rev.2.bt2

$ ls -1 /scratch/my_project/fastq
my_sample_R1.fastq.gz
my_sample_R2.fastq.gz

$ ls -1 /scratch/my_project/sam
$

Now we prepare an sbatch script to do the alignment using bowtie2 that is already installed in the genomics container (/opt/software/packages/containers/datascience_genomics.sif). We allocate 16 processor cores for the alignment and set bowtie2 to use 16 threads accordingly. Note, that we mount both the PROJECT and SCRATCH file systems for the container since we have input data in both locations. As configured in the script, bowtie2’s output log will be in the “%slurm-%j__bowtie2.out” file (%j is replaced by the Slurm job ID).

$ cat ./alignment.sbatch
#!/usr/bin/env bash

#SBATCH --job-name=alignment
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=2000
#SBATCH --time=01:00:00
#SBATCH --output=slurm-%j__bowtie2.out

# input data (fastq files)
READ1=/scratch/my_project/fastq/my_sample_R1.fastq.gz
READ2=/scratch/my_project/fastq/my_sample_R2.fastq.gz
# output data (sam file)
SAM_OUT=/scratch/my_project/sam/my_sample.sam
# reference genome index location
REF_IDX=/project/my_project/reference/hs_grch38.fa.gz

# loading Singularity
module load singularity

# Singularity options
#   the genomic analysis tools are installed in the datascience_genomics.sif container
#   -B option specifies the file systems (with our data) we want to mount inside the container
CONTAINER_CALL="singularity exec -B /project:/project,/scratch:/scratch /opt/software/packages/containers/datascience_genomics.sif"


# running bowtie2 alignment (on 16 processor cores, as set in the SBATCH options above) using the container
${CONTAINER_CALL} bowtie2 --threads ${SLURM_CPUS_PER_TASK} -x ${REF_IDX} -1 ${READ1} -2 ${READ2} -S ${SAM_OUT}

We can submit the sbatch script to Slurm using the following command:

$ sbatch ./alignment.sbatch
Submitted batch job 1234567

If the job completes successfully, the alignment (sam) file have been created and we see a bowtie2 log with the alignment stats:

$ ls -1 /scratch/my_project/sam
my_sample.sam

$ cat ./slurm-1234567__bowtie2.log
50000 reads; of these:
  50000 (100.00%) were paired; of these:
    10646 (21.29%) aligned concordantly 0 times
    33390 (66.78%) aligned concordantly exactly 1 time
    5964 (11.93%) aligned concordantly >1 times
    ----
    10646 pairs aligned concordantly 0 times; of these:
      5243 (49.25%) aligned discordantly 1 time
    ----
    5403 pairs aligned 0 times concordantly or discordantly; of these:
      10806 mates make up the pairs; of these:
        6544 (60.56%) aligned 0 times
        1890 (17.49%) aligned exactly 1 time
        2372 (21.95%) aligned >1 times
93.46% overall alignment rate

Example: Multiple alignments in parallel (job array)

In this example, the task is similar to the previous example, but now we have three samples instead of one that we want to analyse in parallel.

We have the reference (indexed), the input files (three paired-end NGS data files) and the output directory:

$ ls -1 /project/my_project/reference
hs_grch38.fa.gz
hs_grch38.fa.gz.1.bt2
hs_grch38.fa.gz.2.bt2
hs_grch38.fa.gz.3.bt2
hs_grch38.fa.gz.4.bt2
hs_grch38.fa.gz.rev.1.bt2
hs_grch38.fa.gz.rev.2.bt2

$ ls -1 /scratch/my_project/fastq
sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R2.fastq.gz
sample3_R1.fastq.gz
sample3_R2.fastq.gz

$ ls -1 /scratch/my_project/sam
$

The simplest method to do the alignments in parallel is using a job array. With the following settings, the sbatch script will be run in parallel for all the three sample as separate jobs. The variable $SLURM_ARRAY_TASK_ID can be used to distinguish the runs and specify the sample name individually for each run. The range of the array task IDs is set with the --array sbatch option. (See also the “Job Array” section in “Basic Usage / Advanced Job Definitions”.)

$ cat ./alignment_array.sbatch
#!/usr/bin/env bash

#SBATCH --job-name=alignment
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=2000
#SBATCH --array=0-2
#SBATCH --time=01:00:00
#SBATCH --output=slurm-%j_%a__bowtie2.out

# loading Singularity
module load singularity

# Singularity options
#   the genomic analysis tools are installed in the datascience_genomics.sif container
#   -B option specifies the file systems (with our data) we want to mount inside the container
CONTAINER_CALL="singularity exec -B /project:/project,/scratch:/scratch /opt/software/packages/containers/datascience_genomics.sif"

# we want to run the alignment for three samples in parallel
samples=(sample1 sample2 sample3)

# current run (sample)
#   we use the index of the current task within the job array to select the sample name
run=${samples[$SLURM_ARRAY_TASK_ID]}

# input data (fastq files)
READ1=/scratch/my_project/fastq/${run}_R1.fastq.gz
READ2=/scratch/my_project/fastq/${run}_R2.fastq.gz
# output data (sam file)
SAM_OUT=/scratch/my_project/sam/${run}.sam
# reference genome index location
REF_IDX=/project/my_project/reference/hs_grch38.fa.gz

# running bowtie2 alignment (on 16 processor cores, as set in the SBATCH options above) using the container
srun ${CONTAINER_CALL} bowtie2 --threads ${SLURM_CPUS_PER_TASK} -x ${REF_IDX} -1 ${READ1} -2 ${READ2} -S ${SAM_OUT}

Submitting the sbatch script to Slurm:

$ sbatch ./alignment_array.sbatch
Submitted batch job 1234567

On a successful run, we will have the output (sam) file for each sample, and a separate output log for each alignment:

$ ls -1 /scratch/my_project/sam
sample1.sam
sample2.sam
sample3.sam

$ ls -1 ./slurm-*__bowtie2.out
slurm-1234567_2__bowtie2.out
slurm-1234568_0__bowtie2.out
slurm-1234569_1__bowtie2.out

Useful configurations

Links to Slurm configurations that can help organize and efficiently run your bioinfo pipelines that consists of many steps and/or parallel taks:

Job arrays (mentioned above): Submitting and managing collections of similar jobs quickly and easily - https://slurm.schedmd.com/job_array.html

Packed jobs: Mechanism for resource management to the job within its allocation regarding multiple job steps - https://slurm.schedmd.com/srun.html#OPT_exclusive

Job dependencies: Organizing sequential compute jobs that have to be run in order - https://slurm.schedmd.com/sbatch.html#OPT_dependency

Heterogeneous jobs: Submiting and managing jobs with different resources for different pieces of the job - https://slurm.schedmd.com/heterogeneous_jobs.html

Multiple programs configuration: Run a job with different programs and different arguments for each task - https://slurm.schedmd.com/srun.html#OPT_multi-prog

Interactive analyses

The genomics container can also be launched interactively on the command line or on the`JupyterHub <https://jupyter.hpc.kifu.hu>`_.

Analysis setup

Let’s say we want to run just a short test alignment using the same setup (reference, data files, task) described in the “Example: Bowtie2 alignment” section above:

$ REF_IDX=/project/my_project/reference/hs_grch38.fa.gz
$ READ1=/scratch/my_project/fastq/my_sample_R1.fastq.gz
$ READ2=/scratch/my_project/fastq/my_sample_R1.fastq.gz
$ SAM_OUT=/scratch/my_project/sam/my_sample.sam

Execute the analysis interactively

Instead of preparing and submitting an sbatch script, we can also do the alignment interactively on the command line using the container (provided that the task is really short and does not use too many resources):

$ REF_IDX=/project/my_project/reference/hs_grch38.fa.gz
$ READ1=/scratch/my_project/fastq/my_sample_R1.fastq.gz
$ READ2=/scratch/my_project/fastq/my_sample_R1.fastq.gz
$ SAM_OUT=/scratch/my_project/sam/my_sample.sam

$ module load singularity

$ CONTAINER_CALL="singularity exec -B /project:/project,/scratch:/scratch /opt/software/packages/containers/datascience_genomics.sif"
$ srun --ntasks=1 --cpus-per-task=16 --mem-per-cpu=2000 --pty ${CONTAINER_CALL} bowtie2 --threads 16 -x ${REF_IDX} -1 ${READ1} -2 ${READ2} -S ${SAM_OUT}

Execute the analysis in the container shell

We can also load the genomics container for full interactive use and issue commands directly in the container’s shell:

$ module load singularity

$ CONTAINER_SHELL="singularity shell -B /project:/project,/scratch:/scratch /opt/software/packages/containers/datascience_genomics.sif"
$ srun --ntasks=1 --cpus-per-task=16 --mem-per-cpu=2000 --pty $CONTAINER_SHELL

singularity> bowtie2 --version
/opt/conda/bin/bowtie2-align-s version 2.5.4
...
singularity> realpath $READ1
/scratch/my_project/fastq/my_sample_R1.fastq.gz

singularity> bowtie2 --threads 16 -x ${REF_IDX} -1 ${READ1} -2 ${READ2} -S ${SAM_OUT}
50000 reads; of these:
  50000 (100.00%) were paired; of these:
...

singularity> ls /scratch/my_project/sam
my_sample.sam

singularity> exit
$

Interactive use on the JupyterHub

Login to the JupyterHub and select datascience-genomics from the container list when starting a new server. You can use the bash kernel to create a notebook for the analysis or use the bash terminal to run the tools directly.