References

Options

class ClickFeatureCountsOptions(feature_type='gene', attribute='ID', options=None, strandness=None, caller=None)[source]
group_name = 'Feature Counts'
metadata = {'name': 'Feature Counts', 'options': ['--feature-counts-strandness', '--feature-counts-attribute', '--feature-counts-extra-attributes', '--feature-counts-feature-type', '--feature-counts-options']}
class ClickGeneralOptions(caller=None)[source]
static deps_callback(ctx, param, value)[source]
static from_project_callback(ctx, param, value)[source]
group_name = 'General'
metadata = {'name': 'General', 'options': ['--deps', '--from-project', '--help', '--level', '--version']}
static version_callback(ctx, param, value)[source]
class ClickInputOptions(input_directory='.', input_pattern='*fastq.gz', add_input_readtag=True, caller=None)[source]
group_name = 'Data'
metadata = {'name': 'Data', 'options': ['--input-directory', '--input-pattern', '--input-readtag']}
class ClickKrakenOptions(caller=None)[source]
group_name = 'Kraken'
metadata = {'name': 'Kraken', 'options': ['--kraken-databases', '--skip-kraken']}
class ClickSlurmOptions(memory='4G', queue='common', profile=None, caller=None)[source]
group_name = 'Slurm'
metadata = {'name': 'Slurm', 'options': ['--profile', '--slurm-queue', '--slurm-memory']}
class ClickSnakemakeOptions(working_directory='analysis', caller=None)[source]
group_name = 'Snakemake'
metadata = {'name': 'Snakemake', 'options': ['--apptainer-prefix', '--apptainer-args', '--force', '--jobs', '--use-apptainer', '--working-directory']}
class ClickTrimmingOptions(software=['cutadapt', 'atropos', 'fastp'], caller=None)[source]

This section is dedicated to reads trimming and filtering and adapter trimming. We currently provide supports for Cutadapt/Atropos and FastP tools.

This section uniformizes the options for such tools

group_name = 'Trimming'
metadata = {'name': 'Trimming', 'options': ['--software-choice', '--trimming-minimum-length', '--trimming-adapter-read1', '--trimming-adapter-read2', '--disable-trimming', '--trimming-cutadapt-mode', '--trimming-cutadapt-options', '--trimming-quality']}
class OptionEatAll(*args, **kwargs)[source]
add_to_parser(parser, ctx)[source]
include_options_from(cls, *args, **kwargs)[source]
init_click(NAME, groups={})[source]

This function populates click variables and groups

NAME is added to the rich_context so that ClickXXOptions classes may reuse it. It also sets the HEADER_TEXT and initiate a OPTION_GROUPS to be used by rich_click.

In a sequana pipeline, you can use this code:

CTX = init_click(NAME, groups={
        "Pipeline Specific": [
         "--method-example"],
         }
    )
@click.command(context_settings=CONTEXT_SETTINGS)
@include_options_from(ClickSnakemakeOptions, working_directory=NAME)
@click.options("--method-example")
def main(**kwargs):
    pass

Sequana Manager

class SequanaManager(options, name='undefined')[source]
Parameters:
  • options -- an instance of Options

  • name -- name of the pipeline. Must be a Sequana pipeline already installed.

options must be an object Options with at least the following attributes:

class Options:
    level = 'INFO'
    version = False
    workdir = "fastqc"
    job=1
    force = True
    use_apptainer = False
    apptainer_prefix = ""
    def __init__(self):
        pass
from sequana_pipetools import SequanaManager
o = Options()
pm = SequanaManager(o, "fastqc")

The working_directory is used to copy the pipeline in it.

check_input_files(stop_on_error=True)[source]
exists(filename, exit_on_error=True, warning_only=False)[source]

This is a convenient function to check if a directory/file exists

Used in the main.py of all pipelines when setting the working directory

fill_data_options()[source]
setup()[source]

Initialise the pipeline.

  • Create a directory (usually named after the pipeline name)

  • Copy the pipeline and associated files (e.g. config file)

  • Create a script in the directory ready to use

teardown(check_schema=True, check_input_files=True)[source]

Save all files required to run the pipeline and perform sanity checks

We copy the following files into the working directory:

  • the config file (config.yaml)

  • a NAME.sh that contains the snakemake command

  • the Snakefile (NAME.rules)

For book-keeping and some parts of the pipelines, we copied the config file and its snakefile into the .sequana directory. We also copy the logo.png file if present into this .sequana directory

and if present:

  • multiqc_config file for mutliqc reports

  • the schema.yaml file used to check the content of the config.yaml file

multiple_downloads(files_to_download, timeout=3600)[source]

FileFactory and FastQFactory

class FastQFactory(pattern, extension=['fq.gz', 'fastq.gz'], read_tag='_R[12]_', extra_prefixes_to_strip=[], sample_pattern=None, **kwargs)[source]

FastQ Factory tool

In NGS experiments, reads are stored in a so-called FastQ file. The file is named:

PREFIX_R1_SUFFIX.fastq.gz

where _R1_ tag is always to be found. This is a single-ended case. In paired case, a second file is to be found:

PREFIX_R2_SUFFIX.fastq.gz

The PREFIX indicates the sample name. The SUFFIX does not convey any information per se. The default read tag ("_R[12]_") handle this case.

The following commands expect to find data with a readtag _R1_ or _R2_:

FastQFactory("*fastq.gz")

This behaviour can be changed if data have another read tags. (e.g. "[12].fastq.gz"):

FastQFactory("*fastq.gz", read_tag="_[12].")

Sometimes, e.g. in long reads experiments (for instance), naming convention is different and may not be single/paired end convention. If so, set the readtag to None.:

FastQFactory("*ccs.fastq.gz", read_tag=None)

In such case, the paired is set to False.

In a directory (recursively or not), there could be lots of samples. This class can be used to get all the sample prefix in the tags attribute.

Given a tag, one can get the corresponding file(s):

ff = FastQFactory("*fastq.gz")
ff.tags
ff.get_file1(ff.tags[0])
len(ff)

Constructor

Parameters:
  • pattern (str) -- a global pattern (e.g., H*fastq.gz)

  • extension (list) -- not used

  • read_tag (str) -- regex tag used to join paired end files. Some characters need to be escaped with a character to be interpreted as character. (e.g. '_R[12]_.fastq.gz')

  • extra_prefixes_to_strip -- we automatically remove common prefixes. However, you may have extra prefixes not common to all samples that needs to be removed. Provide a list with extra_prefixes_to_strip including trailing dot or not.

  • sample_pattern -- if a sample pattern is provided, prefix are not removed automatically. The sample_pattern must include the string {sample} to define the expected sample name. For instance given a filename A_sorted.fastq.gz where sorted appears in all sample buy is not wished, use sample_pattern='{sample}_sorted.fastq.gz' and your sample will be only 'A'.

get_file1(tag=None)[source]
get_file2(tag=None)[source]
property paired

guess whether data is paired or not

class FileFactory(pattern, extra_prefixes_to_strip=[], sample_pattern=None, **kwargs)[source]

Factory to handle a set of files

from sequana.snaketools import FileFactory
ff = FileFactory("H*.gz")
ff.filenames

A set of useful methods are available based on this convention:

>>> fullpath = /home/user/test/A.fastq.gz
>>> dirname(fullpath)
'/home/user/test'
>>> basename(fullpath)
'A.fastq.gz'
>>> realpath(fullpath) # is .., expanded to /home/user/test

>>> all_extensions
"fastq.gz"
>>> extensions
".gz"

A basename is the name of a directory in a Unix pathname that occurs after the last slash.

dirname, returns everything but the final basename in a pathname. Both

Changed in version 0.8.7: attributes were recomputed at each accession. For small projects, this is transparent, but on novogene or large set of samples, this is taking too much time. This was done in case FileFactorry attributes such as input directorty or pattern are changed. In practice this does not happen, so we can write the basenames and other attributes in variables once for all

Some files may be prefixed with a common name separated by a dot. For example, with pacbio data you may have:

demultiplex.A.fastq.gz
demultiplex.B.fastq.gz

We remove the common prefixes automatically so that you end up with A and B as sample names.

Constructor

Parameters:
  • pattern -- can be a filename, list of filenames, or a global pattern (a unix regular expression with wildcards). For instance, */*fastq.gz

  • extra_prefixes_to_strip -- we automatically remove common prefixes. However, you may have extra prefixes not common to all samples that needs to be removed. Provide a list with extra_prefixes_to_strip including trailing dot or not.

  • sample_pattern -- if a sample pattern is provided, prefix are not removed automatically. The sample_pattern must include the string {sample} to define the expected sample name. For instance given a filename A_sorted.fastq.gz where sorted appears in all sample buy is not wished, use sample_pattern='{sample}_sorted.fastq.gz' and your sample will be only 'A'.

property all_extensions

get all trailing extensions

property basenames

the basename without the path (e.g. readme.txt)

property extensions

get final extension

property filenames

get basename without extension (e.g., readme)

property realpaths

the full path (e.g., /home/user/readme.txt)

Pipeline manager

class PipelineManager(name: str, config: str, schema=None, sample_func=None, extra_prefixes_to_strip=[], sample_pattern=None, **kwargs)[source]

Utility to manage easily the snakemake pipeline including input files

Inside a snakefile, use it as follows:

from sequana import PipelineManager
manager = PipelineManager("pipeline_name", "config.yaml")

This will expect some specific fields in the config file:

- input_directory: path_to_find_input_files
- input_readtag: "_R[12]_"
- input_pattern: "*.fastq.gz"

You may omit the input_readtag, which is not required for non-paired data. For instance for pacbio and nanopore files, there are not paired and the read tag is not required. Instead, if you are dealing with Illumina/MGI data sets, you must provide this field IF AND ONLY IF you want your data to be processed as paired data (or single end). See later for more details.

You may omit the input_directory but then the input_pattern must match files to be found locally.

Behind the scene, the FileFactory or FastQFactory will provide the sample names and their tags in samples where tag are extracted from the sample names where read tags are removed (if required).

The manager tells you if the samples are paired or not assuming all samples are homogeneous (either all paired or all single-ended) and a user read_tag that can discrimate the sample name unambigously.

In Sequencing data, the sequences are stored in one file (single-ended) data or in two files (paired-data). In both cases, most common sequencers will append a so-called read-tag to identify the first and second file. Traditionnally, e.g., with illumina sequencers the read tag are _R1_ and _R2_ or a trailing _1 and _2 Note that samples names have sometimes this tag included. Consider e.g. sample_replicate_1_R1_.fastq.gz or sample_replicate_1_1.fastq.gz then you can imagine that it is tricky to handle.

The sample names are extracted by cutting filenames on the first dot that is encoutered (before extension presumably). For instance the sample name for the file:

A.fastq.gz

will be A. sometimes, you may have ambiguous names. For instance, they may start with a common prefix. Considere these two files:

demultiplex.A.fastq.gz
demultiplex.B.fastq.gz

They would both have the same sample name. So, we remove all trailing prefixes that are common to all files. Therefore, our sample names are as expected:

A
B

If extra prefixes not common to all samples are present and you want to remove them, it is still possible using a field in the config file called extra_prefixes_to_strip. For instance with

demultiplex.A.fastq.gz
demultiplex.mess.B.fastq.gz

your sample names will be A and mess. You can set this pair key:value in the config file:

extra_prefixes_to_strip: ["mess"]

and get A and B as sample names.

If you have specific wishes to create sample names from the filenames, you may provide a function with the sample_func parameter. If so, you must provide the input_directory and input_pattern to identify the files to process. For instance, in the sequana_fastqc pipeline, you set the input_directory and input_pattern and use this function to extract the sample names

from sequana_pipetools import SequanaManager
def func(filename):
    return filename.split("/")[-1].split('.', 1)[0]
pm = SequanaManager("fastqc", "config.yaml", sample_func=func)

The manager can then easily access to the data with a FastQFactory instance:

manager.ff.filenames

Finally, you can also define a sample pattern using a simple syntax such as demultiplex.{sample}.fastq.gz and you can define this in your config file using:

sample_pattern: "demultiplex.{sample}.fastq.gz"

in which case, no prefixes are removed.

Constructor

Parameters:
  • name -- name of the pipeline

  • config -- name of a configuration file

  • schema (str) -- YAML file to validate the config file

  • sample_func -- a user-defined function that extract sample names from filenames

  • extra_prefixes_to_strip -- we automatically remove common prefixes. However, you may have extra prefixes not common to all samples that needs to be removed. Provide a list with extra_prefixes_to_strip including trailing dot or not.

  • sample_pattern -- if a sample pattern is provided, prefix are not removed automatically. The sample_pattern must include the string {sample} to define the expected sample name. For instance given a filename A_sorted.fastq.gz where sorted appears in all sample buy is not wished, use sample_pattern='{sample}_sorted.fastq.gz' and your sample will be only 'A'.

getrawdata()[source]

Return list of raw data

This function contains a wildcard to each of the samples found by the manager.

property paired
class PipelineManagerBase(name, config, schema=None, matplotlib_backend='Agg')[source]

For all files except FastQ, please use this class instead of PipelineManager.

clean_multiqc(filename)[source]
error(msg)[source]
get_html_summary(float='left', width=30)[source]
getmetadata()[source]
getrawdata()[source]

Return list of raw data

A list of files is returned (one or two files) otherwise, a function compatible with snakemake is returned. This function contains a wildcard to each of the samples found by the manager.

onerror()[source]

Try to report error from slurm

setup(namespace=None, mode='error')[source]

90% of the errors come from the fact that users did not set a matplotlib backend properly. In the setup() function, we set the backend to Agg on purpose. One can set this parameter to None to avoid this behaviour

property snakefile
teardown(extra_dirs_to_remove=[], extra_files_to_remove=[], outdir='.')[source]
class PipelineManagerDirectory(name, config, schema=None)[source]

Most generic pipeline manager

Only checks for valid config file and its schema

For all files except FastQ, please use this class instead of PipelineManager.