===============
Getting Started
===============

Execution
=========

The input files of SPID
=======================

SPID's execution is usually preceded by execution of Illumina's **CASAVA/bcl2fastq** or equivalent, for example, in 
the run 150820_D00257_0193_BC7NLTANXX::

    configureBclToFastq.pl \
      --use-bases-mask y*,y* \
      --input-dir /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX/Data/Intensities/BaseCalls \
      --output-dir /path/to/CASAVA/output/dir/
For bcl2fastq version 2 or above:: 
    bcl2fastq --use-bases-mask y*,i* \
    --create-fastq-for-index-reads \
    --runfolder-dir /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX \
    --output-dir /path/to/CASAVA/output/dir/

Where i* in **--use-bases-mask** is for read/s that defined as index in the sequence machine. You can find this information 
in a file RunInfo.xml within the directory: /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX.

Which is then followed by (in bcl2fastq version 1 only)::

    make -C /path/to/CASAVA/output/dir/ -j <num-jobs>

The parameter **--use-bases-mask y*,y*** guides bcl2fastq to only perform base calling, leaving demultiplexing for SPID.

*Note:* If the flow cell is Paired-end, then parameter **--use-bases-mask y*,y*,y*** should be used instead, or y*,i*,y* for 
bcl2fastq version 2 or above.

bcl2fastq's output will then serve as input to SPID.
SPID can get as input uncompressed or compressed files (in gz format) that are accepted from bcl2fastq version 1, But meantime, 
don't support in bgzf format of bcl2fastq version 2, so you will need use with --no-bgzf-compression parameter of bcl2fastq to 
get the output in gz format, or, more efficiently, uncompress the bcl2fastq output files before the using with SPID

Following SPID's installation, two command-line scripts are installed named **spid-demultiplex.py** and **spid-prepare-barcodesheet.py**.

To see all command-line parameters, run::
    spid-prepare-barcodesheet.py -h
    spid-demultiplex.py -h

BarcodeSheet
============

BarcodeSheet.csv is SPID's equivalent to bcl2fastq's SampleSheet.csv but is more elaborate so more complex indexes
can be described. As in bcl2fastq, the barcode sheet directs SPID how to assign reads to samples,and samples to projects.
The script **spid-demultiplexing** gets the BarcodeSheet.csv file as input. You can prepare the BarcodeSheet.csv file manually, or use with 
the script **spid-prepare-barcodesheet.py** for the prepairing the BarcodeSheet.csv file by automatically detection of the barcodes
features and their locations on the reads.

The fields of BarcodeSheet.csv file:

=========================== ===========
Column                      Description
=========================== ===========
lane                        Positive integer indicating the lane number (1-8), as in bcl2fastq. Required
project_name                The project the sample belongs to, as in bcl2fastq. Required
sample_name                 Sample name, as in bcl2fastq. Required
sub_sample_name             Unique name for samples with identical sample name.
                            The output files of these samples will be grouped into sub folders under the same folder with the
                            sample_name. Optional.
tag1_sequence               the first barcode sequence. If no barcode exists, write *NoIndex*. Required
tag2_sequence               the second barcode sequence. Optional
tag3_sequence               as above
tag1_name                   tag name which will be used for output file name. We usually use tag1_sequence as tag1_name. Required
tag2_name                   Same as tag1_name for second tag. Required when tag2_sequence is provided
tag3_name                   Same as tag1_name for third tag. Required when tag3_sequence is provided
master_tag                  Integer (1-3). When using the same barcode for all sub samples under same sample_name,
                            you can declare the common barcode as a master tag, so that if one of the other barcodes was
                            not identified but the master barcode was identified, the reads will not belong to the
                            general undetermined reads but to the local undetermined under the sample_name folder
cut_tag1                    whether to cut the barcode sequence from sequences for read 1 (yes/no). Required
cut_tag2                    Same as cut_tag1 for read #2 (yes/no). Required when tag2_sequence is provided
cut_tag3                    Same as cut_tag1 for read #3 (yes/no). Required when tag3_sequence is provided
maximal_mismatches_tag1     maximal number of mismatches in tag #1. Integer. Required
maximal_mismatches_tag2     maximal number of mismatches in tag #2. Integer. Required when tag2_sequence is provided
maximal_mismatches_tag3     maximal number of mismatches in tag #3. Integer. Required when tag3_sequence is provided
maximal_offset_tagged_read  enable offset (to left side or right side) in the location of the first barcode on the read
                            from its planned location. The format is: (int)l(int)r. for example: 3l5r will enable offset
                            of barcode until 3 bases toward left side of the read (3') and 5 bases toward right side (5')
                            No offset is marked as **0l0r**. Required
=========================== ===========


If you choose to create the BarcodeSheet.csv file automatically by **spid-prepare-barcodesheet.py** script, you need 
create BarcodeList.csv file that contains only the following fields:

=========================== ===========
Column                      Description
=========================== ===========
lane                        Positive integer indicating the lane number (1-8), as in bcl2fastq. Required
sample_name                 Sample name, as in bcl2fastq. Required
tag1_sequence               the first barcode sequence. If no barcode exists, write *NoIndex*. Required
tag2_sequence               the second barcode sequence. Optional
tag3_sequence               as above
project_name                The project the sample belongs to, as in bcl2fastq. Required
=========================== ===========

BarcodeList example
~~~~~~~~~~~~~~~~~~~~~

Here is barcodeList example - for samples which are represented by 2 barcodes:

:download:`BarcodeList.csv for samples which are represented by 2 barcodes <BarcodeList.txt>`


BarcodeSheet examples
~~~~~~~~~~~~~~~~~~~~~

Here are a couple of barcode sheet examples - one for single-read and one for paired-end:

:download:`BarcodeSheet.csv for Single-read with read lengths 51, 7 <BarcodeSheet_singleRead_51_7.txt>`

:download:`BarcodeSheet.csv for Paired-end with read lengths 101, 7, 101 <BarcodeSheet_pairedEnd_101_7_101.txt>`


Usage examples
==============

Here is an example execution of spid-prepair-barcodesheet.py::

    spid-prepair-barcodesheet.py --casava-output-dir  /path/to/bcl2fastq/output/dir/ \
        --barcode-list /path/to/BarcodeList.csv \
        --barcode-sheet-output /path/to/BarcodeSheet.csv \
        --lanes 1

Here is an example execution of spid-demultiplex.py::

    spid-demultiplex.py --casava-output-dir  /path/to/bcl2fastq/output/dir/ \
        --barcode-sheet /path/to/BarcodeSheet.csv \
        --output-dir /path/to/SPID/output/dir/ \
        --lanes 1

Concurrency
~~~~~~~~~~~

Since running SPID on a single lane may take up a lot of memory, it is possible to use it in the following manner:

# Run 8 processes of CASAVA/bcl2fastq simultaneously, one per lane. Use the **--tiles** parameter to run each process
on a different lane, for example: **s_3_*** to run bcl2fastq on lane 3. Give each such process a different output directory::

    configureBclToFastq.pl --use-bases-mask y*,y*,y* \
        --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
        --output-dir /path/to/CASAVA/output/dir/lane_1/ \
        --tiles s_1_*
    configureBclToFastq.pl --use-bases-mask y*,y*,y* \
        --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
        --output-dir /path/to/CASAVA/output/dir/lane_2/ \
        --tiles s_2_*
    [...]
    configureBclToFastq.pl --use-bases-mask y*,y*,y* \
        --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
        --output-dir /path/to/CASAVA/output/dir/lane_8/ \
        --tiles s_8_*

# Run 8 processes of SPID simultaneously, each running on a different bcl2fastq output directory as its input directory.

*Note:* These processes can share output directory. as shown below.
Collisions are avoided since created FASTQ files contain the lane number::

    spid-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
        --casava-output-dir /path/to/CASAVA/output/dir/lane_1/ \
        --output-dir /path/to/SPID/output/dir/ \
        --lanes 1
    spid-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
        --casava-output-dir /path/to/CASAVA/output/dir/lane_2/ \
        --output-dir /path/to/SPID/output/dir/ \
        --lanes 2
    [...]
    spid-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
        --casava-output-dir /path/to/CASAVA/output/dir/lane_8/ \
        --output-dir /path/to/SPID/output/dir/ \
        --lanes 8