=============== Getting Started =============== Execution ========= The input files of SPID ======================= SPID's execution is usually preceded by execution of Illumina's **CASAVA/bcl2fastq** or equivalent, for example, in the run 150820_D00257_0193_BC7NLTANXX:: configureBclToFastq.pl \ --use-bases-mask y*,y* \ --input-dir /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX/Data/Intensities/BaseCalls \ --output-dir /path/to/CASAVA/output/dir/ For bcl2fastq version 2 or above:: bcl2fastq --use-bases-mask y*,i* \ --create-fastq-for-index-reads \ --runfolder-dir /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX \ --output-dir /path/to/CASAVA/output/dir/ Where i* in **--use-bases-mask** is for read/s that defined as index in the sequence machine. You can find this information in a file RunInfo.xml within the directory: /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX. Which is then followed by (in bcl2fastq version 1 only):: make -C /path/to/CASAVA/output/dir/ -j The parameter **--use-bases-mask y*,y*** guides bcl2fastq to only perform base calling, leaving demultiplexing for SPID. *Note:* If the flow cell is Paired-end, then parameter **--use-bases-mask y*,y*,y*** should be used instead, or y*,i*,y* for bcl2fastq version 2 or above. bcl2fastq's output will then serve as input to SPID. SPID can get as input uncompressed or compressed files (in gz format) that are accepted from bcl2fastq version 1, But meantime, don't support in bgzf format of bcl2fastq version 2, so you will need use with --no-bgzf-compression parameter of bcl2fastq to get the output in gz format, or, more efficiently, uncompress the bcl2fastq output files before the using with SPID Following SPID's installation, two command-line scripts are installed named **spid-demultiplex.py** and **spid-prepare-barcodesheet.py**. To see all command-line parameters, run:: spid-prepare-barcodesheet.py -h spid-demultiplex.py -h BarcodeSheet ============ BarcodeSheet.csv is SPID's equivalent to bcl2fastq's SampleSheet.csv but is more elaborate so more complex indexes can be described. As in bcl2fastq, the barcode sheet directs SPID how to assign reads to samples,and samples to projects. The script **spid-demultiplexing** gets the BarcodeSheet.csv file as input. You can prepare the BarcodeSheet.csv file manually, or use with the script **spid-prepare-barcodesheet.py** for the prepairing the BarcodeSheet.csv file by automatically detection of the barcodes features and their locations on the reads. The fields of BarcodeSheet.csv file: =========================== =========== Column Description =========================== =========== lane Positive integer indicating the lane number (1-8), as in bcl2fastq. Required project_name The project the sample belongs to, as in bcl2fastq. Required sample_name Sample name, as in bcl2fastq. Required sub_sample_name Unique name for samples with identical sample name. The output files of these samples will be grouped into sub folders under the same folder with the sample_name. Optional. tag1_sequence the first barcode sequence. If no barcode exists, write *NoIndex*. Required tag2_sequence the second barcode sequence. Optional tag3_sequence as above tag1_name tag name which will be used for output file name. We usually use tag1_sequence as tag1_name. Required tag2_name Same as tag1_name for second tag. Required when tag2_sequence is provided tag3_name Same as tag1_name for third tag. Required when tag3_sequence is provided master_tag Integer (1-3). When using the same barcode for all sub samples under same sample_name, you can declare the common barcode as a master tag, so that if one of the other barcodes was not identified but the master barcode was identified, the reads will not belong to the general undetermined reads but to the local undetermined under the sample_name folder cut_tag1 whether to cut the barcode sequence from sequences for read 1 (yes/no). Required cut_tag2 Same as cut_tag1 for read #2 (yes/no). Required when tag2_sequence is provided cut_tag3 Same as cut_tag1 for read #3 (yes/no). Required when tag3_sequence is provided maximal_mismatches_tag1 maximal number of mismatches in tag #1. Integer. Required maximal_mismatches_tag2 maximal number of mismatches in tag #2. Integer. Required when tag2_sequence is provided maximal_mismatches_tag3 maximal number of mismatches in tag #3. Integer. Required when tag3_sequence is provided maximal_offset_tagged_read enable offset (to left side or right side) in the location of the first barcode on the read from its planned location. The format is: (int)l(int)r. for example: 3l5r will enable offset of barcode until 3 bases toward left side of the read (3') and 5 bases toward right side (5') No offset is marked as **0l0r**. Required =========================== =========== If you choose to create the BarcodeSheet.csv file automatically by **spid-prepare-barcodesheet.py** script, you need create BarcodeList.csv file that contains only the following fields: =========================== =========== Column Description =========================== =========== lane Positive integer indicating the lane number (1-8), as in bcl2fastq. Required sample_name Sample name, as in bcl2fastq. Required tag1_sequence the first barcode sequence. If no barcode exists, write *NoIndex*. Required tag2_sequence the second barcode sequence. Optional tag3_sequence as above project_name The project the sample belongs to, as in bcl2fastq. Required =========================== =========== BarcodeList example ~~~~~~~~~~~~~~~~~~~~~ Here is barcodeList example - for samples which are represented by 2 barcodes: :download:`BarcodeList.csv for samples which are represented by 2 barcodes ` BarcodeSheet examples ~~~~~~~~~~~~~~~~~~~~~ Here are a couple of barcode sheet examples - one for single-read and one for paired-end: :download:`BarcodeSheet.csv for Single-read with read lengths 51, 7 ` :download:`BarcodeSheet.csv for Paired-end with read lengths 101, 7, 101 ` Usage examples ============== Here is an example execution of spid-prepair-barcodesheet.py:: spid-prepair-barcodesheet.py --casava-output-dir /path/to/bcl2fastq/output/dir/ \ --barcode-list /path/to/BarcodeList.csv \ --barcode-sheet-output /path/to/BarcodeSheet.csv \ --lanes 1 Here is an example execution of spid-demultiplex.py:: spid-demultiplex.py --casava-output-dir /path/to/bcl2fastq/output/dir/ \ --barcode-sheet /path/to/BarcodeSheet.csv \ --output-dir /path/to/SPID/output/dir/ \ --lanes 1 Concurrency ~~~~~~~~~~~ Since running SPID on a single lane may take up a lot of memory, it is possible to use it in the following manner: # Run 8 processes of CASAVA/bcl2fastq simultaneously, one per lane. Use the **--tiles** parameter to run each process on a different lane, for example: **s_3_*** to run bcl2fastq on lane 3. Give each such process a different output directory:: configureBclToFastq.pl --use-bases-mask y*,y*,y* \ --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \ --output-dir /path/to/CASAVA/output/dir/lane_1/ \ --tiles s_1_* configureBclToFastq.pl --use-bases-mask y*,y*,y* \ --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \ --output-dir /path/to/CASAVA/output/dir/lane_2/ \ --tiles s_2_* [...] configureBclToFastq.pl --use-bases-mask y*,y*,y* \ --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \ --output-dir /path/to/CASAVA/output/dir/lane_8/ \ --tiles s_8_* # Run 8 processes of SPID simultaneously, each running on a different bcl2fastq output directory as its input directory. *Note:* These processes can share output directory. as shown below. Collisions are avoided since created FASTQ files contain the lane number:: spid-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \ --casava-output-dir /path/to/CASAVA/output/dir/lane_1/ \ --output-dir /path/to/SPID/output/dir/ \ --lanes 1 spid-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \ --casava-output-dir /path/to/CASAVA/output/dir/lane_2/ \ --output-dir /path/to/SPID/output/dir/ \ --lanes 2 [...] spid-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \ --casava-output-dir /path/to/CASAVA/output/dir/lane_8/ \ --output-dir /path/to/SPID/output/dir/ \ --lanes 8