Code Structure

Following are some details regarding a part of SPID’s implementation. Hopefully this will give a good starting point to anyone who wishes to dive into the code.

FlowcellDemultiplexer

The flow cell demultiplexing starting point.

Initiates a LaneDemultiplexerProcess for each lane which requires demultiplexing

All lane demultiplexing processes are run in parallel. The processes are independent, however the output is written to the same directories. Collisions are avoided since created files contain the lane number.

LaneDemultiplexerProcess

Initiates a LaneDemultiplexer

LaneDemultiplexer

Demultiplexes a single lane.

Initiates multiple BatchDemultiplexerProcess processes and FlushProcess processes according to given parameters.

Coordinates between the different processes using a manager process which manages several shared objects:

input_files_queue

shared between demultiplexers so that each demultiplexer works on different input

output_buffers_queue

Shared between demultiplexers and flushers - demultiplexer fill the queue, flushers read from queue and write to output files. This is the highest memory-consuming shared object

output_locks

Shared between flushers to avoid simultaneous writing to the same files

num_sequences_per_sample_dict

Shared between flushers to make sure the number of reads written to each output file does not pass the maximal threshold

BatchDemultiplexerProcess

Polls input_files_queue until it is empty.

Creates an InputBatchDemultiplexer for an input files batch and demultiplexes it.

InputBatchDemultiplexer

Holds an output buffer for each sample.

Uses SingleFastqFileSetReader to read fragments from input FASTQ files.

Uses an instance of FragmentDemultiplexer to demultiplex each fragment.

Whenever an output buffer is full, it is placed in the shared output_buffers_queue

SingleFastqFileSetReader

Reads sequences from a single set of FASTQ files.

An example for such a set in case of paired-end run for example, can be:

lane3_NoIndex_L002_R1_001.fastq.gz
lane2_NoIndex_L002_R2_001.fastq.gz
lane2_NoIndex_L002_R3_001.fastq.gz

In this case, a sequence (or fragment) is constructed of 3 reads and therefore should be read from all 3 files

Uses threading to read all FASTQ files in parallel.

FragmentDemultiplexer

This class tries to determine to which of the samples a certain fragment belongs.

It uses pre-built tag trees which will be described in a different document.

FlushProcess

Polls output_buffers_queue until it is empty.

Retrieves an output buffer from queue

Obtains a shared lock for the buffer’s sample to insure a single flush per sample

Flushes the output