===========================
IGV ALIGNMENT PREPROCESSOR
===========================

This package contains source and binaries for the IGV alignment processor
module.  The module takes an alignment file as input and outputs
a binary read count density file in h5 format.  Supported input formats include
SAM, BAM,  and .aligned (described below)  The results can be
visualized in IGV.

===========================
RUNNING
===========================


The bin directory contains a prebuilt stand alone package for Windows,
Mac OsX, and Linux 64-bit platforms.   Java J2SE 5.0 or greater is required.

The preprocessor can be run from the bin directory with one of the included
scripts.

run_win.bat [args]
run_linux-64.sh [args]	(for Linux 64-bit systems)
run_mac-powerpc.sh [args]
run_mac-intel.sh  [args]

Required arguments:

[inputDirectory] input directory containing one or more ".aligned" files
[outputDirectory]  output directory
[genome]  IGV recognized genome, one of mm7, mm8, mm9, hg16, hg17, hg18, hg19,
        zebrafish, sc, spombe, nspora.  If your genome is not on this list contact us
        at igv-help@broad.mit.edu

-Optional arguments

-m [maskDirectory] directory containing binary "mask" files (see below)
-w [windowSize] defaults to 25 bp
-ext [extensionFactor] defaults to 0
-z [maxZoomLevel]  maximum zoom level to compute.  Recommended value is 8
-normalize normalize the read count in each window to counts per million.
    This option multiplies each count by  (1,000,000 / total # of reads)

Note:  the extension factor should be set to the average shear size of your
fragment library (if applicable).


===========================
BUILDING
===========================

Instructions for building from source files are below.  Note that this will
replace the contents of the bin directory.

Prerequisites:

Ant 1.7.0 or greater (http://ant.apache.org/)

Java J2SE 5.0 or greater (http://java.sun.com/javase/download)

1.  Download and unzip the source distribution file to a directory of your choice.

2.  Run the provided ant script

       ant -buildfile  build_preprocessor.xml


============================
FILE / DIRECTORY FORMATS
============================

Supported input file formats include SAM, BAM, and ".aligned".  SAM and BAM formats
are described at http://samtools.sourceforge.net/.   A ".aligned" file  is a tab
delimited 4-column text file containg the chromosome start position, end position,
and orientation (+/-) of each sequence read.  For example


chr10   91346072    91346098    +
chr15   9879235 9870261 -
etc

The filename must end with ".aligned",  other files in the input directory
are ignored.   There is no header row.

The mask directory is a directory of binary files, one per chromosome, that
specify regions of the genome that should be ignored (are non-alignable).  Each
position in the file corresponds to the corresponding "window" position in the
data.  For example, assuming a window size of 25  position 10 in the file would
correspond to the region 250-275 bp.  A "masked" region is indicated by a
byte value of "zero".

The mask files should begin with the chromosome name followed by a period, and end with
".bin".  The rest of the filename is ignored.  For example chr10.mask.mouse.n27.d2.bin
contains the mask data for chromosome 10.

Mask directories for Human (hg18) and Mouse (mm8 and mm9) genomes for a window
size of 25bp are available for download from the igv web site.

   http://www.broad.mit.edu/igv/downloads/downloads/html

===========================
MEMORY AND DISK REQUIREMENTS
===========================

The current implementation processes all alignments from a single file at once
and is memory intensive.  The default settings allocate 2 GB for Mac and Linux, and
1 GB for Windows.  If you receive an out of memory exception you can increase
these by changing the "-Xmx" value in the scripts.  Conversely if the JVM will
not start you might need to reduce these settings.   Note that 32-bit Windows
machines have a maximum practical upper limit of approximately 1.3GB.

The output file contains an entry for every data point that has not been masked,
even if there were no alignments at that point.   As a consequence the files
can be quite large compared to the input file.  For human and mouse genomes
the output file are on the order of 1 GB in size with a window size of 25bp.