=========================== IGV ALIGNMENT PREPROCESSOR =========================== This package contains source and binaries for the IGV alignment processor module. The module takes an alignment file as input and outputs a binary read count density file in h5 format. Supported input formats include SAM, BAM, and .aligned (described below) The results can be visualized in IGV. =========================== RUNNING =========================== The bin directory contains a prebuilt stand alone package for Windows, Mac OsX, and Linux 64-bit platforms. Java J2SE 5.0 or greater is required. The preprocessor can be run from the bin directory with one of the included scripts. run_win.bat [args] run_linux-64.sh [args] (for Linux 64-bit systems) run_mac-powerpc.sh [args] run_mac-intel.sh [args] Required arguments: [inputDirectory] input directory containing one or more ".aligned" files [outputDirectory] output directory [genome] IGV recognized genome, one of mm7, mm8, mm9, hg16, hg17, hg18, hg19, zebrafish, sc, spombe, nspora. If your genome is not on this list contact us at igv-help@broad.mit.edu -Optional arguments -m [maskDirectory] directory containing binary "mask" files (see below) -w [windowSize] defaults to 25 bp -ext [extensionFactor] defaults to 0 -z [maxZoomLevel] maximum zoom level to compute. Recommended value is 8 -normalize normalize the read count in each window to counts per million. This option multiplies each count by (1,000,000 / total # of reads) Note: the extension factor should be set to the average shear size of your fragment library (if applicable). =========================== BUILDING =========================== Instructions for building from source files are below. Note that this will replace the contents of the bin directory. Prerequisites: Ant 1.7.0 or greater (http://ant.apache.org/) Java J2SE 5.0 or greater (http://java.sun.com/javase/download) 1. Download and unzip the source distribution file to a directory of your choice. 2. Run the provided ant script ant -buildfile build_preprocessor.xml ============================ FILE / DIRECTORY FORMATS ============================ Supported input file formats include SAM, BAM, and ".aligned". SAM and BAM formats are described at http://samtools.sourceforge.net/. A ".aligned" file is a tab delimited 4-column text file containg the chromosome start position, end position, and orientation (+/-) of each sequence read. For example chr10 91346072 91346098 + chr15 9879235 9870261 - etc The filename must end with ".aligned", other files in the input directory are ignored. There is no header row. The mask directory is a directory of binary files, one per chromosome, that specify regions of the genome that should be ignored (are non-alignable). Each position in the file corresponds to the corresponding "window" position in the data. For example, assuming a window size of 25 position 10 in the file would correspond to the region 250-275 bp. A "masked" region is indicated by a byte value of "zero". The mask files should begin with the chromosome name followed by a period, and end with ".bin". The rest of the filename is ignored. For example chr10.mask.mouse.n27.d2.bin contains the mask data for chromosome 10. Mask directories for Human (hg18) and Mouse (mm8 and mm9) genomes for a window size of 25bp are available for download from the igv web site. http://www.broad.mit.edu/igv/downloads/downloads/html =========================== MEMORY AND DISK REQUIREMENTS =========================== The current implementation processes all alignments from a single file at once and is memory intensive. The default settings allocate 2 GB for Mac and Linux, and 1 GB for Windows. If you receive an out of memory exception you can increase these by changing the "-Xmx" value in the scripts. Conversely if the JVM will not start you might need to reduce these settings. Note that 32-bit Windows machines have a maximum practical upper limit of approximately 1.3GB. The output file contains an entry for every data point that has not been masked, even if there were no alignments at that point. As a consequence the files can be quite large compared to the input file. For human and mouse genomes the output file are on the order of 1 GB in size with a window size of 25bp.