NOTE: PEGASUS presently still produces a lot of debugging output when it runs, so it's advisable to redirect stdout to a file when you run it.

The first step in the pipeline is creating a diploid genome. The input files that you'll need will be a reference genome sequence in FASTA format, a corresponding FASTA index file (which can be generated with samtools), and one or more VCF files specifying the variant calls.

The output will include:
A pair of output VCF files corresponding to each input VCF file indicating the variants that were assigned to haplotypes and the variants that were omitted.
A "Master" VCF file output by the Haplotype Assignment component of the personal genome creator, and a second "Master VCF" file output by the Personal Genome Creator. In both Master VCF files any variants that overlap are merged into a single VCF entry with alleles that represent the combined alleles of the overlapped variants. Also, the genotype indicated in the file will be representative of how the allels have been assigned to haplotypes in the diploid genome. The Mast VCF output by the Personal Genome Creator contains start coordinates of the variant in each haplotype.
A pair of FASTA files for each chromosome created by altering the reference sequence to reflect how the alleles of the variants have been assigned to the haplotypes.
A set of UCSC liftover chain files specifying how the coordinates of the reference and the two haplotypes map to each other

The command line you'll use to run the personal genome creator will be:
java -classpath PATH_TO_JAR/Pegasus.jar edu.mit.compbio.pegasus.genome.PersonalGenomeCreator [required parameters] [optional parameters]

The required parameters are:
-refName <text>				Name of reference sequence, e.g. "-refName hg19"

-refDir <path>				Directory where reference sequence is located,
					e.g. "-refDir ."

-refFiles <filename> ...		Name of reference FASTA file, 
					e.g. "-refFile hg19.fa"
OR 
-refPrefix <filename_prefix>		Prefix of reference FASTA filenames, 
					e.g. "-refFile hg19_chr" can be used for 
					hg19_chr1.fa, hg19_chr2.fa, etc... 

-sampleName <text>			Name of individual/cell line/sample used as 
					column label in VCF file, 
					e.g. "-sampleName NA12878"

-inputVCFs <file1> ... <file n>		Filenames, including paths, of VCF files
OR
-inputMasterVCF	<filename>		Generate the diploid genome based on a single
					master VCF file already generated by running
					the personal genome creator

-outputDir <path>			Directory where output should be written
					e.g. "-outputDir ."

-outputMasterVCF <filename>		Filename for output Master VCF file,
					e.g. "-outputMasterVCF master.NA12878.vcf"

The optional parameters are:

-noUseQual				Ignore quality scores, even if they're 
					present, instead just maximize the number
					of variants that are included

-contigs <chr name1> ... <chr name 2>	Names of chromosomes to process and generate
					personal genome for,
					e.g., "-contigs chr1 chr2"

-noUseFilePriority			Treat all the input VCF files equally. By
					default, when this parameter is NOT specified
					the order in which the VCF files are supplied
					as parameters determines their priority. 
					Variants from higher priority VCFs will 
					always be included over conflicting variants
					in lower priority VCF files

-seed <int>				seed the random number generator with the
					specified integer
OR 
-seedTime				seed the random number generate based on the
					current time

-useMatPat				Use "Maternal" and "Paternal" for haplotype names
					instead of "HapOne" and "HapTwo"


==============================================================================
The second, optional, step in the pipeline is removal of PCR duplicates. 

The input files are a VCF file and sorted BAM files (including BAM indexes), which are processed individually.

The output will be a pair of BAM files, one with only the duplicate reads, and a duplicate-marked file in which the duplicate reads are either marked or omitted as specified by the command line options

The command line you'll use to run the variant aware duplicate marker will be:
java -cp PATH_TO_JAR/Pegasus.jar edu.mit.compbio.pegasus.sam.VariantAwareDuplicateMarker [required paramaters] [optional parameters]

The required parameters are:
-vcfFile <filename>		filename, including path, of VCF file. This 
				should be the Master VCF file produced by the 
				Personal Genome Creator
				e.g. "-vcfFile master.NA12878.vcf"

-sampleName <text>		Name of individual/cell line/sample used as 
				column label in VCF file.
				e.g. "-sampleName NA12878"

-inputBAM <filename>		filename, including path, of BAM file to process

-inputHap <haplotype>		Either "HAP_ONE" or "HAP_TWO", or "Paternal", or
				"Maternal", to specify the haplotype to which 
				the reads were aligned to produce the input 
				BAM file

The optional parameters are:
-keepDuplicates			Mark reads as duplicates and output them to 
				the "kept" file along with the non-duplicate 
				reads. By default they are omitted from the 
				kept file.

-ignorePreexisting		Ignore whether or not the duplicate flag is 
				already set for any reads. 

-outputDir <path>		Path to directory where output should be 
				written. By default output is written to the 
				same directory as the input BAM file.

-duplicatesFilename <filename>  Filename to which duplicate reads should be
				written. By default this is determined by
				modifying the input filename.

-keptFilename <filename>	Filename to which non-duplicate reads should be
				written. By default this is determined by
				modifying the input filename.

==============================================================================

 
The third step in the pipeline is to split the BAM files to isolate the reads that overlap variants.

The input files are a VCF file and sorted BAM files

The output will be a pair of BAM files, one with only the reads that overlap variants and another which specified reads that did not overlap variants

The command line you'll use to run the variant aware duplicate marker will be:
java -cp PATH_TO_JAR/Pegasus.jar edu.mit.compbio.pegasus.sam.SplitReadsByVariantOverlap [required paramaters] [optional parameters]

The required parameters are:
-vcfFile <filename>		filename, including path, of VCF file. This 
				should be the Master VCF file produced by the 
				Personal Genome Creator
				e.g. "-vcfFile master.NA12878.vcf"

-sampleName <text>		Name of individual/cell line/sample used as 
				column label in VCF file.
				e.g. "-sampleName NA12878"

-inputHapOneBAM <filename>	filename, including path, of BAM file for reads
				aligned to haplotype one
OR
-inputMaternalBAM <filename>	same as above, for maternal haplotype

-inputHapTwoBAM <filename>	filename, including path, of BAM file for reads
				aligned to haplotype two (or paternal)
OR
-inputPaternalBAM <filename>	same as above, for paternal haplotype

The optional parameters are:
-outputDir <path>		Path to directory where output should be 
				written. By default output is written to the
				same directory as the input files
				
-ignoreHomozygous 		Treat reads that only overlap homozygous
				variants as if they don't overlap variants

==============================================================================

The fourth step in the pipeline is to count how many reads overlap each variant

The input files are a VCF file and a pair of sorted BAM files (including BAM
indexes)

The output will be a "VarCov" file indicating the number of reads overlapping
each allele of each variant.

The command line you'll use to run the allelic count generator will be:
java -cp PATH_TO_JAR/Pegasus.jar edu.mit.compbio.pegasus.allelic.AllelicCountGenerator [required paramaters] [optional parameters]

The required parameters are:
-vcfFile <filename>		filename, including path, of VCF file. This 
				should be the Master VCF file produced by the 
				Personal Genome Creator
				e.g. "-vcfFile master.NA12878.vcf"

-sampleName <text>		Name of individual/cell line/sample used as 
				column label in VCF file.
				e.g. "-sampleName NA12878"

-inputHapOneBAM <filename>	filename, including path, of BAM file for reads
				aligned to haplotype one
OR
-inputMaternalBAM <filename>	same as above, for maternal haplotype

-inputHapTwoBAM <filename>	filename, including path, of BAM file for reads
				aligned to haplotype two (or paternal)
OR
-inputPaternalBAM <filename>	same as above, for paternal haplotype

-outputDir <path>		Path to directory where output should be 
				written. By default output is written to the
				same directory as the input files
				

The optional parameters are:
-ignoreHomozygous 		Treat reads that only overlap homozygous
				variants as if they don't overlap variants

-outputZeroCounts		Include variants in the output file even if
				the read counts are zero for both alleles

-snpQualThresh <int>		The minimum base quality score of the base
				within a read that overlaps a SNP in order for
				the read to be counted for the allele it
				overlaps. Values range from 1-60. Default is 
				14, which corresponds to a ~5% chance that the
				base is incorrect