NOTE: PEGASUS presently still produces a lot of debugging output when it runs, so it's advisable to redirect stdout to a file when you run it. The first step in the pipeline is creating a diploid genome. The input files that you'll need will be a reference genome sequence in FASTA format, a corresponding FASTA index file (which can be generated with samtools), and one or more VCF files specifying the variant calls. The output will include: A pair of output VCF files corresponding to each input VCF file indicating the variants that were assigned to haplotypes and the variants that were omitted. A "Master" VCF file output by the Haplotype Assignment component of the personal genome creator, and a second "Master VCF" file output by the Personal Genome Creator. In both Master VCF files any variants that overlap are merged into a single VCF entry with alleles that represent the combined alleles of the overlapped variants. Also, the genotype indicated in the file will be representative of how the allels have been assigned to haplotypes in the diploid genome. The Mast VCF output by the Personal Genome Creator contains start coordinates of the variant in each haplotype. A pair of FASTA files for each chromosome created by altering the reference sequence to reflect how the alleles of the variants have been assigned to the haplotypes. A set of UCSC liftover chain files specifying how the coordinates of the reference and the two haplotypes map to each other The command line you'll use to run the personal genome creator will be: java -classpath PATH_TO_JAR/Pegasus.jar edu.mit.compbio.pegasus.genome.PersonalGenomeCreator [required parameters] [optional parameters] The required parameters are: -refName Name of reference sequence, e.g. "-refName hg19" -refDir Directory where reference sequence is located, e.g. "-refDir ." -refFiles ... Name of reference FASTA file, e.g. "-refFile hg19.fa" OR -refPrefix Prefix of reference FASTA filenames, e.g. "-refFile hg19_chr" can be used for hg19_chr1.fa, hg19_chr2.fa, etc... -sampleName Name of individual/cell line/sample used as column label in VCF file, e.g. "-sampleName NA12878" -inputVCFs ... Filenames, including paths, of VCF files OR -inputMasterVCF Generate the diploid genome based on a single master VCF file already generated by running the personal genome creator -outputDir Directory where output should be written e.g. "-outputDir ." -outputMasterVCF Filename for output Master VCF file, e.g. "-outputMasterVCF master.NA12878.vcf" The optional parameters are: -noUseQual Ignore quality scores, even if they're present, instead just maximize the number of variants that are included -contigs ... Names of chromosomes to process and generate personal genome for, e.g., "-contigs chr1 chr2" -noUseFilePriority Treat all the input VCF files equally. By default, when this parameter is NOT specified the order in which the VCF files are supplied as parameters determines their priority. Variants from higher priority VCFs will always be included over conflicting variants in lower priority VCF files -seed seed the random number generator with the specified integer OR -seedTime seed the random number generate based on the current time -useMatPat Use "Maternal" and "Paternal" for haplotype names instead of "HapOne" and "HapTwo" ============================================================================== The second, optional, step in the pipeline is removal of PCR duplicates. The input files are a VCF file and sorted BAM files (including BAM indexes), which are processed individually. The output will be a pair of BAM files, one with only the duplicate reads, and a duplicate-marked file in which the duplicate reads are either marked or omitted as specified by the command line options The command line you'll use to run the variant aware duplicate marker will be: java -cp PATH_TO_JAR/Pegasus.jar edu.mit.compbio.pegasus.sam.VariantAwareDuplicateMarker [required paramaters] [optional parameters] The required parameters are: -vcfFile filename, including path, of VCF file. This should be the Master VCF file produced by the Personal Genome Creator e.g. "-vcfFile master.NA12878.vcf" -sampleName Name of individual/cell line/sample used as column label in VCF file. e.g. "-sampleName NA12878" -inputBAM filename, including path, of BAM file to process -inputHap Either "HAP_ONE" or "HAP_TWO", or "Paternal", or "Maternal", to specify the haplotype to which the reads were aligned to produce the input BAM file The optional parameters are: -keepDuplicates Mark reads as duplicates and output them to the "kept" file along with the non-duplicate reads. By default they are omitted from the kept file. -ignorePreexisting Ignore whether or not the duplicate flag is already set for any reads. -outputDir Path to directory where output should be written. By default output is written to the same directory as the input BAM file. -duplicatesFilename Filename to which duplicate reads should be written. By default this is determined by modifying the input filename. -keptFilename Filename to which non-duplicate reads should be written. By default this is determined by modifying the input filename. ============================================================================== The third step in the pipeline is to split the BAM files to isolate the reads that overlap variants. The input files are a VCF file and sorted BAM files The output will be a pair of BAM files, one with only the reads that overlap variants and another which specified reads that did not overlap variants The command line you'll use to run the variant aware duplicate marker will be: java -cp PATH_TO_JAR/Pegasus.jar edu.mit.compbio.pegasus.sam.SplitReadsByVariantOverlap [required paramaters] [optional parameters] The required parameters are: -vcfFile filename, including path, of VCF file. This should be the Master VCF file produced by the Personal Genome Creator e.g. "-vcfFile master.NA12878.vcf" -sampleName Name of individual/cell line/sample used as column label in VCF file. e.g. "-sampleName NA12878" -inputHapOneBAM filename, including path, of BAM file for reads aligned to haplotype one OR -inputMaternalBAM same as above, for maternal haplotype -inputHapTwoBAM filename, including path, of BAM file for reads aligned to haplotype two (or paternal) OR -inputPaternalBAM same as above, for paternal haplotype The optional parameters are: -outputDir Path to directory where output should be written. By default output is written to the same directory as the input files -ignoreHomozygous Treat reads that only overlap homozygous variants as if they don't overlap variants ============================================================================== The fourth step in the pipeline is to count how many reads overlap each variant The input files are a VCF file and a pair of sorted BAM files (including BAM indexes) The output will be a "VarCov" file indicating the number of reads overlapping each allele of each variant. The command line you'll use to run the allelic count generator will be: java -cp PATH_TO_JAR/Pegasus.jar edu.mit.compbio.pegasus.allelic.AllelicCountGenerator [required paramaters] [optional parameters] The required parameters are: -vcfFile filename, including path, of VCF file. This should be the Master VCF file produced by the Personal Genome Creator e.g. "-vcfFile master.NA12878.vcf" -sampleName Name of individual/cell line/sample used as column label in VCF file. e.g. "-sampleName NA12878" -inputHapOneBAM filename, including path, of BAM file for reads aligned to haplotype one OR -inputMaternalBAM same as above, for maternal haplotype -inputHapTwoBAM filename, including path, of BAM file for reads aligned to haplotype two (or paternal) OR -inputPaternalBAM same as above, for paternal haplotype -outputDir Path to directory where output should be written. By default output is written to the same directory as the input files The optional parameters are: -ignoreHomozygous Treat reads that only overlap homozygous variants as if they don't overlap variants -outputZeroCounts Include variants in the output file even if the read counts are zero for both alleles -snpQualThresh The minimum base quality score of the base within a read that overlaps a SNP in order for the read to be counted for the allele it overlaps. Values range from 1-60. Default is 14, which corresponds to a ~5% chance that the base is incorrect