Notation: the same as above. Genome assembly: Hg38. 100 sets of 1000 chunks: 3.5M genomic-position-sorted CREs are splitted into 1000 chunks forming 100 sets, with id such as set_id.chunk_id. For example, 9.10 means the 10th chunk from the 9th set. Each chunk includes 3.5 k CREs. We provide both input and output for each chunk. Genomic bins: we extract TF affinity score and predict epigenomic profile at the resolution of 128-bp bins. Receptive field: we include the genomic bin where the center of a CRE lies, and upstream and downstream 1000 bins as our input, i.e. in total N=2001 bins, spanning 256,128 bps. Dimension of input: we currently include K=360 TFs (with both motif and Protein interaction data). Thus for each genomic bin, we have K features, while for each CRE, we have K*N=720360 features. Output field: instead of predicting signals across the whole receptive field, we only predict signals for M=201 bins in the center, i.e. the center bin + upstream and downstream 100 bins, spanning 25,728bps. | Data here https://data.broadinstitute.org/compbio1/DL.epiImpute/ ./b_CRE.bin.TFSignal_updn128kb_bin128bp/ Input, TF binding affinity matrix B, (nrow=#CRE in the trunk, ncol= #features= K*N), CSR sparse matrix in hdf5 format. Refer to readHDF5.py script to restore the sparse matrix. For each row of K*N features, it is actually a by-column vectorization of matrix (N * K), i.e. the first N items are the scores of N bins for the first TF, etc. [I only include 10 chunks out of 1000 here] | TFScores.SparseMarix.colnames.txt The order of K TFs. | motifTF.ppiDist.tsv P matrix, Protein interaction network of the selected K TFs. Shortest distance calculated between TFs over the whole PPI network, with (s_max-s)/s_max as the direct distance between each pair of node | ./b2_CRE.bin_outputSignal_10k/C17M16/ Output, Y matrix (nrow=#CRE in the trunk, ncol=M), tsv format. [I only include 10 chunks out of 1000 here]