Notation: the same as above.
Genome assembly: Hg38.

100 sets of 1000 chunks: 3.5M genomic-position-sorted CREs are splitted into 1000 chunks forming 100 sets, with id such as set_id.chunk_id. For example, 9.10 means the 10th chunk from the 9th set. Each chunk includes 3.5 k CREs. We provide both input and output for each chunk.
Genomic bins: we extract TF affinity score and predict epigenomic profile at the resolution of  128-bp bins.
Receptive field: we include the genomic bin where the center of a CRE lies, and upstream and downstream 1000 bins as our input, i.e. in total N=2001 bins, spanning 256,128 bps.
Dimension of input: we currently include K=360 TFs (with both motif and Protein interaction data). Thus for each genomic bin, we have K features, while for each CRE, we have K*N=720360 features.
Output field: instead of predicting signals across the whole receptive field, we only predict signals for M=201 bins in the center, i.e. the center bin +  upstream and downstream 100 bins, spanning 25,728bps.
|
Data here
https://data.broadinstitute.org/compbio1/DL.epiImpute/
./b_CRE.bin.TFSignal_updn128kb_bin128bp/
Input, TF binding affinity matrix B, (nrow=#CRE in the trunk, ncol=  #features= K*N), CSR sparse matrix in hdf5 format. Refer to readHDF5.py script to restore the sparse matrix. For each row of K*N features, it is actually a by-column vectorization of matrix (N * K), i.e. the first N items are the scores of N bins for the first TF, etc.  [I only include 10 chunks out of 1000 here]
|
TFScores.SparseMarix.colnames.txt
The order of K TFs.
|
motifTF.ppiDist.tsv
P matrix, Protein interaction network of the selected K TFs.  Shortest distance calculated between TFs over the whole PPI network, with (s_max-s)/s_max as the direct distance between each pair of node
|
./b2_CRE.bin_outputSignal_10k/C17M16/
Output, Y matrix (nrow=#CRE in the trunk,  ncol=M), tsv format.  [I only include 10 chunks out of 1000 here]