Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

> @mtxellrb A script for creating a track database from bigWig TF ChIP-seq data is now added :create_cistarget_track_databases.py #52

Open
MatthewTCManion opened this issue Aug 30, 2024 · 13 comments

Comments

@MatthewTCManion
Copy link

          > @mtxellrb A script for creating a track database from bigWig TF ChIP-seq data is now added :`create_cistarget_track_databases.py`

https://github.com/aertslab/create_cisTarget_databases#create_cistarget_track_databasespy

Hello , I am running into an issue using this script where the .bed file with regions to score is not recognized correctly, and I have tried a few different formats with no success. For reference, here is a screenshot of my most recent attempt to run the script, as well as the format of my .bed:

REGION_BED="/data/PetrosLab/Matt/scenicplus/chipseq/tracks/fwf_gene_assignments.bed"
DATABASE_PREFIX="CellType_750bp_with_binding"
SCRIPT_DIR="/data/PetrosLab/Matt/scenicplus/create_cisTarget_databases"
TRACKS_DIR="/data/PetrosLab/Matt/scenicplus/chipseq/tracks"
TRACK_LIST="track_names.txt"


"${SCRIPT_DIR}/create_cistarget_track_databases.py" \
	-b "${REGION_BED}" \
    -T "${TRACKS_DIR}" \
    -d "${TRACK_LIST}" \
    -o "${DATABASE_PREFIX}" \
    -t 20

image

I assume the issue is with the format of the .bed or the genes/regions data, but I can't find what the proper format should be.

Thanks,
Matt

Originally posted by @MatthewTCManion in #17 (comment)

@ghuls
Copy link
Member

ghuls commented Sep 3, 2024

Your BED file is not in BED format: chr start end name:

This will fix it:

tail -n +2 fwf_gene_assignments.bed | awk -F "\t" -v 'OFS=\t' '{ print $2, $3, $4, $1; }' > fwf_gene_assignments.fixed.bed

@MatthewTCManion
Copy link
Author

Your BED file is not in BED format: chr start end name:

This will fix it:

tail -n +2 fwf_gene_assignments.bed | awk -F "\t" -v 'OFS=\t' '{ print $2, $3, $4, $1; }' > fwf_gene_assignments.fixed.bed

Thank you for your reply! I tried running it with this suggestion, and the BED file now seems to be in the correct format, but the same error came up:

image

Please let me know if you have any additional suggestions!

Best,
Matt

@MatthewTCManion
Copy link
Author

MatthewTCManion commented Sep 3, 2024

Okay so I was able to get the BED file into proper format,

image

and ran create_cistarget_track_databases.py,

image

but I am getting an issue with scoring the tracks for each of my 4 input .bw files:

image

@ghuls
Copy link
Member

ghuls commented Sep 4, 2024

What is the output of for example the first bigWigAverageOverBed command?:

/data/PetrosLab/Conda/envs/create_cistarget_databases/bin/bigAveragedOverBed -minMax /data/PetrosLab/Matt/scenicplus/chipseq/tracks/MGE2-60_S13.cpm.norm.bw /data/PetrosLab/Matt/scenicplus/chipseq/tracks/consensus_tabbed.bed /dev/stdout

(best copy it from your slurm output as I might have made typos)

You probably should make the BED files like I suggested:

tail -n +2 fwf_gene_assignments.bed | awk -F "\t" -v 'OFS=\t' '{ print $2, $3, $4, $1; }' > fwf_gene_assignments.fixed.bed

with only 4 columns, as bigWigAverageOverBed probably does not like your 5th column.

Creating a track database also only would make some sense when you have several hundreds of tracks (even better thousands), instead of only 4.

@MatthewTCManion
Copy link
Author

Thanks for your help,I dropped it to 4 columns and it seems to have ran and created the .feather rankings and scores.

@MatthewTCManion
Copy link
Author

Creating a track database also only would make some sense when you have several hundreds of tracks (even better thousands), instead of only 4.

Can you clarify what you mean by this to ensure I am using this pipeline correctly?

I have run SCENIC+ using motifs and the motif database, which I understand has thousands of motifs.

My understanding is that I am creating the track database from my ChIP-seq data (one track per sample), and scoring binding in these tracks against the region BED created by running pycisTopic on my ATAC-seq data to create a database that can be used in the SCENIC+ pipeline as way to target regions where I have TF binding in my samples. Am I understanding that incorrectly?

If so, do you have any suggestions on the proper way to incorporate my TF-binding data into identification of GRNs?

Thanks,
Matt

@SeppeDeWinter
Copy link

Hi @MatthewTCManion

Do you want to create a single database containing both motif and ChIP-seq scores?
Or a database containing only ChIP-seq data?

The latter probably does not make a lot of sense, given that you only have few tracks calculating the enrichment values will be impossible.

The former might be a good idea, however in the context of SCENIC+ we have not tried this yet.

All the best,

Seppe

@MatthewTCManion
Copy link
Author

@SeppeDeWinter Ideally, a database containing both motif and ChIP-seq scores. Thank you for the reply!

@MatthewTCManion
Copy link
Author

@SeppeDeWinter Do you think it would be possible to use the outputs of a motif-discovery tool (we have successfully identified motifs in our ChIP-seq data using the MEME suite of tools) to create a motif database for SCENIC+ targeted to motifs in our ChIP-seq set?

@ghuls
Copy link
Member

ghuls commented Sep 9, 2024

@MatthewTCManion That should be possible. Convert your motifs (from Homer/MEME/...) to ClusterBuster format (you can use BioPython for this if you want. Make sure that your PWM contains counts and not frequencies (else multiply by 100))) and use that together with our provide motif collection to make your own database. Later you will have to add your motif to the motif2tf table so if your motif is found, it will actually be used by the SCENIC+ analysis.

Against which TFs did you do ChIP-seq and are the motifs you obtain not in our motif collection?

@MatthewTCManion
Copy link
Author

I will try that!

We're using Nkx2.1 for our ChIP-seq

@ghuls
Copy link
Member

ghuls commented Sep 20, 2024

Looks like a motif for Nkx2-1 is at least in JASPAR, so it should be detected with our default motif collection already:
https://jaspar.elixir.no/matrix/MA1994.2/

@MatthewTCManion
Copy link
Author

MatthewTCManion commented Sep 20, 2024

We have seen some inconsistency between the Nkx2-1 motif between different databases, so one thing we' have done is to generate the motif from our own binding data in multiple Nkx2-1 expressing tissues to hopefully capture a more consistent motif, but I agree it should at least partially resemble the JASPAR Nkx2-1 motif

HOCOMOCO
image

JASPAR
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants