Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
initial edit
  • Loading branch information
mitchgill16 authored Oct 12, 2021
1 parent ef56734 commit b26c561
Showing 1 changed file with 16 additions and 0 deletions.
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,18 @@
# lncRNA_Prediction_Interpretation
Notebooks, Data and Scripts for lncRNA Prediction & Interpretation

Each folder is for a given crop.
at = Arabidopsis thaliana
bn = Brassica napus
bo = Brassica oleracea
br = Brassica rapa
gm = Glycine max
os = Oryza sativa
zm = Zea mays

In each folder is a notebook which will be able to finetune a pretrained BERT model which has been installed following the instructions at the DNABERT repository (https://github.com/jerryji1993/DNABERT) for dnalong models. There were a couple changes made for interpretation, so if you eventually want to use the transformers interpret section of the notebook replace the modeling_bert.py file in your installation with the modeling_bert.py file in the scripts folder of this repository.
There will also be 2x fasta files. One fasta file is lncrna sequences specific to the given crop acquired from the cantataDB2.0 database (http://cantata.amu.edu.pl/download.php). The pipeline to generate this data was as follows: Download the relevant gtf file --> use the add_flanks.py script (to add 500bp either side of the lncrna) --> use bedtools get fasta for the reference genome on cantataDB2.0 and the flanked lncrna gtf file to generate crop specific lncrna flanked fasta--> use bedtools shuffle on the flanked gtf file to randomise the same intervals throughout the genome --> use bedtools get fasta to gnerate random fasta sequences.
You can follow the same pipeline if you'd like to finetune a DNABERT model for a different crop.

After finetuning models you can interpret the models by installing the transformers interpret package (https://github.com/cdpierse/transformers-interpret). For my finetuning of DNABERT I set the max sequence length to 2048, which isn't compatible to the standard transformers max seq length of 512. If you have changed to the max seq length (in 512 intervals as dictated by the DNABERT repo for dnalong models) replace the attributions.py and explainer.py in the installation folder of transformers interpret with the respective files in the scripts folder of this repo. I would recommend having your max seq length as 2048 as these changes Ive made were specific to 2048. If you did a different max length you would have to change my code which reshapes pytorch tensors to (4,512) etc.
There is also a do_chi_motifs.sh script which utilises the chi_script.R script to return a list of motifs and their significant value. This is ran after the motifs have been generated in the relevant jupyter notebook. The general script is for at (Arabidopsis thaliana).

0 comments on commit b26c561

Please sign in to comment.