Skip to content

heispv/bioinformatics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bioinformatics

To clone this repository using HTTPS run the code below in your terminal:

git clone https://github.com/heispv/bioinformatics.git
  • You can retrieve all the functions within this Python file. 😉

In this collection, you will find all the notebooks along with their file descriptions and publish dates. Each notebook contains examples that showcase the results of various functions. 🙂

Notebook Name Comment Publish Date
Counting Words To compute Count(Text, Pattern), our plan is to “slide a window” down Text, checking whether each k-mer substring of Text matches Pattern. We will therefore refer to the k-mer starting at position i of Text as Text(i, k) 21 July 2023
Best Common Patterns To calculate the best common patterns in a given sequence (ori), we can begin by creating a function that generates a frequency table of patterns based on their k-mer values. Then, we proceed to develop another function that identifies and selects the most frequently occurring patterns based on the k-mer values. By following these steps, we can effectively extract the most prevalent patterns in the ori sequence. 21 July 2023
Reverse Complement In DNA, each strand is read from 3' to 5'. Therefore, to generate a complementary strand, we need to create a function that first reverses the original strand and then generates the complementary sequence based on the reversed strand. 22 July 2023
Pattern Index Finder V1 - Pattern Index Finder V2(For Real Data) Here we created a function which returns a list of the staring index of a pattern in a sequence. 22 July 2023
Clump Finding We defined a k-mer as a "clump" if it appears many times within a short interval of the genome. More formally, given integers L and t, a k-mer Pattern forms an (L, t)-clump inside a (longer) string Genome if there is an interval of Genome of length L in which this k-mer appears at least t times, so we created a function called find_clumps(seq, k, L, t) which also uses the function freq_table(seq, k) underhood. 22 July 2023
Clump Finding (for Real Data) In this notebook, we have optimized the find_clump(seq, k, L, t) function to efficiently detect clumps in real-world examples. This enhancement was necessary as the previous implementation proved to be considerably slow for such scenarios. Additionally, we introduced a new function named freq_index(seq, k), which also contributes to the improved performance by providing a list of starting indices for each pattern. 23 July 2023
Skew Diagram In this particular section, our primary objective is to create a function called skew_diagram(seq) to analyze the DNA sequence. Notably, we have observed that the frequency of the GC base pair increases in the forward half-strand from the origin (ori) to the termination (ter), while it decreases in the reverse half-strand. To accurately visualize these fluctuations, we will represent the nucleotides as follows: C will be replaced with -1, G with +1, and T and A with 0. This representation will allow us to illustrate the changing pattern of the GC frequency effectively. Furthermore, by utilizing the "skew_diagram" function, we will be able to generate a skew diagram and identify the location of the ori, which is characterized by the minimum value on the diagram. This will enable us to pinpoint the specific position where the DNA replication process begins. 25 July 2023
Hamming Distance We say that position i in k-mers p1 … pk and q1 … qk is a mismatch if pi ≠ qi. For example, CGAAT and CGGAC have two mismatches. The number of mismatches between strings p and q is called the Hamming distance between these strings and is denoted hamming_distance(p, q). 26 July 2023
Approximate Pattern Matching Problem We say that a k-mer Pattern appears as a substring of Text with at most d mismatches if there is some k-mer substring Pattern' of Text having d or fewer mismatches with Pattern, i.e., hamming_distance(Pattern, Pattern') ≤ d. Our observation that a DnaA box may appear with slight variations leads to the following generalization of the Pattern Matching Problem, and so we use the approximate_pattern_matching(pattern, seq, d) fucntion to address this problem. 26 July 2023
Generating the Neighborhood of a String In this notebook, we will define a function named neighborhood(pattern, d). The function utilizes the pre-existing hamming distance calculation to generate a set of sequences that are within a specified maximum hamming distance, 'd', from the given 'pattern' 28 July 2023
Frequent Word With Mismatch Problem In this notebook, we have implemented a function named freq_word_with_mismatch(seq, k, d). This function takes a sequence as input, along with the value of k which represents the length of k-mers to observe, and d which indicates the maximum allowable hamming distance. The function's purpose is to find the most frequent word (k-mer) in the sequence while allowing up to d mismatches. This enables us to identify frequently occurring patterns even in cases where there are slight variations or errors in the sequence data. 28 July 2023
Frequent Word With Mismatch and Reverse Complemenet In this notebook, we have implemented a function named freq_word_with_mismatch_reverse(seq, k, d). This function is different from the previous one because it also takes the reverse complement of the pattern into account. 4 August 2023
Finding the DnaA Boxes of Salmonella Enterica ✅ In this notebook we use all the knowledge we learned from previous notebooks to locate the DnaA boxes of Salmonella Enterica. 6 August 2023
A Brute Force Algorithm For Motif Finding Brute force (also known as exhaustive search) is a general problem-solving technique that explores all possible solution candidates and checks whether each candidate solves the problem. Such algorithms require little effort to design and are guaranteed to produce a correct solution, but they may take an enormous amount of time, and the number of candidates may be too large to check. A brute force approach for solving the Implanted Motif Problem is based on the observation that any (k, d)-motif must be at most d mismatches apart from some k-mer appearing in the first string in Dna. Therefore, we can generate all such k-mers and then check which of them are (k, d)-motifs. So in this notebook we can use motif_enumeration(dna_list, k, d) to address this problem 7 August 2023
Distance Between Patterns and Strings In order to solve the Median String problem one step is to create a function to calculate the sum of the hamming distance between a pattern and a DNA list, so I created a function called patterns_strings_distance(pattern, dna_list) to address this problem. 25 August 2023
Median String Problem As a computer scientist, the runtime of an algorithm is of paramount importance, especially when dealing with real-world examples containing millions of nucleotides. Unfortunately, the Brute Force Algorithm exhibits an excessively lengthy runtime. Within this notebook, I have devised a function named median_string(dna_list, k). This function aims to determine the optimal pattern, minimizing the total Hamming distance between the pattern and the DNA sequences within the dna_list. 25 August 2023
Greedy Motif Search Given a profile matrix Profile, we can evaluate the probability of every k-mer in a string Text and find a Profile-most probable k-mer in Text, i.e., a k-mer that was most likely to have been generated by Profile among all k-mers in Text. For example, ACGGGGATTACC is the Profile-most probable 12-mer in GGTACGGGGATTACCT. Indeed, every other 12-mer in this string has probability 0. In general, if there are multiple Profile-most probable k-mers in Text, then we select the first such k-mer occurring in Text. So we can use the function most_probable_kmer(seq, k, profile_matrix) to address this problem. 27 August 2023

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published