BU-BF550-Final-Project

Zhou et al. of Correlation between Either Cupriavidus or Porphyromonas and Primary Pulmonary Tuberculosis Found by Analysing the Microbiota in Patients ’Bronchoalveolar Lavage Fluid set out to find links between the diversity (or lack thereof) of lung microbiota in cases of chronic pulmonary tuberculosis (TB). Applying concepts developed through the rise of systemic biology, it is suggested that cases of chronic infectious diseases often point towards a type of twisted equilibrium between the host microbiota and the pathogen, allowing said pathogen to remain. Zhou et al. wished to apply nucleotide sequencing in divulging the composition of the diseased lung microbiome, as it has been done previously with other areas (i.e., linking obesity to differing diversities in the intestinal microbiome). A sample of 32 patients presenting chronic TB were enrolled, with chest radiography being the most important selection factor; patients had to display cases featuring one normal lung and one diseased lung. Bronchoalveolar fluid was obtained from both sides as a marker for microbiota, with the normal sample being denoted as Group A and the diseased sample denoted as Group B. Further, 24 healthy patients were chosen as the control group, denoted as Group H. Zhou et al. went on to perform parallel pyrosequencing of bacterial 16s rDNA amplicons in the V3 region. Subsequent sequencing data was binned into FASTA files, aligned, and then used calculate the evenness and Shannon entropy indices. This is my attempt to reproduce and visualize their results.

Methods and Analysis:

Raw data was obtained in FASTA format from the data repository DRYAD. Data was analyzed the following ways; • Files of Group A, Group B and Group H were divided into three lists; filesA, filesB, filesC. • Iterate through each file in each list, going line by line searching for any occurrences of amplicons, and then adding said amplicon to list count. Then creating a dictionary containing each amplicon as the key and each count of said amplicon being the value. Several assumptions were made in this analysis; o As stated in Zhou et al., detection of amplicons only occurred in sample sequences around 200 bp long. So in turn, my algorithm only searched for amplicons in sequences > 185 bp long. o As stated in Zhou et al., amplicon length was found to be around 200-220 bp long. Using this information, the algorithm would iterate through the length of the line, gradually growing in size in the range between 200-220 bp. ▪ For example; Search line for line[0:200] → line[0:220], line[1:200] → line[1:220], etc. ▪ The range was eventually decided to be between range(4,1000). This decision will be explained more in depth in the next section. o What exactly are amplicons? The Wikipedia description denotes them as most commonly being direct repeats and inverted repeats. So knowing this, the algorithm search each line for occurrences of both. o Final count is given an additional occurrence to make up for first amplicon detected. • This analysis was repeated for each list of files, eventually with each count value for each line of each file group being added to lists platoA, platoB and platoC respectively. • Using pandas and scipy, entropy (Shannon Weiner Diversity Index) was found for each list of amplicon counts per line of each file. Each instance of entropy was appended to list dataA, dataB and dataC respectively. • Finally, each entropy data set was plotted on a probability density curve.

Results and Discussion:

I was not able to accurately emulate Zhou et al.’s graph. As I moved further through my analysis, it became clear that what appeared to be a simple(r) looking graph would be quite difficult to emulate. I believe this to be due to varying reasons; • Zhou et al. did not use Python (at least directly) to analyze the FASTA data as far as I could tell. They performed several different forms of analysis; o “Individual sequences were aligned using Aligner tools, and aligned sequence files for each sample were processed using complete-linkage clustering with distance criteria”. o “We used the Uclust algorithm to cluster all of the sequences with a cut-off value of 97%; after clustering, we used the representative sequence of each type as the operational taxonomic unit (OTU) and re-corded each OTU sequence representing the number of sequences and the classification information.” o “Shannon diversity was estimated using Estimate S Win 8.20 software.” • It seems as if Zhou et al. were able to determine multiple points of entropy per line, not per group, as shown in their graph. This points to calculating entropy for every line in every file. How they were able to do this I’m not sure, as when I searched for amplicons, I often times was only able to locate 1 or 2 examples per line o This may have been different if I was searching in lines shorter than 185 bp. o Further, I had to shift my range(200, 220) to range(4,1000), as I was detecting no amplicons in my earlier range. Through tinkering I found that there were seemingly no amplicons above 25 bp. This has mostly likely something to do with my analysis and the differences in software we used to infer information from the FASTA files. • Besides this, not too much information was given on the steps of the analysis. Some more information detailing use of the software tools and how said tools worked to parse through the data would have been appreciated. • Further, the if the complete DRYAD data was available, I could not locate it. The .zip file featuring the data contained all the samples for both Group A and Group B, but seemingly none for Group H. I utilized the FASTA files denoted by an “N” in their file name, which I’m assuming stands for “Normal”. That being said, there were only 8 of said files in comparison to the 32 of Group A and Group B. • Lastly, while Zhou et al’s graph is pretty, is noticeably lacking x and y axis labels. The graph is described as a collection of “Shannon-Weaver index curves”, but through my research I could not find any other examples of this type of graph, at least by the name given.

In the end I believe I bit off more than I could chew. Nevertheless, I believe there are some similar conclusions involving microbial diversity in TB patients that can be interpreted from both graphs. As quoted from the paper; “According to our research, significant differences can be observed in the respiratory tract microbiota of healthy people when compared with TB patients.” Looking at my graph, you can note a greater degree of differences in amplicon diversity in the Group B curve vs the Group A curve, lending to a greater presence of exotic microbes in the TB effected lung and less homogeneity.

Overall, this project was a learning experience. I am intensely interested in the study of our microbiome, and hopefully after learning a bit more in depth in data analysis I would eventually be able to return to this study and attempt at creating the figures in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
#BradFortunato_FinalP.ipynb		#BradFortunato_FinalP.ipynb
10A_pm.fna		10A_pm.fna
10B_pm.fna		10B_pm.fna
11A_pm.fna		11A_pm.fna
11B_pm.fna		11B_pm.fna
12A_pm.fna		12A_pm.fna
12B_pm.fna		12B_pm.fna
13A_pm.fna		13A_pm.fna
13B_pm.fna		13B_pm.fna
14A_pm.fna		14A_pm.fna
14B_pm.fna		14B_pm.fna
15A_pm.fna		15A_pm.fna
15B_pm.fna		15B_pm.fna
16A_pm.fna		16A_pm.fna
16B_pm.fna		16B_pm.fna
17A_pm.fna		17A_pm.fna
17B_pm.fna		17B_pm.fna
18A_pm.fna		18A_pm.fna
18B_pm.fna		18B_pm.fna
19A_pm.fna		19A_pm.fna
19B_pm.fna		19B_pm.fna
1A_pm.fna		1A_pm.fna
1B_pm.fna		1B_pm.fna
20A_pm.fna		20A_pm.fna
20B_pm.fna		20B_pm.fna
21A_pm.fna		21A_pm.fna
21B_pm.fna		21B_pm.fna
22A_pm.fna		22A_pm.fna
22B_pm.fna		22B_pm.fna
23A_pm.fna		23A_pm.fna
23B_pm.fna		23B_pm.fna
24A_pm.fna		24A_pm.fna
24B_pm.fna		24B_pm.fna
25A_pm.fna		25A_pm.fna
25B_pm.fna		25B_pm.fna
26A_pm.fna		26A_pm.fna
26B_pm.fna		26B_pm.fna
27A_pm.fna		27A_pm.fna
27B_pm.fna		27B_pm.fna
28A_pm.fna		28A_pm.fna
28B_pm.fna		28B_pm.fna
29A_pm.fna		29A_pm.fna
29B_pm.fna		29B_pm.fna
2A_pm.fna		2A_pm.fna
2B_pm.fna		2B_pm.fna
30A_pm.fna		30A_pm.fna
30B_pm.fna		30B_pm.fna
31A_pm.fna		31A_pm.fna
31B_pm.fna		31B_pm.fna
32A_pm.fna		32A_pm.fna
32B_pm.fna		32B_pm.fna
3A_pm.fna		3A_pm.fna
3B_pm.fna		3B_pm.fna
4A_pm.fna		4A_pm.fna
4B_pm.fna		4B_pm.fna
5A_pm.fna		5A_pm.fna
5B_pm.fna		5B_pm.fna
6A_pm.fna		6A_pm.fna
6B_pm.fna		6B_pm.fna
7A_pm.fna		7A_pm.fna
7B_pm.fna		7B_pm.fna
8A_pm.fna		8A_pm.fna
8B_pm.fna		8B_pm.fna
9A_pm.fna		9A_pm.fna
9B_pm.fna		9B_pm.fna
FinalP.ipynb		FinalP.ipynb
HEA#01.fasta		HEA#01.fasta
HEA#02.fasta		HEA#02.fasta
HEA#03.fasta		HEA#03.fasta
HEA#04.fasta		HEA#04.fasta
HEA#05.fasta		HEA#05.fasta
HEA#06.fasta		HEA#06.fasta
HEA#07.fasta		HEA#07.fasta
HEA#08.fasta		HEA#08.fasta
HEA#09.fasta		HEA#09.fasta
HEA#10.fasta		HEA#10.fasta
HEA#11.fasta		HEA#11.fasta
HEA#12.fasta		HEA#12.fasta
HEA#13.fasta		HEA#13.fasta
HEA#14.fasta		HEA#14.fasta
HEA#15.fasta		HEA#15.fasta
HEA#16.fasta		HEA#16.fasta
N1_pm.fna		N1_pm.fna
N2_pm.fna		N2_pm.fna
N3_pm.fna		N3_pm.fna
N4_pm.fna		N4_pm.fna
N5_pm.fna		N5_pm.fna
N6_pm.fna		N6_pm.fna
N7_pm.fna		N7_pm.fna
N8_pm.fna		N8_pm.fna
README.md		README.md
Report.docx		Report.docx
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BU-BF550-Final-Project

About

Releases

Packages

Languages

BFortunato1/BU-BF550-Final-Project

Folders and files

Latest commit

History

Repository files navigation

BU-BF550-Final-Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages