Lister is a class that creates a Lister object, which grants access to all of the initial data that we will use for the project.
Lister uses pandas to manipulate our data so that we can more easily call it other processes.
The data to be initialized includes the following:
- Master_Accession_File.csv (include project name)
- common_names.csv (PAML) (rename)
- taxon_ids.txt
Add some examples of these files as template
Using docstrings for help:
from manager.lister import Lister
help(Lister)
Using this class is simple:
from lister import Lister
from pprint import pprint
x = Lister(acc_file = 'Master_Accession_File.csv', paml_file = 'common_names.csv', taxon_file = 'taxon_ids.txt')
from lister import Lister
x = Lister(acc_file, paml_file, taxon_file, go_list = None, hgnc_file = False)
- acc_file ('Master_Accession_file.csv') - Contains accession numbers for a group of genes ranked by tier. Each gene has a group of orthlogs used in our phylogenetic anlysis.
- paml_file ('commonnames.csv') - Contains a list of shortened organism names used in the MSA files. This is done to compy with PAML.
- taxon_file ('taxon_ids.txt') - Contains an ordered list of taxon ids
- go_list ([[gene.1, org.1], ... , [gene.n, org.n]]) - A nested list that can be used to get information about specific gene/org pairs.
- hgnc_file - For future implementation. Used as a file handle to parse an HGNC .csv file.
- x.gene_count
- x.org_count
- x.paml_org_list
- x.taxon_ids
List that contain header info(x.header):
pprint(x.header)
['Tier',
'Gene',
'Homo_sapiens',
'Macaca_mulatta',
'Mus_musculus',
'Rattus_norvegicus',
...
'Trichechus_manatus_latirostris',
'Tupaia_chinensis',
'Tursiops_truncatus']
List of Accessions (x.acc_list):
pprint(x.acc_list)
['NM_000680.3',
'NM_000679.3',
'NM_000678.3',
...
'xm_004368425.2',
'XM_006155397.2',
'XM_004317686.1']
List of Genes (x.gene_list):
pprint(x.gene_list)
['ADRA1A',
'ADRA1B',
'CHRM2',
...
'SSTR2',
'TSHR',
'VIPR1']
List of Organisms (x.org_list):
pprint(x.org_list)
['Homo_sapiens',
'Macaca_mulatta',
'Mus_musculus',
'Rattus_norveg',
...
'Trichechus_manatus_latirostris',
'Tupaia_chinensis',
'Tursiops_truncatus']
To use a dictionary of accessions.
>>> pprint(x.acc_dict)
{'NM_000115.3': ['EDNRB', 'Homo_sapiens'],
'NM_000145.3': ['FSHR', 'Homo_sapiens'],
'NM_000164.3': ['GIPR', 'Homo_sapiens'],
...
'NM_001001620.1': ['CCR3', 'Sus_scrofa'],
'NM_001002911.3': ['GPR139', 'Homo_sapiens'],
'NM_001002944.1': ['ADORA2B', 'Canis_lupus_familiaris']}
Dictionary of Genes is a nested dictionary. (x.gene_dict, x.tier_dict):
pprint(x.gene_dict['HTR1A'])
{'Ailuropoda_melanoleuca': 'XM_002926305.1',
'Bos_taurus': 'XM_600535.5',
'Callithrix_jacchus': 'XM_008992005.2',
...
'Tier': '1',
'Trichechus_manatus_latirostris': 'xm_004374552.2',
'Tupaia_chinensis': 'xm_006156821.1',
'Tursiops_truncatus': 'xm_004325159.1'}
pprint(x.tier_dict['HTR1A'])
'1'
Dictionary of Organisms is a nested dictionary (x.org_dict)
homosapiens_query = x.org_dict['Homo_sapiens'].values()
homosapiens_gene_list = x.org_dict['Homo_sapiens'].keys()
>>> pprint(list(homosapiens_query))
['NM_000680.3',
'NM_000679.3',
'NM_000678.3',
...
'XM_011517263.2',
'NM_000369.2',
'NM_004624.3']
Dataframe that uses Gene as an index (x.df):
pprint(x.df.T.HTR1A)
['Tier 1
Homo_sapiens NM_000524.3
Macaca_mulatta NM_001198700.1
Mus_musculus NM_008308.4
...
Tupaia_chinensis xm_006156821.1
Tursiops_truncatus xm_004325159.1
Name: HTR1A, dtype: object']
Pivot Table MultiIndexed with pandas(x.pt):
# #### Format the main pivot table #### #
self.pt = pd.pivot_table(pd.read_csv(self.__filename_path), index=['Tier', 'Gene'], aggfunc='first')
array = self.pt.axes[1].tolist() # Organism list
self.pt.columns = pd.Index(array, name='Organism')
If your data has tiers or is divided into groups (x.get_tier_frame, x.tier_frame_dict):
Tiers = x.get_tier_frame('1')
Tiers.keys()
dict_keys(['1'])
Tiers = x.tier_frame_dict()
Tiers.keys()
dict_keys(['1', '2', '3', 'None'])
Lookup Accessions (x.get_accession(gene, org), x.get_accesions(go_list=None))
x.get_accession('HTR1A', 'Homo_sapiens')
'NM_000524.3'
go_list = [['HTR1A', 'Homo_sapiens'], ['HTR1A', 'Macaca_mulatta']]
x.get_accessions(go_list = go_list)
['NM_000524.3', 'NM_001198700.1']
Lookup a list of Accession for alignment(x.get_accession_alignment(gene)):
pprint(x.get_accession_alignment('HTR1A'))
['NM_000524.3',
'NM_001198700.1',
...
'xm_006156821.1',
'xm_004325159.1']
Get the master lists from a new dataframe(self.get_master_list(df)):
from Manager.lister import Lister
import os
from pathlib import Path
y = Lister()
y.get_master_lists(csv_file='MAFV3.1.csv')