None of ['query'] are in the columns #32

haoqing12 · 2024-02-27T07:24:28Z

hi, your tool is wonderful and your tutorials are very detailed. However, I encountered this error when running the T antigen workflow.

In [14]: snaf.JunctionCountMatrixQuery.generate_results(path='./result/after_prediction.p',outdir='./result')
...:

adding gene symbol

KeyError Traceback (most recent call last)
in
----> 1 snaf.JunctionCountMatrixQuery.generate_results(path='./result/after_prediction.p',outdir='./result')

~/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py in generate_results(path, outdir, criterion)
454 # add additional attributes
455 df = pd.read_csv(os.path.join(outdir,'frequency_stage{}_verbosity1_uid.txt'.format(stage)),sep='\t',index_col=0)
--> 456 enhance_frequency_table(df,True,True,outdir,'frequency_stage{}_verbosity1_uid_gene_symbol_coord_mean_mle.txt'.format(stage))
457 # report candidates
458 if stage == 3:

~/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py in enhance_frequency_table(df, remove_quote, save, outdir, name)
1401 '''
1402 print('adding gene symbol')
-> 1403 df = add_gene_symbol_frequency_table(df=df,remove_quote=remove_quote)
1404 print('adding chromosome coordinates') [0/1885]
1405 df = add_coord_frequency_table(df,remove_quote=False)

~/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/downstream.py in add_gene_symbol_frequency_table(df, remove_quote)
892 df['samples'] = [literal_eval(item) for item in df['samples']]
893 ensg_list = [item.split(',')[1].split(':')[0] for item in df.index]
--> 894 symbol_list = ensemblgene_to_symbol(ensg_list,'human')
895 df['symbol'] = symbol_list
896 return df

~/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/downstream.py in ensemblgene_to_symbol(query, species)
920 import mygene
921 mg = mygene.MyGeneInfo()
--> 922 out = mg.querymany(query,scopes='ensemblgene',fileds='symbol',species=species,returnall=True,as_dataframe=True,df_index=True)
923
924 df = out['out']

~/anaconda3/envs/SNAF/lib/python3.7/site-packages/biothings_client/base.py in _querymany(self, qterms, scopes, **kwargs)
597
598 if dataframe:
--> 599 out = self._dataframe(out, dataframe, df_index=df_index)
600 li_dup_df = DataFrame.from_records(li_dup, columns=["query", "duplicate hits"])
601 li_missing_df = DataFrame(li_missing, columns=["query"])

~/anaconda3/envs/SNAF/lib/python3.7/site-packages/biothings_client/base.py in _dataframe(obj, dataframe, df_index)
170 df = DataFrame.from_dict(obj)
171 if df_index:
--> 172 df = df.set_index("query")
173 return df
174

~/anaconda3/envs/SNAF/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper

~/anaconda3/envs/SNAF/lib/python3.7/site-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
5449
5450 if missing:
-> 5451 raise KeyError(f"None of {missing} are in the columns")
5452
5453 if inplace:

KeyError: "None of ['query'] are in the columns"

================================
now, the result dir have these files:

after_prediction.p
burden_stage0.txt
burden_stage3.txt
frequency_stage0.txt
frequency_stage3.txt
frequency_stage3_verbosity1_uid.txt
NeoJunction_statistics_maxmin.txt
x_neoantigen_frequency_stage3.pdf
x_occurence_frequency_stage3.pdf

Any help you can provide would be greatly appreciated.

frankligy · 2024-02-27T19:39:36Z

Hi @haoqing12,

Thanks for reaching out! This issue has been brought up before (#25), and as I illustrated in that thread, this error basically says the whole neoantigen prediction step fails, and all the stage3 files are actually empty dataframe, that's why there's a pandas error being throwing out.

In terms of why the prediction would fail, currently two things I observed are:

[1] make sure netMHCpan path is set correctly
[2] make sure HLA allele format is correct

I have more detailed explanation in issue 25 (the link I shared above).

Would you mind checking if either of these apply to your case? And let me know if not, then we will dig further into that.

Best,
Frank

haoqing12 · 2024-02-28T03:49:48Z

Thank you very much, I ignored the ‘HLA-’ prefix. Now, it‘s working.

But, I met a new error.

Traceback (most recent call last):
File "", line 1, in
File "/home/haoq/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 431, in run
self.parallelize_run(kind=3,hlas=hlas)
File "/home/haoq/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 499, in parallelize_run
result = collect.get()
File "/home/haoq/anaconda3/envs/SNAF/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
Exception: immunogenic prediction error: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

Have you experienced this problem?

frankligy · 2024-02-28T14:34:56Z

Hi @haoqing12,

I never ran SNAF or DeepImmuno on GPU node, because DeepImmuno is quite lightweight and can finish on CPU in seconds. I think the issue here is either since SNAF is memory-intensive (I am not proud of that), the GPU you were running doesn't have that much cpu memory, or you run DeepImmuno on GPU and it exceeds the GPU memory (I never tested how much GPU memory It will need if ran on GPU).

I guess my suggestion is to run on a CPU node and see if the issue will go away, in terms of CPU memory, I did a test for another issue brought up (#27) using netMHCpan. This can serve as a guidance for memory setup.

Let me know if the issue persists!

Best,
Frank

haoqing12 · 2024-03-01T14:31:31Z

I am immensely grateful for your invaluable guidance. I've run through the whole pipeline, except for the B-cell antigen prediction.

Following your suggestion, I solved the memory problem by running the pipeline on a CPU node and adjusting the parameter t_min to 200.

However, I still have a few minor questions:

1, Interpretation of result/T_candidates/T_antigen_candidates_all.txt:
I am curious about the 'in_db' column. Unfortunately, I couldn't find an explanation in the Troubleshooting section. Is it advisable to select only peptides where 'in_db' is True?

2, If I want to directly open a job that has already been run, what file do I need to load.
when I ran visualizing(), I got -- NameError: name 'dict_exonCoords' is not defined.
when I ran gtex_visual_combine_plotly(), I got -- NameError: name ' adata' is not defined

3, The detailed sequences about specific uid.
I am eager to obtain novel amino acid/nucleotide sequences based on the UID for more detailed plotting and other predictions. Could you kindly direct me on the steps or files required for this task?

Thank you very, very much for your time and continuous support. I truly appreciate your dedication to assisting others.

Best regards,
haoq

frankligy · 2024-03-01T15:53:59Z

Hi @haoqing12,

Glad you are able to get it to run and these are all good questions, I'll add the explanations to the doc as well.

[1] in_db means whether this splicing junction is present in one of the documented isoforms in the large database. We do not recommend filter by that, because a lot of isoforms documented so far are actually cancer or disease specific isoform. What we would suggest for those labelled as "in_db", is to go to the GTEx portal, look for exon/junction/isoform expression to see whether the transcript carrying this junction is actually expressed in normal tissue or not. But since SNAF has already filtered the junction based on their expression level in GTEx, most of the case, these are not highly detected in normal tissue.

[2] Sorry for not explaining that clearly, the issue is we still need to run the snaf.initialize() step to load in all the required global variables, so append the visualization code after, and because initiation takes some time, so I would recommend to write a list for all the Neojunction you'd like visualize and use a for loop, so you just need to initialize once and plot all:

# read in the splicing junction matrix
df = pd.read_csv('/user/ligk2e/altanalyze_output/ExpressionInput/counts.original.pruned.txt',index_col=0,sep='\t')

# database directory (where you extract the reference tarball file) and netMHCpan folder
db_dir = '/user/ligk2e/download'
netMHCpan_path = '/user/ligk2e/netMHCpan-4.1/netMHCpan'

# demonstrate how to add additional control database, see below note for more
tcga_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','tcga_matched_control_junction_count.h5ad'))
gtex_skin_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','gtex_skin_count.h5ad'))
add_control = {'tcga_control':tcga_ctrl_db,'gtex_skin':gtex_skin_ctrl_db}

# initiate
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=add_control)

# visualize
jcmq = snaf.JunctionCountMatrixQuery.deserialize('result/after_prediction.p')
jcmq.visualize(uid='ENSG00000167291:E38.6-E39.1',sample='TCGA-DA-A1I1-06A-12R-A18U-07.bed',outdir='./result')

# interactive
snaf.gtex_visual_combine_plotly(uid=uid,outdir='result_new/common',norm=False,tumor=df)
# static
dff = snaf.gtex_visual_combine(uid=uid,outdir='Frank_inspection',norm=False,tumor=df,group_by_tissue=False)

[3] Let me know if what I am suggesting is not you were asking for, but if you are interesting in the flanking sequence, you can take a look at this issue (#31), if you are interested in obtaining the whole isoform sequence associated with Junction, you can refer to this issue (#22). Or if you have more customized demand, you can always take the chromosome coordinates reported in the result and query it in UCSC genome browser. If none of above is what you asked, let me know and I can further clarify.

Thank you,
Frank

haoqing12 · 2024-03-04T13:59:56Z

Thank you again. You have given me two issues that are all I want. However, I am having some problems/concerns in the process:

for example "ENSG00000151632:E1.3-E5.1"
From my perspective, the alternative splicing of AKR1C2 jumps directly from exon 1 unconventional splice site (E1.3) to exon 5. The junction locus on the chromosome is 5010653 (left) and 5017900 (right). I don't know if my understanding is correct.

issue#31, In your opinion, if the flanking sequence contains *, we will drop it.

But I found that the "FLHDVGRVL" presented in the result/T_candidates/T_antigen_candidates_all.txt belongs to the translation for phase = 2, and if I set MAX_PEP_LEN to 15, I find that this translation also contains *. MAX_PEP_LEN = 9 hides this codon.

issue#22, I got sr_ffl_str3_report_None_False.txt file. But I didn't find consistent UIDs in T_candidates/T_antigen_candidates_all.txt and sr_ffl_str3_report_None_False.txt. Is it because sr_ffl_str3_report_None_False can only get the B cell surface antigen's full length, not T-cell's?

Best regards,
haoq

frankligy · 2024-03-04T16:19:19Z

Hi @haoqing12,

Both of your observations are correct and I can clarify:

[1] SNAF in silico translation is slightly different from the code I provided to (#31), that was meant to be a quick solution to check the flanking sequence. SNAF in silico translation rules are illustrated below using the specific example you brought up with:

That explains why the peptide FLHDVGRVL got predicted, as illustrated below, because E1.3 happens to not have stop codon:

But of course you will wonder, then how do you explain the preceding stop codon associated with this translation phase? My rationale (feel free to disagree) to include these in the output bases on the fact that a lot of tumor antigens are deriving from non-canonical short ORF with non-ATG start codon, I want to point you to a recently published paper (https://www.nature.com/articles/s41467-024-46240-9 ) and if you navigate to the supplemental figure 8, they showed a few concrete examples where such cryptic ORF exist from RiboSeq data. Similar evidence can be found in the nuORF paper (https://pubmed.ncbi.nlm.nih.gov/34663921/). Another evidence for that type of translation is this paper (https://pubmed.ncbi.nlm.nih.gov/18292810/), see figure6 below (this peptide is preceded by stop codon in that translation phase) and noting this peptide has gone to phase I clinical trial. (https://ascopubs.org/doi/10.1200/JCO.2024.42.4_suppl.435):

One thing I think we have to know is, an antigen doesn't need to be a functional stable protein to be able to presented by HLA, and the fact is it may be exact opposite, unstable degradable short peptides will rapidly be lysed by proteosome, which is exactly the mechanisms for presenting HLA-I presented peptides. So these peptides won't be detected using proteomes because they don't exist but will be detected in HLA-I pull-down immunopeptidome (https://aacrjournals.org/cancerimmunolres/article/8/8/1018/470266/Identification-of-the-Cryptic-HLA-I).

[2] For this issue, I think the reason is only subset of NeoJunction whose full-length isoform can be inferred from the present database and pass certain criteria (no NMD, transcript is annotated as protein-coding etc). For instance, the very example you showed, probably won't be able to have a precise full-length isoform prediction by SNAF because I can not see overlapping documented transcripts with that junction on UCSC genome browser. It doesn't necessarily mean no full-length isoform exist but just the current database is limited, full-length long-read sequencing may provide more rich dataset for us to better annotate such junctions.

Best,
Frank

haoqing12 · 2024-03-05T07:20:42Z

Thank you very much for your explanation, especially the fact that stabilizing proteins are not necessary for antigen presentation, which has enlightened me. Regarding indel, I usually only chose NMD-escaped mutations as a source of neoantigens previously. Now I need to revisit this.

For SNAF, I have no more questions, and this issue can be closed.

Best regards,
haoq

haoqing12 closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

None of ['query'] are in the columns #32

None of ['query'] are in the columns #32

haoqing12 commented Feb 27, 2024

frankligy commented Feb 27, 2024

haoqing12 commented Feb 28, 2024 •

edited

Loading

frankligy commented Feb 28, 2024 •

edited

Loading

haoqing12 commented Mar 1, 2024

frankligy commented Mar 1, 2024 •

edited

Loading

haoqing12 commented Mar 4, 2024 •

edited

Loading

frankligy commented Mar 4, 2024 •

edited

Loading

haoqing12 commented Mar 5, 2024

None of ['query'] are in the columns #32

None of ['query'] are in the columns #32

Comments

haoqing12 commented Feb 27, 2024

adding gene symbol

frankligy commented Feb 27, 2024

haoqing12 commented Feb 28, 2024 • edited Loading

frankligy commented Feb 28, 2024 • edited Loading

haoqing12 commented Mar 1, 2024

frankligy commented Mar 1, 2024 • edited Loading

haoqing12 commented Mar 4, 2024 • edited Loading

frankligy commented Mar 4, 2024 • edited Loading

haoqing12 commented Mar 5, 2024

haoqing12 commented Feb 28, 2024 •

edited

Loading

frankligy commented Feb 28, 2024 •

edited

Loading

frankligy commented Mar 1, 2024 •

edited

Loading

haoqing12 commented Mar 4, 2024 •

edited

Loading

frankligy commented Mar 4, 2024 •

edited

Loading