Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCI60 - Several incorrect values in chem_name #236

Open
jjacobson95 opened this issue Oct 21, 2024 · 7 comments
Open

NCI60 - Several incorrect values in chem_name #236

jjacobson95 opened this issue Oct 21, 2024 · 7 comments
Assignees
Labels
data update invalid This doesn't seem right

Comments

@jjacobson95
Copy link
Collaborator

The validation script is finding several errors in the NCI60 chem_name column in the drugs file.
(This currently translates to be in broad_sanger_drugs.tsv)

[ERROR] [/tmp/nci60_drugs.tsv/2089] 300000000.0 is not of type 'string', 'null' in /chem_name
[ERROR][/tmp/nci60_drugs.tsv/189407] 200000000.0 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/nci60_drugs.tsv/196988] 4e+96 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/nci60_drugs.tsv/272367] 0.0 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/nci60_drugs.tsv/519798] 5052 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/nci60_drugs.tsv/520016] 138061 is not of type 'string', 'null' in /chem_name
@jjacobson95 jjacobson95 added invalid This doesn't seem right data update labels Oct 21, 2024
@sgosline
Copy link
Member

Do we know what drugs are causing this?

@jjacobson95
Copy link
Collaborator Author

jjacobson95 commented Oct 21, 2024

Out of this small list (the 6 errors above were all of the errors that the validation script found), only two could be found through this simple search.

Both mapped to the same drug: SMI_54937 , Pubchem ID 581 which can be found on pubchem. The values are not seen as identifiers here.

Screenshot 2024-10-21 at 9 22 12 AM

@sgosline
Copy link
Member

Here is the REST call for that compound (according to the pubchem_retrieval.py script): https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/581/synonyms/JSON
I dont see those coming up as chem_names.

I think these are somehow missed by the pubchem call and instead get added as NSC identifiers, but without the 'NSC' prefix.

msmi = smiles.filter(pl.col('NSC').is_in(missing))

@jjacobson95
Copy link
Collaborator Author

jjacobson95 commented Oct 29, 2024

I have been working on the schema checker and the previous errors were truncated. This issue is also in the other datasets that use these drug. This may help with tracking down the issue.

[ERROR] [/tmp/prism_drugs.tsv/19740] 4e+96 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/prism_drugs.tsv/50270] 5052 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/prism_drugs.tsv/50305] 138061 is not of type 'string', 'null' in /chem_name
[ERROR] [/tmp/gdscv1_drugs.tsv/5103] 4e+96 is not of type 'string', 'null' in /chem_name

@sgosline
Copy link
Member

Was this resolved in #237?

@jjacobson95
Copy link
Collaborator Author

jjacobson95 commented Nov 12, 2024 via email

@jjacobson95 jjacobson95 self-assigned this Nov 13, 2024
@jjacobson95
Copy link
Collaborator Author

I will look into where this is occurring during the pubchem / build process.

@sgosline sgosline moved this to In progress in CoderData Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data update invalid This doesn't seem right
Projects
Status: In progress
Development

No branches or pull requests

2 participants