feat(openchallenges): add EDAM Extract and Transform Processes #2564

mdsage1 · 2024-03-13T21:09:47Z

Description

EDAM ETL processes need to be developed to incorporate ETAM ontology in the Maria DB linking the ontology to existing data. This PR will address the extract and transform portion.

Related Issue

Contribute to #2524
Contribute to #2548

Fixes #2547
Fixes #2563

Changelog

Add

Download a specified version of the EDAM ontology from https://github.com/Sage-Bionetworks/edamontology
Transform the raw data into a Pandas dataframe that match the content of this file
Start id values from 1 to mimic the behavior of SQL AUTO_INCREMENT.
Print info and statistic about the data to the stdout
Version of EDAM processed
Number of concepts transformed (overall, operation, data, etc.)

Preview

mdsage1 · 2024-03-14T18:32:18Z

@tschaffter Quality Gate doesn't seem to be performing checks for this PR.

mdsage1 · 2024-03-15T18:03:28Z

@tschaffter This is ready for review. Version is now an environment variable and the Description includes a preview.

apps/openchallenges/edam-etl/Dockerfile

apps/openchallenges/edam-etl/src/main.py

vpchung · 2024-03-15T19:31:54Z

@mdsage1 thanks for working on this!! You didn't ask me to, but I added some comments to the PR. Feel free to use them or ignore 😄

sonarcloud · 2024-03-15T21:05:27Z

Quality Gate passed for 'openchallenges-edam-etl'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

tschaffter · 2024-03-18T22:16:29Z

Fixes #2546

This PR is part of #2546, so this PR should not be configured to close this ticket. I will remove it from the list.

tschaffter

@mdsage1 ~~Why do you show the std, mean and other metrics for the id column?~~

Update: here are the information that the script should print

Version of EDAM processed
Number of concepts that will be added to the table
- Total number of concepts
- Number of concepts for the following category
  - Data concepts
  - Operation concepts
  - Format concepts
  - Operation concepts
  - Other concepts

tschaffter

The config parameters should be validated before using them, otherwise the following behavior will occur:

$ nx serve-detach openchallenges-edam-etl

> nx run openchallenges-edam-etl:serve-detach

 Container openchallenges-mariadb  Recreate
 Container openchallenges-mariadb  Recreated
 Container openchallenges-edam-etl  Recreate
 Container openchallenges-edam-etl  Recreated
 Container openchallenges-mariadb  Starting
 Container openchallenges-mariadb  Started
 Container openchallenges-mariadb  Waiting
 Container openchallenges-mariadb  Healthy
 Container openchallenges-edam-etl  Starting
 Container openchallenges-edam-etl  Started

 ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 >  NX   Successfully ran target serve-detach for project openchallenges-edam-etl (33s)
 
   View logs and investigate cache misses at https://cloud.nx.app/runs/Ivfcx2Zd35

vscode@dee30b82cf44:/workspaces/sage-monorepo$ docker logs openchallenges-edam-etl
EDAM Version: None
OC DB URL: jdbc:mysql://openchallenges-mariadb:3306/challenge_service
Downloading the EDAM concepts from GitHub (CSV file)...
Error downloading EDAM concepts: 404 Client Error: Not Found for url: https://github.com/edamontology/edamontology/raw/main/releases/EDAM_None.csv
Processing the EDAM concepts...
File EDAM_None.csv not found.
No data available.

apps/openchallenges/edam-etl/src/main.py

mdsage1 · 2024-03-20T21:45:29Z

closed in error

tschaffter · 2024-03-21T16:23:54Z

You could get this information from the column class_id and a regex.

See the suggestion I made above.

vpchung · 2024-03-21T16:36:56Z

You could get this information from the column class_id and a regex.

@mdsage1 alternatively, you can also do a replace() to remove the substring you don't need from class_id, if you're not comfortable with regex.

EDIT: Since you're interested in the number of concepts per category, you can actually use pandas' contains to get you closer to the count 🙂 e.g.

>>> df["class_id"].str.contains("data")
0        True
1        True
2        True
3        True
4        True
        ...  
3468    False
3469    False
3470    False
3471    False
3472    False

tschaffter · 2024-03-21T16:52:03Z

Prefer exact match to using contains (more future proof): contains would not work if the ontology were to have the concept Data and DataFormat, for example.

vpchung · 2024-03-21T16:57:34Z

if the ontology were to have the concept Data and DataFormat, for example.

Good point. Just shooting my shot here, but this can be overcome by using data_ (assuming they use "dataformat_"). Also, you can use regex with contains().

mdsage1 · 2024-03-21T18:17:32Z

@tschaffter I've updated the concept counts to use the class_id column and regex. The case has been ignored to avoid any future issues. I didn't use contains but used search() function from the regex module. I have prevented future issues with data, and any other concept name, listing as a match when there is an additional word following the word of interest by adding the underscore to the regex as @vpchung suggested.

vpchung

I have one final suggestion, but otherwise, the script looks good on my end!

apps/openchallenges/edam-etl/src/main.py

mdsage1

tested the link and updated it

apps/openchallenges/edam-etl/src/main.py

vpchung · 2024-03-22T18:52:10Z

apps/openchallenges/edam-etl/src/main.py

+        return None
+
+
+def count_occurrences(identifier_pattern: str, df) -> int:


Suggested change

def count_occurrences(identifier_pattern: str, df) -> int:

def count_occurrences(identifier_pattern: str, df: pd.DataFrame) -> np.int64:

^^ IIRC. May need to double-check whether it is a numpy type being returned.

Numpy int is being returned and made changes.

Cool cool. May need to add import numpy as np in this case!

Nice catch! I need to submit and patch for this project in #2594 and will fix this issue at the same time.

apps/openchallenges/edam-etl/src/main.py

mdsage1 added the sonar-scan-approved-deprecated Ready for Sonar code analysis label Mar 13, 2024

mdsage1 changed the title ~~feat(edam): Add EDAM ETL~~ feat(openchallenges): Add EDAM ETL Mar 13, 2024

mdsage1 changed the title ~~feat(openchallenges): Add EDAM ETL~~ feat(openchallenges): add EDAM ETL Mar 13, 2024

mdsage1 self-assigned this Mar 13, 2024

mdsage1 marked this pull request as ready for review March 15, 2024 15:32

mdsage1 requested review from rrchai, tschaffter and vpchung as code owners March 15, 2024 15:32

mdsage1 changed the title ~~feat(openchallenges): add EDAM ETL~~ feat(openchallenges): add EDAM Extract and Transform Processes Mar 15, 2024

vpchung reviewed Mar 15, 2024

View reviewed changes

apps/openchallenges/edam-etl/Dockerfile Outdated Show resolved Hide resolved

vpchung reviewed Mar 15, 2024

View reviewed changes

mdsage1 requested a review from vpchung March 15, 2024 21:09

tschaffter reviewed Mar 18, 2024

View reviewed changes

tschaffter self-requested a review March 18, 2024 22:25

tschaffter requested changes Mar 18, 2024

View reviewed changes

vpchung reviewed Mar 18, 2024

View reviewed changes

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved

vpchung reviewed Mar 18, 2024

View reviewed changes

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved

tschaffter mentioned this pull request Mar 19, 2024

[Task] Start EDAM concept ID from 1 instead of 0 #2563

Closed

1 task

mdsage1 marked this pull request as draft March 20, 2024 14:43

mdsage1 closed this Mar 20, 2024

mdsage1 force-pushed the edam-etl branch from c66bf40 to 122c79d Compare March 20, 2024 15:01

update main

21a2906

mdsage1 reopened this Mar 20, 2024

mdsage1 marked this pull request as ready for review March 21, 2024 16:14

mdsage1 marked this pull request as draft March 21, 2024 17:48

update search using regex

ddfebbc

mdsage1 marked this pull request as ready for review March 21, 2024 18:17

mdsage1 requested review from tschaffter and vpchung March 21, 2024 18:20

vpchung approved these changes Mar 21, 2024

View reviewed changes

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved

apps/openchallenges/edam-etl/src/main.py Show resolved Hide resolved

mdsage1 commented Mar 21, 2024

View reviewed changes

tschaffter requested changes Mar 21, 2024

View reviewed changes

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved

apply requested changes

2380f11

mdsage1 requested a review from tschaffter March 21, 2024 23:05

mdsage1 marked this pull request as draft March 22, 2024 15:54

mdsage1 added 2 commits March 22, 2024 15:56

add regex true

0be846e

add /

935bc5e

mdsage1 marked this pull request as ready for review March 22, 2024 16:02

Merge branch 'Sage-Bionetworks:main' into edam-etl

0b3de80

vpchung reviewed Mar 22, 2024

View reviewed changes

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved

mdsage1 added 2 commits March 22, 2024 18:45

update requested changes

bc5dc82

remove unused module

1de7f28

vpchung reviewed Mar 22, 2024

View reviewed changes

mdsage1 added 2 commits March 22, 2024 19:23

update changes

76e837f

remove type check

a0cd52e

tschaffter approved these changes Mar 22, 2024

View reviewed changes

tschaffter merged commit 3c7933e into Sage-Bionetworks:main Mar 22, 2024
9 checks passed

mdsage1 deleted the edam-etl branch March 22, 2024 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openchallenges): add EDAM Extract and Transform Processes #2564

feat(openchallenges): add EDAM Extract and Transform Processes #2564

mdsage1 commented Mar 13, 2024 •

edited

Loading

mdsage1 commented Mar 14, 2024 •

edited

Loading

mdsage1 commented Mar 15, 2024

vpchung commented Mar 15, 2024

sonarcloud bot commented Mar 15, 2024

tschaffter commented Mar 18, 2024 •

edited

Loading

tschaffter left a comment •

edited

Loading

tschaffter left a comment

mdsage1 commented Mar 20, 2024

tschaffter commented Mar 21, 2024

vpchung commented Mar 21, 2024 •

edited

Loading

tschaffter commented Mar 21, 2024 •

edited

Loading

vpchung commented Mar 21, 2024

mdsage1 commented Mar 21, 2024

vpchung left a comment

mdsage1 left a comment

vpchung Mar 22, 2024

mdsage1 Mar 22, 2024

vpchung Mar 22, 2024

tschaffter Mar 23, 2024

		return None


		def count_occurrences(identifier_pattern: str, df) -> int:

feat(openchallenges): add EDAM Extract and Transform Processes #2564

feat(openchallenges): add EDAM Extract and Transform Processes #2564

Conversation

mdsage1 commented Mar 13, 2024 • edited Loading

Description

Related Issue

Changelog

Preview

mdsage1 commented Mar 14, 2024 • edited Loading

mdsage1 commented Mar 15, 2024

vpchung commented Mar 15, 2024

sonarcloud bot commented Mar 15, 2024

Quality Gate passed for 'openchallenges-edam-etl'

tschaffter commented Mar 18, 2024 • edited Loading

tschaffter left a comment • edited Loading

Choose a reason for hiding this comment

tschaffter left a comment

Choose a reason for hiding this comment

mdsage1 commented Mar 20, 2024

tschaffter commented Mar 21, 2024

vpchung commented Mar 21, 2024 • edited Loading

tschaffter commented Mar 21, 2024 • edited Loading

vpchung commented Mar 21, 2024

mdsage1 commented Mar 21, 2024

vpchung left a comment

Choose a reason for hiding this comment

mdsage1 left a comment

Choose a reason for hiding this comment

vpchung Mar 22, 2024

Choose a reason for hiding this comment

mdsage1 Mar 22, 2024

Choose a reason for hiding this comment

vpchung Mar 22, 2024

Choose a reason for hiding this comment

tschaffter Mar 23, 2024

Choose a reason for hiding this comment

mdsage1 commented Mar 13, 2024 •

edited

Loading

mdsage1 commented Mar 14, 2024 •

edited

Loading

tschaffter commented Mar 18, 2024 •

edited

Loading

tschaffter left a comment •

edited

Loading

vpchung commented Mar 21, 2024 •

edited

Loading

tschaffter commented Mar 21, 2024 •

edited

Loading