Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bacpop-202 Assign fallback to refs to db #49

Merged
merged 29 commits into from
Dec 6, 2024

Conversation

absternator
Copy link
Contributor

@absternator absternator commented Nov 21, 2024

This PR makes updates so the assign step is initially run on the ref database and if external clustering can not be found, then the full database is used to assign clusters. The network and microreact visualise step is now run on the full database as well to provide more information.

The following is done to achieve results:

  • when fails to find samples cluster in external_clusters.csv , the run assign step in a temp directory with full database
  • copy over data from full database run onto original run in main output directory. (query_subset, include_files, external_clusters.csv)
  • delete temp directory created for assign
  • run network and microreact on full database

Testing:
I have deployed a dev site at https://beebop-dev.dide.ic.ac.uk/. Thus, testing can be done there. I have samples that fulfil the criteria that don't assign to ref database but do for full database. Please message me, and I will send those through via Slack.

You can test with normal samples first and then add these problematic samples to project and run... you will notice it takes longer for assign step, but it will complete.. Alos check microreact to ensure populated.

Note: network is still not working, this will be in an subsequent PR....

@absternator absternator marked this pull request as draft November 21, 2024 15:06
@absternator absternator marked this pull request as ready for review November 27, 2024 09:51
@absternator absternator changed the title feat: get 'working' version Bacpop-202 Assign fallback to refs to db Nov 27, 2024
Copy link
Contributor

@EmmaLRussell EmmaLRussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the refactoring in assignClusters to break things up into sensible sized methods!

beebop/app.py Outdated
args,
name_mapping,
species,
redis_host,
queue_kwargs),
depends_on=job_network, **queue_kwargs)
depends_on=Dependency([job_assign, job_network], allow_failure=True), **queue_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised you need to specify dependency on job_assign here, when job_network already depends on job_assign..? What difference does allow_failure make?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason we specify job assign as well is because we need the clusters info which is a dependency... thus if network fails then microreact will run and can still get cluster info as dependency,,,, allow_failiure means we still run if the previous jobs fails

Copy link
Contributor

@EmmaLRussell EmmaLRussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, looks great but quite confusing. But I don't think that's your fault!

I really like the refactoring in assignClusters to break things up into sensible sized methods!

I think part of my confusion is the use of the word "query" meaning sample. Particularly when combined with "previous"! But I think that's too embedded to change now!

) -> None:
"""
[Updates the external clusters with the external clusters found
in the new previous query clustering from assigning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the new previous query clustering" is quite a confusing concept! 😆
Is the path to the new clustering which should have been done from the full db?
And external_clusters is the mapping to external clusters which was done in the first pass with the reference db?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup correct.. haha yup i removed the word new and tried to make more sense

beebop/utils.py Outdated
Comment on lines 194 to 196
df, samples_mask = get_df_sample_mask(
previous_query_clustering_file, not_found_q_names
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is going to get a dataframe of the whole file, plus a mask or view of the rows which match the not_found samples so you can easily update those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yip

Comment on lines 326 to 331
tmp_assign_subset_file = fs.partial_query_graph_tmp(p_hash)
main_subset_file = fs.partial_query_graph(p_hash)
with open(tmp_assign_subset_file, "r") as f:
failed_lines = set(f.read().splitlines())
with open(main_subset_file, "r") as f:
main_lines = set(f.read().splitlines())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, maybe do some renaming here, as "failed_lines" shouldn't be failed anymore.
Maybe "from_full_db" and "from_refs_db" or something..

Comment on lines 363 to 368
[Filter out queries that were not found in the
initial external clusters file.
This function filters out the queries that were not found
in the initial external clusters file,
deletes include files for these queries,
and returns the filtered queries.]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like a little bit of an odd hybrid for this method to do filtering on lists for use by the caller, and also to delete some files. Maybe split these parts up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@@ -227,11 +239,42 @@ def tmp(self, p_hash) -> str:
os.makedirs(tmp_path, exist_ok=True)
return str(tmp_path)

def output_tmp(self, p_hash) -> str:
"""
Generates the path to the full assign output folder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Generates the path to the full assign output folder.
Generates the path to the full assign output folder when using full db.

..this is just used for full db assign isn't it?

"""
return str(PurePath(self.output_tmp(p_hash), f"{p_hash}_query.subset"))

def external_previous_query_clustering_tmp(self, p_hash) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "previous" query clustering here? The first pass with ref db? Or is that a poppunk term?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup poppunk term.. the function param is called previous_query_clustering

previous_mst=None,
previous_distances=None,
network_file=self.fs.network_file(self.p_hash),
network_file=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry yes it is i reverted... i was just testing with a dev poppunk branch where ncik is removing

Copy link
Contributor

@EmmaLRussell EmmaLRussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got some failing tests and lint...? Were those tests failing anyway - network issue to be fixed in next PR?

@absternator
Copy link
Contributor Author

Got some failing tests and lint...? Were those tests failing anyway - network issue to be fixed in next PR?

Have fixed lint.. tests are failing due to network issue and are also failing in destination branch.. Yup next PR will fix network issue.

also have created an diagram would be great to get a review on this too please

Copy link
Contributor

@EmmaLRussell EmmaLRussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram is great, really helpful and clear - thanks for that! A couple of comments and suggestions on it:

  • Beebop and beebop_py are labelled "Bebeop"!
  • The text overhangs the shapes in quite a few places which makes it a little difficult to read - I think it would be nice to expand the shapes in cases where it would be easy / wouldn't mean having to rearrange the whole thing
  • For "assign", I think it's worth saying that means "assign clusters"
  • "has DB external_clusters.csv" => "DB has external_clusters.csv?"
  • "add query & ref labels to component graph nodes" - but that doesn't get saved to a file?
  • draw.io is showing a couple of collapsible areas at the bottom of the diagram. for LH and RH of microreact - I don't think you need these!
  • "output_folder, {hash}._query.subset
    DB_external_clusters.csv,
    include{clusetN}" - "clusetN" should be "clusterN"? Also applies to the shape below.
  • I think the one thing that's missing overall is a description of what outputs are sent back to the front end. I guess it's mostly some portion of the files which are output, but I think it would be nice to include details. Also, how labels are applies to the graph response, modifying the file contents (and any other cases where this applies). This could perhaps be shown on the RHS where the files are listed rather than LHS where Beebop is e.g. by highlighting the files which are returned in responses - if that works better!

Lint does still seem to be grumbling! 😢

README.md Outdated
Comment on lines 111 to 114
### Miscellaneous

- There is a .drawio graph in graphs folder illustrating the proccess of running a analysis. This includes
all the files created anf how they are used in each job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Miscellaneous
- There is a .drawio graph in graphs folder illustrating the proccess of running a analysis. This includes
all the files created anf how they are used in each job
### Diagrams
- There is a .drawio graph in the `diagrams` folder illustrating the process of running a analysis. This includes
all the files created and how they are used in each job. You can open and view the diagram at [draw.io](https://draw.io).

@absternator
Copy link
Contributor Author

updated diagram as per requests

Copy link
Contributor

@EmmaLRussell EmmaLRussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you! Couple of tiny things to change if you want to:

There still seems to be a couple of weird boxes around the bits of microreact in the diagram, though they're no longer collapsible.

"Link to visit microreact graph" - you get data as well as just a link though don't you?

@absternator
Copy link
Contributor Author

Great, thank you! Couple of tiny things to change if you want to:

There still seems to be a couple of weird boxes around the bits of microreact in the diagram, though they're no longer collapsible.

"Link to visit microreact graph" - you get data as well as just a link though don't you?

  1. ohh yes the box is to show that whole part is done for each cluster. (I made note bigger attached to box).
  2. OHH the zip files yes I've updated that too

@absternator absternator merged commit af7603d into bacpop-186-v9-db-support Dec 6, 2024
3 of 4 checks passed
@absternator absternator deleted the bacpop-202-fallback-to-refs branch December 6, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants