-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bacpop-202 Assign fallback to refs to db #49
Bacpop-202 Assign fallback to refs to db #49
Conversation
…rs and PoppunkWrapper classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the refactoring in assignClusters to break things up into sensible sized methods!
beebop/app.py
Outdated
args, | ||
name_mapping, | ||
species, | ||
redis_host, | ||
queue_kwargs), | ||
depends_on=job_network, **queue_kwargs) | ||
depends_on=Dependency([job_assign, job_network], allow_failure=True), **queue_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised you need to specify dependency on job_assign here, when job_network already depends on job_assign..? What difference does allow_failure make?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason we specify job assign as well is because we need the clusters info which is a dependency... thus if network fails then microreact will run and can still get cluster info as dependency,,,, allow_failiure means we still run if the previous jobs fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, looks great but quite confusing. But I don't think that's your fault!
I really like the refactoring in assignClusters to break things up into sensible sized methods!
I think part of my confusion is the use of the word "query" meaning sample. Particularly when combined with "previous"! But I think that's too embedded to change now!
beebop/assignClusters.py
Outdated
) -> None: | ||
""" | ||
[Updates the external clusters with the external clusters found | ||
in the new previous query clustering from assigning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the new previous query clustering" is quite a confusing concept! 😆
Is the path to the new clustering which should have been done from the full db?
And external_clusters
is the mapping to external clusters which was done in the first pass with the reference db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup correct.. haha yup i removed the word new and tried to make more sense
beebop/utils.py
Outdated
df, samples_mask = get_df_sample_mask( | ||
previous_query_clustering_file, not_found_q_names | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is going to get a dataframe of the whole file, plus a mask or view of the rows which match the not_found samples so you can easily update those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yip
beebop/assignClusters.py
Outdated
tmp_assign_subset_file = fs.partial_query_graph_tmp(p_hash) | ||
main_subset_file = fs.partial_query_graph(p_hash) | ||
with open(tmp_assign_subset_file, "r") as f: | ||
failed_lines = set(f.read().splitlines()) | ||
with open(main_subset_file, "r") as f: | ||
main_lines = set(f.read().splitlines()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, maybe do some renaming here, as "failed_lines" shouldn't be failed anymore.
Maybe "from_full_db" and "from_refs_db" or something..
beebop/assignClusters.py
Outdated
[Filter out queries that were not found in the | ||
initial external clusters file. | ||
This function filters out the queries that were not found | ||
in the initial external clusters file, | ||
deletes include files for these queries, | ||
and returns the filtered queries.] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like a little bit of an odd hybrid for this method to do filtering on lists for use by the caller, and also to delete some files. Maybe split these parts up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
beebop/filestore.py
Outdated
@@ -227,11 +239,42 @@ def tmp(self, p_hash) -> str: | |||
os.makedirs(tmp_path, exist_ok=True) | |||
return str(tmp_path) | |||
|
|||
def output_tmp(self, p_hash) -> str: | |||
""" | |||
Generates the path to the full assign output folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generates the path to the full assign output folder. | |
Generates the path to the full assign output folder when using full db. |
..this is just used for full db assign isn't it?
""" | ||
return str(PurePath(self.output_tmp(p_hash), f"{p_hash}_query.subset")) | ||
|
||
def external_previous_query_clustering_tmp(self, p_hash) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "previous" query clustering here? The first pass with ref db? Or is that a poppunk term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup poppunk term.. the function param is called previous_query_clustering
beebop/poppunkWrapper.py
Outdated
previous_mst=None, | ||
previous_distances=None, | ||
network_file=self.fs.network_file(self.p_hash), | ||
network_file=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry yes it is i reverted... i was just testing with a dev poppunk branch where ncik is removing
Co-authored-by: Emma Russell <[email protected]>
Co-authored-by: Emma Russell <[email protected]>
Co-authored-by: Emma Russell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got some failing tests and lint...? Were those tests failing anyway - network issue to be fixed in next PR?
Have fixed lint.. tests are failing due to network issue and are also failing in destination branch.. Yup next PR will fix network issue. also have created an diagram would be great to get a review on this too please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagram is great, really helpful and clear - thanks for that! A couple of comments and suggestions on it:
- Beebop and beebop_py are labelled "Bebeop"!
- The text overhangs the shapes in quite a few places which makes it a little difficult to read - I think it would be nice to expand the shapes in cases where it would be easy / wouldn't mean having to rearrange the whole thing
- For "assign", I think it's worth saying that means "assign clusters"
- "has DB external_clusters.csv" => "DB has external_clusters.csv?"
- "add query & ref labels to component graph nodes" - but that doesn't get saved to a file?
- draw.io is showing a couple of collapsible areas at the bottom of the diagram. for LH and RH of microreact - I don't think you need these!
- "output_folder, {hash}._query.subset
DB_external_clusters.csv,
include{clusetN}" - "clusetN" should be "clusterN"? Also applies to the shape below. - I think the one thing that's missing overall is a description of what outputs are sent back to the front end. I guess it's mostly some portion of the files which are output, but I think it would be nice to include details. Also, how labels are applies to the graph response, modifying the file contents (and any other cases where this applies). This could perhaps be shown on the RHS where the files are listed rather than LHS where Beebop is e.g. by highlighting the files which are returned in responses - if that works better!
Lint does still seem to be grumbling! 😢
README.md
Outdated
### Miscellaneous | ||
|
||
- There is a .drawio graph in graphs folder illustrating the proccess of running a analysis. This includes | ||
all the files created anf how they are used in each job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Miscellaneous | |
- There is a .drawio graph in graphs folder illustrating the proccess of running a analysis. This includes | |
all the files created anf how they are used in each job | |
### Diagrams | |
- There is a .drawio graph in the `diagrams` folder illustrating the process of running a analysis. This includes | |
all the files created and how they are used in each job. You can open and view the diagram at [draw.io](https://draw.io). | |
…r_queries function
…beebop_py into bacpop-202-fallback-to-refs
updated diagram as per requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you! Couple of tiny things to change if you want to:
There still seems to be a couple of weird boxes around the bits of microreact in the diagram, though they're no longer collapsible.
"Link to visit microreact graph" - you get data as well as just a link though don't you?
|
This PR makes updates so the assign step is initially run on the ref database and if external clustering can not be found, then the full database is used to assign clusters. The network and microreact visualise step is now run on the full database as well to provide more information.
The following is done to achieve results:
Testing:
I have deployed a dev site at https://beebop-dev.dide.ic.ac.uk/. Thus, testing can be done there. I have samples that fulfil the criteria that don't assign to ref database but do for full database. Please message me, and I will send those through via Slack.
You can test with normal samples first and then add these problematic samples to project and run... you will notice it takes longer for assign step, but it will complete.. Alos check microreact to ensure populated.
Note: network is still not working, this will be in an subsequent PR....