Bacpop-202 Assign fallback to refs to db #49

absternator · 2024-11-21T15:04:47Z

This PR makes updates so the assign step is initially run on the ref database and if external clustering can not be found, then the full database is used to assign clusters. The network and microreact visualise step is now run on the full database as well to provide more information.

The following is done to achieve results:

when fails to find samples cluster in external_clusters.csv , the run assign step in a temp directory with full database
copy over data from full database run onto original run in main output directory. (query_subset, include_files, external_clusters.csv)
delete temp directory created for assign
run network and microreact on full database

Testing:
I have deployed a dev site at https://beebop-dev.dide.ic.ac.uk/. Thus, testing can be done there. I have samples that fulfil the criteria that don't assign to ref database but do for full database. Please message me, and I will send those through via Slack.

You can test with normal samples first and then add these problematic samples to project and run... you will notice it takes longer for assign step, but it will complete.. Alos check microreact to ensure populated.

Note: network is still not working, this will be in an subsequent PR....

…rs function

…ction

…rs_csv function

…e functions

…rs and PoppunkWrapper classes

EmmaLRussell

I really like the refactoring in assignClusters to break things up into sensible sized methods!

EmmaLRussell · 2024-11-28T10:25:11Z

beebop/app.py

                                     args,
                                     name_mapping,
                                     species,
                                     redis_host,
                                     queue_kwargs),
-                               depends_on=job_network, **queue_kwargs)
+                               depends_on=Dependency([job_assign, job_network], allow_failure=True), **queue_kwargs)


I'm surprised you need to specify dependency on job_assign here, when job_network already depends on job_assign..? What difference does allow_failure make?

the reason we specify job assign as well is because we need the clusters info which is a dependency... thus if network fails then microreact will run and can still get cluster info as dependency,,,, allow_failiure means we still run if the previous jobs fails

EmmaLRussell

As discussed, looks great but quite confusing. But I don't think that's your fault!

I really like the refactoring in assignClusters to break things up into sensible sized methods!

I think part of my confusion is the use of the word "query" meaning sample. Particularly when combined with "previous"! But I think that's too embedded to change now!

beebop/assignClusters.py

EmmaLRussell · 2024-11-28T10:44:39Z

beebop/assignClusters.py

+) -> None:
+    """
+    [Updates the external clusters with the external clusters found
+    in the new previous query clustering from assigning


"the new previous query clustering" is quite a confusing concept! 😆
Is the path to the new clustering which should have been done from the full db?
And external_clusters is the mapping to external clusters which was done in the first pass with the reference db?

yup correct.. haha yup i removed the word new and tried to make more sense

beebop/assignClusters.py

EmmaLRussell · 2024-11-28T11:00:09Z

beebop/utils.py

+    df, samples_mask = get_df_sample_mask(
+        previous_query_clustering_file, not_found_q_names
+    )


So this is going to get a dataframe of the whole file, plus a mask or view of the rows which match the not_found samples so you can easily update those?

EmmaLRussell · 2024-11-28T11:23:01Z

beebop/assignClusters.py

+    tmp_assign_subset_file = fs.partial_query_graph_tmp(p_hash)
+    main_subset_file = fs.partial_query_graph(p_hash)
+    with open(tmp_assign_subset_file, "r") as f:
+        failed_lines = set(f.read().splitlines())
+    with open(main_subset_file, "r") as f:
+        main_lines = set(f.read().splitlines())


As discussed, maybe do some renaming here, as "failed_lines" shouldn't be failed anymore.
Maybe "from_full_db" and "from_refs_db" or something..

EmmaLRussell · 2024-11-28T11:24:45Z

beebop/assignClusters.py

+    [Filter out queries that were not found in the
+        initial external clusters file.
+    This function filters out the queries that were not found
+        in the initial external clusters file,
+    deletes include files for these queries,
+        and returns the filtered queries.]


it feels like a little bit of an odd hybrid for this method to do filtering on lists for use by the caller, and also to delete some files. Maybe split these parts up?

EmmaLRussell · 2024-11-28T11:26:25Z

beebop/filestore.py

@@ -227,11 +239,42 @@ def tmp(self, p_hash) -> str:
        os.makedirs(tmp_path, exist_ok=True)
        return str(tmp_path)

+    def output_tmp(self, p_hash) -> str:
+        """
+        Generates the path to the full assign output folder.


Suggested change

Generates the path to the full assign output folder.

Generates the path to the full assign output folder when using full db.

..this is just used for full db assign isn't it?

EmmaLRussell · 2024-11-28T11:27:17Z

beebop/filestore.py

+        """
+        return str(PurePath(self.output_tmp(p_hash), f"{p_hash}_query.subset"))
+
+    def external_previous_query_clustering_tmp(self, p_hash) -> str:


What is "previous" query clustering here? The first pass with ref db? Or is that a poppunk term?

yup poppunk term.. the function param is called previous_query_clustering

EmmaLRussell · 2024-11-28T11:28:21Z

beebop/poppunkWrapper.py

            previous_mst=None,
            previous_distances=None,
-            network_file=self.fs.network_file(self.p_hash),
+            network_file=None,


Is this not needed?

sorry yes it is i reverted... i was just testing with a dev poppunk branch where ncik is removing

Co-authored-by: Emma Russell <[email protected]>

EmmaLRussell

Got some failing tests and lint...? Were those tests failing anyway - network issue to be fixed in next PR?

absternator · 2024-12-02T20:11:47Z

Got some failing tests and lint...? Were those tests failing anyway - network issue to be fixed in next PR?

Have fixed lint.. tests are failing due to network issue and are also failing in destination branch.. Yup next PR will fix network issue.

also have created an diagram would be great to get a review on this too please

EmmaLRussell

The diagram is great, really helpful and clear - thanks for that! A couple of comments and suggestions on it:

Beebop and beebop_py are labelled "Bebeop"!
The text overhangs the shapes in quite a few places which makes it a little difficult to read - I think it would be nice to expand the shapes in cases where it would be easy / wouldn't mean having to rearrange the whole thing
For "assign", I think it's worth saying that means "assign clusters"
"has DB external_clusters.csv" => "DB has external_clusters.csv?"
"add query & ref labels to component graph nodes" - but that doesn't get saved to a file?
draw.io is showing a couple of collapsible areas at the bottom of the diagram. for LH and RH of microreact - I don't think you need these!
"output_folder, {hash}._query.subset
DB_external_clusters.csv,
include{clusetN}" - "clusetN" should be "clusterN"? Also applies to the shape below.
I think the one thing that's missing overall is a description of what outputs are sent back to the front end. I guess it's mostly some portion of the files which are output, but I think it would be nice to include details. Also, how labels are applies to the graph response, modifying the file contents (and any other cases where this applies). This could perhaps be shown on the RHS where the files are listed rather than LHS where Beebop is e.g. by highlighting the files which are returned in responses - if that works better!

Lint does still seem to be grumbling! 😢

EmmaLRussell · 2024-12-03T10:56:10Z

README.md

+### Miscellaneous
+
+- There is a .drawio graph in graphs folder illustrating the proccess of running a analysis. This includes
+all the files created anf how they are used in each job


Suggested change

### Miscellaneous

- There is a .drawio graph in graphs folder illustrating the proccess of running a analysis. This includes

all the files created anf how they are used in each job

### Diagrams

- There is a .drawio graph in the `diagrams` folder illustrating the process of running a analysis. This includes

all the files created and how they are used in each job. You can open and view the diagram at [draw.io](https://draw.io).

…r_queries function

…beebop_py into bacpop-202-fallback-to-refs

absternator · 2024-12-05T13:42:31Z

updated diagram as per requests

EmmaLRussell

Great, thank you! Couple of tiny things to change if you want to:

There still seems to be a couple of weird boxes around the bits of microreact in the diagram, though they're no longer collapsible.

"Link to visit microreact graph" - you get data as well as just a link though don't you?

absternator · 2024-12-06T15:03:52Z

Great, thank you! Couple of tiny things to change if you want to:

There still seems to be a couple of weird boxes around the bits of microreact in the diagram, though they're no longer collapsible.

"Link to visit microreact graph" - you get data as well as just a link though don't you?

ohh yes the box is to show that whole part is done for each cluster. (I made note bigger attached to box).
OHH the zip files yes I've updated that too

absternator added 3 commits November 21, 2024 10:27

feat: get 'working' version

014ec0b

feat: get working version cleaned

6dc261a

fix: remove redundant deletion of full_assign directory in get_cluste…

4396446

…rs function

absternator marked this pull request as draft November 21, 2024 15:06

absternator added 7 commits November 22, 2024 09:00

feat: get refactored version working

569c289

feat: use pandas for data manipulation

d764af7

feat: remove bad include files.txt

7b1ae88

feat: add dataclass config to reduce params

5a4a806

feat: enhance documentation and type hints in file handling functions

ec2dbce

pycode style fixes

d0ea736

tests: get all unit tests up for ref fallback

3fcf23f

absternator marked this pull request as ready for review November 27, 2024 09:51

absternator changed the title ~~feat: get 'working' version~~ Bacpop-202 Assign fallback to refs to db Nov 27, 2024

absternator requested a review from EmmaLRussell November 27, 2024 10:01

absternator added 8 commits November 27, 2024 10:08

fix doc blocs

53f3e32

fix: doc blocs again

38cde8a

fix: improve docstring formatting in update_external_clusters_csv fun…

4b4dea4

…ction

fix: correct docstring formatting in update_external_clusters_csv fun…

ed29fd3

…ction

fix: correct parameter docstring formatting in update_external_cluste…

8398fec

…rs_csv function

fix: correct parameter and return docstring formatting across multipl…

a78e25b

…e functions

fix: correct parameter docstring formatting in update_external_cluste…

d458f8b

…rs and PoppunkWrapper classes

feat: updates for new netowrk code

2b237f3

EmmaLRussell reviewed Nov 28, 2024

View reviewed changes

absternator and others added 5 commits November 29, 2024 14:09

update conflict logic for include files

5eec95a

Update beebop/assignClusters.py

59117ca

Co-authored-by: Emma Russell <[email protected]>

Update beebop/assignClusters.py

be0e070

Co-authored-by: Emma Russell <[email protected]>

Update beebop/assignClusters.py

2b11e2d

Co-authored-by: Emma Russell <[email protected]>

pr changes for cleanup

0480f18

absternator requested a review from EmmaLRussell November 29, 2024 15:22

EmmaLRussell reviewed Dec 2, 2024

View reviewed changes

absternator added 2 commits December 2, 2024 19:55

fix lint

b6e8555

Add miscellaneous section to README with analysis process diagram

1214b41

EmmaLRussell reviewed Dec 3, 2024

View reviewed changes

absternator added 3 commits December 5, 2024 10:47

Update README diagrams section and remove unused parameter from filte…

972c302

…r_queries function

Merge branch 'bacpop-186-v9-db-support' of https://github.com/bacpop/…

2dd4194

…beebop_py into bacpop-202-fallback-to-refs

update diagram as per pr comments

9c596de

absternator requested a review from EmmaLRussell December 5, 2024 13:42

EmmaLRussell approved these changes Dec 6, 2024

View reviewed changes

update drawio

3e9d0d9

absternator merged commit af7603d into bacpop-186-v9-db-support Dec 6, 2024
3 of 4 checks passed

absternator deleted the bacpop-202-fallback-to-refs branch December 6, 2024 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bacpop-202 Assign fallback to refs to db #49

Bacpop-202 Assign fallback to refs to db #49

absternator commented Nov 21, 2024 •

edited

Loading

EmmaLRussell left a comment

EmmaLRussell Nov 28, 2024

absternator Nov 29, 2024

EmmaLRussell left a comment

EmmaLRussell Nov 28, 2024

absternator Nov 29, 2024

EmmaLRussell Nov 28, 2024

absternator Nov 29, 2024

EmmaLRussell Nov 28, 2024

EmmaLRussell Nov 28, 2024

absternator Nov 29, 2024

EmmaLRussell Nov 28, 2024

EmmaLRussell Nov 28, 2024

absternator Nov 29, 2024

EmmaLRussell Nov 28, 2024

absternator Nov 29, 2024

EmmaLRussell left a comment •

edited

Loading

absternator commented Dec 2, 2024

EmmaLRussell left a comment

EmmaLRussell Dec 3, 2024

absternator commented Dec 5, 2024

EmmaLRussell left a comment

absternator commented Dec 6, 2024

	Generates the path to the full assign output folder.
	Generates the path to the full assign output folder when using full db.

Bacpop-202 Assign fallback to refs to db #49

Bacpop-202 Assign fallback to refs to db #49

Conversation

absternator commented Nov 21, 2024 • edited Loading

EmmaLRussell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EmmaLRussell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EmmaLRussell left a comment • edited Loading

Choose a reason for hiding this comment

absternator commented Dec 2, 2024

EmmaLRussell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

absternator commented Dec 5, 2024

EmmaLRussell left a comment

Choose a reason for hiding this comment

absternator commented Dec 6, 2024

absternator commented Nov 21, 2024 •

edited

Loading

EmmaLRussell left a comment •

edited

Loading