-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gremlin crowdsource #121
base: master
Are you sure you want to change the base?
Gremlin crowdsource #121
Conversation
sara-02
commented
Nov 13, 2017
•
edited by pkajaba
Loading
edited by pkajaba
- Depends on Apollo module #118 to be merged first.
- Depends on Python3 compatible code #45 to be merged first
For user story openshiftio/openshift.io#1286 |
@sara-02 Your image is available in the registry: |
1 similar comment
@sara-02 Your image is available in the registry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nitpick, but LGTM otherwise
input_package_topic_data_store, | ||
output_package_topic_data_store, | ||
additional_path) | ||
untagged_pakcage_data = TagListPruner.clean_file(package_file_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo. You may want to fix and all subsequent occurrences.
result_package_topic_json = [] | ||
untagged_pakcage_data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo again
# TODO: use singleton object, with updated package_topic_list | ||
if ecosystem in untagged_pakcage_data.keys(): | ||
current_untagged_set = set(untagged_pakcage_data[ecosystem]) | ||
new_untagged_set = current_untagged_list.union( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you intend to use current_untagged_set
instead of the list?
@sara-02 Your image is available in the registry: |
LGTM. Will wait for @pkajaba approval |
@sara-02 Your image is available in the registry: |
1 similar comment
@sara-02 Your image is available in the registry: |
@sara-02 Please, rebase instead of merge commit. |
@@ -12,6 +12,8 @@ AWS_BUCKET_NAME = os.environ.get("AWS_BUCKET_NAME","dev-stack-analysis-clean-dat | |||
KRONOS_SCORING_REGION = os.environ.get("KRONOS_SCORING_REGION", "") | |||
KRONOS_MODEL_PATH = os.environ.get("KRONOS_MODEL_PATH", KRONOS_SCORING_REGION + "/github/") | |||
DEPLOYMENT_PREFIX = os.environ.get("DEPLOYMENT_PREFIX", "") | |||
|
|||
GREMLIN_REST_URL = "http://{host}:{port}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one question not strictly related to this PR, but why do you have template instead of just having config.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that by mistake the credential don't get committed. if someone changes their config.py
Config.py
is in .gitignore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I can see your point now, but you are sourcing all those files from environment variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not always, sometimes directly writing them in the con fig, it is easier that way, as they don't change over the testing period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with a way, where every developer would have a script where those secrets are stored. This script would not be in repo, might be in gitignore. This script would basically export stored secrets:
#!/bin/bash
export TOP_SECRET1="foo_bar"
export TOP_SECRET2="foo_bar"
.
.
.
export TOP_SECRET_N="foo_bar"
./run_actuall_code.py
It's another extra script, but I find it clearer than copying configs every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sara-02 Can you elaborate more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sara-02 ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will it get the configs inside the Docker file?
https://github.com/fabric8-analytics/fabric8-analytics-stack-analysis/blob/master/Dockerfile#L24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
through environment variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok will add a PR for that separately.
"""Generate the clean aggregated package_topic list as required by Gnosis. | ||
|
||
:param input_package_topic_data_store: The Data store to pick the package_topic files from. | ||
:param output_package_topic_data_store: The Data store to save the clean package_topic to. | ||
:param additional_path: The directory to pick the package_topic files from.""" | ||
|
||
if mode == "test": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this mode
in the first place? You should rename it to data_path and you don't have to have this if
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the value of data_path is different when running the test cases, that is why mode is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and my point here is that you don't need to solve this through conditions. You can set this value for once This will work because you are not running tests and real code in the same instance, are you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically pass the APOLLO_PATH instead of the mode value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much, or you can have it as a class variable for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkajaba Ack, thank have updated using a temp_path variable instead of mode setting.
@sara-02 Your image is available in the registry: |
1 similar comment
@sara-02 Your image is available in the registry: |
def __init__(self, src_dir): | ||
self.src_dir = src_dir | ||
# ensure path ends with a forward slash | ||
self.src_dir = self.src_dir if self.src_dir.endswith("/") else self.src_dir + "/" | ||
self.src_dir = self.src_dir if self.src_dir.endswith( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this if you use os.path.join
everywhere src_dir
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently removing this will cause other tests cases to fail, this requires a module wide fix of using os.path.join
. I will create an issue and work on PR to fix this for all files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already fixed it for all files that were present in the source code when I did my Python3 changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "apollo" module should be the only problem, ergo not too hard to fix.
def test_gremlin_updater_generate_payload(self): | ||
expected_pay_load = { | ||
'gremlin': | ||
"g.V().has('ecosystem', 'ruby')." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the + sign here if you use the magic string concatenation of Python.
@sara-02 would you kindly rebase? :-) |
@sara-02 Your image is available in the registry: |
@pkajaba PTAL again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sara-02 would you kindly take a look?
file_list = local_data_obj.list_files() | ||
for file_name in file_list: | ||
data = local_data_obj.read_json_file(file_name) | ||
# TODO: use a singleton object with updated datafile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you implement this as a singleton? Anyway, I don't really like that you are initializing the instance of the object inside of the same class as an initalized object is.
Is it some design pattern or what is the reason behind it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
for each_ecosystem in self.untagged_data: | ||
package_list = self.untagged_data[each_ecosystem] | ||
pck_len = len(package_list) | ||
if pck_len == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would delete this if
because it not required in context for this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
# If pck_len =0 then, no package of that ecosystem requires | ||
# tags. Hence, do nothing. | ||
continue | ||
for index in range(0, pck_len, 100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 100
really value what you want to have here? it means that just every 100th element in range will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this can be passed as parameter, but that is what is intended, as we want to break the list in chunks of 100 subsets each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
continue | ||
for index in range(0, pck_len, 100): | ||
sub_package_list = package_list[index:index + 100] | ||
pay_load = self.generate_payload( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you have underscore
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any dangling underscore :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a payload is the correct word, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
each_ecosystem, sub_package_list) | ||
self.execute_gremlin_dsl(pay_load) | ||
|
||
def generate_payload(self, ecosystem, package_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method can be class/static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
self.untagged_data = untagged_data | ||
|
||
@classmethod | ||
def generate_and_update_packages(cls, apollo_temp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have some tests but you are not testing this method. Any specific reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkajaba This method talks to graph, so every time we are testing it we need to a working instance of grmelin-http up and running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I would advise here to mock HTTP response to emulate the behavior of graph DB.
graph_obj.update_graph() | ||
local_data_obj.remove_json_file(file_name) | ||
|
||
def update_graph(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment about testing applies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason.
'str_packages': ['service_identity']}} | ||
|
||
unknown_data_obj = LocalFileSystem(APOLLO_TEMP_TEST_DATA) | ||
self.assertTrue(unknown_data_obj is not None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert unknown_data_obj
would be enough here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
# IMPORTANT: TestGraphUpdater needs to run after TestTagListPruner | ||
|
||
# Test class TestGraphUpdater(TestCase): | ||
def test_gremlin_updater_generate_payload(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this method is supposed to test? you don't really have read package list from the file system, just create a fixture of package list and tests whether a query is created correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, I don't need to load the list, but generation of list needs to be checked, so I am adding it to the previous test instead of this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, you can spit it, but if a generation of the list really has to be tested it should be extracted in function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sara-02 ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sara-02 ^^ :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am testing extraction here itself and then deleting the files. If not then it will create a dependency of tests as we have make sure that the extraction always gets tested after generation if they are 2 separate tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only checks for payload now.
class TestPruneAndUpdate(TestCase): | ||
|
||
# Test Class TagListPruner | ||
def test_generate_and_save_pruned_list_local(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am struggling to see what method is this test testing. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input list contains more than 4 tags, the prune method will generate tag list upto 4 tags based on frequency. So this test checks that the desired 4 tags are generated or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but which method does it test in code? I can't find test_generate_and_save_pruned_list_local
method in repository.
Unit tests should test methods and their behavior on various inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sara-02 ^^
@sara-02 Your image is available in the registry: |
@pkajaba PATL, i think all major concerns have been addressed. 2 things that need a separate PR include the |