Skip to content

Commit

Permalink
Mike's code review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
mernst committed Sep 12, 2023
1 parent 62a90de commit 61cc17f
Show file tree
Hide file tree
Showing 14 changed files with 67 additions and 3 deletions.
21 changes: 21 additions & 0 deletions src/python/cache_utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
#!/usr/bin/env python3
""" Contains all the functions related to the caches.
TODO: Elsewhere, the code speaks of "the cache" (example: get_cache_lock), but here it speaks of 4 caches. I'm confused about the relationship.
TODO: Are there 4 caches overall, or 4 caches per repository, or something else? Please clarify.
TODO: "after running the script": Which script?
There will be 4 caches which are stored on disk after running the script:
TODO: Is the cache directory literally named "cache"? Where is it on disk?
TODO: Are all the caches JSON files?
1) cache/sha_cache_entry: A cache that maps the commit hash to a sha256 hash of the repository.
2) cache/test_cache: A cache that maps a sha256 to test results.
3) cache/merge_results:A cache that maps a merge to the result
Expand All @@ -17,6 +22,9 @@
CACHE_BACKOFF_TIME = 2 * 60 # 2 minutes, in seconds
TIMEOUT = 90 * 60 # 90 minutes, in seconds

# TODO: In this file, please place the externally-visible or exported functions
# first, and then a comment marker of some kind, and then the "private" ones.


def slug_repo_name(repo_slug: str) -> str:
"""Given a GitHub repository slug (owner/reponame), returns the reponame.
Expand All @@ -28,6 +36,9 @@ def slug_repo_name(repo_slug: str) -> str:
return repo_slug.split("/")[1]


# TODO: Throughout, please use "_directory" instead of "_prefix" in
# variable names, when the variable actually holds a directory. I
# feel this will be clearer.
def get_cache_lock(repo_slug: str, cache_prefix: Path):
"""Returns a lock for the cache of a repository.
Initially the lock is unlocked; the caller must explictly
Expand Down Expand Up @@ -74,6 +85,10 @@ def is_in_cache(


def load_cache(repo_slug: str, cache_prefix: Path) -> dict:
# TODO: "the cache": which one? Also, does this load a cache or a cache
# entry? Please define those terms to help the reader. This method returns
# a cache, which is a dict. `cache_lookup` returns a cache entry, which is
# also a dict. What are they dicts from and to?
"""Loads the cache.
Args:
repo_slug (str): The slug of the repository, which is "owner/reponame".
Expand Down Expand Up @@ -104,6 +119,10 @@ def cache_lookup(
return cache[cache_key]


# TODO: The distinction between write_to_cache and set_in_cache is not clear to
# me. If it is only locking, then please give them the same name, but with one
# suffixed by "_without_lock". Also please indicate under what circumstances
# each should be called.
def write_to_cache(
cache_key: Union[Tuple, str],
cache_value: Union[str, dict, None],
Expand Down Expand Up @@ -156,6 +175,8 @@ def set_in_cache(
lock.release()


# TODO: I'm not clear what is the difference between `cache_lookup` and
# `lookup_in_cache`. Please clarify, and indicate when each should be called.
def lookup_in_cache(
cache_key: Union[Tuple, str],
repo_slug: str,
Expand Down
4 changes: 4 additions & 0 deletions src/python/clean_cache_placeholders.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
#!/usr/bin/env python3
# TODO: This file uses "clean", "remove", and "delete". Please choose one term
# and stick to it, in both the code and in the file name.
"""
Deletes all placeholders from the cache. Placeholders are created when a
a process starts; it indicates that is has started and is still running.
Expand All @@ -9,6 +11,8 @@
python clean_cache_placeholders.py --cache_path <path_to_cache>
Args:
TODO: Throughout, please don't use "_path" as a variable suffix, because it is
# not sufficiently precise. Instead, use either "_file" or "_directory".
cache_path (str): Path to cache directory
"""

Expand Down
4 changes: 4 additions & 0 deletions src/python/latex_output.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- repos_head_passes_csv: csv file containing the list of valid repositories
- tested_merges_path: path to the folder containing the merge results
- merges_path: path to the folder containing all found merges.
TODO: Throughout, be consistent about "directory" vs "folder".
- output_dir: path to the folder where the LaTeX files will be saved
"""

Expand Down Expand Up @@ -92,6 +93,9 @@ def latex_def(name, value) -> str:
],
}

# TODO: I suggest that these should be declared in repo.py, or move the enums to
# a different file where this can also be defined. Or, define predicates for
# clients to use rather than exporting these lists.
MERGE_CORRECT_NAMES = [
TEST_STATE.Tests_passed.name,
]
Expand Down
11 changes: 9 additions & 2 deletions src/python/merge_differ.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
--merges_path <merges_path>
--cache_dir <cache_dir>
This script compares the results of two merge tools on the same merge.
It outputs the diff any two merge results if their results differ.
# TODO: Where is the output? Standard out?
It outputs the diff of the two merge results if their results differ.
# TODO: Does "ignores" mean there is no output, or is it something deeper?
It ignores merges that have the same result or merges that are not successful.
"""

Expand Down Expand Up @@ -78,18 +80,21 @@ def get_merge_fingerprint(


def merge_differ(args: Tuple[pd.Series, Path]) -> None:
"""Diffs the results of any two merge tools on the same merge.
# TODO: What does this funtion do with the diff? I think it writes to a file? Or does it populate a cache too?
"""Diffs the results of every two merge tools on the same merge.
Does not diff merges that have the same result or merges that are not successful.
Args:
args (Tuple[pd.Series,Path]): A tuple containing the merge data and the cache prefix.
Returns:
TODO: How is "The result of the test." related to the diff of two merge tools? I don't see this function returning anything.
dict: The result of the test.
"""
merge_data, cache_prefix = args

if not merge_data["parents pass"]:
return

# TODO: See other comment about "_prefix" in variable names; change throughout.
diff_file_prefix = cache_prefix / "merge_diffs"
diff_file_prefix.mkdir(parents=True, exist_ok=True)

Expand Down Expand Up @@ -131,6 +136,7 @@ def diff_file_name(sha1: str, sha2: str) -> Path:
Path: The name of the diff file.
"""
if sha1 < sha2:
# TODO: Why does this use ".txt" rather than ".diff" as the file extension?
return Path(sha1 + "_" + sha2 + ".txt")
return Path(sha2 + "_" + sha1 + ".txt")

Expand All @@ -143,6 +149,7 @@ def diff_file_name(sha1: str, sha2: str) -> Path:
parser.add_argument("--cache_dir", type=Path, default="cache/")
args = parser.parse_args()
cache_dir = Path(args.cache_dir)
# TODO: See comment elsewhere about "_path" in variable names.
cache_diffs_path = cache_dir / "merge_diffs"
cache_diffs_path.mkdir(parents=True, exist_ok=True)

Expand Down
4 changes: 4 additions & 0 deletions src/python/merge_tester.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ def merge_tester(args: Tuple[str, pd.Series, Path]) -> pd.Series:
print("merge_tester: Started ", repo_slug, merge_data["left"], merge_data["right"])

merge_data["parents pass"] = False
# TODO: It's not clear to me whether "left" and "right" are keys for branches, or shas, or something else. How about improving the key name? (We can change the find-merge-conflicts program if it is the source of the problem.)
for branch in ["left", "right"]:
repo = Repository(repo_slug, cache_prefix=cache_prefix)
repo.checkout(merge_data[branch])
Expand All @@ -73,12 +74,14 @@ def merge_tester(args: Tuple[str, pd.Series, Path]) -> pd.Series:
merge_data[f"{branch} test result"] = test_result.name
if test_result != TEST_STATE.Tests_passed:
return merge_data
# TODO: It looks suspicious to me that the repo is only sometimes, not always, deleted.
del repo

merge_data["parents pass"] = True

for merge_tool in MERGE_TOOL:
if is_merge_sucess(merge_data[merge_tool.name]):
# TODO: I suggest abstracting the body of this loop into a separate function, since it stands on its own logically.
repo = Repository(repo_slug, cache_prefix=cache_prefix)
(
result,
Expand Down Expand Up @@ -126,6 +129,7 @@ def main(): # pylint: disable=too-many-locals,too-many-statements

repos = pd.read_csv(args.repos_head_passes_csv, index_col="idx")

# TODO: I suggest abstracting this loop into a separate function (through printing "number of merges to test").
print("merge_tester: Started listing merges to test")
merge_tester_arguments = []
for _, repository_data in tqdm(repos.iterrows(), total=len(repos)):
Expand Down
5 changes: 5 additions & 0 deletions src/python/merge_tools_comparator.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ def merger( # pylint: disable=too-many-locals
"""
repo_slug, merge_data, cache_prefix = args

# TODO: I think this is a key, not an entry. Please clarify.
# TODO: Why are these two SHAs (if they are SHAs) not sorted as is done elsewhere?
cache_entry = merge_data["left"] + "_" + merge_data["right"]
merge_cache_prefix = cache_prefix / "merge_results"

Expand All @@ -73,6 +75,7 @@ def merger( # pylint: disable=too-many-locals
)
cache_data[merge_tool.name] = {"results": [], "log_files": [], "run_time": []}
for i in range(N_MERGES):
# TODO: I suggest abstracting out the body of this loop as a function.
repo = Repository(repo_slug, cache_prefix=cache_prefix)
(
merge_status,
Expand Down Expand Up @@ -152,6 +155,7 @@ def merger( # pylint: disable=too-many-locals
print("merge_tools_comparator: Constructing Inputs")
merger_arguments = []
for _, repository_data in tqdm(repos.iterrows(), total=len(repos)):
# TODO: I suggest abstracting out the body of this loop as a function.
merges_repo = []
repo_slug = repository_data["repository"]
merge_list_file = Path(
Expand Down Expand Up @@ -225,6 +229,7 @@ def merger( # pylint: disable=too-many-locals
merge_data = merger_arguments[i][1]
cache_data = merger_results[i]
two_merge_tools_differ = False
# TODO: I suggest abstracting this loop into a function. That will also make it easier to return early, which can be done just with "return". (Currently it makes more iterations even after discovering that two merge tools do differ.)
for merge_tool1 in MERGE_TOOL:
for merge_tool2 in MERGE_TOOL:
if is_merge_sucess(
Expand Down
11 changes: 11 additions & 0 deletions src/python/repo.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,11 @@ def compute_explanation(

def repo_test(wcopy_dir: Path, timeout: int) -> Tuple[TEST_STATE, str]:
"""Returns the result of run_repo_tests.sh on the given working copy.
TODO: What is the definition of "the entire test"? How is it different than the test process?
If the test process passes then the entire test is marked as passed.
If the test process timeouts then the entire test is marked as timeout.
Args:
TODO: Minor: Throughout, you don't need "The path of" at the beginning of comments. Or change it to "The directory of".
wcopy_dir (Path): The path of the working copy (the clone).
timeout (int): Test timeout limit, in seconds.
Returns:
Expand Down Expand Up @@ -190,6 +192,10 @@ def _merge_and_test( # pylint: disable=too-many-arguments
str: The tree fingerprint of the merge result.
str: The left fingerprint.
str: The right fingerprint.
TODO: Is this time to merge, time to test, or time to merge and test?
TODO: This is a ssingle number, but n_tests tells how many times to
TODO: run the test. Is this an average, median, overall, or something
TODO: else? (I have the same question elsewhere.)
float: The time it took to run the merge, in seconds.
"""
(
Expand Down Expand Up @@ -415,12 +421,14 @@ def compute_tree_fingerprint(self) -> str:
def get_sha_cache_entry(
self, commit: str, start_merge: bool = False
) -> Union[None, dict]:
# TODO: is a SHA cache a cache from a SHA to something, or a cache from something to a SHA?
"""Gets a SHA cache entry.
Args:
commit (str): The commit to check.
start_merge (bool, optional) = False: Whether to indicate that the merge starts if the
commit is not in the cache.
Returns:
# TODO: What is the structure of the dict this returns?
Union[None,dict]: The cache entry if the commit is in the cache, None otherwise.
"""
cache = lookup_in_cache(
Expand Down Expand Up @@ -501,7 +509,10 @@ def checkout_and_test(
def test(self, timeout: int, n_tests: int) -> TEST_STATE:
"""Tests the repository. The test results of multiple runs is combined into one result.
If one of the runs passes then the entire test is marked as passed.
# TODO: Should this line be preceded by "otherwise"? Is it possible that one run passes and another times out?
# TODO: I would think the test should be marked as timeout only if every run is timeout.
If one of the runs timeouts then the entire test is marked as timeout.
# TODO: Otherwise the test is marked as failure?
Args:
timeout (int): The timeout limit, in seconds.
n_tests (int): The number of times to run the test suite.
Expand Down
1 change: 1 addition & 0 deletions src/python/test_repo_head.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# TODO: What does it mean to "validate it"?
"""Tests the HEAD of a repo and validates it if the test passes.
usage: python3 test_repo_head.py --repos_csv_with_hashes <repos_csv_with_hashes.csv>
Expand Down
1 change: 1 addition & 0 deletions src/python/variables.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
CACHE_BACKOFF_TIME = 2 * 60 # 2 minutes, in seconds
DELETE_WORKDIRS = True
REPOS_PATH = Path("repos")
# TODO: What is this used for? Please give it a more descriptive name.
WORKDIR_PREFIX = Path(".workdir")

TIMEOUT_MERGING = 60 * 15 # 15 minutes, in seconds
Expand Down
1 change: 1 addition & 0 deletions src/scripts/merge_tools/intellimerge.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ intellimerge_absolutepath="${ROOT_DIR}/${intellimerge_relativepath}"
clone_dir=$1
branch1=$2
branch2=$3
# TODO: Will this be in the repository? Would it be safer to put it under /tmp or /scratch?
temp_dir=".workdir/intelli_temp_$$/"
mkdir $temp_dir

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#! /usr/bin/env python
# TODO: Benedikt, could you review and improve this file? I'm sure it is not perfect Python style.

"""Edits a file in place to remove conflict markers related to Java imports."""

Expand Down
3 changes: 2 additions & 1 deletion src/scripts/merge_tools/spork.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ if [ "$#" -ne 3 ]; then
exit 1
fi

# Kill all java processes that are running for over an hour (to avoid memory leaks)
# TODO: This script is an odd place to put this `killall` command. Should it be elsewhere in the experimental infrastructure?
# Kill all Java processes that are running for over an hour (to avoid memory leaks).
killall -9 java --older-than 1h

SCRIPT_PATH="$(dirname "$0")"; SCRIPT_PATH="$(eval "cd \"$SCRIPT_PATH\" && pwd")"
Expand Down
2 changes: 2 additions & 0 deletions src/scripts/run_repo_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@

set -o nounset

# TODO: This is too agressive and will interfere with the user's other work.
# TODO: Instead, create a directory under /tmp (or, better, under /tmp/$USER) for the AST merging experiments, and clean it.
# Test side effects can be seen in the /tmp directory.
# We delete all the files older than 2h and owned by the current user.
find /tmp -maxdepth 1 -user "$(whoami)" -mmin +120 -exec rm -rf {} \;
Expand Down
1 change: 1 addition & 0 deletions src/scripts/utils/run_multiple_machines.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/bin/bash
# TODO: A branch of what? This repository? (Same question in `run_remotely.sh`.)
# Run the code for a certain branch on multiple remote machines.
# Usage: ./run_multiple_machines.sh <branch> <machine_address_path> <root_path_on_machine>
# <branch> is the branch to run
Expand Down

0 comments on commit 61cc17f

Please sign in to comment.