Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PATH WALK I: The path-walk API #1818

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Commits on Nov 8, 2024

  1. path-walk: introduce an object walk by path

    In anticipation of a few planned applications, introduce the most basic form
    of a path-walk API. It currently assumes that there are no UNINTERESTING
    objects, and does not include any complicated filters. It calls a function
    pointer on groups of tree and blob objects as grouped by path. This only
    includes objects the first time they are discovered, so an object that
    appears at multiple paths will not be included in two batches.
    
    These batches are collected in 'struct type_and_oid_list' objects, which
    store an object type and an oid_array of objects.
    
    The data structures are documented in 'struct path_walk_context', but in
    summary the most important are:
    
      * 'paths_to_lists' is a strmap that connects a path to a
        type_and_oid_list for that path. To avoid conflicts in path names,
        we make sure that tree paths end in "/" (except the root path with
        is an empty string) and blob paths do not end in "/".
    
      * 'path_stack' is a string list that is added to in an append-only
        way. This stores the stack of our depth-first search on the heap
        instead of using recursion.
    
      * 'path_stack_pushed' is a strmap that stores path names that were
        already added to 'path_stack', to avoid repeating paths in the
        stack. Mostly, this saves us from quadratic lookups from doing
        unsorted checks into the string_list.
    
    The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
    push_to_stack() method. Call this instead of inserting into these
    structures directly.
    
    The walk_objects_by_path() method initializes these structures and
    starts walking commits from the given rev_info struct. The commits are
    used to find the list of root trees which populate the start of our
    depth-first search.
    
    The core of our depth-first search is in a while loop that continues
    while we have not indicated an early exit and our 'path_stack' still has
    entries in it. The loop body pops a path off of the stack and "visits"
    the path via the walk_path() method.
    
    The walk_path() method gets the list of OIDs from the 'path_to_lists'
    strmap and executes the callback method on that list with the given path
    and type. If the OIDs correspond to tree objects, then iterate over all
    trees in the list and run add_children() to add the child objects to
    their own lists, adding new entries to the stack if necessary.
    
    In testing, this depth-first search approach was the one that used the
    least memory while iterating over the object lists. There is still a
    chance that repositories with too-wide path patterns could cause memory
    pressure issues. Limiting the stack size could be done in the future by
    limiting how many objects are being considered in-progress, or by
    visiting blob paths earlier than trees.
    
    There are many future adaptations that could be made, but they are left for
    future updates when consumers are ready to take advantage of those features.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    b7e9b81 View commit details
    Browse the repository at this point in the history
  2. test-lib-functions: add test_cmp_sorted

    This test helper will be helpful to reduce repeated logic in
    t6601-path-walk.sh, but may be helpful elsewhere, too.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    cf2ed61 View commit details
    Browse the repository at this point in the history
  3. t6601: add helper for testing path-walk API

    Add some tests based on the current behavior, doing interesting checks
    for different sets of branches, ranges, and the --boundary option. This
    sets a baseline for the behavior and we can extend it as new options are
    introduced.
    
    Store and output a 'batch_nr' value so we can demonstrate that the paths are
    grouped together in a batch and not following some other ordering. This
    allows us to test the depth-first behavior of the path-walk API. However, we
    purposefully do not test the order of the objects in the batch, so the
    output is compared to the expected output through a sort.
    
    It is important to mention that the behavior of the API will change soon as
    we start to handle UNINTERESTING objects differently, but these tests will
    demonstrate the change in behavior.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    a3c754d View commit details
    Browse the repository at this point in the history
  4. path-walk: allow consumer to specify object types

    We add the ability to filter the object types in the path-walk API so
    the callback function is called fewer times.
    
    This adds the ability to ask for the commits in a list, as well. We
    re-use the empty string for this set of objects because these are passed
    directly to the callback function instead of being part of the
    'path_stack'.
    
    Future changes will add the ability to visit annotated tags.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    83b746f View commit details
    Browse the repository at this point in the history
  5. path-walk: visit tags and cached objects

    The rev_info that is specified for a path-walk traversal may specify
    visiting tag refs (both lightweight and annotated) and also may specify
    indexed objects (blobs and trees). Update the path-walk API to walk
    these objects as well.
    
    When walking tags, we need to peel the annotated objects until reaching
    a non-tag object. If we reach a commit, then we can add it to the
    pending objects to make sure we visit in the commit walk portion. If we
    reach a tree, then we will assume that it is a root tree. If we reach a
    blob, then we have no good path name and so add it to a new list of
    "tagged blobs".
    
    When the rev_info includes the "--indexed-objects" flag, then the
    pending set includes blobs and trees found in the cache entries and
    cache-tree. The cache entries are usually blobs, though they could be
    trees in the case of a sparse index. The cache-tree stores
    previously-hashed tree objects but these are cleared out when staging
    objects below those paths. We add tests that demonstrate this.
    
    The indexed objects come with a non-NULL 'path' value in the pending
    item. This allows us to prepopulate the 'path_to_lists' strmap with
    lists for these paths.
    
    The tricky thing about this walk is that we will want to combine the
    indexed objects walk with the commit walk, especially in the future case
    of walking objects during a command like 'git repack'.
    
    Whenever possible, we want the objects from the index to be grouped with
    similar objects in history. We don't want to miss any paths that appear
    only in the index and not in the commit history.
    
    Thus, we need to be careful to let the path stack be populated initially
    with only the root tree path (and possibly tags and tagged blobs) and go
    through the normal depth-first search. Afterwards, if there are other
    paths that are remaining in the paths_to_lists strmap, we should then
    iterate through the stack and visit those objects recursively.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    97765aa View commit details
    Browse the repository at this point in the history
  6. path-walk: mark trees and blobs as UNINTERESTING

    When the input rev_info has UNINTERESTING starting points, we want to be
    sure that the UNINTERESTING flag is passed appropriately through the
    objects. To match how this is done in places such as 'git pack-objects', we
    use the mark_edges_uninteresting() method.
    
    This method has an option for using the "sparse" walk, which is similar in
    spirit to the path-walk API's walk. To be sure to keep it independent, add a
    new 'prune_all_uninteresting' option to the path_walk_info struct.
    
    To check how the UNINTERSTING flag is spread through our objects, extend the
    'test-tool path-walk' command to output whether or not an object has that
    flag. This changes our tests significantly, including the removal of some
    objects that were previously visited due to the incomplete implementation.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    a4aaa3b View commit details
    Browse the repository at this point in the history