A simple Git implemented in Python, for my personal learning experience.
To get started, run the following:
$ nix develop
$ poetry install
$ poetry run pit <command>
pit init
-git init
pit cat-file <type> <object-id>
-git cat-file <object-id>
pit hash-object [-w] [-t TYPE] FILE
pit log <commit-id>
pit ls-tree [-r] <tree-id>
pit checkout
-git checkout
pit tag
-git tag
pit tag <name> <object-id>
pit tag -a <name> <object-id>
- create new tag object<name>
, pointing atHEAD
(default) or<object-id>
pit checkout <commit>
pit checkout <branch>
pit cat-file <type> <id>
pit rev-parse --pit-type <type> <name>
pit ls-files
pit check-ignore <paths>
pit status
pit rm <paths>
pit add <paths>
pit commit -m <message>
- work tree
.git/config
- bare repository
<project>.git
- without working tree
- for multiple users to push/pull histories/commits to and from
.git/description
- name of the repository.git/HEAD
- reference to the current commit, using either:
- the commit's ID, or
- a symbolic reference to current branch
- usually contains a symref to
main
/master
branch out of the boxref: refs/heads/main
- reference to the current commit, using either:
.git/info/exclude
vs..gitignore
.gitignore
is part of your source tree.git/info/exclude
is not part of the commit database
.git/hooks
- contains scripts executed by Git as part of certain core commands
- remove the
.sample
from the script name to activate it
.git/objects
- Git's database.git/objects/pack
- optimised format.git/objects/info
- metadata
.git/refs
- various kinds of pointers into the.git/objects
DB.git/refs/heads
- stores the latest commit on each local branch- the pointers are usually just files that contain the ID of a commit
- generates the
.git
directory - root commit - commit with no parents
.git/COMMIT_EDITMSG
- commit messages- Git saves
--message
to this file, or - Git opens this file for you to enter your commit message if
--message
is omitted
- Git saves
.git/index
- binary data- cache,
- stores info about each file in the current commit & which version of each file should be present
- updated when
git add
is run - used to build the next commit
.git/logs
- text files- used by
reflog
- contains a log of every time a ref changes
- ie. a reference pointing to a commit, like
HEAD
or branch name
- ie. a reference pointing to a commit, like
- e.g when you:
- make a commit
- checkout a different branch
- perform a merge/rebase
- used by
cat-file
- prints raw contents of an object to stdout- uncompressed
- without git header
git cat-file -p <commit-id>
- show an object from the
.git/objects/
database - shows a tree ID
- representing your whole tree of files as they were when this commit was made
- Git does NOT store timestamp/author info for trees
- show an object from the
git cat-file -p <tree-id>
- Git's representation of a tree
- one tree for every directory in project, including root
- each entry in a tree is either:
- another tree, or
- blob
<filemode> blob <blob-id> <blob-name>
git cat-file -p <blob-id>
git cat-file -p HEAD^{tree}
- Ask Git for the tree of the commit referenced by
.git/HEAD
- Ask Git for the tree of the commit referenced by
- the Deflate Algorithm
- Format of a Git object?
blob <size><null_byte><content>
commit <size><null_byte><content>
tag <size><null_byte><content>
tree <size><null_byte><content>
- how does Git stores a blob?
blob <length><null_byte><content>
- compress with DEFLATE algo
- zlib
- how trees are stored
- object ID - 40 characters hex - 20 1-byte chars
- 20-byte hash ID
- packed into a binary representation
- Git hashes objects before compressing them
- using SHA-1
- it hashes
blob
+<size>
+<string>
hash-object
- reads a file
- computes its hash as an object
- stores the hash in repository (
-w
specified), or just prints the hash - defaults to
blob
if notype
specified
- Git has a second object storage mechanism - packfiles
- more efficient, more complex, than loose objects
- stored in
.git/objects/pack/
-
uncompressed commit object:
tree <hash> parent <hash> author <> committer <> <PGP> <commit_message>
-
Git has two strong rules about object identity:
- the same name will always refer to the same object
- object's name is always hash of its content
- the same object will always be referred by the same name
- there shouldn't be two equivalent objects under different names
- the same name will always refer to the same object
-
What does a commit consist of?
- a tree object
- zero/one/more parents
- author identity (name, email) and timestamp
- committer identity
- optional PGP signature
- message
- tree describes content of the work tree
- associates blobs to paths
- Array of 3-element tuples:
- file mode
- path
- SHA-1
- referring to either a blob or another tree object
- tree object format:
<filemode> <path><null_byte><sha>
- file mode is stored in Git in octal format
git ls-tree [-r] <tree-id>
- prints contents of a tree, recursively if
-r
is set
- prints contents of a tree, recursively if
git checkout
- instantiates a commit in worktree
- Git references
- in
.git/refs
- are text files containing hex representation of an object's hash, encoded in ASCII
- refs can refer to another ref:
ref: refs/remotes/origin/master
- indirect reference
- direct reference: a ref with SHA-1 object ID
- most simple use of refs
- a user-defined name for an object
- often a commit
- a very common use: identifying software releases
git tag v1.2.34 6071c08
- like aliasing
- a new way to refer to an existing object
git checkout <tag-name>
equivalent togit checkout <commit-id>
- tags are actually refs, living under
.git/refs/tags/
- Tags have two flavors:
- lightweight tags
- regular refs to a commit/tree/blob
- tag objects
- regular refs pointing to an object of type
tag
- have author/date/PGP
- format same as commit object
- regular refs pointing to an object of type
- lightweight tags
git tag
- creates a new tag, or
- list existing tags (default behavior)
- a branch is a reference to a commit
- branches are refs living in
.git/refs/heads/
- Branch vs. Tag:
- branches are references to a commit; tags can refer to any object
- the branch ref is updated at each commit
- Every time you commit, Git does:
- a new commit object created, with current branch's (commit) ID as its parent
- the commit object is hashed & stored
- the branch ref is updated to refer to the new commit's hash
- the current branch lives outside of
.git/refs/
- in
.git/HEAD
- an indirect ref, likeref: path/to/other/ref
, not some hash
- in
- Detached HEAD
- when you checkout a random commit
- you are not on any branch any more
.git/HEAD
is a direct reference, containing a SHA-1
- staging area
- the intermediate stage between the last commit and the next commit
- To commit some changes
- first, stage the changes by
git add
/git rm
- then
git commit
- first, stage the changes by
- index file
- a binary file
- holds extra info about files in the worktree, like creation/modification time
- so
git status
does not often need to actually compare files - it just checks modification time is same as the one stored in the index file
- so
- when repo is clean, the index file holds the exact same contents as the HEAD commit, plus metadata about the corresponding filesystem entries
- when
git add
/git rm
, index file is modified accordingly- updated with new blob ID; various metadata updated as well
git status
knows when not to compare file contents
- when
git commit
, a new tree is produced from the index file- new commit object generated with that tree
- branches are updated
- index file vs. tree
- index file can represent inconsistent states (like merge conflict)
- tree is always a complete, unambiguous representation
git ls-files
- displays names of files in the staging area
git check-ignore
- check/list the ignore rules
- in various
.gitignore
files
- Two kinds of ignore files
- live in the index - various
gitignore
files - outside the index - absolute
- global (
~/.config/git/ignore
) - repository-specific
.git/info/exclude
- global (
- live in the index - various
git status
- compares:
- the
HEAD
with the staging area - the staging area with the worktree
- the
- compares:
git rm
- removes files from worktree as well as from index
git add
- begin by removing the existing index entry
- if there is one, without removing the file itself (
git rm
)
- if there is one, without removing the file itself (
- hash the file into a blob object
- create its entry
- write the modified index back
- begin by removing the existing index entry
git add
vs.git rm
git add
needs to create an index entry
- How to build
git commit
- build a dict (hashmap) of directories
- keys are full paths from worktree root (like
pit/libgit.py
) - values are list of
GitIndexEntry
- files in the directory
- keys are full paths from worktree root (like
- traverse this list, bottom up
- at each directory, build tree with its contents
- write new tree to the repo
- add the tree to the directory's parent
- iterate over the next directory
- build a dict (hashmap) of directories