June 2017

Volume 32 Number 6

[DevOps]

Git Internals for Visual Studio Developers

By Jonathan Waldman | June 2017

In my commit to Git DevOps article (msdn.com/magazine/mt767697), I explained how the Git version control system (VCS) differs from centralized VCSes with which you might already be familiar. I then demonstrated how to accomplish certain Git tasks using the Git tooling in Visual Studio. In this article, I’ll summarize relevant changes to how Git works within the newly released Visual Studio 2017 IDE, discuss how the Git repo is implemented in the file system, and examine the topology of its data store and the structure and content of its various storage objects. I’ll conclude with a low-level explanation of Git branches, providing a perspective that I hope will prepare you for more advanced Git operations I’ll present in upcoming articles.

Note: I use no servers or remotes in this article—I’m exploring a purely local scenario that you can follow using any Windows machine with Visual Studio 2017 and Git for Windows (G4W) installed (with or without an Internet or network connection). This article is an introduction to Git internals and assumes you’re familiar with Visual Studio Git tooling and basic Git operations and concepts.

Visual Studio, Git and You

Git refers not only to the repository (“repo”) that contains the version-control data store, but also to the engine that processes commands to manage it: Plumbing commands carry out low-level operations; porcelain commands bundle up plumbing commands in macro-like fashion, making them easier if less granular to invoke. As you master Git, you’ll discover that some tasks require the use of these commands (some of which I’ll use in this article), and that invoking them requires a command-line interface (CLI). Unfortunately, Visual Studio 2017 no longer installs a Git CLI because it uses a new Git engine called MinGit that doesn’t provide one. MinGit (“minimal Git”), introduced with G4W 2.10, is a portable, reduced-feature-set API designed for Windows applications that need to interact with Git repositories. G4W and, by extension, MinGit, are forks of the official Git open source project. This means they both inherit the official Git fixes and updates as soon as they’re available—and it ensures that Visual Studio can do the same.

To access a Git CLI (and to follow along with me), I recommend installing the full G4W package. While other Git CLI/GUI tooling options are available, G4W (as MinGit’s official parent) is a wise choice—especially because it shares its configuration files with MinGit. To obtain the latest G4W setup, visit the Downloads section at the site’s official source: git-scm.com. Run the setup program and select the Git Bash Here checkbox (creates a Git command-prompt window) and the Git GUI Here checkbox (creates a Git GUI window)—which makes it easy to right-click a folder in Windows Explorer—and select one of those two options for the current folder (the “Bash” in Git Bash refers to Bourne Again Shell, which presents a Git CLI in a Unix shell for G4W). Next, select Use Git from the Windows Command Prompt, which configures your environment so that you can conveniently run Git commands from either a Visual Studio package manager console (Windows PowerShell) or a command-prompt window.

If G4W is installed using the options I’ve recommended here (see Figure 1), the communication pathways that will be in effect when communicating with a Git repo: Visual Studio 2017 uses the MinGit API while PowerShell and command-prompt sessions use the G4W CLI—a very different communication pathway en route to the Git repo. Although they serve as different communication endpoints, MinGit and G4W are derived from the official Git source code—and they share their configuration files. Notice that when you issue porcelain commands, they’re translated into plumbing commands before being processed by the CLI. The point here is to understand that Git experts often resort to—and some thrive on—issuing bare-metal Git plumbing commands to a CLI because doing so is the most direct, lowest-level way with which to manage, query and update a Git repo. In contrast to low-level plumbing commands, higher-level porcelain commands and Git operations exposed by the Visual Studio IDE also can update the Git repo—yet it’s not always clear how, especially because porcelain commands often accept options that change what they do when they’re invoked. I’ve concluded that familiarity with Git plumbing commands is essential to wielding the power of Git and that’s why I strongly recommend installing G4W alongside Visual Studio 2017. (Read details about Git plumbing and porcelain commands at git-scm.com/docs.)

Communication Paths to and from the MinGit API and the Git for Windows Command-Line Interface
Figure 1 Communication Paths to and from the MinGit API and the Git for Windows Command-Line Interface

Low-Level Git

It’s natural for a Visual Studio developer to try to leverage existing knowledge of a VCS, such as Team Foundation Server (TFS), when transitioning to Git. Indeed, there’s an overlap in the terms and concepts used to describe operations in both systems—such as checking out/checking in code, merging, branching and so on. However, the assumption that a similar vocabulary implies similar underlying operations is downright wrong and dangerous. That’s because the decentralized Git VCS is fundamentally different in how it stores and tracks files and in the way it implements familiar version-control features. In short, when transitioning to Git, it’s probably best to just forget everything you know about centralized VCSes and start afresh.

When you’re working on a Visual Studio project that’s under Git source control, the typical edit/stage/commit workflow works something like this: You add, edit and delete (collectively hereafter, “change”) files in your project as needed. When ready, you stage some or all of those changes before committing them to the repo. Once committed, those changes become part of the repo’s complete and transparent history. Now, let’s see how Git manages all that internally.

The Directed Acyclic Graph Behind the scenes, each commit ends up as a vertex (node) on a Git-managed directed acyclic graph (“DAG,” in graph-theory parlance). The DAG represents the Git repo and each vertex represents a data element known as a commit object (see ** Figure 2**). Vertices in a DAG are connected with a line called an edge; it’s customary for DAG edges to be drawn as arrows so they can express a parent/child relationship (the head points to the parent vertex; the tail points to the child vertex). The origin vertex represents the repo’s first commit; a terminal vertex has no child. DAG edges express the exact parent-child relationship between each of its vertices. Because Git commit objects (“commits”) are vertices, Git can leverage the DAG structure to model the parent-child relationship between every commit, giving Git its ability to generate a history of changes from any commit back to the repo’s initial commit. Furthermore, unlike linear graphs, a DAG supports branches (a parent with more than one child), as well as merges (a child with more than one parent). A Git branch is spawned whenever a commit object produces a new child and a merge occurs when commit objects are combined to form a child.

Directed Acyclic Graph Showing Vertex, Edge, Head, Tail, Origin Vertex and Terminal Vertices
Figure 2 A Directed Acyclic Graph Showing Vertex, Edge, Head, Tail, Origin Vertex and Terminal Vertices; Three Branches (A, B and C); Two Branch Events (at A4); and One Merge Event (B3 and A5 Are Merged at A6)

I’ve explored the DAG and its associated terminology in great detail because such knowledge is a prerequisite to understanding advanced Git operations, which tend to work by manipulating vertices on the Git DAG. Furthermore, DAGs help to visualize the Git repo, and they’re widely leveraged in teaching materials, during presentations and by Git GUI tools.

Git Objects Overview So far I’ve mentioned only the Git commit object. However, Git actually stores four different object types in its repo: commit, tree, blob and tag. To investigate each of these, launch Visual Studio (I’m using Visual Studio 2017, but earlier versions that include Git support will work similarly) and create a new console application using File | New Project. Give your project a name, check the Create new Git repository checkbox and click OK. (If you haven’t configured Git within Visual Studio before, you’ll see a Git User Information dialog box appear. If you do, specify your name and e-mail address—this information is written to the Git repo for each of your commits—and check the “Set in global .gitconfig” checkbox if you want to use this information for every Git repo on your machine.)

When done, open a Solution Explorer window (see ** Figure 3, Marker 1**). You’ll see light-blue lock icons next to files that are checked-in—even though I haven’t yet issued a commit! (This is an example showing that Visual Studio sometimes carries out actions against your repo that you might not anticipate.) To see exactly what Visual Studio did, look at the history of changes for the current branch.

A New Visual Studio Project with Its Git-Repo History Report
Figure 3 A New Visual Studio Project with Its Git-Repo History Report

Git names the default branch master and makes it the current branch. Visual Studio displays the current branch name at the right edge of its status bar (Marker 2). The current branch identifies the commit object on the DAG that will be the parent of the next commit (more on branches later). To view the commit history for the current branch, click the master label (Marker 2) then select View History (Marker 3) from the menu.

The History - master window displays several columns of information. At the left (Marker 4) are two vertices on the DAG; each graphically represents a commit on the Git DAG. The ID, Author, Date and Message columns (Marker 5) show details about the commit. The HEAD for the master branch is indicated by a dark red pointer (Marker 6)—I’ll fully explain the meaning of this toward the end of this article. This HEAD marks the location of the next edge arrow’s head after a commit adds a new vertex to the DAG.

The report shows that Visual Studio issued two commits, each with its own Commit ID (Marker 7). The first (earliest) is uniquely identified with ID a759f283; the second with bfeb0957. These values are truncated from the full 40-character hexadecimal Secure Hash Algorithm 1 (SHA-1). The SHA-1 is a cryptographic hash function engineered to detect corruption by taking a message, such as the commit data, and creating a message digest—the full SHA-1 hash value—such as the commit ID. In short, the SHA-1 hash acts not only like a checksum, but also like a GUID because it provides roughly 1.46 x 1048 unique combinations. Like many other Git tools, Visual Studio expresses only the first eight characters of the full value because these provide 4.3 billion unique values, enough to avoid collisions in your daily work. If you want to see the full SHA-1 value, hover the mouse over a line in the History report (Marker 8).

While the View History report’s message column indicates the stated purpose of each commit (provided by the committer during a commit), it is, after all, just a comment. To examine a commit’s actual changes, right-click a row in the list and select View Commit Details (see Figure 4).

Commit Details for the Repo’s First Two Commits
Figure 4 Commit Details for the Repo’s First Two Commits

The first commit (Marker 1) shows two changes: .gitignore and .gitattributes (I discussed these files in my previous article).  The [add] next to each indicates files added to the repo. The second commit (Marker 2) shows five files added and also displays its parent-commit object’s ID as a clickable link. To copy the entire SHA-1 value to the clipboard, simply click the Actions menu and select Copy Commit ID.

File-System Implementation of a Git Repo To see how Git stores these files in the repo, right-click the solution (not the project) in Solution Explorer and select Open Folder in File Explorer. In the solution’s root you’ll see a hidden folder called .git (if you don’t see .git, click Hidden items in the File Explorer View menu). The .git folder is the Git repo for your project. Its objects folder defines the DAG: All DAG vertices and all parent-child relationships between each vertex are encoded by files that represent every commit in the repo starting with the origin vertex (refer back to  ). The .git folder’s HEAD file and refs folder define branches. Let’s look at these .git items in detail.

Exploring Git Objects

The .git\objects folder stores all Git object types: commit (for commits), tree (for folders), blob (for binary files) and tag (a friendly commit-object alias).

Commit Object Now’s the time to launch a Git CLI. You can use whichever tool you prefer (Git Bash, PowerShell or command window)—I’ll use PowerShell. To begin, navigate to the solution root’s .git\objects folder, then list its contents ( Figure 5, Marker 1). You’ll see that it contains a number of folders named using two-character hex values. To avoid exceeding the number of files permitted in a folder by the OS, Git removes the first two characters from each 40-byte SHA-1 value in order to create a folder name, then it uses the remaining 38 characters as the file name for the object to store. To illustrate, my project’s first commit has ID a759f283, so that object will appear in a folder called a7 (the first two characters of the ID). As expected, when I open that folder, I see a file named 59f283. Remember that all of the files stored in these hex-named folders are Git objects. To save space, Git zlib-compresses the files in the object store. Because this kind of compression produces binary files, you won’t be able to view these files using a text editor. Instead, you’ll need to invoke Git commands that can properly extract Git-object data and present it using a format you can digest.

Exploring Git Objects Using a Git Command-Line Interface
Figure 5 Exploring Git Objects Using a Git Command-Line Interface

I already know that file 59f283 contains a commit object because it’s a commit ID. But sometimes you’ll see a file in the objects folder without knowing what it is. Git provides the cat-file plumbing command to report the type of an object, as well as its contents (Marker 3). To obtain the type, specify the -t (type) option when invoking the command, along with the few unique characters of the Git object’s file name:

git cat-file -t a759f2

On my system, this reports the value “commit”—indicating that the file starting with a759f2 contains a commit object. Specifying only the first five characters of the SHA-1 hash value is usually enough, but you can provide as many as you want (don’t forget to add the two characters from the folder name). When you issue the same command with the -p (pretty print) option, Git extracts information from the commit object and presents it in a format that’s human-readable (Marker 4).

A commit object is composed of the following properties: Parent Commit ID, Tree ID, Author Name, Author Email Address, Author Commit Timestamp, Committer Name, Committer Email Address, Committer Commit Timestamp and Commit Message (the parent commit ID isn’t displayed for the first commit in the repo). The SHA-1 for each commit object is computed from all of the values contained in these commit-object properties, virtually guaranteeing that each commit object will have a unique commit ID.

Tree and Blob Objects Notice that although the commit object contains information about the commit, it doesn’t contain any files or folders. Instead, it has a Tree ID (also an SHA-1 value) that points to a Git tree object. Tree objects are stored in the .git\objects folder along with all other Git objects.

Figure 6 depicts the root tree object that’s part of each commit object. The root tree object maps, in turn, to blob objects (covered next) and other tree objects, as needed.

Visualization of Git Objects That Express a Commit
Figure 6 Visualization of Git Objects That Express a Commit

Because my project’s second commit (Commit ID bfeb09) includes files, as well as folders (see the earlier ** 4**), I’ll use it to illustrate how the tree object works. ** Figure 7, Marker 1** shows the cat‑file ‑p bfeb09 output. This time, notice that it includes a parent property, which correctly references the SHA-1 value for the first commit object. (Remember that it’s a commit object’s parent reference that enables Git to construct and maintain its DAG of commits.)

Using the Git CLI to Explore Tree Object Details
Figure 7 Using the Git CLI to Explore Tree Object Details

The root tree object maps, in turn, to blob objects (zlib-compressed files) and other tree objects, as needed.

Commit bfeb09 contains a tree property with ID ca853d. Figure 7, Marker 2 shows the cat-file -p ca853d output. Each tree object contains a permissions property corresponding to the POSIX permissions mask of the object (040000 = Directory, 100644 = Regular non-executable file, 100664 = Regular non-executable group-writeable file, 100755 = Regular executable file, 120000 = Symbolic link, and 160000 = Gitlink); type (tree or blob); SHA-1 (for the tree or blob); and name. The name is the folder name (for tree objects) or the file name (for blob objects). Observe that this tree object is composed of three blob objects and another tree object. You can see that the three blobs refer to files .gitattributes, .gitignore and DemoConsole.sln, and that the tree refers to folder DemoConsoleApp (Figure 7, Marker 3). Although tree object ca853d is associated with the project’s second commit, its first two blobs represent files .gitattributes and .gitignore—files added during the first commit (see ** Figure 4, Marker 1**)! The reason these files appear in the tree for the second commit is that each commit represents the previous commit object along with changes captured by the current commit object. To “walk the tree” one level deeper, Figure 7, Marker 3 shows the cat-file -p a763da output, showing three more blobs (App.config, DemoConsoleApp.csproj and Program.cs) and another tree (folder Properties).

Blob objects are again just zlib-compressed files. If the uncompressed file contains text, you can extract a blob’s entire content using the same cat-file command along with the blob ID (Figure 7, Marker 5). Because blob objects represent files, Git uses the SHA-1 blob ID to determine if a file changed from the previous commit; it also uses SHA-1 values when diffing any two commits in the repo.

Tag Object The cryptic alphanumeric nature of SHA-1 values can be a bit unwieldy to communicate. The tag object lets you assign a friendly name to any commit, tree or blob object—although it’s most common to tag only commit objects. There are two types of tag object: lightweight and annotated. Both types appear as files in the .git\refs\tags folder, where the file name is the tag name. The content of a lightweight tag file is the SHA-1 to an existing commit, tree or blob object. The content of an annotation tag file is the SHA-1 to a tag object, which is stored in the .git\objects folder along with all other Git objects. To view the content of a tag object, leverage the same cat-file -p command. You’ll see the SHA-1 value of the object that was tagged, along with the object type, tag author, date-time and tag message. There are a number of ways to tag commits in Visual Studio. One way is to click the Create Tag link in the Commit Details window ( ). Tag names appear in the Commit Details window ( , Marker 3) and in View History reports (see the earlier ** , Marker 9**).

Git populates the info and pack folders in the .git\objects folder when it applies storage optimizations to objects in the repo. I’ll discuss these folders and the Git file-storage optimizations more fully in an upcoming article.

Armed with knowledge about the four Git object types, realize that Git is referred to as a content-addressable file system because any kind of content across any number of files and folders can be reduced to a single SHA-1 value. That SHA-1 value can later be used to accurately and reliably recreate the same content. Put another way, the SHA-1 is the key and the content is the value in an exalted implementation of the usually prosaic key-index-driven lookup table. Additionally, Git can economize when file content hasn’t changed between commits because an unchanged file produces the same SHA-1 value. This means that the commit object can reference the same SHA-1 blob or tree ID value used by a previous commit without having to create any new objects—this means no new copies of files!

Branching

Before truly understanding what a Git branch is, you must master how Git internally defines a branch. Ultimately, this boils down to grasping the purpose of two key terms: head and HEAD.

The first, head (all lowercase), is a reference Git maintains for every new commit object. To illustrate how this works, Figure 8 shows several commits and branch operations. For Commit 01, Git creates the first head reference for the repo and names it master by default (master is an arbitrary name with no special meaning other than it’s a default name—Git teams often rename this reference). When Git creates a new head reference, it creates a text file in its ref\heads folder and places the full SHA‑1 for the new commit object into that file. For Commit 01, this means that Git creates a file called master and places the SHA-1 for commit object A1into that file. For Commit 02, Git updates the master head file in the heads folder by removing the old SHA-1 value and replacing it with the full SHA-1 commit ID for A2. Git does the same thing for Commit 03: It updates the head file called master in the heads folder so that it holds the full commit ID for A3.

Two Heads Are Better Than One: Git Maintains Various Files in Its Heads Folder Along with a Single HEAD File
Figure 8 Two Heads Are Better Than One: Git Maintains Various Files in Its Heads Folder Along with a Single HEAD File

You might have guessed correctly that the file called master in the heads folder is the branch name for the commit object to which it points. Oddly, perhaps, at first, a branch name points to a single commit object rather than to a sequence of commits (more on this specific concept in a moment).

Observe the Create Branch & Checkout Files section in ** Figure 8**. Here, the user created a new branch for a print-preview feature in Visual Studio. The user named the branch feat_print_preview, based it on master and checked the Checkout branch checkbox in Team Explorer’s New Local Branch From pane. Checking the checkbox tells Git that you want the new branch to become the current branch (I’ll explain this in a moment). Behind the scenes, Git creates a new head file in the heads folder called feat_print_preview and places the SHA-1 value for commit object A3 into it. This now means that two files exist in the heads folder: master and feat_print_preview—both of which point to A3.

At Commit 04, Git is faced with a decision: Normally, it would update the SHA-1 value for the file reference in the heads folder—but now it’s got two file references in that folder, so which one should it update? That’s what HEAD does. HEAD (all uppercase) is a single file in the root of the .git folder that points to a “head” (all lowercase) file in the heads folder. (Note that “HEAD” is actually a file that’s always named HEAD, whereas “head” files have no particular name.) The head file HEAD contains the commit ID that will be assigned as the parent ID for the next commit object. In a practical sense, HEAD marks Git’s current location on the DAG, and there can be many heads but there’s always only one HEAD.

Going back to ** 8**, Commit 01 shows that HEAD points to the head file called master, which, in turn, points to A1 (that is, the master head file contains the SHA-1 for commit object A1). At Commit 02, Git doesn’t need to do anything with the HEAD file because HEAD already points to the file master. Ditto for Commit 03. However, in the Create and Check-Out New Branch step, the user created a branch and checked out the files for the branch by marking the Checkout branch checkbox. In response, Git updates HEAD so that it points to the head file called feat_print_preview rather than to master. (If the user hadn’t checked the Checkout branch checkbox, HEAD would continue to point to master.)

Armed with knowledge about HEAD, you now can see that Commit 04 no longer requires Git to make any decision: Git simply inspects the value of HEAD and sees that it points to the head file called feat_print_preview. It then knows that it must update the SHA-1 in the feat_print_preview head file so that it contains the commit ID for B1.

In the Checkout Branch step, the user accessed the Team Explorer branches pane, right-clicked the master branch and chose Checkout. In response, Git checks out the files for Commit A3 and updates the HEAD file so that it points to the head file called master.

At this point, it should be clear why branch operations in Git are so efficient and fast: Creating a new branch boils down to creating one text file (head) and updating another (HEAD). Switching branches involves updating only a single text file (HEAD) and then usually a small performance hit as files in the working directory are updated from the repo.

Notice that commit objects contain no branch information! In fact, branches are maintained by only the HEAD file and the various files in the heads folder that serve as references. Yet when developers using Git talk about being on a branch or they refer to a branch, they often are colloquially referring to the sequence of commit objects that originates with master or from a newly formed branch. The earlier Figure 2 shows what many developers would identify as three branches: A, B and C. Branch A follows the sequence A1 through A6. Branch activity at A4 produces two new branches: B1 and C1. So the sequence of commits that begins with B1 and continues to B3 can be referred to as Branch B while the sequence from C1 to C2 can be referred to as Branch C.

The takeaway here is not to forget the formal definition of a Git branch: It’s simply a pointer to a commit object. Moreover, Git maintains branch pointers for all branches (called heads) and a single branch pointer for the current branch (called HEAD).

In my next article I’ll explore details about checking out and checking in files from and to the repo, and the role of index and how it constructs tree objects during each commit. I’ll also explore how Git optimizes storage and how merges and diffs work.


Jonathan Waldman is a Microsoft Certified Professional who has worked with Microsoft technologies since their inception and who specializes in software ergonomics. Waldman is a member of the Pluralsight technical team and he currently leads institutional and private-sector software-­development projects. He can be reached at jonathan.waldman@live.com.

Thanks to the following Microsoft technical experts for reviewing this article: Kraig Brockschmidt and Ralph Squillace


Discuss this article in the MSDN Magazine forum