Understanding Git: basic concepts of Git and recommendations for use

Git has recently become an essential part of software development. For most developers, a particularly good GitHub profile is as good as or even more useful than a good resume. But Git is not GitHub. While GitHub is an amazing Git service, Git in itself is a wonderful piece of engineering that is so useful, much more worth than the hype.

Git is a version control system used to housekeep source code and such. Wikipedia describes Git as:

Git is a distributed, version-control system for tracking changes in source code during software development.

Why version control?

A popular belief about Git and other version control systems (VCS) is that those are required for large teams and open source projects to enable collaboration. While this is absolutely true, VCS has a more essential, critical use- to keep tab on the evolution of code.

Why to keep track of the evolution of code? When working on medium to large sized software projects, the software is coded over a long time period. Over time, the developer herself tends to miss the reasons for certain decisions that were taken earlier. While code comments would indeed help, comments are seldom effective in comprehending chronological changes. Also, many a times the developer may need to roll-back to or at least preview the state of the code at an earlier point in time.

A version control system allows a developer to check-in code files, often on accomplishing certain milestones, such that it will act as a checkpoint for later reference or roll-backs. The check-ins are otherwise called commits and often carry a comment along so that it helps to comprehend the state of mind at a later instance. Commits also doubles down as a code backup, and references for other people while collaborating. Many VCS solutions also allows for comparing between different commits.

What makes Git so popular?

Git definitely is not the only VCS solution out there, nor was it the first. There are others like CVS, SVN etc. However Git turned out to be the most popular one yet due to its unique way of handling the versioning. The primary distinguishing factor in my opinion is that Git is a distributed version control system. There are other things as well such as performance, integrity, flexibility and security that proved to be major factors for the success of Git. We can get into the details of some of these while explaining the Git terminology.

Git terminology

[ The 4 stages in Git. Image source: GitLab docs ]

Remote and local repositories

Git is a distributed version control system. The code repository is often stored at a central location and a copy of the repository (or a smaller part of the repository) is stored in the machines of each collaborator. Here, the repository at the central location is called a remote repository and the one in the individual developer's computer is called a local repository.

A remote repository might be checked out (or cloned, in Git terminology) into several computers where the code is updated and pushed back to the server. Similarly a local repository might as well point to multiple remote repositories. This capability is one thing that makes Git really effective in a handful of scenarios.

Working-copy

Git stores a copy of the entire repository (or that of a few branches) on the developer's computer. Such a repository maintains the current code state, changes in the code over time, commit comments and many other things. This is hidden and non-readable by humans.

However, the current state of the code that the developer is working on is stored as readable files and this is called the working copy. The working copy is usually the directory into which the code is checked out. Other data that Git maintains are stored in a .git directory located inside this location.

Commit

Git maintains a repository as a set of incremental changes or deltas. A delta would contain the changes to the files over the previous revision of the same files. When the repository is checked-out, Git computes the state by incrementally applying the deltas over the initial version of the file. Commit is a mechanism in which the developer submits the changes in the code to the Git records.

A commit also carries a commit message, time and date, author information (such as name and email) etc. along with it. In Git, every commit is identified by a unique hash code. The commits can be later viewed chronologically and filtered. In most cases, a commit is atomic and should not be altered and it is based on the commits that Git maintains its integrity.

Staging area

Git maintains an intermediate level between the local and remote repositories, often called a staging area or an index. Many a times, the developer would be working on multiple files at the same time, but not all may be ready for commit. As mentioned earlier, the commit is usually a state or a milestone and would carry a commit message, such as "Updated feature A". If the update is as a result of changes in two files, it doesn't make sense to commit those separately. Also in a practical scenario, the developer won't be able to discard the changes in a third file that is not yet ready for a commit.

This is where the stage or index comes handy. The developer could add the two files involved in the update to the staging area, and then make a commit from the staging area specifying an appropriate commit message. Staging area also facilitates provisions reviewing the changes before a commit is made.

Branches

Consider branches like the branches of a family tree. The code is evolved over time and may need to diverge into two at many instances in this evolution. Take an example in which a software would need to be continuously supported after being delivered to a customer, but would also needs to be further evolved so as to add new capabilities. Both the scenarios would require making changes to the code files, but cannot be done simultaneously. In such cases a new branch is spawned and active development would continue in both the branches until a point when both could be merged.

[ Git branches overview. Image source: GitLab docs ]

Git comes with a default branch named master and it is a convention to use this branch as the main branch when using Git. Branching is exceptionally efficient in Git and is a significant aspect of Git's popularity.

Due to the efficient versioning algorithm of Git, a developer can switch between branches rapidly and easily. Switching branches with Git doesn't need to have the entire code of a branch checked out altogether, rather would be handled by moving the pointer to the latest commit on the branch. The state marked by the latest commit in a branch is called the HEAD of the branch.

Push, Pull and Fetch

Once changes are committed to a specific branch, the changes are still in the local repository. The changes would not be reflected in the remote repository until the changes are pushed to the remote repository. Similarly, the changes others make and push to the remote repository would not be reflected or received unless the changes are pulled.

Many times it may be required to take the changes from the remote repository, but not apply it to the local code. This happens usually when the current code needs to be compared with that of the remote repository or as a backup mechanism to compare or merge later in the absence of remote connectivity. In such scenarios, the changes from the remote are fetched to the local repository, but not applied on to the working copy until the developer choose to.

Recommendations for using Git

When it comes to Git, usage is one thing even expert programmers tend to err. When moving from legacy VCS environments to Git, it takes not only a technological shift, but a mindset shift as well. Working with Git often requires an altogether different mind-set in order to harness its full power, as compared to conventional versioning frameworks like CVS and SVN.

For using Git there are certain strategies laid out by various service providers. A few notable of those are the Git flow, the GitHub flow and the GitLab flow. These are in-depth philosophies on how to make use of a powerful technology rather than a mere user manual.

I find Git flow hard and cumbersome to use and often underproductive for any team/project size. GitHub flow is usually good for small projects and medium open-source projects while GitLab flow is better suited for large projects and large teams. However, each has its own pros and cons and I suggest to have a look at those if possible.

Listed below are a few best practices that I learnt over the years working with Git. These are mainly based on my experiences with the GitHub and GitLab flows and covers certain aspects that are often not sufficiently focused while using Git.

  • Get familiar with Git command-line. This is very important. Although many IDEs come up with their own visual implementation of Git, many are known to cause unwanted issues. For example, Eclipse EGIt plugin is notorious in my opinion and almost always run into problems. I find the in-built Git features of Visual Studio Code reasonably good, but it would be still better to be familiar with Git command-line which would definitely save a lot of trouble one day.
  • Commit (and push) as frequently as possible. This may seem odd, as many are acquainted to work for hours or days and commit only once a work is done. This is problematic when it comes to multiple people working together. When the deltas are small (due to frequent commits) it is often easier to resolve merge conflicts and most of the time Git does the work by itself. Also, having regular backup is an added advantage.
  • Avoid committing anything that breaks existing functionality. While it is good to commit frequently, do not commit any code that would break existing functionalities. Committing incomplete features is fine, but make sure that the commit doesn't break anything that was working earlier. This is particularly true when many people are working on the same branch.
  • Utilise feature branches. When working in large projects, new features may take quite some time and efforts to complete and involve multiple people working. In such cases, it is always better to create a feature-specific branch. This would save time and efforts by offloading the handling of conflicts to a single time of merge. If commits were proper and frequent, even that won't be much of a trouble. However in small projects it might be okay to commit directly to the master as in the GitHub flow.
  • Break large features into smaller milestones. One thing I would recommend is to keep the feature branches alive for as short as possible. Keeping a branch alive for a longer duration would attract unwanted conflicts. Hence, it would be beneficial to break big features into small milestones and create separate branches for each milestone or keep merging the feature branch into master after every milestone.
  • Frequently merge master onto long living branches. In some rare situations, long living branches could be hard to avoid or broken down further. If such a scenario arise, it is possible to merge the master on to the branch regularly such that the branch is in sync with the master and serious conflicts at the eventual merge could be minimised.
  • Write proper commit comments. When the code is committed, avoid comments such as "updated file1", "updated configurations" etc. There can anyway be inferred from the commit details. Write appropriate comments such that you or someone else would be able to understand the purpose of the change and why the change was made.

These are just a few of the recommendations for use and the applicability may vary depending on the specific scenario. However, certain aspects like getting familiar with the Git command-line interface and the concept of frequent commits and appropriate comments are more or less always applicable.

Conclusion

Git is a wide topic to be covered in a single article like this one. The general intention of this article is to provide some insights into the various basic concepts of Git and to prevent usual mistakes. There are many more advanced concepts like rebase, blame, revert etc. that would help handle various situations and provider specific concepts like forking, pull requests, merge requests etc. For understanding more about the concepts of forking and pull/merge requests, you can read my article Git: Clone, Fork, Pull-request and Merge-request explained.

As I mentioned earlier, getting familiar with Git command-line interface is instrumental in taming Git. Most of the commands and their usage can be learnt from this beautiful TutorialsPoint tutorial. Also, it is a good idea to read further on Git philosophies from this article from GitLab docs.

No comments

Post a Comment