Code Archeology: How to learn from your version control history

Regardless of which software you use for version control, it will keep track of all of the changes that were made during the lifetime of your code base and as such, it is a gold mine of information. Still, its usage is often heavily underutilized and reduced to simply scanning through the last few commit messages to see what changed recently or to investigate who was responsible for a certain change.

Code Archeology: How to learn from your version control history

By spending just a little bit more time to analyze your version control history, you can uncover interesting patterns and relationships in your code base and sometimes even learn a thing or two about the interactions and culture in your team. Having this information available also allows you to take more informed decisions about where to spend your time and efforts, for example when it comes to refactoring. In this post, I will give some ideas on where to get started and some useful metrics to look for.

Getting started 

Before we can start to calculate some interesting metrics, we need to extract the actual raw data from our version control system. Luckily enough, this is the most straightforward part of this exercise as all version control systems come with some kind of utility command that can be used for this purpose. Using Git as an example, we can run the following command to get information about all changes that were done during the lifetime of our project:

> git log --pretty=format:'%h %aN%n%ad %s' --date=short --numstat
46bab6b John Doe 2017-10-26 Added updated translations
10 16 app/translations/fi/LC_MESSAGES/messages.po

5af16c7 Jane Doe 2017-10-26 Minor clarification around build steps
3 0

6b22f7b John Doe 2017-10-26 Fix for PRJ-4711
2 1 app/
1 1 app/

Now that we have some data to work with, let’s get started with the actual metrics.

File change frequency  

One of the most straightforward, and at the same time most useful, metrics to calculate on our version control history is the file change frequency. By simply stepping through the commit log one revision at the time and increment a per-file counter each time a file occurs, we will get a collection of each file in the code base together with the number of times that file was changed. If we then sort the files on the number of changes and plot it as a bar chart, we will most likely see a change distribution similar to the one below.

File change frequency  

From the hockey-stick shape of this graph, we can see that the change frequency seems to follow an exponential distribution. In other words, while the vast majority of the files in this particular project change very infrequently, there are a few files that change pretty much all the time. This gives valuable insights when it comes to deciding where to spend test automation or refactoring efforts. By prioritizing the files that change most often, we will get a bigger impact of such initiatives. It can also provide a strategy for how to refactor a large legacy code base, something that can often feel intimidating and tedious. Using this metric as guidance, we can focus on say the top ten or twenty percent of the files and still make a big impact.

Code quality

To narrow down the set of files to focus on even further, let’s add another metric to the mix. By calculating some kind of code quality metric, such as cyclomatic complexity, for each file and plot the values as a scatter graph with the number of changes on one axis and the complexity on the other axis we get something like this:

Code quality

Here, files in the top right quadrant are the ones to focus on. These files have a high change rate as well as a high complexity score, indicating that these would benefit the most from refactoring or test automation efforts.

Code ownership 

Next, let’s see what the code ownership distribution in our project looks like. Do we have a shared ownership, where many people know most parts of the code base or do people tend to work in isolation and specialize in different areas? 

By calculating the number of unique contributors to each file, we can plot a pie chart to see the proportions of the number of authors across our code base.

In this example, we see that a fairly large proportion of files have just one or two authors in total, which could be a red flag, indicating that we need to do more knowledge sharing or pair programming. However, this doesn’t necessarily tell us the entire story. To get a more detailed view, we can plot the number of revisions on one axis and the number of unique authors on the other axis for each file.

Code ownership 

Here, we see that things may not be as bad as the first picture told us. Most of the single author files seems to have relatively few changes, perhaps a feature implemented once and then never or rarely revisited again. This also correlates well with the exponential change frequency graph we saw in the first example.

However, there are some files with just a few authors but with a fairly large number of commits, which indicates that they are being actively developed. In this case, it certainly seems like there may be isolated clusters of knowledge in our team that we should be on the lookout for and perhaps schedule some knowledge sharing or rotation.

Building on this, it is straightforward to calculate which developers on the team that stand for the most isolated knowledge. And while this could have perfectly natural reasons, for example due to the fact that these developers have been on the project for a long time or indeed have some very specialized knowledge, if one of them would be hit by a bus tomorrow, you would for sure lose a lot of knowledge.

In some cases, the knowledge may already be lost. By calculating the set of single author files for developers that are no longer part of the project or the company, you get a measure of the knowledge loss in you project.

Knowledge distribution

The number of unique authors per file is a good starting point for identifying single points of knowledge, but it doesn’t give us the full picture. By zooming in on the commit distribution within each file or package/module, we can get a more detailed view.

Knowledge distribution

In this plot, we count the number of commits by each developer and group it per module to see who is most active within each module. We can see that many modules seem to have a fairly evenly spread activity, but again there are outliers. In module H and I, there are several contributing developers, but only one of them stand for the vast majority of the commits.

Developer similarity 

We can also measure the similarity between different developers in terms of which files they work on. By using a formula like the Jaccard index, we can compare the set of files worked on by two different developers and calculate how similar they are. This can give us insights into which developers that usually work in the same area of the code base and which ones that tend to stick to themselves.

Other systems 

We are of course not limited to the data in our version control history. By fetching data from other systems that contain relevant information and cross matching the data in there, we can gain some really powerful insights. For example, it is common to include a reference to the project’s issue tracker, such as Trello or Jira, in commit messages. With a regular expression, we can easily extract this reference from the commit message and query the issue tracker to see if the commit refers to a bug or a feature. With this information in place, it is possible to calculate what files that have the highest bug frequency (don’t be surprised if it follows an exponential distribution), which could again indicate a need for refactoring, test automation, pair programming or just more careful code reviews for features involving those files.

Closing words

Hopefully, this has given you some ideas and starting points on where to look for interesting metrics in your own version control history. To dig even deeper, you can calculate more fine-grained metrics by looking at changes on the class and method level instead of on the file or module level. Exploring relationships, dependencies and change rates between different components in your code base, for example frontend vs backend or test code vs production code, is also something that could yield interesting insights.

So, now it is time you go ahead and explore your own code base. It is both fun and easy and I can almost guarantee that you will learn something useful along the way!

niklas sundbaum ReachMeeNiklas is ReacMee’s VP Engineering. He’s got extensive experience within development from being a programmer, consultant, CTO and architect. If you have any questions regarding this post or about everyday life in Niklas’ dev-team, don’t hesitate to contact him at

If you want to hear Niklas talk about tech irl you can always follow our Meetup-group TeachMee to know when our next Meetup is held and join us then.

Learn More About ReachMee

Publicerat 2018-02-02

Ämnen: Tech


I ReachMees blogg hittar du allt du vill och behöver veta om rekrytering. Lär dig mer om allt ifrån karriärsidor till att attrahera kandidater, arbeta med employer branding och att lyckas med er jobbannonsering. Självfallet hittar du även information om rekryteringsprocessen, metoder för urval och att praktisk genomföra lyckade rekryteringar.

Prenumerera på vår blogg så får du varje vecka hem de senaste artiklarna direkt till din inkorg.

Prenumerera på bloggen

Populära artiklar


Se alla