05 Aug 2013

40330572I have had the joy of working with some great teams and learning how they use Git, but I have also seen it misused and the number one issue I helped teams resolve when using Git is what I call, the Fat Repository. The fat repository (not to be confused with the funny & NSFW Fat Git Enterprises), is an especially easy trap to fall in for teams who have come from the client/server source control world (SVN, TFS, CVS etc…) because in their world, a fat repository isn’t a harmful thing and even may be a good thing – but in the DVCS world the fat repository is a major problem.

Fat Repository

A fat repository can be one of the following two characteristics, or it can have both of them. I think both of these characteristics will be familiar to most developers who have used some sort of source control any period of time.

Characteristic One: Multi-tasked Repository

The multi-tasked repository is where you have a single repository which has multiple root folders and each folder is for a different customer or project. This is very common with internal focused items. For example a consulting company may develop a set of common libraries that are used in a variety of projects and put all the common libraries code in a single repository, so they are easily accessible for everyone in the company. Another example would be a department in a company, which has all the projects that single department has developed in a single repository.

Many people, myself included have multiple Visual Studio projects inside a single repository, that is no problem – where it becomes a fat repository is where those items/folders/projects are not related to each other from a delivery perspective. Often those projects only are lumped together, because they share some common organisation relationship.

The reason this is called multi-tasked repository, is because the single repository is responsible for multiple unrelated deliverables.

Characteristic Two: Multi-focused repository

The repository where it has not fallen into the multi-tasked trap can still become fat. This often happens when a single repository is used to handle many requirements beyond code of a single deliverable. A common example of this characteristic is when you have a documents folder or database backup folder in your code repository. Here the core issue is that while everything is related to the software development, not everything relates to the building of the code.

This is named a multi-focused repository because the focus is split between specs, database backups, virtual machines and the code (for example). I find this issue hits not only repositories, but tools like DropBox where you get a folder shared because you need one document and you end up syncing 100’s of other documents, backups, install files and what-nots that you don’t care about too!

Fat repository – Summary

In short a fat repository is one where there is more going on in a repository than just code & just the code that is needed for a single project or deliverable.

There is a simple test to identify a fat repository just ask: Is everything in here vital my deliverables? If there are items that are not vital – you have a fat repository.

Client/Server and the fat repository

In the client/server world of source control, like SVN or TFS, this isn’t actually an issue and it is very common to find fat repositories. In fact I would say if you are using a client/server source control – having a fat repository is an advantage since there is a single place to go to get everything you need.

The reason this isn’t an issue in the client/server world, like we will find out in the DVCS world, is because of two key differences:

  1. In client/server you have only a working set, not the entire repo. So you initially need to pull a smaller set down to your machine, compared to pulling down the working set & all the history in DVCS.
  2. in client/server because you are working with a server, you can checkout a single sub-folder or even a single file without needing the rest of the files in the repository.

DVCS and the fat repository

The DVCS world of Mercurial & Git works completely differently to the client/server world. Firstly the repository is comparatively cheap to create – just type git init and viola, a repo. In client/server you need a server admin to create one, this relative cheapness encourages you to create loads of repositories – some may last & get a remote version on a server and some may live short lives just on your machine – that is perfectly fine.

Secondly, when you grab a repository for a team you not only get the one folder you want in a working set version – you get EVERYTHING. Every folder, every file & ALL the history for it.

This is where it becomes painful to deal with a Multi-tasked repository or a mutli-focused repository because you are forced to pull down content that you do not care about.

Issues a fat repository causes

40330634Initial check out times that are ridiculously long. Most of the time this is a once off pain but it is still a pain. I say most of the time, because I went to help a team that had each deliverable in a separate multi-focused repository and while the working sets were small and the team members had no issue with the one long initial check outs – the senior developers that floated between the teams to help out found that pulling the repositories down was ruining their productivity, especially if they were just trying to help with a single bug. This teams repositories were so huge that most people couldn’t keep more than two around, even though the working sets were less than a gig.

Disk space usage. You developers want a SSD and while it will make them more productive it also means that space is a premium or you will be forking out a lot for those massive SSDs. With DVCS, they get everything in the repository including all the history. Check in a 1Gb database backup, then delete it… in the client/server world the working set wouldn’t be impacted, but in the DVCS world everyone will pull that 1Gb file down initially and that means a lot more disk space usage.

Solution: The Lightweight repository

So when using a DVCS, the solution is simple:

  1. Keep the repository as lightweight as possible – only check in the things you NEED. This means taking care about what is checked in & using functionality like .gitignore & using staging to ensure you only check in what you should.
  2. Since DVCS repositories are cheap, have loads of them – have one for documentation, have one for code, have one for artwork etc… That way those who need something can cherry pick the repositories they need.
  3. Avoid the Law of the instrument – just because you can put things in repo’s doesn’t mean you should. Do you need the history on every DB backup, not likely – a file share or DropBox maybe enough for those! Or why not use a tool that is better suited to Word documents, with the ability to index their contents so that they can be searched, for example SharePoint. TFS does this with the full version – where every repository gets an SharePoint site to put the documents in.

In short a happy repository is a lightweight repository!


 Git: A happy repository is a lightweight repository's picture

[...] at Aphelion.bi with permission of Robert MacLean, author. (source) Share [...]

Jason Stangroome's picture

Hi Robert,

When you separate code from documentation and database backups, how do you like to reliably handle cross-repository references?

For example, if your integration test harness restores a database backup before the test run, how do you locate the backup file if it is not in the same repository? Do you prefer Git Submodules, well-known paths, or some other approach?

What are your thoughts on this when you don't control the development environment, ie open source. I personally don't like mandating that people must use "c:\source\foo" to be able to build and test the Foo project.



Robert MacLean's picture

"how do you like to reliably handle cross-repository references?"
Let's use cross-system references rather, since those they may or may not be in a repository. There is a number of ways to do it. For example if you put a document in say Google Drive, then a simple link file to that would help guide people to it.

"how do you locate the backup file if it is not in the same repository?"
I see this is a different issue to the first point you raised because the goal of the first is to find the documentation (discoverability) but this is about how do we have our tools work with this. You could easily have the build tool pull from multiple repo's & then if you are doing automation the file name should be the same each time. I know that doesn't cover every scenario but it is an option. As you perfectly pointed out, well known paths (both file system style & HTTP style) and sub modules provide alternatives which could solve different problems.

Important here is to also identify that sometimes having a database script in your code repo is the right thing because it is needed. DB backups I would avoid, because they are binary and thus do not diff or compress well but if you scripted the DB (including all the data) that could be a valid thing to have. There is no perfect scenario - that is why you need to ask yourself, is everything here vital to my deliverables? If your deliverable includes passing integration then a DB script could be vital.

"What are your thoughts on this when you don't control the development environment, ie open source. I personally don't like mandating that people must use "c:\source\foo" to be able to build and test the Foo project."

I agree fully with you. This isn't just an open source issue, it happens in companies a lot too. Person X setup the project & now everyone else must use his same folder structure. It feels very incorrect for me when I see those situations.

 GitHub: why it is right for open source and wrong for you's picture

[...] you apply the recommendation I mentioned in yesterday’s post about lightweight repositories to this model (with the free unlimited public repositories), it makes even more sense for open [...]

Add new comment