Fun with git and backdating commits


One thing that's somewhat surprising to a newcomer to git is how unprotected commit metadata is. In fact, it's worse than that: the default tools make it straight up convenient to tamper with metadata!

For example, to change the author date of the last commit:

git commit --amend --date '1970-02-03'

The reason for this is quite simple: the security of metadata lies not in how accessible it is on the tamperer's own device, but on how accessible it is on the authoritative entity's device and on the chain of trust from there to you (and if either of those is untrustworthy, you're screwed any way you slice it).

Technical details

That being said, changing the date of other commits is harder: for one, since the hash of latter commits is dependent on that of previous commits, git would complain if we were to just rewrite a commit without adjusting its children's hashes. One way to do this somewhat easily is with interactive rebase: get a hold of the hash of the commit to modify, git rebase -i ${COMMIT}~ (the tilde gets the parent, [1] that's how rebase works) and then using the commit command to change the date of the commit in question. Then, as usual, rebase will rewrite the hash of all children commits before completing execution.

[1]Well, the first parent in the case of a merge

The rebase approach is rather tedious and limited. In fact, git includes a super-duper general tool to automate just that: git filter-branch! And reading the manpage for this new command, we learn that, in fact, there are 2 dates in a commit: the author date and the committer date (the author date is the default date shown by git log, to show all dates, use git log --pretty=fuller). The difference being that the author date is when the commit was created and the committer date is when it was added to the current tree. Hence, they generally are the same except for when a merge, cherry-pick or other such command re-applies the commit, so to speak. This means that the --date option to git commit only touches up the author date, but not the committer date. But the thermo-nuclear option of filter-branch does allow changing both dates.

Cool, but is there a constructive point to it all?

As a matter of fact, there is! For example, I have just decided to gather all the sort of cool projects I once did and make them presentable. So, the thing about old projects is that they tend to be more amateurish, disorganized and, well, less presentable. Indeed, this is exactly the case with one of the first project I did (well, aside from the others I just completely lost the source of): libsdlinput. This is a project I did around 9 years ago at a time when I didn't know about version control (in fact, it might be the project that made me realize all the advantages of version control).

So, as can be seen on the Sourceforge page, I was dumping versioned archives directly into the project rather than doing version control. Fortunately, Sourceforge has the dates at which I last uploaded the files which means I have some information about versioning. So, one thing that is possible is to gather the names and dates of every archive and commit them in order. I could even set the author date of every commit to the archive date using the information above.

Unfortunately, since I used so many directories, gathering this information by hand gets boring and error-prone very fast (a quick checkup from the shell tells me that I have 26 files in 16 different directories in this project, in total). So, I started writing a little Python script to scrape that information. But then, I realized that while I would have the information, I would still have to download each actual file and place it manually into the right directory (and I'm not going to do that 26 times). So, I started looking into what kind of file access Sourceforge gives to developers and it turns out that they have SSH and even rsync over SSH, neato!

So, after some reading and head-scratching (it turns out that you need to request a shell access creation every few hours, that your files are in /home/frs/, but your home dir is an empty directory named /home/U/US/USERNAME and some other oddities), I ended up being able to rsync my project files to my computer with timestamps preserved:

$ rsync -av --rsh="ssh -l vhann" vhann@shell.sourceforge.net:/home/frs/project/libsdlinput/ libsdlinput

Great. Now, the plan is to go through the whole thing, list everything by date (older first), then create a new directory into which I init a new git repo then, for each file, unpack it if it's an archive, otherwise simple copy, then ask git for new or changed files and if any are found, create a new git commit.

First order of business is listing files (excluding directories) sorted by mtime. ls -l can do that, but it does so only up to a resolution of days. ls -l --time-style=long-iso stops at minutes. ls -l --full-time finally lists seconds but it is, once again, the thermonuclear option: it lists times, on my machine, in billionths of a second!

$ ls -ld --full-time .

drwxr-xr-x 3 dioo dioo 142 2017-08-10 21:54:12.025828392 -0400 .

No matter, with a little bit of shell magic, we'll arrange all that:

$ find libsdlinput/ -type f -exec ls --full-time -lGtr {} + | cut -c26-44,55-

There, now that looks more sensible:

$ find libsdlinput/ -type f -exec ls --full-time -lGtr {} + | cut -c26-44,60- | head -n 2

2008-10-04 03:54:340 libsdlinput/OldFiles/libSDL_Input4_0_1PortableFilesOnly.tar.gz

2008-10-04 03:56:200 libsdlinput/OldFiles/libSDL_Input_TTF1_0_0PortableFilesOnly.tar.gz

Okay, now save this to a file:

$ find libsdlinput/ -type f -exec ls --full-time -lGtr {} + | cut -c26-44,60- > listing.txt

And now we need only write a small script, create a new directory to put it into, run the script and VOILĂ€!