Saturday, March 15, 2014

Linus and XML

Reading this post from Linus Torvalds, I never thought of the git subsystem as a replacement for a file format.

tl;dr:

Linus' project, Subsurface, a SCUBA diving logger, originally used XML. Linus hates XML, and has been looking to replace it. He ended up using the git object database format. With it, he gets efficient deduplication and compression, as well as backups and history.


You might say "But the git object database format isn't a file format, its a bunch of files tied together which makes up the format. Its a database format." And you'd be correct.

The discussion has Linus revealing "As to JSON, it's certainly a better format than XML both for humans and computers, but it ends up sharing a lot of the same issues in the end: putting everything in one file is just not a good idea. There's a reason people end up using simple databases for a lot of things."

There's been several projects that have used git and/or its technologies before. I can think of backup systems, FUSE filesystems for a substitute for ZFS or btrfs. In the comments of Linus' post, one of the devs of Pitivi mentioned they had toyed with using it for that project.

Reading up on git objects and libgit2, it seems that the git subsystem could be a good replacement for text file formats in certain scenarios, provided you're okay with having a good text database, instead of wrapping everything into a single file. It's dependent on the situation. It sure beats the "Just use XML for everything, everywhere" mentality.

2 comments:

paul said...

I looked at the git subsystem some and thought was nice for version-ed text based data. I just wish a sql based database would add in version data as a base idea. Seems to be what Google's new system does somewhat with Spanner/F1.

Redsaz said...

I agree, something like that would be nice, without throwing in something like LiquiBase or the like. But my guess is that the way things need to be grouped and versioned may be domain specific when data is stored the way it is in tables. At least, that's the big reason I would see in doing things via key-value DBs or Document-based DBs vs. table-based DBs.

Yes, versioning can be done with all of them, but it seems like you'd have to do more planning with the table-based DBs compared to the others.

But again, this is all a guess, I could be completely backwards in my thinking.