Sunday, April 1, 2012

Personal cloud

Personally cloudy with a chance of redundant offsite backup.


Despite extensive searching, I've been unable to locate a storage system that is a pure drop in implementation for all of my storage system needs. For a young adult, my personal storage needs aren't that massive. I have a single home server, Fenris, which I primarily use for long term file storage. My desktop, Odin, and my laptops, Huginn and Muninn, are clients for my storage system, and I frequently make use of it! Using a collection of a multi-disk BTRFS implementation, and Samba, SSHFS, Fenris provides me with a very convenient place to dump all of my files without any worry that I'll lose everything in a moments notice from a drive failure. At the moment, I'm only using 80% of the storage available on Fenris, and a lot of that usage comes from poor organizing on my part.

So why am I writing this then?


How much of a poor organizational habit (computer information or physical objects) is primarily caused by the environment that surrounds that habit? I postulate that a full 70% or more of my bad computer information organization derives from a poor environment in which to do that organization. Needless to say, I need an environment overhaul.

I've identified a list of features that my "not perfect, but pretty close" personal storage system must have.

  1. Storage is not inherently centralized.
  2. Access to the storage system can be fully transparent (i.e. can be mounted as an arbitrary folder).
  3. Adding new clients / nodes to the system is relatively painless.
  4. Can handle frequent node connects / disconnects (laptop's turned off / on).
  5. Frequently accessed AND user designated files / folders always cached locally.
  6. Data cached locally can be used without being connected to other nodes. (Laptop away from home).
  7. Snapshot support. Even better if on a per file, or per folder, basis.
  8. User configurable redundancy settings, (i.e. specify how many copies of a given file should be in the network simultaneously).
  9. Allow nodes to operate primarily as storage nodes or clients, allowing the server to handle the storage and redundancy, while laptops cache their files locally with permanent storage on the server.
  10. Some concession to offsite, including incremental transfer, and not requiring all of my nodes to be online simultaneously during a backup.

http://ceph.newdream.net/ comes close, offering non-centralized storage, and mount-ability, plus several in the works features for snapshots and tunable redundancy settings, but ultimately lacks the support for local data caching.


http://www.coda.cs.cmu.edu/ comes even closer, offering built in support for disconnected operation and caching, but is unacceptable for having too many centralized components with no ability to designate fall-over instances, and a lack of offsite backup considerations. Even more disappointing is that coda appears to no longer be developed, with no support in the Ubuntu package archives, or code commits in the last several years.

But wait, there's more!


My need for some kind of offsite storage doesn't stop with storing my own files. I manage my parents home storage server, and also provide tech support to multiple friends to manage their storage systems. As everybody knows, if you don't have offsite backups for your critical data, you might as well assume you've already lost that critical data.

So I aimed requirement number 10 at providing me with some means by which I can provide automatic offsite backups of all of the storage systems I manage. By my current count, that's approaching 4 independent storage servers, in addition to another 3 or so friends who would probably jump on board if the barrier to entry for participating was low enough.

I've considered https://tahoe-lafs.org/trac/tahoe-lafs, which provides fantastic data-recovery features, but tahoe-lafs doesn't seem to provide much support for trusted storage partners. After all, who can you trust if not your family?

My reasons for being wary of tahoe's hard adherence to trusting only yourself is that the filesystem loses significant opportunities for compression, and a massively increased need for storage space compared to the raw file sizes.

I store a lot of text files, including homework, email archives, ebooks, source code, and so on. I also have a lot of video and pictures, but primarily I'm storing a massive amount of plaintext. I know for a fact that the storage behavior of my family is similar to mine, and I strongly suspect the same can be said of most of my friends. In order to take advantage of our limited hardware capacities, I'd want the perfect distributed / redundant wide area network backup system to take full advantage of the files being stored in the system to squeeze every last drop out of our hardware. Plain text compresses best when you collect it all into the same archive, after all!

Ultimately, I'm left with a bitter taste in my mouth. I can cobble something together that does meet all of the needs I espouse in this post with scripts and a few small programs, but that system won't scale well, and certainly won't be as reliable as a preexisting system.

I'm going to continue watching the development of new filesystems, and hopefully contribute in the needed direction to finally have my perfect personal cloud.

Update May 29th 2014

Now announcing Aerosta!
Check out our website: http://www.aerosta.com/
Check out a video: https://www.youtube.com/watch?v=4mpHPxVu1XA

2 comments:

  1. I can't tell whether you want an off-site to already exist, as in Dropbox, or whether you want to set up the server yourself with relatively little effort.

    If the latter, then there might be some git-annex based solutions worth looking into.

    I just glanced at SparkleShare, and it is far closer to Dropbox then what you describe, but at the disadvantage of requiring you to setup the server manually.

    ReplyDelete
  2. Hey Minifig,

    I'm actually interested in a mostly self-hosted offsite solution. After re-reading, I can see I need to add some clarification to my post, so I'll try to chug on that over the next couple of days.

    My reasons for elaborating on tahoe-lafs is that tahoe-lafs is what amounts to a wide-area-network type cluster filesystem, where data is duplicated across multiple offsite destinations as well as trying to do some type of storage usage optimization. tahoe is also designed to be usable when you don't trust any of the people who are sharing your redundant data chunks. The main difference in need between tahoe-lafs and what I really see as filling my need is that essentially: tahoe-lafs assumes no trust and lots of storage, whereas I assume complete trust and limited storage.

    This results in tahoe-lafs having an extremely high storage requirement (something around inflating each byte stored in the tahoe-lafs cluster by a multiple of 7), and also a lot of computational power needed to encrypt the data being redundantly backed up. I'd personally rather have those compute resources targeted at compressing the files that I do want to store, than encrypting them. I wouldn't mind having encryption support though, in the case where friends want to join the wide area network cluster but don't necessarily want to show me or my parents their tax documents.

    I wouldn't be too upset if the wan-cluster needed 3 or 4 days of setup, so long as it was reliable, and could easily survive reboots and multi-day disconnections. In my humble opinion, for the home consumer market, a huge barrier to using more exotic storage schemes is having too many components, and too many shared configuration files. All of my nodes have their own domain name, and I'm more than happy to configure each of them with that information... but systems like Ceph fs (a LAN-cluster) pretty explicitly require that configurations between nodes in the storage cluster be in lock-step, and also need to have multiple orthogonal daemons running, which really increases the complexity.

    I also evaluated freenet (and others), but freenet is simply not targeted at my expected usage. *shrug*.

    SparkleShare is actually something that I'm seriously considering installing on my Dad's business machines. A major problem that he's having is that he keeps forgetting to back up his documents on his storage server, SparkleShare would give him a (Windows friendly) way to integrate doing backups into his normal workflow.

    Thanks for the comment :-)

    ReplyDelete