-
-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Borg backup to Amazon S3 on FUSE? #102
Comments
I'm still trying to get an idea of what exactly happens in the Borg repo from one run to the next. I used it to backup my ~/ directory (about 72GB on disk) last night, and I messed around with creating and deleting files and re-combining ISO images to see how well the de-dupe works. (It works extremely well, I might add!) I ran around 30 backups with no pruning. That was last night, and then today I used my computer for some web browsing and then ran another backup with a before and after Then I pruned all but the most recent backup and ran another diff: So assuming that all the chunks are around 5MB, that would be around 3GB of deleted data taking up wasted space in Amazon S3, which would cost me about $0.05/month in Glacier according to Amazon's calculator, and it would have to stay there for 90 days to avoid a penalty. Or else in regular S3 storage it would cost something like $0.11/month. Additionally there would be far fewer changes and much less total data stored in the case of my webserver I want to back up with this scheme. So I would tentatively think this could be a good option? |
I might add that you can get 10 TB (thats ten terrabyte) as "nearly" OpenStack Swift compatible storage from HubiC.com for 50 Euro a year (no kidding). I use this together with my Hubic Swift Gateway and the swift duplicity back end. This also is EU storage (located in france) which solves some problems with German laws. I also think that it is fairly easy to implement as backend for software with a chunked approach. P.S.: Their desktop client (still) sucks imho... but you even get 25 GB for free. Which can also be used for experiments with the API. |
Thanks @oderwat for the tip! Good to know. |
I must say that I don't use "cloud data storage services", so I can't advise about their API/capabilities. Borg's backend is similar to a key/value storage and segment files only get created/written, but not modified (except from complete segment files being deleted), so it could be possible if someone writes such a backend. Borg has an "internals" doc that might be interesting for anybody wanting to write such a backend. If information is missing there, please file a docs issue here. |
borg has some level of abstraction of remote repositories... there's currently only a single
this list is from |
notice the similar issue in jborg/attic#136 |
Supporting storage services like AWS S3 would be huge and make borg a real alternative to tools like tarsnap. I would support a bounty for a) generic storage interface layer b) and S3 support based on it. Another interesting backend storage might be sftp/scp, as provided by some traditional hosting providers, like Hetzner or Strato HiDrive |
@rmoriz your contribution would of course be welcome. bounties are organised on bountysource, in this case: https://www.bountysource.com/issues/24578298-borg-backup-to-amazon-s3-on-fuse the main problem with S3 and other cloud providers is that we can't run native code on the other side, which we currently expect for remote server support. our remote server support involves calling fairly high-level functions like the assumptions we make about the remotes also imply that the current good performance we get on SSH-based remotes would be affected by "dumb" remotes like key/object value storage. see also #36 (comment) for this. |
Please correct my if I'm wrong. It looks like we have/need a three-tier architecture:
So the borg server part needs a storage abstraction model where backends like S3, ftps, Google Cloud Storage, etc. can be added. Is that correct? I think using FUSE adapters are not a reliable way (IMHO). Update:
|
the server is not necessaryly needed borgs internal structure would allow to use something like a different k/v store as well - but someone needs to do and test it |
Thanks for putting a bounty on this. If someone wants to take it: please discuss implementation here beforehands, do not work in the dark. |
+1 for me on this. I want exactly what the original poster is talking about. Also since I am worrying about deduplicating I want to use some really highly durable storage like amazon has. Also the versioning life-cycles to protect against the "compromised" host problem would be fantastic... (I added to the bounty :) ) |
I've written up some of my thoughts on some of the limitations of s3, and a WIP discussion about some possible method to address them. It is organised as a single document right now, but as it flushes out, I will expand it as appropriate. Please comment there and I will try and keep the document up to date with as much information as possible. see https://gist.github.com/asteadman/bd79833a325df0776810 Any feedback is appreciated. Thank you. |
the problematic points (as you have partly noticed already):
|
Yes, target chunk size in 1.0 will be 1 or 2MiB. That doesn't mean that there will be no tiny chunks - if you file only has 1 byte, it will be still 1 chunk. So, the average might be lower than the target size. BTW, it is still unclear to me how you want to work without locking, with parallelel operations allowed (including deletion). I also do not think that making this github issue longer and longer with back-and-forth discussion posts is helping here very much - if we want to implement this, we need ONE relatively formal description of how it works (not many pages in discussion mode). So I'ld suggest you please rather edit one of your posts and update it as needed until it implements everything needed or until we find it can't be implemented. Also, the other posts (including mine) should be removed after integration. I am also not sure a gh issue is the best for that, maybe a github repo, where one can see diffs and history would be better. |
http://www.daemonology.net/blog/2008-12-14-how-tarsnap-uses-aws.html doesn't sound too promising about the possibility of reliably using S3 directly from a backup tool (he wrote a special server that sits between the backup client and S3). |
@tw that post was from 2008… https://aws.amazon.com/de/s3/faqs/#How_durable_is_Amazon_S3 |
@ThomasWaldmann - actually its promising - it's not too different from what borg is already doing in the local format - and it might not need too much of a change to make borg work against it |
Don't forget BackBlaze's B2. Cheapest storage around. Hashbackup already does all that but it's closed source so who knows how that is done. |
Amazon Cloud Drive offers unlimited storage for just 50$ a year. Would be great if it'd be supported! :) |
There's a FUSE FS for it: https://github.com/yadayada/acd_cli That should work okayish (maybe not the best performance). This thread here is about directly using the S3 key-value store as a backup target (no intermediate FS layer), at least that's how I understand it. I think it's kinda unrealistic, at least for now, to completely redo the Repository layer. An alternative Repository implementation could be possible, but I don't see how you could do reliable locking with only S3 as the IPC, when it explicitly states that all operations are only eventually consistent. Parallel operation might be possible, but really, it's not a good idea for a first impl. Also, Repository works only on a chunk-level, and most chunks are very small. That just won't work. (As mentioned above) Working on the LoggedIO level (i.e. alternate implementation of that, which doesn't store segments in the FS, but S3) sounds more promising to me (but - eventual consistency, so the Repository index must be both local and remote, i.e. remote updated after a successful local transaction, so we will actually need to re-implement both LoggedIO and Repository). Locking: Either external (e.g. simple(!) database. Are there ACID RESTful databases, those wouldn't need a lot of code or external deps?) or "User promise locking" (i.e. 'Yes dear Borg, I won't run things in parallel'). Eventual consistency: Put last (id_hash(Manifest), timestamp) in locking storage or local, refuse to operate if Manifest of S3 isn't ==? |
For what it's worth, I'm currently using Issues as I see them
It might help to cache KV updates locally before writing them in a blast periodically, But I don't have any easy way of testing this. (It would be nice if there were a generic FUSE caching layer, but I have not been able to find one.) |
Increasing the segment size in the repo config might help if there is a long-ish ramp-up period for uploads. (And increasing filesystem level buffer sizes if possible) |
AFAICT it's sftp only. I don't believe Borg supports SFTP? My understanding is it needs to be able to invoke itself on the remote and talk to it over forwarded stdin/stdout.. |
Oh... I'm dumb. I forgot that Borg needs another Borg executable on the server side... |
borg can work either with a (working, potentially remote) filesystem or client/server. |
If anyone is still trying to do this, it's totally possible using the s3fuse plugin -- I know it isn't native. But here's a writeup of doing it with Linode object storage that could easily be adapted to Amazon S3: https://jthan.io/blog/borg-backups-on-linode-object-storage/ |
Stumbled upon this hoping to find some progress on borg + S3 like (minio) backed |
I'd definitely be willing to pitch in for Borg S3 backend |
Amazon S3 is now strongly consistent, eliminating the most annoying part of a theoretical borg S3 backend. |
My only suggestion would be that it works with any s3-compatible storage so
we aren't locked to AWS, but I'm still interested if anyone is wondering.
Borg is in every way superior to Restic, other than a lack of s3
compatibility. My two cents.
…On Tue, Dec 1, 2020, 19:12 Milkey Mouse ***@***.***> wrote:
Amazon S3 is now strongly consistent
<https://aws.amazon.com/s3/consistency/>, eliminating the most annoying
part of a theoretical borg S3 backend.
Does this bounty still stand? If a working S3 borg backend were written,
would it be merged?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#102 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AENJAJ33SDF72XTHFJS34MLSSWPCTANCNFSM4BLNQ54Q>
.
|
There is a new project, Kopia (in Go like Restic) which is not as mature as Borg but with many same features + S3 backend |
@milkey-mouse that is hard to say in advance. i guess the code change would have to be somehow clean and not risking stability of the existing file/ssh-based backend. not sure whether that is possible. |
Some thoughts/repetition, trying to focus this to an achievable goal. The bounty stated S3 but practically it simply represents the massive community desire to "backup directly to cloud". Borg doesn't have a clear story for this - this issue is full of misgivings and individual workarounds. Many are coping via doing the indirect 2-step process : Given This issue then reduces to
rclone/rclone#3641 covers this exact use-case from |
@drekbour Thank you for mentioning I agree that having something in the A minor friction point that remains is the need to dedicate a directory to serve as a mount point. It's not a show-stopper for me, but it is slightly inconvenient. For example, I have to figure out where I want to place the mount point and how to prevent |
Well, I've found a couple more friction points. They aren't bugs in
My
Note that this requires creating the following symlink in order to work.
Also note that several online descriptions say you can use
Personally, I would prefer |
Slightly unrelated to the issue, but: About write-caching: I would not recommend using any form of write caching with rclone mount + borg! It might have changed during the last ~6 months, but before that, I have been using it a lot. And rclone might sync indexes or segments and then get interrupted at a point, where either is not in sync with the other. This corrupts the borg repository. Just avoid any form for write caching, this has been stable for me so far. |
@dragetd Regarding caching: Is there anyway to do read caching without write caching? Maybe I'm looking in the wrong part of the docs, but I haven't found anything. (I am currently using |
I haven't read all the messages here, so I don't know if it has come up: we could use [EFS]. We could also use a storage gateway in either file or volume mode. For EFS(https://aws.amazon.com/efs/), maybe we could mount this NFS-based filesystem on our local system and just back up to this directory as well. I think it might be easier to configure than a gateway, but a storage gateway could probably end up cheaper. |
Glad to have found the thread. I'm actively using
The only issue I've run into at this point is that I have a lot of zero-length segments in the data directory. Running For reference:
|
@wsmeichinllc that sound's like the rclone fs is not working correctly. borg usually does not create zero length files. |
I expect so. Does borg delete files? Because then it would be an issue with the way rclone (or b2) handles delete, which is very specific. In any case, it's working. |
Yes, if borg compacts the repo's segments files, new segment files will be created and old non-compact ones will be deleted. |
I use s3 with rclone and won't get any empty files |
#8332 was a big step to make any key/value store usable with borg2.
|
Hi everyone,
I'm interested in using Borg to backup my webserver to an Amazon S3 bucket. I've been using Duplicity, but I'm sick of the full/incremental model, as well as the difficulty of pruning backups. I love the ease of use and features that Borg provides, but I don't really understand the internals and I'm not sure if it will work with Amazon S3 storage.
Specifically, I'm considering mounting my S3 bucket over FUSE, using one of the following three options:
Any comments on which, if any would be more appropriate? And how tolerant would Borg be to S3's "eventual consistency" weirdness?
Additionally, I need to plan against the worst-case scenario of a hacker getting root access to my server and deleting the backups on S3 using the stored credentials on my server. To eliminate this possibility, I was thinking about enabling S3 versioning on the bucket so that files deleted with my server's S3 user account can still be recovered via my main Amazon user account. Then, I would have S3 lifecycle management configured to delete all versions of deleted files after X amount of time. In this case,
Again, my concerns are based on me not really understanding all the black magic that happens with all the chunks and indexes inside a Borg repository, and how much they change from one backup to the next.
Thanks in advance for the help!
💰 there is a bounty for this
The text was updated successfully, but these errors were encountered: