Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Borg backup to Amazon S3 on FUSE? #102

Closed
geckolinux opened this issue Jul 19, 2015 · 100 comments · Fixed by #8405
Closed

Borg backup to Amazon S3 on FUSE? #102

geckolinux opened this issue Jul 19, 2015 · 100 comments · Fixed by #8405

Comments

@geckolinux
Copy link

Hi everyone,

I'm interested in using Borg to backup my webserver to an Amazon S3 bucket. I've been using Duplicity, but I'm sick of the full/incremental model, as well as the difficulty of pruning backups. I love the ease of use and features that Borg provides, but I don't really understand the internals and I'm not sure if it will work with Amazon S3 storage.

Specifically, I'm considering mounting my S3 bucket over FUSE, using one of the following three options:

Any comments on which, if any would be more appropriate? And how tolerant would Borg be to S3's "eventual consistency" weirdness?

Additionally, I need to plan against the worst-case scenario of a hacker getting root access to my server and deleting the backups on S3 using the stored credentials on my server. To eliminate this possibility, I was thinking about enabling S3 versioning on the bucket so that files deleted with my server's S3 user account can still be recovered via my main Amazon user account. Then, I would have S3 lifecycle management configured to delete all versions of deleted files after X amount of time. In this case,

  • How much of my S3 data would Borg routinely need to download in order to figure out which files have changed and need to be backup up? (I'm worried about bandwidth costs.)
  • How much accumulated clutter and wasted space could I expect from files that Borg "deletes", (which will actually be retained on S3 due to the versioning)?

Again, my concerns are based on me not really understanding all the black magic that happens with all the chunks and indexes inside a Borg repository, and how much they change from one backup to the next.

Thanks in advance for the help!


💰 there is a bounty for this

@geckolinux
Copy link
Author

I'm still trying to get an idea of what exactly happens in the Borg repo from one run to the next. I used it to backup my ~/ directory (about 72GB on disk) last night, and I messed around with creating and deleting files and re-combining ISO images to see how well the de-dupe works. (It works extremely well, I might add!) I ran around 30 backups with no pruning. That was last night, and then today I used my computer for some web browsing and then ran another backup with a before and after ls -sl on the repo/data/1 directory . Here's a diff of repo/data/1 before and after:
http://paste.ubuntu.com/11910814/
(1 chunk deleted, 4 added, total change of 5)

Then I pruned all but the most recent backup and ran another diff:
http://paste.ubuntu.com/11910824/
And here's the repo/data/0 directory, just the names of deleted files:
http://paste.ubuntu.com/11910839/
(580 chunks deleted, 75 added, total change of 655)

So assuming that all the chunks are around 5MB, that would be around 3GB of deleted data taking up wasted space in Amazon S3, which would cost me about $0.05/month in Glacier according to Amazon's calculator, and it would have to stay there for 90 days to avoid a penalty. Or else in regular S3 storage it would cost something like $0.11/month. Additionally there would be far fewer changes and much less total data stored in the case of my webserver I want to back up with this scheme.

So I would tentatively think this could be a good option?

@oderwat
Copy link

oderwat commented Jul 20, 2015

I might add that you can get 10 TB (thats ten terrabyte) as "nearly" OpenStack Swift compatible storage from HubiC.com for 50 Euro a year (no kidding). I use this together with my Hubic Swift Gateway and the swift duplicity back end.

This also is EU storage (located in france) which solves some problems with German laws.

I also think that it is fairly easy to implement as backend for software with a chunked approach.

P.S.: Their desktop client (still) sucks imho... but you even get 25 GB for free. Which can also be used for experiments with the API.

@geckolinux
Copy link
Author

Thanks @oderwat for the tip! Good to know.

@ThomasWaldmann
Copy link
Member

I must say that I don't use "cloud data storage services", so I can't advise about their API/capabilities.

Borg's backend is similar to a key/value storage and segment files only get created/written, but not modified (except from complete segment files being deleted), so it could be possible if someone writes such a backend.

Borg has an "internals" doc that might be interesting for anybody wanting to write such a backend. If information is missing there, please file a docs issue here.

@anarcat
Copy link
Contributor

anarcat commented Oct 6, 2015

borg has some level of abstraction of remote repositories... there's currently only a single RemoteRepository implementation, and it hardcodes ssh in a bunch of place. we nevertheless have a list of methods we use in RPC calls that would need to be defined more clearly, maybe cleaned up, and then implemented in such a new implementation:

    rpc_methods = (
        '__len__',
        'check',
        'commit',
        'delete',
        'destroy',
        'get',
        'list',
        'negotiate',
        'open',
        'put',
        'repair',
        'rollback',
        'save_key',
        'load_key',
    )

this list is from remote.py, and is passed through the SSH pipe during communication with the borg serve command...

@anarcat
Copy link
Contributor

anarcat commented Oct 6, 2015

notice the similar issue in jborg/attic#136

@rmoriz
Copy link

rmoriz commented Nov 18, 2015

Supporting storage services like AWS S3 would be huge and make borg a real alternative to tools like tarsnap. I would support a bounty for a) generic storage interface layer b) and S3 support based on it.
I suggest libcloud https://libcloud.readthedocs.org/en/latest/storage/supported_providers.html to design interfaces/deal with cloud storage services.

Another interesting backend storage might be sftp/scp, as provided by some traditional hosting providers, like Hetzner or Strato HiDrive

@anarcat
Copy link
Contributor

anarcat commented Nov 18, 2015

@rmoriz your contribution would of course be welcome. bounties are organised on bountysource, in this case: https://www.bountysource.com/issues/24578298-borg-backup-to-amazon-s3-on-fuse

the main problem with S3 and other cloud providers is that we can't run native code on the other side, which we currently expect for remote server support. our remote server support involves calling fairly high-level functions like check on the remote side, which can't possibly be implemented directly in the native S3 API: we'd need to treat those as different remotes. see also #191 (comment) about this...

the assumptions we make about the remotes also imply that the current good performance we get on SSH-based remotes would be affected by "dumb" remotes like key/object value storage. see also #36 (comment) for this.

@rmoriz
Copy link

rmoriz commented Nov 24, 2015

Please correct my if I'm wrong.

It looks like we have/need a three-tier architecture:

  • borg client
  • borg server (via ssh)
  • (dumb) storage.

So the borg server part needs a storage abstraction model where backends like S3, ftps, Google Cloud Storage, etc. can be added.

Is that correct? I think using FUSE adapters are not a reliable way (IMHO).

Update:

@RonnyPfannschmidt
Copy link
Contributor

the server is not necessaryly needed

borgs internal structure would allow to use something like a different k/v store as well - but someone needs to do and test it

@ThomasWaldmann
Copy link
Member

Thanks for putting a bounty on this.

If someone wants to take it: please discuss implementation here beforehands, do not work in the dark.

@jasonfharris
Copy link

+1 for me on this. I want exactly what the original poster is talking about. Also since I am worrying about deduplicating I want to use some really highly durable storage like amazon has. Also the versioning life-cycles to protect against the "compromised" host problem would be fantastic... (I added to the bounty :) )

@asteadman
Copy link

I've written up some of my thoughts on some of the limitations of s3, and a WIP discussion about some possible method to address them. It is organised as a single document right now, but as it flushes out, I will expand it as appropriate. Please comment there and I will try and keep the document up to date with as much information as possible. see https://gist.github.com/asteadman/bd79833a325df0776810

Any feedback is appreciated. Thank you.

@ThomasWaldmann
Copy link
Member

the problematic points (as you have partly noticed already):

  • using 1 file per chunk is not gonna work practically - too many chunks, too much overhead. you have to consider that 1 chunk is not just the usual 64kiB (or soon: 1MiB) target chunk size, but can be way smaller if the input file is smaller. you can't really ignore that in the end, this is something that has to be solved.
  • the archive metadata (list of all files, metadata of files, chunk lists) can be quite large, so you won't be able / you won't want to store this in one piece. borg currently runs this metadata stream through chunker / deduplication also, which is quite nice because we always have the full(!) item list there and a lot of it is not changing usually.
  • "skipping chunks that already exist" - if you want to do that quickly, you need an up-to-date (consistent) local index / hash table. otherwise, you may have 1 network roundtrip per chunk.
  • that "eventually consistent" S3 property is scary. it's already hard enough to design such a system without that property.
  • "chunk staleness" is an interesting idea. but i think you could run into race conditions - e.g. you just decided that this 3 months old chunk shall be killed, when a parallel backup task decided to use it again. guess either atomicity or locking is needed here.

@ThomasWaldmann
Copy link
Member

Yes, target chunk size in 1.0 will be 1 or 2MiB. That doesn't mean that there will be no tiny chunks - if you file only has 1 byte, it will be still 1 chunk. So, the average might be lower than the target size.

BTW, it is still unclear to me how you want to work without locking, with parallelel operations allowed (including deletion). I also do not think that making this github issue longer and longer with back-and-forth discussion posts is helping here very much - if we want to implement this, we need ONE relatively formal description of how it works (not many pages in discussion mode).

So I'ld suggest you please rather edit one of your posts and update it as needed until it implements everything needed or until we find it can't be implemented. Also, the other posts (including mine) should be removed after integration. I am also not sure a gh issue is the best for that, maybe a github repo, where one can see diffs and history would be better.

@ThomasWaldmann
Copy link
Member

http://www.daemonology.net/blog/2008-12-14-how-tarsnap-uses-aws.html doesn't sound too promising about the possibility of reliably using S3 directly from a backup tool (he wrote a special server that sits between the backup client and S3).

@rmoriz
Copy link

rmoriz commented Jan 11, 2016

@tw that post was from 2008… https://aws.amazon.com/de/s3/faqs/#How_durable_is_Amazon_S3

@RonnyPfannschmidt
Copy link
Contributor

@ThomasWaldmann - actually its promising - it's not too different from what borg is already doing in the local format - and it might not need too much of a change to make borg work against it

@olivernz
Copy link

olivernz commented Mar 7, 2016

Don't forget BackBlaze's B2. Cheapest storage around. Hashbackup already does all that but it's closed source so who knows how that is done.

@phime42
Copy link

phime42 commented Mar 19, 2016

Amazon Cloud Drive offers unlimited storage for just 50$ a year. Would be great if it'd be supported! :)

@enkore
Copy link
Contributor

enkore commented Mar 24, 2016

There's a FUSE FS for it: https://github.com/yadayada/acd_cli

That should work okayish (maybe not the best performance).

This thread here is about directly using the S3 key-value store as a backup target (no intermediate FS layer), at least that's how I understand it.

I think it's kinda unrealistic, at least for now, to completely redo the Repository layer. An alternative Repository implementation could be possible, but I don't see how you could do reliable locking with only S3 as the IPC, when it explicitly states that all operations are only eventually consistent. Parallel operation might be possible, but really, it's not a good idea for a first impl. Also, Repository works only on a chunk-level, and most chunks are very small. That just won't work. (As mentioned above)

Working on the LoggedIO level (i.e. alternate implementation of that, which doesn't store segments in the FS, but S3) sounds more promising to me (but - eventual consistency, so the Repository index must be both local and remote, i.e. remote updated after a successful local transaction, so we will actually need to re-implement both LoggedIO and Repository).

Locking: Either external (e.g. simple(!) database. Are there ACID RESTful databases, those wouldn't need a lot of code or external deps?) or "User promise locking" (i.e. 'Yes dear Borg, I won't run things in parallel').

Eventual consistency: Put last (id_hash(Manifest), timestamp) in locking storage or local, refuse to operate if Manifest of S3 isn't ==?

@roaima
Copy link

roaima commented Mar 25, 2016

For what it's worth, I'm currently using borg on top of a Hubic FUSE-based filesystem for my off-site backups. It's painfully slow - my net effective writing speed is around only 1 Mb/s - but other than that works pretty well.

Issues as I see them

  • Writes have a very high latency. Once you're writing it's fast (10 Mb/s, intentionally limited within Hubic), but there seems to be a two second delay at the beginning of each file write.
  • Reads are reasonably fast. There's certainly nothing like the write latency but I've yet to turn this from an empirical value into a quantifiable one.
  • The process is slow, so avoiding inter-feature locking would be a very good thing. (borg list, and borg extract, specifically).

It might help to cache KV updates locally before writing them in a blast periodically, But I don't have any easy way of testing this. (It would be nice if there were a generic FUSE caching layer, but I have not been able to find one.)

@enkore
Copy link
Contributor

enkore commented Mar 26, 2016

Increasing the segment size in the repo config might help if there is a long-ish ramp-up period for uploads. (And increasing filesystem level buffer sizes if possible)

@srd424
Copy link

srd424 commented Mar 12, 2019

So S3 supports SFTP (SSH) now. You could spin up an SFTP gateway and back up directly to S3 with Borg!

AFAICT it's sftp only. I don't believe Borg supports SFTP? My understanding is it needs to be able to invoke itself on the remote and talk to it over forwarded stdin/stdout..

@wzyboy
Copy link
Contributor

wzyboy commented Mar 12, 2019

So S3 supports SFTP (SSH) now. You could spin up an SFTP gateway and back up directly to S3 with Borg!

AFAICT it's sftp only. I don't believe Borg supports SFTP? My understanding is it needs to be able to invoke itself on the remote and talk to it over forwarded stdin/stdout..

Oh... I'm dumb. I forgot that Borg needs another Borg executable on the server side...

@ThomasWaldmann
Copy link
Member

borg can work either with a (working, potentially remote) filesystem or client/server.

@jrdemasi
Copy link

If anyone is still trying to do this, it's totally possible using the s3fuse plugin -- I know it isn't native. But here's a writeup of doing it with Linode object storage that could easily be adapted to Amazon S3: https://jthan.io/blog/borg-backups-on-linode-object-storage/

@i0x71
Copy link

i0x71 commented Apr 17, 2020

Stumbled upon this hoping to find some progress on borg + S3 like (minio) backed

@i0x71
Copy link

i0x71 commented Apr 17, 2020

I'd definitely be willing to pitch in for Borg S3 backend

@milkey-mouse
Copy link
Contributor

Amazon S3 is now strongly consistent, eliminating the most annoying part of a theoretical borg S3 backend.
Does this bounty still stand? If a working S3 borg backend were written, would it be merged?

@jrdemasi
Copy link

jrdemasi commented Dec 2, 2020 via email

@flibustenet
Copy link

There is a new project, Kopia (in Go like Restic) which is not as mature as Borg but with many same features + S3 backend

@ThomasWaldmann
Copy link
Member

@milkey-mouse that is hard to say in advance. i guess the code change would have to be somehow clean and not risking stability of the existing file/ssh-based backend. not sure whether that is possible.

@drekbour
Copy link

Some thoughts/repetition, trying to focus this to an achievable goal.

The bounty stated S3 but practically it simply represents the massive community desire to "backup directly to cloud". Borg doesn't have a clear story for this - this issue is full of misgivings and individual workarounds.

Many are coping via doing the indirect 2-step process : client --(borg serve)--> NAS --(rclone sync)--> cloud which gives LAN performance but has various downsides too.

Given borg's UNIX-style do-one-thing-well design, it feels preferable to remain as a chainable tool rather than maintain support for N clouds. Many cloud-specific fuse utils exist (seems a popular programming project!) but the best is surely rclone - a highly respected multi-cloud utility. I don't think anyone would come asking borg for more if there was a confirmed and documented solution with rclone.

This issue then reduces to

  • What are the friction points of borg + rclone mount
  • What borg changes can be made to address those
  • What rclone changes can be made to address those
  • Document the pattern officially to supercede years of noise

rclone/rclone#3641 covers this exact use-case from rclone side and is similarly sprawling and inconclusive but does contain more detail around re-reading of segments during borg create (reading data costs $ on S3 and time in all cases), the use of various caching options to avoid rapid re-writes and handling retries.

@adamsmd
Copy link

adamsmd commented Dec 8, 2021

@drekbour Thank you for mentioning rclone mount. All the other online instructions that I've found involve keeping a local copy of the repository, which I don't have the disk space for and thus made borg impractical for me. I had no idea rclone had a mount option.

I agree that having something in the borg docs about rclone mount would have helped when I was trying to figure out if borg could backup directly to the cloud. ("Directly" being a key word.)

A minor friction point that remains is the need to dedicate a directory to serve as a mount point. It's not a show-stopper for me, but it is slightly inconvenient. For example, I have to figure out where I want to place the mount point and how to prevent locate/updatedb or even a naive grep from scanning into the mount point and causing network traffic.

@adamsmd
Copy link

adamsmd commented Dec 13, 2021

Well, I've found a couple more friction points. They aren't bugs in borg per se, but I am documenting them here (1) for those who have similar issues and find this thread after me, and (2) if anyone wants to take on reducing these usability issues.

  1. I have not been able to get rclone setup from /etc/crontab to work with the systemd automounter. (Thus I am currently using a manual mount.)

My fstab has:

<rclone-account-name>:<bucket-name>		<mount-point>	rclone	rw,nofail,x-systemd.automount,args2env,log_file=/var/log/rclone.log,vfs_cache_mode=off,config=/etc/rclone.conf,cache_dir=/var/cache/rclone,vfs-cache-max-size=5G	0	0

Note that this requires creating the following symlink in order to work.

$ ls -la /usr/sbin/mount.rclone 
lrwxrwxrwx 1 root root 15 Dec  8 23:05 /usr/sbin/mount.rclone -> /usr/bin/rclone

Also note that several online descriptions say you can use _netdev to prevent mounting until the network is up, but for me that results in rclone interpreting that as an (invalid) flag and producing an error. Since the automounter doesn't work for me, I manually run sudo mount <mount-point>.

  1. In my setup, borg out paces the speed at which rclone can upload updates. With --vfs-cache-mode=writes (the most common recommendation.), this can result in rclone's cashe folder ballooning to ~100GB. Once borg stops, the cache folder slowly shrinks as rclone get the data uploaded. The solution I've gone with is to disable caching.

Personally, I would prefer borg to use rclone (or some standard API it supports) as a library over having to deal with the complications of mounting. There is a comment (rclone/rclone#361 (comment)) briefly explaining in principle how to do that and a couple of borg issues discussing this (#1070 and #5324), but I haven't dug into this.

@dragetd
Copy link
Contributor

dragetd commented Dec 13, 2021

Slightly unrelated to the issue, but:
@adamsmd The _netdev issue is fixed in the next rclone version, see here: rclone/rclone#5808

About write-caching: I would not recommend using any form of write caching with rclone mount + borg! It might have changed during the last ~6 months, but before that, I have been using it a lot. And rclone might sync indexes or segments and then get interrupted at a point, where either is not in sync with the other. This corrupts the borg repository. Just avoid any form for write caching, this has been stable for me so far.

@adamsmd
Copy link

adamsmd commented Dec 13, 2021

@dragetd Regarding caching: Is there anyway to do read caching without write caching? Maybe I'm looking in the wrong part of the docs, but I haven't found anything. (I am currently using --vfs-cache-mode=off, which appears to be the only cache mode without any write caching, but which I think also turns off read caching.)

@Zackhardtoname
Copy link

Zackhardtoname commented Jan 28, 2022

I haven't read all the messages here, so I don't know if it has come up: we could use [EFS]. We could also use a storage gateway in either file or volume mode.

For EFS(https://aws.amazon.com/efs/), maybe we could mount this NFS-based filesystem on our local system and just back up to this directory as well. I think it might be easier to configure than a gateway, but a storage gateway could probably end up cheaper.

@wsmeichinllc
Copy link

Glad to have found the thread. I'm actively using rclone mount with backblaze b2 as part of a script that runs borg backup:

  • rclone mount
  • borg create to mounted path
  • sync and unmount
    This also meets my requirement that the backup file system is not accessible when the backup isn't running.

The only issue I've run into at this point is that I have a lot of zero-length segments in the data directory. Running borg check shows that there are 75 backups, but there are 331 segments, 158 of which are zero-length. While this generates a lot of Data integrity error: invalid segment magic [segment x, offset 0] messages during checks, I get Archive consistency check complete, no problems found at the end. And restores are successful.

For reference:

  • rclone v1.53.3
  • borg 1.1.16
  • debian 11

@ThomasWaldmann
Copy link
Member

@wsmeichinllc that sound's like the rclone fs is not working correctly. borg usually does not create zero length files.

@wsmeichinllc
Copy link

I expect so. Does borg delete files? Because then it would be an issue with the way rclone (or b2) handles delete, which is very specific.

In any case, it's working.

@ThomasWaldmann
Copy link
Member

Yes, if borg compacts the repo's segments files, new segment files will be created and old non-compact ones will be deleted.

@hartmark
Copy link

I use s3 with rclone and won't get any empty files

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Sep 9, 2024

#8332 was a big step to make any key/value store usable with borg2.

borgstore implements:

  • file: borg uses that for local repos and also via ssh: borg serve)
  • sftp: borg can use that for remote repos without the need of "borg serve" on the remote.

borgstore could implement:

  • guess it could also talk to cloud storage providers
  • it does not need FUSE for that, it could talk directly to their API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.