[Dar-discussions] RSync like diff for incremental stuff

Discussion:

Cyril Russo

2009-06-13 10:38:14 UTC

Hi,

I'm wondering if DAR could use rsync like system for incremental backup.
Currently, if I want to backup my mail file (3GB) that changes hourly,
it back it up everytime it's run (and the whole 3GB).

With a rsync like algorithm, it would have saved only the difference
(which is not that much, something like 10MB).
Usually rsync requires 2 copies of the file (the "previous file" and the
"new file"), in order to find out the difference.
This is an annoying requirement for an archive format, as the "previous
file" is usually compressed / encrypted, and as such, would require too
much CPU to uncompress / decrypt for comparing.

However, librsync has a mechanism to compute a "signature" of a file,
and this signature is used in place of the "previous file". The
signature itself is very small (compared to the file), so it might be
possible to store such signature in the catalog.
The signature is compared to the "new file", and the diff can be made
from this (so in my previous example, only 10MB of data would be stored
in the backup).

Upon restoring however, the whole chain of backup must be read (as the
final file is made of original_file + diff(s)).
As restoring happen very rarely, I don't think it's a problem, as the
space gained by using diff worth the extra CPU time.

That way, the DAR format would really, really fit all the possible
requirement for a backup tool, as it would be optimal in size (rsync
like algorithm), optimal in speed (binary code), optimal in security
(encryption).

What do you think about this ?
Cyril

Sterling Windmill

2009-06-14 15:55:05 UTC

Permalink

Have you tried rdiff-backup ?

Sterling Windmill | Systems & Technology
Custom Data Solutions, Inc.

410 S. Main St | Romeo | MI | 48065
586-752-9671 ext 161 | fax: 586-752-6589
toll free: 800-441-9595 | fax: 800-383-4551
www.custdata.com

CONFIDENTIALITY NOTICE: This email contains information from the sender that may be CONFIDENTIAL, LEGALLY PRIVILEGED, PROPRIETARY or otherwise protected from disclosure. This email is intended for use only by the person or entity to whom it is addressed. If you are not the intended recipient, any use, disclosure, copying, distribution, printing, or any action taken in reliance on the contents of this email, is strictly prohibited. If you received this email in error, please contact the sending party by replying in an email to the sender, delete the email from your computer system and shred any paper copies of the email you may have printed.

----- Original Message -----
From: "Cyril Russo" <***@laposte.net>
To: dar-***@lists.sourceforge.net
Sent: Saturday, June 13, 2009 6:38:14 AM GMT -05:00 US/Canada Eastern
Subject: [Dar-discussions] RSync like diff for incremental stuff

Hi,

Â Â I'm wondering if DAR could use rsync like system for incremental backup.
Currently, if I want to backup my mail file (3GB) that changes hourly,
it back it up everytime it's run (and the whole 3GB).

With a rsync like algorithm, it would have saved only the difference
(which is not that much, something like 10MB).
Usually rsync requires 2 copies of the file (the "previous file" and the
"new file"), in order to find out the difference.
This is an annoying requirement for an archive format, as the "previous
file" is usually compressed / encrypted, and as such, would require too
much CPU to uncompress / decrypt for comparing.

However, librsync has a mechanism to compute a "signature" of a file,
and this signature is used in place of the "previous file". The
signature itself is very small (compared to the file), so it might be
possible to store such signature in the catalog.
The signature is compared to the "new file", and the diff can be made
from this (so in my previous example, only 10MB of data would be stored
in the backup).

Upon restoring however, the whole chain of backup must be read (as the
final file is made of original_file + diff(s)).
As restoring happen very rarely, I don't think it's a problem, as the
space gained by using diff worth the extra CPU time.

That way, the DAR format would really, really fit all the possible
requirement for a backup tool, as it would be optimal in size (rsync
like algorithm), optimal in speed (binary code), optimal in security
(encryption).

What do you think about this ?
Cyril

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Dar-discussions mailing list
Dar-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dar-discussions

Cyril Russo

2009-06-15 08:54:38 UTC

Permalink

Post by Sterling Windmill
Have you tried rdiff-backup ?

Yes
And, even better, duplicity.

However, all thoses software still lack a major technical feature.
rdiff-backup doesn't compress (unless you have gzip-rsyncable) nor
encrypt, so it's useless if you don't master/trust the remote host location.
duplicity compress and encrypt and use rsync, but it uses TAR
internally, so browsing a backup is a real pain (as there isn't any
catalog in the archive itself it's very very slow).
Duplicity comes with a real giant list of backend for accessing remote
storage (from sshfs, imap, webdav, ftp, S3, tahoe, local, etc...).

I though about an ultimate backup tool that would use the smartness of
DAR (catalog / compression / encryption / incremental), but with a
optimal disk consumption (thanks to rsync).
That way, browsing the backup set would still be very fast.

I think the unix philosophy of "One tools does one thing well", I would
let DAR create a local backup archive on a remote mounted filesystem
(like NFS, SSHFS, WebdavFS etc...).

Anyway, how complex would it require to implement in DAR ?

Post by Sterling Windmill
*Sterling Windmill* | Systems & Technology
*Custom**Data**Solutions, Inc.*
*410 S. Main St | Romeo | MI | 48065
586-752-9671 ext 161 | fax: 586-752-6589
toll free: 800-441-9595 | fax: 800-383-4551
www.custdata.com *
*
CONFIDENTIALITY NOTICE: This email contains information from the
sender that may be CONFIDENTIAL, LEGALLY PRIVILEGED, PROPRIETARY or
otherwise protected from disclosure. This email is intended for use
only by the person or entity to whom it is addressed. If you are not
the intended recipient, any use, disclosure, copying, distribution,
printing, or any action taken in reliance on the contents of this
email, is strictly prohibited. If you received this email in error,
please contact the sending party by replying in an email to the
sender, delete the email from your computer system and shred any paper
copies of the email you may have printed. *
----- Original Message -----
Sent: Saturday, June 13, 2009 6:38:14 AM GMT -05:00 US/Canada Eastern
Subject: [Dar-discussions] RSync like diff for incremental stuff
Hi,
I'm wondering if DAR could use rsync like system for incremental backup.
Currently, if I want to backup my mail file (3GB) that changes hourly,
it back it up everytime it's run (and the whole 3GB).
With a rsync like algorithm, it would have saved only the difference
(which is not that much, something like 10MB).
Usually rsync requires 2 copies of the file (the "previous file" and the
"new file"), in order to find out the difference.
This is an annoying requirement for an archive format, as the "previous
file" is usually compressed / encrypted, and as such, would require too
much CPU to uncompress / decrypt for comparing.
However, librsync has a mechanism to compute a "signature" of a file,
and this signature is used in place of the "previous file". The
signature itself is very small (compared to the file), so it might be
possible to store such signature in the catalog.
The signature is compared to the "new file", and the diff can be made
from this (so in my previous example, only 10MB of data would be stored
in the backup).
Upon restoring however, the whole chain of backup must be read (as the
final file is made of original_file + diff(s)).
As restoring happen very rarely, I don't think it's a problem, as the
space gained by using diff worth the extra CPU time.
That way, the DAR format would really, really fit all the possible
requirement for a backup tool, as it would be optimal in size (rsync
like algorithm), optimal in speed (binary code), optimal in security
(encryption).
What do you think about this ?
Cyril
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Dar-discussions mailing list
https://lists.sourceforge.net/lists/listinfo/dar-discussions
------------------------------------------------------------------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
------------------------------------------------------------------------
_______________________________________________
Dar-discussions mailing list
https://lists.sourceforge.net/lists/listinfo/dar-discussions

Denis Corbin

2009-06-15 20:01:13 UTC

Permalink

Hello,