Discussion:
[Dar-discussions] RSync like diff for incremental stuff
Cyril Russo
2009-06-13 10:38:14 UTC
Permalink
Hi,

I'm wondering if DAR could use rsync like system for incremental backup.
Currently, if I want to backup my mail file (3GB) that changes hourly,
it back it up everytime it's run (and the whole 3GB).

With a rsync like algorithm, it would have saved only the difference
(which is not that much, something like 10MB).
Usually rsync requires 2 copies of the file (the "previous file" and the
"new file"), in order to find out the difference.
This is an annoying requirement for an archive format, as the "previous
file" is usually compressed / encrypted, and as such, would require too
much CPU to uncompress / decrypt for comparing.

However, librsync has a mechanism to compute a "signature" of a file,
and this signature is used in place of the "previous file". The
signature itself is very small (compared to the file), so it might be
possible to store such signature in the catalog.
The signature is compared to the "new file", and the diff can be made
from this (so in my previous example, only 10MB of data would be stored
in the backup).

Upon restoring however, the whole chain of backup must be read (as the
final file is made of original_file + diff(s)).
As restoring happen very rarely, I don't think it's a problem, as the
space gained by using diff worth the extra CPU time.

That way, the DAR format would really, really fit all the possible
requirement for a backup tool, as it would be optimal in size (rsync
like algorithm), optimal in speed (binary code), optimal in security
(encryption).

What do you think about this ?
Cyril
Sterling Windmill
2009-06-14 15:55:05 UTC
Permalink
Have you tried rdiff-backup ?




Sterling Windmill | Systems & Technology
Custom Data Solutions, Inc.

410 S. Main St | Romeo | MI | 48065
586-752-9671 ext 161 | fax: 586-752-6589
toll free: 800-441-9595 | fax: 800-383-4551
www.custdata.com


CONFIDENTIALITY NOTICE: This email contains information from the sender that may be CONFIDENTIAL, LEGALLY PRIVILEGED, PROPRIETARY or otherwise protected from disclosure. This email is intended for use only by the person or entity to whom it is addressed. If you are not the intended recipient, any use, disclosure, copying, distribution, printing, or any action taken in reliance on the contents of this email, is strictly prohibited. If you received this email in error, please contact the sending party by replying in an email to the sender, delete the email from your computer system and shred any paper copies of the email you may have printed.

----- Original Message -----
From: "Cyril Russo" <***@laposte.net>
To: dar-***@lists.sourceforge.net
Sent: Saturday, June 13, 2009 6:38:14 AM GMT -05:00 US/Canada Eastern
Subject: [Dar-discussions] RSync like diff for incremental stuff


Hi,

  I'm wondering if DAR could use rsync like system for incremental backup.
Currently, if I want to backup my mail file (3GB) that changes hourly,
it back it up everytime it's run (and the whole 3GB).

With a rsync like algorithm, it would have saved only the difference
(which is not that much, something like 10MB).
Usually rsync requires 2 copies of the file (the "previous file" and the
"new file"), in order to find out the difference.
This is an annoying requirement for an archive format, as the "previous
file" is usually compressed / encrypted, and as such, would require too
much CPU to uncompress / decrypt for comparing.

However, librsync has a mechanism to compute a "signature" of a file,
and this signature is used in place of the "previous file". The
signature itself is very small (compared to the file), so it might be
possible to store such signature in the catalog.
The signature is compared to the "new file", and the diff can be made
from this (so in my previous example, only 10MB of data would be stored
in the backup).

Upon restoring however, the whole chain of backup must be read (as the
final file is made of original_file + diff(s)).
As restoring happen very rarely, I don't think it's a problem, as the
space gained by using diff worth the extra CPU time.

That way, the DAR format would really, really fit all the possible
requirement for a backup tool, as it would be optimal in size (rsync
like algorithm), optimal in speed (binary code), optimal in security
(encryption).

What do you think about this ?
Cyril


------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Dar-discussions mailing list
Dar-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dar-discussions
Cyril Russo
2009-06-15 08:54:38 UTC
Permalink
Post by Sterling Windmill
Have you tried rdiff-backup ?
Yes
And, even better, duplicity.

However, all thoses software still lack a major technical feature.
rdiff-backup doesn't compress (unless you have gzip-rsyncable) nor
encrypt, so it's useless if you don't master/trust the remote host location.
duplicity compress and encrypt and use rsync, but it uses TAR
internally, so browsing a backup is a real pain (as there isn't any
catalog in the archive itself it's very very slow).
Duplicity comes with a real giant list of backend for accessing remote
storage (from sshfs, imap, webdav, ftp, S3, tahoe, local, etc...).

I though about an ultimate backup tool that would use the smartness of
DAR (catalog / compression / encryption / incremental), but with a
optimal disk consumption (thanks to rsync).
That way, browsing the backup set would still be very fast.

I think the unix philosophy of "One tools does one thing well", I would
let DAR create a local backup archive on a remote mounted filesystem
(like NFS, SSHFS, WebdavFS etc...).

Anyway, how complex would it require to implement in DAR ?
Post by Sterling Windmill
*Sterling Windmill* | Systems & Technology
*Custom**Data**Solutions, Inc.*
*410 S. Main St | Romeo | MI | 48065
586-752-9671 ext 161 | fax: 586-752-6589
toll free: 800-441-9595 | fax: 800-383-4551
www.custdata.com *
*
CONFIDENTIALITY NOTICE: This email contains information from the
sender that may be CONFIDENTIAL, LEGALLY PRIVILEGED, PROPRIETARY or
otherwise protected from disclosure. This email is intended for use
only by the person or entity to whom it is addressed. If you are not
the intended recipient, any use, disclosure, copying, distribution,
printing, or any action taken in reliance on the contents of this
email, is strictly prohibited. If you received this email in error,
please contact the sending party by replying in an email to the
sender, delete the email from your computer system and shred any paper
copies of the email you may have printed. *
----- Original Message -----
Sent: Saturday, June 13, 2009 6:38:14 AM GMT -05:00 US/Canada Eastern
Subject: [Dar-discussions] RSync like diff for incremental stuff
Hi,
I'm wondering if DAR could use rsync like system for incremental backup.
Currently, if I want to backup my mail file (3GB) that changes hourly,
it back it up everytime it's run (and the whole 3GB).
With a rsync like algorithm, it would have saved only the difference
(which is not that much, something like 10MB).
Usually rsync requires 2 copies of the file (the "previous file" and the
"new file"), in order to find out the difference.
This is an annoying requirement for an archive format, as the "previous
file" is usually compressed / encrypted, and as such, would require too
much CPU to uncompress / decrypt for comparing.
However, librsync has a mechanism to compute a "signature" of a file,
and this signature is used in place of the "previous file". The
signature itself is very small (compared to the file), so it might be
possible to store such signature in the catalog.
The signature is compared to the "new file", and the diff can be made
from this (so in my previous example, only 10MB of data would be stored
in the backup).
Upon restoring however, the whole chain of backup must be read (as the
final file is made of original_file + diff(s)).
As restoring happen very rarely, I don't think it's a problem, as the
space gained by using diff worth the extra CPU time.
That way, the DAR format would really, really fit all the possible
requirement for a backup tool, as it would be optimal in size (rsync
like algorithm), optimal in speed (binary code), optimal in security
(encryption).
What do you think about this ?
Cyril
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Dar-discussions mailing list
https://lists.sourceforge.net/lists/listinfo/dar-discussions
------------------------------------------------------------------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
------------------------------------------------------------------------
_______________________________________________
Dar-discussions mailing list
https://lists.sourceforge.net/lists/listinfo/dar-discussions
Denis Corbin
2009-06-15 20:01:13 UTC
Permalink
Hello,
Post by Cyril Russo
Post by Sterling Windmill
Have you tried rdiff-backup ?
Yes
And, even better, duplicity.
However, all thoses software still lack a major technical feature.
rdiff-backup doesn't compress (unless you have gzip-rsyncable) nor
encrypt, so it's useless if you don't master/trust the remote host location.
duplicity compress and encrypt and use rsync, but it uses TAR
internally, so browsing a backup is a real pain (as there isn't any
catalog in the archive itself it's very very slow).
Duplicity comes with a real giant list of backend for accessing remote
storage (from sshfs, imap, webdav, ftp, S3, tahoe, local, etc...).
I though about an ultimate backup tool that would use the smartness of
DAR (catalog / compression / encryption / incremental), but with a
optimal disk consumption (thanks to rsync).
That way, browsing the backup set would still be very fast.
I think the unix philosophy of "One tools does one thing well", I would
let DAR create a local backup archive on a remote mounted filesystem
(like NFS, SSHFS, WebdavFS etc...).
Anyway, how complex would it require to implement in DAR ?
First, to be able to make a binary diff you must have the original
binary beside the current one to compare them (actually dar relies on
inode data changes to decide whether to save the whole file again or to
not save it at all).

Consequences is that this is not compatible with differential backup
based on catalogue, the feature must thus be an option.

Worse it is will be difficult to base such a backup on a differential
backup. Suppose the file is already binary diffed ... how to know if the
part of the file that is not present in the file has changed? Yes, using
signatures (or ~ CRC) but still the risk that the file changed and
signature stays the same thus changed would not be saved.

The second point is that at restoration time you must not blindly patch
an existing binary file with the portion of data that have been saved.
The use of signature (CRC for example) done on the original file is
necessary, but will not warranty that the original file is the really
the same as the one over which the binary diff has been made (there is
still a little chance that two different files get the same signature)
and I don't see any mean to be sure that the live file to restore over a
binary diff is exactly the same suite of byte as the one used at backup
time, except than comparing byte by byte, thus requiring any previous
diff backup up to the first where the whole data is present. But now, if
the user is warned about this risk and wants to take it ... I see no
problem.

This was for the functional level.

Seen now the implementation, this should be feasable, but at the cost of
some major change in the way dar stores files. We must now not only
store a suite of bytes to be copied to a brand new file, but a list of
blocks with offset, length and data (an probably checksum). Well, there
is a feature in the pipe that will bring some changes at this level
which is the possibility to properly store sparse files [
https://sourceforge.net/tracker/?func=detail&aid=1457710&group_id=65612&atid=511615
]. I could extend a little bit this feature to let room for CRC fields
and the like (re-using the TLV class).

Will lack the algorithm to inspect binary files and decide which way to
define blocks of changed data. The problem here is that it will cost
much more disk I/O to compare two binaries: It must first be necessary
to inspect the whole files (old backup and live filesystem) to find out
which amount of data has changed and which pattern do changes follow. It
may well be more costly to systematically do a diff in place of a full
backup if a byte every N byte has changed for example. Then, we could
proceed either to full backup or binary diff backup.


Regards,
Denis.
Cyril Russo
2009-06-16 07:10:38 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello,
Post by Cyril Russo
Post by Sterling Windmill
Have you tried rdiff-backup ?
Yes
And, even better, duplicity.
However, all thoses software still lack a major technical feature.
rdiff-backup doesn't compress (unless you have gzip-rsyncable) nor
encrypt, so it's useless if you don't master/trust the remote host location.
duplicity compress and encrypt and use rsync, but it uses TAR
internally, so browsing a backup is a real pain (as there isn't any
catalog in the archive itself it's very very slow).
Duplicity comes with a real giant list of backend for accessing remote
storage (from sshfs, imap, webdav, ftp, S3, tahoe, local, etc...).
I though about an ultimate backup tool that would use the smartness of
DAR (catalog / compression / encryption / incremental), but with a
optimal disk consumption (thanks to rsync).
That way, browsing the backup set would still be very fast.
I think the unix philosophy of "One tools does one thing well", I would
let DAR create a local backup archive on a remote mounted filesystem
(like NFS, SSHFS, WebdavFS etc...).
Anyway, how complex would it require to implement in DAR ?
First, to be able to make a binary diff you must have the original
binary beside the current one to compare them (actually dar relies on
inode data changes to decide whether to save the whole file again or to
not save it at all).
Well, this is not exact, and it's one of the main feature of the librsync.
The librsync create a signature file from a binary file, and you
can(should) use this signature as the first comparand.
The signature itself is very small, compared to the file -usually only
1% but can be less-
Other the incremental backup, you only have to update the signature.
Consequences is that this is not compatible with differential backup
based on catalogue, the feature must thus be an option.
If the signature is stored in the catalog, I think it might be compatible.
Worse it is will be difficult to base such a backup on a differential
backup. Suppose the file is already binary diffed ... how to know if the
part of the file that is not present in the file has changed? Yes, using
signatures (or ~ CRC) but still the risk that the file changed and
signature stays the same thus changed would not be saved.
Signature algorithm use 2 different checksums (strong and rolling).
The rolling checksum is a checksum that can detect byte shift (insertion
or deletion in the block).
The strong checksum (SHA-256, SHA1, or even MD5) is used when the
rolling checksum matched.
So, it's pretty unlikely that SHA256 would be the same (probability is
1/2^128) but it could happen.
It's pretty unlikely that the rolling checksum match for 2 different
inputs (probablility is 1/2^32 for a 64bit rolling checksum).
However, it's very very uncommon that both would give the same results.
Worst, as the strong checksum is done on a fixed block size, this would
mean that only that block gone unnotified,
(so no byte was ever added, nor removed, else the other blocks will have
changed too and will unlikely go unnoticed).

As RSync is widely used for binary linux distribution (with data >
100GB), I guess it's not a real issue in our real world.
Gentoo use this, YUM (Fedora, Mandriva, Moblin, OpenSuse) use this, so
I'm pretty confident that this doesn't happen at all.
The second point is that at restoration time you must not blindly patch
an existing binary file with the portion of data that have been saved.
The use of signature (CRC for example) done on the original file is
necessary, but will not warranty that the original file is the really
the same as the one over which the binary diff has been made (there is
still a little chance that two different files get the same signature)
and I don't see any mean to be sure that the live file to restore over a
binary diff is exactly the same suite of byte as the one used at backup
time, except than comparing byte by byte, thus requiring any previous
diff backup up to the first where the whole data is present. But now, if
the user is warned about this risk and wants to take it ... I see no
problem.
Yes, I agree with you here.
Restoring an incremental backup here requires all the backup chain (from
the last full backup till the last incremental version).
BTW, this wouldn't be way slower than it is currently, as current code
store the whole file, so it has to decrypt(decompress(whole file)) for
the last incremental.
If you get the wrong version (at you must have decompressed the whole
file to see its content), then you have to reproduce the above procedure
for each version.
And, unless you exactly know what time you made a mistake, you usually
check 2 or 3 version before finding what you were looking for.

The new code would
(undiff(decrypt(decompress(orig_file)))^number_incremental).
This means that, once you've decompressed the original file, going from
one version to the other is only a diff (which is O(1) operation), so
it's faster than the current code.
You can even compute the diff with the current file, if it still exist,
and avoid the first (longer) step, and go backward.
Then going from one version to the other is a O(1) operation, so it's
faster than the whole decompression again.
This was for the functional level.
Seen now the implementation, this should be feasable, but at the cost of
some major change in the way dar stores files. We must now not only
store a suite of bytes to be copied to a brand new file, but a list of
blocks with offset, length and data (an probably checksum). Well, there
is a feature in the pipe that will bring some changes at this level
which is the possibility to properly store sparse files [
https://sourceforge.net/tracker/?func=detail&aid=1457710&group_id=65612&atid=511615
]. I could extend a little bit this feature to let room for CRC fields
and the like (re-using the TLV class).
CRC is one thing that get the bad habit of colliding.
Please store the rsync signature in there, as you'll get (for the same
price) both a rolling checksum, and a strong checksum.
Even better, add a PAR block (see http://en.wikipedia.org/wiki/Parchive
, so if you ever get a corruption, it can be restored)
Will lack the algorithm to inspect binary files and decide which way to
define blocks of changed data. The problem here is that it will cost
much more disk I/O to compare two binaries: It must first be necessary
to inspect the whole files (old backup and live filesystem) to find out
which amount of data has changed and which pattern do changes follow. It
may well be more costly to systematically do a diff in place of a full
backup if a byte every N byte has changed for example. Then, we could
proceed either to full backup or binary diff backup.
Please try to read the rsync algorithm as it's very very clever (
http://en.wikipedia.org/wiki/Rsync ). Basicly, the rsync algorithm is
O(N) to figure out the diff.
Using a signature (as said above), you don't need to read the previous
file at all, and you only need to read the new file *once*, so it's
exactly like the current code in fact.
Even better, the system will be faster as you won't be compressing /
encrypting the unchanged data, so the time you loose reading the small
signature file (usually 1% of the original file size), is highly
compensated by *not* compressing the redundant data.
The amount of I/O will be less too (as you store less compressed block
on the FS), so you'll gain in network or disk bandwidth.

In fact, rsync usually provides a 2x to 16x speedup other gzipping the
whole file for example (try to run a rsync test, you'll see by yourself).

Sure, if the rsync diff size > input size, then it's better to perform a
full backup. This never happen in reality.
I guess a better rule of thumb would be: if rsync_diff_size +
signature_size > input_size, then it might worth the full backup.

Cyril
Denis Corbin
2009-06-16 21:42:34 UTC
Permalink
Cyril Russo wrote:

[...]
Post by Cyril Russo
Post by Denis Corbin
Post by Cyril Russo
Anyway, how complex would it require to implement in DAR ?
First, to be able to make a binary diff you must have the original
binary beside the current one to compare them (actually dar relies on
inode data changes to decide whether to save the whole file again or to
not save it at all).
Well, this is not exact, and it's one of the main feature of the librsync.
The librsync create a signature file from a binary file, and you
can(should) use this signature as the first comparand.
OK, but if the signature is different, you cannot make binary diff
without the original file, to know where is located the change.
Post by Cyril Russo
The signature itself is very small, compared to the file -usually only
1% but can be less-
Other the incremental backup, you only have to update the signature.
If there is no change you have no signature to update, right? Else you
also have to deal with data, not just signature, no?
Post by Cyril Russo
Post by Denis Corbin
Consequences is that this is not compatible with differential backup
based on catalogue, the feature must thus be an option.
If the signature is stored in the catalog, I think it might be compatible.
If there was a change instead, without the original copy you have to
backup the whole file (which is the current situation). And, if now you
have a partially saved file, there is much chance you have to backup the
whole file again... unless you ask the user to provide former backup up
to the full backup ... this for each file that cast this situation...
Post by Cyril Russo
Post by Denis Corbin
Worse it is will be difficult to base such a backup on a differential
backup. Suppose the file is already binary diffed ... how to know if the
part of the file that is not present in the file has changed? Yes, using
signatures (or ~ CRC) but still the risk that the file changed and
signature stays the same thus changed would not be saved.
Signature algorithm use 2 different checksums (strong and rolling).
The rolling checksum is a checksum that can detect byte shift (insertion
or deletion in the block).
The strong checksum (SHA-256, SHA1, or even MD5) is used when the
rolling checksum matched.
So, it's pretty unlikely that SHA256 would be the same (probability is
1/2^128) but it could happen.
Precisely yes. How improbable it may be, the situation can occur where
data is not saved properly. For that reason this can only be a optional
feature (not activated by defaut), with warning toward the user upon
activation.
Post by Cyril Russo
It's pretty unlikely that the rolling checksum match for 2 different
inputs (probablility is 1/2^32 for a 64bit rolling checksum).
However, it's very very uncommon that both would give the same results.
Worst, as the strong checksum is done on a fixed block size, this would
mean that only that block gone unnotified,
(so no byte was ever added, nor removed, else the other blocks will have
changed too and will unlikely go unnoticed).
As RSync is widely used for binary linux distribution (with data >
100GB), I guess it's not a real issue in our real world.
Well maybe in the rsync context, this is not a real problem, because in
case of signature collision you just keep the previous file version in
the (remote) synced directory, the whole copied file stay a valid but
obsolete version of the file. This will be fixed at the next change,
this way the differences beteen copies is very rare and not subject to
persist. In case of restoration instead, if signature are the same
between the file in filesystem and the file that was used to base the
binary diff, while those file do differ, this would lead to patch a
wrong file and get a file contents that had never existed before. This
would be unoticable unless you try to use the file, which would most
probably lead to a core dump if it is a executable for example. The
probability of such occurrence is not greater with backup than with
rsync, but the consequences are much more important.

OK, now considering MD5 with 128 bits, (thus 3.4E 38 possibilities), the
average number of files one uses (200,000) on a system, the max number
of people today (around 6 trillions), such a collision would occure once
every 7.78 E18 century...). But, by principle, this does not avoid
warning the user.
Post by Cyril Russo
Gentoo use this, YUM (Fedora, Mandriva, Moblin, OpenSuse) use this, so
I'm pretty confident that this doesn't happen at all.
It is not because something is darn not probable that it does not happen
at all... there is a difference, which for me has its importance.
Post by Cyril Russo
Post by Denis Corbin
The second point is that at restoration time you must not blindly patch
an existing binary file with the portion of data that have been saved.
The use of signature (CRC for example) done on the original file is
necessary, but will not warranty that the original file is the really
the same as the one over which the binary diff has been made (there is
still a little chance that two different files get the same signature)
and I don't see any mean to be sure that the live file to restore over a
binary diff is exactly the same suite of byte as the one used at backup
time, except than comparing byte by byte, thus requiring any previous
diff backup up to the first where the whole data is present. But now, if
the user is warned about this risk and wants to take it ... I see no
problem.
Yes, I agree with you here.
Restoring an incremental backup here requires all the backup chain (from
the last full backup till the last incremental version).
BTW, this wouldn't be way slower than it is currently, as current code
store the whole file, so it has to decrypt(decompress(whole file)) for
the last incremental.
It would probably be a bit faster for a full restoration, as there would
be less data to copy... in the other hand, there is a more complex
algorithm to run... but that's not the point here.

However it would be more painful for the user to restore of a small set
of files, as you would have to access to more archive and slices for the
same set of files to restore.
Post by Cyril Russo
If you get the wrong version (at you must have decompressed the whole
file to see its content), then you have to reproduce the above procedure
for each version.
And, unless you exactly know what time you made a mistake, you usually
check 2 or 3 version before finding what you were looking for.
The new code would
(undiff(decrypt(decompress(orig_file)))^number_incremental).
This means that, once you've decompressed the original file, going from
one version to the other is only a diff (which is O(1) operation), so
it's faster than the current code.
a diff is not a constant operation it depends on the amount of data to
compare, isn't it rather an O(n) operation?
Post by Cyril Russo
You can even compute the diff with the current file, if it still exist,
and avoid the first (longer) step, and go backward.
Then going from one version to the other is a O(1) operation, so it's
faster than the whole decompression again.
Post by Denis Corbin
This was for the functional level.
Seen now the implementation, this should be feasable, but at the cost of
some major change in the way dar stores files. We must now not only
store a suite of bytes to be copied to a brand new file, but a list of
blocks with offset, length and data (an probably checksum). Well, there
is a feature in the pipe that will bring some changes at this level
which is the possibility to properly store sparse files [
https://sourceforge.net/tracker/?func=detail&aid=1457710&group_id=65612&atid=511615
]. I could extend a little bit this feature to let room for CRC fields
and the like (re-using the TLV class).
CRC is one thing that get the bad habit of colliding.
Please store the rsync signature in there, as you'll get (for the same
price) both a rolling checksum, and a strong checksum.
Even better, add a PAR block (see http://en.wikipedia.org/wiki/Parchive
, so if you ever get a corruption, it can be restored)
There is already scripts that simplify Parchive use with dar (automatic
redundancy generation, testing and repairing).
(see http://dar.linux.free.fr/doc/samples/index.html )

By the way, CRC is just a way of signing a block of data. Its weekness
comes from the address space if has... it is usually 32 bits, where MD5
is 128 bits...
Post by Cyril Russo
Post by Denis Corbin
Will lack the algorithm to inspect binary files and decide which way to
define blocks of changed data. The problem here is that it will cost
much more disk I/O to compare two binaries: It must first be necessary
to inspect the whole files (old backup and live filesystem) to find out
which amount of data has changed and which pattern do changes follow. It
may well be more costly to systematically do a diff in place of a full
backup if a byte every N byte has changed for example. Then, we could
proceed either to full backup or binary diff backup.
Please try to read the rsync algorithm as it's very very clever (
http://en.wikipedia.org/wiki/Rsync ). Basicly, the rsync algorithm is
O(N) to figure out the diff.
I just did, this is an interesting article. But note that the rolling
checksum needs the whole data of the file to be possible to compute.
Having differential backup based on other differential backups needs
another algorithm for binary diff.
Post by Cyril Russo
Using a signature (as said above), you don't need to read the previous
file at all, and you only need to read the new file *once*, so it's
exactly like the current code in fact.
Maybe I do not understand what you mean here above.
Post by Cyril Russo
Even better, the system will be faster as you won't be compressing /
encrypting the unchanged data, so the time you loose reading the small
signature file (usually 1% of the original file size), is highly
compensated by *not* compressing the redundant data.
The amount of I/O will be less too (as you store less compressed block
on the FS), so you'll gain in network or disk bandwidth.
Actually if a file has not changed (see its last modification time),
there is no need to compress / encrypt it. I probably do not understand
what you mean.
Post by Cyril Russo
In fact, rsync usually provides a 2x to 16x speedup other gzipping the
whole file for example (try to run a rsync test, you'll see by yourself).
I use rsync dayly (through fcron), while I never made the effort to look
at the way it works. Thanks for having let me know that. ;-)
Post by Cyril Russo
Sure, if the rsync diff size > input size, then it's better to perform a
full backup. This never happen in reality.
I guess a better rule of thumb would be: if rsync_diff_size +
signature_size > input_size, then it might worth the full backup.
Cyril
Back to dar, I see this possibility:

# is saved data
- - is unsaved data

assuming there is only one file in the backup:

full backup : ###########################
diff1 backup : --------####-------###-----
diff2 backup : --------###----------#-----
but if some data changed at a place in the file which is unsaved (for
example at the beginning), the whole data block as to be saved again,
diff2 would have then been:
diff2 bacup : ###########----------#-----

restoration would need "full" then "diff1", then "diff2". Restoration
would fail for diff backup if no file exist on filesystem or if its
global data signature does not match the expected one.

Dar would also store signature for each block. By block I mean the
sequence of unsaved data, or the sequence of saved data.

See
http://sourceforge.net/tracker/?func=detail&aid=2803478&group_id=65612&atid=511615
for implementation detail seen so far.

Regards,
Denis.
Cyril Russo
2009-06-17 08:08:25 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
[...]
Post by Cyril Russo
Post by Denis Corbin
Post by Cyril Russo
Anyway, how complex would it require to implement in DAR ?
First, to be able to make a binary diff you must have the original
binary beside the current one to compare them (actually dar relies on
inode data changes to decide whether to save the whole file again or to
not save it at all).
Well, this is not exact, and it's one of the main feature of the librsync.
The librsync create a signature file from a binary file, and you
can(should) use this signature as the first comparand.
OK, but if the signature is different, you cannot make binary diff
without the original file, to know where is located the change.
I'll try to take an example here:
Let's say I have a 1/2GB file.
The signature is set to use 512kb block (something crazy as it's very big).
In my signature, I'll get 1000 block, each of 32 bytes, so a 32k file.

Then, let's say I've modified the file, and I'm running an incremental
backup
The signature is compared to the new file (and only the new file).
Block number 234, 235, and 636 are modified (thanks to the signature), so:
1) The block number 234, 235 and 636 are saved in the the incremental
backup (and only those blocks)
2) The new signature is updated from the new_file

In the process, I've never read the initial file again.
Post by Cyril Russo
The signature itself is very small, compared to the file -usually only
1% but can be less-
Other the incremental backup, you only have to update the signature.
If there is no change you have no signature to update, right? Else you
also have to deal with data, not just signature, no?
Exactly. Signature are a derived from the data only. So no change in
data => no change in signature.
Post by Cyril Russo
Post by Denis Corbin
Consequences is that this is not compatible with differential backup
based on catalogue, the feature must thus be an option.
If the signature is stored in the catalog, I think it might be compatible.
If there was a change instead, without the original copy you have to
backup the whole file (which is the current situation). And, if now you
have a partially saved file, there is much chance you have to backup the
whole file again... unless you ask the user to provide former backup up
to the full backup ... this for each file that cast this situation...
This is wrong. The aim of the signature is to never use the original
file again.
Hopefully it's something very convenient to use.
Post by Cyril Russo
Post by Denis Corbin
Worse it is will be difficult to base such a backup on a differential
backup. Suppose the file is already binary diffed ... how to know if the
part of the file that is not present in the file has changed? Yes, using
signatures (or ~ CRC) but still the risk that the file changed and
signature stays the same thus changed would not be saved.
Signature algorithm use 2 different checksums (strong and rolling).
The rolling checksum is a checksum that can detect byte shift (insertion
or deletion in the block).
The strong checksum (SHA-256, SHA1, or even MD5) is used when the
rolling checksum matched.
So, it's pretty unlikely that SHA256 would be the same (probability is
1/2^128) but it could happen.
Precisely yes. How improbable it may be, the situation can occur where
data is not saved properly. For that reason this can only be a optional
feature (not activated by defaut), with warning toward the user upon
activation.
Sure, as there is still the (most probable) hard disk corruption (with
probability of 0.70% per year so 10^21 more probable than hash collision).
Anyway, it always safe to inform the user, as (s)he's the final decision
maker.
Post by Cyril Russo
It's pretty unlikely that the rolling checksum match for 2 different
inputs (probablility is 1/2^32 for a 64bit rolling checksum).
However, it's very very uncommon that both would give the same results.
Worst, as the strong checksum is done on a fixed block size, this would
mean that only that block gone unnotified,
(so no byte was ever added, nor removed, else the other blocks will have
changed too and will unlikely go unnoticed).
As RSync is widely used for binary linux distribution (with data >
100GB), I guess it's not a real issue in our real world.
Well maybe in the rsync context, this is not a real problem, because in
case of signature collision you just keep the previous file version in
the (remote) synced directory, the whole copied file stay a valid but
obsolete version of the file. This will be fixed at the next change,
this way the differences beteen copies is very rare and not subject to
persist. In case of restoration instead, if signature are the same
between the file in filesystem and the file that was used to base the
binary diff, while those file do differ, this would lead to patch a
wrong file and get a file contents that had never existed before. This
would be unoticable unless you try to use the file, which would most
probably lead to a core dump if it is a executable for example. The
probability of such occurrence is not greater with backup than with
rsync, but the consequences are much more important.
Don't forget that rsync is block based, so if a hash collide happen,
then you'll get a binary clash too (the new version with a block from
the last version).
This rarely happen in reality.
OK, now considering MD5 with 128 bits, (thus 3.4E 38 possibilities), the
average number of files one uses (200,000) on a system, the max number
of people today (around 6 trillions), such a collision would occure once
every 7.78 E18 century...). But, by principle, this does not avoid
warning the user.
Exactly (moreover, most of the files are the same for each user).
However that MD5 really provide 64 safe bits (see attacks on MD5). So
the probability is still high.
RSync can use SHA1 and SHA256 too, so safety can be enhanced too.
Post by Cyril Russo
Gentoo use this, YUM (Fedora, Mandriva, Moblin, OpenSuse) use this, so
I'm pretty confident that this doesn't happen at all.
It is not because something is darn not probable that it does not happen
at all... there is a difference, which for me has its importance.
Yes.
BTW, if we look at current security & safety in Dar:
1) Data Safety
- Minimum limit is hard drive failure (with probability of 0.007 per
year, see Wikipedia on hard drive)
2) Data security
- Encryption derive the key from the password hash. That way, one
could find a hash that collide (probability is 1/2^64 for 128 bits hash)
to get access to the uncrypted file.

To me, I think the dual hash (with probability of 1/2^64 * 1/2^32 =>
1/2^96) failure is so impossible from current code limit (more that 10
orders of magnitude lower) that I wouldn't fear it at all.
We can improve hard drive failure with Reed Solomon parity, but never to
get the 10^10 difference.
Post by Cyril Russo
Post by Denis Corbin
The second point is that at restoration time you must not blindly patch
an existing binary file with the portion of data that have been saved.
The use of signature (CRC for example) done on the original file is
necessary, but will not warranty that the original file is the really
the same as the one over which the binary diff has been made (there is
still a little chance that two different files get the same signature)
and I don't see any mean to be sure that the live file to restore over a
binary diff is exactly the same suite of byte as the one used at backup
time, except than comparing byte by byte, thus requiring any previous
diff backup up to the first where the whole data is present. But now, if
the user is warned about this risk and wants to take it ... I see no
problem.
Yes, I agree with you here.
Restoring an incremental backup here requires all the backup chain (from
the last full backup till the last incremental version).
BTW, this wouldn't be way slower than it is currently, as current code
store the whole file, so it has to decrypt(decompress(whole file)) for
the last incremental.
It would probably be a bit faster for a full restoration, as there would
be less data to copy... in the other hand, there is a more complex
algorithm to run... but that's not the point here.
However it would be more painful for the user to restore of a small set
of files, as you would have to access to more archive and slices for the
same set of files to restore.
Yes. This is the drawback of such algorithm. You have to have all steps
from full to last incremental.
At the same time, the saving in disk space / bandwidth might worth the
extra effort on restoring.

I rarely restore my backup (only on failure, once a year ? once a month
?), but I'm more concerned by not sending my 3GB mail file
every day to my remote server.
Post by Cyril Russo
If you get the wrong version (at you must have decompressed the whole
file to see its content), then you have to reproduce the above procedure
for each version.
And, unless you exactly know what time you made a mistake, you usually
check 2 or 3 version before finding what you were looking for.
The new code would
(undiff(decrypt(decompress(orig_file)))^number_incremental).
This means that, once you've decompressed the original file, going from
one version to the other is only a diff (which is O(1) operation), so
it's faster than the current code.
a diff is not a constant operation it depends on the amount of data to
compare, isn't it rather an O(n) operation?
Producing a diff is O(n*m).
Hopefully the diff store the position of difference before the
difference itself, so patching (using the diff) is O(1), as you'll only
touch the modified part, and won't read the other part of the file.

You're right that most of filesystem don't allow inserting / removing
from arbitrary position, so once you are on a modified size block,
you'll have to rewrite the other blocks too (thus O(n) operation).
But this is a filesystem limit, not a mathematical limit.

And as you'll only write the file once, if you have 10 diff from the
full to last incremental, you'll have 9x O(1) operation (producing the
difference in memory), and 1x O(n) operation (writing the final file on
disk).
Post by Cyril Russo
You can even compute the diff with the current file, if it still exist,
and avoid the first (longer) step, and go backward.
Then going from one version to the other is a O(1) operation, so it's
faster than the whole decompression again.
Post by Denis Corbin
This was for the functional level.
Seen now the implementation, this should be feasable, but at the cost of
some major change in the way dar stores files. We must now not only
store a suite of bytes to be copied to a brand new file, but a list of
blocks with offset, length and data (an probably checksum). Well, there
is a feature in the pipe that will bring some changes at this level
which is the possibility to properly store sparse files [
https://sourceforge.net/tracker/?func=detail&aid=1457710&group_id=65612&atid=511615
]. I could extend a little bit this feature to let room for CRC fields
and the like (re-using the TLV class).
CRC is one thing that get the bad habit of colliding.
Please store the rsync signature in there, as you'll get (for the same
price) both a rolling checksum, and a strong checksum.
Even better, add a PAR block (see http://en.wikipedia.org/wiki/Parchive
, so if you ever get a corruption, it can be restored)
There is already scripts that simplify Parchive use with dar (automatic
redundancy generation, testing and repairing).
(see http://dar.linux.free.fr/doc/samples/index.html )
By the way, CRC is just a way of signing a block of data. Its weekness
comes from the address space if has... it is usually 32 bits, where MD5
is 128 bits...
Yes I agree.
That's great!
Post by Cyril Russo
Post by Denis Corbin
Will lack the algorithm to inspect binary files and decide which way to
define blocks of changed data. The problem here is that it will cost
much more disk I/O to compare two binaries: It must first be necessary
to inspect the whole files (old backup and live filesystem) to find out
which amount of data has changed and which pattern do changes follow. It
may well be more costly to systematically do a diff in place of a full
backup if a byte every N byte has changed for example. Then, we could
proceed either to full backup or binary diff backup.
Please try to read the rsync algorithm as it's very very clever (
http://en.wikipedia.org/wiki/Rsync ). Basicly, the rsync algorithm is
O(N) to figure out the diff.
I just did, this is an interesting article. But note that the rolling
checksum needs the whole data of the file to be possible to compute.
Having differential backup based on other differential backups needs
another algorithm for binary diff.
The smart algorithm does this:
1) Initial file => compute the rolling checksum + hash per block, this
gives the signature
2) For each incremental step:
Denis Corbin
2009-06-20 16:04:58 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Denis Corbin
[...]
Post by Cyril Russo
Post by Cyril Russo
Post by Denis Corbin
Post by Cyril Russo
Anyway, how complex would it require to implement in DAR ?
First, to be able to make a binary diff you must have the original
binary beside the current one to compare them (actually dar relies on
inode data changes to decide whether to save the whole file again or to
not save it at all).
Well, this is not exact, and it's one of the main feature of the librsync.
The librsync create a signature file from a binary file, and you
can(should) use this signature as the first comparand.
OK, but if the signature is different, you cannot make binary diff
without the original file, to know where is located the change.
Post by Cyril Russo
Let's say I have a 1/2GB file.
The signature is set to use 512kb block (something crazy as it's very big).
In my signature, I'll get 1000 block, each of 32 bytes, so a 32k file.
Then, let's say I've modified the file, and I'm running an incremental
backup
The signature is compared to the new file (and only the new file).
1) The block number 234, 235 and 636 are saved in the the incremental
backup (and only those blocks)
2) The new signature is updated from the new_file
In the process, I've never read the initial file again.
OK, I see. This is not a signature of a given file but many signature of
the file, one for each block of a given size.

[...]
Post by Denis Corbin
If there was a change instead, without the original copy you have to
backup the whole file (which is the current situation). And, if now you
have a partially saved file, there is much chance you have to backup the
whole file again... unless you ask the user to provide former backup up
to the full backup ... this for each file that cast this situation...
Post by Cyril Russo
This is wrong. The aim of the signature is to never use the original
file again.
Yes, I was considering "signature" as an overall hash of the file, not
as you explained before, a by-block hash.


[not reproduced text but fully agreed]
Post by Denis Corbin
However it would be more painful for the user to restore of a small set
of files, as you would have to access to more archive and slices for the
same set of files to restore.
Post by Cyril Russo
Yes. This is the drawback of such algorithm. You have to have all steps
from full to last incremental.
At the same time, the saving in disk space / bandwidth might worth the
extra effort on restoring.
I rarely restore my backup (only on failure, once a year ? once a month
?), but I'm more concerned by not sending my 3GB mail file
every day to my remote server.
:-) that's a good argument.


[...]
Post by Denis Corbin
Post by Cyril Russo
Post by Cyril Russo
The new code would
(undiff(decrypt(decompress(orig_file)))^number_incremental).
This means that, once you've decompressed the original file, going
from one version to the other is only a diff (which is O(1)
operation), so it's faster than the current code.
a diff is not a constant operation it depends on the amount of data to
compare, isn't it rather an O(n) operation?
Post by Cyril Russo
Producing a diff is O(n*m).
Hopefully the diff store the position of difference before the
difference itself, so patching (using the diff) is O(1), as you'll only
touch the modified part, and won't read the other part of the file.
You're right that most of filesystem don't allow inserting / removing
from arbitrary position, so once you are on a modified size block,
you'll have to rewrite the other blocks too (thus O(n) operation).
But this is a filesystem limit, not a mathematical limit.
And as you'll only write the file once, if you have 10 diff from the
full to last incremental, you'll have 9x O(1) operation (producing the
difference in memory), and 1x O(n) operation (writing the final file on
disk).
OK
[...]
Post by Denis Corbin
Post by Cyril Russo
Post by Cyril Russo
Please try to read the rsync algorithm as it's very very clever (
http://en.wikipedia.org/wiki/Rsync ). Basicly, the rsync algorithm is
O(N) to figure out the diff.
I just did, this is an interesting article. But note that the rolling
checksum needs the whole data of the file to be possible to compute.
Having differential backup based on other differential backups needs
another algorithm for binary diff.
Post by Cyril Russo
1) Initial file => compute the rolling checksum + hash per block, this
gives the signature
2.1)
Cyril Russo
2009-06-21 13:18:38 UTC
Permalink
Post by Denis Corbin
OK, I see. This is not a signature of a given file but many signature of
the file, one for each block of a given size.
To be honest, it's the signature of the whole file, *including* the
signature of many "blocks" in the file.
In fact the rolling checksum mechanism allows you to have the "instant"
signature of any possible block in the file -for almost nothing-.
Post by Denis Corbin
Post by Denis Corbin
# is saved data
- is unsaved data
full backup : ###########################
diff1 backup : --------####-------###-----
diff2 backup : --------###----------#-----
but if some data changed at a place in the file which is unsaved (for
example at the beginning), the whole data block as to be saved again,
diff2 bacup : ###########----------#-----
diff2 bacup : #-------###----------#-----
Yes, as we saw previously with the hash per block you can know which
block to only save. I must review the algorithm. What I liked here is
that there was not a block size of arbitrary size to define. But it is
not efficient. This I have to keep from the differential backup the list
of hash by block and copy it to the new backup.
In fact, you could use librsync to compute such things for you (it's
compatible with the license).
Post by Denis Corbin
Post by Denis Corbin
restoration would need "full" then "diff1", then "diff2".
You're speaking the truth.
Restoration would fail for diff backup if no file exist on filesystem
or if its
global data signature does not match the expected one.
I disagree here.
Restoration fail if full backup or diff1 backup, or diff2 are missing.
Signature are useless for restoration (unless going backward, see below).
full backup : ###########################
diff1 backup : --------###----------#-----
diff2 backup : --------####-------###-----
OK, you are right, using by block hash you can restore even if diff1 is
missing (assuming you have either 'full' backup or the corresponding
file present in the filesystem). What I meant is that if you have *only*
diff1 or diff2 and no file in the filesystem, or a file that is not the
one you could find in "full", then restoration fails. In current status
you can restore from a diff1 backup all files that have changed since
the archive of reference was made.
Yes. Less data means more dependance on data safety. It's a choice that
every user could do.
Either be a Mac user with a time machine (this is what Dar do actually,
storing every modified file as complete), or be a Unix user with a
remote ssh/ftp machine (à la rsync-backup / duplicity).
Local storage is cheap, but isn't safe at all (robbery / fire / flood /
electrical shock, etc...).
Remote storage is more expensive, but it's supposed to be safer.
Post by Denis Corbin
Post by Denis Corbin
Then diff1 can be missing, it'll still work (as diff2 is more complete
than diff1, and rsync store whole blocks).
[backward restoration] Also, in your example, let's say you want to
restore up to step diff2.
If you have the signature in diff2 backup set (as you should), AND the
new_file is still here on the filesystem, then it'll be faster to go
from new_file to diff2 than doing full + diff1 + diff2.
In order to achieve this, you'll compute the "hypothetical" diff3 from
diff2 to new_file (thanks to the signature) and revert new_file to diff2
by applying the reverse diff.
It's dichotomy here, nothing special but the speed up might really worth
the effort..
Dar would also store signature for each block. By block I mean the
sequence of unsaved data, or the sequence of saved data.
I'm sorry I didn't understand, so I might say bullshit here.
That's not totally bullshit. My original Idea was to not have fixed
length block size, but instead a "block" is either a continuous portion
of a file that have not changed since the reference or a continuous
portion a file that have changed. Each one is associated with its own hash.
we have 5 blocks. 1, 3 and 5 only have hash value to detect any change
in them, while 2 and 4 have hash and data.
diff1 backup : --------###----------#-----
111111112223333333333455555
As you have showed, this has the drawback to force the backup of a
larger amount of data than what rsync can do, if for example the first
diff2 backup : ########-------------------
111111112223333333333455555
diff2 backup : #--------------------------
Thus my algorithm is not good. I have to deal with fixed block size and
with the decision of what arbitrary size to use.
I think Rsync is now quite stable and used almost everywhere it can be
used, because it has proved suitable for most situation.
rsync with its rolling checksum algorithm can compute the minimum block
size quite well, is fast, and is "de-facto" standard.
Data are quite safe when handled to rsync.

I don't know how easy it would be to plug rsync like algorithm in Dar,
but I think it'll be safer and simpler than re-writing a similar
algorithm (with all the risk to the data themselves).
Georg Sauthoff
2010-01-23 08:55:23 UTC
Permalink
On 2009-06-13, Cyril Russo <***@laposte.net> wrote:

Hi,
Post by Cyril Russo
I'm wondering if DAR could use rsync like system for incremental backup.
Currently, if I want to backup my mail file (3GB) that changes hourly,
it back it up everytime it's run (and the whole 3GB).
a 3 gb mbox file sounds scary. Why don't you use the maildir mailbox
format? With this format each mail is saved into a separate file. Thus
it is well suited for backup purposes.
Post by Cyril Russo
With a rsync like algorithm, it would have saved only the difference
(which is not that much, something like 10MB).
10 MB for 3 GB?
Post by Cyril Russo
Usually rsync requires 2 copies of the file (the "previous file" and the
"new file"), in order to find out the difference.
This is an annoying requirement for an archive format, as the "previous
file" is usually compressed / encrypted, and as such, would require too
much CPU to uncompress / decrypt for comparing.
However, librsync has a mechanism to compute a "signature" of a file,
and this signature is used in place of the "previous file". The
signature itself is very small (compared to the file), so it might be
possible to store such signature in the catalog.
The signature is compared to the "new file", and the diff can be made
from this (so in my previous example, only 10MB of data would be stored
in the backup).
Computing the diff for each incremental backup is a O(n^2) operation,
where n is the file length.
Post by Cyril Russo
Upon restoring however, the whole chain of backup must be read (as the
final file is made of original_file + diff(s)).
As restoring happen very rarely, I don't think it's a problem, as the
space gained by using diff worth the extra CPU time.
Plus a lot of extra I/O time, since for big files it is likely, that the
original_file has to be restored to disk and O(n) diffs have to be
applied to this file, i.e. requiring O(nm) IO-Operations, where m is
the number of IO operations used to create original_file.
Post by Cyril Russo
That way, the DAR format would really, really fit all the possible
requirement for a backup tool, as it would be optimal in size (rsync
like algorithm), optimal in speed (binary code), optimal in security
(encryption).
What do you think about this ?
I am rating the fact that dar does not support a rsync like operation as
a feature! If you use a Unix-like operating system then you get reliable
modification times for your files, thus there is no need to do very
costly checksum computations to see which files have to be backed up.

I used rsync for some time to backup a home directory, since the rsync
algorithm sounds attractive. In practice my usage of my home directory
was so 'strange' that one rsync run was like 2 times as long as just
tar-ing everything away! I.e. my usage pattern seemed to confuse the
complex rsync rolling checksum etc. heuristic algorithm.

Best regards
Georg

Loading...