Cyril Russo wrote:
[...]
Post by Cyril RussoPost by Denis CorbinPost by Cyril RussoAnyway, how complex would it require to implement in DAR ?
First, to be able to make a binary diff you must have the original
binary beside the current one to compare them (actually dar relies on
inode data changes to decide whether to save the whole file again or to
not save it at all).
Well, this is not exact, and it's one of the main feature of the librsync.
The librsync create a signature file from a binary file, and you
can(should) use this signature as the first comparand.
OK, but if the signature is different, you cannot make binary diff
without the original file, to know where is located the change.
Post by Cyril RussoThe signature itself is very small, compared to the file -usually only
1% but can be less-
Other the incremental backup, you only have to update the signature.
If there is no change you have no signature to update, right? Else you
also have to deal with data, not just signature, no?
Post by Cyril RussoPost by Denis CorbinConsequences is that this is not compatible with differential backup
based on catalogue, the feature must thus be an option.
If the signature is stored in the catalog, I think it might be compatible.
If there was a change instead, without the original copy you have to
backup the whole file (which is the current situation). And, if now you
have a partially saved file, there is much chance you have to backup the
whole file again... unless you ask the user to provide former backup up
to the full backup ... this for each file that cast this situation...
Post by Cyril RussoPost by Denis CorbinWorse it is will be difficult to base such a backup on a differential
backup. Suppose the file is already binary diffed ... how to know if the
part of the file that is not present in the file has changed? Yes, using
signatures (or ~ CRC) but still the risk that the file changed and
signature stays the same thus changed would not be saved.
Signature algorithm use 2 different checksums (strong and rolling).
The rolling checksum is a checksum that can detect byte shift (insertion
or deletion in the block).
The strong checksum (SHA-256, SHA1, or even MD5) is used when the
rolling checksum matched.
So, it's pretty unlikely that SHA256 would be the same (probability is
1/2^128) but it could happen.
Precisely yes. How improbable it may be, the situation can occur where
data is not saved properly. For that reason this can only be a optional
feature (not activated by defaut), with warning toward the user upon
activation.
Post by Cyril RussoIt's pretty unlikely that the rolling checksum match for 2 different
inputs (probablility is 1/2^32 for a 64bit rolling checksum).
However, it's very very uncommon that both would give the same results.
Worst, as the strong checksum is done on a fixed block size, this would
mean that only that block gone unnotified,
(so no byte was ever added, nor removed, else the other blocks will have
changed too and will unlikely go unnoticed).
As RSync is widely used for binary linux distribution (with data >
100GB), I guess it's not a real issue in our real world.
Well maybe in the rsync context, this is not a real problem, because in
case of signature collision you just keep the previous file version in
the (remote) synced directory, the whole copied file stay a valid but
obsolete version of the file. This will be fixed at the next change,
this way the differences beteen copies is very rare and not subject to
persist. In case of restoration instead, if signature are the same
between the file in filesystem and the file that was used to base the
binary diff, while those file do differ, this would lead to patch a
wrong file and get a file contents that had never existed before. This
would be unoticable unless you try to use the file, which would most
probably lead to a core dump if it is a executable for example. The
probability of such occurrence is not greater with backup than with
rsync, but the consequences are much more important.
OK, now considering MD5 with 128 bits, (thus 3.4E 38 possibilities), the
average number of files one uses (200,000) on a system, the max number
of people today (around 6 trillions), such a collision would occure once
every 7.78 E18 century...). But, by principle, this does not avoid
warning the user.
Post by Cyril RussoGentoo use this, YUM (Fedora, Mandriva, Moblin, OpenSuse) use this, so
I'm pretty confident that this doesn't happen at all.
It is not because something is darn not probable that it does not happen
at all... there is a difference, which for me has its importance.
Post by Cyril RussoPost by Denis CorbinThe second point is that at restoration time you must not blindly patch
an existing binary file with the portion of data that have been saved.
The use of signature (CRC for example) done on the original file is
necessary, but will not warranty that the original file is the really
the same as the one over which the binary diff has been made (there is
still a little chance that two different files get the same signature)
and I don't see any mean to be sure that the live file to restore over a
binary diff is exactly the same suite of byte as the one used at backup
time, except than comparing byte by byte, thus requiring any previous
diff backup up to the first where the whole data is present. But now, if
the user is warned about this risk and wants to take it ... I see no
problem.
Yes, I agree with you here.
Restoring an incremental backup here requires all the backup chain (from
the last full backup till the last incremental version).
BTW, this wouldn't be way slower than it is currently, as current code
store the whole file, so it has to decrypt(decompress(whole file)) for
the last incremental.
It would probably be a bit faster for a full restoration, as there would
be less data to copy... in the other hand, there is a more complex
algorithm to run... but that's not the point here.
However it would be more painful for the user to restore of a small set
of files, as you would have to access to more archive and slices for the
same set of files to restore.
Post by Cyril RussoIf you get the wrong version (at you must have decompressed the whole
file to see its content), then you have to reproduce the above procedure
for each version.
And, unless you exactly know what time you made a mistake, you usually
check 2 or 3 version before finding what you were looking for.
The new code would
(undiff(decrypt(decompress(orig_file)))^number_incremental).
This means that, once you've decompressed the original file, going from
one version to the other is only a diff (which is O(1) operation), so
it's faster than the current code.
a diff is not a constant operation it depends on the amount of data to
compare, isn't it rather an O(n) operation?
Post by Cyril RussoYou can even compute the diff with the current file, if it still exist,
and avoid the first (longer) step, and go backward.
Then going from one version to the other is a O(1) operation, so it's
faster than the whole decompression again.
Post by Denis CorbinThis was for the functional level.
Seen now the implementation, this should be feasable, but at the cost of
some major change in the way dar stores files. We must now not only
store a suite of bytes to be copied to a brand new file, but a list of
blocks with offset, length and data (an probably checksum). Well, there
is a feature in the pipe that will bring some changes at this level
which is the possibility to properly store sparse files [
https://sourceforge.net/tracker/?func=detail&aid=1457710&group_id=65612&atid=511615
]. I could extend a little bit this feature to let room for CRC fields
and the like (re-using the TLV class).
CRC is one thing that get the bad habit of colliding.
Please store the rsync signature in there, as you'll get (for the same
price) both a rolling checksum, and a strong checksum.
Even better, add a PAR block (see http://en.wikipedia.org/wiki/Parchive
, so if you ever get a corruption, it can be restored)
There is already scripts that simplify Parchive use with dar (automatic
redundancy generation, testing and repairing).
(see http://dar.linux.free.fr/doc/samples/index.html )
By the way, CRC is just a way of signing a block of data. Its weekness
comes from the address space if has... it is usually 32 bits, where MD5
is 128 bits...
Post by Cyril RussoPost by Denis CorbinWill lack the algorithm to inspect binary files and decide which way to
define blocks of changed data. The problem here is that it will cost
much more disk I/O to compare two binaries: It must first be necessary
to inspect the whole files (old backup and live filesystem) to find out
which amount of data has changed and which pattern do changes follow. It
may well be more costly to systematically do a diff in place of a full
backup if a byte every N byte has changed for example. Then, we could
proceed either to full backup or binary diff backup.
Please try to read the rsync algorithm as it's very very clever (
http://en.wikipedia.org/wiki/Rsync ). Basicly, the rsync algorithm is
O(N) to figure out the diff.
I just did, this is an interesting article. But note that the rolling
checksum needs the whole data of the file to be possible to compute.
Having differential backup based on other differential backups needs
another algorithm for binary diff.
Post by Cyril RussoUsing a signature (as said above), you don't need to read the previous
file at all, and you only need to read the new file *once*, so it's
exactly like the current code in fact.
Maybe I do not understand what you mean here above.
Post by Cyril RussoEven better, the system will be faster as you won't be compressing /
encrypting the unchanged data, so the time you loose reading the small
signature file (usually 1% of the original file size), is highly
compensated by *not* compressing the redundant data.
The amount of I/O will be less too (as you store less compressed block
on the FS), so you'll gain in network or disk bandwidth.
Actually if a file has not changed (see its last modification time),
there is no need to compress / encrypt it. I probably do not understand
what you mean.
Post by Cyril RussoIn fact, rsync usually provides a 2x to 16x speedup other gzipping the
whole file for example (try to run a rsync test, you'll see by yourself).
I use rsync dayly (through fcron), while I never made the effort to look
at the way it works. Thanks for having let me know that. ;-)
Post by Cyril RussoSure, if the rsync diff size > input size, then it's better to perform a
full backup. This never happen in reality.
I guess a better rule of thumb would be: if rsync_diff_size +
signature_size > input_size, then it might worth the full backup.
Cyril
Back to dar, I see this possibility:
# is saved data
- - is unsaved data
assuming there is only one file in the backup:
full backup : ###########################
diff1 backup : --------####-------###-----
diff2 backup : --------###----------#-----
but if some data changed at a place in the file which is unsaved (for
example at the beginning), the whole data block as to be saved again,
diff2 would have then been:
diff2 bacup : ###########----------#-----
restoration would need "full" then "diff1", then "diff2". Restoration
would fail for diff backup if no file exist on filesystem or if its
global data signature does not match the expected one.
Dar would also store signature for each block. By block I mean the
sequence of unsaved data, or the sequence of saved data.
See
http://sourceforge.net/tracker/?func=detail&aid=2803478&group_id=65612&atid=511615
for implementation detail seen so far.
Regards,
Denis.