Post Syndicated from Lennart Poettering original https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
Introducing casync
In the past months I have been working on a new project:
casync
. casync
takes
inspiration from the popular rsync
file
synchronization tool as well as the probably even more popular
git
revision control system. It combines the
idea of the rsync
algorithm with the idea of git
-style
content-addressable file systems, and creates a new system for
efficiently storing and delivering file system images, optimized for
high-frequency update cycles over the Internet. Its current focus is
on delivering IoT, container, VM, application, portable service or OS
images, but I hope to extend it later in a generic fashion to become
useful for backups and home directory synchronization as well (but
more about that later).
The basic technological building blocks casync
is built from are
neither new nor particularly innovative (at least not anymore),
however the way casync
combines them is different from existing tools,
and that’s what makes it useful for a variety of use-cases that other
tools can’t cover that well.
Why?
I created casync
after studying how today’s popular tools store and
deliver file system images. To briefly name a few: Docker has a
layered tarball approach,
OSTree serves the
individual files directly via HTTP and maintains packed deltas to
speed up updates, while other systems operate on the block layer and
place raw squashfs
images (or other archival file systems, such as
IS09660) for download on HTTP shares (in the better cases combined
with zsync
data).
Neither of these approaches appeared fully convincing to me when used
in high-frequency update cycle systems. In such systems, it is
important to optimize towards a couple of goals:
- Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
- Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
- Put boundaries on disk space usage on clients
- Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
- Simplicity to use for users, repository administrators and developers
I don’t think any of the tools mentioned above are really good on more
than a small subset of these points.
Specifically: Docker’s layered tarball approach dumps the “delta”
question onto the feet of the image creators: the best way to make
your image downloads minimal is basing your work on an existing image
clients might already have, and inherit its resources, maintaining full
history. Here, revision control (a tool for the developer) is
intermingled with update management (a concept for optimizing
production delivery). As container histories grow individual deltas
are likely to stay small, but on the other hand a brand-new deployment
usually requires downloading the full history onto the deployment
system, even though there’s no use for it there, and likely requires
substantially more disk space and download sizes.
OSTree’s serving of individual files is unfriendly to CDNs (as many
small files in file trees cause an explosion of HTTP GET
requests). To counter that OSTree supports placing pre-calculated
delta images between selected revisions on the delivery servers, which
means a certain amount of revision management, that leaks into the
clients.
Delivering direct squashfs
(or other file system) images is almost
beautifully simple, but of course means every update requires a full
download of the newest image, which is both bad for disk usage and
generated traffic. Enhancing it with zsync
makes this a much better
option, as it can reduce generated traffic substantially at very
little cost of history/meta-data (no explicit deltas between a large
number of versions need to be prepared server side). On the other hand
server requirements in disk space and functionality (HTTP Range
requests) are minus points for the use-case I am interested in.
(Note: all the mentioned systems have great properties, and it’s not
my intention to badmouth them. They only point I am trying to make is
that for the use case I care about — file system image delivery with
high high frequency update-cycles — each system comes with certain
drawbacks.)
Security & Reproducibility
Besides the issues pointed out above I wasn’t happy with the security
and reproducibility properties of these systems. In today’s world
where security breaches involving hacking and breaking into connected
systems happen every day, an image delivery system that cannot make
strong guarantees regarding data integrity is out of
date. Specifically, the tarball format is famously nondeterministic:
the very same file tree can result in any number of different
valid serializations depending on the tool used, its version and the
underlying OS and file system. Some tar
implementations attempt to
correct that by guaranteeing that each file tree maps to exactly
one valid serialization, but such a property is always only specific
to the tool used. I strongly believe that any good update system must
guarantee on every single link of the chain that there’s only one
valid representation of the data to deliver, that can easily be
verified.
What casync Is
So much about the background why I created casync
. Now, let’s have a
look what casync
actually is like, and what it does. Here’s the brief
technical overview:
Encoding: Let’s take a large linear data stream, split it into
variable-sized chunks (the size of each being a function of the
chunk’s contents), and store these chunks in individual, compressed
files in some directory, each file named after a strong hash value of
its contents, so that the hash value may be used to as key for
retrieving the full chunk data. Let’s call this directory a “chunk
store”. At the same time, generate a “chunk index” file that lists
these chunk hash values plus their respective chunk sizes in a simple
linear array. The chunking algorithm is supposed to create variable,
but similarly sized chunks from the data stream, and do so in a way
that the same data results in the same chunks even if placed at
varying offsets. For more information see this blog
story.
Decoding: Let’s take the chunk index file, and reassemble the large
linear data stream by concatenating the uncompressed chunks retrieved
from the chunk store, keyed by the listed chunk hash values.
As an extra twist, we introduce a well-defined, reproducible,
random-access serialization format for file trees (think: a more
modern tar
), to permit efficient, stable storage of complete file
trees in the system, simply by serializing them and then passing them
into the encoding step explained above.
Finally, let’s put all this on the network: for each image you want to
deliver, generate a chunk index file and place it on an HTTP
server. Do the same with the chunk store, and share it between the
various index files you intend to deliver.
Why bother with all of this? Streams with similar contents will result
in mostly the same chunk files in the chunk store. This means it is
very efficient to store many related versions of a data stream in the
same chunk store, thus minimizing disk usage. Moreover, when
transferring linear data streams chunks already known on the receiving
side can be made use of, thus minimizing network traffic.
Why is this different from rsync
or OSTree, or similar tools? Well,
one major difference between casync
and those tools is that we
remove file boundaries before chunking things up. This means that
small files are lumped together with their siblings and large files
are chopped into pieces, which permits us to recognize similarities in
files and directories beyond file boundaries, and makes sure our chunk
sizes are pretty evenly distributed, without the file boundaries
affecting them.
The “chunking” algorithm is based on a the buzhash rolling hash
function. SHA256 is used as strong hash function to generate digests
of the chunks. xz is used to compress the individual chunks.
Here’s a diagram, hopefully explaining a bit how the encoding process
works, wasn’t it for my crappy drawing skills:
The diagram shows the encoding process from top to bottom. It starts
with a block device or a file tree, which is then serialized and
chunked up into variable sized blocks. The compressed chunks are then
placed in the chunk store, while a chunk index file is written listing
the chunk hashes in order. (The original SVG of this graphic may be
found here.)
Details
Note that casync
operates on two different layers, depending on the
use-case of the user:
-
You may use it on the block layer. In this case the raw block data
on disk is taken as-is, read directly from the block device, split
into chunks as described above, compressed, stored and delivered.
-
You may use it on the file system layer. In this case, the
file tree serialization format mentioned above comes into play:
the file tree is serialized depth-first (much like tar
would do
it) and then split into chunks, compressed, stored and delivered.
The fact that it may be used on both the block and file system layer
opens it up for a variety of different use-cases. In the VM and IoT
ecosystems shipping images as block-level serializations is more
common, while in the container and application world file-system-level
serializations are more typically used.
Chunk index files referring to block-layer serializations carry the
.caibx
suffix, while chunk index files referring to file system
serializations carry the .caidx
suffix. Note that you may also use
casync
as direct tar
replacement, i.e. without the chunking, just
generating the plain linear file tree serialization. Such files
carry the .catar
suffix. Internally .caibx
are identical to
.caidx
files, the only difference is semantical: .caidx
files
describe a .catar
file, while .caibx
files may describe any other
blob. Finally, chunk stores are directories carrying the .castr
suffix.
Features
Here are a couple of other features casync
has:
-
When downloading a new image you may use casync
‘s --seed=
feature: each block device, file, or directory specified is processed
using the same chunking logic described above, and is used as
preferred source when putting together the downloaded image locally,
avoiding network transfer of it. This of course is useful whenever
updating an image: simply specify one or more old versions as seed and
only download the chunks that truly changed since then. Note that
using seeds requires no history relationship between seed and the new
image to download. This has major benefits: you can even use it to
speed up downloads of relatively foreign and unrelated data. For
example, when downloading a container image built using Ubuntu you can
use your Fedora host OS tree in /usr
as seed, and casync
will
automatically use whatever it can from that tree, for example timezone
and locale data that tends to be identical between
distributions. Example: casync extract
http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2
. This
will place the block-layer image described by the indicated URL in the
/dev/sda2
partition, using the existing /dev/sda1
data as seeding
source. An invocation like this could be typically used by IoT systems
with an A/B partition setup. Example 2: casync extract
http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1
--seed=/srv/container-v2 /src/container-v3
, is very similar but
operates on the file system layer, and uses two old container versions
to seed the new version.
-
When operating on the file system level, the user has fine-grained
control on the meta-data included in the serialization. This is
relevant since different use-cases tend to require a different set of
saved/restored meta-data. For example, when shipping OS images, file
access bits/ACLs and ownership matter, while file modification times
hurt. When doing personal backups OTOH file ownership matters little
but file modification times are important. Moreover different backing
file systems support different feature sets, and storing more
information than necessary might make it impossible to validate a tree
against an image if the meta-data cannot be replayed in full. Due to
this, casync
provides a set of --with=
and --without=
parameters
that allow fine-grained control of the data stored in the file tree
serialization, including the granularity of modification times and
more. The precise set of selected meta-data features is also always
part of the serialization, so that seeding can work correctly and
automatically.
-
casync
tries to be as accurate as possible when storing file
system meta-data. This means that besides the usual baseline of file
meta-data (file ownership and access bits), and more advanced features
(extended attributes, ACLs, file capabilities) a number of more exotic
data is stored as well, including Linux
chattr(1) file attributes, as
well as FAT file
attributes
(you may wonder why the latter? — EFI is FAT, and /efi
is part of
the comprehensive serialization of any host). In the future I intend
to extend this further, for example storing btrfs
sub-volume
information where available. Note that as described above every single
type of meta-data may be turned off and on individually, hence if you
don’t need FAT file bits (and I figure it’s pretty likely you don’t),
then they won’t be stored.
-
The user creating .caidx
or .caibx
files may control the desired
average chunk length (before compression) freely, using the
--chunk-size=
parameter. Smaller chunks increase the number of
generated files in the chunk store and increase HTTP GET load on the
server, but also ensure that sharing between similar images is
improved, as identical patterns in the images stored are more likely
to be recognized. By default casync
will use a 64K average chunk
size. Tweaking this can be particularly useful when adapting the
system to specific CDNs, or when delivering compressed disk images
such as squashfs
(see below).
-
Emphasis is placed on making all invocations reproducible,
well-defined and strictly deterministic. As mentioned above this is a
requirement to reach the intended security guarantees, but is also
useful for many other use-cases. For example, the casync digest
command may be used to calculate a hash value identifying a specific
directory in all desired detail (use --with=
and --without
to pick
the desired detail). Moreover the casync mtree
command may be used
to generate a BSD mtree(5) compatible manifest of a directory tree,
.caidx
or .catar
file.
-
The file system serialization format is nicely composable. By this
I mean that the serialization of a file tree is the concatenation of
the serializations of all files and file sub-trees located at the
top of the tree, with zero meta-data references from any of these
serializations into the others. This property is essential to ensure
maximum reuse of chunks when similar trees are serialized.
-
When extracting file trees or disk image files, casync
will automatically create
reflinks
from any specified seeds if the underlying file system supports it
(such as btrfs
, ocfs
, and future xfs
). After all, instead of
copying the desired data from the seed, we can just tell the file
system to link up the relevant blocks. This works both when extracting
.caidx
and .caibx
files — the latter of course only when the
extracted disk image is placed in a regular raw image file on disk,
rather than directly on a plain block device, as plain block devices
do not know the concept of reflinks.
-
Optionally, when extracting file trees, casync
can
create traditional UNIX hard-links for identical files in specified
seeds (--hardlink=yes
). This works on all UNIX file systems, and can
save substantial amounts of disk space. However, this only works for
very specific use-cases where disk images are considered read-only
after extraction, as any changes made to one tree will propagate to
all other trees sharing the same hard-linked files, as that’s the
nature of hard-links. In this mode, casync
exposes OSTree-like
behavior, which is built heavily around read-only hard-link trees.
-
casync
tries to be smart when choosing what to include in file
system images. Implicitly, file systems such as procfs and sysfs are
excluded from serialization, as they expose API objects, not real
files. Moreover, the “nodump” (+d
)
chattr(1) flag is honored by
default, permitting users to mark files to exclude from serialization.
-
When creating and extracting file trees casync
may apply an
automatic or explicit UID/GID shift. This is particularly useful when
transferring container image for use with Linux user name-spacing.
-
In addition to local operation, casync
currently supports HTTP,
HTTPS, FTP and ssh natively for downloading chunk index files and
chunks (the ssh mode requires installing casync
on the remote host,
though, but an sftp mode not requiring that should be easy to
add). When creating index files or chunks, only ssh is supported as
remote back-end.
-
When operating on block-layer images, you may expose locally or
remotely stored images as local block devices. Example: casync mkdev
http://example.com/myimage.caibx
exposes the disk image described by
the indicated URL as local block device in /dev
, which you then may
use the usual block device tools on, such as mount or fdisk (only
read-only though). Chunks are downloaded on access with high priority,
and at low priority when idle in the background. Note that in this
mode, casync
also plays a role similar to “dm-verity”, as all blocks
are validated against the strong digests in the chunk index file
before passing them on to the kernel’s block layer. This feature is
implemented though Linux’ NBD kernel facility.
-
Similar, when operating on file-system-layer images, you may mount
locally or remotely stored images as regular file systems. Example:
casync mount http://example.com/mytree.caidx /srv/mytree
mounts the
file tree image described by the indicated URL as a local directory
/srv/mytree
. This feature is implemented though Linux’ FUSE kernel
facility. Note that special care is taken that the images exposed this
way can be packed up again with casync make
and are guaranteed to
return the bit-by-bit exact same serialization again that it was
mounted from. No data is lost or changed while passing things through
FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s
hopefully just a temporary gap to be fixed soon).
-
In IoT A/B fixed size partition setups the file systems placed in
the two partitions are usually much shorter than the partition size,
in order to keep some room for later, larger updates. casync
is able
to analyze the super-block of a number of common file systems in order
to determine the actual size of a file system stored on a block
device, so that writing a file system to such a partition and reading
it back again will result in reproducible data. Moreover this speeds
up the seeding process, as there’s little point in seeding the
white-space after the file system within the partition.
Example Command Lines
Here’s how to use casync
, explained with a few examples:
$ casync make foobar.caidx /some/directory
This will create a chunk index file foobar.caidx
in the local
directory, and populate the chunk store directory default.castr
located next to it with the chunks of the serialization (you can
change the name for the store directory with --store=
if you
like). This command operates on the file-system level. A similar
command operating on the block level:
$ casync make foobar.caibx /dev/sda1
This command creates a chunk index file foobar.caibx
in the local
directory describing the current contents of the /dev/sda1
block
device, and populates default.castr
in the same way as above. Note
that you may as well read a raw disk image from a file instead of a
block device:
$ casync make foobar.caibx myimage.raw
To reconstruct the original file tree from the .caidx
file and
the chunk store of the first command, use:
$ casync extract foobar.caidx /some/other/directory
And similar for the block-layer version:
$ casync extract foobar.caibx /dev/sdb1
or, to extract the block-layer version into a raw disk image:
$ casync extract foobar.caibx myotherimage.raw
The above are the most basic commands, operating on local data
only. Now let’s make this more interesting, and reference remote
resources:
$ casync extract http://example.com/images/foobar.caidx /some/other/directory
This extracts the specified .caidx
onto a local directory. This of
course assumes that foobar.caidx
was uploaded to the HTTP server in
the first place, along with the chunk store. You can use any command
you like to accomplish that, for example scp
or
rsync
. Alternatively, you can let casync
do this directly when
generating the chunk index:
$ casync make ssh.example.com:images/foobar.caidx /some/directory
This will use ssh to connect to the ssh.example.com
server, and then
places the .caidx
file and the chunks on it. Note that this mode of
operation is “smart”: this scheme will only upload chunks currently
missing on the server side, and not re-transmit what already is
available.
Note that you can always configure the precise path or URL of the
chunk store via the --store=
option. If you do not do that, then the
store path is automatically derived from the path or URL: the last
component of the path or URL is replaced by default.castr
.
Of course, when extracting .caidx
or .caibx
files from remote sources,
using a local seed is advisable:
$ casync extract http://example.com/images/foobar.caidx --seed=/some/exising/directory /some/other/directory
Or on the block layer:
$ casync extract http://example.com/images/foobar.caibx --seed=/dev/sda1 /dev/sdb2
When creating chunk indexes on the file system layer casync
will by
default store meta-data as accurately as possible. Let’s create a chunk
index with reduced meta-data:
$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir
This command will create a chunk index for a file tree serialization
that has three features above the absolute baseline supported: 1s
granularity time-stamps, symbolic links and a single read-only bit. In
this mode, all the other meta-data bits are not stored, including
nanosecond time-stamps, full UNIX permission bits, file ownership or
even ACLs or extended attributes.
Now let’s make a .caidx
file available locally as a mounted file
system, without extracting it:
$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar
And similar, let’s make a .caibx
file available locally as a block device:
$ casync mkdev http://example.comf/images/foobar.caibx
This will create a block device in /dev
and print the used device
node path to STDOUT.
As mentioned, casync
is big about reproducibility. Let’s make use of
that to calculate the a digest identifying a very specific version of
a file tree:
This digest will include all meta-data bits casync
and the underlying
file system know about. Usually, to make this useful you want to
configure exactly what meta-data to include:
$ casync digest --with=unix .
This makes use of the --with=unix
shortcut for selecting meta-data
fields. Specifying --with-unix=
selects all meta-data that
traditional UNIX file systems support. It is a shortcut for writing out:
--with=16bit-uids --with=permissions --with=sec-time --with=symlinks
--with=device-nodes --with=fifos --with=sockets
.
Note that when calculating digests or creating chunk indexes you may
also use the negative --without=
option to remove specific features
but start from the most precise:
$ casync digest --without=flag-immutable
This generates a digest with the most accurate meta-data, but leaves
one feature out: chattr(1)‘s
immutable (+i
) file flag.
To list the contents of a .caidx
file use a command like the following:
$ casync list http://example.com/images/foobar.caidx
or
$ casync mtree http://example.com/images/foobar.caidx
The former command will generate a brief list of files and
directories, not too different from tar t
or ls -al
in its
output. The latter command will generate a BSD
mtree(5) compatible
manifest. Note that casync
actually stores substantially more file
meta-data than mtree
files can express, though.
What casync isn’t
-
casync
is not an attempt to minimize serialization and downloaded
deltas to the extreme. Instead, the tool is supposed to find a good
middle ground, that is good on traffic and disk space, but not at the
price of convenience or requiring explicit revision control. If you
care about updates that are absolutely minimal, there are binary delta
systems around that might be an option for you, such as Google’s
Courgette.
-
casync
is not a replacement for rsync
, or git
or zsync
or
anything like that. They have very different use-cases and
semantics. For example, rsync
permits you to directly synchronize two
file trees remotely. casync
just cannot do that, and it is unlikely
it every will.
Where next?
casync
is supposed to be a generic synchronization tool. Its primary
focus for now is delivery of OS images, but I’d like to make it useful
for a couple other use-cases, too. Specifically:
-
To make the tool useful for backups, encryption is missing. I have
pretty concrete plans how to add that. When implemented, the tool
might become an alternative to restic
,
BorgBackup or
tarsnap
.
-
Right now, if you want to deploy casync
in real-life, you still
need to validate the downloaded .caidx
or .caibx
file yourself, for
example with some gpg
signature. It is my intention to integrate with
gpg
in a minimal way so that signing and verifying chunk index files
is done automatically.
-
In the longer run, I’d like to build an automatic synchronizer for
$HOME
between systems from this. Each $HOME
instance would be
stored automatically in regular intervals in the cloud using casync,
and conflicts would be resolved locally.
-
casync
is written in a shared library style, but it is not yet
built as one. Specifically this means that almost all of casync
‘s
functionality is supposed to be available as C API soon, and
applications can process casync
files on every level. It is my
intention to make this library useful enough so that it will be easy
to write a module for GNOME’s gvfs
subsystem in order to make remote
or local .caidx
files directly available to applications (as an
alternative to casync mount
). In fact the idea is to make this all
flexible enough that even the remoting back-ends can be replaced
easily, for example to replace casync
‘s default HTTP/HTTPS back-ends
built on CURL with GNOME’s own HTTP implementation, in order to share
cookies, certificates, … There’s also an alternative method to
integrate with casync
in place already: simply invoke casync
as a
sub-process. casync
will inform you about a certain set of state
changes using a mechanism compatible with
sd_notify(3). In
future it will also propagate progress data this way and more.
-
I intend to a add a new seeding back-end that sources chunks from
the local network. After downloading the new .caidx
file off the
Internet casync
would then search for the listed chunks on the local
network first before retrieving them from the Internet. This should
speed things up on all installations that have multiple similar
systems deployed in the same network.
Further plans are listed tersely in the
TODO file.
FAQ:
-
Is this a systemd project? — casync
is hosted under the
github systemd umbrella, and the
projects share the same coding style. However, the code-bases are
distinct and without interdependencies, and casync
works fine both
on systemd systems and systems without it.
-
Is casync
portable? — At the moment: no. I only run Linux and
that’s what I code for. That said, I am open to accepting portability
patches (unlike for systemd, which doesn’t really make sense on
non-Linux systems), as long as they don’t interfere too much with the
way casync
works. Specifically this means that I am not too
enthusiastic about merging portability patches for OSes lacking the
openat(2) family
of APIs.
-
Does casync
require reflink-capable file systems to work, such
as btrfs
? — No it doesn’t. The reflink magic in casync
is
employed when the file system permits it, and it’s good to have it,
but it’s not a requirement, and casync
will implicitly fall back to
copying when it isn’t available. Note that casync
supports a number
of file system features on a variety of file systems that aren’t
available everywhere, for example FAT’s system/hidden file flags or
xfs
‘s projinherit
file flag.
-
Is casync
stable? — I just tagged the first, initial
release. While I have been working on it since quite some time and it
is quite featureful, this is the first time I advertise it publicly,
and it hence received very little testing outside of its own test
suite. I am also not fully ready to commit to the stability of the
current serialization or chunk index format. I don’t see any breakages
coming for it though. casync
is pretty light on documentation right
now, and does not even have a man page. I also intend to correct that
soon.
-
Are the .caidx
/.caibx
and .catar
file formats open and
documented? — casync
is Open Source, so if you want to know the
precise format, have a look at the sources for now. It’s definitely my
intention to add comprehensive docs for both formats however. Don’t
forget this is just the initial version right now.
-
casync
is just like $SOMEOTHERTOOL
! Why are you reinventing
the wheel (again)? — Well, because casync
isn’t “just like” some
other tool. I am pretty sure I did my homework, and that there is no
tool just like casync
right now. The tools coming closest are probably
rsync
, zsync
, tarsnap
, restic
, but they are quite different beasts
each.
-
Why did you invent your own serialization format for file trees?
Why don’t you just use tar
? — That’s a good question, and other
systems — most prominently tarsnap
— do that. However, as mentioned
above tar
doesn’t enforce reproducibility. It also doesn’t really do
random access: if you want to access some specific file you need to
read every single byte stored before it in the tar
archive to find
it, which is of course very expensive. The serialization casync
implements places a focus on reproducibility, random access, and
meta-data control. Much like traditional tar
it can still be
generated and extracted in a stream fashion though.
-
Does casync
save/restore SELinux/SMACK file labels? — At the
moment not. That’s not because I wouldn’t want it to, but simply
because I am not a guru of either of these systems, and didn’t want to
implement something I do not fully grok nor can test. If you look at
the sources you’ll find that there’s already some definitions in place
that keep room for them though. I’d be delighted to accept a patch
implementing this fully.
-
What about delivering squashfs
images? How well does chunking
work on compressed serializations? – That’s a very good point!
Usually, if you apply the a chunking algorithm to a compressed data
stream (let’s say a tar.gz
file), then changing a single bit at the
front will propagate into the entire remainder of the file, so that
minimal changes will explode into major changes. Thankfully this
doesn’t apply that strictly to squashfs
images, as it provides
random access to files and directories and thus breaks up the
compression streams in regular intervals to make seeking easy. This
fact is beneficial for systems employing chunking, such as casync
as
this means single bit changes might affect their vicinity but will not
explode in an unbounded fashion. In order achieve best results when
delivering squashfs
images through casync
the block sizes of
squashfs
and the chunks sizes of casync
should be matched up
(using casync
‘s --chunk-size=
option). How precisely to choose
both values is left a research subject for the user, for now.
-
What does the name casync
mean? – It’s a synchronizing
tool, hence the -sync
suffix, following rsync
‘s naming. It makes
use of the content-addressable concept of git
hence the ca-
prefix.
-
Where can I get this stuff? Is it already packaged? – Check
out the sources on GitHub. I
just tagged the first
version. Martin
Pitt has packaged casync
for
Ubuntu. There
is also an ArchLinux
package. Zbigniew
Jędrzejewski-Szmek has prepared a Fedora
RPM that hopefully
will soon be included in the distribution.
Should you care? Is this a tool for you?
Well, that’s up to you really. If you are involved with projects that
need to deliver IoT, VM, container, application or OS images, then
maybe this is a great tool for you — but other options exist, some of
which are linked above.
Note that casync
is an Open Source project: if it doesn’t do exactly
what you need, prepare a patch that adds what you need, and we’ll
consider it.
If you are interested in the project and would like to talk about this
in person, I’ll be presenting casync
soon at Kinvolk’s Linux
Technologies
Meetup
in Berlin, Germany. You are invited. I also intend to talk about it at
All Systems Go!, also in Berlin.