LinuxSelfhelp.com
Go to the first, previous, next, last section, table of contents.


Controlling the Archive Format

@FIXME{need an intro here}

Making tar Archives More Portable

Creating a tar archive on a particular system that is meant to be useful later on many other machines and with other versions of tar is more challenging than you might think. tar archive formats have been evolving since the first versions of Unix. Many such formats are around, and are not always comptible with each other. This section discusses a few problems, and gives some advice about making tar archives more portable.

One golden rule is simplicity. For example, limit your tar archives to contain only regular files and directories, avoiding other kind of special files. Do not attempt to save sparse files or contiguous files as such. Let's discuss a few more problems, in turn.

Portable Names

Use straight file and directory names, made up of printable ASCII characters, avoiding colons, slashes, backslashes, spaces, and other dangerous characters. Avoid deep directory nesting. Accounting for oldish System V machines, limit your file and directory names to 14 characters or less.

If you intend to have your tar archives to be read under MSDOS, you should not rely on case distinction for file names, and you might use the GNU doschk program for helping you further diagnosing illegal MSDOS names, which are even more limited than System V's.

Symbolic Links

Normally, when tar archives a symbolic link, it writes a block to the archive naming the target of the link. In that way, the tar archive is a faithful record of the filesystem contents. --dereference (-h) is used with --create (-c), and causes tar to archive the files symbolic links point to, instead of the links themselves. When this option is used, when tar encounters a symbolic link, it will archive the linked-to file, instead of simply recording the presence of a symbolic link.

The name under which the file is stored in the file system is not recorded in the archive. To record both the symbolic link name and the file name in the system, archive the file under both names. If all links were recorded automatically by tar, an extracted file might be linked to a file name that no longer exists in the file system.

If a linked-to file is encountered again by tar while creating the same archive, an entire second copy of it will be stored. (This might be considered a bug.)

So, for portable archives, do not archive symbolic links as such, and use --dereference (-h): many systems do not support symbolic links, and moreover, your distribution might be unusable if it contains unresolved symbolic links.

Old V7 Archives

Certain old versions of tar cannot handle additional information recorded by newer tar programs. To create an archive in V7 format (not ANSI), which can be read by these old versions, specify the --old-archive (-o) option in conjunction with the --create (-c). tar also accepts `--portability' for this option. When you specify it, tar leaves out information about directories, pipes, fifos, contiguous files, and device files, and specifies file ownership by group and user IDs instead of group and user names.

When updating an archive, do not use --old-archive (-o) unless the archive was created with using this option.

In most cases, a new format archive can be read by an old tar program without serious trouble, so this option should seldom be needed. On the other hand, most modern tars are able to read old format archives, so it might be safer for you to always use --old-archive (-o) for your distributions.

GNU tar and POSIX tar

GNU tar was based on an early draft of the POSIX 1003.1 ustar standard. GNU extensions to tar, such as the support for file names longer than 100 characters, use portions of the tar header record which were specified in that POSIX draft as unused. Subsequent changes in POSIX have allocated the same parts of the header record for other purposes. As a result, GNU tar is incompatible with the current POSIX spec, and with tar programs that follow it.

We plan to reimplement these GNU extensions in a new way which is upward compatible with the latest POSIX tar format, but we don't know when this will be done.

In the mean time, there is simply no telling what might happen if you read a GNU tar archive, which uses the GNU extensions, using some other tar program. So if you want to read the archive with another tar program, be sure to write it using the `--old-archive' option (`-o').

@FIXME{is there a way to tell which flavor of tar was used to write a particular archive before you try to read it?}

Traditionally, old tars have a limit of 100 characters. GNU tar attempted two different approaches to overcome this limit, using and extending a format specified by a draft of some P1003.1. The first way was not that successful, and involved `@MaNgLeD@' file names, or such; while a second approach used `././@LongLink' and other tricks, yielding better success. In theory, GNU tar should be able to handle file names of practically unlimited length. So, if GNU tar fails to dump and retrieve files having more than 100 characters, then there is a bug in GNU tar, indeed.

But, being strictly POSIX, the limit was still 100 characters. For various other purposes, GNU tar used areas left unassigned in the POSIX draft. POSIX later revised P1003.1 ustar format by assigning previously unused header fields, in such a way that the upper limit for file name length was raised to 256 characters. However, the actual POSIX limit oscillates between 100 and 256, depending on the precise location of slashes in full file name (this is rather ugly). Since GNU tar use the same fields for quite other purposes, it became incompatible with the latest POSIX standards.

For longer or non-fitting file names, we plan to use yet another set of GNU extensions, but this time, complying with the provisions POSIX offers for extending the format, rather than conflicting with it. Whenever an archive uses old GNU tar extension format or POSIX extensions, would it be for very long file names or other specialities, this archive becomes non-portable to other tar implementations. In fact, anything can happen. The most forgiving tars will merely unpack the file using a wrong name, and maybe create another file named something like `@LongName', with the true file name in it. tars not protecting themselves may segment violate!

Compatibility concerns make all this thing more difficult, as we will have to support all these things together, for a while. GNU tar should be able to produce and read true POSIX format files, while being able to detect old GNU tar formats, besides old V7 format, and process them conveniently. It would take years before this whole area stabilizes...

There are plans to raise this 100 limit to 256, and yet produce POSIX conformant archives. Past 256, I do not know yet if GNU tar will go non-POSIX again, or merely refuse to archive the file.

There are plans so GNU tar support more fully the latest POSIX format, while being able to read old V7 format, GNU (semi-POSIX plus extension), as well as full POSIX. One may ask if there is part of the POSIX format that we still cannot support. This simple question has a complex answer. Maybe that, on intimate look, some strong limitations will pop up, but until now, nothing sounds too difficult (but see below). I only have these few pages of POSIX telling about `Extended tar Format' (P1003.1-1990 -- section 10.1.1), and there are references to other parts of the standard I do not have, which should normally enforce limitations on stored file names (I suspect things like fixing what / and NUL means). There are also some points which the standard does not make clear, Existing practice will then drive what I should do.

POSIX mandates that, when a file name cannot fit within 100 to 256 characters (the variance comes from the fact a / is ideally needed as the 156'th character), or a link name cannot fit within 100 characters, a warning should be issued and the file not be stored. Unless some --posix option is given (or POSIXLY_CORRECT is set), I suspect that GNU tar should disobey this specification, and automatically switch to using GNU extensions to overcome file name or link name length limitations.

There is a problem, however, which I did not intimately studied yet. Given a truly POSIX archive with names having more than 100 characters, I guess that GNU tar up to 1.11.8 will process it as if it were an old V7 archive, and be fooled by some fields which are coded differently. So, the question is to decide if the next generation of GNU tar should produce POSIX format by default, whenever possible, producing archives older versions of GNU tar might not be able to read correctly. I fear that we will have to suffer such a choice one of these days, if we want GNU tar to go closer to POSIX. We can rush it. Another possibility is to produce the current GNU tar format by default for a few years, but have GNU tar versions from some 1.POSIX and up able to recognize all three formats, and let older GNU tar fade out slowly. Then, we could switch to producing POSIX format by default, with not much harm to those still having (very old at that time) GNU tar versions prior to 1.POSIX.

POSIX format cannot represent very long names, volume headers, splitting of files in multi-volumes, sparse files, and incremental dumps; these would be all disallowed if --posix or POSIXLY_CORRECT. Otherwise, if tar is given long names, or `-[VMSgG]', then it should automatically go non-POSIX. I think this is easily granted without much discussion.

Another point is that only mtime is stored in POSIX archives, while GNU tar currently also store atime and ctime. If we want GNU tar to go closer to POSIX, my choice would be to drop atime and ctime support on average. On the other hand, I perceive that full dumps or incremental dumps need atime and ctime support, so for those special applications, POSIX has to be avoided altogether.

A few users requested that --sparse (-S) be always active by default, I think that before replying to them, we have to decide if we want GNU tar to go closer to POSIX on average, while producing files. My choice would be to go closer to POSIX in the long run. Besides possible double reading, I do not see any point of not trying to save files as sparse when creating archives which are neither POSIX nor old-V7, so the actual --sparse (-S) would become selected by default when producing such archives, whatever the reason is. So, --sparse (-S) alone might be redefined to force GNU-format archives, and recover its previous meaning from this fact.

GNU-format as it exists now can easily fool other POSIX tar, as it uses fields which POSIX considers to be part of the file name prefix. I wonder if it would not be a good idea, in the long run, to try changing GNU-format so any added field (like ctime, atime, file offset in subsequent volumes, or sparse file descriptions) be wholly and always pushed into an extension block, instead of using space in the POSIX header block. I could manage to do that portably between future GNU tars. So other POSIX tars might be at least able to provide kind of correct listings for the archives produced by GNU tar, if not able to process them otherwise.

Using these projected extensions might induce older tars to fail. We would use the same approach as for POSIX. I'll put out a tar capable of reading POSIXier, yet extended archives, but will not produce this format by default, in GNU mode. In a few years, when newer GNU tars will have flooded out tar 1.11.X and previous, we could switch to producing POSIXier extended archives, with no real harm to users, as almost all existing GNU tars will be ready to read POSIXier format. In fact, I'll do both changes at the same time, in a few years, and just prepare tar for both changes, without effecting them, from 1.POSIX. (Both changes: 1--using POSIX convention for getting over 100 characters; 2--avoiding mangling POSIX headers for GNU extensions, using only POSIX mandated extension techniques).

So, a future tar will have a --posix flag forcing the usage of truly POSIX headers, and so, producing archives previous GNU tar will not be able to read. So, once pretest will announce that feature, it would be particularly useful that users test how exchangeable will be archives between GNU tar with --posix and other POSIX tar.

In a few years, when GNU tar will produce POSIX headers by default, --posix will have a strong meaning and will disallow GNU extensions. But in the meantime, for a long while, --posix in GNU tar will not disallow GNU extensions like --label=archive-label (-V archive-label), --multi-volume (-M), --sparse (-S), or very long file or link names. However, --posix with GNU extensions will use POSIX headers with reserved-for-users extensions to headers, and I will be curious to know how well or bad POSIX tars will react to these.

GNU tar prior to 1.POSIX, and after 1.POSIX without --posix, generates and checks `ustar ', with two suffixed spaces. This is sufficient for older GNU tar not to recognize POSIX archives, and consequently, wrongly decide those archives are in old V7 format. It is a useful bug for me, because GNU tar has other POSIX incompatibilities, and I need to segregate GNU tar semi-POSIX archives from truly POSIX archives, for GNU tar should be somewhat compatible with itself, while migrating closer to latest POSIX standards. So, I'll be very careful about how and when I will do the correction.

Checksumming Problems

SunOS and HP-UX tar fail to accept archives created using GNU tar and containing non-ASCII file names, that is, file names having characters with the eight bit set, because they use signed checksums, while GNU tar uses unsigned checksums while creating archives, as per POSIX standards. On reading, GNU tar computes both checksums and accept any. It is somewhat worrying that a lot of people may go around doing backup of their files using faulty (or at least non-standard) software, not learning about it until it's time to restore their missing files with an incompatible file extractor, or vice versa.

GNU tar compute checksums both ways, and accept any on read, so GNU tar can read Sun tapes even with their wrong checksums. GNU tar produces the standard checksum, however, raising incompatibilities with Sun. That is to say, GNU tar has not been modified to produce incorrect archives to be read by buggy tar's. I've been told that more recent Sun tar now read standard archives, so maybe Sun did a similar patch, after all?

The story seems to be that when Sun first imported tar sources on their system, they recompiled it without realizing that the checksums were computed differently, because of a change in the default signing of char's in their compiler. So they started computing checksums wrongly. When they later realized their mistake, they merely decided to stay compatible with it, and with themselves afterwards. Presumably, but I do not really know, HP-UX has chosen that their tar archives to be compatible with Sun's. The current standards do not favor Sun tar format. In any case, it now falls on the shoulders of SunOS and HP-UX users to get a tar able to read the good archives they receive.

Using Less Space through Compression

Creating and Reading Compressed Archives

@UNREVISED

-z
--gzip
--ungzip
Filter the archive through gzip.

@FIXME{ach; these two bits orig from "compare" (?). where to put?} Some format parameters must be taken into consideration when modifying an archive: @FIXME{???}. Compressed archives cannot be modified.

You can use `--gzip' and `--gunzip' on physical devices (tape drives, etc.) and remote files as well as on normal files; data to or from such devices or remote files is reblocked by another copy of the tar program to enforce the specified (or default) record size. The default compression parameters are used; if you need to override them, avoid the --gzip (--gunzip, --ungzip, -z) option and run gzip explicitly. (Or set the `GZIP' environment variable.)

The --gzip (--gunzip, --ungzip, -z) option does not work with the --multi-volume (-M) option, or with the --update (-u), --append (-r), --concatenate (--catenate, -A), or --delete operations.

It is not exact to say that GNU tar is to work in concert with gzip in a way similar to zip, say. Surely, it is possible that tar and gzip be done with a single call, like in:

$ tar cfz archive.tar.gz subdir

to save all of `subdir' into a gzip'ed archive. Later you can do:

$ tar xfz archive.tar.gz

to explode and unpack.

The difference is that the whole archive is compressed. With zip, archive members are archived individually. tar's method yields better compression. On the other hand, one can view the contents of a zip archive without having to decompress it. As for the tar and gzip tandem, you need to decompress the archive to see its contents. However, this may be done without needing disk space, by using pipes internally:

$ tar tfz archive.tar.gz

About corrupted compressed archives: gzip'ed files have no redundancy, for maximum compression. The adaptive nature of the compression scheme means that the compression tables are implicitly spread all over the archive. If you lose a few blocks, the dynamic construction of the compression tables becomes unsychronized, and there is little chance that you could recover later in the archive.

There are pending suggestions for having a per-volume or per-file compression in GNU tar. This would allow for viewing the contents without decompression, and for resynchronizing decompression at every volume or file, in case of corrupted archives. Doing so, we might loose some compressibility. But this would have make recovering easier. So, there are pros and cons. We'll see!

-Z
--compress
--uncompress
Filter the archive through compress. Otherwise like --gzip (--gunzip, --ungzip, -z).
--use-compress-program=prog
Filter through prog (must accept `-d').

--compress (--uncompress, -Z) stores an archive in compressed format. This option is useful in saving time over networks and space in pipes, and when storage space is at a premium. --compress (--uncompress, -Z) causes tar to compress when writing the archive, or to uncompress when reading the archive.

To perform compression and uncompression on the archive, tar runs the compress utility. tar uses the default compression parameters; if you need to override them, avoid the --compress (--uncompress, -Z) option and run the compress utility explicitly. It is useful to be able to call the compress utility from within tar because the compress utility by itself cannot access remote tape drives.

The --compress (--uncompress, -Z) option will not work in conjunction with the --multi-volume (-M) option or the --append (-r), --update (-u), --append (-r) and --delete operations. See section The Five Advanced tar Operations, for more information on these operations.

If there is no compress utility available, tar will report an error. Please note that the compress program may be covered by a patent, and therefore we recommend you stop using it.

--compress
--uncompress
-z
-Z
When this option is specified, tar will compress (when writing an archive), or uncompress (when reading an archive). Used in conjunction with the --create (-c), --extract (--get, -x), --list (-t) and --compare (--diff, -d) operations.

You can have archives be compressed by using the --gzip (--gunzip, --ungzip, -z) option. This will arrange for tar to use the gzip program to be used to compress or uncompress the archive wren writing or reading it.

To use the older, obsolete, compress program, use the --compress (--uncompress, -Z) option. The GNU Project recommends you not use compress, because there is a patent covering the algorithm it uses. You could be sued for patent infringment merely by running compress.

I have one question, or maybe it's a suggestion if there isn't a way to do it now. I would like to use --gzip (--gunzip, --ungzip, -z), but I'd also like the output to be fed through a program like GNU ecc (actually, right now that's `exactly' what I'd like to use :-)), basically adding ECC protection on top of compression. It seems as if this should be quite easy to do, but I can't work out exactly how to go about it. Of course, I can pipe the standard output of tar through ecc, but then I lose (though I haven't started using it yet, I confess) the ability to have tar use rmt for it's I/O (I think).

I think the most straightforward thing would be to let me specify a general set of filters outboard of compression (preferably ordered, so the order can be automatically reversed on input operations, and with the options they require specifiable), but beggars shouldn't be choosers and anything you decide on would be fine with me.

By the way, I like ecc but if (as the comments say) it can't deal with loss of block sync, I'm tempted to throw some time at adding that capability. Supposing I were to actually do such a thing and get it (apparantly) working, do you accept contributed changes to utilities like that? (Leigh Clayton `loc@soliton.com', May 1995).

Isn't that exactly the role of the --use-compress-prog=program option? I never tried it myself, but I suspect you may want to write a prog script or program able to filter stdin to stdout to way you want. It should recognize the `-d' option, for when extraction is needed rather than creation.

It has been reported that if one writes compressed data (through the --gzip (--gunzip, --ungzip, -z) or --compress (--uncompress, -Z) options) to a DLT and tries to use the DLT compression mode, the data will actually get bigger and one will end up with less space on the tape.

Archiving Sparse Files

@UNREVISED

-S
--sparse
Handle sparse files efficiently.

This option causes all files to be put in the archive to be tested for sparseness, and handled specially if they are. The --sparse (-S) option is useful when many dbm files, for example, are being backed up. Using this option dramatically decreases the amount of space needed to store such a file.

In later versions, this option may be removed, and the testing and treatment of sparse files may be done automatically with any special GNU options. For now, it is an option needing to be specified on the command line with the creation or updating of an archive.

Files in the filesystem occasionally have "holes." A hole in a file is a section of the file's contents which was never written. The contents of a hole read as all zeros. On many operating systems, actual disk storage is not allocated for holes, but they are counted in the length of the file. If you archive such a file, tar could create an archive longer than the original. To have tar attempt to recognize the holes in a file, use --sparse (-S). When you use the --sparse (-S) option, then, for any file using less disk space than would be expected from its length, tar searches the file for consecutive stretches of zeros. It then records in the archive for the file where the consecutive stretches of zeros are, and only archives the "real contents" of the file. On extraction (using --sparse (-S) is not needed on extraction) any such files have hols created wherever the continuous stretches of zeros were found. Thus, if you use --sparse (-S), tar archives won't take more space than the original.

A file is sparse if it contains blocks of zeros whose existence is recorded, but that have no space allocated on disk. When you specify the --sparse (-S) option in conjunction with the --create (-c) operation, tar tests all files for sparseness while archiving. If tar finds a file to be sparse, it uses a sparse representation of the file in the archive. See section How to Create Archives, for more information about creating archives.

--sparse (-S) is useful when archiving files, such as dbm files, likely to contain many nulls. This option dramatically decreases the amount of space needed to store such an archive.

Please Note: Always use --sparse (-S) when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.

Even if your system has no sparse files currently, some may be created in the future. If you use --sparse (-S) while making file system backups as a matter of course, you can be assured the archive will never take more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes). @FIXME-xref{incremental when node name is set.}

tar ignores the --sparse (-S) option when reading an archive.

--sparse
-S
Files stored sparsely in the file system are represented sparsely in the archive. Use in conjunction with write operations.

However, users should be well aware that at archive creation time, GNU tar still has to read whole disk file to locate the holes, and so, even if sparse files use little space on disk and in the archive, they may sometimes require inordinate amount of time for reading and examining all-zero blocks of a file. Although it works, it's painfully slow for a large (sparse) file, even though the resulting tar archive may be small. (One user reports that dumping a `core' file of over 400 megabytes, but with only about 3 megabytes of actual data, took about 9 minutes on a Sun Sparstation ELC, with full CPU utilisation.)

This reading is required in all cases and is not related to the fact the --sparse (-S) option is used or not, so by merely not using the option, you are not saving time(6).

Programs like dump do not have to read the entire file; by examining the file system directly, they can determine in advance exactly where the holes are and thus avoid reading through them. The only data it need read are the actual allocated data blocks. GNU tar uses a more portable and straightforward archiving approach, it would be fairly difficult that it does otherwise. Elizabeth Zwicky writes to `comp.unix.internals', on 1990-12-10:

What I did say is that you cannot tell the difference between a hole and an equivalent number of nulls without reading raw blocks. st_blocks at best tells you how many holes there are; it doesn't tell you where. Just as programs may, conceivably, care what st_blocks is (care to name one that does?), they may also care where the holes are (I have no examples of this one either, but it's equally imaginable).

I conclude from this that good archivers are not portable. One can arguably conclude that if you want a portable program, you can in good conscience restore files with as many holes as possible, since you can't get it right.

Handling File Attributes

@UNREVISED

When tar reads files, this causes them to have the access times updated. To have tar attempt to set the access times back to what they were before they were read, use the --atime-preserve option. This doesn't work for files that you don't own, unless you're root, and it doesn't interact with incremental dumps nicely (see section Performing Backups and Restoring Files), but it is good enough for some purposes.

Handling of file attributes

--atime-preserve
Do not change access times on dumped files.
-m
--touch
Do not extract file modified time. When this option is used, tar leaves the modification times of the files it extracts as the time when the files were extracted, instead of setting it to the time recorded in the archive. This option is meaningless with --list (-t).
--same-owner
Create extracted files with the same ownership they have in the archive. When using super-user at extraction time, ownership is always restored. So, this option is meaningful only for non-root users, when tar is executed on those systems able to give files away. This is considered as a security flaw by many people, at least because it makes quite difficult to correctly account users for the disk space they occupy. Also, the suid or sgid attributes of files are easily and silently lost when files are given away. When writing an archive, tar writes the user id and user name separately. If it can't find a user name (because the user id is not in `/etc/passwd'), then it does not write one. When restoring, and doing a chmod like when you use --same-permissions (--preserve-permissions, -p) (@FIXME{same-owner?}), it tries to look the name (if one was written) up in `/etc/passwd'. If it fails, then it uses the user id stored in the archive instead.
--numeric-owner
The --numeric-owner option allows (ANSI) archives to be written without user/group name information or such information to be ignored when extracting. It effectively disables the generation and/or use of user/group name information. This option forces extraction using the numeric ids from the archive, ignoring the names. This is useful in certain circumstances, when restoring a backup from an emergency floppy with different passwd/group files for example. It is otherwise impossible to extract files with the right ownerships if the password file in use during the extraction does not match the one belonging to the filesystem(s) being extracted. This occurs, for example, if you are restoring your files after a major crash and had booted from an emergency floppy with no password file or put your disk into another machine to do the restore. The numeric ids are always saved into tar archives. The identifying names are added at create time when provided by the system, unless --old-archive (-o) is used. Numeric ids could be used when moving archives between a collection of machines using a centralized management for attribution of numeric ids to users and groups. This is often made through using the NIS capabilities. When making a tar file for distribution to other sites, it is sometimes cleaner to use a single owner for all files in the distribution, and nicer to specify the write permission bits of the files as stored in the archive independently of their actual value on the file system. The way to prepare a clean distribution is usually to have some Makefile rule creating a directory, copying all needed files in that directory, then setting ownership and permissions as wanted (there are a lot of possible schemes), and only then making a tar archive out of this directory, before cleaning everything out. Of course, we could add a lot of options to GNU tar for fine tuning permissions and ownership. This is not the good way, I think. GNU tar is already crowded with options and moreover, the approach just explained gives you a great deal of control already.
-p
--same-permissions
--preserve-permissions
Extract all protection information. This option causes tar to set the modes (access permissions) of extracted files exactly as recorded in the archive. If this option is not used, the current umask setting limits the permissions on extracted files. This option is meaningless with --list (-t).
--preserve
Same as both --same-permissions (--preserve-permissions, -p) and --same-order (--preserve-order, -s). The --preserve option has no equivalent short option name. It is equivalent to --same-permissions (--preserve-permissions, -p) plus --same-order (--preserve-order, -s). @FIXME{I do not see the purpose of such an option. (Neither I. FP.)}

The Standard Format

@UNREVISED

While an archive may contain many files, the archive itself is a single ordinary file. Like any other file, an archive file can be written to a storage device such as a tape or disk, sent through a pipe or over a network, saved on the active file system, or even stored in another archive. An archive file is not easy to read or manipulate without using the tar utility or Tar mode in GNU Emacs.

Physically, an archive consists of a series of file entries terminated by an end-of-archive entry, which consists of 512 zero bytes. A file entry usually describes one of the files in the archive (an archive member), and consists of a file header and the contents of the file. File headers contain file names and statistics, checksum information which tar uses to detect file corruption, and information about file types.

Archives are permitted to have more than one member with the same member name. One way this situation can occur is if more than one version of a file has been stored in the archive. For information about adding new versions of a file to an archive, see section Updating an Archive, and to learn more about having more than one archive member with the same name, see @FIXME-xref{-backup node, when it's written}.

In addition to entries describing archive members, an archive may contain entries which tar itself uses to store information. See section Including a Label in the Archive, for an example of such an archive entry.

A tar archive file contains a series of blocks. Each block contains BLOCKSIZE bytes. Although this format may be thought of as being on magnetic tape, other media are often used.

Each file archived is represented by a header block which describes the file, followed by zero or more blocks which give the contents of the file. At the end of the archive file there may be a block filled with binary zeros as an end-of-file marker. A reasonable system should write a block of zeros at the end, but must not assume that such a block exists when reading an archive.

The blocks may be blocked for physical I/O operations. Each record of n blocks (where n is set by the --blocking-factor=512-size (-b 512-size) option to tar) is written with a single `write ()' operation. On magnetic tapes, the result of such a write is a single record. When writing an archive, the last record of blocks should be written at the full size, with blocks after the zero block containing all zeros. When reading an archive, a reasonable system should properly handle an archive whose last record is shorter than the rest, or which contains garbage records after a zero block.

The header block is defined in C as follows. In the GNU tar distribution, this is part of file `src/tar.h':

/* GNU tar Archive Format description.  */

/* If OLDGNU_COMPATIBILITY is not zero, tar produces archives which, by
   default, are readable by older versions of GNU tar.  This can be
   overriden by using --posix; in this case, POSIXLY_CORRECT in environment
   may be set for enforcing stricter conformance.  If OLDGNU_COMPATIBILITY
   is zero or undefined, tar will eventually produces archives which, by
   default, POSIX compatible; then either using --posix or defining
   POSIXLY_CORRECT enforces stricter conformance.

   This #define will disappear in a few years.  FP, June 1995.  */
#define OLDGNU_COMPATIBILITY 1

/*---------------------------------------------.
| `tar' Header Block, from POSIX 1003.1-1990.  |
`---------------------------------------------*/

/* POSIX header.  */

struct posix_header
{                               /* byte offset */
  char name[100];               /*   0 */
  char mode[8];                 /* 100 */
  char uid[8];                  /* 108 */
  char gid[8];                  /* 116 */
  char size[12];                /* 124 */
  char mtime[12];               /* 136 */
  char chksum[8];               /* 148 */
  char typeflag;                /* 156 */
  char linkname[100];           /* 157 */
  char magic[6];                /* 257 */
  char version[2];              /* 263 */
  char uname[32];               /* 265 */
  char gname[32];               /* 297 */
  char devmajor[8];             /* 329 */
  char devminor[8];             /* 337 */
  char prefix[155];             /* 345 */
                                /* 500 */
};

#define TMAGIC   "ustar"        /* ustar and a null */
#define TMAGLEN  6
#define TVERSION "00"           /* 00 and no null */
#define TVERSLEN 2

/* Values used in typeflag field.  */
#define REGTYPE  '0'            /* regular file */
#define AREGTYPE '\0'           /* regular file */
#define LNKTYPE  '1'            /* link */
#define SYMTYPE  '2'            /* reserved */
#define CHRTYPE  '3'            /* character special */
#define BLKTYPE  '4'            /* block special */
#define DIRTYPE  '5'            /* directory */
#define FIFOTYPE '6'            /* FIFO special */
#define CONTTYPE '7'            /* reserved */

/* Bits used in the mode field, values in octal.  */
#define TSUID    04000          /* set UID on execution */
#define TSGID    02000          /* set GID on execution */
#define TSVTX    01000          /* reserved */
                                /* file permissions */
#define TUREAD   00400          /* read by owner */
#define TUWRITE  00200          /* write by owner */
#define TUEXEC   00100          /* execute/search by owner */
#define TGREAD   00040          /* read by group */
#define TGWRITE  00020          /* write by group */
#define TGEXEC   00010          /* execute/search by group */
#define TOREAD   00004          /* read by other */
#define TOWRITE  00002          /* write by other */
#define TOEXEC   00001          /* execute/search by other */

/*-------------------------------------.
| `tar' Header Block, GNU extensions.  |
`-------------------------------------*/

/* In GNU tar, SYMTYPE is for to symbolic links, and CONTTYPE is for
   contiguous files, so maybe disobeying the `reserved' comment in POSIX
   header description.  I suspect these were meant to be used this way, and
   should not have really been `reserved' in the published standards.  */

/* *BEWARE* *BEWARE* *BEWARE* that the following information is still
   boiling, and may change.  Even if the OLDGNU format description should be
   accurate, the so-called GNU format is not yet fully decided.  It is
   surely meant to use only extensions allowed by POSIX, but the sketch
   below repeats some ugliness from the OLDGNU format, which should rather
   go away.  Sparse files should be saved in such a way that they do *not*
   require two passes at archive creation time.  Huge files get some POSIX
   fields to overflow, alternate solutions have to be sought for this.  */

/* Descriptor for a single file hole.  */

struct sparse
{                               /* byte offset */
  char offset[12];              /*   0 */
  char numbytes[12];            /*  12 */
                                /*  24 */
};

/* Sparse files are not supported in POSIX ustar format.  For sparse files
   with a POSIX header, a GNU extra header is provided which holds overall
   sparse information and a few sparse descriptors.  When an old GNU header
   replaces both the POSIX header and the GNU extra header, it holds some
   sparse descriptors too.  Whether POSIX or not, if more sparse descriptors
   are still needed, they are put into as many successive sparse headers as
   necessary.  The following constants tell how many sparse descriptors fit
   in each kind of header able to hold them.  */

#define SPARSES_IN_EXTRA_HEADER  16
#define SPARSES_IN_OLDGNU_HEADER 4
#define SPARSES_IN_SPARSE_HEADER 21

/* The GNU extra header contains some information GNU tar needs, but not
   foreseen in POSIX header format.  It is only used after a POSIX header
   (and never with old GNU headers), and immediately follows this POSIX
   header, when typeflag is a letter rather than a digit, so signaling a GNU
   extension.  */

struct extra_header
{                               /* byte offset */
  char atime[12];               /*   0 */
  char ctime[12];               /*  12 */
  char offset[12];              /*  24 */
  char realsize[12];            /*  36 */
  char longnames[4];            /*  48 */
  char unused_pad1[68];         /*  52 */
  struct sparse sp[SPARSES_IN_EXTRA_HEADER];
                                /* 120 */
  char isextended;              /* 504 */
                                /* 505 */
};

/* Extension header for sparse files, used immediately after the GNU extra
   header, and used only if all sparse information cannot fit into that
   extra header.  There might even be many such extension headers, one after
   the other, until all sparse information has been recorded.  */

struct sparse_header
{                               /* byte offset */
  struct sparse sp[SPARSES_IN_SPARSE_HEADER];
                                /*   0 */
  char isextended;              /* 504 */
                                /* 505 */
};

/* The old GNU format header conflicts with POSIX format in such a way that
   POSIX archives may fool old GNU tar's, and POSIX tar's might well be
   fooled by old GNU tar archives.  An old GNU format header uses the space
   used by the prefix field in a POSIX header, and cumulates information
   normally found in a GNU extra header.  With an old GNU tar header, we
   never see any POSIX header nor GNU extra header.  Supplementary sparse
   headers are allowed, however.  */

struct oldgnu_header
{                               /* byte offset */
  char unused_pad1[345];        /*   0 */
  char atime[12];               /* 345 */
  char ctime[12];               /* 357 */
  char offset[12];              /* 369 */
  char longnames[4];            /* 381 */
  char unused_pad2;             /* 385 */
  struct sparse sp[SPARSES_IN_OLDGNU_HEADER];
                                /* 386 */
  char isextended;              /* 482 */
  char realsize[12];            /* 483 */
                                /* 495 */
};

/* OLDGNU_MAGIC uses both magic and version fields, which are contiguous.
   Found in an archive, it indicates an old GNU header format, which will be
   hopefully become obsolescent.  With OLDGNU_MAGIC, uname and gname are
   valid, though the header is not truly POSIX conforming.  */
#define OLDGNU_MAGIC "ustar  "  /* 7 chars and a null */

/* The standards committee allows only capital A through capital Z for
   user-defined expansion.  */

/* This is a dir entry that contains the names of files that were in the
   dir at the time the dump was made.  */
#define GNUTYPE_DUMPDIR 'D'

/* Identifies the *next* file on the tape as having a long linkname.  */
#define GNUTYPE_LONGLINK 'K'

/* Identifies the *next* file on the tape as having a long name.  */
#define GNUTYPE_LONGNAME 'L'

/* This is the continuation of a file that began on another volume.  */
#define GNUTYPE_MULTIVOL 'M'

/* For storing filenames that do not fit into the main header.  */
#define GNUTYPE_NAMES 'N'

/* This is for sparse files.  */
#define GNUTYPE_SPARSE 'S'

/* This file is a tape/volume header.  Ignore it on extraction.  */
#define GNUTYPE_VOLHDR 'V'

/*--------------------------------------.
| tar Header Block, overall structure.  |
`--------------------------------------*/

/* tar files are made in basic blocks of this size.  */
#define BLOCKSIZE 512

enum archive_format
{
  DEFAULT_FORMAT,               /* format to be decided later */
  V7_FORMAT,                    /* old V7 tar format */
  OLDGNU_FORMAT,                /* GNU format as per before tar 1.12 */
  POSIX_FORMAT,                 /* restricted, pure POSIX format */
  GNU_FORMAT                    /* POSIX format with GNU extensions */
};

union block
{
  char buffer[BLOCKSIZE];
  struct posix_header header;
  struct extra_header extra_header;
  struct oldgnu_header oldgnu_header;
  struct sparse_header sparse_header;
};

/* End of Format description.  */

All characters in header blocks are represented by using 8-bit characters in the local variant of ASCII. Each field within the structure is contiguous; that is, there is no padding used within the structure. Each character on the archive medium is stored contiguously.

Bytes representing the contents of files (after the header block of each file) are not translated in any way and are not constrained to represent characters in any character set. The tar format does not distinguish text files from binary files, and no translation of file contents is performed.

The name, linkname, magic, uname, and gname are null-terminated character strings. All other fileds are zero-filled octal numbers in ASCII. Each numeric field of width w contains w minus 2 digits, a space, and a null, except size, and mtime, which do not contain the trailing null.

The name field is the file name of the file, with directory names (if any) preceding the file name, separated by slashes.

@FIXME{how big a name before field overflows?}

The mode field provides nine bits specifying file permissions and three bits to specify the Set UID, Set GID, and Save Text (sticky) modes. Values for these bits are defined above. When special permissions are required to create a file with a given mode, and the user restoring files from the archive does not hold such permissions, the mode bit(s) specifying those special permissions are ignored. Modes which are not supported by the operating system restoring files from the archive will be ignored. Unsupported modes should be faked up when creating or updating an archive; e.g. the group permission could be copied from the other permission.

The uid and gid fields are the numeric user and group ID of the file owners, respectively. If the operating system does not support numeric user or group IDs, these fields should be ignored.

The size field is the size of the file in bytes; linked files are archived with this field specified as zero. @FIXME-xref{Modifiers}, in particular the --incremental (-G) option.

The mtime field is the modification time of the file at the time it was archived. It is the ASCII representation of the octal value of the last time the file was modified, represented as an integer number of seconds since January 1, 1970, 00:00 Coordinated Universal Time.

The chksum field is the ASCII representation of the octal value of the simple sum of all bytes in the header block. Each 8-bit byte in the header is added to an unsigned integer, initialized to zero, the precision of which shall be no less than seventeen bits. When calculating the checksum, the chksum field is treated as if it were all blanks.

The typeflag field specifies the type of file archived. If a particular implementation does not recognize or permit the specified type, the file will be extracted as if it were a regular file. As this action occurs, tar issues a warning to the standard error.

The atime and ctime fields are used in making incremental backups; they store, respectively, the particular file's access time and last inode-change time.

The offset is used by the --multi-volume (-M) option, when making a multi-volume archive. The offset is number of bytes into the file that we need to restart at to continue the file on the next tape, i.e., where we store the location that a continued file is continued at.

The following fields were added to deal with sparse files. A file is sparse if it takes in unallocated blocks which end up being represented as zeros, i.e., no useful data. A test to see if a file is sparse is to look at the number blocks allocated for it versus the number of characters in the file; if there are fewer blocks allocated for the file than would normally be allocated for a file of that size, then the file is sparse. This is the method tar uses to detect a sparse file, and once such a file is detected, it is treated differently from non-sparse files.

Sparse files are often dbm files, or other database-type files which have data at some points and emptiness in the greater part of the file. Such files can appear to be very large when an `ls -l' is done on them, when in truth, there may be a very small amount of important data contained in the file. It is thus undesirable to have tar think that it must back up this entire file, as great quantities of room are wasted on empty blocks, which can lead to running out of room on a tape far earlier than is necessary. Thus, sparse files are dealt with so that these empty blocks are not written to the tape. Instead, what is written to the tape is a description, of sorts, of the sparse file: where the holes are, how big the holes are, and how much data is found at the end of the hole. This way, the file takes up potentially far less room on the tape, and when the file is extracted later on, it will look exactly the way it looked beforehand. The following is a description of the fields used to handle a sparse file:

The sp is an array of struct sparse. Each struct sparse contains two 12-character strings which represent an offset into the file and a number of bytes to be written at that offset. The offset is absolute, and not relative to the offset in preceding array element.

The header can hold four of these struct sparse at the moment; if more are needed, they are not stored in the header.

The isextended flag is set when an extended_header is needed to deal with a file. Note that this means that this flag can only be set when dealing with a sparse file, and it is only set in the event that the description of the file will not fit in the alloted room for sparse structures in the header. In other words, an extended_header is needed.

The extended_header structure is used for sparse files which need more sparse structures than can fit in the header. The header can fit 4 such structures; if more are needed, the flag isextended gets set and the next block is an extended_header.

Each extended_header structure contains an array of 21 sparse structures, along with a similar isextended flag that the header had. There can be an indeterminate number of such extended_headers to describe a sparse file.

REGTYPE
AREGTYPE
These flags represent a regular file. In order to be compatible with older versions of tar, a typeflag value of AREGTYPE should be silently recognized as a regular file. New archives should be created using REGTYPE. Also, for backward compatibility, tar treats a regular file whose name ends with a slash as a directory.
LNKTYPE
This flag represents a file linked to another file, of any type, previously archived. Such files are identified in Unix by each file having the same device and inode number. The linked-to name is specified in the linkname field with a trailing null.
SYMTYPE
This represents a symbolic link to another file. The linked-to name is specified in the linkname field with a trailing null.
CHRTYPE
BLKTYPE
These represent character special files and block special files respectively. In this case the devmajor and devminor fields will contain the major and minor device numbers respectively. Operating systems may map the device specifications to their own local specification, or may ignore the entry.
DIRTYPE
This flag specifies a directory or sub-directory. The directory name in the name field should end with a slash. On systems where disk allocation is performed on a directory basis, the size field will contain the maximum number of bytes (which may be rounded to the nearest disk block allocation unit) which the directory may hold. A size field of zero indicates no such limiting. Systems which do not support limiting in this manner should ignore the size field.
FIFOTYPE
This specifies a FIFO special file. Note that the archiving of a FIFO file archives the existence of this file and not its contents.
CONTTYPE
This specifies a contiguous file, which is the same as a normal file except that, in operating systems which support it, all its space is allocated contiguously on the disk. Operating systems which do not allow contiguous allocation should silently treat this type as a normal file.
A ... Z
These are reserved for custom implementations. Some of these are used in the GNU modified format, as described below.

Other values are reserved for specification in future revisions of the P1003 standard, and should not be used by any tar program.

The magic field indicates that this archive was output in the P1003 archive format. If this field contains TMAGIC, the uname and gname fields will contain the ASCII representation of the owner and group of the file respectively. If found, the user and group IDs are used rather than the values in the uid and gid fields.

For references, see ISO/IEC 9945-1:1990 or IEEE Std 1003.1-1990, pages 169-173 (section 10.1) for Archive/Interchange File Format; and IEEE Std 1003.2-1992, pages 380-388 (section 4.48) and pages 936-940 (section E.4.48) for pax - Portable archive interchange.

GNU Extensions to the Archive Format

@UNREVISED

The GNU format uses additional file types to describe new types of files in an archive. These are listed below.

GNUTYPE_DUMPDIR
'D'
This represents a directory and a list of files created by the --incremental (-G) option. The size field gives the total size of the associated list of files. Each file name is preceded by either a `Y' (the file should be in this archive) or an `N'. (The file is a directory, or is not stored in the archive.) Each file name is terminated by a null. There is an additional null after the last file name.
GNUTYPE_MULTIVOL
'M'
This represents a file continued from another volume of a multi-volume archive created with the --multi-volume (-M) option. The original type of the file is not given here. The size field gives the maximum size of this piece of the file (assuming the volume does not end before the file is written out). The offset field gives the offset from the beginning of the file where this part of the file begins. Thus size plus offset should equal the original size of the file.
GNUTYPE_SPARSE
'S'
This flag indicates that we are dealing with a sparse file. Note that archiving a sparse file requires special operations to find holes in the file, which mark the positions of these holes, along with the number of bytes of data to be found after the hole.
GNUTYPE_VOLHDR
'V'
This file type is used to mark the volume header that was given with the --label=archive-label (-V archive-label) option when the archive was created. The name field contains the name given after the --label=archive-label (-V archive-label) option. The size field is zero. Only the first file in each volume of an archive should have this type.

You may have trouble reading a GNU format archive on a non-GNU system if the options --incremental (-G), --multi-volume (-M), --sparse (-S), or --label=archive-label (-V archive-label) were used when writing the archive. In general, if tar does not use the GNU-added fields of the header, other versions of tar should be able to read the archive. Otherwise, the tar program will give an error, the most likely one being a checksum error.

Comparison of tar and cpio

@UNREVISED

@FIXME{Reorganize the following material}

The cpio archive formats, like tar, do have maximum pathname lengths. The binary and old ASCII formats have a max path length of 256, and the new ASCII and CRC ASCII formats have a max path length of 1024. GNU cpio can read and write archives with arbitrary pathname lengths, but other cpio implementations may crash unexplainedly trying to read them.

tar handles symbolic links in the form in which it comes in BSD; cpio doesn't handle symbolic links in the form in which it comes in System V prior to SVR4, and some vendors may have added symlinks to their system without enhancing cpio to know about them. Others may have enhanced it in a way other than the way I did it at Sun, and which was adopted by AT&T (and which is, I think, also present in the cpio that Berkeley picked up from AT&T and put into a later BSD release--I think I gave them my changes).

(SVR4 does some funny stuff with tar; basically, its cpio can handle tar format input, and write it on output, and it probably handles symbolic links. They may not have bothered doing anything to enhance tar as a result.)

cpio handles special files; traditional tar doesn't.

tar comes with V7, System III, System V, and BSD source; cpio comes only with System III, System V, and later BSD (4.3-tahoe and later).

tar's way of handling multiple hard links to a file can handle file systems that support 32-bit inumbers (e.g., the BSD file system); cpios way requires you to play some games (in its "binary" format, i-numbers are only 16 bits, and in its "portable ASCII" format, they're 18 bits--it would have to play games with the "file system ID" field of the header to make sure that the file system ID/i-number pairs of different files were always different), and I don't know which cpios, if any, play those games. Those that don't might get confused and think two files are the same file when they're not, and make hard links between them.

tars way of handling multiple hard links to a file places only one copy of the link on the tape, but the name attached to that copy is the only one you can use to retrieve the file; cpios way puts one copy for every link, but you can retrieve it using any of the names.

What type of check sum (if any) is used, and how is this calculated.

See the attached manual pages for tar and cpio format. tar uses a checksum which is the sum of all the bytes in the tar header for a file; cpio uses no checksum.

If anyone knows why cpio was made when tar was present at the unix scene,

It wasn't. cpio first showed up in PWB/UNIX 1.0; no generally-available version of UNIX had tar at the time. I don't know whether any version that was generally available within AT&T had tar, or, if so, whether the people within AT&T who did cpio knew about it.

On restore, if there is a corruption on a tape tar will stop at that point, while cpio will skip over it and try to restore the rest of the files.

The main difference is just in the command syntax and header format.

tar is a little more tape-oriented in that everything is blocked to start on a record boundary.

Is there any differences between the ability to recover crashed archives between the two of them. (Is there any chance of recovering crashed archives at all.)

Theoretically it should be easier under tar since the blocking lets you find a header with some variation of `dd skip=nn'. However, modern cpio's and variations have an option to just search for the next file header after an error with a reasonable chance of re-syncing. However, lots of tape driver software won't allow you to continue past a media error which should be the only reason for getting out of sync unless a file changed sizes while you were writing the archive.

If anyone knows why cpio was made when tar was present at the unix scene, please tell me about this too.

Probably because it is more media efficient (by not blocking everything and using only the space needed for the headers where tar always uses 512 bytes per file header) and it knows how to archive special files.

You might want to look at the freely available alternatives. The major ones are afio, GNU tar, and pax, each of which have their own extensions with some backwards compatibility.

Sparse files were tarred as sparse files (which you can easily test, because the resulting archive gets smaller, and GNU cpio can no longer read it).


Go to the first, previous, next, last section, table of contents.