TAR Archive Format

The tar archive format offers a reliable way to store files. It preserves the original data byte-by-byte, adding a 512-byte header at the beginning. File sizes are also adjusted to fit neatly into 512-byte blocks. Interestingly, the header includes a built-in checksum for error detection. While the standard tar utility might skip corrupted headers (and consequently the associated files), it will continue processing the rest of the archive, ensuring you can still access undamaged files.

About TAR Archive Information

A tar file format, short for "tape archive", is created by the tar utility in UNIX systems. It serves as an archive to bundle files together for tasks like backup or distribution. Unlike compressed archive formats, tar files store multiple files, also known as a tarball, in an uncompressed format along with metadata about the archive. While tar files themselves are not compressed, they can be compressed using utilities like gzip or bzip2 for efficient storage and transfer.
Since the tar file format itself doesn’t include built-in compression, tar archives are often compressed using external utilities such as gzip, bzip2, XZ (utilizing 7-Zip / p7zip LZMA / LZMA2 compression algorithms), Brotli, Zstandard, and similar tools. This compression helps reduce the size of the archive for easier portability and efficient data backup. Compressed files resulting from this process may have single extensions like tgz, tbz, txz, tzst, or double extensions like tar.gz, tar.br, tar.bz2, tar.xz, tar.zst.

Evolution of the TAR Archive Format

The tar archive format has evolved over time. New features added to the tar utility since the 1980s led to format extensions that include additional information for improved functionality. Early tar formats lacked consistency in how numeric fields were stored, but this was addressed in later versions to enhance portability. This improvement began with the first POSIX standard for tar formats in 1988.
POSIX.1 2001 introduced the "extended tar" format (also known as tar.h or pax). This format is the most flexible, incorporating functionalities from other tar specifications. It allows vendors to add custom features using tags. While documentation highlights that not all tar implementations can handle this format perfectly, its design ensures that any tool capable of reading “ustar” archives can also read most “posix” archives. Additionally, POSIX.1 2001 eliminated the previous 8 GB file size limitation for tar archives.

TAR Structure

A TAR archive, at its core, is a sequence of data blocks. These fixed-size blocks 512 bytes each are arranged linearly. To mark the end of the archive, there are two consecutive blocks filled with zeros.
However, when viewed logically, a TAR archive is a series of file entries. Each entry is made up of multiple blocks, with the first block always being the entry header. The remaining blocks store the actual file content.
Inside the Entry Header
The entry header acts as a blueprint for each file within the archive. It contains the following information

  • File Name (100 bytes): The name of the file stored in this entry.
  • File Permissions (8 bytes): Permissions for accessing the file, represented as an octal string.
  • Owner ID (8 bytes): The numerical user ID of the file owner (octal format).
  • Group ID (8 bytes): The numerical group ID of the file owner (octal format).
  • File Size (12 bytes): The size of the file in octal format.
  • Last Modified Time (12 bytes): The octal timestamp of the last file modification.
  • Checksum (8 bytes): A checksum value used to verify the integrity of the header data.
  • File Type (1 byte): Indicates the type of file stored (regular file, hard link, or symbolic link).
  • Linked File Name (variable length): If the entry is a link (hard or symbolic), this field stores the name of the linked file.

Benefits of this format

  • Versatility - TAR is a versatile format capable of storing multiple files and directories in a single archive file, making it suitable for various backup and distribution needs.
  • Preservation of File Attributes TAR preserves important file attributes such as permissions, ownership, and timestamps, ensuring that the archived data retains its integrity and usability.
  • Simplicity - The structure of TAR files is simple and straightforward, making them easy to work with and process. This simplifies programming and automation of tasks related to TAR archives.

TAR Archive Supported Operations

Aspose.ZIP allows user extract either particular entry or whole archive. For Aspose.ZIP for .NET You can use the TarArchiveClass to open the .tar.gz file and then iterate through its entries, extracting them to a desired location. For Aspose.ZIP for Java Similar approach using the TarArchive to open the .tar.gz file and extract entries.

TAR-file - Internal Structure

Segment files store raw data about a segment. While different segment types exist, TAR files only differentiate between data and bulk segments. Bulk segments are directly saved "as-is" in the TAR file.
Data segments, however, are examined to find references to other segments or raw binary content. These references are simply stored as a list of unique identifiers (UUIDs) within the data segment. The referenced segments can be located either within the current TAR file or externally.
Internal references are found by checking the TAR file’s index. External references require an external tool to locate the segment in another TAR file. The list of referenced segments in a data segment is stored in the graph file for faster retrieval. This list is kept ordered to optimize the search process.

TAR-file - Internal Structure

Inner Archive Structure

  • File Metadata - Similar to a tar archive, each file stores basic information like modification time and permissions. However, this section is flexible and allows omitting or including additional details like access control lists (ACLs) or extended attributes (EAs) based on your needs. It’s recommended to include a strong hash function (like SHA1) for regular files to ensure data integrity.
  • Multiple Content Streams - Unlike traditional archives, files can have more than one data stream within the inner data file. This is useful for storing extended attributes or resource forks associated with the file.
  • Headers - The inner index file holds file headers, mirroring those scattered throughout the inner data file. But, when stored separately, the index headers must reference the starting position of their corresponding data within the data file. Additionally, directory entries in the index list their contained files and their corresponding offsets within the inner file index.
  • Rationale for Duplicate Metadata - This design choice ensures both efficient data streaming/decoding and random file access. Additionally, metadata compresses well, resulting in minimal storage overhead. Tests show metadata typically occupies less than 0.3% of storage space, making the trade-off worthwhile.
  • Block Headers - Block headers, similar to the outer file, contain block size information and a unique identifier sequence.

Examples of Using TAR

Aspose.ZIP API lets extract archives in your applications without the need of any other 3rd party applications. Aspose.ZIP API provides TarArchive class to work with TAR archives.

Add entries to existing TAR archive via C#

All you need to do is open archive for extraction and add entry to archive .

    using (TarArchive archive = new TarArchive(existing.tar))
    {
        archive.CreateEntry("one_more.bin", data.bin);
        archive.Save(added.tar);
    }

Delete entries from existing TAR archive via .net

Entries of tar archive can be deleted with similar DeleteEntry methods.

using (var archive = new TarArchive("two_files.tar"))
{
    archive.DeleteEntry(0);
    archive.Save("single_file.tar");
}

Add files to TAR archive without compression

Tar is a file archival format used to group multiple files and directories into a single archive file without compression , unlike formats like ZIP, RAR and others. To create a tar archive without specifying any compression settings, simply use a TarArchive instance.

    using (FileStream tarFile = File.Open("joint.tar", FileMode.Create))
    {
        FileInfo fi1 = new FileInfo("text.txt");
        FileInfo fi2 = new FileInfo("picture.png");
        using (TarArchive archive = new TarArchive())
        {
            archive.CreateEntry("text.txt", fi1);
            archive.CreateEntry("picture.png", fi2);
            archive.Save(tarFile);
        }
    }

Aspose.Zip offers individual archive processing APIs for popular development environments, listed below:

Aspose.Zip for .NETAspose.Zip via JavaAspose.Zip via Python.NET

Additional information about TAR-archives

People have been asking

1. What is a TAR archive?

A TAR archive, short for Tape Archive, is a file format used to bundle multiple files and directories into a single archive file without compression. It is commonly used for backup and distribution purposes in Unix-based systems.

2. What are TAR archives benefits?

TAR boasts universality, as it is compatible with most operating systems and archive programs, facilitating seamless data sharing and exchange across different platforms. Its simplicity lies in the straightforward structure of TAR archives, enabling effortless creation, extraction, and manipulation of files. Moreover, TAR offers efficiency by allowing compression with external tools like gzip or bzip2, enabling users to reduce file size and conserve storage space and bandwidth during data transmission.

3. What are some limitations of TAR archives?

While TAR is a versatile file format commonly used for archiving and distributing files in Unix-based systems, it does come with some limitations to be aware of. Firstly, TAR lacks built-in compression capabilities, meaning you’ll need additional tools like gzip or bzip2 to reduce file sizes. Secondly, TAR archives do not offer native encryption features, so if you require data security, you’ll have to rely on external tools for password protection. Lastly, TAR has limited support for preserving file attributes such as timestamps and permissions from the original files, which may affect the integrity of the archived data.