12.5. File and Archiving Commands

Archiving

tar

The standard UNIX archiving utility. Originally a Tape ARchiving program, it has developed into a general purpose package that can handle all manner of archiving with all types of destination devices, ranging from tape drives to regular files to even stdout (see Example 4-4). GNU tar has long since been patched to accept gzip compression options, such as tar czvf archive-name.tar.gz *, which recursively archives and compresses all files in a directory tree except dotfiles in the current working directory ($PWD). [1]

Some useful tar options:

  1. -c create (a new archive)

  2. --delete delete (files from the archive)

  3. -r append (files to the archive)

  4. -t list (archive contents)

  5. -u update archive

  6. -x extract (files from the archive)

  7. -z gzip the archive

Caution

It may be difficult to recover data from a corrupted gzipped tar archive. When archiving important files, make multiple backups.

shar

Shell archiving utility. The files in a shell archive are concatenated without compression, and the resultant archive is essentially a shell script, complete with #!/bin/sh header, and containing all the necessary unarchiving commands. Shar archives still show up in Internet newsgroups, but otherwise shar has been pretty well replaced by tar/gzip. The unshar command unpacks shar archives.

ar

Creation and manipulation utility for archives, mainly used for binary object file libraries.

cpio

This specialized archiving copy command (copy input and output) is rarely seen any more, having been supplanted by tar/gzip. It still has its uses, such as moving a directory tree.

Compression

gzip

The standard GNU/UNIX compression utility, replacing the inferior and proprietary compress. The corresponding decompression command is gunzip, which is the equivalent of gzip -d.

The zcat filter decompresses a gzipped file to stdout, as possible input to a pipe or redirection. This is, in effect, a cat command that works on compressed files (including files processed with the older compress utility). The zcat command is equivalent to gzip -dc.

Caution

On some commercial UNIX systems, zcat is a synonym for uncompress -c, and will not work on gzipped files.

See also Example 7-6.

bzip2

An alternate compression utility, usually more efficient than gzip, especially on large files. The corresponding decompression command is bunzip2.

compress, uncompress

This is an older, proprietary compression utility found in commercial UNIX distributions. The more efficient gzip has largely replaced it. Linux distributions generally include a compress workalike for compatibility, although gunzip can unarchive files treated with compress.

Tip

The znew command transforms compressed files into gzipped ones.

sq

Yet another compression utility, a filter that works only on sorted ASCII word lists. It uses the standard invocation syntax for a filter, sq < input-file > output-file. Fast, but not nearly as efficient as gzip. The corresponding uncompression filter is unsq, invoked like sq.

Tip

The output of sq may be piped to gzip for further compression.

zip, unzip

Cross-platform file archiving and compression utility compatible with DOS PKZIP. "Zipped" archives seem to be a more acceptable medium of exchange on the Internet than "tarballs".

File Information

file

A utility for identifying file types. The command file file-name will return a file specification for file-name, such as ascii text or data. It references the magic numbers found in /usr/share/magic, /etc/magic, or /usr/lib/magic, depending on the Linux/UNIX distribution.

The -f option causes file to run in batch mode, to read from a designated file a list of filenames to analyze. The -z option, when used on a compressed target file, forces an attempt to analyze the uncompressed file type.

bash$ file test.tar.gz
test.tar.gz: gzip compressed data, deflated, last modified: Sun Sep 16 13:34:51 2001, os: Unix

bash file -z test.tar.gz
test.tar.gz: GNU tar archive (gzip compressed data, deflated, last modified: Sun Sep 16 13:34:51 2001, os: Unix)
	      

Example 12-24. stripping comments from C program files

#!/bin/bash
# strip-comment.sh: Strips out the comments (/* COMMENT */) in a C program.

E_NOARGS=65
E_ARGERROR=66
E_WRONG_FILE_TYPE=67

if [ $# -eq "$E_NOARGS" ]
then
  echo "Usage: `basename $0` C-program-file" >&2 # Error message to stderr.
  exit $E_ARGERROR
fi  

# Test for correct file type.
type=`eval file $1 | awk '{ print $2, $3, $4, $5 }'`
# "file $1" echoes file type...
# then awk removes the first field of this, the filename...
# then the result is fed into the variable "type".
correct_type="ASCII C program text"

if [ "$type" != "$correct_type" ]
then
  echo
  echo "This script works on C program files only."
  echo
  exit $E_WRONG_FILE_TYPE
fi  


# Rather cryptic sed script:
#--------
sed '
/^\/\*/d
/.*\/\*/d
' $1
#--------
# Easy to understand if you take several hours to learn sed fundamentals.


#  Need to add one more line to the sed script to deal with
#+ case where line of code has a comment following it on same line.
#  This is left as a non-trivial exercise.

# Also, the above code deletes lines with a "*/" or "/*",
# not a desirable result.

exit 0


# ----------------------------------------------------------------
# Code below this line will not execute because of 'exit 0' above.

# Stephane Chazelas suggests the following alternative:

usage() {
  echo "Usage: `basename $0` C-program-file" >&2
  exit 1
}

WEIRD=`echo -n -e '\377'`   # or WEIRD=$'\377'
[[ $# -eq 1 ]] || usage
case `file "$1"` in
  *"C program text"*) sed -e "s%/\*%${WEIRD}%g;s%\*/%${WEIRD}%g" "$1" \
     | tr '\377\n' '\n\377' \
     | sed -ne 'p;n' \
     | tr -d '\n' | tr '\377' '\n';;
  *) usage;;
esac

# This is still fooled by things like:
# printf("/*");
# or
# /*  /* buggy embedded comment */
#
# To handle all special cases (comments in strings, comments in string
# where there is a \", \\" ...) the only way is to write a C parser
# (lex or yacc perhaps?).

exit 0
which

which command-xxx gives the full path to "command-xxx". This is useful for finding out whether a particular command or utility is installed on the system.

$bash which rm
/usr/bin/rm

whereis

Similar to which, above, whereis command-xxx gives the full path to "command-xxx", but also to its manpage.

$bash whereis rm
rm: /bin/rm /usr/share/man/man1/rm.1.bz2

whatis

whatis filexxx looks up "filexxx" in the whatis database. This is useful for identifying system commands and important configuration files. Consider it a simplified man command.

$bash whatis whatis
whatis               (1)  - search the whatis database for complete words

See also Example 10-3.

vdir

Show a detailed directory listing. The effect is similar to ls -l.

This is one of the GNU fileutils.

bash$ vdir
total 10
 -rw-r--r--    1 bozo  bozo      4034 Jul 18 22:04 data1.xrolo
 -rw-r--r--    1 bozo  bozo      4602 May 25 13:58 data1.xrolo.bak
 -rw-r--r--    1 bozo  bozo       877 Dec 17  2000 employment.xrolo

bash ls -l
total 10
 -rw-r--r--    1 bozo  bozo      4034 Jul 18 22:04 data1.xrolo
 -rw-r--r--    1 bozo  bozo      4602 May 25 13:58 data1.xrolo.bak
 -rw-r--r--    1 bozo  bozo       877 Dec 17  2000 employment.xrolo
	      

shred

Securely erase a file by overwriting it multiple times with random bit patterns before deleting it. This command has the same effect as Example 12-36, but does it in a more thorough and elegant manner.

This is one of the GNU fileutils.

Caution

Using shred on a file may not prevent recovery of some or all of its contents using advanced forensic technology.

locate, slocate

The locate command searches for files using a database stored for just that purpose. The slocate command is the secure version of locate (which may be aliased to slocate).

$bash locate hickson
/usr/lib/xephem/catalogs/hickson.edb

strings

Use the strings command to find printable strings in a binary or data file. It will list sequences of printable characters found in the target file. This might be handy for a quick 'n dirty examination of a core dump or for looking at an unknown graphic image file (strings image-file | more might show something like JFIF, which would identify the file as a jpeg graphic). In a script, you would probably parse the output of strings with grep or sed. See Example 10-7 and Example 10-9.

Example 12-26. An "improved" strings command

#!/bin/bash
# wstrings.sh: "word-strings" (enhanced "strings" command)
#
#  This script filters the output of "strings" by checking it
#+ against a standard word list file.
#  This effectively eliminates all the gibberish and noise,
#+ and outputs only recognized words.

# =================================================================
#                 Standard Check for Script Argument(s)
ARGS=1
E_BADARGS=65
E_NOFILE=66

if [ $# -ne $ARGS ]
then
  echo "Usage: `basename $0` filename"
  exit $E_BADARGS
fi

if [ -f "$1" ]                        # Check if file exists.
then
    file_name=$1
else
    echo "File \"$1\" does not exist."
    exit $E_NOFILE
fi
# =================================================================


MINSTRLEN=3                           #  Minimum string length.
WORDFILE=/usr/share/dict/linux.words  #  Dictionary file.
                                      #  May specify a different
                                      #+ word list file
                                      #+ of format 1 word per line.


wlist=`strings "$1" | tr A-Z a-z | tr '[:space:]' Z | \
tr -cs '[:alpha:]' Z | tr -s '\173-\377' Z | tr Z ' '`

# Translate output of 'strings' command with multiple passes of 'tr'.
#  "tr A-Z a-z"  converts to lowercase.
#  "tr '[:space:]'"  converts whitespace characters to Z's.
#  "tr -cs '[:alpha:]' Z"  converts non-alphabetic characters to Z's,
#+ and squeezes multiple consecutive Z's.
#  "tr -s '\173-\377' Z"  converts all characters past 'z' to Z's
#+ and squeezes multiple consecutive Z's,
#+ which gets rid of all the weird characters that the previous
#+ translation failed to deal with.
#  Finally, "tr Z ' '" converts all those Z's to whitespace,
#+ which will be seen as word separators in the loop below.

#  Note the technique of feeding the output of 'tr' back to itself,
#+ but with different arguments and/or options on each pass.


for word in $wlist                    # Important:
                                      # $wlist must not be quoted here.
                                      # "$wlist" does not work.
                                      # Why?
do

  strlen=${#word}                     # String length.
  if [ "$strlen" -lt "$MINSTRLEN" ]   # Skip over short strings.
  then
    continue
  fi

  grep -Fw $word "$WORDFILE"          # Match whole words only.

done  


exit 0

Utilities

basename

Strips the path information from a file name, printing only the file name. The construction basename $0 lets the script know its name, that is, the name it was invoked by. This can be used for "usage" messages if, for example a script is called with missing arguments:
echo "Usage: `basename $0` arg1 arg2 ... argn"

dirname

Strips the basename from a filename, printing only the path information.

Note

basename and dirname can operate on any arbitrary string. The argument does not need to refer to an existing file, or even be a filename for that matter (see Example A-7).

split

Utility for splitting a file into smaller chunks. Usually used for splitting up large files in order to back them up on floppies or preparatory to e-mailing or uploading them.

sum, cksum, md5sum

These are utilities for generating checksums. A checksum is a number mathematically calculated from the contents of a file, for the purpose of checking its integrity. A script might refer to a list of checksums for security purposes, such as ensuring that the contents of key system files have not been altered or corrupted. The md5sum command is the most appropriate of these in security applications.

Note that cksum also shows the size, in bytes, of the target file.

bash$ cksum /boot/vmlinuz
1670054224 804083 /boot/vmlinuz


bash$ md5sum /boot/vmlinuz
0f43eccea8f09e0a0b2b5cf1dcf333ba  /boot/vmlinuz
	      

Encoding and Encryption

uuencode

This utility encodes binary files into ASCII characters, making them suitable for transmission in the body of an e-mail message or in a newsgroup posting.

uudecode

This reverses the encoding, decoding uuencoded files back into the original binaries.

Tip

The fold -s command may be useful (possibly in a pipe) to process long uudecoded text messages downloaded from Usenet newsgroups.

mimencode, mmencode

The mimencode and mmencode commands process multimedia-encoded e-mail attachments. Although mail user agents (such as pine or kmail) normally handle this automatically, these particular utilities permit manipulating such attachments manually from the command line or in a batch by means of a shell script.

crypt

At one time, this was the standard UNIX file encryption utility. [2] Politically motivated government regulations prohibiting the export of encryption software resulted in the disappearance of crypt from much of the UNIX world, and it is still missing from most Linux distributions. Fortunately, programmers have come up with a number of decent alternatives to it, among them the author's very own cruft (see Example A-4).

Miscellaneous

make

Utility for building and compiling binary packages. This can also be used for any set of operations that is triggered by incremental changes in source files.

The make command checks a Makefile, a list of file dependencies and operations to be carried out.

install

Special purpose file copying command, similar to cp, but capable of setting permissions and attributes of the copied files. This command seems tailormade for installing software packages, and as such it shows up frequently in Makefiles (in the make install : section). It could likewise find use in installation scripts.

more, less

Pagers that display a text file or stream to stdout, one screenful at a time. These may be used to filter the output of a script.

Notes

[1]

A tar czvf ... will include dotfiles in directories below the current working directory. This is an undocumented tar "feature".

[2]

This is a symmetric block cipher, used to encrypt files on a single system or local network, as opposed to the "public key" cipher class, of which pgp is a well-known example.