by David G. Korn, Charles J. Northrup, and Jeffery Korn
The Unix system was one of the first systems that didn't make the command interpreter a part of the operating system or a privileged task. It was written as an ordinary user process with no special permissions or calls to unadvertised functions. This has led to a succession of better and better shells. The early generations of Unix came with a command shell written by Ken Thompson, one of the inventors of the Unix system. By the late 1970s, two vastly improved shells emerged. The Bourne shell, created by Steve Bourne at Bell Telephone Laboratories, was a big improvement as a language. The C shell, created by Bill Joy at the University of California at Berkeley, was a much improved command interpreter but a poor language.
The KornShell, written by David Korn at Bell Telephone Laboratories, combined the best features of both of these shells, and added the ability to edit and reenter the current and previous commands using the same keystrokes as either the vi or the Emacs editor as the user desired. This shell became very popular, but its distribution was restricted. As a result, several freely available imitations such as pdksh and bash were created. An enhanced version of C shell, tcsh, was created to provide visual editing to C shell users.
While the Bourne shell provided a good basis for programming, and this was improved upon by earlier versions of KornShell, it was not adequate for general purpose scripting without combining it with other languages such as the awk programming language. While in most instances the two languages work well together, the performance penalty of using two languages with separate processes is often prohibitive. The Perl language was created to provide a single language with the combined functionality of the shell and awk. However, Perl has a syntax that many find difficult to understand.
ksh93, the latest major revision of the KornShell language provides an alternative to Tcl and Perl. As a programming language, it has comparable speed and functionality to each of these languages, yet is arguably the best interactive shell. It is a superset of the POSIX 1003.2 shell standard. Like Tcl, it is extensible and embeddable with a C language application programming interface. In fact, two graphical shells have been created using ksh93. One of these, dTksh, is a Motif-based language developed by Novell. The other, Tksh, written by Jeff Korn at Princeton University, uses the Tk library, and is briefly discussed here.
The best way to describe the new features found in ksh93 is to illustrate them through an example. We will create a shell script named lsc, shown in Listing 1, to provide an ls output with subdirectory names printed in bold. We will need to maintain the multi-column output associated with the standard ls.
The lsc script will produce the ls output for each directory name provided as a command line argument. The default action is to produce the ls output for the current directory. Several modifications can be made to the lsc script for enhanced performance. We leave them as an exercise for the reader. We perform the following high level actions for each directory name to be processed.
for each directory do
load directory entries into array entries load entries calculate number of columns in multi-column output calculate maximum number of rows print the current directory name determine output layout add entries to row[] array add entries to col[] array calculate the column widths display the output
done
ksh93 provides one-dimensional indexed and associative arrays. An array element is referenced as varName[subscript]. Indexed arrays use arithmetic expressions for subscripts. This permits computation within the subscript expression. The statement varName[3+8] for example, references the 11th element of the indexed array. (Arithmetic expressions are described more fully below).
The elements of an indexed array can be initialized from a list using the varName=(....) command. This provides a convenient notation for initializing an array to contain the names of files in a given directory. The number of entries in the array describes the number of files found. As an example, consider the following statement to initialize the entries indexed array with the names of files found in the current directory: entries=(*)
An associative array uses arbitrary strings for subscripts. We could, for example, create a state tax associative array and reference elements by the state name. This works even for space separated tokens within the string, such as New Jersey.
typeset -A StateTax StateTax[New Jersey]=0.06 print ${StateTax[New Jersey]}
Several special positional parameter expansions are provided for array processing. Using ${varName[@]} refers to all elements of the array. The subscripts of an array can be referenced with ${!varName[@]}. The notation ${#varName[@]} provides the number of elements within the array. Elements within a numeric subscript range can be referenced using ${varName[@]:offset:length}. This special notation works with both indexed and associative arrays.
Arrays are used throughout the example lsc script. We define video as an associative array with capability names from the terminfo database as subscripts. The definition of video is provided as a compound assignment for an associative array.
video=( [bold]=$(tput bold) [reset]=$(tput reset) [reverse]=$(tput reverse) )
Each element is assigned a value from the standard output of a tput execution for the capability name. For example, video[bold] is the terminfo sequence for bold lettering. Similarly, video[reverse] will provide reverse video output.
Using the notation $(command) will cause command to execute in a subshell of the current ksh. In many instances, ksh will not actually fork/exec a subshell when command is a built-in or a shell function. (Built-in functions are described below).
In ksh93 a variable is defined by a name=value pair. The variable name space is hierarchical with . (dot) delimiters. The expanded name space permits an aggregate definition for a variable.
The lsc script will produce multi-column output. We visualize the output as a table consisting of rows and columns. A common definition for row and column is provided by the definition of a compound variable named cell.
cell=( # maximum number of cells integer maximum=0 # maximum width based on entries integer width=0 # current index within the cell integer index=0 # content of the cell typeset entries )
This defines the variable cell, with aggregate members maximum, width, index, and entries. A reference of ${cell.index} provides the value associated with the index aggregate. Using the eval command we can create additional variables with the same aggregates. We can, for example, define variables row and col to have the same definition as cell:
eval row="$cell" eval col="$cell"
ksh93 provides support for internationalization. Double-quoted strings preceded by a $ are checked for message substitution. If the string appears in the message catalog, then ksh93 will substitute the string with the corresponding string from the message catalog. Otherwise, the string is unchanged.
In the lsc example, we display an error message of "not found" for any command line arguments that are not readable directories. The error message we provide is defined with internationalization support (see line 33 of Listing 1). If the shell variable LANG is defined to some locale other than POSIX, ksh will attempt to replace the error message using internationalization support. Otherwise, the message remains unchanged.
Executing ksh -D on a shell script will output all messages identified for internationalization. In the lsc script, for example, ksh -D will output the following message.
"${video[reverse]} not found ${video[reset]}"
ksh93 is extensible through the KornShell Development Kit (KDK). You can write your own built-in functions in C and load them into the current shell environment through the builtin command. This feature is available on operating systems with the ability to load and link code into the current process at run time.
A built-in command is executed without creating a separate process. Instead, the command is invoked as a C function by ksh. If this function has no side effects in the shell process, then the behavior of this built-in is identical to that of the equivalent stand-alone command. The primary difference in this case is performance: the overhead of process creation is eliminated. For commands of short duration, the effect can be dramatic. For example, on SUN OS 4.1 wc on a small file of about 1000 bytes runs about 50 times faster as a built-in command than as a separate process.
In addition, built-in commands that have side effects on the shell environment can be written. Using the API, available through the KornShell Development Kit, you can extend the application domain for shell programming. For example, an X-Windows extension that makes heavy use of the shell variable namespace was added as a group of built-in commands. The result is a windowing shell that can be used to write X-Windows applications.
While there are definite advantages to adding built-in commands, there are some disadvantages as well. Since the built-in command and ksh share the same address space, a coding error in the built-in program may affect the behavior of ksh, perhaps causing it to core dump or hang. Debugging is also more complex since the built-in's code is now a part of a larger entity. The isolation provided by a separate process guarantees that all resources used by the command will be freed when the command completes; this guarantee does not apply to built-ins. Also, since the address space of ksh will be larger, this may increase the time it takes ksh to fork() and exec() a non-builtin command [though not on more advanced operating systems like Linux, which conserve memory and time by doing ``copy-on-write'' when they fork--ED]. It makes no sense to add a built-in command that takes a long time to run or that is run only once, since the performance benefits will be negligible. Built-ins that have side effects in the current shell environment have the disadvantage of increasing the coupling between the built-in and ksh making the overall system less modular and more monolithic.
Despite these drawbacks, in many cases extending ksh by adding built-in commands makes sense and allows reuse of the shell scripting capability in an application-specific domain.
In the lsc example, we need to determine the maximum string size within a list of strings. This is required to determine the initial number of columns in the multi-column display. We will also use this to determine the maximum width for a column of entries. A typical shell implementation would be given as:
(( max_stringSize = 0 )) for fileName in * do if (( max_stringSize < ${#fileName} )) then (( max_stringSize = ${#fileName} )) fi done
(See Arithmetic Expressions, below, for an explanation of (( and )).)
To improve performance, we can re-write this function in C. In a simple example, the shell equivalent function required 0.58 seconds of CPU. The C built-in function provided 0.08 seconds of CPU for the same task. The function name is preceded with ``b_'' to indicate that it is a built-in function. When compiled, the strlenList.o object is then archived into a shared library. To reference the strlenList function, we must load it into the current ksh environment through the builtin command (see line 29 of Listing 1).
#pragma prototyped #include "shell.h" #include "stdio.h" int b_strlenList(int argc, char **argv, void *extra) { register int max, n = 0 char **cp = NULL; cp=argv; while ( *(++cp) ) { n = strlen(*cp); max = max < n ? n : max; } fprintf(stdout,"%d\n", max); return(0); }
ksh93 provides two methods for function definitions. The formats are given as:
function name { body } name() { body }
The second function format is provided for compatibility with POSIX standards. The primary distinction is that of variable name scope. In a POSIX function, a variable definition has global scope. In the following POSIX function bar, variable foo is redefined to a value of 6.
typeset foo=5 bar() { typeset foo=6 echo $foo } bar 6 echo $foo 6
Variable definitions in ksh93 functions have local scope. In the following ksh93 function bar, a local variable foo is defined and has precedence over the global variable foo.
typeset foo=5 function bar { typeset foo=6 echo $foo } bar 6 echo $foo 5
ksh93 provides active variables through a series of discipline functions. From the shell level, you can write get, set, and unset disciplines. Through the KornShell Development Kit, you can also add disciplines unique to your environment.
When a variable is referenced, as in $foo, ksh will invoke the get discipline associated with foo. The default discipline is to simply return the current value associated with foo. From the shell level, you can define a foo.get discipline function.
The set discipline is called when a value is assigned to a variable. Within the set discipline, the special variable .sh.name is the name of the variable whose value is being set.
On line 31 of lsc, we define a max_stringSize.get discipline function. Every reference to ${max_stringSize} will result in this function being executed. The value of the special .sh.value variable is the value returned from the discipline.
In ksh93, a printf statement is available following the ANSI C printf definition. This permits formatting specifications to be applied to each argument. To appreciate the differences between the standard print and printf statements, consider how you would output the contents of the entries array (from the lsc example), one per line. The standard print statement would display the file names as space-separated tokens on a single line. Using the printf statement with a "%s\n" format, however, would produce the desired results.
ksh93 statements of the form (( expression )) are called arithmetic commands. Arithmetic commands return True when the value of the enclosed expression is non-zero, and False when the expression evaluates to zero. The construct $((expression)) can be used as a word or part of a word. It is replaced by the value of expression.
In the lsc example, line 38, we evaluate the value of the discipline function using:
(( .sh.value = $(strlenList ${entries[@]}) + 3 ))
ksh93 will evaluate the expression, which includes an assignment to the .sh.value variable. Note that the:
$(strlenList ${entries[@]})
will invoke the strlenList built-in function and return the maximum width of the strings (given as element values) in the entries[] array. We add 3 to this value for formatting purposes.
An ANSI C string is defined by preceding the single-quoted string with a $. For example, $'*' is the literal asterisk, *. With ANSI C strings, all characters between the single quotes retain their literal meaning, except for escape sequences. An escape sequence is introduced by the escape character \.
ANSI C string support provides an essential feature for shell programmers. Consider, for example, having to process variables with embedded tabs in their values. Without ANSI C string support, we would not be able to effectively test the value of the variable for embedded tabs. As an example, consider the following script:
print "foo\tbar" > /tmp/foobar read aline < /tmp/foobar if [[ "${aline}" == "foo\tbar" ]] then print TRUE fi
The comparison (see Conditional Commands, below) will fail. We can replace the conditional with ANSI C strings and ensure proper functionality. The example above should be rewritten as:
print "foo\tbar" > /tmp/foobar read aline < /tmp/foobar if [[ "${aline}" == $'foo\tbar' ]] then print TRUE fi
On line 45 of Listing 1, we must test to see if the directory is empty. The preceding entries=(*) in an empty directory will set the entries variable to the literal asterisk if no files are found.
A conditional command in ksh93 evaluates a test-expression and returns either True or False. Conditional commands can be used as part of an ``Or list'' (||), ``And list'' (&&), or as part of an if-elif-else command. Conditional commands have the format:
[[ test-expression ]]
When used in conjunction with an ``And list'', ksh93 evaluates the test-expression and will execute the ``And component'' only if the test-expression evaluates to True. We use a conditional command as part of an ``And list'' such that the return statement will be executed only if the test-expression is True.
[[ ${entries[0]} == $'*' ]] && return 2
The for command has two formats. The traditional format is provided to iterate on each word in a list. The format is:
for variableName [ in word-list ] do compound-list done
An arithmetic for command has been provided that is very similar to the C programming language for statement. The format is:
for (( initExpr ; condition ; loopExpr )) do compound-list done
The initExpression is evaluated by ksh prior to executing the for command. The condition is then evaluated prior to each iteration of compound-list. If the condition is non-zero, then ksh executes the compound-list. The loopExpression is evaluated at the end of each iteration.
A new typeset option has been added for name referencing. Using typeset -n nameReference=variableName will associate nameReference with variableName. A special alias, nameref, is provided as the equivalent for typeset -n. A shell script may use the reference name to refer to the variable name. Name referencing provides a convenient mechanism to pass the name of compound variables, or arrays, to ksh functions. This is more efficient than passing the variable's content.
In the lsc example, function setOutput must add the directory entries to the appropriate row and column. We could have defined separate functions named addToRow and addToColumn for this purpose. The main body of the functions, however, would be equivalent. Instead, we opted to write a single function addToCell that uses a nameref to the cell type passed as a parameter.
The addToCell function accepts three arguments, of which the first two are required. The first argument is the cell type and must be either row or col. We create a nameref using the local variable cell to be equivalent to the cell type specified. A reference to ${cell.index} would therefore be equivalent to ${row.index} or ${col.index}.
ksh functions are not inherited across invocations of ksh. A child shell process, for example, does not have access to the functions defined within the parent ksh invocation. This has historically limited the re-usability of ksh functions. As a solution, ksh93 will search the colon-separated list of directories given by the FPATH variable value, for an executable file with the same name as the function. In the lsc example, we can eliminate the last statement:
lsc "${@}"
The FPATH can then be set to the directory containing the lsc file. From the shell level, we can now call lsc. ksh93 will load the lsc script and will call the lsc function with the command line arguments specified. Note that the supporting functions defined in the lsc script are available to the lsc function.
A function autoload feature is provided, in which an auto-loaded function definition is loaded and retained within the ksh93 environment upon the first reference to the function name. This provides better performance since the search and load steps are eliminated for subsequent references.
ksh93, the latest major revision of the KornShell language, provides an alternative to Tcl and Perl. As a programming language, it has comparable speed and functionality to each of these languages. Like Tcl, it is extensible and embeddable with a C language application programming interface. The New KornShell, ksh93, and the Tksh products are available from Global Technologies, Ltd., Inc., 5 West Ave, Old Bridge, NJ 908-251-2840.
David G. Korn: AT&T Research, Technical Manager Charles J. Northrup: Global Technologies Ltd., Inc., CIO Jeffery Korn: Princeton University. Computer Science Department