A look at the sysctl system call that gives you the ability to fine tune kernel parameters.
by Alessandro Rubini
The sysctl system call is an interesting feature of the Linux kernel; it is quite unique in the Unix world. The system call exports the ability to fine-tune kernel parameters and is tightly bound to the /proc file system, a simpler, file-based interface that can be used to perform the same tasks available by means of the system call. sysctl appeared in kernel 1.3.57 and has been fully supported ever since. This article explains how to use sysctl with any kernel between 2.0.0 and 2.1.35.
When running Unix kernels, system administrators often need to fine-tune some low-level features according to their specific needs. Usually, system tailoring requires you rebuilding the kernel image and rebooting the computer. These tasks are lengthy ones which require good skills and a little luck to be successfully completed. Linux developers diverged from this approach and chose to implement variable parameters in place of hardwired constants; run-time configuration can be performed by using the sysctl system call or more easily by exploiting the /proc file system. The internals of sysctl are designed not only to read and modify configuration parameters, but also to support a dynamic set of such variables. In other words, the module writer can insert new entries in the sysctl tree and allow run-time configuration of driver features.
Most Linux users are familiar with the /proc file system. In short, the file system can be considered a gateway to kernel internals: its files are entry points to certain kernel information. Such information is usually exchanged in textual form to ease interactive use, although the exchange can involve binary data when required. The typical example of a binary /proc file is /proc/kcore, a core file that represents the current kernel. Thus, you can execute the command:
gdb /usr/src/linux/vmlinux /proc/kcoreand peek into your running kernel. Naturally, gdb on /proc/kcore gives much better results if vmlinux has been compiled using the -g compiler option.
Most of the /proc files are read-only: writing to them has no effect. This applies, for instance, to /proc/interrupts, /proc/ioports, /proc/net/route and all the other information nodes. The directory /proc/sys, on the other hand, behaves differently; it is the root of a file tree related to system control. Each subdirectory in /proc/sys deals with a kernel subsystem like net/ and vm/, while the kernel/ subdirectory is special as it includes kernel-wide parameters, like the file kernel/hostname.
Each sysctl file includes numeric or string values--sometimes a single value, sometimes an array of them. For example, if you go to the /proc/sys directory and give the command:
grep . kernel/*kernel 2.1.32 returns data similar to the following:
kernel/ctrl-alt-del:0 kernel/domainname:systemy.it kernel/file-max:1024 kernel/file-nr:128 kernel/hostname:morgana kernel/inode-max:3072 kernel/inode-nr:384 263 kernel/osrelease:2.1.32 kernel/ostype:Linux kernel/panic:0nn kernel/printk:6 4 1 7 kernel/securelevel:0 kernel/version:#9 Mon Apr 7 23:08:18 MET DST 1997It's worth stressing that reading /proc items with less doesn't work, because they appear as zero-length files to the stat system call, and less checks the attributes of the file before reading it. The inaccuracy of stat is a feature of /proc, rather than a bug. It's a saving in human resources (in writing code), and kernel size (in carrying the code around). stat information is completely irrelevant for most files, as cat, grep and all the other tools work fine. If you really need to use less to look at the contents of a /proc file, you can resort to:
catIf you want to change system parameters, all you need to do is write the new values to the correct file in /proc/sys. If the file contains an array of values, they will be overwritten in order. Let's look at the kernel/printk file as an example. printk was first introduced in kernel version 2.1.32. The four numbers in /proc/sys/kernel/printk control the ``verbosity'' level of the printk kernel function. The first number in the array is console_loglevel: kernel messages with priority less than or equal to the specified value will be printed to the system console (i.e., the active virtual console, unless you've changed it). This parameter doesn't affect the operation of klogd, which receives all the messages in any case. The following commands show how to change the log level:
# cat kernel/printk 6 4 1 7 # echo 8 > kernel/printk # cat kernel/printk 8 4 1 7A level of 8 corresponds to debug messages, which are not printed on the console by default. The example session shown above changes the default behaviour so that every message, including the debug ones, are printed.
Similarly, you can change the host name by writing the new value to /proc/kernel/hostname--a useful feature if the hostname command is not available.
Even though the /proc file system is a great resource, it is not always available in the kernel. Since it's not vital to system operation, there are times when you choose to leave it out of the kernel image or simply don't mount it. For example, when building an embedded system, saving 40 to 50KB can be advantageous. Also, if you are concerned about security, you may decide to hide system information by leaving /proc unmounted.
The system call interface to kernel tuning, namely sysctl, is an alternative way to peek into configurable parameters and modify them. One advantage of sysctl is that it's faster, as no fork/exec is involved (i.e., no external programs are spawned) nor is any directory lookup. However, unless you run an ancient platform, the performance savings are irrelevant.
To use the system call in a C program, the header file sys/sysctl.h must be included; it declares the sysctl function as:
int sysctl (int *name, int nlen, void *oldval, size_t *oldlenp, void *newval, size_t newlen);If your standard library is not up to date, the sysctl function will neither be prototyped in the headers nor defined in the library. I don't know exactly when the library function was first introduced, but I do know libc-5.0 does not have it, while libc-5.3 does. If you have an old library you must invoke the system call directly, using code such as:
#include <linux/unistd.h> #include <linux/sysctl.h> /* now "_sysctl(struct __sysctl_args *args)" can be called */ _syscall1(int, _sysctl, struct __sysctl_args *, args);The system call gets a single argument instead of six of them, and the mismatch in the prototypes is solved by prepending an underscore to the name of the system call. Therefore, the system call is _sysctl and gets one argument, while the library function is sysctl and gets six arguments. The sample code introduced in this article uses the library function.
The six arguments of the sysctl library function have the following meaning:
Now, let's write some C code to access the four parameters contained in /proc/sys/kernel/printk. The numeric name of the file is KERN_PRINTK, within the directory CTL_KERN/ (both symbols are defined in linux/sysctl.h). The code shown in Listing 1, pkparms.c, is the complete program to access these values.
Changing sysctl values is similar to reading them--just use newval and newlen. A program similar to pkparms.c can be used to change the console log level, the first number in kernel/printk. The program is called setlevel.c, and the code at its core looks like:
int newval[1]; int newlen = sizeof(newval); /* assign newval[0] */ error = sysctl (name, namelen, NULL /* oldval */, 0 /* len */, newval, newlen);The program overwrites only the first sizeof(int) bytes of the kernel entry, which is exactly what we want.
Please remember that the printk parameters are not exported to sysctl in version 2.0 of the kernel. The programs won't compile under 2.0 due to the missing KERN_PRINTK symbol; also, if you compile either of them against later versions and then run under 2.0, you'll get an error when invoking sysctl.
The source files for pkparms.c, setlevel.c and hname.c (which will be introduced in a while) are in the 2365.tgz1 file.
A simple run of the two programs introduced above looks like the following:
# ./pkparms len is 16 bytes 6 4 1 7 # cat /proc/sys/kernel/printk 6 4 1 7 # ./setlevel 8 # ./pkparms len is 16 bytes 8 4 1 7If you run kernel 2.0, don't despair--the files acting on kernel/printk are just samples, and the same code can be used to access any sysctl item available in 2.0 kernels with minimal modifications.
On the same ftp site you'll also find hname.c, a bare-bones hostname command based on sysctl. The source works with the 2.0 kernels and demonstrates how to invoke the system call with no library support, since my Linux-2.0 runs on a libc-5.0-based PC.
Although low-level, the tunable parameters of the kernel are very interesting to tweak and can help optimize system performance for the different environments where Linux is used.
The following list is an overview of some relevant /kernel and /vm files in /proc/sys. (This information applies to all kernels from 2.0 through 2.1.35.)
Module writers can easily add their own tunable features to /proc/sys by using the programming interface to extend the control tree. The kernel exports to modules the following two functions:
struct ctl_table_header * register_sysctl_table(ctl_table * table, int insert_at_head); void unregister_sysctl_table( struct ctl_table_header * table);The former function is used to register a ``table'' of entries and returns a token, which is used by the latter function to detach (unregister) your table. The argument insert_at_head tells whether the new table must be inserted before or after the other ones, and you can easily ignore the issue and specify 0, which means ``not at head''.
What is the ctl_table type? It is a structure made up of the following fields:
Well, the previous list may have scared most readers. Therefore, I won't show the prototypes for the handling functions and will instead switch directly to some sample code. Writing code is much easier than understanding it, because you can start by copying lines from existing files. The resulting code will fall under the GPL--of course, I don't see that as a disadvantage.
Let's write a module with two integer parameters, called ontime and offtime. The module will busy-loop for a few timer ticks and sleep for a few more; the parameters control the duration of each state. Yes, this is silly, but it is the simplest hardware-independent example I could imagine.
The parameters will be put in /proc/sys/kernel/busy, a new directory. To this end, we need to register a tree like the one shown in Figure 1. The /kernel directory won't be created by register_sysctl_table, because it already exists. Also, it won't be deleted at unregister time, because it still has active child files; thus, by specifying the whole tree of directories you can add files to every directory within /proc/sys.
Listing 2 is the interesting part of busy.c, which does all the work related to sysctl. The trick here is leaving all the hard work to proc_dointvec and sysctl_intvec. These handlers are exported only by version 2.1.8 and later of the kernel, so you need to copy them into your module (or implement something similar) when compiling for older kernels.
I won't show the code related to busy looping here, because it is completely out of the scope of this article. Once you have downloaded the source from the FTP site1, it can be compiled on your own system. It works with both version 2.0 and 2.1 on the Intel, Alpha and SPARC platforms.
Despite the usefulness of sysctl, it's hard to find documentation. This is not a concern for system programmers, who are accustomed to peeking at the source code to extract information. The main entry points to the sysctl internals are kernel/sysctl.c and net/sysctl_net.c. Most items in the sysctl tables act on solely on strings or arrays of integers. So to search through the whole source tree for an item, you will end up using the data field as the argument to grep. I see no shortcut to this method.
As an example, let's trace the meaning of ip_log_martians in /proc/sys/net/ipv4. You'll first find that sysctl_net.c refers to ipv4_table, which in turn is exported by sysctl_net_ipv4.c. This file in turn includes the following entry in its table:
{NET_IPV4_LOG_MARTIANS, "ip_log_martians", &ipv4_config.log_martians, sizeof(int), 0644, NULL, &proc_dointvec},Understanding the role of our control file, therefore, reduces to looking for the field ipv4config.log_martians throughout the sources. Some grepping will show that the field is used to control verbose reporting (via printk) of erroneous packets received by this host.
Unfortunately, many system administrators are not programmers and need other sources of information. For their benefit, kernel developers sometimes write a little documentation as a break from writing code, and this documentation is distributed with the kernel source. The bad news is that, sysctl is quite recent in design, and such extra documentation is almost nonexistent.
The file Documentation/networking/Configurable is a short introduction to sysctl (much shorter than this article) and points to net/TUNABLE, which in turn is a huge list of configurable parameters in the network subtree. Unfortunately the description of each item is quite technical, so that people who don't know the details of networking can't proficiently tune network parameters. As I'm writing, this file is the only source of information about system control, if you don't count C source files.
Alessandro Rubini reads e-mail as rubini@linux.it and enjoys breeding oaks and playing with kernel code. He is currently looking for a job in either field.