https://sourceware.org/systemtap/wiki/PortingDTracetoSystemTap
If you are familiar with DTrace and have existing DTrace scripts to diagnose performance problems, it is not difficult to translate those existing DTrace into equivalent SystemTap scripts. The ouline of the process is:
Match up DTrace providers to SystemTap probe points
Convert DTrace predicates into SystemTap conditional statements
These steps will be decribed in greater detail in the process of converting of converting some very simple DTrace examples from:
http://www.brendangregg.com/DTrace/dtrace_oneliners.txt
One example in the DTrace one-liners prints out detailed information on signals:
dtrace -n 'proc:::signal-send /pid/ { printf("%s -%d %d",execname,args[2],args[1]->pr_pid); }'
This command-line DTrace script prints out the executable name, the signal number, and the process pid each time a user process sends a signal.
First step is to use the proper command and options for SystemTap to execute SystemTap from a command line ("stap -e"):
stap -e 'proc:::signal-send /pid/ { printf("%s -%d %d",execname,args[2],args[1]->pr_pid); }'
Look at "man stap" for more details on the available options for the stap command.
There is not a one-to-one correspondence between DTrace providers and SystemTap probe points, but in most cases matches can be found. To get an understanding what a particular DTrace provides supplies look it up at:
SystemTap has similiar information describing the probe points and supporting functions at:
For this particular example we find that the SystemTap signal.send probe point is a good match for proc:::signal-send and the script is now written as:
stap -e 'probe signal.send /pid/ { printf("%s -%d %d",execname,args[2],args[1]->pr_pid); }'
SystemTap probe points and supporting functions are implmented as tapsets. These tapset provide the equivalent to the DTrace built-in variables and provider arguments. The DTrace example uses: pid and execname; these can be mapped to the pid() and execname() functions respectively. The DTrace proc:::signal-send provider args[2] is the signal number and arg[1]->pr_pid is the pid of the process receiving the signal. As described in the SystemTap documentation, the signal.send probepoint provides similar variables: sig and sig_pid. Thus, the script is now:
stap -e 'probe signal.send /pid()/ { printf("%s -%d %d",execname(), sig, sig_pid); }'
DTrace has a more restrictive execution model for the probe handlers than SystemTap as a result most DTrace scripts use predication. Systemtap is a bit more flexible and allow conditional code inside the probe handler. The direct translation of the predication would be to negate the predicate and use the next statement to skip the rest of the Systemtap probe handler:
stap -e 'probe signal.send { if (!pid()) next; printf("%s -%d %d",execname(), sig, sig_pid); }'
In this case it would be clearer to simply write the code as:
stap -e 'probe signal.send { if (pid()) printf("%s -%d %d",execname(), sig, sig_pid); }'
In this parictular case the example doesn't have any thread local storage so nothing needs to done for this particular step.
There are many differences between Dtrace and SystemTap output. DTrace has more default rules to output data without explicit code in the script. Also DTrace adds newline to printf statment output. To avoid having this particular example have all output on a single line you need to add a "\n" to the printf function. The command line below is the completely translated script suitable for use with SystemTap:
stap -e 'probe signal.send { if (pid()) printf("%s -%d %d\n",execname(), sig, sig_pid); }'
Another of the DTrace one liners prints out distributions on the size of data written by each executable:
dtrace -n 'sysinfo:::writech { @dist[execname] = quantize(arg0); }'
You need to change the the "dtrace -n" into "stap -e", yielding:
stap -e 'sysinfo:::writech { @dist[execname] = quantize(arg0); }'
The DTrace sysinfo:::writech provider instruments the write, writev and pwrite syscalls. The same syscalls exist in Linux. The script becomes:
stap -e 'probe syscall.write.return, syscall.writev.return, syscall.pwrite.return { @dist[execname] = quantize(arg0); }'
SystemTap allows multiple probe events to share the same probe handler. The multiple probe events can be specified with wild card or enumerated and separated by commas. For this particular example we must determine that the how much data was actually written and that the write was successful so the probes are on syscall.write.return, syscall.writev.return, and syscall.pwrite.return rather than on syscall.write, syscall.writev, and syscall.pwrite.
The DTrace execname is eqivalent to the SystemTap execname() function. Each *.return probe event includes a $return context variable which is the return value for the probe point. In this case that is the number of bytes actually written.
Like DTrace, SystemTap provides associative arrays and aggregates. However, SystemTap must have the associate arrays declared as global variable. You need to add "global dist" for the associative array to store the information. The indexing of the associative arrays is similar for SystemTap. SystemTap has statistical operator "<<<" to add a sample. This data can later be printed out as histograms or provide averages, counts, minimums, and maximum.
After modifying the script we now have:
stap -e 'global dist; probe syscall.write.return, syscall.writev.return, syscall.pwrite.return
{ dist[execname()] <<< $return; }
'
All of these probe events fire whether the write was successful or not. You need to put a test of the $return value to ensure that negative error values are not included in the data.
stap -e 'global dist; probe syscall.write.return, syscall.writev.return, syscall.pwrite.return
{ if ($return >=0) dist[execname()] <<< $return; }
'
There are no thread local variables in this example, so nothing needs to be done for this step.
DTrace and SystemTap differ significantly in how they produce output. DTrace automatically selects the format of the output when the script exits. SystemTap needs a "probe end" event to print out the data in the desired format. In this case you want to print out @hist_log of each of the entries in the associative array. This is implement with a "foreach" statement. You also want to label the execname for each histogram, so a printf precedes the printing of the histogram. The final SystemTap script is:
stap -e ' global bytes; probe syscall.write.return, syscall.writev.return, syscall.pwrite.return
{ if ($return>=0) bytes[execname()] <<< $return }
probe end
{foreach (e in bytes) {printf("%s\n", e); print(@hist_log(bytes[e]))}}
'
This script print the histograms out when it exits with a ctl-C.
This example is from:
http://www.tablespace.net/quicksheet/dtrace-quickstart.html
Let's assume that the example is call read_time.stp and contains:
syscall::read:entry {
self->stime = timestamp;
}
syscall::read:return /self->stime != 0/ {
printf("%s read() %d nsecs\n",
execname,
timestamp - self->stime);
}
It will print out the the executable name followed by wallclock time in nanoseconds for each read syscall.
Rename the script with the ".stp" extension to read_time.stp.
Match up DTrace providers to SystemTap probe points
The DTrace providers used in this example directly match SystemTap syscall.read and syscall.read.return. The current script is:
probe syscall.read {
self->stime = timestamp;
}
probe syscall.read.return /self->stime != 0/ {
printf("%s read() %d nsecs\n",
execname,
timestamp - self->stime);
}
SystemTap does not implement thread-local variable in the same manner as DTrace; you use a global array and the thread ID (tid()) to index the entries thread specific value in the global array. When a thread-local value is no longer needed it should be deleted to avoid filling the associative arrary with dead values. In this case the example has the global stime to hold the thread local values.
The DTrace timestamp and execname variables map to the SystemTap gettimeofday_ns() and execname() functions. This yields the following intermediate version of the script:
global stime
probe syscall.read {
stime[tid()] = gettimeofday_ns();
}
probe syscall.read.return /self->stime != 0/ {
printf("%s read() %d nsecs\n",
execname(), gettimesofday_ns() - stime[tid()]);
delete stime[tid()];
}
In the original DTrace script the predication limited the execution of the syscall::read:return event only to ones that had a matching syscall::read:entry timestamp. The SystemTap version of the script needs to do the same. By default if there is no entry in the associative array for a index value it is assumed to be 0. Subtracting the current time from zero will give a very large and incorrect value. This predication is implemented with a check to determine whether the current tid() has an entry in the associative array with the "in" operator:
global stime
probe syscall.read {
stime[tid()] = gettimeofday_ns();
}
probe syscall.read.return {
if (tid() in stime) {
printf("%s read() %d nsecs\n",
execname(), gettimeofday_ns() - stime[tid()]);
delete stime[tid()];
}
}
The SystemTap script will instrument all syscall read operations including SystemTap's syscalls. Those can be filtered out with a conditional statement in the syscall.read event handler. This yields the following script:
global stime
probe syscall.read {
if (pid() != stp_pid())
stime[tid()] = gettimeofday_ns();
}
probe syscall.read.return {
if (tid() in stime) {
printf("%s read() %d nsecs\n",
execname(), gettimeofday_ns() - stime[tid()]);
delete stime[tid()];
}
}