Format String Vulnerability

Since the late 1990s, the vulnerability of format strings has been known to the public and is still used to this day. When exploited, an attacker can modify memory and even execute custom code in addition to reading memory. This article discusses the principle and possibilities of exploiting format string vulnerabilities, particularly in the C and C++ languages. The possibilities of exploiting the vulnerability are described in depth, and the auxiliary techniques used by the attackers are presented. The article also explores methods to protect against this vulnerability and includes examples from various programming languages.

INTRODUCTION

At the end of the 1990s, format string vulnerabilities in C and C++ began to appear. This vulnerability gained a lot of attention in June 2000, when it appeared in the WU-FTPD – Washington’s University FTP server package [8]. This vulnerability offers the attacker a wide range of exploits and, if exploited correctly, the attacker can even gain administrative access to the system.

This study explores all the possibilities of abusing this vulnerability and illustrates them through examples. The study first discusses what format strings and functions are and how they work in C and C++. The next section of the paper is devoted to the vulnerability in more detail and describes possible attacks on the vulnerability – from reading values on the stack to executing malicious code. After explaining these attacks, the auxiliary methods that attackers use to simplify attacks are described. Lastly, the section on defense focuses on practical solutions that can significantly contribute to mitigating the risks associated with this vulnerability, and two tools developed to eliminate this vulnerability are presented. At the end of the article, examples of format string vulnerability from other languages are discussed.

Related Resources

[]Exploiting Format String Vulnerabilities
[]Hacking: The Art of Exploitation
[]Format-String Vulnerability

FORMAT STRING AND FUNCTIONS

A format string is a special type of string in C and C++ programming that is used by functions like printf to output formatted text. This string contains text mixed with format specifiers (see Table I) that determine how variables should be output. In the example below, it is the first argument of the function printf, i.e.: "Numbers: %d %d %d\n". A format function is a special type of ANSI C function that accepts variable number of arguments, one of which is called a format string. It is a function that is used to display common C language data types in a human-readable form [1], [2], [7], [9]. The function rewrites the format string to the output and, if necessary, substitutes the format parameters for the value of the argument from the stack, see Section Stack. So, for example, if a function contains three format parameters, the function should contain three more arguments in addition to the format string [1], [2], [7].

Example showing the correct number of arguments with the correct number of formatting parameters:

printf("Numbers: %d %d %d\n", 1, 2, 3);

The code will print: Numbers: 1 2 3 (terminated by a newline).

Formatting functions access the next arguments using the macro va_arg, which returns a pointer to an argument and moves that pointer to the next argument in the va_list array. The problem occurs when there are multiple formatting parameters in the format string. The macro va_arg does not recognize that it has already passed all arguments, so it continues reading data from the stack and moves the pointer further [3], [7].

Examples of Functions

There are many formatting functions in the ANSI C language that have different uses, listed below are functions from the printf family:

  • printf
  • fprintf
  • dprintf
  • sprintf
  • snprintf
  • vprintf
  • vfprintf
  • vdprintf
  • vsprintf
  • vsnprintf

However, there are even more functions, e.g.: syslog, setproctitle, err*, verr*, warn*, vwarn* [1], [8], [10].

Format Parameters

Table I describes some formatting parameters that may appear in the format string. These parameters are identified by the % character before the corresponding letter, but it should be added that the % character does not have to appear in the code. For example, \x25 will be substituted for the percent sign during compilation, since in ASCII the % is represented by the value 0x25 [1].

Table I: Format Parameters
Parameter Input Output
%d Value Number
%u Value Unsigned number
%x Value Number in hexadecimal
%s Pointer String
%n Pointer Numbers of bytes already written

Stack

The format function goes through the format string character by character during its evaluation. If it is a normal character, it is copied to the output. However, if the % character is encountered (the beginning of a formatting parameter), some formatting parameter is expected after it. Based on the parameter, the function then increments the auxiliary pointer in the stack and expects to find the value of the desired function argument at that location and converts it to output according to the formatting parameter [1], [2], [5].

The function parameters are stored on the stack in the reverse order, see sample below [2], [5], [10]: printf("Test: %d, %d, %08x", one, two, &three);

stack
  • the arrow shows the direction of stack growth
  • one – value of variable one
  • two – value of variable two
  • &three – address of variable three

VULNERABILITY

The Format String Vulnerability is found when an attacker can enter a text string into the program that is subsequently evaluated differently than originally intended by the program [1], [5]–[7], [9], [10]. This can change the behavior of the formatting function, allowing an attacker to, for example, cause the program to crash, read arbitrary addresses from memory, write to them, and possibly execute malicious code [1], [5], [7], [10]. If he manages to write to the correct location, he can even gain administrative privileges [5].

Similarly to buffer overflow, this vulnerability is also caused by a programming error, but unlike buffer overflow, it is quite easy to find in the program [2], [6]. In some cases, programmers tend to write printf(string_var); instead of the correct printf("%s", string_var); – so the function takes the text string of a variable and treats it as a format string. Both ways of writing work as we would expect, however, the moment the formatting parameter is put into the variable, the function starts accessing the stack values without any additional arguments stored on it, as mentioned in the Section Format String and Functions [2], [4], [9]. This in itself is not that big of a deal, the problem occurs when the attacker is able to affect the variable that enters the function.

good.c:

#include<stdio.h>

int main(int argc, char* argv[]){
    char* var = argv[1];
    printf("Name: %s\n", var);
    return 0;
}
    
$ ./good John
Name: John
$ ./good %d
Name: %d

bad.c:

#include<stdio.h>

int main(int argc, char* argv[]){
    char* var = argv[1];
    printf("Name: ");
    printf(var); //vulnerability!!!
    printf("\n");
    return 0;
}
    
$ ./bad John
Name: John
$ ./bad %d
Name: 1370186400

Program Crash – Segmentation Fault

If the attacker’s goal is only to cause the program to crash, this is the simplest variant of the format string attack. The attacker only needs to get enough formatting parameters into the format string so that one of them points to an invalid address. On almost all UNIX systems, access to an invalid address is caught by the kernel, and the process receives the SIGSEV signal – segmentation fault, and the execution is terminated [2], [3], [5].

$ ./bad %s%s%s%s%s%s%s

Output:

Segmentation fault (core dumped)

An example of such attack could be a situation where resolving a domain name in the DNS is involved. At this point, an attacker might find it convenient to crash the DNS server [1], [7].

Reading Data From the Stack

If the attacker can see the output of the format function, he can use it to display the stack, since he can control the function with his input [1], [4]. This output is useful to the attacker for the next stages of the attack, for example, to determine the correct offset on the stack [1].

For displaying data from the stack, it is convenient to use the formatting parameter %08x, which displays the stack values as hexadecimal numbers aligned at eight places [1], [5]. In the example below, the function assumes that four parameters are stored on the stack:

printf("%08x_%08x_%08x_%08x\n");

Output:

a90382c8_a90382d8_aaf3fdc0_125fdf10

Thus, this method can be used to display the stack (from bottom to top – assuming the stack grows towards the low addresses). However, an attacker may run into some limits here, f.e. the size of the format string buffer and the output buffer. However, in some cases, it is possible to display the entire stack [1].

Reading Data From Arbitrary Address

Reading the stack can be interesting for an attacker, however, reading arbitrary memory is more valuable for attacks. For example, an attacker might want to know the value of a particular secret variable. With format string vulnerabilities, this is possible.

For a successful attack, it is necessary to handle two phases: slip the address in memory that we want to read from, and read the value of the given address. To do this, the attacker will use the formatting parameter %s, since this parameter expects a pointer to the data, see Table I. At this point, all that is left to do is slip the address from which the attacker wants to read [1].

As mentioned in Section Stack, the parameters of the formatting function get on the stack and can be accessed. In addition to the other parameters, the format string itself is put on the stack. Here, the attacker uses a primitive technique to find the format string on the stack. He inserts the letters AAAA at the beginning of the format string and then tries to find them on the stack using, e.g., %08x. In the output of the function, the attacker looks for 41414141, i.e., hexadecimal AAAA [1], [2], [5].

Format string:

AAAA_%08x_%08x_%08x_%08x

Output:

AAAA_6758b2a0_6652100e_6758b2a0_41414141

If the attacker chooses the correct number of format parameters, he will be able to find the text he is looking for, so in the example above, the format string is the fourth parameter on the stack. If the attacker replaced the last parameter with %s, this would cause the function to read the value stored on the address from the fourth parameter (0x41414141) [2], [5].

Thus, using this technique, the attacker finds out where the beginning of the format string is on the stack – that is, the location in memory that he is able to control. The attacker then only needs to replace the letters AAAA with the actual address and then display the value stored there using the formatting parameter %s, since this parameter reads data from the given address [2], [5], [6]. Below is an example of inserting an address into the format string from the previous example:

Address:

0x12345678

Format string:

\x78\x56\x34\x12_%08x_%08x_%08x_%s

When inserting a custom address, it is necessary to enter it in reverse order due to the format little endian [1], [2], [5]. In addition, it is necessary to pay attention to how the input string is processed. For example, in the Bash shell, when input is given via a command line argument, the text: \x11\x22\x33\x44 is interpreted as x11x22x33x44. It is therefore advisable to insert it using the printf command:

./bad $(printf "\x78\x56\x34\x12")...

If the attacker fails to hit the exact boundary of the format string using 4-byte jumps, he must align the format string using one to three redundant characters – since it is not possible to move around the stack by bytes. Instead, the format string itself must be shifted so that it can be reached using four-byte jumps [1].

Since in C the string is terminated by the \0 character (Null byte), which is hexadecimal represented as \x00, an attacker is limited to addresses that do not contain the 00 pair that would terminate the format string prematurely. This can be a problem, for example, with the Windows operating system, where addresses with the prefix 00 are often seen [6].

Changing Data in Memory

Once an attacker can display a value at an arbitrary address in memory, it should not be such a problem for him to write something to an arbitrary address [2]. In this case, the attacker will use the formatting parameter %n, which writes the number of bytes already written by the formatting function to the corresponding pointer, see the example below [6]:

int count;
printf("TEST\n%n", &count);
printf("Count = %d\n", count);

Output: Count = 5

In this case, the program wrote the value 5 into the variable count, because there were 4 letters and 1 newline character before the formatting parameter %n.

Thus, if an attacker wants to write to a given address in memory, he must pass it as in Section Reading Data From Arbitrary Address, and then ensure that the number he wants to write to the address is equal to the number of characters already written – that is, the number of characters before %n [2].

This is where aligning the numbers to a given number of digits can help the attacker. This trick can be noticed e.g. in the %x parameter, where addresses are aligned to 8 characters using %08x. It is also possible to insert a number before other formatting parameters, for example %u [1], [6].

The example below shows an example of changing the value of a variable at a known address in memory:

  • Address of the variable: 0x12345678
  • Address of the variable prepared for attack: \x78\x56\x34\x12

  • Format string: \x78\x56\x34\x12_%08x_%08x_%08x_%n
  • Example output: xV4_b87152a0_00000000_b7d3800e_
  • Resulting variable value: 31
    • 7 characters (x,v,4, 4*"_")
    • 3*8 address characters = 24

  • Format string: \x78\x56\x34\x12_%08x_%08x_%108x_%n
  • Resulting variable value: 131
    • same as above + last %x is aligned to 108 characters instead of 8

Using the width of the number we can adjust the number of characters written out, but unfortunately with this trick we won’t be able to write large numbers such as addresses in memory [1], [2].

Addresses in memory are stored backwards, and thus the least significant byte is the first in memory. If an attacker wants to write the entire address into memory, he is able to do so by writing four consecutive memory writes, where he always overwrites the least significant byte [1], [2].

Consider that an attacker wants to write to the variable at address 0x08049794 the address 0xDDCCBBAA. He does this using four entries as follows [2]:

Memory 94 95 96 97
1. write at 0x08049794 AA 00 00 00
2. write at 0x08049795 BB 00 00 00
3. write at 0x08049796 CC 00 00 00
4. write at 0x08049797 DD 00 00 00
Result AA BB CC DD

To write the address, the attacker must create a format string containing the three parts [1], [6]:

  1. Addresses
  2. Formatting parameters for moving the pointer to the addresses
  3. Formatting parameters for writing

So in the example above, an attacker would prepare four write addresses (offset by one byte) at the beginning of the format string, followed by format parameters to ensure that the address is written correctly. Since he would want to write a specific value to each address, he would separate the addresses with any four characters. He will do this for the reason of being able to put a different value at each address – he will be using number length alignment here, and thus shifting the pointer in memory, so by inserting four characters between the addresses, he will make sure that each subsequent write will correctly point to the next address. Also note that writing the actual address overwrites the next three bytes in memory, see the writing scheme above [2].

The resulting format string will look like this [2]:

\x94\x97\x04\x08JUNK\x95\x97\x04\x08JUNK\x96\x97\x04\x08JUNK\x97\x97\x04\x08%x%x%126x%n%17x%n%17x%n%17x%n

At the beginning of this string, there are addresses separated by four arbitrary characters. Next, there are three formatting parameters to move the pointer to the address (one of which is already modified to write the correct number to the first address), followed by the parameter for the first write (%n). Then at the end there are three pairs that are used to add the required number of characters for the next write and the write itself.

Obviously, the attacker must calculate the correct numbers to write the desired parts of the resulting address before creating the format string. To write 0xaa, the attacker must write 170 characters (0xaa in decimal) before the first %n. The format string already contains four addresses and three auxiliary words, for a total of 28 characters. It also contains %x twice, with eight characters of output, for a total of 44 characters so far. So, to get 170 in the first write, the attacker aligns the last parameter before writing to 126 digits. For subsequent writes, the attacker just adds to the characters already written using further alignments, in this case always 17 (0xbb - 0xaa = 17, etc.).

In the previous example, the attacker had it relatively easy, as in each subsequent entry was a higher number. However, it may be that an attacker wants to spoof an address for which it is not so easy, since smaller numbers will follow larger ones in the address [2]. For this case, it is possible to use the technique of wrapping the entry with a higher number. If an attacker wanted to spoof the address 0x0806abcd in the above example, it would be easy for him to write the first part (0xcd ∼ 205). But the problem occurs with the second part (0xab ∼ 171). To write the first part, the attacker would have to write 205 characters, but to write the second part, he would only need to write 171 characters, but unfortunately the number of characters already written cannot be changed. Therefore, the attacker writes 0x1ab to memory instead of 0xab, thus solving the problem. This technique can then be used for subsequent writes of [1], [2].

Code Execution

In most cases, the attacker will not just want to overwrite the value of a variable, but rather try to execute their own malicious code. To be able to do this, the attacker will need to insert malicious code on the stack, find out at what address this code starts, and change some pointer to the executable code to point to the beginning of the malicious code [1], [3].

Today, it is possible to find many prepared shellcodes on the Internet (e.g., at http://shell-storm.org/shellcode/index.html). A shellcode is a piece of code that is inserted into a format string and then used to perform some action (e.g. to get the admin shell).

An attacker can find the address of the pointer that needs to be changed in, for example, Global Offset Table (GOT). This table contains the addresses of the library functions that are used in the program. Consider that there is a function exit in the program. An attacker can then use objdump -R ./binary_name to find the address where the pointer to this function is located and then write the start address of the embedded shellcode to this address. Thus, the program executes the embedded code when exit is called [1]–[3].

The advantage of this method is that the entries in the GOT table are bound to a binary, so if multiple systems use the same binary, the values of this table will be the same [1], [2].

The output of the GNU C compiler includes a special section in the table called DTORS. This section contains destructors that can be used to arrange the execution of code before the program exits [1], [2], [5]. An attacker will use a command to examine this section for a given program: objdump -s -j .dtors ./binary_name or nm ./binary_name | grep DTOR [2], [5]. To execute the actual code, the attacker must find the address of __DTOR_LIST__, add the next four bytes to it, and insert a pointer to its shellcode into the resulting address [1], [2], [5].

These two methods are discussed in more depth in the books Hacking: The Art of Exploitation [2] and Hacking: the hacker’s manual [5].

AUXILIARY TECHNIQUES

Since the implementation of the methods described in the previous section is not entirely straightforward, there are auxiliary techniques that will be discussed in this section. Using these techniques, an attacker can simplify the execution of some operations.

Writing the Short Type

This technique is very similar to the already mentioned four-write technique, but it simplifies it by writing two bytes instead of one. Thus, only two writes are needed instead of four. To do this, an attacker uses the formatting parameter %hn [1], [2], [5].

To write the address, the attacker splits the address into two parts (upper and lower half) and writes them separately with two writes, similar to the four writes. The advantage of this variant is the possibility of swapping the order of individual writes – the attacker can make his job easier and write the part of the address with the lower value first [1], [2], [5].

Fast Move of the Pointer in the Stack

It may happen that the format string is too short for an attacker to reach the embedded string by a sequence of format parameters. For example, if an attacker uses the formatting parameter %u, this parameter will take two bytes in the string, and the stack pointer will move only four bytes. However, an attacker can help himself with the format parameter %f, which moves the stack pointer by eight bytes.

Since this parameter represents the values from the stack as a decimal number, division by zero could occur here, which would terminate the execution. To avoid this, the attacker instead uses the %.f parameter, which prints only the integer part of the number. Thus, the pointer on the stack is shifted by eight bytes using only three bytes in the format string [1].

Direct Parameter Access

Accessing a specific parameter on the stack can be somewhat tedious, especially if the parameter is located far away. Using the technique of direct access to parameters, an attacker can simplify this. Here, the attacker can use the formatting parameter %n$x, where n is replaced by the order of the desired parameter [1], [2], [5], [7]. Illustrated in the example below:

Format string:

AAAA_%08x_%08x_%08x_%08x

Output:

AAAA_6758b2a0_6652100e_6758b2a0_41414141

Format string:

AAAA_%4$x

Output:

AAAA_41414141

DEFENSE

Many options have been proposed to solve this problem, but unfortunately it is not possible to apply them [11]:

  • Removing the Parameter -%n
    • Since this parameter represents the most serious problem – writing to memory – it was suggested to remove it. Unfortunately, there are many programs that use this parameter and the consequences of removing it would be large, so it can still be used today.
  • Static Format Strings
    • Another suggestion would be to allow only static format strings, unfortunately here we run into the same problem, namely that many programs today use dynamically generated strings.
  • Counting the Arguments
    • The ideal solution would be to count the arguments of the function and compare them with the number of format parameters provided. A variable number of arguments is allowed for these functions, and the varargs mechanism in C does not allow this counting. Creating a new mechanism would very likely be incompatible with the previous one, so this variant is not used in practice either.

For these reasons, the simplest solution is to use the format function correctly. Furthermore, it is very important to ensure that the user input is not used as a format string. Some existing tools that can be used for defense are described below.

Check During Compilation

Both GCC and Clang compilers check for formatting functions that could pose a security risk, i.e. functions from the printf and scanf families. This check can be controlled using the -Wformat-nonliteral and -Wformat-security switches.

Existing Tools

The FormatGuard and Kimchi tools are worth mentioning. FormatGuard is a small patch of the glibc library that provides general protection against this vulnerability. This tool compares the number of arguments with the number of arguments referenced from the format string. If the number of actual arguments is lower, the attack is evaluated, and the program is terminated by [11].

The Kimchi tool rewrites a binary program at runtime by redirecting the printf function calls to safe_printf. Kimchi protects against access to arguments that are outside the stack frame address of the printf parent function [10].

The functioning of these tools is described in depth in the articles FormatGuard: Automatic Protection From printf Format String Vulnerabilities [11] and Kimchi: A Binary Rewriting Defense Against Format String Attacks [10].

EXAMPLES FROM VARIOUS PROGRAMMING LANGUAGES

C and C++ are not the only languages that suffer from format string vulnerabilities. Other languages that contain formatting parameters may also contain this vulnerability [8]. Examples of some other languages are described below.

Perl

In 2005 Stefan Schmidt discovered a problem in the Postgrey program written in Perl. The program was found to crash when an email foo_bar%nowhere.co.xy[622.622.615.619] was entered. The Perl interpreter can detect a situation where the format string contains %n but does not contain the corresponding function parameter to write the value. In this case, the email was in the format string and the program execution was terminated with an error message [8]:

Modification of read-only value attempted at...

PHP

The PHP language does not support the formatting parameter %n, and if the number of formatting parameters is greater than the number of arguments provided to the function, a warning is printed:

Warning: sprintf(): Too few arguments in file.php on line...

The program execution is not stopped, but the string that should have been created by sprintf remains empty. This can cause, for example, empty log messages or other problems [8].

Java

Since Java 1.5, support for format strings in the printf style has been added. Java responds to incorrect format parameters with an exception. If exceptions are not handled correctly, attackers can use them to launch DoS attacks [8].

Ruby

If the Ruby language receives a format parameter it does

not know (e.g. %z), or if the function contains more format

parameters than arguments, the program execution is terminated

with:

‘sprintf’: malformed format string - %z (ArgumentError) or

‘sprintf’: too few arguments (ArgumentError) error messages [8].

CONCLUSION

Although the format string vulnerability is over twenty years old, it is still found in dozens every year in real systems (see CVE). Due to this error, if the format string is not properly monitored, major problems can occur in real systems – from crashing programs, leaking information, changing data in memory, to gaining administrative privileges.

When writing new programs, it is therefore necessary to be careful to use format strings correctly, ideally leaving them static, and if this is not possible, to check carefully for user input. The advantage is that the C and C++ languages must be compiled prior to use. This allows the programmer to avoid this vulnerability, as it is pointed out by the compiler. To prevent it, the programmer can use, among other things, the tools created, mentioned in the section on defenses.

BIBLIOGRAPHY

[1] scut/team teso, “Exploiting Format String Vulnerabilities,” ver. 1.2, September 2001.

[2] J. Erickson, “Hacking: The Art of Exploitation,“ 2nd ed, No Starch Press, 2008.

[3] F. Zhang, “Format-String Vulnerability,“ Southern University of Science and Technology, CS 315 Computer Security

[4] S. El-sherei, “Format String Exploitation-Tutorial“

[5] S. Harris, “Hacking: manu´al hackera.“ Praha: Grada, 2008.

[6] G. Hoglund and G. McGraw, “Exploiting software: how to break code.“ Boston: Addison-Wesley, 2004.

[7] J. Koziol and D. Litchfield, “Shellcoder´s handbook: discovering and exploiting security holes.“ Indianapolis: Wiley Publishing, 2004.

[8] H. Burch and R. C. Seacord, “Programming language format string vulnerabilities”, Dr. Dobb’s Journal, vol. 32, no. 3, p. 22, 2007.

[9] K. -suk Lhee and S. J. Chapin, “Buffer overflow and format string overflow vulnerabilities”, Software, practice & experience, vol. 33, no. 5, pp. 423-460, 2003.

[10] “Kimchi: A Binary Rewriting Defense Against Format String Attacks”, in Lecture notes in computer science, 2006, pp. 179-193.

[11] M. Barringer, M. Frantzen, and J. Lokier, “FormatGuard: Automatic Protection From printf Format String Vulnerabilities,” Aug. 2001.

Related Resources

[]Exploiting Format String Vulnerabilities
[]Hacking: The Art of Exploitation
[]Format-String Vulnerability