PChars: no strings attached

The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. — Alan Perlis

In the public newsgroups on the Embarcadero server, I often see that there is still great confusion about the PChar type on the one, and the string type on the other hand. In this article I would like to discuss the similarites and the differences between both types, as well as some things you should or shouldn’t do with them.

The general principles layed out in this article apply to all Win32, Win64 and OS X versions of Delphi, including Delphi 2009 and up. There is, however, a special “chapter” at the end of this article especially for those who use Delphi 2009 and up.

PChar

Trying to outsmart a compiler defeats much of the purpose of using one. — Kernighan and Plauger, The Elements of Programming Style.

PChars were inspired by strings, as used in the C language. Most Windows API functions have a C interface, and accept C style strings. To be able to use APIs, Borland had to introduce a type that mimicked them, in the ancestor of Delphi, Turbo Pascal.

In C, there is no real string type, like there is in Delphi. Strings are just arrays of characters, and the end of the text is marked by a character with ASCII code zero. This allows them to be very long (unlike Turbo Pascal’s string type, which was limited to 255 characters and a length byte – this is Delphi’s ShortString type now), but a bit awkward to use. The beginning of the array is simply marked by a pointer to a char, which has become a PChar in Delphi. To traverse the string, in C one can use the pointer as if it were an array (this is true for all pointers in C), and use s[20] to indicate the 21st character (counting starts at 0). But C pointer arithmetic not only allows incrementing and decrementing the pointer, it also allows calculating the sum of a pointer and a number, or the difference between two pointers. In C, *(s + 20) is equivalent to s[20] (* is the C pointer operator, much like Delphi’s ^). Borland allowed almost the same syntax for the PChar type.

A PChar is just a pointer, like in C. And also like in C, you can use it as if it were an array (i.e. the pointer points to the first character in the array). But it isn’t! A PChar has no automatic storage, like the convenient Delphi string. If you copy text to a PChar-“string”, you must always make sure that the PChar actually points to a valid array, and that the array is large enough to hold the text.

1
2
3
4
5
var
  S: PChar;
begin
  S[0] := 'D';
  S[1] := '6';

The code above did not allocate storage for the string, so it tries to store the characters starting at some random location in memory (the address P happens to hold before it is assigned one is undefined, see my article on pointers). This can cause problems, or even a program crash. It is your responsibility to ensure that the array exists. The easiest way is to use a local array:

1
2
3
4
5
6
7
var
  S: PChar;
  A: array[0..100] of Char;
begin
  S := A;
  S[0] := 'D'; // this is equivalent to A[0] := 'D';
  S[1] := '6'; // you could also write: (S + 1)^ := '6';

The above code stores the characters in the array. But if you try to display the string at S, it will probably display lots of nonsense. That is because the string didn’t end in a #0 character. OK, you could simply add another line:

1
  S[2] := #0; // or: (S + 2)^ := #0;

and you would get a display of the text "D6". But storing characters one by one is really inconvenient. To display a text via a PChar is much simpler: you simply set the PChar to an already existing array with a text in it. Luckily, string constants like 'Delphi' are also such arrays, and can be used with PChars:

1
2
3
4
var
  S: PChar;
begin
  S := 'Delphi';

You should however be aware that that only changes the value of the pointer S. No text is moved or copied around. The text is simply stored somewhere in the program (and has a #0 delimiter), and S is pointed to its start address. If you do:

1
2
3
4
5
6
7
// WARNING: BAD EXAMPLE
var
  S: PChar;
  A: array[0..100] of Char;
begin
  S := A;
  S := 'Delphi';

this does not copy the text 'Delphi' to the array A. The first line after begin points S to the array A, but immediately after that, the next line only changes S to the address of the literal string. If you want to copy text to the array, you must do that using for instance StrCopy or StrLCopy:

1
2
3
4
5
6
var
  S: PChar;
  A: array[0..100] of Char;
begin
  S := A;
  StrCopy(S, 'Delphi');

or

6
  StrLCopy(S, 'Delphi', Length(S));

In this case it is obvious that 'Delphi' will fit in the array, so the use of StrLCopy seems a bit overdone, but in other occasions, where you don’t know the size of the string, you should use StrLCopy to avoid overrunning the array bounds.

An array like A is useful as a text buffer for small strings of a known size, but often you’ll have strings of a size which is unknown when the program is compiled. In that case you’ll have to use dynamic allocation of a text buffer. You can for instance use StrAlloc or StrNew to create a buffer, or GetMem, but then you’ll have to remember to free the memory again, using StrDispose or FreeMem. You can also use a Delphi string as a buffer, but before I describe how to do that, I want to discuss that type first.

String

A world without string is chaos — Randolf Smuntz, Mouse Hunt

Allow me to confuse you: a string or, more precise, AnsiString (in Delphi 2009 and higher: UnicodeString) is in fact a PChar. Just as a PChar, it is a pointer to an array of characters, terminated by a #0 character. But there is one big difference. You normally don’t have to think about how they work. They can be used almost like any other variable. The compiler takes care that the appropriate code to allocate, copy and free the text is called. So instead of calling routines like StrCopy, the compiler will take care of such chores for you.

But there is more. Although the text is sure to be always terminated by a #0, just to make AnsiStrings compatible with C-style strings, the compiler doesn’t need it. In front of the text in memory, at a negative offset, the length of the string is stored, as an Integer. So to know the length of the string, the compiler only has to read that Integer, and not count characters until it finds a #0. That means that you can store #0 characters in the middle of the string without confusing the compiler. But some output routines, which rely on the #0 and not on the length, might be confused.

Normally, each time you’d assign one string to another variable, the compiler would have to allocate memory and copy the entire string to it. Because Delphi strings can be quite long (theoretically, up to 2GB), this could be slow. To avoid the copying, Delphi knows a concept that is called “copy on demand”. Each string has another field of information stored in front of it: the reference count. This is the count of string variables that actually reference that particular string in memory. Only if it becomes 0, the string text is not referenced anymore, and the memory can be freed.

The compiler takes care that the reference count is always correct (but you can confuse the compiler by casting – more on that later). If a string variable is declared in a var section, or as a field of a class or record, it will start its life as nil, the internal representation of the empty string (''). As soon as string text is created and assigned to one of these variables, the reference count of the string will be 1. Each additional assignment of that particular string to a new variable will increment the reference count. If a string variable leaves its scope (when the function or class in which it was declared ends), or is pointed to a new string, the reference count of the text is decremented.

A simple example:

1
2
3
4
5
function PlayWithStrings: string;
var
  S1, S2: string;
begin
  S1 := IntToStr(123456);

Now S1 points to the text '123456' and has a reference count of 1.

6
  S2 := S1;

No text is copied yet, S2 is simply set to the same address as S1, but the reference count of the text '123456' is 2 now.

7
  S2 := 'The number is ' + S2;

Now a new, larger buffer is allocated, the text 'The number is ' is copied to it, and the text from '123456' concatenated. But, since S2 doesn’t point to the text '123456' anymore, the reference count of that text is decremented to 1 again.

8
  Result := S2;

Result will be set to point to the same address as S2, and the reference count of the text 'The number is 123456' is incremented to 2.

9
end;

Now S1 and S2 leave their scope. The reference count for '123456' will be decremented to 0, and the text buffer will be freed. The reference count for 'The number is 123456' will also be decremented, but only to 1, since the function result still points to it. So although the function has ended, the string is still around.

Complicated? Yes, it is complicated, and can get even more complicated with var, const and out parameters. But fortunately, you normally don’t have to worry about this. Only if you access strings in assembler, or using a typecast to a PChar, this can become important to know. But using strings with a typecast to PChar is something which is not uncommon.

The most importants things to remember about strings are

  • that text is only copied to a new string buffer if it is modified;
  • that the reference count and the length are not connected to a string variable, but to a specific text buffer (also known as payload), to which more than one string variable can point;
  • that the reference count is always correct unless you fool the compiler by casting to a different type;
  • that assignments to a variable decrement the reference count of the text buffer it previously pointed to;
  • that if the reference count becomes 0, the string buffer is freed.

Using strings and PChars together

If you can’t be a good example, then you’ll just have to be a horrible warning. — Catherine Aird

PChars and character arrays are awkward to use. Most of the time, you must allocate memory, and not forget to free it. If you want to add text, you must first calculate the size of the resulting text, reallocate the text buffer if it is too small, and use StrCat or StrLCat to finally add the text. You must use StrComp or StrLComp to compare strings, etc. etc.

Strings, on the other hand, are much simpler to use. Most things are done automatically. But many Windows (or Linux) API functions require PChars, and not strings. Fortunately, since strings are also pointers to zero-terminated text, you can use them as a PChar by simply casting them:

1
2
3
4
5
6
var
  S: string;
begin
  S := ExtractFilePath(ParamStr(0)) + 'MyDoc.doc';
  ShellExecute(0, 'open', PChar(S), nil, nil, SW_SHOW);
end;

Don’t forget, that an AnsiString variable is a pointer to text, and not a text buffer itself. If the text is modified, it will often be copied to a new location, and the address in the variable is adjusted accordingly. That means that you should not use a PChar to point to the string and then modify the string. It is best to avoid doing something like:

1
2
3
4
5
6
7
8
// WARNING: BAD EXAMPLE
var
  S: string;
  P: PChar;
begin
  S := ParamStr(0); // say, this returns 'C:\Test.exe';
  P := PChar(S);
  S := 'Something else';

If S is changed to 'Something else', P will not be changed with it, and still point to 'C:\Test.exe'. Since P is not a string reference to that text, and there is no other string variable pointing to it, its reference count will become 0, and the text will be discarded. That means, that P now points to invalid memory.

It is wise not confuse the compiler by mixing PChar and string variables, unless you know what you do. The compiler does not recognize a PChar as a string, so it will not change the reference count of the string memory, if you point a PChar to it. It is often better not to use a PChar variable like this at all. Simply use the string as much as possible, and only cast at the last moment. Functions accepting a PChar parameter should copy the text they receive to their own buffer.

Normally, string buffers are only as large as necessary to contain the text assigned to them. But using SetLength you can set the string buffer to any size you need. This makes string buffers useful as text buffers to receive text. Windows API functions that return a text in a character array can be used like this:

1
2
3
4
5
6
function WindowsDirectory: string;
begin
  SetLength(Result, MAX_PATH);
  GetWindowsDirectory(PChar(Result), Length(Result));
  SetLength(Result, StrLen(PChar(Result)));
end;

Alternatively, since you can assign a PChar to a string, and that will result in a new string with a copy of the text, you can set the length of the string just as well with this functionally equivalent code:

1
  Result := PChar(Result);

The last line of the function sets the length of the string back to the length of the C-style string that was stored in the buffer. If you need the result as a PChar anyway, to be processed by further API routines, you may perhaps be tempted to do this instead:

1
2
3
4
5
6
7
8
// WARNING: BAD EXAMPLE
function WindowsDirectoryAsPChar: PChar;
var
  Buffer: array[0..MAX_PATH] of Char;
begin
  GetWindowsDirectory(Buffer, MAX_PATH);
  Result := Buffer;
end;

This will however fail. Because Buffer is a local variable, the entire buffer is in local memory (the processor stack). As soon as the function ends, the local memory is reused for other routines, so the text to which the result now points is turned into complete gibberish. Local buffers should never be used to return text.

But even if you had used a dynamic allocation with StrAlloc or a similar routine, the user would have to free the string. It generally is not a good idea to return PChars like that. Better follow the example of GetWindowsDirectory, and let the user of the function provide a buffer and its length. You then simply fill the buffer (using StrLCopy) up to the given length.

There is an alternative to the function WindowsDirectory, that could use a local buffer. This relies on the fact that you can assign a PChar to a string directly. To make the text a Delphi string (with length and reference count fields), a Delphi string buffer will be allocated to the required length, and the text copied to that. So even if the local buffer is discarded, the text in the string buffer is still there:

1
2
3
4
5
6
7
function WindowsDirectory: string;
var
  Buffer: array[0..MAX_PATH] of Char;
begin
  GetWindowsDirectory(Buffer, MAX_PATH);
  Result := Buffer; // StrLen(Buffer) characters copied!
end;

But how would you write a function, for instance in a DLL, that must pass back data as a PChar, yourself? I think you should take the example of GetWindowsDirectory again. Here is a simple DLL function, returning a version string that is stored in our DLL:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// Having a separate function to get the length is clearer than
// asking GetDLLVersion to provide that length if parameters are nil.
function GetDLLVersionLength: Integer;
begin
  Result := Length(DLLVersion + IntToStr(VersionNum));
end;

// Returns number of characters copied, excluding zero byte
function GetDLLVersion(Buffer: PChar; MaxLen: Integer): Integer;
begin
  if (Buffer <> nil) and (MaxLen > 1) then
  begin
    StrLCopy(Buffer, PChar(DLLVersion +IntToStr(VersionNum)), MaxLen - 1);
    Result := StrLen(Buffer);
  end
  else
    Result := 0;
end;

As you can see, the string is simply copied to the provided buffer with StrLCopy. Because the user must provide the buffer, you will avoid any memory management problems. If you provided it, the user would have to know how to free it. FreeMem doesn’t work across a DLL boundary. But even if it did, a user of the DLL that used C or Visual Basic would not know how to free the buffer in that language, since memory management is different in each language. Letting the user provide the buffer makes him or her independent of your implementation.

Delphi 2009 and up

In Delphi 2009, strings were changed big time. Before, i.e. in Delphi 2 up to Delphi 2007, string mapped to AnsiString, and each character was a single-byte AnsiString. A PChar was in fact a PAnsiChar. But in Delphi 2009, strings were made to use Unicode, to be precise, UTF-16, which meant that a new string type was required: UnicodeString. This string type is based on WideChars. This became the default string type, which meant that string now mapped to UnicodeString, Char to WideChar and PChar to PWideChar.

Delphi for Win32 already had the string type WideString, but this is a type that is allocated by the OS and has no reference count or “copy on demand”, so each assignment meant that a new, unique, full copy of the text had to be made. The WideString type is not very performant, and that is why the new UnicodeString type was introduced.

Beside the length and the reference count field, each string type, i.e. AnsiString as well as UnicodeString, got extra fields stored before the text: a Word containing the encoding for the string (mainly used for single byte strings like AnsiString) and a Word containing the character size. The encoding of an AnsiString governs how characters with byte values 128 up to 255 are interpreted and converted, the character size is mainly necessary for interfacing with C++ code.

Additionally, a few other string types were introduced as well: RawByteString and UTF8String. UTF8Strings are meant to contain text in UTF-8 format, which means that each element is an AnsiChar, but that “characters” can be encoded as multiple AnsiChars. Note that I put “characters” in quotes, since in the context of Unicode, it is more accurate to speak of code points.

As you can see in the Wikipedia article about UTF-16, it is also possible that some UTF-16 code points also require the use of two WideChars, so called “surrogate pairs”. So the Length of a UnicodeString or an UTF8String do not necessarily correspond to the number of code points they contain. In UnicodeString, surrogate pairs will however be pretty seldom, while in UTF8, multibyte encodings are quite common.

Another new string type is the RawByteString. If you assign an AnsiString with one type of encoding to a string with a different encoding, an automatic conversion will take place, which could result in a loss of data, if characters from one encoding have no equivalent in the other. AnsiStrings use a default encoding, governed by system settings. RawByteString, however, is a string without any encoding, so you can be sure that if you assign your AnsiString or UTF8String to one (usually when passing one of them as a parameter), no conversion will take place.

The Delphi 2009 help says about RawByteString:

RawByteString enables the passing of string data of any code page without doing any codepage conversions. Normally, this means that parameters of routines that process strings without regard for the string’s code page should be of type RawByteString. Declaring variables of type RawByteString should rarely, if ever, be done, because this can lead to undefined behavior and potential data loss.

So what to do?

As you can see, in the text of this article, I never make any reference to the size of a Char. So anything I wrote in the article above can also be applied in Delphi 2009 and up. Most code using the techniques mentioned will simply recompile in Delphi 2009 and up, but instead of using AnsiStrings, it will be using UnicodeStrings, WideChars and PWideChars.

Win32 API functions often come in two versions, one that takes Ansi (i.e. single-byte) characters and (C-style) strings and one that takes Wide (Unicode, double-byte) characters and (C-style) strings. These two are usually distinguished by an A or a W at the end of their name, respectively. Delphi’s interface units for such API functions, like Windows.pas, generally also define a third version, without A or W at the end of the name (just like Microsoft does, in the C headers for these functions), and map that to the Ansi−based functions. One example from a Windows.pas from before Delphi 2009:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
function GetShortPathName(lpszLongPath: PChar; lpszShortPath: PChar;
  cchBuffer: DWORD): DWORD; stdcall;
{$EXTERNALSYM GetShortPathName}
function GetShortPathNameA(lpszLongPath: PAnsiChar; lpszShortPath: PAnsiChar;
  cchBuffer: DWORD): DWORD; stdcall;
{$EXTERNALSYM GetShortPathNameA}
function GetShortPathNameW(lpszLongPath: PWideChar; lpszShortPath: PWideChar;
  cchBuffer: DWORD): DWORD; stdcall;
{$EXTERNALSYM GetShortPathNameW}
...
function GetShortPathName; external kernel32 name 'GetShortPathNameA';
function GetShortPathNameA; external kernel32 name 'GetShortPathNameA';
function GetShortPathNameW; external kernel32 name 'GetShortPathNameW';

As you can see, GetShortPathName is mapped to the function ‘GetShortPathNameA’. You also see that the -A version is declared to take PAnsiChar strings, the -W version takes PWideChar strings, and the neutral version take PChar strings.

In Delphi 2009 and up, such neutrally named function declarations are now mapped to the W variety, so now it becomes:

1
2
3
function GetShortPathName; external kernel32 name 'GetShortPathNameW';
function GetShortPathNameA; external kernel32 name 'GetShortPathNameA';
function GetShortPathNameW; external kernel32 name 'GetShortPathNameW';

This means that, in Delphi 2009 and up, even if you want to call Windows API functions, but also if you call runtime library or VCL functions, most of the time, you don’t have to worry about character size. Strings are now Unicode, the API functions are now (mapped to) Unicode too, so if you keep on using the size neutral types string, Char and PChar, you won’t have to modify a lot of your code. And if there is code that happens to have the wrong character size (some API functions, like GetProcAddress only exist in an Ansi version), you will get a nice compiler warning or error, to which you can and should react.

SizeOf or Length?

Of course you must be careful of code, especially code that uses low level routines like Move or FillChar that assumes that characters are byte sized. So to clear an array of Char, don’t do:

1
2
3
4
var
  Buffer: array[0..MAX_PATH] of Char;
begin
  FillChar(Buffer, MAX_PATH + 1, 0);

because Buffer is now made up of WideChars, which means it is now 2*(MAX_PATH+1) bytes in size. So if the size of such a buffer is required, you must use SizeOf:

4
  FillChar(Buffer, SizeOf(Buffer), 0);

Note that SizeOf should only be applied to static arrays. It does not work on dynamic arrays of Char. In that case, you use something like:

4
5
  SetLength(MyCharArray, MAX_PATH + 1);
  FillChar(MyCharArray[0], Length(MyCharArray) * SizeOf(Char), 0);

For situations where the number of characters is important, you use Length:

4
  StrLCopy(Buffer, PChar(MyString), Length(Buffer));

Further Information

There is a whitepaper by Marco Cantù, which describes the various new string types and enhancements extensively and very clearly. I recommend you download it and read it at least once.

More tips and tricks about converting your strings to Delphi 2009 and up can be found in these articles by Nick Hodges, former Delphi R&D Manager: Delphi in a Unicode World, Part 1, Part 2 and Part 3.

There is a bunch of Unicode related articles and documents on the Embarcadero Developer Network.

Conclusions

The open secrets of good design practice include the importance of knowing what to keep whole, what to combine, what to separate, and what to throw away. — Kevlin Henny

Although strings and PChars are both string types, they are quite different. Strings are easier to use, whereas for PChars you must do almost everything yourself. You can use them together, and cast a string as PChar, and assign a PChar to a string, but because strings change their address when they are changed, you should not hold on very long to the address you obtain by casting a string to a PChar. Assigning a PChar to a string is less hazardous.

As the previous text demonstrated, allocating text in a function and then returning a PChar to the new buffer is ususally not a good idea. It is even worse if it is done across a DLL boundary, since the user can perhaps not even free the memory – the DLL and the user probably use a different memory manager, and each has a different heap. It is also not a very good idea to use a local buffer to return text.

If you must use PChars, because a function requires them, you should use strings as much as possible, and only cast to PChar when you use the string as a parameter. Using strings is much easier, and less error prone, than using the C-style string functions.

Finally

A little inaccuracy sometimes saves a ton of explanation. — H. H. Munro (Saki)

I hope I have lifted a bit of the fog regarding PChars. I have not told everything there is to be known, and perhaps even twisted the exact truth a bit (for instance, not every Delphi string is reference counted – string literals always have a reference count of -1), but those internal details are not important for the big picture, and have no bearing on the safe use and interaction of strings and PChars.

Rudy Velthuis

Standard Disclaimer for External Links

These links are being provided as a convenience and for informational purposes only; they do not constitute an endorsement or an approval of any of the products, services or opinions of the corporation or organization or individual. I bear no responsibility for the accuracy, legality or content of the external site or for that of subsequent links. Contact the external site for answers to questions regarding its content.

Disclaimer and Copyright

The coding examples presented here are for illustration purposes only. The author takes no responsibility for end-user use. All content herein is copyrighted by Rudy Velthuis, and may not be reproduced in any form without the author's permission. Source code written by Rudy Velthuis presented as download is subject to the license in the files.

Back to top