Serializing std::string
As part of the serialization library I am writing (called draft), I recently added serialization for strings. The core serialization functionality looks something like this:
std::ofstream ostrm("out", std::ios::binary);
void SaveToBinary(const void* addr, std::size_t size) {
ostrm.write(reinterpret_cast<const char*>(addr), size);
}
ostrm
is a std::basic_ofstream<char>
initialized with std::ios::binary
which avoids character conversion. When reading a text file, some conversion of the data may occur. For example, newlines in Windows are encoded as \r\n
, while in Unix they are encoded as \n
. When transitioning between platforms, reading data in text mode may cause it to change. That is undesired behavior with draft, since everything must be written and read in a specific format, hence binary encoding.
std::basic_ostream::write
accepts a pointer to some data and a count, and writes count characters starting at the given address. Serializing a basic type such as an int
is as easy as passing its pointer and size.
int main() {
int x = 15;
SaveToBinary(&x, sizeof(x));
return 0;
}
Running this program produces a binary file named out
. Using xxd
, we can dump its contents with xxd out
, which on my Mac results in:
00000000: 0f00 0000 ....
We can also look at the binary dump with xxd -b out
:
00000000: 00001111 00000000 00000000 00000000 ....
My Mac with an Intel Core i5 processor uses little-endian ordering to store data. This is the opposite of how we as humans usually read numbers, in that the least significant byte comes first. We serialized the 32-bit value 15
, which written in binary looks like 00000000 00000000 00000000 00001111
. But since my computer uses little-endian for storage, the ordering is flipped, resulting in 00001111
(the least significant byte) coming first.
Encoding std::string
With basic types covered, next up are strings. Draft does not yet support wide strings (wchar_t
is 16 bits on Windows and 32 bits mostly everywhere else), so let’s just discuss std::string
for now.
Without any further work, what happens if we just do this?
int main() {
std::string str("hello");
SaveToBinary(&str, str.length());
return 0;
}
Actually… it almost works. xxd out
produces
00000000: 0a68 656c 6c .hell
But only the first four characters of “hello” were encoded. Let’s dive into why this almost worked and how strings are stored.
std::string
std::string
is a typedef of std::basic_string<char>
. basic_string
is a class used for storage of characters. The typical implementation of basic_string
has three fields: a pointer to the stored characters, a length, and a total capacity. The idea is similar to std::vector
in that more elements can be added to the container, so extra space is allocated to avoid constant resize operations. To keep the size of a string object small, the variable length character array used as storage is allocated on the heap.
With this in mind, passing the address of a string object to our serialization function seems like it should always produce garbage output. A string object contains a pointer and two sizes, so how come we saw 80% of our string when we serialized it by passing the address of the variable on the stack?
Something to keep in mind is that allocating memory on the heap is expensive. The heap is a big chunk of memory on your computer, but as different programs are running, space is constantly being used and then freed up. This leads to a lot of fragmentation – there are many small sections of unused memory on the heap. To find enough bytes to satisfy your request, your computer has to search through the heap. There are many allocation algorithms, but allocating memory on the heap is never as fast as adding to the stack.
If std::basic_string
can avoid going to the heap for allocation, it can run much faster. This is exactly what happens with small strings. Using a technique called “small string optimization”, if a string is short enough, the space typically used for the pointer to the character array on the heap will instead be used to store your string! This can be done without the string class taking up any additional space with unions.
So that is why serializing the string “hello” resulted in the characters “hell” in the binary output, instead of garbage data. If we instead serialize “abcdefghijklmnopqrstuvwxyz”, we get something like:
00000000: 2100 0000 0000 0000 1a00 0000 0000 0000 !...............
00000010: 9006 c034 a87f 0000 3620 ...4....6
No strings in there! (note: the output may differ somewhat depending on your compiler and system – different implementations of the C++ standard may implement SSO differently, or not at all).
Serializing std::string correctly
One correct way to serialize strings is to write the length first, then write the characters in the string (see std::basic_string::data
). Let’s update main
to correctly serialize strings:
int main() {
std::string str("hello");
std::string::size_type length = str.length();
SaveToBinary(&length, sizeof(length));
SaveToBinary(str.data(), length);
return 0;
}
std::string::size_type
is a type guaranteed to be able to hold the size of a string on your platform. For example, a 64-bit platform may use a 64-bit integer to store the number of characters in a string, whereas a 32-bit platform may only use a 32-bit integer.
Aside: This is incorrect code to write for a serializer. Serialization of strings should use the same number of bytes to write the strings length on every platform so correct deserialization can be performed on every platform. Let’s ignore cross-platform compatibility for now though.
str.data()
returns a pointer to the beginning of the array storing the characters of the string. By saving length
characters starting at this address, we will write out the entire string. data()
also returns the correct address where the characters are stored, be it on the stack or the heap. Running the above code and looking at the output with xxd
produces something like:
00000000: 0500 0000 0000 0000 6865 6c6c 6f ........hello
The first 8 bytes are used to encode the length of the string (5), and the last five bytes are used to encode the string itself.
Deserializing std::string
Now that we have a correct encoding, how would we decode this binary output file back into the original string? In order to read the string back, we need to know how long the string is, so the first step is to read the length of the string. Then, we can read that many bytes from the input stream.
std::ifstream istrm("out", std::ios::binary);
void LoadBinary(void* addr, std::size_t size) {
istrm.read(reinterpret_cast<char*>(addr), size);
}
int main() {
std::string::size_type length;
LoadBinary(&length, sizeof(length));
std::string str;
str.resize(length);
LoadBinary((void *) str.data(), length);
// str now holds "hello"
return 0;
}
Most of this code is similar to the serialization code. One extra step is to resize
the string we create to hold the decoded string. This is equivalent to telling std::vector
“I want x many elements, exactly”, and if we don’t resize the string, it won’t know how many characters it has.
With that, we have serialized and deserialized a string to binary. Stay tuned for more posts about serialization.
Follow the development of draft on GitHub.