Thursday, 8 December 2011

Host endianness and data transfer over the network

Network components talk to each other by sending messages which are simply arrays of bytes. In order to understand them, parties in conversation need to know the communication protocol which defines message format and the length, order and the meaning of its parts.

Typically, message would comprise header and payload. Header can contain information about message itself, protocol version and information about the sender and receiver. Payload is actually information that sender wants to pass to receiver.

The simplest and shortest message one host can send is a message of a 1-byte length. In this case, protocol only needs to define how is this byte treated - as a character, signed or unsigned number. For example, if protocol says that message contains value of type unsigned char, and the message is 0x8b, receiver will treat this as a positive integer, of value 139. If that was a value of signed char, receiver would understand that this is a negative integer, -117.

There is one problem for messages made of two or more bytes. Bytes are send and received in the same order they are written in the sending buffer. But the way how are bytes copied from register to memory (buffer) and vice versa can be different on different hosts and this depends on their endianness. If sender has a big endian (BE) CPU and receiver has a small endian (SE) CPU, receiver might interpret received values in a wrong way.

Let's look at the case when the message comprises of 2-byte integer value, let's say of type unsigned short. This type has a range of values between 0 and 65535 (0x0000 and 0xffff). If BE sender wants to send value 0xabcd (43981) it will copy this value from registry to buffer keeping the same byte order and buffer will be like this: | 0xab | 0xcd |. Most significant byte (MSB) is at the lower address in memory. The other side will receive bytes in the same order. When copying bytes to the registry, BE receiver will treat the byte from the lowest memory address as the MSB and put it first so the registry will filled with bytes in the same order they are in the memory (0xabcd) and everything would be fine. But LE receiver will treat byte from the lowest address as the Least Significant Byte (LSB) and put it at the last position in the registry - it would swap the order and read received value as 0xcdab (52651), which is wrong! Sender should know the endianness of the client so can send bytes in the correct order, but that is impractical.

Solution to this is a simple rule: sender should always send bytes in big endian order (network byte order) and receiver should always convert received bytes from network to its own byte order. This makes sending and receiving code portable.

Both Windows and *NIX networking frameworks offer helper functions which are able to convert integers from host to network byte order and vice versa. They are:

uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);


Obviously, if host has network byte order (big endian), no conversion would take place, no matter whether it is on the sending or receiving side.

Sending and receiving buffers can be declared as char or unsigned char arrays. Values transported could be of signed or unsigned types. I made a set of several utility functions that insert and export values of desired integer types into/out from sending/receiving (probably socket) buffers. Prior to inserting, values are converted to network byte order (big endian) and after extraction, values are converted from network to host byte order. Tests prove that hton/ntoh functions can be applied both to signed and unsigned types as all they do is actually swapping bytes (if necessary).

NetBuffUtilCore.h:



NetBuffUtil.h:



NetBuffUtil.cpp:



main.cpp:



Output:

unsigned char buff
Original (unsigned short): 43981
Received val = 43981

char buff
Original (unsigned short): 43981
Received val = 43981

unsigned char buff
Original (short): -31234
Received val = -31234

char buff
Original (short): -31234
Received val = -31234

unsigned char buff
Original (unsigned long): 2882343476
Received val = 2882343476

char buff
Original (unsigned long): 2882343476
Received val = 2882343476

unsigned char buff
Original (long): -1107401523
Received val = -1107401523

char buff
Original (long): -1107401523
Received val = -1107401523

To avoid dependency on Winsock library, I implemented a function which swaps bytes for a given type (well, template should be constrained to only integer types...):

EndiannessUtil.h:



main.cpp:



So far, we were focused on transfer of integer types. What if message payload needs to contain strings, or, mashup of strings and integers?

Let's say that we need to send some ASCII string and some unsigned long number. Protocol should define message payload like this:

|L0|L1|S1|S1|S2|........|SK|N0|N1|N2|N3|

|L0|L1| - 2 bytes for unsigned short value that defines string length (K bytes)
|S1|S1|S2|........|SK| - string (K bytes)
|N0|N1|N2|N3| - 4 bytes for unsigned long number

Both integers should be converted to the network byte order prior to writing into sending buffer. But string does not need to be changed - that is ASCII string and each character is placed in a single byte. Receiving side will first read 2 bytes of payload, extract string length (K), allocate memory for string (K bytes) and then read (copy) next K bytes from receiving buffer into the string buffer. After that, receiver will read next 4 bytes and convert them from network byte order before passing it for further processing.

If sending Unicode string, we need to take care about endianness again as some of its characters use two or more bytes. Our protocol will define encoding applied (e.g. UTF-8 or UTF-16) but this time sender needs to send additional information as well - its endianness. This information is contained in Byte Order Mark (BOM) sequence which is prepended to our string. BOM helps Unicode decoder on the client side to decide whether to swap or not bytes for multi-byte characters.

Links and references:
htons(), htonl(), ntohs(), ntohl() (Beej's guide)
htons function (MSDN)
htonl function (MSDN)
ntohl function (MSDN)
ntohs function (MSDN)
Linux functions
Encodings and Unicode (Python)
Byte Order (Codecs)

No comments: