My Public Notepad: Unicode

Showing posts with label Unicode. Show all posts

Friday, 4 February 2022

Running BOM-encoded .sql file in MySQL shell gives "ERROR: 1064 (42000): You have an error in your SQL syntax"

I wanted to import a Northwind DB into my local instance of MySQL and found on one Google Code page a file with all necessary commands: Northwind.MySQL5.sql. I tried to execute it via MySQL Shell (mysqlsh) but got an error:

ERROR: 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '# ---------------------------------------------------------------------- #
#' at line 1

After replacing the entire content of this sql file with simple

select User from mysql.user;

...I was still getting the same error. When running this command directly in SQL shell prompt, it was successful. But when passing the file via \source in the prompt or --file as mysqlsh argument, the error would appear. So something was wrong with the file or the way it's being passed to MySQL shell.

After some fruitless trials I created a new file and placed the same command and this time it worked fine!

I then compared the HEX content of both files and I noticed the difference: troublemaker file was starting with Byte order mark (BOM) byte sequence: 0xefbbbf.

I could not find any document which confirms that MySQL shell does not ignore BOM in .sql files but I found the following:

From MySQL :: MySQL 8.0 Reference Manual :: 4.5.1.5 Executing SQL Statements from a Text File:

mysql ignores Unicode byte order mark (BOM) characters at the beginning of input files. Previously, it read them and sent them to the server, resulting in a syntax error. Presence of a BOM does not cause mysql to change its default character set. To do that, invoke mysql with an option such as --default-character-set=utf8.

Seems that this is not the case for mysqlsh.

From MySQL :: MySQL Shell 8.0.21 for MySQL Server 8.0 and 5.7 has been released:

       MySQL Shell’s JSON import utility importJSON() now
       handles UTF-8 encoded files that include a BOM (byte mark
       order) at the start, which is the sequence 0xEF 0xBB
       0xBF. As a workaround in earlier releases, remove this
       byte sequence, which is not needed. (Bug #30993547, Bug
       #98836)

From MySQL :: MySQL 8.0 Reference Manual :: 10.9 Unicode Support:

The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values needs to be performed when transferring data between those systems and MySQL. The implementation of UTF-16LE is little-endian.

MySQL uses no BOM for UTF-8 values.

From utf 8 - Detecting utf8 broken characters in MySQL - Stack Overflow:

Also make sure (since I use PHP and this had tripped me up a couple of times so I thought I'd mention it here) all your script files are UTF8 (without BOM)

From How to Import a CSV file into MySQL Workbench | by 👩🏻‍💻 Kessie Zhang | Towards Data Science:

“Table Data Import Wizard fails on UTF-8 encoded file with BOM.”

In any case, I came to conclusion that .sql files intended to be executed by MySQL should not start with BOM character.

We can remove BOM character as here:

$ sed -i '1s/^\xef\xbb\xbf//' troublemaker.sql

To check it:

$ xxd troublemaker.sql
00000000: 7365 6c65 6374 2055 7365 7220 6672 6f6d select User from
00000010: 206d 7973 716c 2e75 7365 723b mysql.user;

And finally, this troublemaker is not making troubles anymore:

How to remove BOM Unicode character from a file (Linux)

Example of file with Byte Order Mark (BOM) bytes which appear at the beginning of the file:

$ xxd commands.sql
00000000: efbb bf73 656c 6563 7420 5573 6572 2066 ...select User f
00000010: 726f 6d20 6d79 7371 6c2e 7573 6572 3b rom mysql.user;

To remove these bytes, we can use sed tool:

$ sed -i '1s/^\xef\xbb\xbf//' commands.sql

Let's check now the file content:

$ xxd commands.sql
00000000: 7365 6c65 6374 2055 7365 7220 6672 6f6d select User from
00000010: 206d 7973 716c 2e75 7365 723b mysql.user;

Thursday, 8 December 2011

Host endianness and data transfer over the network

Network components talk to each other by sending messages which are simply arrays of bytes. In order to understand them, parties in conversation need to know the communication protocol which defines message format and the length, order and the meaning of its parts.

Typically, message would comprise header and payload. Header can contain information about message itself, protocol version and information about the sender and receiver. Payload is actually information that sender wants to pass to receiver.

The simplest and shortest message one host can send is a message of a 1-byte length. In this case, protocol only needs to define how is this byte treated - as a character, signed or unsigned number. For example, if protocol says that message contains value of type unsigned char, and the message is 0x8b, receiver will treat this as a positive integer, of value 139. If that was a value of signed char, receiver would understand that this is a negative integer, -117.

There is one problem for messages made of two or more bytes. Bytes are send and received in the same order they are written in the sending buffer. But the way how are bytes copied from register to memory (buffer) and vice versa can be different on different hosts and this depends on their endianness. If sender has a big endian (BE) CPU and receiver has a small endian (SE) CPU, receiver might interpret received values in a wrong way.

Let's look at the case when the message comprises of 2-byte integer value, let's say of type unsigned short. This type has a range of values between 0 and 65535 (0x0000 and 0xffff). If BE sender wants to send value 0xabcd (43981) it will copy this value from registry to buffer keeping the same byte order and buffer will be like this: | 0xab | 0xcd |. Most significant byte (MSB) is at the lower address in memory. The other side will receive bytes in the same order. When copying bytes to the registry, BE receiver will treat the byte from the lowest memory address as the MSB and put it first so the registry will filled with bytes in the same order they are in the memory (0xabcd) and everything would be fine. But LE receiver will treat byte from the lowest address as the Least Significant Byte (LSB) and put it at the last position in the registry - it would swap the order and read received value as 0xcdab (52651), which is wrong! Sender should know the endianness of the client so can send bytes in the correct order, but that is impractical.

Solution to this is a simple rule: sender should always send bytes in big endian order (network byte order) and receiver should always convert received bytes from network to its own byte order. This makes sending and receiving code portable.

Both Windows and *NIX networking frameworks offer helper functions which are able to convert integers from host to network byte order and vice versa. They are:

uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);

Obviously, if host has network byte order (big endian), no conversion would take place, no matter whether it is on the sending or receiving side.

Sending and receiving buffers can be declared as char or unsigned char arrays. Values transported could be of signed or unsigned types. I made a set of several utility functions that insert and export values of desired integer types into/out from sending/receiving (probably socket) buffers. Prior to inserting, values are converted to network byte order (big endian) and after extraction, values are converted from network to host byte order. Tests prove that hton/ntoh functions can be applied both to signed and unsigned types as all they do is actually swapping bytes (if necessary).

NetBuffUtilCore.h:

NetBuffUtil.h:

NetBuffUtil.cpp:

main.cpp:

Output:

unsigned char buff
Original (unsigned short): 43981
Received val = 43981

char buff
Original (unsigned short): 43981
Received val = 43981

unsigned char buff
Original (short): -31234
Received val = -31234

char buff
Original (short): -31234
Received val = -31234

unsigned char buff
Original (unsigned long): 2882343476
Received val = 2882343476

char buff
Original (unsigned long): 2882343476
Received val = 2882343476

unsigned char buff
Original (long): -1107401523
Received val = -1107401523

char buff
Original (long): -1107401523
Received val = -1107401523

To avoid dependency on Winsock library, I implemented a function which swaps bytes for a given type (well, template should be constrained to only integer types...):

EndiannessUtil.h:

main.cpp:

So far, we were focused on transfer of integer types. What if message payload needs to contain strings, or, mashup of strings and integers?

Let's say that we need to send some ASCII string and some unsigned long number. Protocol should define message payload like this:

|L0|L1|S1|S1|S2|........|SK|N0|N1|N2|N3|

|L0|L1| - 2 bytes for unsigned short value that defines string length (K bytes)
|S1|S1|S2|........|SK| - string (K bytes)
|N0|N1|N2|N3| - 4 bytes for unsigned long number

Both integers should be converted to the network byte order prior to writing into sending buffer. But string does not need to be changed - that is ASCII string and each character is placed in a single byte. Receiving side will first read 2 bytes of payload, extract string length (K), allocate memory for string (K bytes) and then read (copy) next K bytes from receiving buffer into the string buffer. After that, receiver will read next 4 bytes and convert them from network byte order before passing it for further processing.

If sending Unicode string, we need to take care about endianness again as some of its characters use two or more bytes. Our protocol will define encoding applied (e.g. UTF-8 or UTF-16) but this time sender needs to send additional information as well - its endianness. This information is contained in Byte Order Mark (BOM) sequence which is prepended to our string. BOM helps Unicode decoder on the client side to decide whether to swap or not bytes for multi-byte characters.

Links and references:
htons(), htonl(), ntohs(), ntohl() (Beej's guide)
htons function (MSDN)
htonl function (MSDN)
ntohl function (MSDN)
ntohs function (MSDN)
Linux functions
Encodings and Unicode (Python)
Byte Order (Codecs)

Pages

Friday, 4 February 2022

Running BOM-encoded .sql file in MySQL shell gives "ERROR: 1064 (42000): You have an error in your SQL syntax"

How to remove BOM Unicode character from a file (Linux)

Thursday, 8 December 2011

Host endianness and data transfer over the network