About strings


In Creating memory, I've provided some information about how to implementing strings. In an example, (data (i32.const 0) "\48\65\6C\6C\6F") stores five bytes in memory. When being viewed as unsigned 8-bits integers, they are the same as Unicode numbers of characters "Hello". In JavaScript, you read the first 5 bytes from ArrayBuffer of the exported memory and use String.fromCharCode to get their respective string.

String.fromCharCode converts a Unicode number into a string. "\48\65\6C\6C\6F" is a way to literally specify bytes in memory. Each character in "Hello" coincidentally requires only one byte to represent its own Unicode number.

How about non-ascii characters? For example, how to store '良' in memory?

The Unicode number of '良' is 33391 in decimal format or 826F in hexadecimal format. WebAssembly uses Little-endian so remember to store the least significant byte first and the most significant byte last. You have to write "6F82" in the data section.

For example, the Unicode number of '葛' is 845B and '格' is 683C. If you use String.fromCharCode and want to print "良葛格" in console, you have to write "\6F\82\5B\84\3C\68".

(module
    (memory $mem 1)
    (data (i32.const 0) "\6F\82\5B\84\3C\68")
    (export "mem" (memory $mem))
    (func $nope)
)

Because each character uses 2 bytes, you use Uint16Array in JavaScript.

WebAssembly.instantiateStreaming(fetch('program.wasm'))
            .then(prog => {
                console.log(String.fromCharCode.apply(null, 
                    new Uint16Array(prog.instance.exports.mem.buffer, 0, 48)
                ));
            });

You may use characters to specify data in memory. For example:

(data (i32.const 0) "Hello")

Wat files use UTF-8 encoding. When storing a wat, "Hello" is actually saved as "\48\65\6C\6C\6F". Each character in "Hello" coincidentally requires only one byte to represent its own Unicode number.

What happens when writing non-ascii characters?

(data (i32.const 0) "良葛格")

Because the file uses UTF-8 encoding, "良葛格" is saved as "\e8\89\af\e8\91\9b\e6\a0\bc". Of course, you can write "\e8\89\af\e8\91\9b\e6\a0\bc" when using the data instruction. Two written forms have the same effect.

(data (i32.const 0) "\e8\89\af\e8\91\9b\e6\a0\bc"))

In this example, UTF-8 uses three bytes for a character. "\e8\89\af" is the UTF-8 encoding of '良'. Don't mix Unicode up with encodings. A valid code point in Unicode can have different encodings, such as UTF-8, UTF-16 and UTF-32. The Unicode number of '良' is 826F; However, its UTF-8 encoding is "\e8\89\af".

That is, the following code has the same effect as above.

(module
    (memory $mem 1)
    (export "mem" (memory $mem))
    (func $main
        ;; 良
        (i32.store8 (i32.const 0) (i32.const 0xE8))
        (i32.store8 (i32.const 1) (i32.const 0x89))
        (i32.store8 (i32.const 2) (i32.const 0xAF))
        ;; 葛
        (i32.store8 (i32.const 3) (i32.const 0xE8))
        (i32.store8 (i32.const 4) (i32.const 0x91))
        (i32.store8 (i32.const 5) (i32.const 0x9B))
        ;; 格
        (i32.store8 (i32.const 6) (i32.const 0xE6))
        (i32.store8 (i32.const 7) (i32.const 0xA0))
        (i32.store8 (i32.const 8) (i32.const 0xBC))                
    )
    (start $main)
)

In this case, don't use String.fromCharCode to decode data in memory. You can use TextDecoder API to do the task.

WebAssembly.instantiateStreaming(fetch('program.wasm'))
            .then(prog => {
                var bytes = new Uint8Array(prog.instance.exports.mem.buffer, 0, 9);
                var string = new TextDecoder('utf8').decode(bytes);
                console.log(string);
            });