Quick Links: Download Gideros Studio | Gideros Documentation | Gideros Development Center | Gideros community chat | DONATE
String split with utf8? — Gideros Forum

String split with utf8?

test29test29 Member
edited July 2016 in General questions
How to make this function to work with utf8?

function string:split(sep)
local sep, fields = sep or ":", {}
local pattern = string.format("([^%s]+)", sep)
self:gsub(pattern, function(c) fields[#fields + 1] = c end)
return fields
end

I think the line:
self:gsub(pattern, function(c) fields[#fields + 1] = c end) should be replaced with:

self:utf8.gsub(pattern, function(c) fields[#fields + 1] = c end)

but that give me error : "function arguments expected near '.' "?

Comments

  • n1cken1cke Maintainer
    edited July 2016 Accepted Answer
    `utf8` is not default metatable for strings (`string` library is default one) so you cannot call it's functions like methods for strings.
    Also `:` in `self:gsub(pattern, function(c) fields[#fields + 1] = c end)` is syntax sugar for `string.gsub(self, pattern, function(c) fields[#fields + 1] = c end)`.
    Try this one:
    utf8.gsub(self, pattern, function(c) fields[#fields + 1] = c end)
    Details are in this chapter of the PiL: https://www.lua.org/pil/16.html
  • test29test29 Member
    edited July 2016
    Thank you. But this doesn't solve my problem (even I think that it will).

    Example:

    st = {}

    a = "jezičnim|teškoću"

    st = a:split("|")

    print(st[1], st[2])
    print(#st[1], #st[2])

    After running the string is splitted but there are errors:

    word: "jezičnim" have 8 letters but # command gives 9.

    also word "teškoću" have 7 letters but # command gives also 9.

    It seems that every letter č, š, ć adds empty space at end (in first case one, in second two empty spaces).

    It's same with self:gsub(pattern, function(c) fields[#fields + 1] = c end) and
    utf8.gsub(self, pattern, function(c) fields[#fields + 1] = c end), even I was thinking that utf8 characters are the problem.

    Also how do you put code in "window" like in your post?
  • n1cken1cke Maintainer
    edited July 2016
    > After running the string is splitted but there are errors
    There are no errors, it's how utf8 strings work: each utf8 character is encoded with 1, 2, 3 or 4 bytes (depends on utf8 character itself).
    And Lua strings are raw byte sequences where each byte is an unsigned number in 0..255 range. And '#' operator for a string only gives you it's size in bytes. And if you need to get length in characters you use `utf8.len` function instead of `#` operator.
    print(utf8.len(st[1]), utf8.len(st[2]))
    >Also how do you put code in "window" like in your post?
    Select your text and press "C" button (it's above the text you enter).
  • test29test29 Member
    Thank you!
Sign In or Register to comment.