Development: UTF-8 Support

From WxWiki
Jump to: navigation, search

About UTF-8 Support

This page collects notes about the work of adding support for UTF-8 build mode in wxWidgets and improving the existing Unicode build.

Rationale for adding UTF-8 build mode

Currently wxWidgets has 2 build modes: ANSI and Unicode. This was inspired by both C and C++ standards and Win32 and while it seemed like a good idea back then (~1998), this model didn't gain much popularity since then, especially on the Unix side of things. Many popular open source libraries (GTK+, Pango, DirectFB, ...) use UTF-8 exclusively internally and people developing under Unix are much more used to this way of handling Unicode.

Moreover, having 2 build modes for wxWidgets (and in fact not 2 but 2 times more as we also have debug/release static/shared STL/non-STL ...) is very painful and the idea of replacing them with a single build mode (at least for each platform) is very enticing.

Goals of UTF-8 support

The goals are, in order of priority:

  1. Make Unicode easier to use, especially avoid wxT() macro which confuses people a lot and, as far as possible, don't expose wchar_t neither
  2. Allow using UTF-8 internally in wxString too
    • Save space (especially important for embedded systems, whether wxGTK-based (such as Maemo), or wxDFB ones)
    • Avoid conversions between wxString and the GUI toolkit (again, applies to both wxGTK and wxDFB and maybe even wxMotif)
  3. Keep compatibility with the existing ANSI build: the goal here is 100% backwards compatibility at the source code level
  4. Keep compatibility with the existing Unicode build: Unicode and ANSI builds are currently orthogonal, so it's going to be impossible to stay compatible with both and compatibility with the ANSI build is more important because people using the Unicode build hopefully understand Unicode better and so will have less problems updating their code. Still, compatibility with the existing code should be preserved as much as possible.

Visions of the bright future

On completion of this work we will release wxWidgets 3. This version will provide (as before) the same API on both Windows and Unix but will use different implementations: wxString will store data as wchar_t * internally under Windows and Mac and as UTF-8 encoded char * under Unix and there will be only one official wxWidgets build for any platform.

The existing code will mostly continue to work but due to the scope of the changes there can be some incompatibilities, hence the jump in the version number.

Implementation plans

Platforms

The plan is to support a new UTF-8 build on Unix platforms only. Notice that for full Unicode support the current locale should use UTF-8 encoding as otherwise only characters supported by the current encoding can be used. However wxWidgets programs will still work at least as well as currently even if a different encoding is used.

Windows and Mac OS ports will use the (extended) current Unicode build as they don't use UTF-8 internally. Of course, the public API should be the same in any case, even though some functions (e.g. access to wchar_t buffer representing the string contents) may be more efficient under some platforms (where this buffer is used internally) than under others (where a temporary buffer will have to be allocated and filled).

Preprocessor definitions

The UTF-8 build will be identified by defining wxUSE_UNICODE_UTF8 macro. We may also define wxUSE_UNICODE_WCHAR for symmetry for the existing Unicode build and maybe even more precise wxUSE_UNICODE_UTF16/32 depending on sizeof(wchar_t) but these macros are not necessary strictly speaking. In any case, wxUSE_UNICODE will be always defined as 1 in wxWidgets 3 as it will always support Unicode fully, unlike the current ANSI build.

The wxT() (also known as _T()) macro will be preserved for backwards compatibility but won't be needed any more as either narrow or wide strings can be used in any case in wxWidgets 3 API.

Making Unicode easier to use: wxString constructors, mb_str() etc.

(All of this applies to both UTF-8 and Unicode build modes.)

The goal is to eliminate the need for using wxT() in the code. In particular, the following uses should work in all builds and not only in the ANSI one:

  1. creating wxString from char* (literal or variable)
  2. converting wxString to char*
  3. comparing wxString to char* and wxString elements to char
  4. using wxString, char* and wchar_t* in formatting functions like wxPrintf() or wxLogXXX()
Items 1-3 are easily done. In UTF-8 and ANSI build modes, this is trivial, as it's the internal representation already. In Unicode mode, we currently require either using wxT() or char* and wxMBConv instance. Instead, we should use wxConvLocal by default, allowing code like this to work regardless of the build mode:
wxString foo("bar");
Likewise, wxString::mb_str() should use wxConvLocal if no converter is specified:
wxString str("foo");
puts(str);

The reasoning for using wxConvLocal here is that it's compatible with the ANSI build, which interprets all strings as being in current locale's charset (it's compatible with proposed UTF-8 build as well, because it only runs under locales using UTF-8). This is modeled after Qt's QString which is both Unicode aware and can be easily used with char* (unlike current wxString in Unicode build).

The problematic case is 4.: making wx functions that accept vararg arguments as well as wx wrappers for vararg CRT functions (wxPrintf() and friends) work. Eliminating wxT() from 1.-3. is not very useful if you still have to use it when logging or formatting strings. We could add wxLogXXX etc. overloads that accept char* for the format string, but while that would allow uses like this:
wxPrintf("foo value is %s\n", foo.c_str());
it wouldn't allow this
wxPrintf("enabled: %s", enabled ? "yes" : "no");
or this
wxPrintf("value: %s (%s)", str.c_str(), enable ? "enabled" : "disabled")
(both in Unicode build where c_str() returns wchar_t*).

The solution is to replace the wrappers with templates:

void wxPrintf(const wxString& fmt);
template<typename T1> void wxPrintf(const wxString& fmt, T1 a1);
template<typename T1, typename T2> void wxPrintf(const wxString& fmt, T1 a1, T2 a2);
// ... up to N arguments for some reasonably high and configurable value of N...

With this, string arguments - be they char*, wchar_t* or wxString - would be converted to a suitable form by the (thin) template wrapper. This would make existing Unicode and ANSI code using them work as before. As an added advantage, using c_str() wouldn't be necessary anymore, all of these uses would be equally valid:

wxString str("str");
const char *foo = "foo";
const wchar_t *bar = L"bar";
wxPrintf("%s", str);
wxPrintf("%s", str.c_str());
wxPrintf("%s", foo);
wxPrintf("%s", bar);
wxPrintf("%s %s %s %s", str, str.c_str(), foo, bar);

The same solution applies to all places where variadic arguments are accepted: wxLogXXX() functions, wxString::Printf() etc.

wxChar and wxUniChar

wxChar is currently typedef for char or wchar_t, depending on the build. This is not sufficient for us: we need to return some type that could be transparently used as char or wchar_t from wxString::operator[] and wxString::iterator, in order to be as compatible as possible with both ANSI and Unicode builds. We also want to be able to represent any Unicode character in it, which is not possible using char in UTF-8 build.

The solution is to add a lightweight wxUniChar class representing single Unicode character, modeled after Qt's QChar class. It will be returned by wxString::operator[] and used by wxString::iterator (see below). The point of using it is that it can be treated as either char or wchar_t by user code, which makes porting code to Unicode simpler.

As for wxChar, it will still have to be defined for backwards compatibility. To keep the code like this

for ( const wxChar *p = s.c_str(); *p; p++ )
{
  wxChar ch = *p;
  ... do something with ch ...
}

working we need to define it as wchar_t in all builds (if it were defined as char, this code would still compile but would be silently broken for strings containing Unicode characters outside of the current encoding).

Simplified mockup of wxUniChar interface:

class wxUniChar
{
public:
    wxUniChar(char c);
    wxUniChar(wchar_t c);

    operator char() const;
    operator wchar_t() const { return m_value; }

    wxUniChar& operator=(const wxUniChar& c) { m_value = c.m_value; }
    wxUniChar& operator=(char c);
    wxUniChar& operator=(wchar_t c);

    bool operator==(const wxUniChar& c) const { return m_value == c.m_value; }
    bool operator==(char c) const;
    bool operator==(wchar_t c) const { return m_value == c; }

    bool operator!=(const wxChar& c) const { return !(*this == c); }
    bool operator!=(char c) const { return !(*this == c); }
    bool operator!=(wchar_t c) const { return !(*this == c); }

private:
    unsigned m_value; // wchar_t is UCS-2 only on Win32, we need full Unicode range here
};

We'll need a similar class, wxUniCharRef, for (writable) references, too. You can currently change individual characters in wxString like this:

string[i] = wxT('a');

This works because there are two operator[]s, const one which returns wxChar and non-const one which returns wxChar&. We can't return wxChar& from operator[] in UTF-8 build because one character may be represented using several bytes, so we have to return a wxUniCharRef "reference" class with the following properties:

  • implicit cast to wxChar so that it can be used in exactly same way wxChar is (because non-const version of operator[] is always used if the wxString instance is not const)
  • implements operator= for setting the value; if used, ~wxCharRef will update the string (by partially reencoding it)
  • lightweight; in particular, don't do anything with the string unless a new value was assigned to it using operator= and be [almost] as cheap as current implementation for ASCII characters

c_str() and implicit conversions

c_str() method is the main problem in making the new UTF-8 build compatible with both the existing ANSI and Unicode builds. This is because it returns const char * and const wchar_t * in the latter.

Returning const wchar_t * is a non starter as one of our main goals is to make wxWidgets easier to use for the people who don't know or care about wchar_t. And this would, of course, break any existing ANSI-only code. Returning const char * is not possible neither, unfortunately, as this would mean that we'd need to make wxChar synonym for char to allow to continue writing const wxChar *p = s.c_str() and we've seen above that wxChar should be defined as wchar_t.

The only remaining solution is to make c_str() return an object which is convertible to both of these types. Unfortunately this doesn't work because c_str() is often used with printf()-like functions and such conversions don't happen automatically to vararg functions parameters. But it does work with template-based wrappers described above. So by returning a lightweight wxCStrData object with implicit conversion operators for conversion to char* as well as wchar_t* and void* (for returning native in-memory representation), we can retain compatibility with:

  1. CRT and other third party non-vararg function taking char* (puts(), fopen() etc.)
  2. CRT and other third party non-vararg function taking wchar_t* (fputws() etc.)
  3. all wx wrappers for CRT functions (wxPrintf() etc.)
  4. all wx vararg functions (wxLogDebug(), wxString::Printf() etc.)
  5. code handling wxString's data as memory (e.g. for storing in file) by means of wxCStrData::operator void*()

This leaves only one unsupported case: directly passing c_str() return value to printf() and its CRT friends in ANSI build or wprintf() in Unicode build. Some compilers (notably, gcc) will produce a warning about passing non-POD type through ..., thus warning the programmer about a problem. A similar problem exists with code that depends on implicit conversion of the return value of c_str() from char* to any other type. Since no implicit conversion from wxCStrData is available in the general case, and chaining of implicit constructors is not permitted by the C++ standard because it is O(N!) and likely ambiguous, all such code is broken by this mechanism, and will not compile.

Note that it's important to allocate memory for the temporary buffer in wxString itself and not wxCStrData, as the temporary wxCStrData object can be destroyed while the buffer is still in use as in this case:

for ( const wxChar *ptr = str.c_str(); *ptr; ptr++ )
{
  ...
}

So if a representation different from the one used internally is needed, the temporary buffer is allocated in wxString itself and not freed by ~wxCStrData. The buffer is only freed (invalidated) when the underlying string is modified.


wxString will also be implicitly convertible to both const char * and const wchar_t *.

UTF-8 based wxString implementation

In the UTF-8 build, string data will be stored as UTF-8-encoded char* string. In other words, the implementation will mostly be same as in the ANSI build, but with one crucial difference: operations performed on string characters would be performed on Unicode characters (= Unicode code points for the purpose of this discussion) and not bytes of the string as is done in the ANSI build.

This distinction wasn't visible in the ANSI build until relatively recently, because it was used with single-byte charsets such as iso-8859-*. However, virtually all modern Linux distributions started using UTF-8 locales by default now and the ANSI build is broken under them precisely because wxString character operations are performed on bytes rather then Unicode characters. A common use of wxString when processing individual characters is doing something like this:

unsigned len = str.length();
for ( unsigned i = 0; i < len; i++ )
{
  wxChar c = string[i];
  if ( c == wxT('<') )
  {
      doSomething(c);
      string[i] = wxT(' '); // erase <
  }
  else
      doSomethingElse(c);
}

(Or, alternatively, using std::string::iterator.) The problem here is that it works as long as the strings are ASCII, but when they're not, some characters are represented by multiple bytes in UTF-8 encoding and these bytes would be processed individually, instead of calling doSomething[Else]() only once for the correctly-decoded Unicode character, as would be correct.

The UTF-8 build must handle this case correctly, so we need to provide interface for iterating over characters and not bytes. Note that such interface would be useful for Win32 Unicode build as well: Win32 uses UTF-16 encoding for wide strings and so the above code is incorrect for it as well, because it only works for Plane 0 Unicode characters.

Therefore all wxString methods related to individual characters must work on Unicode characters. In particular:

  • length() will return the number of characters, not the length of C string used to represent it
  • operator[n] will return n-th character, not n-th byte
  • wxString::iterator will iterate over the characters, not bytes of the string
  • both operator[] and wxString::iterator will return wxChar object instead of char or wchar_t. "Writable" wxString::operator[] will return wxCharRef instead of wxChar& (old wxChar, i.e. char& or wchar_t&). (Note that this will be done for all builds, it's also part of the "making Unicode easier to use" subtask.)
  • Mid(), Find() etc. will accept or return positions in # of characters, not # of bytes. (This may be incompatible with some obscure code that uses wxString methods to compute indexes and C string returned from c_str() to further manipulate the string, but that can be safely considered broken code, IMHO.)
  • ...etc. (?)

Note that using wxChar-as-Qchar-like-class is crucial here, it allows existing ANSI or Unicode code to work as before, because the returned character can be transparently handled as either char or wchar_t.

The code example above would still work or it could be written as this now (note the lack of wxT):

unsigned len = str.length();
for ( unsigned i = 0; i < len; i++ )
{
  wxChar c = string[i];
  if ( c == '<' )
  {
      doSomething(c);
      string[i] = ' '; // erase <
  }
  else
      doSomethingElse(c);
}

But this code is inefficient in UTF-8 build, because operator[] has to find the i-th character in UTF-8 encoding of the string and so the the code runs in O(len^2) in UTF-8 build compared to O(len) for other builds (but see below). That's why we have to provide wxString::(const)_iterator (with the same interface std:: iterators have):

for ( wxString::iterator i = str.begin(); i != str.end(); ++i )
{
  if ( *i == '<' )
  {
      doSomething(*i);
      *i = ' '; // erase <
  }
  else
      doSomethingElse(*i);
}

wxString::operator[] performance

Simple implementation of operator[i] would have O(i) performance in typical case in UTF-8 build, because looking up i-th character requires iterating from the start of the string over i characters. This means that typical iteration of the string as in this example would be O(n^2):

unsigned len = str.length();
for ( unsigned i = 0; i < len; i++ )
{
  ...
}

This is probably the most common use of operator[], together with looking it first, second or last character of the string:

c = str[0];
c = str[1];
c = str.Last();

While the worst-case performance of operator[] will always be O(i), we can optimize it in the common case:

  • Instead of iterating the string from the start until we find i-th character, iterate from the start or the end, whichever is closer to i. We can do this if we keep the length of the string in memory (we currently do and we have to as long as we allow embedded NULs).
  • Remember the last index used in operator[] (either in wxString itself or in a global occasionally cleared cache) and start the iteration from there instead of from the start. This will make it run in O(1) in the typical for loop above. If we want to allow simultaneous read-only iteration by multiple threads (this is open question, because e.g. wxString copying is not MT-safe), then we would have to store the cache in TLS memory (which probably may negate the benefits of this optimization).

Makefiles changes

Third build mode, --enable-utf8 (on by default), will be added. On systems that support UTF-8 locales, this build will be used by default and will use shared libraries with the same name as current ANSI build. The rationale is that it's compatible with ANSI build and is meant to replace it. On systems that don't support UTF-8 locales, existing ANSI build will be used by default.

Incompatible changes

Known incompatibilities with the existing code:

General

  • Existing code passing c_str() return value directly to a function excepting char* as vararg argument (e.g. printf()) won't work, but will compile (producing a warning with gcc). The recommended solution is to use wxWidgets wrappers for vararg functions (e.g. wxPrintf() instead of printf()). If a wrapper is not available, you can add explicit cast:
    printf("foo=%s\n", (const char*)str.c_str());
    Or, alternatively, use mb_str():
    printf("foo=%s\n", str.mb_str());
  • c_str() return value cannot be casted to non-const char* or wchar_t* anymore. The solution is to use newly added wxString methods char_str() (which returns a buffer convertible to char*) or wchar_str() (which returns a buffer convertible to wchar_t*). These methods will be available in wxWidgets 2.8 series beginning with 2.8.4, in order to allow writing code compatible with both 3.0 and 2.8. If compatibility with older versions of wxWidgets is required, a workaround is possible: casting to const pointer and then non-const, e.g. (char*)(const char*)str.c_str();
  • Return type of wxString::operator[] and wxString::iterator::operator* is no longer wxChar, but wxUniChar. This is not a problem in vast majority of cases because of conversion operators, but it breaks code that depends on the result being of the wxChar type (e.g. CPPUNIT_ASSERT_EQUALS).
  • wxUniChar cannot be used in switch statement, because it has implicit conversion to several integer types (char, wchar_t, bool). Because wxUniChar is returned by operator[], this means that code like this will no longer compile:
    switch ( str[i] ) { ... }
    The error given by compiler is ambiguous default type conversion from ‘wxUniChar’; candidate conversions include ‘wxUniChar::operator wchar_t() const’ and ‘wxUniChar::operator bool() const’. This could be made to work by leaving only one implicit conversion in wxUniChar, namely removing the conversion to char and relying on C++'s integral promotion. That would compile, but would only give correct result if current locale's charset was ASCII or ISO-8859-1. Silent breakage like this would be worse than breaking compilation, so user code using operator[] in switch's argument will have to be modified to use
    switch ( str[i].GetValue() ) { ... }
    or
    switch ( (wxChar)str[i] )
    if compatibility with the previous wxWidgets versions is important.
  • wxString::iterator and const_iterator classes have changed and are not convertible to const wxChar * any more, so the code like
    const wxChar *p = s.begin() + 1
    doesn't work any longer and will have to be corrected to use wxString::iterator type for p.
  • It's no longer possible to pass anonymous enum values to wxPrintf() and other "vararg" functions wrappers. For example, this code no longer compiles:
enum {
FOO
}
...
wxPrintf("FOO=%i\n", FOO);This can be easily fixed by naming the enum.
  • Some methods and functions that previously took const wxChar* argument and accepted NULL were changed to use wxEmptyString instead of NULL. As the result, old code explicitly passing NULL no longer compiles and has to be changed to use wxEmptyString or "". In particular, these functions are effected:
    • wxFileSelector(), wxFileSelectorEx(), wxLoadFileSelector(), wxSaveFileSelector() functions
    • wxRegKey methods that used NULL to indicate key's default value

Compiler specific

  • (BCC, DMC only) Return type of wxString::c_str() is now wxCStrData struct and not const wxChar*. wxCStrData is implicitly convertible to const char* and const wchar_t*, so this only presents a problem if the compiler cannot convert the type. In particular, Borland C++ and DigitalMars compilers don't correctly convert operator?: operands to the same type and fail with compilation error instead. This can be worked around by explicitly casting to const wxChar*:
    wxLogError(_("error: %s"), !err.empty() ? (const wxChar*)err.c_str() : "")
  • (DMC only) DigitalMars compiler has a bug that prevents it from using wxUniChar::operator bool in conditions and it erroneously reports type conversion ambiguity in expressions such as this:
    for ( wxString::const_iterator p = s.begin(); *p; ++p )
    This can be worked around by explicitly casting to bool:
    for ( wxString::const_iterator p = s.begin(); (bool)*p; ++p )

Open Questions

  • Should we support, if only for the testing purposes,
    • UTF-8 build under Windows?
    • wchar_t build under Unix?
  • Check that wxString can still be used to store arbitrary binary data after all these changes