Unicode, wxWidgets, and I

June 15, 2004

Over the last week or so, I’ve had the experience of using wxWidgets with Unicode. I read a fair number of Unicode Character Charts, and read the wxMBConv classes overview, and still didn’t get it.

Until a few days ago, when everything clicked. Here my knowledge then, hopefully in a better form than what’s in the wxWidgets documentation.

Disclaimer

Please note that I’m not a Unicode expert - if something is obviously wrong, comment and I will see what I can do to fix the inaccuracy.

Basic Class Overview

wxMBConv

wxMBConv is the base class for the Multi-Byte converter classes. The multi-byte converter classes take 8 byte character encodings (such as Latin1) and convert them to properly formated Unicode strings.

Note that 8 byte encoded characters can easily fit inside an char - think of unsigned chars, which have 255 possible values. Typically the first 127 character values are the same between encodings, and only differ after that. A chart that might help you out here is the ANSI/Unicode/HTML entities character chart. The concept that the 8 byte non-Unicode string is (normally) a const char* type was very helpful for me. (Unicode strings, if you want to deal with them directly, are usually const wchar_t* types). UTF-8 breaks the rules slightly - these may be encoded inside char* s too.

OK, now all of these types have predefined instances - the trick is, of course, knowing which one to use at the right time. The first instance of this is wxConvLibc which is a wxMBConv. Personally, I’m not actually sure what this instance is useful for, but it’s there anyway.

The trick about these predefined instances is that they are great, but feel free to make your own instances of the converter objects. There’s nothing preventing you from doing this, and sometimes it’s faster than figuring out what predefined instance you need.
wxCSConv

wxCSConv is a class that converts between any character set and Unicode. At first glance, this seems just like what wxMBConv does, and yes, it is, except you can specify the names of the encoding type in the wxCSConv constructor.

wxCSConv also has one predefined instance - wxConvLocal. This will hold the encoding for the user’s default character set. This default character set is what encoding input from the user will be returned as, how GUI buttons are drawn, along with other things.

The wxMBConv Classes Overview says you shouldn’t use wxConvLocal directly - instead use the wxConvCurrent instance (which is actually a wxMBConv object).
wxMBConvUTF8

wxMBConvUTF8 converts between UTF8 and Unicode strings. Now, UTF8 is neat because it uses clever bit tricks to enable wide Unicode characters to operate with code written for 8 bit encodings. Basically, this means that UTF-8 could be embedded into a char* (if you tried that with Unicode, well, no good would come of it.)

wxMBConvUTF8 has one predefined instance - wxConvUTF8.

The Task: Writing a UTF8 encoded file

Enter wxUTF8OutputStream

So now you want to output some text to a UTF-8 encoded file - and you want to do this both when wxWidgets is complied in Unicode mode, and when it is not. I created a class especially to do this: wxUTF8OutputStream. This class takes care of all the character conversions, and provides a wxStream/std::stream like interface to do so.

Design of the wxUTF8OutputStream class

Think it’s important to let a user of my classes and tools to do as much with them as they possibly can. Sometimes this means using C++ templates, sometimes it jst means designing a class with flexibility in mind from the start. Since wxWidgets end-user code doesn’t use any C++ templates at all (the library is old enough so that some of it was written when templates were not a part of standard C++ - although basic data structures are slowly being moved to STL-base variants, while preserving legacy code), I decided to eschew them as well for this class, instead I have the user pass me a wxStream-based object in the constructor to actually write the output to.

Now, our header:

class wxUTF8OutputStream { public: wxUTF8OutputStream( wxOutputStream* outStream, bool withBOM = true ); void Write( const wxString& outputString ); wxUTF8OutputStream& operator Write( s ); return *this; }

private: wxOutputStream* m_stream; };

The code of the matter

Now, our .cpp file:

wxUTF8OutputStream::wxUTF8OutputStream( wxOutputStream* outStream, bool withBOM) { unsigned char bomValueP1 = 0xEF; unsigned char bomValueP2 = 0xBB; unsigned char bomValueP3 = 0xBF;

`m_stream = outStream; if (withBOM) { m_stream->Write(&bomValueP1, sizeof(char) ); m_stream->Write(&bomValueP2, sizeof(char) ); m_stream->Write(&bomValueP3, sizeof(char) ); } }

void wxUTF8OutputStream::Write( const wxString& outputString ) { wxCSConv convFile = NULL; wxMBConv convMem = NULL; convFile = new wxCSConv(“utf-8”); //translate to UTF8 #if wxUSE_UNICODE const wxWX2MBbuf buf ( outputString.mb_str( convFile ) ); m_stream->Write( (const char) buf, strlen( (const char) buf ) );

#else convMem = wxConvCurrent; //translate from the users default wxString str2(outputString.wc_str(*convMem), *convFile); m_stream->Write( str2.mb_str(), str2.Len() ); #endif

if (convFile) delete convFile; }

wxUTF8OutputStream constructor (or: The Big Badda BOM)

Unicode documents (even UTF8) should have a few bytes at the beginning of the file specifying what type of Unicode they are. Having a BOM (Bill of Materials) in your document is a good idea, but we do allow you to turn off outputting a BOM. We output individual characters of the BOM like that to keep our code simple.

wxUTF8OutputStream::Write (or: bringing it all together)

The Write method is where everything all comes together. First we create a wxCSConv instance, ready to encode things into UTF-8. If we’re compiling wxWidgets in Unicode mode, we’re pretty much done here - put the data from outputString, encoded in UTF8 format, into a buffer, and then write the buffer to the stream.

For non-Unicode builds, the code is a bit more complex. First, we need to retrieve the user’s default encoding. This really could be anything - although Latin1 may be the most common. (Latin1 is even the default encoding in the Mac version of wxWidgets.). Since this is so complex, let’s take this section by section.

outputString.wc_str(*convMem)

This line will output the string in wide character (wchar_t) format. Except wc_str() needs to know what encoding the string is in - so we pass convMem (which is just wxConvCurrent, the users default encoding).

wxString str2 ( ... , *convFile );

This line says “construct a string given the input wchar_t* buffer, and change it to this encoding” (which happens to be UTF8).

Then, the end - we write str2’s value as a multi-byte string, which is now encoded in UTF8, thanks to our constructor line.

We have now written some UTF8 data to the output stream.

More Example Code

A bit of my initial knowledge came from the wxMBConv classes overview - there are several two-three line samples down near the bottom of the page.

However, the breakthrough really came through looking at the wxXmlDocument code - this class is a real live example of a class that has to encode and decode UTF-8 (as well as dealing with XML entities). Under wxWidgets 2.4.x this class is at contrib/src/xrc/xml.cpp, while under 2.5 it’s at /src/xml/xml.cpp.

Conclusion

Hopefully this entry is more helpful than the official documentation for wxMBConv. If there’s anything you think I should add, please mention it in the comments and I’ll see what I can do. I hope you learned a lot, and took away with you some knowledge (and/or wxUTF8OutputStream) for future programs.

The latest version of wxUTF8OutputStream is always available from our OpenSource subversion repository (username: anon, no password), and its use is licensed under the Creative Commons Attribution License.