Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 11, 2013

unicodex — High-performance Unicode Library (C++)

Filed under: Software,Unicode — Patrick Durusau @ 11:42 am

unicodex — High-performance Unicode Library (C++) by Dustin Juliano.

From the post:

The following is a micro-optimized Unicode encoder/decoder for C++ that is capable of significant performance, sustaining 6 GiB/s for UTF-8 to UTF-16/32 on an AMD A8-3870 running in a single thread, and 8 GiB/s for UTF-16 to UTF-32. That would allow it to encode nearly the full English Wikipedia in approximately 6 seconds.

It maps between UTF-8, UTF-16, and UTF-32, and properly detects UTF-8 BOM and the UTF-16 BOMs. It has been unit tested with gigabytes of data and verified with binary analysis tools. Presently, only little-endian is supported, which should not pose any significant limitations on use. It is released under the BSD license, and can be used in both proprietary and free software projects.

The decoder is aware of malformed input and will raise an exception if the input sequence would cause a buffer overflow or is otherwise fatally incorrect. It does not, however, ensure that exact codepoints correspond to the specific Unicode planes; this is by design. The implementation has been designed to be robust against garbage input and specifically avoid encoding attacks.

One of those “practical” things that you may need for processing topic maps and or other digital information. 😉

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress