Archive for the ‘Unicode’ Category

Fun, Frustration, Curiosity, Murderous Rage – mimic

Monday, January 15th, 2018


From the webpage:

There are many more characters in the Unicode character set that look, to some extent or another, like others – homoglyphs. Mimic substitutes common ASCII characters for obscure homoglyphs.

Fun games to play with mimic:

  • Pipe some source code through and see if you can find all of the problems
  • Pipe someone else’s source code through without telling them
  • Be fired, and then killed

I can attest to the murderous rage from experience. There was a browser-based SGML parser that would barf on the presence of an extra whitespace (space I think) in the SGML declaration. One file worked, another with the “same” declaration did not.

Only by printing and comparing the files (this was on Windoze machines) was the errant space discovered.


Shape Searching Dictionaries?

Thursday, November 16th, 2017

Facebook, despite its spying, censorship, and being a shill for the U.S. government, isn’t entirely without value.

For example, this post by Simon St. Laurent:

Drew this response from Peter Cooper:

Which if you follow the link: Shapecatcher: Unicode Character Recognition you find:

Draw something in the box!

And let shapecatcher help you to find the most similar unicode characters!

Currently, there are 11817 unicode character glyphs in the database. Japanese, Korean and Chinese characters are currently not supported.
(emphasis in original)

I take “Japanese, Korean and Chinese characters are currently not supported.” means Anatolian Hieroglyphs; Cuneiform, Cuneiform Numbers and Punctuation, Early Dynastic Cuneiform, Old Persian, Ugaritic; Egyptian Hieroglyphs; Meroitic Cursive, and Meroitic Hieroglphs are not supported as well.

But my first thought wasn’t discovery of glyphs in Unicode Code Charts, although useful, but shape searching dictionaries, such as Faulkner’s A Concise Dictionary of Middle Egyptian.

A sample from Faulkner’s (1991 edition):

Or, The Student’s English-Sanskrit Dictionary by Vaman Shivram Apte (1893):

Imagine being able to search by shape for either dictionary! Not just as a gylph but as a set of glyphs, within any entry!

I suspect that’s doable based on Benjamin Milde‘s explanation of Shapecatcher:

Under the hood, Shapecatcher uses so called “shape contexts” to find similarities between two shapes. Shape contexts, a robust mathematical way of describing the concept of similarity between shapes, is a feature descriptor first proposed by Serge Belongie and Jitendra Malik.

You can find an indepth explanation of the shape context matching framework that I used in my Bachelor thesis (“On the Security of reCATPCHA”). In the end, it is quite a bit different from the matching framework that Belongie and Malik proposed in 2000, but still based on the idea of shape contexts.

The engine that runs this site is a rewrite of what I developed during my bachelor thesis. To make things faster, I used CUDA to accelereate some portions of the framework. This is a fairly new technology that enables me to use my NVIDIA graphics card for general purpose computing. Newer cards are quite powerful devices!

That was written in 2011 and no doubt shape matching has progressed since then.

No technique will be 100% but even less than 100% accuracy will unlock generations of scholarly dictionaries, in ways not imagined by their creators.

If you are interested, I’m sure Benjamin Milde would love to hear from you.

Unicode Egyptian Hieroglyphic Fonts

Monday, October 16th, 2017

Unicode Egyptian Hieroglyphic Fonts by Bob Richmond.

From the webpage:

These fonts all contain the Unicode 5.2 (2009) basic set of Egyptian Hieroglyphs.

Please contact me if you know of any others, or information to include.

Also of interest:

UMdC Coding Manual for Egyptian Hieroglyphic in Unicode

UMdC (Unicode MdC) aims to provides guidelines for encoding Egyptian Hieroglyphic and related scripts In Unicode using plain text with optional lightweight mark-up.

This GitHub project is the central point for development of UMdC and associated resources. Features of UMdC are still in a discussion phase so everything here should be regarded as preliminary and subject to change. As such the project is initially oriented towards expert Egyptologists and software developers who wish to help ensure ancient Egyptian writing system is well supported in modern digital media.

The Manuel de Codage (MdC) system for digital encoding of Ancient Egyptian textual data was adopted as an informal standard in the 1980s and has formed the basis for most subsequent digital encodings, sometimes using extensions or revisions to the original scheme. UMdC links to the traditional methodology in various ways to help with the transition to Unicode-based solutions.

As with the original MdC system, UMdC data files (.umdc) can be viewed and edited in standard text editors (such as Windows Notepad) and the HTML <textarea></textarea> control. Specialist software applications can be adapted or developed to provide a simpler workflow or enable additional techniques for working with the material.

Also see UMdC overview [pdf].

A UMdC-compatible hieroglyphic font Aaron UMdC Alpha (relative to the current draft) can be downloaded from the Hieroglyphs Everywhere Fonts project.

For news and information on Ancient Egyptian in Unicode see

I understand the need for “plain text” viewing of hieroglyphics, especially for primers and possibly for search engines, but Egyptian hieroglyphs can be written facing right or left, top to bottom and more rarely bottom to top. Moreover, artistic and other considerations can result in transposition of glyphs out of their “linear” order in a Western reading sense.

Unicode hieroglyphs are a major step forward for the interchange of hieroglyphic texts but we should remain mindful “linear” presentation of inscription texts is a far cry from their originals.

The greater our capacity for graphic representation, the more we simplify complex representations from the past. Are the needs of our computers really that important?

Unicode 10.0 Beta Review

Thursday, March 9th, 2017

Unicode 10.0 Beta Review

In today’s mail:

The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and smart phones—plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc.). The Unicode Standard, its associated standards, and data form the foundation for CLDR and ICU releases. Thus it is important to ensure a smooth transition to each new version of the standard.

Unicode 10.0 includes a number of changes. Some of the Unicode Standard Annexes have modifications for Unicode 10.0, often in coordination with changes to character properties. In particular, there are changes to UAX #14, Unicode Line Breaking Algorithm, UAX #29, Unicode Text Segmentation, and UAX #31, Unicode Identifier and Pattern Syntax. In addition, UAX #50, Unicode Vertical Text Layout, has been newly incorporated as a part of the standard. Four new scripts have been added in Unicode 10.0, including Nüshu. There are also 56 additional emoji characters, a major new extension of CJK ideographs, and 285 hentaigana, important historic variants for Hiragana syllables.

Please review the documentation, adjust your code, test the data files, and report errors and other issues to the Unicode Consortium by May 1, 2017. Feedback instructions are on the beta page.

See for more information about testing the 10.0.0 beta.

See for the current draft summary of Unicode 10.0.0.

It’s not too late for you to contribute to the Unicode party! There plenty of reviewing and by no means has all the work been done!

For this particular version, comments are due by May 1, 2017.


ScriptSource [Fonts but so much more]

Thursday, January 19th, 2017


From the about page:

ScriptSource is a dynamic, collaborative reference to the writing systems of the world, with detailed information on scripts, characters, languages – and the remaining needs for supporting them in the computing realm. It is sponsored, developed and maintained by SIL International. It currently contains only a skeleton of information, and so depends on your participation in order to grow and assist others.

The need for information on Writing Systems

In today’s expanding global community, designers, linguists and computer professionals are called upon more frequently to support the myriad writing systems around the world. A key to this development is consistent, trustworthy, complete and organised information on the alphabets and scripts used to write the world’s languages. The development of Writing System Implementations (WSIs) depends on the availability of this information, so a lack of it can hinder the cultural, economic and intellectual development of communities that communicate in minority languages and scripts.


The information needed varies widely, and can include:

  • Design information and guidelines – both for alphabets and for specific letters/glyphs
  • Linguistic information – how the script is used for specific languages
  • Encoding details – particularly Unicode, including new Unicode proposals
  • Script behaviour – how letters change shape and position in context
  • Keyboarding conventions – including information on data entry tools
  • Testing tools and sample texts – so developers can test their software, fonts, keyboards

Some of this information is available, but is scattered around among a variety of web sites that have different purposes and structures, and often lies undocumented in the minds of individual script experts, or hidden in library books.

This information is also often segregated by audience. A font designer may be frustrated to find that available resources on a script address the spoken/written language relationship, but not the background and visual rules of the letterforms. A linguist may find information on encoding the script – such as the information in The Unicode Standard – but not important details of which languages use which symbols. An application developer may find a long writeup on the development and use of the script, but nothing to tell them what script behaviours are required.

There are also relatively few opportunities for experts from these fields to cooperate and work together. What interaction does exist often happens at conferences, on various mailing lists and forums, and through personal email. There are few experts who have the time to participate in these exchanges, and those that do may be frustrated to find that the same questions keep coming up again and again. Until now, there has been no place where this knowledge can be captured, organised and maintained.

The purpose of ScriptSource

ScriptSource exists to provide this information and bridge the gap between the designer, developer, linguist and user. It seeks to document the writing systems of the world and help those wanting to implement them on computers and other devices.

The initial content is relatively sparse, but includes basic information on all scripts in the ISO 15924 standard. It will grow dynamically through public submissions, expert content development and live linkages with other web sites. Rather than being just another web site about writing systems, ScriptSource provides a single hub of information where both old and new content can be found.

A truly remarkable resource on writing systems by SIL International.

You can think of ScriptSource as a way to locate fonts, but you may be drawn into complexities others rarely see!


GNU Unifont Glyphs [Good News/Bad News]

Thursday, January 19th, 2017

GNU Unifont Glyphs 9.0.06.

From the webpage:

GNU Unifont is part of the GNU Project. This page contains the latest release of GNU Unifont, with glyphs for every printable code point in the Unicode 9.0 Basic Multilingual Plane (BMP). The BMP occupies the first 65,536 code points of the Unicode space, denoted as U+0000..U+FFFF. There is also growing coverage of the Supplemental Multilingual Plane (SMP), in the range U+010000..U+01FFFF, and of Michael Everson’s ConScript Unicode Registry (CSUR).
… (red highlight in original)

That’s the good news.

The bad news is shown by the coverage mapping:

0.0%  U+012000..U+0123FF  Cuneiform*
0.0%  U+012400..U+01247F  Cuneiform Numbers and Punctuation*
0.0%  U+012480..U+01254F  Early Dynastic Cuneiform*
0.0%  U+013000..U+01342F  Egyptian Hieroglyphs*
0.0%  U+014400..U+01467F  Anatolian Hieroglyphs*

These scripts will require a 32-by-32 pixel grid:

*Note: Scripts such as Cuneiform, Egyptian Hieroglyphs, and Bamum Supplement will not be drawn on a 16-by-16 pixel grid. There are plans to draw these scripts on a 32-by-32 pixel grid in the future.

One additional resource on creating cuneiform fonts:

Creating cuneiform fonts with MetaType1 and FontForge by Karel Píška:


A cuneiform font collection covering Akkadian, Ugaritic and Old Persian glyph subsets (about 600 signs) has been produced in two steps. With MetaType1 we generate intermediate Type 1 fonts, and then construct OpenType fonts using FontForge. We describe cuneiform design and the process of font development.

On creating fonts more generally with FontForge, see: Design With FontForge.


Your assignment, should you choose to accept it….

Friday, August 26th, 2016

You may (may not) remember the TV show, Mission Impossible. It had a cast of regulars who formed a spy team to undertake “impossible” tasks that could not be traced back to the U.S. government.

Stories like: BAE Systems Sells Internet Surveillance Gear to United Arab Emirates make me wish for a non-nationalistic, modern equivalent of the Mission Impossible team.

You may recall the United Arab Emirates (UAE) were behind the attempted hack of Ahmed Mansoor, a prominent human rights activist.

So much for the UAE needing spyware for legitimate purposes.

From the article:

In a written statement, BAE Systems said, “It is against our policy to comment on contracts with specific countries or customers. BAE Systems works for a number of organizations around the world, within the regulatory frameworks of all relevant countries and within our own responsible trading principles.”

The Danish Business Authority told Andersen it found no issue approving the export license to the Ministry of the Interior of the United Arab Emirates after consulting with the Danish Ministry of Foreign Affairs, despite regulations put in place by the European Commission in October 2014 to control exports of spyware and internet surveillance equipment out of concern for human rights. The ministry told Andersen in an email it made a thorough assessment of all relevant concerns and saw no reason to deny the application.

It doesn’t sound like any sovereign government is going to restrain BAE Systems and/or the UAE.

Consequences for their mis-deeds will have to come from other quarters.

Like the TV show started every week:

Your assignment, should you choose to accept it….

Unicode® Standard, Version 9.0

Wednesday, July 6th, 2016

Unicode® Standard, Version 9.0

From the webpage:

Version 9.0 of the Unicode Standard is now available. Version 9.0 adds exactly 7,500 characters, for a total of 128,172 characters. These additions include six new scripts and 72 new emoji characters.

The new scripts and characters in Version 9.0 add support for lesser-used languages worldwide, including:

  • Osage, a Native American language
  • Nepal Bhasa, a language of Nepal
  • Fulani and other African languages
  • The Bravanese dialect of Swahili, used in Somalia
  • The Warsh orthography for Arabic, used in North and West Africa
  • Tangut, a major historic script of China

Important symbol additions include:

  • 19 symbols for the new 4K TV standard
  • 72 emoji characters such as the following

Why they choose to omit the bacon emoji from the short list is a mystery to me:


Get your baking books out! I see missing bread emojis. 😉

“invisible entities having arcane but gravely important significances”

Sunday, June 19th, 2016

Allison Parrish tweeted: the “Other, Format” unicode category, full of invisible entities having arcane but gravely important significances

I just could not let a tweet with:

“invisible entities having arcane but gravely important significances”

pass without comment!

As of today, one-hundred and fifty (150) such entities. All with multiple properties.

How many of these “invisible entities” are familiar to you?

Unicode Code Chart Reviewers Needed – Now!

Tuesday, May 17th, 2016

I saw an email from Rick McGowan of the Unicode Consortium that reads:

As we near the release of Unicode 9.0, we’re looking for volunteers to review the latest code charts for regressions from the 8.0 charts… If you have a block that you’re particularly fond of, please consider checking the glyphs and names against the 8.0 charts… To see the latest 9.0 charts, you can start here:

The “blocks” directory has all of the individual block charts, and the charts with specific additions/changes are here:

Not for everyone but if you can contribute, please do.

Just so you know, this is the 25th anniversary of the Unicode Consortium!

Even if you don’t proof the code charts, do remember to wish the Unicode Consortium a happy 25th anniversary!

UTF-8 encoding table and Unicode characters

Friday, April 22nd, 2016

UTF-8 encoding table and Unicode characters

The mapping between UTF-8 and binary representations doesn’t come up often but it did today. but it does come up.

Rather than hunting through bookmarks in the future, I am capturing this resource here.

Helmification of XML Unicode

Friday, January 8th, 2016

XML Unicode by Norman Walsh.

From the webpage:

XML Unicode provides some convenience methods for inserting Unicode characters. When it started, the focus was on characters that were traditionally inserted with named character entities, things like é.

In practice, and in the age of UTF-8, the “insert unicode character” function, especially the Helm-enabled version, is much more broadly useful.

You’re most likely going to want to bind some or all of them to keys.

Complete with suggested key bindings!

Oh, the image from Norman’s tweet:


FYI, the earliest use of helm-ification (note the hyphen) I can find was on November 24, 2015 by Christian Romney. Citation authorities remain split on whether Christian’s helm-ification or Norman’s helmification is the correct usage. 😉

Unicode to LaTeX

Wednesday, December 2nd, 2015

Unicode to LaTeX by John D. Cook.

From the post:

I’ve run across a couple web sites that let you enter a LaTeX symbol and get back its Unicode value. But I didn’t find a site that does the reverse, going from Unicode to LaTeX, so I wrote my own.

Unicode / LaTeX Conversion

If you enter Unicode, it will return LaTeX. If you enter LaTeX, it will return Unicode. It interprets a string starting with “U+” as a Unicode code point, and a string starting with a backslash as a LaTeX command.

I am having trouble visualizing when I would need to go from Unicode to LaTeX but on the off-chance that I find myself in that situation, I wanted to note John’s conversion page.

Knowing my luck, just after this post is pushed off the front page of the blog I will have need of it. 😉

Internationalization & Unicode Conference ICU 39

Thursday, June 25th, 2015

Internationalization & Unicode Conference ICU 39

October 26-28, 2015 – Santa Clara, CA USA

From the webpage:

The Internationalization and Unicode® Conference (IUC) is the premier event covering the latest in industry standards and best practices for bringing software and Web applications to worldwide markets. This annual event focuses on software and Web globalization, bringing together internationalization experts, tools vendors, software implementers, and business and program managers from around the world. 

Expert practitioners and industry leaders present detailed recommendations for businesses looking to expand to new international markets and those seeking to improve time to market and cost-efficiency of supporting existing markets. Recent conferences have provided specific advice on designing software for European countries, Latin America, China, India, Japan, Korea, the Middle East, and emerging markets.

This highly rated conference features excellent technical content, industry-tested recommendations and updates on the latest standards and technology. Subject areas include web globalization, programming practices, endangered languages and un-encoded scripts, integrating with social networking software, and implementing mobile apps. This year’s conference will also highlight new features in Unicode and other relevant standards. 

In addition, please join us in welcoming over 20 first-time speakers to the program! This is just another reason to attend; fresh talks, fresh faces, and fresh ideas!

(emphasis and colors in original)

If you want your software to be an edge case and hard to migrate in the future, go ahead, don’t support Unicode. Unicode libraries exist in all the major and many minor programming languages. Not supporting Unicode isn’t simpler, it’s just dumber.

Sorry, I have been a long time follower of the Unicode work and an occasional individual member of the Consortium. Those of us old enough to remember pre-Unicode days want to lessen the burden of interchanging texts, not increase it.

Enjoy the conference!

Unicode 8 – Coming Next Week!

Friday, June 12th, 2015

Unicode 8 will be released next week. Rick McGowan has posted directions to code charts for final review:

For the complete archival charts, as a single-file 100MB file, or as individual block files, please see the charts directory here:

For the set of “delta charts” only with highlighting for changes please see:

(NOTE: There is a known problem viewing the charts using the PDF Viewer plugin for Firefox on the Mac platform.)

And the 8.0 beta UCD files are also available for cross-reference:

The draft version page is here:

From the draft version homepage:

Unicode 8.0 adds a total of 7,716 characters, encompassing six new scripts and many new symbols, as well as character additions to several existing scripts. Notable character additions include the following:

  • A set of lowercase Cherokee syllables, forming case pairs with the existing Cherokee characters
  • A large collection of CJK unified ideographs
  • Emoji symbols and symbol modifiers for implementing skin tone diversity; see Unicode Emoji.
  • Georgian lari currency symbol
  • Letters to support the Ik language in Uganda, Kulango in the Côte d’Ivoire, and other languages of Africa
  • The Ahom script for support of the Tai Ahom language in India
  • Arabic letters to support Arwi—the Tamil language written in the Arabic script

Other important updates in Unicode Version 8.0 include:

  • Change in encoding model of New Tai Lue to visual order


Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and include updates for the repertoire additions made in Version 8.0, as well as other modifications:

If you have the time this weekend, take a quick look.

Unicode 7.0 Core Specification (paperback)

Friday, January 16th, 2015


The Unicode 7.0 core specification is now available in paperback book form.

Responding to requests, the editorial committee has created a pair of modestly-priced print-on-demand volumes that contain the complete text of the core specification of Version 7.0 of the Unicode Standard.

The form-factor in this edition has been changed from US letter to 6×9 inch US trade paperback size, making the two volumes more compact than previous versions. The two volumes may be purchased separately or together. The cost for the pair is US$16.27, plus postage and applicable taxes. Please visit to order.

Note that these volumes do not include the Version 7.0 code charts, nor do they include the Version 7.0 Standard Annexes and Unicode Character Database, all of which are available only on the Unicode website,

Even with the aggressive pricing, I don’t see this getting onto the best seller list. 😉

It should be on the best seller list! The current version is the result of decades of work by Consortium staff and many volunteers.


PS: Blog about this at your site and/or forward to your favorite mailing list. Typographers, programmers, editors and the computer literate should have a basic working knowledge of Unicode.

Unicode Version 7.0…

Wednesday, October 8th, 2014

Unicode Version 7.0 – Complete Text of the Core Specification Published

From the post:

The Unicode® Consortium announces the publication of the core specification for Unicode 7.0. The Version 7.0 core specification contains significant changes:

  • Major reorganization of the chapters and overall layout
  • New page size tailored for easy viewing on e-readers and other mobile devices
  • Addition of twenty-two new scripts and a shorthand writing system
  • Alignment with updates to the Unicode Bidirectional Algorithm

In Version 7.0, the standard grew by 2,834 characters. This version continues the Unicode Consortium’s long-term commitment to support the full diversity of languages around the world with its newly encoded scripts and other additional characters. The text of the latest version documents two newly adopted currency symbols: the manat, used in Azerbaijan, and the ruble, used in Russia and other countries. It also includes information about newly added pictographic symbols, geometric symbols, arrows and ornaments.

This version of the Standard brings technical improvements to support implementers, including further clarification of the case pair stability policy, and a new stability policy for Numeric_Type.

All other components of Unicode 7.0 were released on June 16, 2014: the Unicode Standard Annexes, code charts, and the Unicode Character Database, to allow vendors to update their implementations of Unicode 7.0 as early as possible. The release of the core specification completes the definitive documentation of the Unicode Standard, Version 7.0.

For more information on all of The Unicode Standard, Version 7.0, see

For non-backtick + Unicode character applications, this is good news!

Following the Unicode standard should be the first test for consideration of an application. The time for ad hoc character hacks passed a long time ago.

Juju Charm (HPCC Systems)

Friday, August 8th, 2014

HPCC Systems from LexisNexis Celebrates Third Open-Source Anniversary, And Releases 5.0 Version

From the post:

LexisNexis® Risk Solutions today announced the third anniversary of HPCC Systems®, its open-source, enterprise-proven platform for big data analysis and processing for large volumes of data in 24/7 environments. HPCC Systems also announced the upcoming availability of version 5.0 with enhancements to provide additional support for international users, visualization capabilities and new functionality such as a Juju charm that makes the platform easier to use.

“We decided to open-source HPCC Systems three years ago to drive innovation for our leading technology that had only been available internally and allow other companies and developers to experience its benefits to solve their unique business challenges,” said Flavio Villanustre, Vice President, Products and Infrastructure, HPCC Systems, LexisNexis.


5.0 Enhancements
With community contributions from developers and analysts across the globe, HPCC Systems is offering translations and localization in its version 5.0 for languages including Chinese, Spanish, Hungarian, Serbian and Brazilian Portuguese with other languages to come in the future.
Additional enhancements include:
• Visualizations
• Linux Ubuntu Juju Charm Support
• Embedded language features
• Apache Kafka Integration
• New Regression Suite
• External Database Support (MySQL)
• Web Services-SQL

The HPCC Systems source code can be found here:
The HPCC Systems platform can be found here:

Just in time for the Fall upgrade season! 😉

While reading the documentation I stumbled across: Unicode Indexing in ECL, last updated January 09, 2014.

From the page:

ECL’s dafault indexing logic works great for strings and numbers, but can encounter problems when indexing Unicode data. In some cases, unicode indexes don’t return all matching recordsfor a query. For example, If you have a Unicode field “ufield” in a dataset and select dataset(ufield BETWEEN u’ma’ AND u’me’), it would bring back records for ‘mai’,’Mai’ and ‘may’. However a query on the index for that dataset, idx(ufield BETWEEN u’ma’ AND u’me’), only brings back a record for ‘mai’.

This is a result of the way unicode fields are sorted for indexing. Sorting compares the values of two fields byte by byte to see if a field matches or is less than or greater than another value. Integers are stored in bigendian format, and signed numbers have an offset added to create an absolute value range.

Unicode fields are different. When compared/sorted in datasets, the comparisons are performed using the ICU locale sensitive comparisons to ensure correct ordering. However, index lookup operations need to be fast and therefore the lookup operations perform binary comparisons on fixed length blocks of data. Equality checks will return data correctly, but queries involving between, > or < may fail.

If you are considering HPCC, be sure to check your indexing requirements with regard to Unicode.

Alphabetical Order

Tuesday, July 29th, 2014

Alphabetical order explained in a mere 27,817 words by David Weinberger.

From the post:

This is one of the most amazing examples I’ve seen of the complexity of even simple organizational schemes. “Unicode Collation Algorithm (Unicode Technical Standard #10)” spells out in precise detail how to sort strings in what we might colloquially call “alphabetical order.” But it’s way, way, way more complex than that.

Unicode is an international standard for how strings of characters get represented within computing systems. For example, in the familiar ASCII encoding, the letter “A” is represented in computers by the number 65. But ASCII is too limited to encode the world’s alphabets. Unicode does the job.

As the paper says, “Collation is the general term for the process and function of determining the sorting order of strings of characters” so that, for example, users can look them up on a list. Alphabetical order is a simple form of collation.

The best part is the summary of Unicode Technical Standard #10:

This document dives resolutely into the brambles and does not give up. It incidentally exposes just how complicated even the simplest of sorting tasks is when looked at in their full context, where that context is history, language, culture, and the ambiguity in which they thrive.

We all learned the meaning of “alphabetical order” in elementary school. But which “alphabetical order” depends upon language, culture, context, etc.

Other terms and phrases have the same problem. But the vast majority of them have no Unicode Technical Report with all the possible meanings.

For those terms there are topic maps.

I first saw this in a tweet by Computer Science.

Unicode Character Table

Wednesday, June 4th, 2014

Unicode Character Table

A useful webpage that I first saw in a tweet by Scott Chamberlain.

Displays Unicode characters on “buttons” that when selected displays the Unicode Hex code and HTML code for the selected character.

Quite useful when all you need is one entity value for a post.

If you need more information try Unicode Table – The Unicode Character Reference, which for “Latin Small Letter D” displays:

Unicode Character Information
Unicode Hex U+0064
General Category Lowercase Letter [Code: Ll]
Canonical Combining Class 0
Bidirectional Category L
Mirrored N
Uppercase Version U+0044
Titlecase Version U+0044
Unicode Character Encodings
Latin Small Letter D HTML Entity &#100; (decimal entity), &#x0064; (hex entity)
Windows Key Code Alt 0100 or Alt +00641
Programming Source Code Encodings Python hex: u”\u0064″, Hex for C++ and Java: “\u0064”
UTF-8 Hexadecimal Encoding 0x64

Or if you need all the information available on Unicode and to know it is the canonical information, see

(String/text processing)++:…

Thursday, May 15th, 2014

(String/text processing)++: stringi 0.2-3 released by Marek Gągolewski.

From the post:

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

Here is a very general list of the most important features available in the current version of stringi:

  • string searching:
    • with ICU (Java-like) regular expressions,
    • ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),
    • very fast, locale-independent byte-wise pattern matching;
  • joining and duplicating strings;
  • extracting and replacing substrings;
  • string trimming, padding, and text wrapping (e.g. with Knuth's dynamic word wrap algorithm);
  • text transliteration;
  • text collation (comparing, sorting);
  • text boundary analysis (e.g. for extracting individual words);
  • random string generation;
  • Unicode normalization;
  • character encoding conversion and detection;

and many more.

Interesting isn’t it? How CS keeps circling around back to strings?


Character(s) in Unicode 6.3.0

Wednesday, December 18th, 2013

Search for character(s) in Unicode 6.3.0 by Tomas Schild.

A site that allows you to search the latest Unicode character set by:

  • Word or phrase from the official Unicode character name
  • Word or phrase from the old, deprecated Unicode 1.0 character name
  • A single character
  • The hexadecimal value of the Unicode postion
  • Search for numerical value

When you need just one or two characters to encode for HTML, this could be very handy.

Be aware that the search engine does not compensate from spelling differences in the Unicode character list.

Thus, a search for “aleph” returns:

code point
Unicode character name
U+1202A f0 92 80 aa CUNEIFORM SIGN ALEPH

Whereas a search for “alef” returns:

128 characters found

code point
Unicode character name
Remaining 121 characters omitted

Semitic alphabets all contain the alef/aleph character which represents a glottal stop.

I have no immediate explanation for why the Unicode standard chose different names for the same character in different languages.

But, be aware that it does happen.

BTW, I modified the tables to omit the character and other fields.

WordPress seems to have difficulty with Imperial Aramaic, Inscriptional Parthian, Inscriptional Pahlavi, and Cuneiform code points for aleph.

Unicode Standard, Version 6.3

Tuesday, October 1st, 2013

Unicode Standard, Version 6.3

From the post:

The Unicode Consortium announces Version 6.3 of the Unicode Standard and with it, significantly improved bidirectional behavior. The updated Version 6.3 Unicode Bidirectional Algorithm now ensures that pairs of parentheses and brackets have consistent layout and provides a mechanism for isolating runs of text.

Based on contributions from major browser developers, the updated Bidirectional Algorithm and five new bidi format characters will improve the display of text for hundreds of millions of users of Arabic, Hebrew, Persian, Urdu, and many others. The display and positioning of parentheses will better match the normal behavior that users expect. By using the new methods for isolating runs of text, software will be able to construct messages from different sources without jumbling the order of characters. The new bidi format characters correspond to features in markup (such as in CSS). Overall, these improvements also bring greater interoperability and an improved ability for inserting text and assembling user interface elements.

The improvements come with new rigor: the Consortium now offers two reference implementations and greatly improved testing and test data.

In a major enhancement for CJK usage, this new version adds standardized variation sequences for all 1,002 CJK compatibility ideographs. These sequences address a well-known issue of the CJK compatibility ideographs — that they could change their appearance when any process normalized the text. Using the new standardized variation sequences allows authors to write text which will preserve the specific required shapes of these CJK ideographs, even under Unicode normalization.

Version 6.3 includes other improvements as well:

  • Improved Unihan data to better align with ISO/IEC 10646
  • Better support for Hebrew word break behavior and for ideographic space in line breaking

Get started with Unicode 6.3 today!

Now, there’s an interesting data set!

Much of the convenience you now experience with digital texts is due to the under-appreciated efforts of the Unicode project.

Character Sorted Table….

Wednesday, June 19th, 2013

Character Sorted Table Showing Entity Names and Unicode Values

I often need to look up just one character and guessing which part of Unicode will have it is a pain.

I found this thirty-seven (37) page summary of characters with entity names and Unicode values at the U. S. Government Printing Office (GPO).

I could not find any directories above it with an index page or pointers to this file.

I have not verified the entries in this listing. Use at your own risk.

Common Locale Data Repository (CLDR) 23.1

Friday, May 17th, 2013

Common Locale Data Repository (CLDR) 23.1

From the CLDR project homepage:

What is CLDR?

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. It includes:

  • Locale-specific patterns for formatting and parsing: dates, times, timezones, numbers and currency values
  • Translations of names: languages, scripts, countries and regions, currencies, eras, months, weekdays, day periods, timezones, cities, and time units
  • Language & script information: characters used; plural cases; gender of lists; capitalization; rules for sorting & searching; writing direction; transliteration rules; rules for spelling out numbers; rules for segmenting text into graphemes, words, and sentences
  • Country information: language usage, currency information, calendar preference and week conventions, postal and telephone codes
  • Other: ISO & BCP 47 code support (cross mappings, etc.), keyboard layouts

CLDR uses the XML format provided by UTS #35: Unicode Locale Data Markup Language (LDML). LDML is a format used not only for CLDR, but also for general interchange of locale data, such as in Microsoft’s .NET.

For a set of slides on the technical contents of CLDR, see Overview.

Great set of widely used mappings between locale data.

unicodex — High-performance Unicode Library (C++)

Monday, February 11th, 2013

unicodex — High-performance Unicode Library (C++) by Dustin Juliano.

From the post:

The following is a micro-optimized Unicode encoder/decoder for C++ that is capable of significant performance, sustaining 6 GiB/s for UTF-8 to UTF-16/32 on an AMD A8-3870 running in a single thread, and 8 GiB/s for UTF-16 to UTF-32. That would allow it to encode nearly the full English Wikipedia in approximately 6 seconds.

It maps between UTF-8, UTF-16, and UTF-32, and properly detects UTF-8 BOM and the UTF-16 BOMs. It has been unit tested with gigabytes of data and verified with binary analysis tools. Presently, only little-endian is supported, which should not pose any significant limitations on use. It is released under the BSD license, and can be used in both proprietary and free software projects.

The decoder is aware of malformed input and will raise an exception if the input sequence would cause a buffer overflow or is otherwise fatally incorrect. It does not, however, ensure that exact codepoints correspond to the specific Unicode planes; this is by design. The implementation has been designed to be robust against garbage input and specifically avoid encoding attacks.

One of those “practical” things that you may need for processing topic maps and or other digital information. 😉

Unicode 6.2.0 Available

Friday, November 23rd, 2012

Unicode 6.2.0 Available


Version 6.2 of the Unicode Standard is a special release dedicated to the early publication of the newly encoded Turkish lira sign. This version also rolls in various minor corrections for errata and other small updates for the Unicode Character Database. In addition, there are some significant changes to the Unicode algorithms for text segmentation and line breaking, including changes to the line break property to improve line breaking for emoji symbols.

Just in case you don’t follow Unicode releases closely.

The character set against which all others should be mapped.