Unicode « Another Word For It

January 15, 2018

Fun, Frustration, Curiosity, Murderous Rage – mimic

Filed under: Humor,Programming,Unicode — Patrick Durusau @ 10:09 am

From the webpage:

…
There are many more characters in the Unicode character set that look, to some extent or another, like others – homoglyphs. Mimic substitutes common ASCII characters for obscure homoglyphs.

Fun games to play with mimic:

Pipe some source code through and see if you can find all of the problems

Pipe someone else’s source code through without telling them

Be fired, and then killed

…

I can attest to the murderous rage from experience. There was a browser-based SGML parser that would barf on the presence of an extra whitespace (space I think) in the SGML declaration. One file worked, another with the “same” declaration did not.

Only by printing and comparing the files (this was on Windoze machines) was the errant space discovered.

Enjoy!

Comments Off

November 16, 2017

Shape Searching Dictionaries?

Filed under: Cuneiform,Facebook,Subject Identifiers,Subject Identity,Subject Recognition,Unicode — Patrick Durusau @ 11:27 am

Facebook, despite its spying, censorship, and being a shill for the U.S. government, isn’t entirely without value.

For example, this post by Simon St. Laurent:

Drew this response from Peter Cooper:

Which if you follow the link: Shapecatcher: Unicode Character Recognition you find:

Draw something in the box!

And let shapecatcher help you to find the most similar unicode characters!

Currently, there are 11817 unicode character glyphs in the database. Japanese, Korean and Chinese characters are currently not supported.
(emphasis in original)

I take “Japanese, Korean and Chinese characters are currently not supported.” means Anatolian Hieroglyphs; Cuneiform, Cuneiform Numbers and Punctuation, Early Dynastic Cuneiform, Old Persian, Ugaritic; Egyptian Hieroglyphs; Meroitic Cursive, and Meroitic Hieroglphs are not supported as well.

But my first thought wasn’t discovery of glyphs in Unicode Code Charts, although useful, but shape searching dictionaries, such as Faulkner’s A Concise Dictionary of Middle Egyptian.

A sample from Faulkner’s (1991 edition):

Or, The Student’s English-Sanskrit Dictionary by Vaman Shivram Apte (1893):

Imagine being able to search by shape for either dictionary! Not just as a gylph but as a set of glyphs, within any entry!

I suspect that’s doable based on Benjamin Milde‘s explanation of Shapecatcher:

…
Under the hood, Shapecatcher uses so called “shape contexts” to find similarities between two shapes. Shape contexts, a robust mathematical way of describing the concept of similarity between shapes, is a feature descriptor first proposed by Serge Belongie and Jitendra Malik.

You can find an indepth explanation of the shape context matching framework that I used in my Bachelor thesis (“On the Security of reCATPCHA”). In the end, it is quite a bit different from the matching framework that Belongie and Malik proposed in 2000, but still based on the idea of shape contexts.

The engine that runs this site is a rewrite of what I developed during my bachelor thesis. To make things faster, I used CUDA to accelereate some portions of the framework. This is a fairly new technology that enables me to use my NVIDIA graphics card for general purpose computing. Newer cards are quite powerful devices!
…

That was written in 2011 and no doubt shape matching has progressed since then.

No technique will be 100% but even less than 100% accuracy will unlock generations of scholarly dictionaries, in ways not imagined by their creators.

If you are interested, I’m sure Benjamin Milde would love to hear from you.

Comments Off

October 16, 2017

Unicode Egyptian Hieroglyphic Fonts

Filed under: Ancient World,Fonts,Language,Unicode — Patrick Durusau @ 8:57 pm

Unicode Egyptian Hieroglyphic Fonts by Bob Richmond.

From the webpage:

These fonts all contain the Unicode 5.2 (2009) basic set of Egyptian Hieroglyphs.

Please contact me if you know of any others, or information to include.

Also of interest:

UMdC Coding Manual for Egyptian Hieroglyphic in Unicode

UMdC (Unicode MdC) aims to provides guidelines for encoding Egyptian Hieroglyphic and related scripts In Unicode using plain text with optional lightweight mark-up.

This GitHub project is the central point for development of UMdC and associated resources. Features of UMdC are still in a discussion phase so everything here should be regarded as preliminary and subject to change. As such the project is initially oriented towards expert Egyptologists and software developers who wish to help ensure ancient Egyptian writing system is well supported in modern digital media.

The Manuel de Codage (MdC) system for digital encoding of Ancient Egyptian textual data was adopted as an informal standard in the 1980s and has formed the basis for most subsequent digital encodings, sometimes using extensions or revisions to the original scheme. UMdC links to the traditional methodology in various ways to help with the transition to Unicode-based solutions.

As with the original MdC system, UMdC data files (.umdc) can be viewed and edited in standard text editors (such as Windows Notepad) and the HTML <textarea></textarea> control. Specialist software applications can be adapted or developed to provide a simpler workflow or enable additional techniques for working with the material.

Also see UMdC overview [pdf].

A UMdC-compatible hieroglyphic font Aaron UMdC Alpha (relative to the current draft) can be downloaded from the Hieroglyphs Everywhere Fonts project.

For news and information on Ancient Egyptian in Unicode see https://hieroglyphseverywhere.blogspot.co.uk/.

I understand the need for “plain text” viewing of hieroglyphics, especially for primers and possibly for search engines, but Egyptian hieroglyphs can be written facing right or left, top to bottom and more rarely bottom to top. Moreover, artistic and other considerations can result in transposition of glyphs out of their “linear” order in a Western reading sense.

Unicode hieroglyphs are a major step forward for the interchange of hieroglyphic texts but we should remain mindful “linear” presentation of inscription texts is a far cry from their originals.

The greater our capacity for graphic representation, the more we simplify complex representations from the past. Are the needs of our computers really that important?

Comments Off

March 9, 2017

Unicode 10.0 Beta Review

Filed under: Fonts,Unicode — Patrick Durusau @ 9:45 pm

Unicode 10.0 Beta Review

In today’s mail:

The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and smart phones—plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc.). The Unicode Standard, its associated standards, and data form the foundation for CLDR and ICU releases. Thus it is important to ensure a smooth transition to each new version of the standard.

Unicode 10.0 includes a number of changes. Some of the Unicode Standard Annexes have modifications for Unicode 10.0, often in coordination with changes to character properties. In particular, there are changes to UAX #14, Unicode Line Breaking Algorithm, UAX #29, Unicode Text Segmentation, and UAX #31, Unicode Identifier and Pattern Syntax. In addition, UAX #50, Unicode Vertical Text Layout, has been newly incorporated as a part of the standard. Four new scripts have been added in Unicode 10.0, including Nüshu. There are also 56 additional emoji characters, a major new extension of CJK ideographs, and 285 hentaigana, important historic variants for Hiragana syllables.

Please review the documentation, adjust your code, test the data files, and report errors and other issues to the Unicode Consortium by May 1, 2017. Feedback instructions are on the beta page.

See http://unicode.org/versions/beta-10.0.0.html for more information about testing the 10.0.0 beta.

See http://unicode.org/versions/Unicode10.0.0/ for the current draft summary of Unicode 10.0.0.

It’s not too late for you to contribute to the Unicode party! There plenty of reviewing and by no means has all the work been done!

For this particular version, comments are due by May 1, 2017.

Enjoy!

Comments Off

January 19, 2017

ScriptSource [Fonts but so much more]

Filed under: Fonts,Unicode — Patrick Durusau @ 4:00 pm

ScriptSource

From the about page:

ScriptSource is a dynamic, collaborative reference to the writing systems of the world, with detailed information on scripts, characters, languages – and the remaining needs for supporting them in the computing realm. It is sponsored, developed and maintained by SIL International. It currently contains only a skeleton of information, and so depends on your participation in order to grow and assist others.

The need for information on Writing Systems

In today’s expanding global community, designers, linguists and computer professionals are called upon more frequently to support the myriad writing systems around the world. A key to this development is consistent, trustworthy, complete and organised information on the alphabets and scripts used to write the world’s languages. The development of Writing System Implementations (WSIs) depends on the availability of this information, so a lack of it can hinder the cultural, economic and intellectual development of communities that communicate in minority languages and scripts.

The information needed varies widely, and can include:

Design information and guidelines – both for alphabets and for specific letters/glyphs

Linguistic information – how the script is used for specific languages

Encoding details – particularly Unicode, including new Unicode proposals

Script behaviour – how letters change shape and position in context

Keyboarding conventions – including information on data entry tools

Testing tools and sample texts – so developers can test their software, fonts, keyboards

Some of this information is available, but is scattered around among a variety of web sites that have different purposes and structures, and often lies undocumented in the minds of individual script experts, or hidden in library books.

This information is also often segregated by audience. A font designer may be frustrated to find that available resources on a script address the spoken/written language relationship, but not the background and visual rules of the letterforms. A linguist may find information on encoding the script – such as the information in The Unicode Standard – but not important details of which languages use which symbols. An application developer may find a long writeup on the development and use of the script, but nothing to tell them what script behaviours are required.

There are also relatively few opportunities for experts from these fields to cooperate and work together. What interaction does exist often happens at conferences, on various mailing lists and forums, and through personal email. There are few experts who have the time to participate in these exchanges, and those that do may be frustrated to find that the same questions keep coming up again and again. Until now, there has been no place where this knowledge can be captured, organised and maintained.

The purpose of ScriptSource

ScriptSource exists to provide this information and bridge the gap between the designer, developer, linguist and user. It seeks to document the writing systems of the world and help those wanting to implement them on computers and other devices.

The initial content is relatively sparse, but includes basic information on all scripts in the ISO 15924 standard. It will grow dynamically through public submissions, expert content development and live linkages with other web sites. Rather than being just another web site about writing systems, ScriptSource provides a single hub of information where both old and new content can be found.

…

A truly remarkable resource on writing systems by SIL International.

You can think of ScriptSource as a way to locate fonts, but you may be drawn into complexities others rarely see!

Enjoy!

Comments Off

GNU Unifont Glyphs [Good News/Bad News]

Filed under: Fonts,Unicode — Patrick Durusau @ 9:43 am

GNU Unifont Glyphs 9.0.06.

From the webpage:

GNU Unifont is part of the GNU Project. This page contains the latest release of GNU Unifont, with glyphs for every printable code point in the Unicode 9.0 Basic Multilingual Plane (BMP). The BMP occupies the first 65,536 code points of the Unicode space, denoted as U+0000..U+FFFF. There is also growing coverage of the Supplemental Multilingual Plane (SMP), in the range U+010000..U+01FFFF, and of Michael Everson’s ConScript Unicode Registry (CSUR).
… (red highlight in original)

That’s the good news.

The bad news is shown by the coverage mapping:

0.0%  U+012000..U+0123FF  Cuneiform*
0.0%  U+012400..U+01247F  Cuneiform Numbers and Punctuation*
0.0%  U+012480..U+01254F  Early Dynastic Cuneiform*
0.0%  U+013000..U+01342F  Egyptian Hieroglyphs*
0.0%  U+014400..U+01467F  Anatolian Hieroglyphs*

These scripts will require a 32-by-32 pixel grid:

*Note: Scripts such as Cuneiform, Egyptian Hieroglyphs, and Bamum Supplement will not be drawn on a 16-by-16 pixel grid. There are plans to draw these scripts on a 32-by-32 pixel grid in the future.

One additional resource on creating cuneiform fonts:

Creating cuneiform fonts with MetaType1 and FontForge by Karel Píška:

Abstract:

A cuneiform font collection covering Akkadian, Ugaritic and Old Persian glyph subsets (about 600 signs) has been produced in two steps. With MetaType1 we generate intermediate Type 1 fonts, and then construct OpenType fonts using FontForge. We describe cuneiform design and the process of font development.

On creating fonts more generally with FontForge, see: Design With FontForge.

Enjoy!

Comments Off

August 26, 2016

Your assignment, should you choose to accept it….

Filed under: Government,Privacy,TeX/LaTeX,Unicode — Patrick Durusau @ 2:59 pm

You may (may not) remember the TV show, Mission Impossible. It had a cast of regulars who formed a spy team to undertake “impossible” tasks that could not be traced back to the U.S. government.

Stories like: BAE Systems Sells Internet Surveillance Gear to United Arab Emirates make me wish for a non-nationalistic, modern equivalent of the Mission Impossible team.

You may recall the United Arab Emirates (UAE) were behind the attempted hack of Ahmed Mansoor, a prominent human rights activist.

So much for the UAE needing spyware for legitimate purposes.

From the article:

…
In a written statement, BAE Systems said, “It is against our policy to comment on contracts with specific countries or customers. BAE Systems works for a number of organizations around the world, within the regulatory frameworks of all relevant countries and within our own responsible trading principles.”

The Danish Business Authority told Andersen it found no issue approving the export license to the Ministry of the Interior of the United Arab Emirates after consulting with the Danish Ministry of Foreign Affairs, despite regulations put in place by the European Commission in October 2014 to control exports of spyware and internet surveillance equipment out of concern for human rights. The ministry told Andersen in an email it made a thorough assessment of all relevant concerns and saw no reason to deny the application.
…

It doesn’t sound like any sovereign government is going to restrain BAE Systems and/or the UAE.

Consequences for their mis-deeds will have to come from other quarters.

Like the TV show started every week:

Your assignment, should you choose to accept it….

Comments Off

July 6, 2016

Unicode® Standard, Version 9.0

Filed under: Unicode — Patrick Durusau @ 3:40 pm

Unicode® Standard, Version 9.0

From the webpage:

Version 9.0 of the Unicode Standard is now available. Version 9.0 adds exactly 7,500 characters, for a total of 128,172 characters. These additions include six new scripts and 72 new emoji characters.

The new scripts and characters in Version 9.0 add support for lesser-used languages worldwide, including:

Osage, a Native American language

Nepal Bhasa, a language of Nepal

Fulani and other African languages

The Bravanese dialect of Swahili, used in Somalia

The Warsh orthography for Arabic, used in North and West Africa

Tangut, a major historic script of China

Important symbol additions include:

19 symbols for the new 4K TV standard

72 emoji characters such as the following

…

Why they choose to omit the bacon emoji from the short list is a mystery to me:

Get your baking books out! I see missing bread emojis.

Comments Off

June 19, 2016

“invisible entities having arcane but gravely important significances”

Filed under: Entities,Unicode — Patrick Durusau @ 3:23 pm

Allison Parrish tweeted:

https://t.co/sXt6AqEIoZ the “Other, Format” unicode category, full of invisible entities having arcane but gravely important significances

I just could not let a tweet with:

“invisible entities having arcane but gravely important significances”

pass without comment!

As of today, one-hundred and fifty (150) such entities. All with multiple properties.

How many of these “invisible entities” are familiar to you?

Comments Off

May 17, 2016

Unicode Code Chart Reviewers Needed – Now!

Filed under: Unicode — Patrick Durusau @ 6:59 pm

I saw an email from Rick McGowan of the Unicode Consortium that reads:

As we near the release of Unicode 9.0, we’re looking for volunteers to review the latest code charts for regressions from the 8.0 charts… If you have a block that you’re particularly fond of, please consider checking the glyphs and names against the 8.0 charts… To see the latest 9.0 charts, you can start here:

http://www.unicode.org/Public/9.0.0/charts/

The “blocks” directory has all of the individual block charts, and the charts with specific additions/changes are here:

http://www.unicode.org/charts/PDF/Unicode-9.0/

Not for everyone but if you can contribute, please do.

Just so you know, this is the 25th anniversary of the Unicode Consortium!

Even if you don’t proof the code charts, do remember to wish the Unicode Consortium a happy 25th anniversary!

Comments Off

April 22, 2016

UTF-8 encoding table and Unicode characters

Filed under: Typography,Unicode — Patrick Durusau @ 2:49 pm

UTF-8 encoding table and Unicode characters

The mapping between UTF-8 and binary representations doesn’t come up often but it did today. but it does come up.

Rather than hunting through bookmarks in the future, I am capturing this resource here.

Comments Off

January 8, 2016

Helmification of XML Unicode

Filed under: Emacs,Unicode — Patrick Durusau @ 4:22 pm

XML Unicode by Norman Walsh.

From the webpage:

XML Unicode provides some convenience methods for inserting Unicode characters. When it started, the focus was on characters that were traditionally inserted with named character entities, things like é.

In practice, and in the age of UTF-8, the “insert unicode character” function, especially the Helm-enabled version, is much more broadly useful.

You’re most likely going to want to bind some or all of them to keys.
…

Complete with suggested key bindings!

Oh, the image from Norman’s tweet:

FYI, the earliest use of helm-ification (note the hyphen) I can find was on November 24, 2015 by Christian Romney. Citation authorities remain split on whether Christian’s helm-ification or Norman’s helmification is the correct usage.

Comments Off

December 2, 2015

Unicode to LaTeX

Filed under: TeX/LaTeX,Unicode — Patrick Durusau @ 11:26 am

Unicode to LaTeX by John D. Cook.

From the post:

I’ve run across a couple web sites that let you enter a LaTeX symbol and get back its Unicode value. But I didn’t find a site that does the reverse, going from Unicode to LaTeX, so I wrote my own.

Unicode / LaTeX Conversion

If you enter Unicode, it will return LaTeX. If you enter LaTeX, it will return Unicode. It interprets a string starting with “U+” as a Unicode code point, and a string starting with a backslash as a LaTeX command.

I am having trouble visualizing when I would need to go from Unicode to LaTeX but on the off-chance that I find myself in that situation, I wanted to note John’s conversion page.

Knowing my luck, just after this post is pushed off the front page of the blog I will have need of it.

Comments Off

June 25, 2015

Internationalization & Unicode Conference ICU 39

Filed under: Conferences,Unicode — Patrick Durusau @ 8:00 pm

Internationalization & Unicode Conference ICU 39

October 26-28, 2015 – Santa Clara, CA USA

From the webpage:

The Internationalization and Unicode® Conference (IUC) is the premier event covering the latest in industry standards and best practices for bringing software and Web applications to worldwide markets. This annual event focuses on software and Web globalization, bringing together internationalization experts, tools vendors, software implementers, and business and program managers from around the world.

Expert practitioners and industry leaders present detailed recommendations for businesses looking to expand to new international markets and those seeking to improve time to market and cost-efficiency of supporting existing markets. Recent conferences have provided specific advice on designing software for European countries, Latin America, China, India, Japan, Korea, the Middle East, and emerging markets.

This highly rated conference features excellent technical content, industry-tested recommendations and updates on the latest standards and technology. Subject areas include web globalization, programming practices, endangered languages and un-encoded scripts, integrating with social networking software, and implementing mobile apps. This year’s conference will also highlight new features in Unicode and other relevant standards.

In addition, please join us in welcoming over 20 first-time speakers to the program! This is just another reason to attend; fresh talks, fresh faces, and fresh ideas!

(emphasis and colors in original)

If you want your software to be an edge case and hard to migrate in the future, go ahead, don’t support Unicode. Unicode libraries exist in all the major and many minor programming languages. Not supporting Unicode isn’t simpler, it’s just dumber.

Sorry, I have been a long time follower of the Unicode work and an occasional individual member of the Consortium. Those of us old enough to remember pre-Unicode days want to lessen the burden of interchanging texts, not increase it.

Enjoy the conference!

Comments Off

June 12, 2015

Unicode 8 – Coming Next Week!

Filed under: Graph Analytics,Unicode — Patrick Durusau @ 1:52 pm

Unicode 8 will be released next week. Rick McGowan has posted directions to code charts for final review:

For the complete archival charts, as a single-file 100MB file, or as individual block files, please see the charts directory here:

http://www.unicode.org/Public/8.0.0/charts/

For the set of “delta charts” only with highlighting for changes please see:

http://www.unicode.org/charts/PDF/Unicode-8.0/

(NOTE: There is a known problem viewing the charts using the PDF Viewer plugin for Firefox on the Mac platform.)

And the 8.0 beta UCD files are also available for cross-reference:

http://www.unicode.org/Public/8.0.0/ucd/

The draft version page is here:

http://www.unicode.org/versions/Unicode8.0.0/

From the draft version homepage:

Unicode 8.0 adds a total of 7,716 characters, encompassing six new scripts and many new symbols, as well as character additions to several existing scripts. Notable character additions include the following:

A set of lowercase Cherokee syllables, forming case pairs with the existing Cherokee characters

A large collection of CJK unified ideographs

Emoji symbols and symbol modifiers for implementing skin tone diversity; see Unicode Emoji.

Georgian lari currency symbol

Letters to support the Ik language in Uganda, Kulango in the Côte d’Ivoire, and other languages of Africa

The Ahom script for support of the Tai Ahom language in India

Arabic letters to support Arwi—the Tamil language written in the Arabic script

Other important updates in Unicode Version 8.0 include:

Change in encoding model of New Tai Lue to visual order

Synchronization

Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and include updates for the repertoire additions made in Version 8.0, as well as other modifications:

UTS #10, Unicode Collation Algorithm

UTS #46, Unicode IDNA Compatibility Processing

If you have the time this weekend, take a quick look.

Comments Off

January 16, 2015

Unicode 7.0 Core Specification (paperback)

Filed under: Unicode — Patrick Durusau @ 3:58 pm

The Unicode 7.0 core specification is now available in paperback book form.

Responding to requests, the editorial committee has created a pair of modestly-priced print-on-demand volumes that contain the complete text of the core specification of Version 7.0 of the Unicode Standard.

The form-factor in this edition has been changed from US letter to 6×9 inch US trade paperback size, making the two volumes more compact than previous versions. The two volumes may be purchased separately or together. The cost for the pair is US$16.27, plus postage and applicable taxes. Please visit http://www.lulu.com/spotlight/unicode to order.

Note that these volumes do not include the Version 7.0 code charts, nor do they include the Version 7.0 Standard Annexes and Unicode Character Database, all of which are available only on the Unicode website, http://www.unicode.org/versions/Unicode7.0.0/.

Even with the aggressive pricing, I don’t see this getting onto the best seller list.

It should be on the best seller list! The current version is the result of decades of work by Consortium staff and many volunteers.

Enjoy!

PS: Blog about this at your site and/or forward to your favorite mailing list. Typographers, programmers, editors and the computer literate should have a basic working knowledge of Unicode.

Comments Off

October 8, 2014

Unicode Version 7.0…

Filed under: Unicode — Patrick Durusau @ 4:32 pm

Unicode Version 7.0 – Complete Text of the Core Specification Published

From the post:

The Unicode® Consortium announces the publication of the core specification for Unicode 7.0. The Version 7.0 core specification contains significant changes:

Major reorganization of the chapters and overall layout

New page size tailored for easy viewing on e-readers and other mobile devices

Addition of twenty-two new scripts and a shorthand writing system

Alignment with updates to the Unicode Bidirectional Algorithm

In Version 7.0, the standard grew by 2,834 characters. This version continues the Unicode Consortium’s long-term commitment to support the full diversity of languages around the world with its newly encoded scripts and other additional characters. The text of the latest version documents two newly adopted currency symbols: the manat, used in Azerbaijan, and the ruble, used in Russia and other countries. It also includes information about newly added pictographic symbols, geometric symbols, arrows and ornaments.

This version of the Standard brings technical improvements to support implementers, including further clarification of the case pair stability policy, and a new stability policy for Numeric_Type.

All other components of Unicode 7.0 were released on June 16, 2014: the Unicode Standard Annexes, code charts, and the Unicode Character Database, to allow vendors to update their implementations of Unicode 7.0 as early as possible. The release of the core specification completes the definitive documentation of the Unicode Standard, Version 7.0.

For more information on all of The Unicode Standard, Version 7.0, see http://www.unicode.org/versions/Unicode7.0.0/.

For non-backtick + Unicode character applications, this is good news!

Following the Unicode standard should be the first test for consideration of an application. The time for ad hoc character hacks passed a long time ago.

Comments Off

August 8, 2014

Juju Charm (HPCC Systems)

Filed under: BigData,HPCC,Unicode — Patrick Durusau @ 1:44 pm

HPCC Systems from LexisNexis Celebrates Third Open-Source Anniversary, And Releases 5.0 Version

From the post:

LexisNexis® Risk Solutions today announced the third anniversary of HPCC Systems®, its open-source, enterprise-proven platform for big data analysis and processing for large volumes of data in 24/7 environments. HPCC Systems also announced the upcoming availability of version 5.0 with enhancements to provide additional support for international users, visualization capabilities and new functionality such as a Juju charm that makes the platform easier to use.

“We decided to open-source HPCC Systems three years ago to drive innovation for our leading technology that had only been available internally and allow other companies and developers to experience its benefits to solve their unique business challenges,” said Flavio Villanustre, Vice President, Products and Infrastructure, HPCC Systems, LexisNexis.

….

5.0 Enhancements
With community contributions from developers and analysts across the globe, HPCC Systems is offering translations and localization in its version 5.0 for languages including Chinese, Spanish, Hungarian, Serbian and Brazilian Portuguese with other languages to come in the future.
Additional enhancements include:
• Visualizations
• Linux Ubuntu Juju Charm Support
• Embedded language features
• Apache Kafka Integration
• New Regression Suite
• External Database Support (MySQL)
• Web Services-SQL

The HPCC Systems source code can be found here: https://github.com/hpcc-systems
The HPCC Systems platform can be found here: http://hpccsystems.com/download/free-community-edition

Just in time for the Fall upgrade season!

While reading the documentation I stumbled across: Unicode Indexing in ECL, last updated January 09, 2014.

From the page:

ECL’s dafault indexing logic works great for strings and numbers, but can encounter problems when indexing Unicode data. In some cases, unicode indexes don’t return all matching recordsfor a query. For example, If you have a Unicode field “ufield” in a dataset and select dataset(ufield BETWEEN u’ma’ AND u’me’), it would bring back records for ‘mai’,’Mai’ and ‘may’. However a query on the index for that dataset, idx(ufield BETWEEN u’ma’ AND u’me’), only brings back a record for ‘mai’.

This is a result of the way unicode fields are sorted for indexing. Sorting compares the values of two fields byte by byte to see if a field matches or is less than or greater than another value. Integers are stored in bigendian format, and signed numbers have an offset added to create an absolute value range.

Unicode fields are different. When compared/sorted in datasets, the comparisons are performed using the ICU locale sensitive comparisons to ensure correct ordering. However, index lookup operations need to be fast and therefore the lookup operations perform binary comparisons on fixed length blocks of data. Equality checks will return data correctly, but queries involving between, > or < may fail.

If you are considering HPCC, be sure to check your indexing requirements with regard to Unicode.

Comments Off

July 29, 2014

Alphabetical Order

Filed under: Algorithms,Sorting,Unicode — Patrick Durusau @ 3:00 pm

Alphabetical order explained in a mere 27,817 words by David Weinberger.

From the post:

This is one of the most amazing examples I’ve seen of the complexity of even simple organizational schemes. “Unicode Collation Algorithm (Unicode Technical Standard #10)” spells out in precise detail how to sort strings in what we might colloquially call “alphabetical order.” But it’s way, way, way more complex than that.

Unicode is an international standard for how strings of characters get represented within computing systems. For example, in the familiar ASCII encoding, the letter “A” is represented in computers by the number 65. But ASCII is too limited to encode the world’s alphabets. Unicode does the job.

As the paper says, “Collation is the general term for the process and function of determining the sorting order of strings of characters” so that, for example, users can look them up on a list. Alphabetical order is a simple form of collation.

The best part is the summary of Unicode Technical Standard #10:

This document dives resolutely into the brambles and does not give up. It incidentally exposes just how complicated even the simplest of sorting tasks is when looked at in their full context, where that context is history, language, culture, and the ambiguity in which they thrive.

We all learned the meaning of “alphabetical order” in elementary school. But which “alphabetical order” depends upon language, culture, context, etc.

Other terms and phrases have the same problem. But the vast majority of them have no Unicode Technical Report with all the possible meanings.

For those terms there are topic maps.

I first saw this in a tweet by Computer Science.

Comments Off

June 4, 2014

Unicode Character Table

Filed under: Unicode — Patrick Durusau @ 1:03 pm

Unicode Character Table

A useful webpage that I first saw in a tweet by Scott Chamberlain.

Displays Unicode characters on “buttons” that when selected displays the Unicode Hex code and HTML code for the selected character.

Quite useful when all you need is one entity value for a post.

If you need more information try Unicode Table – The Unicode Character Reference, which for “Latin Small Letter D” displays:

Unicode Character Information

Unicode Hex U+0064

Character Name LATIN SMALL LETTER D

General Category Lowercase Letter [Code: Ll]

Canonical Combining Class 0

Bidirectional Category L

Mirrored N

Uppercase Version U+0044

Titlecase Version U+0044

Unicode Character Encodings

Latin Small Letter D HTML Entity d (decimal entity), d (hex entity)

Windows Key Code Alt 0100 or Alt +0064¹

Programming Source Code Encodings Python hex: u”\u0064″, Hex for C++ and Java: “\u0064”

UTF-8 Hexadecimal Encoding 0x64

Or if you need all the information available on Unicode and to know it is the canonical information, see http://www.unicode.org/

Comments Off

May 15, 2014

(String/text processing)++:…

Filed under: String Matching,Text Feature Extraction,Text Mining,Unicode — Patrick Durusau @ 2:49 pm

(String/text processing)++: stringi 0.2-3 released by Marek Gągolewski.

From the post:

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

…

Here is a very general list of the most important features available in the current version of stringi:

string searching:

with ICU (Java-like) regular expressions,

ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),

very fast, locale-independent byte-wise pattern matching;

joining and duplicating strings;

extracting and replacing substrings;

string trimming, padding, and text wrapping (e.g. with Knuth's dynamic word wrap algorithm);

text transliteration;

text collation (comparing, sorting);

text boundary analysis (e.g. for extracting individual words);

random string generation;

Unicode normalization;

character encoding conversion and detection;

and many more.

Interesting isn’t it? How CS keeps circling around back to strings?

Enjoy!

Comments Off

December 18, 2013

Character(s) in Unicode 6.3.0

Filed under: Typography,Unicode — Patrick Durusau @ 2:04 pm

Search for character(s) in Unicode 6.3.0 by Tomas Schild.

A site that allows you to search the latest Unicode character set by:

Word or phrase from the official Unicode character name
Word or phrase from the old, deprecated Unicode 1.0 character name
A single character
The hexadecimal value of the Unicode postion
Search for numerical value

When you need just one or two characters to encode for HTML, this could be very handy.

Be aware that the search engine does not compensate from spelling differences in the Unicode character list.

Thus, a search for “aleph” returns:

Unicode
code point UTF-8
encoding
(hex.) Unicode character name

U+10840 f0 90 a1 80 IMPERIAL ARAMAIC LETTER ALEPH

U+10B40 f0 90 ad 80 INSCRIPTIONAL PARTHIAN LETTER ALEPH

U+10B60 f0 90 ad a0 INSCRIPTIONAL PAHLAVI LETTER ALEPH

U+1202A f0 92 80 aa CUNEIFORM SIGN ALEPH

Unicode code point	UTF-8 encoding (hex.)	Unicode character name
U+10840	f0 90 a1 80	IMPERIAL ARAMAIC LETTER ALEPH
U+10B40	f0 90 ad 80	INSCRIPTIONAL PARTHIAN LETTER ALEPH
U+10B60	f0 90 ad a0	INSCRIPTIONAL PAHLAVI LETTER ALEPH
U+1202A	f0 92 80 aa	CUNEIFORM SIGN ALEPH

Whereas a search for “alef” returns:

128 characters found

Unicode
code point UTF-8
encoding
(hex.) Unicode character name

U+05D0 d7 90 HEBREW LETTER ALEF

U+0616 d8 96 ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH

U+0622 d8 a2 ARABIC LETTER ALEF WITH MADDA ABOVE

U+0623 d8 a3 ARABIC LETTER ALEF WITH HAMZA ABOVE

U+0625 d8 a5 ARABIC LETTER ALEF WITH HAMZA BELOW

U+0627 d8 a7 ARABIC LETTER ALEF

U+0649 d9 89 ARABIC LETTER ALEF MAKSURA

Remaining 121 characters omitted

Unicode code point	UTF-8 encoding (hex.)	Unicode character name
U+05D0	d7 90	HEBREW LETTER ALEF
U+0616	d8 96	ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
U+0622	d8 a2	ARABIC LETTER ALEF WITH MADDA ABOVE
U+0623	d8 a3	ARABIC LETTER ALEF WITH HAMZA ABOVE
U+0625	d8 a5	ARABIC LETTER ALEF WITH HAMZA BELOW
U+0627	d8 a7	ARABIC LETTER ALEF
U+0649	d9 89	ARABIC LETTER ALEF MAKSURA
Remaining 121 characters omitted

Semitic alphabets all contain the alef/aleph character which represents a glottal stop.

I have no immediate explanation for why the Unicode standard chose different names for the same character in different languages.

But, be aware that it does happen.

BTW, I modified the tables to omit the character and other fields.

WordPress seems to have difficulty with Imperial Aramaic, Inscriptional Parthian, Inscriptional Pahlavi, and Cuneiform code points for aleph.

Comments Off

October 1, 2013

Unicode Standard, Version 6.3

Filed under: Unicode — Patrick Durusau @ 2:32 pm

Unicode Standard, Version 6.3

From the post:

The Unicode Consortium announces Version 6.3 of the Unicode Standard and with it, significantly improved bidirectional behavior. The updated Version 6.3 Unicode Bidirectional Algorithm now ensures that pairs of parentheses and brackets have consistent layout and provides a mechanism for isolating runs of text.

Based on contributions from major browser developers, the updated Bidirectional Algorithm and five new bidi format characters will improve the display of text for hundreds of millions of users of Arabic, Hebrew, Persian, Urdu, and many others. The display and positioning of parentheses will better match the normal behavior that users expect. By using the new methods for isolating runs of text, software will be able to construct messages from different sources without jumbling the order of characters. The new bidi format characters correspond to features in markup (such as in CSS). Overall, these improvements also bring greater interoperability and an improved ability for inserting text and assembling user interface elements.

The improvements come with new rigor: the Consortium now offers two reference implementations and greatly improved testing and test data.

In a major enhancement for CJK usage, this new version adds standardized variation sequences for all 1,002 CJK compatibility ideographs. These sequences address a well-known issue of the CJK compatibility ideographs — that they could change their appearance when any process normalized the text. Using the new standardized variation sequences allows authors to write text which will preserve the specific required shapes of these CJK ideographs, even under Unicode normalization.

Version 6.3 includes other improvements as well:

Improved Unihan data to better align with ISO/IEC 10646

Better support for Hebrew word break behavior and for ideographic space in line breaking

Get started with Unicode 6.3 today! http://www.unicode.org/versions/Unicode6.3.0/.

Now, there’s an interesting data set!

Much of the convenience you now experience with digital texts is due to the under-appreciated efforts of the Unicode project.

Comments Off

June 19, 2013

Character Sorted Table….

Filed under: Unicode — Patrick Durusau @ 1:08 pm

Character Sorted Table Showing Entity Names and Unicode Values

I often need to look up just one character and guessing which part of Unicode will have it is a pain.

I found this thirty-seven (37) page summary of characters with entity names and Unicode values at the U. S. Government Printing Office (GPO).

I could not find any directories above it with an index page or pointers to this file.

I have not verified the entries in this listing. Use at your own risk.

Comments Off

May 17, 2013

Common Locale Data Repository (CLDR) 23.1

Filed under: Unicode — Patrick Durusau @ 6:07 pm

Common Locale Data Repository (CLDR) 23.1

From the CLDR project homepage:

What is CLDR?

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. It includes:

Locale-specific patterns for formatting and parsing: dates, times, timezones, numbers and currency values

Translations of names: languages, scripts, countries and regions, currencies, eras, months, weekdays, day periods, timezones, cities, and time units

Language & script information: characters used; plural cases; gender of lists; capitalization; rules for sorting & searching; writing direction; transliteration rules; rules for spelling out numbers; rules for segmenting text into graphemes, words, and sentences

Country information: language usage, currency information, calendar preference and week conventions, postal and telephone codes

Other: ISO & BCP 47 code support (cross mappings, etc.), keyboard layouts

CLDR uses the XML format provided by UTS #35: Unicode Locale Data Markup Language (LDML). LDML is a format used not only for CLDR, but also for general interchange of locale data, such as in Microsoft’s .NET.

For a set of slides on the technical contents of CLDR, see Overview.

Great set of widely used mappings between locale data.

Comments Off

February 11, 2013

unicodex — High-performance Unicode Library (C++)

Filed under: Software,Unicode — Patrick Durusau @ 11:42 am

unicodex — High-performance Unicode Library (C++) by Dustin Juliano.

From the post:

The following is a micro-optimized Unicode encoder/decoder for C++ that is capable of significant performance, sustaining 6 GiB/s for UTF-8 to UTF-16/32 on an AMD A8-3870 running in a single thread, and 8 GiB/s for UTF-16 to UTF-32. That would allow it to encode nearly the full English Wikipedia in approximately 6 seconds.

It maps between UTF-8, UTF-16, and UTF-32, and properly detects UTF-8 BOM and the UTF-16 BOMs. It has been unit tested with gigabytes of data and verified with binary analysis tools. Presently, only little-endian is supported, which should not pose any significant limitations on use. It is released under the BSD license, and can be used in both proprietary and free software projects.

The decoder is aware of malformed input and will raise an exception if the input sequence would cause a buffer overflow or is otherwise fatally incorrect. It does not, however, ensure that exact codepoints correspond to the specific Unicode planes; this is by design. The implementation has been designed to be robust against garbage input and specifically avoid encoding attacks.

One of those “practical” things that you may need for processing topic maps and or other digital information.

Comments Off

November 23, 2012

Unicode 6.2.0 Available

Filed under: Unicode — Patrick Durusau @ 11:26 am

Unicode 6.2.0 Available

Summary:

Version 6.2 of the Unicode Standard is a special release dedicated to the early publication of the newly encoded Turkish lira sign. This version also rolls in various minor corrections for errata and other small updates for the Unicode Character Database. In addition, there are some significant changes to the Unicode algorithms for text segmentation and line breaking, including changes to the line break property to improve line breaking for emoji symbols.

Just in case you don’t follow Unicode releases closely.

The character set against which all others should be mapped.

Comments Off

Unicode Character Information
Unicode Hex	U+0064
Character Name	LATIN SMALL LETTER D
General Category	Lowercase Letter [Code: Ll]
Canonical Combining Class	0
Bidirectional Category	L
Mirrored	N
Uppercase Version	U+0044
Titlecase Version	U+0044
Unicode Character Encodings
Latin Small Letter D HTML Entity	d (decimal entity), d (hex entity)
Windows Key Code	Alt 0100 or Alt +0064¹
Programming Source Code Encodings	Python hex: u”\u0064″, Hex for C++ and Java: “\u0064”
UTF-8 Hexadecimal Encoding	0x64

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 15, 2018

November 16, 2017

October 16, 2017

March 9, 2017

January 19, 2017

The need for information on Writing Systems

The purpose of ScriptSource

August 26, 2016

July 6, 2016

June 19, 2016

May 17, 2016

April 22, 2016

January 8, 2016

December 2, 2015

June 25, 2015

June 12, 2015

January 16, 2015

October 8, 2014

August 8, 2014

July 29, 2014

June 4, 2014

May 15, 2014

December 18, 2013

October 1, 2013

June 19, 2013

May 17, 2013

February 11, 2013

November 23, 2012