Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 8, 2014

Juju Charm (HPCC Systems)

Filed under: BigData,HPCC,Unicode — Patrick Durusau @ 1:44 pm

HPCC Systems from LexisNexis Celebrates Third Open-Source Anniversary, And Releases 5.0 Version

From the post:

LexisNexis® Risk Solutions today announced the third anniversary of HPCC Systems®, its open-source, enterprise-proven platform for big data analysis and processing for large volumes of data in 24/7 environments. HPCC Systems also announced the upcoming availability of version 5.0 with enhancements to provide additional support for international users, visualization capabilities and new functionality such as a Juju charm that makes the platform easier to use.

“We decided to open-source HPCC Systems three years ago to drive innovation for our leading technology that had only been available internally and allow other companies and developers to experience its benefits to solve their unique business challenges,” said Flavio Villanustre, Vice President, Products and Infrastructure, HPCC Systems, LexisNexis.

….

5.0 Enhancements
With community contributions from developers and analysts across the globe, HPCC Systems is offering translations and localization in its version 5.0 for languages including Chinese, Spanish, Hungarian, Serbian and Brazilian Portuguese with other languages to come in the future.
Additional enhancements include:
• Visualizations
• Linux Ubuntu Juju Charm Support
• Embedded language features
• Apache Kafka Integration
• New Regression Suite
• External Database Support (MySQL)
• Web Services-SQL

The HPCC Systems source code can be found here: https://github.com/hpcc-systems
The HPCC Systems platform can be found here: http://hpccsystems.com/download/free-community-edition

Just in time for the Fall upgrade season! 😉

While reading the documentation I stumbled across: Unicode Indexing in ECL, last updated January 09, 2014.

From the page:

ECL’s dafault indexing logic works great for strings and numbers, but can encounter problems when indexing Unicode data. In some cases, unicode indexes don’t return all matching recordsfor a query. For example, If you have a Unicode field “ufield” in a dataset and select dataset(ufield BETWEEN u’ma’ AND u’me’), it would bring back records for ‘mai’,’Mai’ and ‘may’. However a query on the index for that dataset, idx(ufield BETWEEN u’ma’ AND u’me’), only brings back a record for ‘mai’.

This is a result of the way unicode fields are sorted for indexing. Sorting compares the values of two fields byte by byte to see if a field matches or is less than or greater than another value. Integers are stored in bigendian format, and signed numbers have an offset added to create an absolute value range.

Unicode fields are different. When compared/sorted in datasets, the comparisons are performed using the ICU locale sensitive comparisons to ensure correct ordering. However, index lookup operations need to be fast and therefore the lookup operations perform binary comparisons on fixed length blocks of data. Equality checks will return data correctly, but queries involving between, > or < may fail.

If you are considering HPCC, be sure to check your indexing requirements with regard to Unicode.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress