## Archive for the ‘XML’ Category

### …Apache HBase REST Interface, Part 2

Friday, April 12th, 2013

How-to: Use the Apache HBase REST Interface, Part 2 by Jesse Anderson.

From the post:

This how-to is the second in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 below will show you how to insert multiple rows at once using XML and JSON. The full code samples can be found on GitHub.

Only fair to cover both XML and TBL’s new favorite, JSON. (Tim Berners-Lee Renounces XML?)

### Tim Berners-Lee Renounces XML?

Wednesday, April 10th, 2013

Draft TAG Teleconference Minutes 4th of April 2013

In a discussion of ISSUE-34: XML Transformation and composability (e.g., XSLT,XInclude, Encryption) the following exchange takes place:

Noah: Lets go through the issues and see which we can close. … Processing model of XML. Is there any interest in this?

xmlFunctions-34

Tim: I’m happy to do things with XML. This came from when we’re talking about XML was processed. The meaning from XML has to be taken outside-in. Otherwise you cannot create new XML specifications that interweave with what exist. … Not clear people noticed that.

I note that traceker has several status codes we can assign, including OPEN, PENDING, REVIEW, POSTPONED, and CLOSED.

Tim: Henry did a lot more work on that. I don’t feel we need to put a whole lot of energy into XML at all. JSON is the new way for me. It’s much more straightforward.

Suggestion: if we think this is now resolved or uninteresting, CLOSE it; if we think it’s interesting but not now, then POSTPONED?

Tim: We need another concept besides OPEN/CLOSED. Something like NOT WORKING ON IT.

Noah: It has POSTPONED.

Tim: POSTPONED expresses a feeling of guilt. But there’s no guilt.

Noah: It’s close enough and I’m not looking forward to changing Tracker.

ht, you wanted to add 0.02USD

Henry: I’m happy to move this to the backburner. I think there’s a genuine issue here and of interest to the community but I don’t have the bandwidth.

Noah: We need to tell ourselves a story as to what these codes mean. … Historically we used CLOSED for “it’s in pretty good shape”.

Henry: I’m happy with POSTPONED and it’s better than CLOSED.

+1 for postponing

+1

RESOLUTION: We mark ISSUE-34 (xmlFunctions-34) POSTPONED

I think this is important, thanks for doing it noah

XML can be improved to be sure but the concept is not inherently flawed.

To JSON supporters, all I can say is XML wasn’t the bloated confusion you see now when it started.

Sunday, April 7th, 2013

I rather hate to end the day on a practical note, , but after going off Google Reader, I started using RSSOwl.

Feed Validator reported the feed was:

not well-formed (invalid token)

with a pointer to the letter “f” in the word “find.”

Captured the feed as XML and loaded it into oXygen.

A form feed character was immediately in front of the “f” in “fine” but of course was not displaying.

Culprit in one case was a form feed character, 0xc and in the other, end of text, 0×03.

ASCII characters 0 — 31 and 127 are non-printing control characters called CO controls.

Of the CO control characters, only carriage return (0x0d), linefeed (0x0a) and horizontal tab (0×09) can appear in an XML feed.

For loading and parsing RSS feeds into a topic map, you may want to filter for CO controls that should not appear in the XML feed.

PS: I suspect in both cases the control characters were introduced by copy-n-paste operations.

### XMLQuire Web Edition

Sunday, March 17th, 2013

XMLQuire Web Edition: A Free XSLT 2.0 Editor for the Web

From the webpage:

XSLT 2.0 processing within the browser is now a reality with the introduction of the open source Saxon-CE from Saxonica. This processor runs as a JavaScript app and supports JavaScript interoperability and user-event handling for the era of HTML5 and the dynamic web.

This Windows product, XMLQuire, is an XSLT edtior specially extended to integrate with Saxon-CE and support the Saxon-CE language extensions that make interactive XSLT possible. Saxon-CE is not included with this product, but is available from Saxonica here.

*nix folks will have to install Windows 7 or 8 on a VM to take advantage of this software.

Worth the effort if for no other reason than to see how the market majority lives.

I first saw this in a tweet by Michael Kay.

### “…XML User Interfaces” As in Using XML?

Tuesday, February 19th, 2013

International Symposium on Native XML user interfaces

This came across the wire this morning and I need your help interpreting it.

Why would you want to have an interface to XML?

All these years I have been writing XML in Emacs because XML wasn’t supposed to have an interface.

Brave hearts, male, female and unknown, struggling with issues too obscure for mere mortals.

Now I find that isn’t supposed to be so? You can imagine my reaction.

I moved my laptop a bit closer to the peat fire to make sure I read it properly. Waiting for the ox cart later this week to take my complaint to the local bishop about this disturbing innovation.

15 March 2013 — Peer review applications due
19 April 2013 — Paper submissions due
19 April 2013 — Applications due for student support awards due
21 May 2013 — Speakers notified
12 July 2013 — Final papers due
5 August 2013 — International Symposium on Native XML user interfaces
6–9 August 2013 — Balisage: The Markup Conference

International Symposium on
Native XML user interfaces

Monday August 5, 2013 Hotel Europa, Montréal, Canada

XML is everywhere. It is created, gathered, manipulated, queried, browsed, read, and modified. XML systems need user interfaces to do all of these things. How can we make user interfaces for XML that are powerful, simple to use, quick to develop, and easy to maintain?

How are we building user interfaces today? How can we build them tomorrow? Are we using XML to drive our user interfaces? How?

This one-day symposium is devoted to the theory and practice of user interfaces for XML: the current state of implementations, practical case studies, challenges for users, and the outlook for the future development of the technology.

Relevant topics include:

• Editors customized for specific purposes or users
• User interfaces for creation, management, and use of XML documents
• Uses of XForms
• Making tools for creation of XML textual documents
• Using general-purpose user-interface libraries to build XML interfaces
• Looking at XML, especially looking at masses of XML documents
• XML, XSLT, and XQuery in the browser
• Specialized user interfaces for specialized tasks
• XML vocabularies for user-interface specification

Presentations can take a variety of forms, including technical papers, case studies, and tool demonstrations (technical overviews, not product pitches).

This is the same conference I wrote about in: Markup Olympics (Balisage) [No Drug Testing].

In times of lean funding for conferences, if you go to a conference this year, it really should be Balisage.

You will be the envy of your co-workers and have tales to tell your grandchildren.

Not bad for one conference registration fee.

### MarkLogic Announces Free Developer License for Enterprise [With Odd Condition]

Wednesday, February 13th, 2013

MarkLogic Announces Free Developer License for Enterprise

From the post:

MarkLogic Corporation today announced the availability of a free Developer License for MarkLogic Enterprise Edition.

The Developer License provides access to the features available in MarkLogic Enterprise Edition, including integrated search, government-grade security, clustering, replication, failover, alerting, geospatial indexing, conversion, and a suite of application development tools. MarkLogic also announced the Mongo2MarkLogic converter, a Java-based tool for importing data from MongoDB into MarkLogic providing developers immediate access to features needed to build out enterprise-ready big data solutions.

“By providing a free Developer License we enable developers to quickly deliver reliable, scalable and secure information and analytic applications that are production-ready,” said Gary Bloom, CEO and President of MarkLogic. “Many of our customers first experimented with other free NoSQL products, but turned to MarkLogic when they recognized the need for search, security, support for ACID transactions and other features necessary for enterprise environments. Our goal is to eliminate the cost barrier for developers and give them access to the best enterprise NoSQL platform from the start.”

The Developer License for MarkLogic Enterprise Edition includes tools for faster application development, business intelligence (BI) tool integration, analytic functions and visualization tools, and the ability to create user-defined functions for fast and flexible analysis of huge volumes of data.

You would think that story would merit at least one link to the free developer program.

That wasn’t hard. Two links and you have direct access to the topic of the story and the company.

One odd licensing condition:

Q. Can I publish my work done with MarkLogic Server?

A. We encourage you to share your work publicly, but note that you can not disclose, without MarkLogic prior written consent, any performance or capacity statistics or the results of any benchmark test performed on MarkLogic Server.

That sounds just a tad defensive doesn’t it?

I haven’t looked at MarkLogic for a couple of iterations but earlier versions had no need to fear statistics or benchmark tests.

Results vary depending on how testing is done but anyone authorized to recommend or sign acquisition orders should know that.

If they don’t, your organization has more serious problems than needing a MarkLogic server.

### Optimizing TM Queries?

Wednesday, January 16th, 2013

A recent paper by V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyễn, Optimizing XML querying using type-based document projection, suggests some interesting avenues for optimizing topic map queries.

Abstract:

XML data projection (or pruning) is a natural optimization for main memory query engines: given a query Q over a document D, the subtrees of D that are not necessary to evaluate Q are pruned, thus producing a smaller document D ; the query Q is then executed on D , hence avoiding to allocate and process nodes that will never be reached by Q.

In this article, we propose a new approach, based on types, that greatly improves current solutions. Besides providing comparable or greater precision and far lesser pruning overhead, our solution ―unlike current approaches― takes into account backward axes, predicates, and can be applied to multiple queries rather than just to single ones. A side contribution is a new type system for XPath able to handle backward axes. The soundness of our approach is formally proved. Furthermore, we prove that the approach is also complete (i.e., yields the best possible type-driven pruning) for a relevant class of queries and Schemas. We further validate our approach using the XMark and XPathMark benchmarks and show that pruning not only improves the main memory query engine’s performances (as expected) but also those of state of the art native XML databases.

Phrased in traditional XML terms but imagine pruning a topic map by topic or association types, for example, before execution of a query.

While true enough that a query could include topic type, the remains the matter of examining all the instances of topic type before proceeding to the rest of the query.

For common query sub-maps as it were, I suspect that to prune once and store the results could be a viable alternative.

Despite the graphic chart enhancement from processing millions or billions of nodes, processing the right set of nodes and producing a useful answer has its supporters.

### XQuery 3.0: An XML Query Language [Subject Identity Equivalence Language?]

Tuesday, January 15th, 2013

XQuery 3.0: An XML Query Language – W3C Candidate Recommendation

Abstract:

XML is a versatile markup language, capable of labeling the information content of diverse data sources including structured and semi-structured documents, relational databases, and object repositories. A query language that uses the structure of XML intelligently can express queries across all these kinds of data, whether physically stored in XML or viewed as XML via middleware. This specification describes a query language called XQuery, which is designed to be broadly applicable across many types of XML data sources.

Just starting to read the XQuery CR but the thought occurred to me that it could be a basis for a “subject identity equivalence language.”

Rather than duplicating the work on expressions, paths, data types, operators, expressions, etc., why not take all that as given?

Suffice it to define a “subject equivalence function,” the variables of which are XQuery statements that identify values (or value expressions) as required, optional or forbidden and the definition of the results of the function.

Reusing a well-tested query language seems preferable to writing an entirely new one from scratch.

Suggestions?

I first saw this in a tweet by Michael Kay.

### Markup Olympics (Balisage) [No Drug Testing]

Thursday, January 10th, 2013

Markup athletes take heart! Unlike venues that intrude into the personal lives of competitors, there are no, repeat no drug tests for presenters at Balisage!

Fear no trainer betrayals or years of being dogged by second-raters in the press.

Eat, drink, visit, ???, present, in the company of your peers.

The more traditional call for participation, yawn, has the following details:

Dates:

15 March 2013 – Peer review applications due
19 April 2013 – Paper submissions due
19 April 2013 – Applications due for student support awards due
21 May 2013 – Speakers notified
12 July 2013 – Final papers due

5 August 2013 – Pre-conference Symposium on XForms
6-9 August 2013 – Balisage: The Markup Conference

From the call:

Balisage is where people interested in descriptive markup meet each year in August for informed technical discussion, occasionally impassioned debate, good coffee, and the incomparable ambience of one of North America’s greatest cities, Montreal. We welcome anyone interested in discussing the use of descriptive markup to build strong, lasting information systems.

Practitioner or theorist, tool-builder or tool-user, student or lecturer — you are invited to submit a paper proposal for Balisage 2013. As always, papers at Balisage can address any aspect of the use of markup and markup languages to represent information and build information systems. Possible topics include but are not limited to:

• XML and related technologies
• Non-XML markup languages
• Big Data and XML
• Implementation experience with XML parsing, XSLT processors, XQuery processors, XML databases, XProc integrations, or any markup-related technology
• Semantics, overlap, and other complex fundamental issues for markup languages
• Case studies of markup design and deployment
• Quality of information in markup systems
• JSON and XML
• Efficiency of Markup Software
• Markup systems in and for the mobile web
• The future of XML and of descriptive markup in general
• Interesting applications of markup

In addition, please consider becoming a Peer Reviewer. Reviewers play a critical role towards the success of Balisage. They review blind submissions — on topics that interest them — for technical merit, interest, and applicability. Your comments and recommendations can assist the Conference Committee in creating the program for Balisage 2013!

How:

More IQ per square foot than any other conference you will attend in 2013!

### pdfx v1.0 [PDF-to-XML]

Thursday, December 27th, 2012

pdfx v1.0

From the homepage:

Fully-automated PDF-to-XML conversion of scientific text

I submitted Static and Dynamic Semantics of NoSQL Languages, a paper I blogged about earlier this week. Twenty-four pages of lots of citations and equations.

I forgot to set a timer but it isn’t for the impatient. I think the conversion ran more than ten (10) minutes.

Some mathematical notation defeats the conversion process.

See: Static-and-Dynamic-Semantics-NoSQL-Languages.tar.gz for the original PDF plus the HTML and PDF outputs.

For occasional conversions where heavy math notation isn’t required, this may prove to be quite useful.

### BaseX. The XML Database. [XPath/XQuery]

Friday, December 21st, 2012

BaseX. The XML Database.

From the webpage:

News: BaseX 7.5 has just been released…

BaseX is a very light-weight, high-performance and scalable XML Database engine and XPath/XQuery 3.0 Processor, including full support for the W3C Update and Full Text extensions. An interactive and user-friendly GUI frontend gives you great insight into your XML documents.

Another XML editor but I mention it for its support of XQuery more than as an editor per se.

We continue to lack a standard query language for topic maps and experience with XQuery may prove informative.

Not to mention its possible role in gathering diverse data for presentation in a merged state to users.

### <ANGLES>

Friday, December 21st, 2012

<ANGLES>

From the homepage:

ANGLES is a research project aimed at developing a lightweight, online XML editor tuned to the needs of the scholarly text encoding community. By combining the model of intensive code development (the “code sprint”) with participatory design exercises, testing, and feedback from domain experts gathered at disciplinary conferences, ANGLES will contribute not only a working prototype of a new software tool but also another model for tool building in the digital humanities (the “community roadshow”).

Work on ANGLES began in November 2012.

We’ll have something to share very soon!

<ANGLES> is an extension of ACE:

ACE is an embeddable code editor written in JavaScript. It matches the features and performance of native editors such as Sublime, Vim and TextMate. It can be easily embedded in any web page and JavaScript application. ACE is maintained as the primary editor for Cloud9 IDE and is the successor of the Mozilla Skywriter (Bespin) project.

<ANGLES> code at Sourceforge.

I will be interested to see how ACE is extended. Just glancing at it this morning, it appears to be the traditional “display angle bang syntax” editor we all know so well.

What puzzles me is that we have been to the mountain of teaching users to be comfortable with raw XML markup and the results have not been promising.

As opposed to the experience with OpenOffice, MS Office, etc., which have proven that creating documents that are then expressed in XML, is within the range of ordinary users.

<ANGLES> looks like an interesting project but whether it brings XML editing within the reach of ordinary users is an open question.

If the XML editing puzzle is solved, perhaps it will have lessons for topic map editors.

### Balisage 2013 – Dates/Location

Tuesday, November 20th, 2012

Tommie Usdin just posted email with the Balisage 2013 dates and location:

Montreal, Hotel Europa, August 5 – 9 , 2013

Hope that works with everything else.

That’s the entire email so I don’t know what was meant by:

Hope that works with everything else.

Short of it being your own funeral, open-heart surgery or giving birth (to your first child), I am not sure what “everything else” there could be?

You get a temporary excuse for the second two cases and a permanent excuse for the first one.

Now’s a good time to hint about plane fare plus hotel and expenses for Balisage as a stocking stuffer.

And to wish a happy holiday Tommie Usdin and to all the folks at Mulberry Technology who make Balisage possible all of us. Each and every one.

### XML-Print 1.0

Tuesday, September 11th, 2012

Prof. Marc W. Küster announced XML-Print 1.0 this week, “…an open source XML formatter designated especially for the needs of the Digital Humanties.”

Mapping from “…semantic structures to typesetting styles….” (from below)

We have always mapped from semantic structures to typesetting styles, but this time it will be explicit.

Consider whether you need “transformation” (implies a static file output) or merely a “view” for some purpose, such as printing?

Both require mappings but the later keeps your options open as it were.

Enjoy!

XML-Print allows the end user to directly interact with semantically annotated data. It consists of two independent, but well-integrated components, an Eclipse-based front-end that enables the user to map their semantic structures to typesetting styles, and the typesetting engine proper that produces the PDF based on this mapping. Both components build as much as possible on existing standards such as XML, XSL-T and XSL-FO and extend those only where absolutely necessary, e.g. for the handling of critical apparatuses.

XML-Print is a DFG-supported joint project of the FH Worms (Prof. Marc W. Küster) and the University of Trier (Prof. Claudine Moulin) in collaboration wiht the TU Darmstadt (Prof. Andrea Rapp). It is released under the Eclipse Public Licence (EPL) for the front-end and the Affero General Public Licence (APGL) for the typesetting engine. The project is currently roughly half-way through its intended duration. In its final incarnation the PDF that is produced will satisfy the full set of requirements for the typesetting of (amongst others) critical editions including critical apparatuses, multicolumn synoptic texts and bidirectional text. At this stage it can already handle basic formatting as well as multiple apparatuses, albeit still with some restrictions and rough edges. It is work in progress with new releases coming out regularly.

If you have questions, please do not hesitate to contact us via our website http://www.xmlprint.eu or directly to print@uni-trier.de. Any and all feedback is welcome. Moreover, if you know some people you think could benefit from XML-Print, please feel free to spread the news amongst your peers.

Project homepage: http://www.xmlprint.eu
Source code: http://sourceforge.net/projects/xml-print/
Installers for Windows, Mac and Linux:
http://sourceforge.net/projects/xml-print/files/

### St. Laurent on Balisage

Sunday, August 12th, 2012

Applying markup to complexity: The blurry line between markup and programming by Simon St. Laurent.

Simon’s review of Balisage will make you want to attend next year, if you missed this year.

He misses an important issue with JSON (and XML) when he writes:

JSON gave programmers much of what they wanted: a simple format for shuttling (and sometimes storing) loosely structured data. Its simpler toolset, freed of a heritage of document formats and schemas, let programmers think less about information formats and more about the content of what they were sending.

XML and JSON look at data through different lenses. XML is a tree structure of elements, attributes, and content, while JSON is arrays, objects, and values. Element order matters by default in XML, while JSON is far less ordered and contains many more anonymous structures. (emphasis added)

The problem with JSON in a nutshell (apologies to O’Reilly): anonymous structures.

How is a subsequent programmer going to discover the semantics of “anonymous structures?”

Works great for job security, works less well for information integration several “generations” of programmers later.

XML can be poorly documented, just like JSON, but relationships between elements are explicit.

Anonymity, of all kinds, is the enemy of re-use of data, semantic integration and useful archiving of data.

If those aren’t your use cases, use anonymous JSON structures. (Or undocumented XML.)

### Using the flickr XML/API as a source of RSS feeds

Saturday, August 4th, 2012

Using the flickr XML/API as a source of RSS feeds by Pierre Lindenbaum.

Pierre has created an XSLT stylesheet to transform XML from flickr into an RSS feed.

Something for your data harvesting recipe box.

### If you are in Kolkata/Pune, India…a request.

Tuesday, July 17th, 2012

No emails are given for the authors of: Identify Web-page Content meaning using Knowledge based System for Dual Meaning Words but their locations were listed as Kolkata and Pune, India. I would appreciate your pointing the authors to this blog as one source of information on topic maps.

The authors have re-invented a small part of topic maps to deal with synonymy using XSD syntax. Quite doable but I think they would be better served by either using topic maps or engaging in improving topic maps.

Reinvention is rarely a step forward.

Abstract:

Meaning of Web-page content plays a big role while produced a search result from a search engine. Most of the cases Web-page meaning stored in title or meta-tag area but those meanings do not always match with Web-page content. To overcome this situation we need to go through the Web-page content to identify the Web-page meaning. In such cases, where Webpage content holds dual meaning words that time it is really difficult to identify the meaning of the Web-page. In this paper, we are introducing a new design and development mechanism of identifying the Web-page content meaning which holds dual meaning words in their Web-page content.

### TUSTEP is open source – with TXSTEP providing a new XML interface

Monday, July 9th, 2012

TUSTEP is open source – with TXSTEP providing a new XML interface

I won’t recount how many years ago I first received email from Wilhelm Ott about TUSTEP.

From the TUSTEP homepage:

TUSTEP is a professional toolbox for scholarly processing textual data (including those in non-latin scripts) with a strong focus on humanities applications. It contains modules for all stages of scholarly text data processing, starting from data capture and including information retrieval, text collation, text analysis, sorting and ordering, rule-based text manipulation, and output in electronic or conventional form (including typesetting in professional quality).

Since the title “big data” is taken, perhaps we should take “complex data” for texts.

If you are exploring textual data in any detail or with XML, you should give take a look at the TUSTEP project and its new XML interface, TXSTEP.

Or consider contributing to the project as well.

Wilhelm Ott writes (in part):

We are pleased to announce that, starting with the release 2012, TUSTEP is available as open source software. It is distributed under the Revised BSD Licence and can be downloaded from www.tustep.org.

TUSTEP has a long tradition as a highly flexible, reliable, efficient suite of programs for humanities computing. It started in the early 70ies as a tool for supporting humanities projects at the University of Tübingen, relying on own funds of the University. From 1985 to 1989, a substantial grant from the Land Baden-Württemberg officially opened its distribution beyond the limits of the University and started its success as a highly appreciated research tool for many projects at about a hundred universities and academic institutions in the German speaking part of the world, represented since 1993 in the International TUSTEP User Group (ITUG). Reports on important projects relying on TUSTEP and a list of publications (includig lexicograpic works and critical editions) can be found on the tustep webpage.

TXSTEP, presently being developed in cooperation with Stuttgart Media University, offers a new XML-based user interface to the TUSTEP programs. Compared to the original TUSTEP commands, we see important advantages:

• it will offer an up-to-date established syntax for scripting;
• it will show the typical benefits of working with an XML editor, like content completion, highlighting, showing annotations, and, of course, verifying the code;
• it will offer – to a certain degree – a self teaching environment by commenting on the scope of every step;
• it will help to avoid many syntactical errors, even compared to the original TUSTEP scripting environment;
• the syntax is in English, providing a more widespread usability than TUSTEP’s German command language.

At the TEI conference last year in Würzburg, we presented a first prototype to an international audience. We look forward to DH2012 in Hamburg next week where, during the Poster Session, a more enhanced version which already contains most of TUSTEPs functions will be presented. A demonstration of TXSTEPs functionality will include tasks which can not easily be performed by existing XML tools.

After the demo, you are invited to download a test version of TXSTEP to play with, to comment on it and to help make it a great and flexible tool for everyday – and complex – questions.

OK, I confess a fascination with complex textual analysis.

### MuteinDB

Friday, June 29th, 2012

MuteinDB: the mutein database linking substrates, products and enzymatic reactions directly with genetic variants of enzymes by Andreas Braun, Bettina Halwachs, Martina Geier, Katrin Weinhandl, Michael Guggemos, Jan Marienhagen, Anna J. Ruff, Ulrich Schwaneberg, Vincent Rabin, Daniel E. Torres Pazmiño, Gerhard G. Thallinger, and Anton Glieder.

Abstract:

Mutational events as well as the selection of the optimal variant are essential steps in the evolution of living organisms. The same principle is used in laboratory to extend the natural biodiversity to obtain better catalysts for applications in biomanufacturing or for improved biopharmaceuticals. Furthermore, single mutation in genes of drug-metabolizing enzymes can also result in dramatic changes in pharmacokinetics. These changes are a major cause of patient-specific drug responses and are, therefore, the molecular basis for personalized medicine. MuteinDB systematically links laboratory-generated enzyme variants (muteins) and natural isoforms with their biochemical properties including kinetic data of catalyzed reactions. Detailed information about kinetic characteristics of muteins is available in a systematic way and searchable for known mutations and catalyzed reactions as well as their substrates and known products. MuteinDB is broadly applicable to any known protein and their variants and makes mutagenesis and biochemical data searchable and comparable in a simple and easy-to-use manner. For the import of new mutein data, a simple, standardized, spreadsheet-based data format has been defined. To demonstrate the broad applicability of the MuteinDB, first data sets have been incorporated for selected cytochrome P450 enzymes as well as for nitrilases and peroxidases.

Database URL: http://www.muteindb.org/

Why is this relevant to topic maps or semantic diversity you ask?

I will let the author’s answer:

Information about specific proteins and their muteins are widely spread in the literature. Many studies only describe single mutation and its effects without comparison to already known muteins. Possible additive effects of single amino acid changes are scarcely described or used. Even after a thorough and time-consuming literature search, researchers face the problem of assembling and presenting the data in an easy understandable and comprehensive way. Essential information may be lost such as details about potentially cooperative mutations or reactions one would not expect in certain protein families. Therefore, a web-accessible database combining available knowledge about a specific enzyme and its muteins in a single place are highly desirable. Such a database would allow researchers to access relevant information about their protein of interest in a fast and easy way and accelerate the engineering of new and improved variants. (Third paragraph of the introduction)

I would have never dreamed that gene data would be spread to Hell and back.

The article will give you insight into how gene data is collected, searched, organized, etc. All of which will be valuable to you whether you are designing or using information systems in this area.

I was a bit let down when I read about data formats:

Most of them are XML based, which can be difficult to create and manipulate. Therefore, simpler, spreadsheet-based formats have been introduced which are more accessible for the individual researcher.

I’ve never had any difficulties with XML based formats but will admit that may not be a universal experience. Sounds to me like the XML community should concentrate a bit less on making people write angle-bang syntax and more on long term useful results. (Which I think XML can deliver.)

### Show Me The Money!

Monday, June 25th, 2012

I need to talk to Tommie Usdin about marketing the Balisage conference.

The final program came out today and here is what Tommie had to say:

When the regular (peer-reviewed) part of the Balisage 2012 program was scheduled, a few slots were reserved for presentation of “Late breaking” material. These presentations have now been selected and added to the program.

• making robust and multi-platform ebooks
• creating representative documents from large document collections
• validating RESTful services using XProc, XSLT, and XSD
• XML for design-based (e.g. magazine) publishing
• provenance in XSLT transformation (tracking what XSLT does to documents)
• literate programming
• managing the many XML-related standards and specifications
• leveraging XML for web applications

The program already included talks about adding RDF to TEI documents, compression of XML documents, exploring large XML collections, Schematron, relation of XML to JSON, overlap, higher-order functions in XSLT, the balance between XML and non-XML notations, and many other topics. Now it is a real must for anyone who thinks deeply about markup.

Balisage is the XML Geek-fest; the annual gathering of people who design markup and markup-based applications; who develop XML specifications, standards, and tools; the people who read and write, books about publishing technologies in general and XML in particular; and super-users of XML and related technologies. You can read about the Balisage 2011 conference at http://www.balisage.net.

Yawn. Are we there yet?

Why you should care about XML and Balisage:

• US government and others are publishing laws and regulations and soon to be legislative material in XML
• Securities are increasingly using XML for required government reports
• Texts and online data sets are being made available in XML
• All the major document formats are based in XML

A $billion here, a$billion there and pretty soon you are talking about real business opportunity.

Be smart, make your XML developers imaginative and productive.

Send your XML developers to Balisage.

### BaseX 7.3 (The Summer Edition) is now available!

Thursday, June 21st, 2012

BaseX 7.3 (The Summer Edition) is now available!

From the post:

we are glad to announce a great new release of BaseX, our XML database and XPath/XQuery 3.0 processor! Here are the latest features:

• Many new internal XQuery Modules have been added, and existing ones have been revised to ensure long-term stability of your future XQuery applications
• A new powerful Command API is provided to specify BaseX commands and scripts as XML
• The full-text fuzzy index was extended to also support wildcard queries
• The simple map operator of XQuery 3.0 gives you a compact syntax to process items of sequences
• BaseX as Web Application can now start its own server instance
• All command-line options will now be executed in the given order
• Charles Foster’s latest XQJ Driver supports XQuery 3.0 and the Update and Full Text extensions

For those of you in the Northern Hemisphere, we wish you a nice summer! No worries, we’ll stay busy..

Just in time for the start of summer in the Northern Hemisphere!

Something you can toss onto your laptop before you head to the beach.

Err, huh? Well, even if you don’t take BaseX 7.3 to the beach, it promises to be good fun for the summer and more serious work should the occasion arise.

I count twenty-three (23) modules in addition to the XQuery functions specified by the latest XPath/XQuery 3.0 draft.

Just so you know, the BaseX database server listens to port 1984 by default.

### XML to Graph Converter

Sunday, June 10th, 2012

XML to Graph Converter

From the webpage:

XML data can easily be converted into a graph. Simply load paste the XML data into the left-hand side, convert into Geoff, then view the results in the Neo4j console.

I would have modeled the XML differently, but that is probably a markup prejudice.

Still, an impressive demonstration and worth your time to review.

### A Pluggable XML Editor

Wednesday, June 6th, 2012

A Pluggable XML Editor by Grant Vergottini.

From the post:

Ever since I announced my HTML5-based XML editor, I’ve been getting all sorts of requests for a variety of implementations. While the focus has been, and continues to be, providing an Akoma Ntoso based legislative editor, I’ve realized that the interest in a web-based XML editor extends well beyond Akoma Ntoso and even legislative editors.

So… with that in mind I’ve started making some serious architectural changes to the base editor. From the get-go, my intent had been for the editor to be “pluggable” although I hadn’t totally thought it through. By “pluggable” I mean capable of allowing different information models to be used. I’m actually taking the model a bit further to allow modules to be built that can provide optional functionality to the base editor. What this means is that if you have a different document information model, and it is capable of being round-tripped in some way with an editing view, then I can probably adapt it to the editor.

Let’s talk about the round-tripping problem for a moment. In the other XML editors I have worked with, the XML model has had to quite closely match the editing view that one works with. So you’re literally authoring the document using that information model. Think about HTML (or XHTML for an XML perspective). The arrangement of the tags pretty much exactly represents how you think of an deal with the components of the document. Paragraphs, headings, tables, images, etc, are all pretty much laid out how you would author them. This is the ideal situation as it makes building the editor quite straight-forward.

Note the line:

What this means is that if you have a different document information model, and it is capable of being round-tripped in some way with an editing view, then I can probably adapt it to the editor.

I think that means that we don’t all have to use the same editing view and at the same time, we can share an underlying format. Or perhaps even annotate texts with subject identities, not even realizing we are helping others.

This is an impressive bit of work and as the post promises, there is more to follow.

(I first saw this at Legal Informatics. http://legalinformatics.wordpress.com/2012/06/05/vergottini-on-improvements-to-akneditor-html-5-based-xml-editor-for-legislation/)

### Are You Going to Balisage?

Friday, June 1st, 2012

To the tune of “Are You Going to Scarborough Fair:”

Are you going to Balisage?
Parsley, sage, rosemary and thyme.
Remember me to one who is there,
she once was a true love of mine.

Tell her to make me an XML shirt,
Parsley, sage, rosemary, and thyme;
Without any seam or binary code,
Then she shall be a true lover of mine.

….

Oh, sorry! There you will see:

• higher-order functions in XSLT
• Schematron to enforce consistency constraints
• relation of the XML stack (the XDM data model) to JSON
• integrating JSON support into XDM-based technologies like XPath, XQuery, and XSLT
• XML and non-XML syntaxes for programming languages and documents
• type introspection in XQuery
• using XML to control processing in a document management system
• standardizing use of XQuery to support RESTful web interfaces
• RDF to record relations among TEI documents
• high-performance knowledge management system using an XML database
• a corpus of overlap samples
• an XSLT pipeline to translate non-XML markup for overlap into XML
• comparative entropy of various representations of XML
• interoperability of XML in web browsers
• XSLT extension functions to validate OCL constraints in UML models
• ontological analysis of documents
• statistical methods for exploring large collections of XML data

Balisage is an annual conference devoted to the theory and practice of descriptive markup and related technologies for structuring and managing information. Participants typically include XML users, librarians, archivists, computer scientists, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, members of the working groups which define the specifications, academics, industrial researchers, representatives of governmental bodies and NGOs, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Discussion is open, candid, and unashamedly technical.

The Balisage 2012 Program is now available at: http://www.balisage.net/2012/Program.html

### Destination: Montreal!

Tuesday, May 29th, 2012

If you remember the Saturday afternoon sci-fi movies, Destination: …., then you will appreciate the title for this post.

Tommie Usdin and company just posted: Balisage 2012 Call for Late-breaking News, written in torn bodice style:

The peer-reviewed part of the Balisage 2012 program has been scheduled (and will be announced in a few days). A few slots on the Balisage program have been reserved for presentation of “Late-breaking” material.

Proposals for late-breaking slots must be received by June 15, 2012. Selection of late-breaking proposals will be made by the Balisage conference committee, instead of being made in the course of the regular peer-review process.

If you have a presentation that should be part of Balisage, please send a proposal message as plain-text email to info@balisage.net.

In order to be considered for inclusion in the final program, your proposal message must supply the following information:

• The name(s) and affiliations of all author(s)/speaker(s)
• The email address of the presenter
• The title of the presentation
• An abstract of 100-150 words, suitable for immediate distribution
• Disclosure of when and where, if some part of this material has already been presented or published
• An indication as to whether the presenter is comfortable giving a conference presentation and answering questions in English about the material to be presented
• Your assurance that all authors are willing and able to sign the Balisage Non-exclusive Publication Agreement (http://www.balisage.net/BalisagePublicationAgreement.pdf) with respect to the proposed presentation

In order to be in serious contention for inclusion in the final program, your proposal should probably be either a) really late-breaking (it happened in the last month or two) or b) a paper, an extended paper proposal, or a very long abstract with references. Late-breaking slots are few and the competition is fiercer than for peer-reviewed papers. The more we know about your proposal, the better we can appreciate the quality of your submission.

Please feel encouraged to provide any other information that could aid the conference committee as it considers your proposal, such as a detailed outline, samples, code, and/or graphics. We expect to receive far more proposals than we can accept, so it’s important that you send enough information to make your proposal convincing and exciting. (This material may be attached to the email message, if appropriate.)

The conference committee reserves the right to make editorial changes in your abstract and/or title for the conference program and publicity. (emphasis added to last sentence)

The conference committee reserves the right to make editorial changes in your abstract and/or title for the conference program and publicity.

The conference committee might change your abstract and/or title to say something …. controversial? ….attention getting? ….CNN / Slashdot worthy?

Bring it on!

Submit late breaking proposals!

### TEI Boilerplate

Tuesday, April 24th, 2012

TEI Boilerplate

If you don’t know it, the TEI (Text Encoding Initiative), is one of the oldest digital humanities projects dedicated to fashioning encoding solutions for non-digital texts. The Encoding Guidelines, as they are known, were designed to capture the complexities of pre-digital texts.

If you doubt the complexities of pre-digital texts, consider the following image of a cover page from the Leningrad Codex:

There are more complex pages, such as the mss. of Charles Peirce (Peirce Logic Notebook, Charles Sanders Peirce Papers MS Am 1632 (339). Houghton Library, Harvard University, Cambridge, Mass.):

And those are just a few random examples. Encoding pre-digital texts is a complex and rewarding field of study.

Not that “born digital” texts need concede anything to “pre-digital” texts. When you think about our capacity to capture versions, multiple authors, sources, interpretations of readers, discussions and the like, the wealth of material that can be associated with any one text becomes quite complex.

Consider for example the Harry Potter book series that spawned websites, discussion lists, interviews with the author, films and other resources. Not quite like the interpretative history of the Bible but enough to make an interesting problem.

Anything that can encode that range of texts is of necessity quite complex itself and therein lies the rub. You work very hard at document analysis, using or extending the TEI Guidelines to encode your text, now what?

You can:

1. Show the XML text to family and friends. Always a big hit at parties.
2. Use your tame XSLT wizard to create a custom conversion of the XML text so normal people will want to see and use it.
3. Use the TEI Boilerplate project for a stock delivery of the XML text so normal people will want to see and use it. (like your encoders, funders)

From the webpage:

TEI Boilerplate is a lightweight solution for publishing styled TEI (Text Encoding Initiative) P5 content directly in modern browsers. With TEI Boilerplate, TEI XML files can be served directly to the web without server-side processing or translation to HTML. Our TEI Boilerplate Demo illustrates many TEI features rendered by TEI Boilerplate.

Browser Compatibility

TEI Boilerplate requires a robust, modern browser to do its work. It is compatible with current versions of Firefox, Chrome, Safari, and Internet Explorer (IE 9). If you have problems with TEI Boilerplate with a modern browser, please let us know by filing a bug report at https://sourceforge.net/p/teiboilerplate/tickets/.

Many thanks to John Walsh, Grant Simpson, and Saeed Moaddeli, all from Indiana University for this wonderful addition to the TEI toolbox!

PS: If you have disposable funds and aren’t planning on mining asteroids, please consider donating to the TEI (Text Encoding Initiative). Even asteroid miners need to know Earth history, a history written in texts.

### XML data clustering: An overview

Saturday, February 25th, 2012

XML data clustering: An overview by Alsayed Algergawy, Marco Mesiti, Richi Nayak, and Gunter Saake.

Abstract:

In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.

I thought this survey article would be of particular interest since it covers the syntax and semantics of XML that contains data.

Not to mention that our old friend, heterogeneous data, isn’t far behind:

Since XML data are engineered by different people, they often have different structural and terminological heterogeneities. The integration of heterogeneous data sources requires many tools for organizing and making their structure and content homogeneous. XML data integration is a complex activity that involves reconciliation at different levels: (1) at schema level, reconciling different representations of the same entity or property, and (2) at instance level, determining if different objects coming from different sources represent the same real-world entity. Moreover, the integration of Web data increases the integration process challenges in terms of heterogeneity of data. Such data come from different resources and it is quite hard to identify the relationship with the business subjects. Therefore, a first step in integrating XML data is to find clusters of the XML data that are similar in semantics and structure [Lee et al. 2002; Viyanon et al. 2008]. This allows system integrators to concentrate on XML data within each cluster. We remark that reconciling similar XML data is an easier task than reconciling XML data that are different in structures and semantics, since the later involves more restructuring. (emphasis added)

First, print our or photocopy Table II on page 35, “Features of XML Clustering Approaches.” It will be a handy reminder/guide as you read the coverage of the various techniques.

Second, on the last page, page 41, note that the article was accepted in October of 2009 but not published until October of 2011. It’s great that the ACM has an abundance of excellent survey articles but a two year delay is publication is unreasonable.

Surveys in rapidly developing fields are of most interest when they are timely. Electronic publication upon final acceptance should be the rule at an organization such as the ACM.

### Having a ChuQL at XML on the Cloud

Friday, February 24th, 2012

Having a ChuQL at XML on the Cloud by Shahan Khatchadourian, Mariano P. Consens, and Jérôme Siméon.

Abstract:

MapReduce/Hadoop has gained acceptance as a framework to process, transform, integrate, and analyze massive amounts of Web data on the Cloud. The MapReduce model (simple, fault tolerant, data parallelism on elastic clouds of commodity servers) is also attractive for processing enterprise and scienti c data. Despite XML ubiquity, there is yet little support for XML processing on top of MapReduce.

In this paper, we describe ChuQL, a MapReduce extension to XQuery, with its corresponding Hadoop implementation. The ChuQL language incorporates records to support the key/value data model of MapReduce, leverages higher-order functions to provide clean semantics, and exploits side-e ffects to fully expose to XQuery developers the Hadoop framework. The ChuQL implementation distributes computation to multiple XQuery engines, providing developers with an expressive language to describe tasks over big data.

The aggregation and co-grouping were the most interesting examples for me.

The description of ChuQL was a bit thin. Pointers to more resources would be appreciated.

### Would You Know “Good” XML If It Bit You?

Tuesday, February 14th, 2012

XML is a pale imitation of a markup language. It has resulted in real horrors across the markup landscape. After years in its service, I don’t have much hope of that changing.

But, the Princess of the Northern Marches has organized a war council to consider how to stem the tide of bad XML. Despite my personal misgivings, I wish them well and invite you to participate as you see fit.

Oh, and I found this message about the council meeting:

International Symposium on Quality Assurance and Quality Control in XML

Monday August 6, 2012

Paper submissions due April 20, 2012.

A one-day discussion of issues relating to Quality Control and Quality Assurance in the XML environment.

XML systems and software are complex and constantly changing. XML documents are highly varied, may be large or small, and often have complex life-cycles. In this challenging environment quality is difficult to define, measure, or control, yet the justifications for using XML often include promises or implications relating to quality.

We invite papers on all aspects of quality with respect to XML systems, including but not limited to:

• Defining, measuring, testing, improving, and documenting quality
• Quality in documents, document models, software, transformations, or queries
• Case studies in the control of quality in an XML environment
• Theoretical or practical approaches to measuring quality in XML
• Does the presence of XML, XML schemas, and XML tools make quality checking easier, harder, or even different from other computing environments
• Should XML transforms and schemas be QAed as software? Or configuration files? Or documents? Does it matter?

Paper submissions due April 20, 2012.

Details at: http://www.balisage.net/QA-QC/

You do have to understand the semantics of even imitation markup languages before mapping them with more robust languages. Enjoy!

### XML Prague 2012 (proceedings)

Sunday, February 12th, 2012

Fourteen papers by the leading lights in the XML world covering everything from XProc and XQuery to NVDL and JSONiq, and places in between.