Archive for the ‘XML’ Category

Overlap – Attacking on Machine Learning Models

Saturday, August 5th, 2017

Robust Physical-World Attacks on Machine Learning Models by Ivan Evtimov, et al.


Deep neural network-based classifiers are known to be vulnerable to adversarial examples that can fool them into misclassifying their input through the addition of small-magnitude perturbations. However, recent studies have demonstrated that such adversarial examples are not very effective in the physical world–they either completely fail to cause misclassification or only work in restricted cases where a relatively complex image is perturbed and printed on paper. In this paper we propose a new attack algorithm–Robust Physical Perturbations (RP2)– that generates perturbations by taking images under different conditions into account. Our algorithm can create spatially-constrained perturbations that mimic vandalism or art to reduce the likelihood of detection by a casual observer. We show that adversarial examples generated by RP2 achieve high success rates under various conditions for real road sign recognition by using an evaluation methodology that captures physical world conditions. We physically realized and evaluated two attacks, one that causes a Stop sign to be misclassified as a Speed Limit sign in 100% of the testing conditions, and one that causes a Right Turn sign to be misclassified as either a Stop or Added Lane sign in 100% of the testing conditions.

I was struck by the image used for this paper in a tweet:

I recognized this as an “overlapping” markup problem before discovering the authors were attacking machine learning models. On overlapping markup, see: Towards the unification of formats for overlapping markup by Paolo Marinelli, Fabio Vitali, Stefano Zacchiroli, or more recently, It’s more than just overlap: Text As Graph – Refining our notion of what text really is—this time for sure! by Ronald Haentjens Dekker and David J. Birnbaum.

From the conclusion:

In this paper, we introduced Robust Physical Perturbations (RP2), an algorithm that generates robust, physically realizable adversarial perturbations. Previous algorithms assume that the inputs of DNNs can be modified digitally to achieve misclassification, but such an assumption is infeasible, as an attacker with control over DNN inputs can simply replace it with an input of his choice. Therefore, adversarial attack algorithms must apply perturbations physically, and in doing so, need to account for new challenges such as a changing viewpoint due to distances, camera angles, different lighting conditions, and occlusion of the sign. Furthermore, fabrication of a perturbation introduces a new source of error due to a limited color gamut in printers.

We use RP2 to create two types of perturbations: subtle perturbations, which are small, undetectable changes to the entire sign, and camouflage perturbations, which are visible perturbations in the shape of graffiti or art. When the Stop sign was overlayed with a print out, subtle perturbations fooled the classifier 100% of the time under different physical conditions. When only the perturbations were added to the sign, the classifier was fooled by camouflage graffiti and art perturbations 66.7% and 100% of the time respectively under different physical conditions. Finally, when an untargeted poster-printed camouflage perturbation was overlayed on a Right Turn sign, the classifier was fooled 100% of the time. In future work, we plan to test our algorithm further by varying some of the other conditions we did not consider in this paper, such as sign occlusion.

Excellent work but my question: Is the inability of the classifier to recognize overlapping images similar to the issues encountered as overlapping markup?

To be sure overlapping markup is in part an artifice of unimaginative XML rules, since overlapping texts are far more common than non-overlapping texts. Especially when talking about critical editions or even differing analysis of the same text.

But beyond syntax, there is the subtlety of treating separate “layers” or stacks of a text as separate and yet tracking the relationship between two or more such stacks, when arbitrary additions or deletions can occur in any of them. Additions and deletions that must be accounted for across all layers/stacks.

I don’t have a solution to offer but pose the question of layers of recognition in hopes that machine learning models can capitalize on the lessons learned about a very similar problem with overlapping markup.

It’s more than just overlap: Text As Graph

Wednesday, August 2nd, 2017

It’s more than just overlap: Text As Graph – Refining our notion of what text really is—this time for sure! by Ronald Haentjens Dekker and David J. Birnbaum.


The XML tree paradigm has several well-known limitations for document modeling and processing. Some of these have received a lot of attention (especially overlap), and some have received less (e.g., discontinuity, simultaneity, transposition, white space as crypto-overlap). Many of these have work-arounds, also well known, but—as is implicit in the term “work-around”—these work-arounds have disadvantages. Because they get the job done, however, and because XML has a large user community with diverse levels of technological expertise, it is difficult to overcome inertia and move to a technology that might offer a more comprehensive fit with the full range of document structures with which researchers need to interact both intellectually and programmatically. A high-level analysis of why XML has the limitations it has can enable us to explore how an alternative model of Text as Graph (TAG) might address these types of structures and tasks in a more natural and idiomatic way than is available within an XML paradigm.

Hyperedges, texts and XML, what more could you need? 😉

This paper merits a deep read and testing by everyone interested in serious text modeling.

You can’t read the text but here is a hypergraph visualization of an excerpt from Lewis Carroll’s “The hunting of the Snark:”

The New Testament, the Hebrew Bible, to say nothing of the Rabbinic commentaries on the Hebrew Bible and centuries of commentary on other texts could profit from this approach.

Put your text to the test and share how to advance this technique!

XSL Transformations (XSLT) Version 3.0 (That’s a Wrap!)

Thursday, June 8th, 2017

XSL Transformations (XSLT) Version 3.0 W3C Recommendation 8 June 2017


This specification defines the syntax and semantics of XSLT 3.0, a language designed primarily for transforming XML documents into other XML documents.

XSLT 3.0 is a revised version of the XSLT 2.0 Recommendation [XSLT 2.0] published on 23 January 2007.

The primary purpose of the changes in this version of the language is to enable transformations to be performed in streaming mode, where neither the source document nor the result document is ever held in memory in its entirety. Another important aim is to improve the modularity of large stylesheets, allowing stylesheets to be developed from independently-developed components with a high level of software engineering robustness.

XSLT 3.0 is designed to be used in conjunction with XPath 3.0, which is defined in [XPath 3.0]. XSLT shares the same data model as XPath 3.0, which is defined in [XDM 3.0], and it uses the library of functions and operators defined in [Functions and Operators 3.0]. XPath 3.0 and the underlying function library introduce a number of enhancements, for example the availability of higher-order functions.

As an implementer option, XSLT 3.0 can also be used with XPath 3.1. All XSLT 3.0 processors provide maps, an addition to the data model which is specified (identically) in both XSLT 3.0 and XPath 3.1. Other features from XPath 3.1, such as arrays, and new functions such as random-number-generatorFO31 and sortFO31, are available in XSLT 3.0 stylesheets only if the implementer chooses to support XPath 3.1.

Some of the functions that were previously defined in the XSLT 2.0 specification, such as the format-dateFO30 and format-numberFO30 functions, are now defined in the standard function library to make them available to other host languages.

XSLT 3.0 also includes optional facilities to serialize the results of a transformation, by means of an interface to the serialization component described in [XSLT and XQuery Serialization]. Again, the new serialization capabilities of [XSLT and XQuery Serialization 3.1] are available at the implementer’s option.

This document contains hyperlinks to specific sections or definitions within other documents in this family of specifications. These links are indicated visually by a superscript identifying the target specification: for example XP30 for XPath 3.0, DM30 for the XDM data model version 3.0, FO30 for Functions and Operators version 3.0.

A special shout out to Michael Kay for, in his words, “Done and dusted: ten years’ work.”

Thanks from an appreciative audience!

Balisage: The Markup Conference 2017 Program Now Available

Wednesday, May 17th, 2017

Balisage: The Markup Conference 2017 Program Now Available

An email from Tommie Usdin, Chair, Chief Organizer and herder of markup cats for Balisage advises:

Balisage: where serious markup practitioners and theoreticians meet every August.

The 2017 program includes papers discussing XML vocabularies, cutting-edge digital humanities, lossless JSON/XML roundtripping, reflections on concrete syntax and abstract syntax, parsing and generation, web app development using the XML stack, managing test cases, pipelining and micropipelinging, electronic health records, rethinking imperative algorithms for XSLT and XQuery, markup and intellectual property, why YOU should use (my favorite XML vocabulary), developing a system to aid in studying manuscripts in the tradition of the Ethiopian and Eritrean Highlands, exploring “shapes” in RDF and their relationship to schema validation, exposing XML data to users of varying technical skill, test-suite management, and use case studies about large conversion applications, DITA, and SaxonJS.

Up-Translation and Up-Transformation: A one-day Symposium on the goals, challenges, solutions, and workflows for significant XML enhancements, including approaches, tools, and techniques that may potentially be used for a variety of other tasks. The symposium will be of value not only to those facing up-translation and transformation but also to general XML practitioners seeking to get the most out of their data.

Are you interested in open information, reusable documents, and vendor and application independence? Then you need descriptive markup, and Balisage is your conference. Balisage brings together document architects, librarians, archivists, computer scientists, XML practitioners, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, semantic-Web evangelists, standards developers, academics, industrial researchers, government and NGO staff, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Some participants are busy designing replacements for XML while other still use SGML (and know why they do).

Discussion is open, candid, and unashamedly technical.

Balisage 2017 Program:

Symposium Program:

NOTE: Members of the TEI and their employees are eligible for discount Balisage registration.

You need to see the program for yourself but the highlights (for me) include: Ethiopic manuscripts (ok, so I have odd tastes), Earley parsers (of particular interest), English Majors (my wife was an English major), and a number of other high points.

Mark your calendar for July 31 – August 4, 2017 – It’s Balisage!

ARM Releases Machine Readable Architecture Specification (Intel?)

Saturday, April 22nd, 2017

ARM Releases Machine Readable Architecture Specification by Alastair Reid.

From the post:

Today ARM released version 8.2 of the ARM v8-A processor specification in machine readable form. This specification describes almost all of the architecture: instructions, page table walks, taking interrupts, taking synchronous exceptions such as page faults, taking asynchronous exceptions such as bus faults, user mode, system mode, hypervisor mode, secure mode, debug mode. It details all the instruction formats and system register formats. The semantics is written in ARM’s ASL Specification Language so it is all executable and has been tested very thoroughly using the same architecture conformance tests that ARM uses to test its processors (See my paper “Trustworthy Specifications of ARM v8-A and v8-M System Level Architecture”.)

The specification is being released in three sets of XML files:

  • The System Register Specification consists of an XML file for each system register in the architecture. For each register, the XML details all the fields within the register, how to access the register and which privilege levels can access the register.
  • The AArch64 Specification consists of an XML file for each instruction in the 64-bit architecture. For each instruction, there is the encoding diagram for the instruction, ASL code for decoding the instruction, ASL code for executing the instruction and any supporting code needed to execute the instruction and the decode tree for finding the instruction corresponding to a given bit-pattern. This also contains the ASL code for the system architecture: page table walks, exceptions, debug, etc.
  • The AArch32 Specification is similar to the AArch64 specification: it contains encoding diagrams, decode trees, decode/execute ASL code and supporting ASL code.

Alastair provides starting points for use of this material by outlining his prior uses of the same.

Raises the question why an equivalent machine readable data set isn’t available for Intel® 64 and IA-32 Architectures? (PDF manuals)

The data is there, but not in a machine readable format.

Anyone know why Intel doesn’t provide the same convenience?

XSL Transformations (XSLT) Version 3.0 (Proposed Recommendation 18 April 2017)

Tuesday, April 18th, 2017

XSL Transformations (XSLT) Version 3.0 (Proposed Recommendation 18 April 2017)

Michael Kay tweeted today:

XSLT 3.0 is a Proposed Recommendation: It’s taken ten years but we’re nearly there!

Congratulations to Michael and the entire team!

What’s new?

A major focus for enhancements in XSLT 3.0 is the requirement to enable streaming of source documents. This is needed when source documents become too large to hold in main memory, and also for applications where it is important to start delivering results before the entire source document is available.

While implementations of XSLT that use streaming have always been theoretically possible, the nature of the language has made it very difficult to achieve this in practice. The approach adopted in this specification is twofold: it identifies a set of restrictions which, if followed by stylesheet authors, will enable implementations to adopt a streaming mode of operation without placing excessive demands on the optimization capabilities of the processor; and it provides new constructs to indicate that streaming is required, or to express transformations in a way that makes it easier for the processor to adopt a streaming execution plan.

Capabilities provided in this category include:

  • A new xsl:source-document instruction, which reads and processes a source document, optionally in streaming mode;
  • The ability to declare that a mode is a streaming mode, in which case all the template rules using that mode must be streamable;
  • A new xsl:iterate instruction, which iterates over the items in a sequence, allowing parameters for the processing of one item to be set during the processing of the previous item;
  • A new xsl:merge instruction, allowing multiple input streams to be merged into a single output stream;
  • A new xsl:fork instruction, allowing multiple computations to be performed in parallel during a single pass through an input document.
  • Accumulators, which allow a value to be computed progressively during streamed processing of a document, and accessed as a function of a node in the document, without compromise to the functional nature of the XSLT language.

A second focus for enhancements in XSLT 3.0 is the introduction of a new mechanism for stylesheet modularity, called the package. Unlike the stylesheet modules of XSLT 1.0 and 2.0 (which remain available), a package defines an interface that regulates which functions, variables, templates and other components are visible outside the package, and which can be overridden. There are two main goals for this facility: it is designed to deliver software engineering benefits by improving the reusability and maintainability of code, and it is intended to streamline stylesheet deployment by allowing packages to be compiled independently of each other, and compiled instances of packages to be shared between multiple applications.

Other significant features in XSLT 3.0 include:

  • An xsl:evaluate instruction allowing evaluation of XPath expressions that are dynamically constructed as strings, or that are read from a source document;
  • Enhancements to the syntax of patterns, in particular enabling the matching of atomic values as well as nodes;
  • An xsl:try instruction to allow recovery from dynamic errors;
  • The element xsl:global-context-item, used to declare the stylesheet’s expectations of the global context item (notably, its type).
  • A new instruction xsl:assert to assist developers in producing correct and robust code.

XSLT 3.0 also delivers enhancements made to the XPath language and to the standard function library, including the following:

  • Variables can now be bound in XPath using the let expression.
  • Functions are now first class values, and can be passed as arguments to other (higher-order) functions, making XSLT a fully-fledged functional programming language.
  • A number of new functions are available, for example trigonometric functions, and the functions parse-xmlFO30 and serializeFO30 to convert between lexical and tree representations of XML.

XSLT 3.0 also includes support for maps (a data structure consisting of key/value pairs, sometimes referred to in other programming languages as dictionaries, hashes, or associative arrays). This feature extends the data model, provides new syntax in XPath, and adds a number of new functions and operators. Initially developed as XSLT-specific extensions, maps have now been integrated into XPath 3.1 (see [XPath 3.1]). XSLT 3.0 does not require implementations to support XPath 3.1 in its entirety, but it does requires support for these specific features.

This will remain a proposed recommendation until 1 June 2017.

How close can you read? 😉


XQuery 3.1 and Company! (Deriving New Versions?)

Wednesday, March 22nd, 2017

XQuery 3.1: An XML Query Language W3C Recommendation 21 March 2017


Related reading of interest:

XML Path Language (XPath) 3.1

XPath and XQuery Functions and Operators 3.1

XQuery and XPath Data Model 3.1

These recommendations are subject to licenses that read in part:

No right to create modifications or derivatives of W3C documents is granted pursuant to this license, except as follows: To facilitate implementation of the technical specifications set forth in this document, anyone may prepare and distribute derivative works and portions of this document in software, in supporting materials accompanying software, and in documentation of software, PROVIDED that all such works include the notice below. HOWEVER, the publication of derivative works of this document for use as a technical specification is expressly prohibited.

You know I think the organization of XQuery 3.1 and friends could be improved but deriving and distributing “improved” versions is expressly prohibited.

Hmmm, but we are talking about XML and languages to query and transform XML.

Consider the potential of an query that calls XQuery 3.1: An XML Query Language and materials cited in it, then returns a version of XQuery 3.1 that has definitions from other standards off-set in the XQuery 3.1 text.

Or than inserts into the text examples or other materials.

For decades XML enthusiasts have bruited about dynamic texts but have produced damned few of them (as in zero) as their standards.

Let’s use the “no derivatives” language of the W3C as an incentive to not create another static document but a dynamic one that can grow or contract according to the wishes of its reader.

Suggestions for first round features?

Balisage Papers Due in 3 Weeks!

Thursday, March 16th, 2017

Apologies for the sudden lack of posting but I have been working on a rather large data set with XQuery and checking forwards and backwards to make sure it can be replicated. (I hate “it works on my computer.”)

Anyway, Tommie Usdin dropped an email bomb today with a reminder that Balisage papers are due on April 7, 2017.

From her email:

Submissions to “Balisage: The Markup Conference” and pre-conference symposium:
“Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions”
are on April 7.

It is time to start writing!

Balisage: The Markup Conference 2017
August 1 — 4, 2017, Rockville, MD (a suburb of Washington, DC)
July 31, 2017 — Symposium Up-Translation and Up-Transformation

Balisage: where serious markup practitioners and theoreticians meet every August. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

• Web application development with XML
• Informal data models and consensus-based vocabularies
• Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
• Performance issues in parsing, XML database retrieval, or XSLT processing
• Development of angle-bracket-free user interfaces for non-technical users
• Semistructured data and full text search
• Deployment of XML systems for enterprise data
• Web application development with XML
• Design and implementation of XML vocabularies
• Case studies of the use of XML for publishing, interchange, or archiving
• Alternatives to XML
• the role(s) of XML in the application lifecycle
• the role(s) of vocabularies in XML environments

Detailed Call for Participation:
About Balisage:

pre-conference symposium:
Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions
Chair: Evan Owens, Cenveo

Increasing the granularity and/or specificity of markup is an important task in many content and information workflows. Markup transformations might involve tasks such as high-level structuring, detailed component structuring, or enhancing information by matching or linking to external vocabularies or data. Enhancing markup presents secondary challenges including lack of structure of the inputs or inconsistency of input data down to the level of spelling, punctuation, and vocabulary. Source data for up-translation may be XML, word processing documents, plain text, scanned & OCRed text, or databases; transformation goals may be content suitable for page makeup, search, or repurposing, in XML, JSON, or any other markup language.

The range of approaches to up-transformation is as varied as the variety of specifics of the input and required outputs. Solutions may combine automated processing with human review or could be 100% software implementations. With the potential for requirements to evolve over time, tools may have to be actively maintained and enhanced. This is the place to discuss goals, challenges, solutions, and workflows for significant XML enhancements, including approaches, tools, and techniques that may potentially be used for a variety of other tasks.

For more information: or +1 301 315 9631

I’m planning on posting tomorrow one way or the other!

While you wait for that, get to work on your Balisage paper!

XQuery Ready CIA Vault7 Files

Friday, March 10th, 2017

I have extracted the HTML files from WikiLeaks Vault7 Year Zero 2017 V 1.7z, processed them with Tidy (see note on correction below), and uploaded the “tidied” HTML files to: Vault7-CIA-Clean-HTML-Only.

Beyond the usual activities of Tidy, I did have to correct the file page_26345506.html: by creating a comment around one line of content:

<!– <declarations><string name=”½ö”></string></declarations&>lt;p>›<br> –>

Otherwise, the files are only corrected HTML markup with no other changes.

The HTML compresses well, 7811 files coming in at 3.4 MB.

Demonstrate the power of basic XQuery skills!


Why I Love XML (and Good Thing, It’s Everywhere) [Needs Subject Identity Too]

Sunday, March 5th, 2017

Why I Love XML (and Good Thing, It’s Everywhere) by Lee Pollington.

Lee makes a compelling argument for XML as the underlying mechanism for data integration when saying:

…Perhaps the data in your relational databases is structured. What about your knowledge management systems, customer information systems, document systems, CMS, mail, etc.? How do you integrate that data with structured data to get a holistic view of all your data? What do you do when you want to bring a group of relational schemas from different systems together to get that elusive 360 view – which is being demanded by the world’s regulators banks? Mergers and acquisitions drive this requirement too. How do you search across that data?

Sure there are solution stack answers. We’ve all seen whiteboards with ever growing number of boxes and those innocuous puny arrows between them that translate to teams of people, buckets of code, test and operations teams. They all add up to ever-increasing costs, complexity, missed deadlines & market share loss. Sound overly dramatic? Gartner calculated a worldwide spend of $5 Billion on data integration software in 2015. How much did you spend … would you know where to start calculating that cost?

While pondering what you spend on a yearly basis for data integration, contemplate two more questions from Lee:

…So take a moment to think about how you treat the data format that underpins your intellectual property? First-class citizen or after-thought?…

If you are treating your XML elements as first class citizens, do tell me that you created subject identity tests for those subjects?

So that a programmer new to your years of legacy XML will understand that <MFBM>, <MBFT> and <MBF> elements are all expressed in units of 1,000 board feet.


Reducing the cost of data integration tomorrow, next year and five years after that, requires investment in the here and now.

Perhaps that is why data integration costs continue to climb.

Why pay for today what can be put off until tomorrow? (Future conversion costs are a line item in some future office holder’s budget.)

Oxford Dictionaries Thesaurus Data – XQuery

Sunday, February 12th, 2017

Retrieve Oxford Dictionaries API Thesaurus Data as XML with XQuery and BaseX by Adam Steffanick.

From the post:

We retrieved thesaurus data from the Oxford Dictionaries application programming interface (API) and returned Extensible Markup Language (XML) with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial illustrates how to retrieve thesaurus data—synonyms and antonyms—as XML from the Oxford Dictionaries API with XQuery and BaseX.

The Oxford Dictionaries API returns JavaScript Object Notation (JSON) responses that yield undesired XML structures when converted automatically with BaseX. Fortunately, we’re able to use XQuery to fill in some blanks after converting JSON to XML. My GitHub repository od-api-xquery contains XQuery code for this tutorial.

If you are having trouble staying at your computer during this unreasonably warm spring, this XQuery/Oxford Dictionary tutorial may help!

Ok, maybe that is an exaggeration but only a slight one. 😉


Digital Humanities / Studies: U.Pitt.Greenberg

Wednesday, February 1st, 2017

Digital Humanities / Studies: U.Pitt.Greenberg maintained by Elisa E. Beshero-Bondar.

I discovered this syllabus and course materials by accident when one of its modules on XQuery turned up in a search. Backing out of that module I discovered this gem of a digital humanities course.

The course description:

Our course in “digital humanities” and “digital studies” is designed to be interdisciplinary and practical, with an emphasis on learning through “hands-on” experience. It is a computer course, but not a course in which you learn programming for the sake of learning a programming language. It’s a course that will involve programming, and working with coding languages, and “putting things online,” but it’s not a course designed to make you, in fifteen weeks, a professional website designer. Instead, this is a course in which we prioritize what we can investigate in the Humanities and related Social Sciences fields about cultural, historical, and literary research questions through applications in computer coding and programming, which you will be learning and applying as you go in order to make new discoveries and transform cultural objects—what we call “texts” in their complex and multiple dimensions. We think of “texts” as the transmittable, sharable forms of human creativity (mainly through language), and we interface with a particular text in multiple ways through print and electronic “documents.” When we refer to a “document,” we mean a specific instance of a text, and much of our work will be in experimenting with the structures of texts in digital document formats, accessing them through scripts we write in computer code—scripts that in themselves are a kind of text, readable both by humans and machines.

Your professors are scholars and teachers of humanities, not computer programmers by trade, and we teach this course from our backgrounds (in literature and anthropology, respectively). We teach this course to share coding methods that are highly useful to us in our fields, with an emphasis on working with texts as artifacts of human culture shaped primarily with words and letters—the forms of “written” language transferable to many media (including image and sound) that we can study with computer modelling tools that we design for ourselves based on the questions we ask. We work with computers in this course as precision instruments that help us to read and process great quantities of information, and that lead us to make significant connections, ask new kinds of questions, and build models and interfaces to change our reading and thinking experience as people curious about human history, culture, and creativity.

Our focus in this course is primarily analytical: to apply computer technologies to represent and investigate cultural materials. As we design projects together, you will gain practical experience in editing and you will certainly fine-tune your precision in writing and thinking. We will be working primarily with eXtensible Markup Language (XML) because it is a powerful tool for modelling texts that we can adapt creatively to our interests and questions. XML represents a standard in adaptability and human-readability in digital code, and it works together with related technologies with which you will gain working experience: You’ll learn how to write XPath expressions: a formal language for searching and extracting information from XML code which serves as the basis for transforming XML into many publishable forms, using XSLT and XQuery. You’ll learn to write XSLT: a programming “stylesheet” transforming language designed to convert XML to publishable formats, as well as XQuery, a query (or search) language for extracting information from XML files bundled collectively. You will learn how to design your own systematic coding methods to work on projects, and how to write your own rules in schema languages (like Schematron and Relax-NG) to keep your projects organized and prevent errors. You’ll gain experience with an international XML language called TEI (after the Text Encoding Initiative) which serves as the international standard for coding digital archives of cultural materials. Since one of the best and most widely accessible ways to publish XML is on the worldwide web, you’ll gain working experience with HTML code (a markup language that is a kind of XML) and styling HTML with Cascading Stylesheets (CSS). We will do all of this with an eye to your understanding how coding works—and no longer relying without question on expensive commercial software as the “only” available solution, because such software is usually not designed with our research questions in mind.

We think you’ll gain enough experience at least to become a little dangerous, and at the very least more independent as investigators and makers who wield computers as fit instruments for your own tasks. Your success will require patience, dedication, and regular communication and interaction with us, working through assignments on a daily basis. Your success will NOT require perfection, but rather your regular efforts throughout the course, your documenting of problems when your coding doesn’t yield the results you want. Homework exercises are a back-and-forth, intensive dialogue between you and your instructors, and we plan to spend a great deal of time with you individually over these as we work together. Our guiding principle in developing assignments and working with you is that the best way for you to learn and succeed is through regular practice as you hone your skills. Our goal is not to make you expert programmers (as we are far from that ourselves)! Rather, we want you to learn how to manipulate coding technologies for your own purposes, how to track down answers to questions, how to think your way algorithmically through problems and find good solutions.

Skimming the syllabus rekindles an awareness of the distinction between the “hard” sciences and the “difficult” ones.



After yesterday’s post, Elisa Beshero-Bondar tweeted this one course is now two:

At a new homepage: newtFire {dh|ds}!


Up-Translation and Up-Transformation … [Balisage Rocks!]

Sunday, January 29th, 2017

Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions (a Balisage pre-conference symposium)

When & Where:

Monday July 31, 2017
CAMBRiA Hotel, Rockville, MD USA

Chair: Evan Owens, Cenveo

You need more details than that?

Ok, from the webpage:

Increasing the granularity and/or specificity of markup is an important task in many different content and information workflows. Markup transformations might involve tasks such as high-level structuring, detailed component structuring, or enhancing information by matching or linking to external vocabularies or data. Enhancing markup presents numerous secondary challenges including lack of structure of the inputs or inconsistency of input data down to the level of spelling, punctuation, and vocabulary. Source data for up-translation may be XML, word processing documents, plain text, scanned & OCRed text, or databases; transformation goals may be content suitable for page makeup, search, or repurposing, in XML, JSON, or any other markup language.

The range of approaches to up-transformation is as varied as the variety of specifics of the input and required outputs. Solutions may combine automated processing with human review or could be 100% software implementations. With the potential for requirements to evolve over time, tools may have to be actively maintained and enhanced.

The presentations in this pre-conference symposium will include goals, challenges, solutions, and workflows for significant XML enhancements, including approaches, tools, and techniques that may potentially be used for a variety of other tasks. The symposium will be of value not only to those facing up-translation and transformation but also to general XML practitioners seeking to get the most out of their data.

If I didn’t know better, up-translation and up-transformation sound suspiciously like conferred properties of topic maps fame.

Well, modulo that conferred properties could be predicated on explicit subject identity and not hidden in the personal knowledge of the author.

There are two categories of up-translation and up-transformation:

  1. Ones that preserve jobs like spaghetti Cobol code, and
  2. Ones that support easy long term maintenance.

While writing your paper for the pre-conference, which category fits yours the best?

XQuery Update 3.0 – WG Notes

Tuesday, January 24th, 2017

A tweet by Jonathan Robie points out:

XQuery Update Facility 3.0


XQuery Update Facility 3.0 Requirements and Use Cases

have been issued as working group notes.

It’s not clear to me why anyone continues to think of data as mutable.

Mutable data was an artifact of limited storage and processing.

Neither of which obtains in a modern computing environment. (Yes, there are edge cases, the Large Hadron Collider for example.)

Still, if your interested, read on!

BaseX 8.6 Is Out!

Tuesday, January 24th, 2017

Email from Christian Grün arrived today with great news!

BaseX 8.6 is out! The new version of our XML database system and XQuery 3.1 processor includes countless improvements and optimizations. Many of them have been triggered by your valuable feedback, many others are the result of BaseX used in productive and commercial environments.

The most prominent new features are:


  • jobs without database access will never be locked
  • read transactions are now favored (adjustable via FAIRLOCK)


  • file monitoring was improved (adjustable via PARSERESTXQ)
  • authentication was reintroduced (no passwords anymore in web.xml)
  • id session attributes will show up in log data


  • always accessible, even if job queue is full
  • pagination of table results


  • path index improved: distinct values storage for numeric types


  • aligned with latest version of XQuery 3.1
  • updated functions: map:find, map:merge, fn:sort, array:sort, …
  • enhancements in User, Process, Jobs, REST and Database Module


  • improved import/export compatibility with Excel data

Visit to find the latest release, and check out to get more information. As always, we are looking forward to your feedback. Enjoy!

If one or more of your colleagues goes missing this week, suspect BaseX 8.6 and the new W3C drafts for XQuery are responsible.


XQuery/XSLT Proposals – Comments by 28 February 2017

Tuesday, January 24th, 2017

Proposed Recommendations Published for XQuery WG and XSLT WG.

From the webpage:

The XML Query Working Group and XSLT Working Group have published a Proposed Recommendation for four documents:

  • XQuery and XPath Data Model 3.1: This document defines the XQuery and XPath Data Model 3.1, which is the data model of XML Path Language (XPath) 3.1, XSL Transformations (XSLT) Version 3.0, and XQuery 3.1: An XML Query Language. The XQuery and XPath Data Model 3.1 (henceforth “data model”) serves two purposes. First, it defines the information contained in the input to an XSLT or XQuery processor. Second, it defines all permissible values of expressions in the XSLT, XQuery, and XPath languages.
  • XPath and XQuery Functions and Operators 3.1: The purpose of this document is to catalog the functions and operators required for XPath 3.1, XQuery 3.1, and XSLT 3.0. It defines constructor functions, operators, and functions on the datatypes defined in XML Schema Part 2: Datatypes Second Edition and the datatypes defined in XQuery and XPath Data Model (XDM) 3.1. It also defines functions and operators on nodes and node sequences as defined in the XQuery and XPath Data Model (XDM) 3.1.
  • XML Path Language (XPath) 3.1: XPath 3.1 is an expression language that allows the processing of values conforming to the data model defined in XQuery and XPath Data Model (XDM) 3.1. The name of the language derives from its most distinctive feature, the path expression, which provides a means of hierarchic addressing of the nodes in an XML tree. As well as modeling the tree structure of XML, the data model also includes atomic values, function items, and sequences.
  • XSLT and XQuery Serialization 3.1: This document defines serialization of an instance of the data model as defined in XQuery and XPath Data Model (XDM) 3.1 into a sequence of octets. Serialization is designed to be a component that can be used by other specifications such as XSL Transformations (XSLT) Version 3.0 or XQuery 3.1: An XML Query Language.

Comments are welcome through 28 February 2017.

Get your red pen out!

Unlike political flame wars on social media, comments on these proposed recommendatons could make a useful difference.

Enjoy! Relaunch!

Monday, January 16th, 2017

Lauren Wood posted this note about the relaunch of recently:

I’ve relaunched (for some background, Tim Bray wrote an article here: I’m hoping it will become part of the community again, somewhere for people to post their news (submit your news here: and articles (see the guidelines at I added a job board to the site as well (if you’re in Berlin, Germany, or able to
move there, look at the job currently posted; thanks LambdaWerk!); if your employer might want to post XML-related jobs please email me.

The old content should mostly be available but some articles were previously available at two (or more) locations and may now only be at one; try the archive list ( if you’re looking for something. Please let me know if something major is missing from the archives.

XML is used in a lot of areas, and there is a wealth of knowledge in this community. If you’d like to write an article, send me your ideas. If you have comments on the site, let me know that as well.

Just in time as President Trump is about to stir, vigorously, that big pot of crazy known as federal data.

Mapping, processing, transformation demands will grow at an exponential rate.

Notice the emphasis on demand.

Taking a two weeks to write custom software to sort files (you know the Weiner/Abedin laptop story, yes?) won’t be acceptable quite soon.

How are your on-demand XML chops?

Looking up words in the OED with XQuery [Details on OED API Key As Well]

Saturday, January 14th, 2017

Looking up words in the OED with XQuery by Clifford Anderson.

Clifford has posted a gist of work from the @VandyLibraries XQuery group, looking up words in the Oxford English Dictionary (OED) with XQuery.

To make full use of Clifford’s post, you will need for the Oxford Dictionaries API.

If you go straight to the regular Oxford English Dictionary (I’m omitting the URL so you don’t make the same mistake), there is nary a mention of the Oxford Dictionaries API.

The free plan allows 3K queries a month.

Not enough to shut out the outside world for the next four/eight years but enough to decide if it’s where you want to hide.

Application for the free api key was simple enough.

Save that the dumb password checker insisted on one or more special characters, plus one or more digits, plus upper and lowercase. When you get beyond 12 characters the insistence on a special character is just a little lame.

Email response with the key was fast, so I’m in!

What about you?

BaseX – Bad News, Good News

Wednesday, January 4th, 2017

Good news and bad news from BaseX. Christian Grün posted to the BaseX discussion list today:

Dear all,

This year, there will be no BaseX Preconference Meeting in XML Prague. Sorry for that! The reason is simple: We were too busy in order to prepare ourselves for the event, look for speakers, etc.

At least we are happy to announce that BaseX 8.6 will finally be released this month. Stay tuned!

All the best,


That’s disappointing but understandable. Console yourself with watching presentations from 2013 – 2016 and reviewing issues for 8.6.

Just a guess on my part, ;-), but I suspect more testing of BaseX builds would not go unappreciated.

Something to keep yourself busy while waiting for BaseX 8.6 to drop.

XQuery/XPath CRs 3.1! [#DisruptJ20 Twitter Game]

Tuesday, December 13th, 2016

Just in time for the holidays, new CRs for XQuery/XPath hit the street! Comments due by 2017-01-10.

XQuery and XPath Data Model 3.1

XML Path Language (XPath) 3.1

XQuery 3.1: An XML Query Language

XPath and XQuery Functions and Operators 3.1

XQueryX 3.1

#DisruptJ20 is too late for comments to the W3C but you can break the boredom of indeterminate waiting to protest excitedly for TV cameras and/or to be arrested.


Play the XQuery/XPath 3.1 Twitter Game!

Definitions litter the drafts and appear as:

[Definition: A sequence is an ordered collection of zero or more items.]

You Tweet:

An ordered collection of zero or more items? #xquery

Correct response:

A sequence.

Some definitions are too long to be tweeted in full:

An expanded-QName is a value in the value space of the xs:QName datatype as defined in the XDM data model (see [XQuery and XPath Data Model (XDM) 3.1]): that is, a triple containing namespace prefix (optional), namespace URI (optional), and local name. (xpath-functions)

Suggest you tweet:

A triple containing namespace prefix (optional), namespace URI (optional), and local name.


A value in the value space of the xs:QName datatype as defined in the XDM data model (see [XQuery and XPath Data Model (XDM) 3.1].

In both cases, the correct response:

An expanded-QName.

Use a $10 burner phone and it unlocked at protests. If your phone is searched, imagine the attempts to break the “code.”

You could agree on definitions/responses as instructions for direct action. But I digress.

4 Days Left – Submission Alert – XML Prague

Sunday, December 11th, 2016

A tweet by Jirka Kosek reminded me there are only 4 days left for XML Prague submissions!

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

From the call for papers:

XML Prague 2017 now welcomes submissions for presentations on the following topics:

  • Markup and the Extensible Web – HTML5, XHTML, Web Components, JSON and XML sharing the common space
  • Semantic visions and the reality – micro-formats, semantic data in business, linked data
  • Publishing for the 21th century – publishing toolchains, eBooks, EPUB, DITA, DocBook, CSS for print, …
  • XML databases and Big Data – XML storage, indexing, query languages, …
  • State of the XML Union – updates on specs, the XML community news, …

All proposals will be submitted for review by a peer review panel made up of the XML Prague Program Committee. Submissions will be chosen based on interest, applicability, technical merit, and technical correctness.

Accepted papers will be included in published conference proceedings.

I don’t travel but if you need a last-minute co-author or proofer, you know where to find me!

Visualizing XML Schemas

Tuesday, November 29th, 2016

I don’t have one of the commercial XML packages at the moment and was casting about for a free visualization technique for a large XML schema when I encountered:


I won’t be trying it on my schema until tomorrow but I thought it looked interesting enough to pass along.

Further details: Visualizing Complex Content Models with Spatial Schemas by Joe Pairman.

This looks almost teachable.


Other “free” visualization tools to suggest?

Manipulate XML Text Data Using XQuery String Functions

Tuesday, November 22nd, 2016

Manipulate XML Text Data Using XQuery String Functions by Adam Steffanick.

Adam’s notes from the Vanderbilt University XQuery Working Group:

We evaluated and manipulated text data (i.e., strings) within Extensible Markup Language (XML) using string functions in XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial covers the basics of how to use XQuery string functions and manipulate text data with BaseX.

We used a limited dataset of English words as text data to evaluate and manipulate, and I’ve created a GitHub gist of XML input and XQuery code for use with this tutorial.

A quick run-through of basic XQuery string functions that takes you up to writing your own XQuery function.

While we wait for more reports from the Vanderbilt University XQuery Working Group, have you considered using XQuery to impose different views on a single text document?

For example, some Bible translations follow the “traditional” chapter and verse divisions (a very late addition), while others use paragraph level organization and largely ignore the tradition verses.

Creating a view of a single source text as either one or both should not involve permanent changes to a source file in XML. Or at least not the original source file.

If for processing purposes there was a need for a static file rendering one way or the other, that’s doable but should be separate from the original XML file.

XML Prague 2017, February 9-11, 2017 – Registration Opens!

Wednesday, November 16th, 2016

XML Prague 2017, February 9-11, 2017

I mentioned XML Prague 2017 last month and now, after the election of Donald Trump as president of the United States, registration for the conference opens!


Maybe. 😉

Even if you are returning to the U.S. after the conference, XML Prague will be a welcome respite from the tempest of news coverage of what isn’t known about the impending Trump administration.

At 120 Euros for three days, this is a great investment both professionally and emotionally.


XML Calabash 1.1.13 released…

Thursday, November 10th, 2016

Norm Walsh tweeted:

XML Calabash 1.1.13 released for Saxon 9.5, Saxon 9.6, and, with all praise to @fgeorges, Saxon 9.7.

You are using XML pipelines for processing XML files. (XProc)


See also: XML Calabash Reference.


XML Prague 2017 is coming

Sunday, October 16th, 2016

XML Prague 2017 is coming by Jirka Kosek.

From the post:

I’m happy to announce that call for papers for XML Prague 2017 is finally open. We are looking forward for your interesting submissions related to XML. We have switched from CMT to EasyChair for managing submission process – we hope that new system will have less quirks for users then previous one.

We are sorry for slightly delayed start than in past years. But we have to setup new non-profit organization for running the conference and sometimes we felt like characters from Kafka’s Der Process during this process.

We are now working hard on redesigning and opening of registration. Process should be more smooth then in the past.

But these are just implementation details. XML Prague will be again three day gathering of XML geeks, users, vendors, … which we all are used to enjoy each year. I’m looking forward to meet you in Prague in February.

Conference: February 9-11, 2016.

Important Dates:

Important Dates:

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

You can see videos of last year’s presentation (to gauge the competition): Watch videos from XML Prague 2016 on Youtube channel.

December the 15th will be here sooner than you think!

Think of it as a welcome distraction from the barn yard posturing that is U.S. election politics this year!

XQuery Snippets on Gist

Thursday, October 6th, 2016

@XQuery tweeted today:

Check out some of the 1,637 XQuery code snippets on GitHub’s gist service

Not a bad way to get in a daily dose of XQuery!

You can also try Stack Overflow:

XQuery (3,000)

xquery-sql (293)

xquery-3.0 (70)

xquery-update (55)


XQuery Working Group (Vanderbilt)

Saturday, September 24th, 2016

XQuery Working Group – Learn XQuery in the Company of Digital Humanists and Digital Scientists

From the webpage:

We meet from 3:00 to 4:30 p.m. on most Fridays in 800FA of the Central Library. Newcomers are always welcome! Check the schedule below for details about topics. Also see our Github repository for code samples. Contact Cliff Anderson with any questions or see the FAQs below.

Good thing we are all mindful of the distinction W3C XML Query Working Group and XQuery Working Group (Vanderbilt).

Otherwise, you might need a topic map to sort out casual references. 😉

Even if you can’t attend meetings in person, support this project by Cliff Anderson.

BaseX 8.5.3 Released!

Monday, August 15th, 2016

BaseX 8.5.3 Released! (2016/08/15)

BaseX 8.5.3 was released today!

The changelog reads:

VERSION 8.5.3 (August 15, 2016) —————————————-

Minor bug fixes, improved thread-safety.

Still, not a bad idea to upgrade today!


PS: You do remember that Congress is throwing XML in ever increasing amounts at the internet?

Perhaps in hopes of burying information in angle-bang syntax.

XQuery can help disappoint them.


Thursday, July 28th, 2016


From the webpage:

MorganaXProc is an implementation of W3C’s XProc: An XML Pipeline Language written in Java™. It is free software, released under GNU General Public License version 2.0 (GPLv2).

The current version is 0.95 (public beta). It is very close to the recommendation with all related tests of the XProc Test Suite passed.

News: MorganaXProc 0.95-11 released

You can follow <xml-project/> on Twitter: @xml_project and peruse their documentation.

I haven’t worked my way through A User’s Guide to MorganaXProc but it looks promising.