Archive for the ‘XML’ Category

Looking up words in the OED with XQuery [Details on OED API Key As Well]

Saturday, January 14th, 2017

Looking up words in the OED with XQuery by Clifford Anderson.

Clifford has posted a gist of work from the @VandyLibraries XQuery group, looking up words in the Oxford English Dictionary (OED) with XQuery.

To make full use of Clifford’s post, you will need for the Oxford Dictionaries API.

If you go straight to the regular Oxford English Dictionary (I’m omitting the URL so you don’t make the same mistake), there is nary a mention of the Oxford Dictionaries API.

The free plan allows 3K queries a month.

Not enough to shut out the outside world for the next four/eight years but enough to decide if it’s where you want to hide.

Application for the free api key was simple enough.

Save that the dumb password checker insisted on one or more special characters, plus one or more digits, plus upper and lowercase. When you get beyond 12 characters the insistence on a special character is just a little lame.

Email response with the key was fast, so I’m in!

What about you?

BaseX – Bad News, Good News

Wednesday, January 4th, 2017

Good news and bad news from BaseX. Christian Grün posted to the BaseX discussion list today:

Dear all,

This year, there will be no BaseX Preconference Meeting in XML Prague. Sorry for that! The reason is simple: We were too busy in order to prepare ourselves for the event, look for speakers, etc.

At least we are happy to announce that BaseX 8.6 will finally be released this month. Stay tuned!

All the best,


That’s disappointing but understandable. Console yourself with watching presentations from 2013 – 2016 and reviewing issues for 8.6.

Just a guess on my part, ;-), but I suspect more testing of BaseX builds would not go unappreciated.

Something to keep yourself busy while waiting for BaseX 8.6 to drop.

XQuery/XPath CRs 3.1! [#DisruptJ20 Twitter Game]

Tuesday, December 13th, 2016

Just in time for the holidays, new CRs for XQuery/XPath hit the street! Comments due by 2017-01-10.

XQuery and XPath Data Model 3.1

XML Path Language (XPath) 3.1

XQuery 3.1: An XML Query Language

XPath and XQuery Functions and Operators 3.1

XQueryX 3.1

#DisruptJ20 is too late for comments to the W3C but you can break the boredom of indeterminate waiting to protest excitedly for TV cameras and/or to be arrested.


Play the XQuery/XPath 3.1 Twitter Game!

Definitions litter the drafts and appear as:

[Definition: A sequence is an ordered collection of zero or more items.]

You Tweet:

An ordered collection of zero or more items? #xquery

Correct response:

A sequence.

Some definitions are too long to be tweeted in full:

An expanded-QName is a value in the value space of the xs:QName datatype as defined in the XDM data model (see [XQuery and XPath Data Model (XDM) 3.1]): that is, a triple containing namespace prefix (optional), namespace URI (optional), and local name. (xpath-functions)

Suggest you tweet:

A triple containing namespace prefix (optional), namespace URI (optional), and local name.


A value in the value space of the xs:QName datatype as defined in the XDM data model (see [XQuery and XPath Data Model (XDM) 3.1].

In both cases, the correct response:

An expanded-QName.

Use a $10 burner phone and it unlocked at protests. If your phone is searched, imagine the attempts to break the “code.”

You could agree on definitions/responses as instructions for direct action. But I digress.

4 Days Left – Submission Alert – XML Prague

Sunday, December 11th, 2016

A tweet by Jirka Kosek reminded me there are only 4 days left for XML Prague submissions!

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

From the call for papers:

XML Prague 2017 now welcomes submissions for presentations on the following topics:

  • Markup and the Extensible Web – HTML5, XHTML, Web Components, JSON and XML sharing the common space
  • Semantic visions and the reality – micro-formats, semantic data in business, linked data
  • Publishing for the 21th century – publishing toolchains, eBooks, EPUB, DITA, DocBook, CSS for print, …
  • XML databases and Big Data – XML storage, indexing, query languages, …
  • State of the XML Union – updates on specs, the XML community news, …

All proposals will be submitted for review by a peer review panel made up of the XML Prague Program Committee. Submissions will be chosen based on interest, applicability, technical merit, and technical correctness.

Accepted papers will be included in published conference proceedings.

I don’t travel but if you need a last-minute co-author or proofer, you know where to find me!

Visualizing XML Schemas

Tuesday, November 29th, 2016

I don’t have one of the commercial XML packages at the moment and was casting about for a free visualization technique for a large XML schema when I encountered:


I won’t be trying it on my schema until tomorrow but I thought it looked interesting enough to pass along.

Further details: Visualizing Complex Content Models with Spatial Schemas by Joe Pairman.

This looks almost teachable.


Other “free” visualization tools to suggest?

Manipulate XML Text Data Using XQuery String Functions

Tuesday, November 22nd, 2016

Manipulate XML Text Data Using XQuery String Functions by Adam Steffanick.

Adam’s notes from the Vanderbilt University XQuery Working Group:

We evaluated and manipulated text data (i.e., strings) within Extensible Markup Language (XML) using string functions in XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial covers the basics of how to use XQuery string functions and manipulate text data with BaseX.

We used a limited dataset of English words as text data to evaluate and manipulate, and I’ve created a GitHub gist of XML input and XQuery code for use with this tutorial.

A quick run-through of basic XQuery string functions that takes you up to writing your own XQuery function.

While we wait for more reports from the Vanderbilt University XQuery Working Group, have you considered using XQuery to impose different views on a single text document?

For example, some Bible translations follow the “traditional” chapter and verse divisions (a very late addition), while others use paragraph level organization and largely ignore the tradition verses.

Creating a view of a single source text as either one or both should not involve permanent changes to a source file in XML. Or at least not the original source file.

If for processing purposes there was a need for a static file rendering one way or the other, that’s doable but should be separate from the original XML file.

XML Prague 2017, February 9-11, 2017 – Registration Opens!

Wednesday, November 16th, 2016

XML Prague 2017, February 9-11, 2017

I mentioned XML Prague 2017 last month and now, after the election of Donald Trump as president of the United States, registration for the conference opens!


Maybe. 😉

Even if you are returning to the U.S. after the conference, XML Prague will be a welcome respite from the tempest of news coverage of what isn’t known about the impending Trump administration.

At 120 Euros for three days, this is a great investment both professionally and emotionally.


XML Calabash 1.1.13 released…

Thursday, November 10th, 2016

Norm Walsh tweeted:

XML Calabash 1.1.13 released for Saxon 9.5, Saxon 9.6, and, with all praise to @fgeorges, Saxon 9.7.

You are using XML pipelines for processing XML files. (XProc)


See also: XML Calabash Reference.


XML Prague 2017 is coming

Sunday, October 16th, 2016

XML Prague 2017 is coming by Jirka Kosek.

From the post:

I’m happy to announce that call for papers for XML Prague 2017 is finally open. We are looking forward for your interesting submissions related to XML. We have switched from CMT to EasyChair for managing submission process – we hope that new system will have less quirks for users then previous one.

We are sorry for slightly delayed start than in past years. But we have to setup new non-profit organization for running the conference and sometimes we felt like characters from Kafka’s Der Process during this process.

We are now working hard on redesigning and opening of registration. Process should be more smooth then in the past.

But these are just implementation details. XML Prague will be again three day gathering of XML geeks, users, vendors, … which we all are used to enjoy each year. I’m looking forward to meet you in Prague in February.

Conference: February 9-11, 2016.

Important Dates:

Important Dates:

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

You can see videos of last year’s presentation (to gauge the competition): Watch videos from XML Prague 2016 on Youtube channel.

December the 15th will be here sooner than you think!

Think of it as a welcome distraction from the barn yard posturing that is U.S. election politics this year!

XQuery Snippets on Gist

Thursday, October 6th, 2016

@XQuery tweeted today:

Check out some of the 1,637 XQuery code snippets on GitHub’s gist service

Not a bad way to get in a daily dose of XQuery!

You can also try Stack Overflow:

XQuery (3,000)

xquery-sql (293)

xquery-3.0 (70)

xquery-update (55)


XQuery Working Group (Vanderbilt)

Saturday, September 24th, 2016

XQuery Working Group – Learn XQuery in the Company of Digital Humanists and Digital Scientists

From the webpage:

We meet from 3:00 to 4:30 p.m. on most Fridays in 800FA of the Central Library. Newcomers are always welcome! Check the schedule below for details about topics. Also see our Github repository for code samples. Contact Cliff Anderson with any questions or see the FAQs below.

Good thing we are all mindful of the distinction W3C XML Query Working Group and XQuery Working Group (Vanderbilt).

Otherwise, you might need a topic map to sort out casual references. 😉

Even if you can’t attend meetings in person, support this project by Cliff Anderson.

BaseX 8.5.3 Released!

Monday, August 15th, 2016

BaseX 8.5.3 Released! (2016/08/15)

BaseX 8.5.3 was released today!

The changelog reads:

VERSION 8.5.3 (August 15, 2016) —————————————-

Minor bug fixes, improved thread-safety.

Still, not a bad idea to upgrade today!


PS: You do remember that Congress is throwing XML in ever increasing amounts at the internet?

Perhaps in hopes of burying information in angle-bang syntax.

XQuery can help disappoint them.


Thursday, July 28th, 2016


From the webpage:

MorganaXProc is an implementation of W3C’s XProc: An XML Pipeline Language written in Java™. It is free software, released under GNU General Public License version 2.0 (GPLv2).

The current version is 0.95 (public beta). It is very close to the recommendation with all related tests of the XProc Test Suite passed.

News: MorganaXProc 0.95-11 released

You can follow <xml-project/> on Twitter: @xml_project and peruse their documentation.

I haven’t worked my way through A User’s Guide to MorganaXProc but it looks promising.


Saxon-JS – Beta Release (EE-License)

Thursday, July 28th, 2016


From the webpage:

Saxon-JS is an XSLT 3.0 run-time written in pure JavaScript. It’s designed to execute Stylesheet Export Files compiled by Saxon-EE.

The first beta release is Saxon-JS 0.9 (released 28 July 2016), for use on web browsers. This can be used with Saxon-EE or later.

The beta release has been tested with current versions of Safari, Firefox, and Chrome browsers. It is known not to work under Internet Explorer. Browser support will be extended in future releases. Please let us know of any problems.

Saxon-JS documentation.

Goodies from the documentation:

Because people want to write rich interactive client-side applications, Saxon-JS does far more than simply converting XML to HTML, in the way that the original client-side XSLT 1.0 engines did. Instead, the stylesheet can contain rules that respond to user input, such as clicking on buttons, filling in form fields, or hovering the mouse. These events trigger template rules in the stylesheet which can be used to read additional data and modify the content of the HTML page.

We’re talking here primarily about running Saxon-JS in the browser. However, it’s also capable of running in server-side JavaScript environments such as Node.js (not yet fully supported in this beta release).

Grab a copy to get ready for discussions at Balisage!

Accessing IRS 990 Filings (Old School)

Monday, July 25th, 2016

Like many others, I was glad to see: IRS 990 Filings on AWS.

From the webpage:

Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present are available for anyone to use via Amazon S3.

Form 990 is the form used by the United States Internal Revenue Service to gather financial information about nonprofit organizations. Data for each 990 filing is provided in an XML file that contains structured information that represents the main 990 form, any filed forms and schedules, and other control information describing how the document was filed. Some non-disclosable information is not included in the files.

This data set includes Forms 990, 990-EZ and 990-PF which have been electronically filed with the IRS and is updated regularly in an XML format. The data can be used to perform research and analysis of organizations that have electronically filed Forms 990, 990-EZ and 990-PF. Forms 990-N (e-Postcard) are not available withing this data set. Forms 990-N can be viewed and downloaded from the IRS website.

I could use AWS but I’m more interested in deep analysis of a few returns than analysis of the entire dataset.

Fortunately the webpage continues:

An index listing all of the available filings is available at s3://irs-form-990/index.json. This file includes basic information about each filing including the name of the filer, the Employer Identificiation Number (EIN) of the filer, the date of the filing, and the path to download the filing.

All of the data is publicly accessible via the S3 bucket’s HTTPS endpoint at No authentication is required to download data over HTTPS. For example, the index file can be accessed at and the example filing mentioned above can be accessed at (emphasis in original).

I open a terminal window and type:


which as of today, results in:

-rw-rw-r-- 1 patrick patrick 1036711819 Jun 16 10:23 index.json

A trial grep:

grep "NATIONAL RIFLE" index.json > nra.txt

Which produces:

{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “396056607”, “SubmittedOn”: “2011-05-12”, “TaxPeriod”: “201012”, “FormType”: “990EZ”, “LastUpdated”: “2016-06-14T01:22:09.915971Z”, “OrganizationName”: “EAU CLAIRE NATIONAL RIFLE CLUB”, “IsElectronic”: false, “IsAvailable”: false},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

We have one errant result, the “EAU CLAIRE NATIONAL RIFLE CLUB,” so let’s delete that, re-order by year and the NATIONAL RIFLE ASSOCIATION OF AMERICA result reads (most recent to oldest):

{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

Of course, now you want the XML 990 returns, so extract the URLs for the 990s to a file, here nra-urls.txt (I would use awk if it is more than a handful):

Back to wget:

wget -i nra-urls.txt


-rw-rw-r– 1 patrick patrick 111798 Mar 21 16:12 201132709349300508_public.xml
-rw-rw-r– 1 patrick patrick 123490 Mar 21 19:47 201203119349301120_public.xml
-rw-rw-r– 1 patrick patrick 116786 Mar 21 22:12 201302609349300520_public.xml
-rw-rw-r– 1 patrick patrick 122071 Mar 21 15:20 201423099349300417_public.xml
-rw-rw-r– 1 patrick patrick 132081 Apr 29 10:10 201532599349300503_public.xml

Ooooh, it’s in XML! 😉

For the XML you are going to need: Current Valid XML Schemas and Business Rules for Exempt Organizations Modernized e-File, not to mention a means of querying the data (may I suggest XQuery?).

Once you have the index.json file, with grep, a little awk and wget, you can quickly explore IRS 990 filings for further analysis or to prepare queries for running on AWS (such as discovery of common directors, etc.).


BaseX 8.5.1 Released! (XQuery Texts for Smart Phone?)

Saturday, July 16th, 2016

BaseX – 8.5.1 Released!

From the documentation page:

BaseX is both a light-weight, high-performance and scalable XML Database and an XQuery 3.1 Processor with full support for the W3C Update and Full Text extensions. It focuses on storing, querying, and visualizing large XML and JSON documents and collections. A visual frontend allows users to interactively explore data and evaluate XQuery expressions in realtime. BaseX is platform-independent and distributed under the free BSD License (find more in Wikipedia).

Besides Priscilia Walmsley’s XQuery 2nd Edition and the BaseX documentation as a PDF file, what other XQuery resources would you store on a smart phone? (For occasional reference, leisure reading, etc.)

The Feynman Technique – Contest for Balisage 2016?

Tuesday, June 28th, 2016

The Best Way to Learn Anything: The Feynman Technique by Shane Parrish.

From the post:

There are four simple steps to the Feynman Technique, which I’ll explain below:

  1. Choose a Concept
  2. Teach it to a Toddler
  3. Identify Gaps and Go Back to The Source Material
  4. Review and Simplify

This made me think of the late-breaking Balisage 2016 papers posted by Tommie Usdin in email:

  • Saxon-JS – XSLT 3.0 in the Browser, by Debbie Lockett and Michael Kay, Saxonica
  • A MicroXPath for MicroXML (AKA A New, Simpler Way of Looking at XML Data Content), by Uche Ogbuji, Zepheira
  • A catalog of Functional programming idioms in XQuery 3.1, James Fuller, MarkLogic

New contest for Balisage?

Pick a concept from a Balisage 2016 presentation and you have sixty (60) seconds to explain it to Balisage attendees.

What do you think?

Remember, you can’t play if you don’t attend! Register today!

If Tommie agrees, the winner gets me to record a voice mail greeting for their phone! 😉

The Symptom of Many Formats

Monday, June 13th, 2016

Distro.Mic: An Open Source Service for Creating Instant Articles, Google AMP and Apple News Articles

From the post:

Mic is always on the lookout for new ways to reach our audience. When Facebook, Google and Apple announced their own native news experiences, we jumped at the opportunity to publish there.

While setting Mic up on these services, David Björklund realized we needed a common article format that we could use for generating content on any platform. We call this format article-json, and we open-sourced parsers for it.

Article-json got a lot of support from Google and Apple, so we decided to take it a step further. Enter DistroMic. Distro lets anyone transform an HTML article into the format mandated by one of the various platforms.


While I applaud the DistroMic work, I am saddened that it was necessary.

From the DistroMic page, here is the same article in three formats:


“article”: [
“text”: “Astronomers just announced the universe might be expanding up to 9% faster than we thought.\n”,
“additions”: [
“type”: “link”,
“rangeStart”: 59,
“rangeLength”: 8,
“URL”: “”
“inlineTextStyles”: [
“rangeStart”: 59,
“rangeLength”: 8,
“textStyle”: “bodyLinkTextStyle”
“role”: “body”,
“layout”: “bodyLayout”
“text”: “It’s a surprising insight that could put us one step closer to finally figuring out what the hell dark energy and dark matter are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.\n”,
“additions”: [
“type”: “link”,
“rangeStart”: 98,
“rangeLength”: 28,
“URL”: “”
“inlineTextStyles”: [
“rangeStart”: 98,
“rangeLength”: 28,
“textStyle”: “bodyLinkTextStyle”
“role”: “body”,
“layout”: “bodyLayout”
“role”: “container”,
“components”: [
“role”: “photo”,
“URL”: “bundle://image-0.jpg”,
“style”: “embedMediaStyle”,
“layout”: “embedMediaLayout”,
“caption”: {
“text”: “Source: \n NASA\n \n”,
“additions”: [
“type”: “link”,
“rangeStart”: 13,
“rangeLength”: 4,
“URL”: “”
“inlineTextStyles”: [
“rangeStart”: 13,
“rangeLength”: 4,
“textStyle”: “embedCaptionTextStyle”
“textStyle”: “embedCaptionTextStyle”
“layout”: “embedLayout”,
“style”: “embedStyle”
“bundlesToUrls”: {
“image-0.jpg”: “”


<p>Astronomers just announced the universe might be expanding
<a href=””>up to 9%</a> faster than we thought.</p>
<p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=””>
dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<figure data-feedback=”fb:likes,fb:comments”>
<img src=””></img>
Source: <a href=”


<p>Astronomers just announced the universe might be expanding
<a href=””>up to 9%</a> faster than we thought.</p> <p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=””> dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<amp-img width=”900″ height=”445″ layout=”responsive” src=””></amp-img>
<a href=”

All starting from the same HTML source:

<p>Astronomers just announced the universe might be expanding
<a href=””>up to 9%</a> faster than we thought.</p><p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=””>
dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<img width=”900″ height=”445″ src=””>
<a href=”

Three workflows based on what started life in one common format.

Three workflows that have their own bugs and vulnerabilities.

Three workflows that duplicate the capabilities of each other.

Three formats that require different indexing/searching.

This is not the cause of why we can’t have nice things in software, but it certainly is a symptom.

The next time someone proposes a new format for a project, challenge them to demonstrate a value-add over existing formats.

Balisage 2016 Program Posted! (Newcomers Welcome!)

Monday, May 23rd, 2016

Tommie Usdin wrote today to say:

Balisage: The Markup Conference
2016 Program Now Available

Balisage: where serious markup practitioners and theoreticians meet every August.

The 2016 program includes papers discussing reducing ambiguity in linked-open-data annotations, the visualization of XSLT execution patterns, automatic recognition of grant- and funding-related information in scientific papers, construction of an interactive interface to assist cybersecurity analysts, rules for graceful extension and customization of standard vocabularies, case studies of agile schema development, a report on XML encoding of subtitles for video, an extension of XPath to file systems, handling soft hyphens in historical texts, an automated validity checker for formatted pages, one no-angle-brackets editing interface for scholars of German family names and another for scholars of Roman legal history, and a survey of non-XML markup such as Markdown.

XML In, Web Out: A one-day Symposium on the sub rosa XML that powers an increasing number of websites will be held on Monday, August 1.

If you are interested in open information, reusable documents, and vendor and application independence, then you need descriptive markup, and Balisage is the conference you should attend. Balisage brings together document architects, librarians, archivists, computer
scientists, XML practitioners, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, standards developers, academics, industrial researchers, government and NGO staff, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Some participants are busy designing replacements for XML while other still use SGML (and know why they do).

Discussion is open, candid, and unashamedly technical.

Balisage 2016 Program:

Symposium Program:

Even if you don’t eat RELAX grammars at snack time, put Balisage on your conference schedule. Even if a bit scruffy looking, the long time participants like new document/information problems or new ways of looking at old ones. Not to mention they, on occasion, learn something from newcomers as well.

It is a unique opportunity to meet the people who engineered the tools and specs that you use day to day.

Be forewarned that most of them have difficulty agreeing what controversial terms mean, like “document,” but that to one side, they are a good a crew as you are likely to meet.


TEI XML -> HTML w/ XQuery [+ CSS -> XML]

Thursday, May 5th, 2016

Convert TEI XML to HTML with XQuery and BaseX by Adam Steffanick.

From the post:

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

The post only treats six (6) TEI elements but the methods presented could be extended to a larger set of TEI elements.

TEI 5 has 563 elements, which may appear in varying, valid, combinations. It also defines 256 attributes which are distributed among those 563 elements.

Consider using XQuery as a quality assurance (QA) tool to insure that encoded texts conform your project’s definition of expected text encoding.

While I was at Adam’s site I encountered: Convert CSV to XML with XQuery and BaseX, which you should bookmark for future reference.

Balisage 2016, 2–5 August 2016 [XML That Makes A Difference!]

Tuesday, February 2nd, 2016

Call for Participation


  • 25 March 2016 — Peer review applications due
  • 22 April 2016 — Paper submissions due
  • 21 May 2016 — Speakers notified
  • 10 June 2016 — Late-breaking News submissions due
  • 16 June 2016 — Late-breaking News speakers notified
  • 8 July 2016 — Final papers due from presenters of peer reviewed papers
  • 8 July 2016 — Short paper or slide summary due from presenters of late-breaking news
  • 1 August 2016 — Pre-conference Symposium
  • 2–5 August 2016 — Balisage: The Markup Conference

From the call:

Balisage is the premier conference on the theory, practice, design, development, and application of markup. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

  • Web application development with XML
  • Informal data models and consensus-based vocabularies
  • Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
  • Performance issues in parsing, XML database retrieval, or XSLT processing
  • Development of angle-bracket-free user interfaces for non-technical users
  • Semistructured data and full text search
  • Deployment of XML systems for enterprise data
  • Web application development with XML
  • Design and implementation of XML vocabularies
  • Case studies of the use of XML for publishing, interchange, or archiving
  • Alternatives to XML
  • the role(s) of XML in the application lifecycle
  • the role(s) of vocabularies in XML environments

Full papers should be submitted by the deadline given below. All papers are peer-reviewed — we pride ourselves that you will seldom get a more thorough, skeptical, or helpful review than the one provided by Balisage reviewers.

Whether in theory or practice, let’s make Balisage 2016 the one people speak of in hushed tones at future markup and information conferences.

Useful semantics continues to flounder about, cf. Vice-President Biden’s interest in “one cancer research language.” Easy enough to say. How hard could it be?

Documents are commonly thought of and processed as if from BOM to EOF is the definition of a document. Much to our impoverishment.

Silo dissing has gotten popular. What if we could have our silos and eat them too?

Let’s set our sights on a Balisage 2016 where non-technicals come away saying “I want that!”

Have your first drafts done well before the end of February, 2016!

Congressional Roll Call Vote – The Documents – Part 2 (XQuery)

Wednesday, January 13th, 2016

Congressional Roll Call Vote – The Documents (XQuery) we looked at the initial elements found in FINAL VOTE RESULTS FOR ROLL CALL 705. Today we continue our examination of those elements, starting with <vote-data>.

As before, use ctrl-u in your browser to display the XML source for that page. Look for </vote-metadata>, the next element is <vote-data>, which contains all the votes cast by members of Congress as follows:

<legislator name-id=”A000374″ sort-field=”Abraham” unaccented-name=”Abraham” party=”R” state=”LA” role=”legislator”>Abraham</legislator><
<legislator name-id=”A000370″ sort-field=”Adams” unaccented-name=”Adams” party=”D” state=”NC” role=”legislator”>Adams</legislator>

These are only the first two (2) lines but only the content of other <recorded-vote> elements varies from these.

I have introduced line returns to make it clear that <recorded-vote> … </recorded-vote> begin and end each record. Also note that <legislator> and <vote> are siblings.

What you didn’t see in the upper part of this document were the attributes that appear inside the <legislator> element.

Some of the attributes are: name-id=”A000374,” state=”LA” role=”legislator.”

In an XQuery, we address attributes by writing out the path to the element containing the attributes and then appending the attribute.

For example, for name-id=”A000374,” we could write:

rollcall-vote/vote-data/recorded-vote/legislator[@name-id = "A000374]

If we wanted to select that attribute value and/or the <legislator> element with that attribute and value.

Recalling that:

rollcall-vote – Root element of the document.

vote-data – Direct child of the root element.

recorded-vote – Direct child of the vote-data element (with many siblings).

legislator – Direct child of recorded-vote.

@name-id – One of the attributes of legislator.

As I mentioned in our last post, there are other ways to access elements and attributes but many useful things can be done with direct descendant XPaths.

In preparation for our next post, trying searching for “A000374” and limiting your search to the domain,

It is a good practice to search on unfamiliar attribute values. You never know what you may find!

Until next time!

Congressional Roll Call Vote – The Documents (XQuery)

Monday, January 11th, 2016

I assume you have read my new starter post for this series: Congressional Roll Call Vote and XQuery (A Do Over). If you haven’t and aren’t already familiar with XQuery, take a few minutes to go read it now. I’ll wait.

The first XML document we need to look at is FINAL VOTE RESULTS FOR ROLL CALL 705. If you press ctrl-u in your browser, the XML source of that document will be displayed.

The top portion of that document, before you see <vote-data> reads:

<?xml version=”1.0″ encoding=”UTF-8″?>
<!DOCTYPE rollcall-vote PUBLIC “-//US Congress//DTDs/vote
v1.0 20031119 //EN” “”>
<?xml-stylesheet type=”text/xsl” href=””?>
<chamber>U.S. House of Representatives</chamber>
<legis-num>H R 2029</legis-num>
<vote-question>On Concurring in Senate Amdt with
Amdt Specified in Section 3(a) of H.Res. 566</vote-question>
<action-time time-etz=”09:49″>9:49 AM</action-time>
<vote-desc>Making appropriations for military construction, the
Department of Veterans Affairs, and related agencies for the fiscal
year ending September 30, 2016, and for other purposes</vote-desc>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>

One of the first skills you need to learn to make effective use of XQuery is how to recognize paths in an XML document.

I’ll do the first several and leave some of the others for you.

<rollcall-vote> – the root element – aka “parent” element

<vote-metadata> – first child element in this document
XPath rollcall-vote/vote-metadata

<majority>R</majority> first child of <majority>R</majority> of <vote-metadata>
XPath rollcall-vote/vote-metadata/majority


What do you think? Looks like the same level as <majority>R</majority> and it is. Called a sibling of <majority>R</majority>
XPath rollcall-vote/vote-metadata/congress

Caveat: There are ways to go back up the XPath and to reach siblings and attributes. For the moment, lets get good at spotting direct XPaths.

Let’s skip down in the markup until we come to <totals-by-party-header>. It’s not followed, at least not immediately, with </totals-by-party-header>. That’s a signal that the previous siblings have stopped and we have another step in the XPath.

XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header

XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header/party-header

As you may suspect, the next four elements are siblings of <party-header>Party</party-header>

<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>

The closing element, shown by the “/,” signals the end of the <totals-by-party-header> element.


See how you do mapping out the remaining XPaths from the top of the document.


Tomorrow we are going to dive into the structure of the <vote-data> and how to address the attributes therein and their values.


JATS: Journal Article Tag Suite, Navigation Update!

Monday, January 11th, 2016

I posted about the appearance of JATS: Journal Article Tag Suite, version 1.1 and then began to lazily browse the pdf.

I forget what I was looking for now but I noticed the table of contents jumped from page 42 to page 235, and again from 272 to to 405. I’m thinking by this point “this is going to be a bear to find elements/attributes in.” I looked for an index only to find none. 🙁

But, there’s hope!

If you look at Chapter 7 “TAG Suite Components,” elements start on page 7 and attributes on page 28, you will find:


Each ✔ is a navigation link to that element (or attribute if you are in the attribute section) under each of those divisions, Archiving, Publishing, Authoring.

Very cool but falls under “non-obvious” for me.

Pass it on so others can safely and quickly navigate JATS 1.1!

PS: It was Tommie Usdin of Balisage fame who pointed out the table in chapter 7 to me. Thanks Tommie!

Congressional Roll Call Vote and XQuery (A Do Over)

Sunday, January 10th, 2016

Once words are written, as an author I consider them to be fixed. Even typos should be acknowledged as being corrected and not silently “improve” the original text. Rather than editing what has been said, more words can cover the same ground with the hope of doing so more completely or usefully.

I am starting my XQuery series of posts with the view of being more systematic, including references to at least one popular XQuery book, along with my progress through a series of uses of XQuery.

You are going to need an XQuery engine for all but this first post to be meaningful so let’s cover getting that setup first.

There are any number of GUI interface tools that I will mention over time but for now, let’s start with Saxon.

Download Saxon, unzip the file and you can choose to put saxon9he.jar in your Java classpath (if set) or you can invoke it with the -cp (path to saxon9he.jar), as in java -cp (path to saxon9he.jar) net.sf.saxon.Query -q:query-file.

Classpaths are a mixed blessing at best but who wants to keep typing -cp (your path to saxon9he.jar) net.sf.saxon.Query -q: all the time?

What I have found very useful (Ubuntu system) is to create a short shell script that I can invoke from the command line, thus:

java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -q:$1

Which after creating that file, which I very imaginatively named “,” I used chmod 755 to make it executable.

When I want to run Saxon at the command line, in the same directory with “” I type:

./ ex-5.4.xq > ex-5.4.html

It is a lot easier and not subject to my fat-fingering of the keyboard.

The “>” sign is a pipe in Linux that redirects the output to a file, in this case, ex-5.4.html.

The source of ex-5.4.xq (and its data file) is: XQuery, 2nd Edition by Patricia Walmsley. Highly recommended.

Patricia has put all of her examples online, XQuery Examples. Please pass that along with a link to her book if you use her examples.

If you have ten minutes, take a look at: Learn XQuery in 10 Minutes: An XQuery Tutorial *UPDATED* by Dr. Michael Kay. Michael Kay is also the author of Saxon.

By this point you should be well on your way to having a working XQuery engine and tomorrow we will start exploring the structure of the congressional roll call vote documents.

Congressional Roll Call and XQuery – (Week 1 of XQuery)

Saturday, January 9th, 2016

Truthfully a little more than a week of daily XQuery posts, I started a day or so before January 1, 2016.

I haven’t been flooded with suggestions or comments, ;-), so I read back over my XQuery posts and I see lots of room for improvement.

Most of my posts are on fairly technical topics and are meant to alert other researchers of interesting software or techniques. Most of them are not “how-to” or step by step guides, but some of them are.

The posts on congressional roll call documents made sense to me but then I wrote them. Part of what I sensed was that either you know enough to follow my jumps, in which case you are looking for specific details, like the correspondence across documents for attribute values, and not so much for my XQuery expressions.

On the other hand, if you weren’t already comfortable with XQuery, the correspondence of values between documents was the least of your concerns. Where the hell was all this terminology coming from?

I’m no stranger to long explanations, one of the standards I edit crosses the line at over 1,500 pages. But it hasn’t been my habit to write really long posts on this blog.

I’m going to spend the next week, starting tomorrow, re-working and expanding the congressional roll call vote posts to be more detailed for those getting into XQuery, with a very terse, short experts tips at the end of each post if needed.

The expert part will have observations such as the correspondences in attribute values and other oddities that either you know or you don’t.

Will have the first longer style post up tomorrow, January 10, 2016 and we will see how the week develops from there.

Congressional Roll Call Vote – Join/Merge Remote XML Files (XQuery)

Friday, January 8th, 2016

One of the things that yesterday’s output lacked was the full names of the Georgia representatives. Which aren’t reported in the roll call documents.

But, what the roll call documents do have, is the following:

<legislator name-id=”J000288″ sort-field=”Johnson (GA)” unaccented-name=”Johnson (GA)”
party=”D” state=”GA” role=”legislator”>Johnson (GA)</legislator>

With emphasis on name-id=”J000288″

I call that attribute out because there is a sample data file, just for the House of Representatives that has:


And yes, the “name-id” attribute and the <bioguideID> share the same value for Henry C. “Hank” Johnson, Jr. of Georgia.

As far as I can find, that relationship between the “name-id” value in roll call result files and the House Member Data File is undocumented. You have to be paying attention to the data values in the various XML files at

The result of the XQuery script today has the usual header but for members of the Georgia delegation, the following:


That is the result of joining/merging two XML files hosted at in real time. You can substitute any roll call vote and your state as appropriate and generate a similar webpage for that roll call vote.

The roll call vote file I used for this example is: and the House Member Data File was: The MemberData.xml file dates from April of 2015 so it may not have the latest data on any given member. Documentation for House Member Data in XML (pdf).

The main XQuery function for merging the two XML files:

{for $voter in doc(“”)//recorded-vote,
$mem in doc(“”)//member/member-info
where $voter/legislator[@state = ‘GA’] and $voter/legislator/@name-id = $mem//bioguideID
return <li> {string($mem//official-name)} — {string($voter/vote)} — {string($mem//phone)}&;lt;/li>

At a minimum, you can auto-generate a listing for representatives from your state, their vote on any roll-call vote and give readers their phone number to register their opinion.

This is a crude example of what you can do with XML, XQuery and online data from

BTW, if you work in “sharing” environment at a media outlet or library, you can also join/merge data that you hold internally, say the private phone number of a congressional aide, for example.

We are not nearly done with the congressional roll call vote but you can begin to see the potential that XQuery offers for very little effort. Not to mention that XQuery scripts can be rapidly adapted to your library or news room.

Try out today’s XQuery roll705-join-merge.xq.txt for yourself. (Apologies for the “.txt” extension but my ISP host has ideas about “safe” files to upload.)

I realize this first week has been kinda haphazard in its presentation. Suggestions welcome on improvements as this series goes forward.

The government and others are cranking out barely useful XML by the boatload. XQuery is your ticket to creating personalized presentations dynamically from that XML and other data.


PS: For display of XML and XQuery, should I be using a different Word template? Suggestions?

JATS: Journal Article Tag Suite, version 1.1

Friday, January 8th, 2016

JATS: Journal Article Tag Suite, version 1.1


The Journal Article Tag Suite provides a common XML format in which publishers and archives can exchange journal content. The JATS provides a set of XML elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews.

Documentation and help files: Journal Article Tag Suite.

Tommie Usdin (of Balisage fame) posted to Facebook:

JATS has added capabilities to encode:
– NISO Access License and Indicators
– additional support for multiple language documents and for Japanese documents (including Ruby)
– citation of datasets
and some other things users of version 1.0 have requested.

Another XML vocabulary that provides grist for your XQuery adventures!

Localizing A Congressional Roll Call Vote (XQuery)

Thursday, January 7th, 2016

I made some progress today on localizing a congressional roll call vote.

As you might expect, I chose to localize to the representatives from Georgia. 😉

I used a FLWOR expression to select legislators where the attribute state = GA.

Here is that expression:

{for $voter in doc(“”)//recorded-vote
where $voter/legislator[@state = ‘GA’]
return <li> {string($voter/legislator)} — {string($voter/vote)}</li>

Which makes our localized display a bit better for local readers but only just.

See roll705-local.html.

What we need is more information that can be found at:

More on that tomorrow!

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Thursday, January 7th, 2016

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Let’s reverse the order of the announcement, to be in reader-friendly order:


Press kit

Release Notes

What’s New in 9.5

Edit: I moved my comments above the fold as it were:

Just so you know, PostgreSQL 9.5 documentation, XMLEXISTS says:

Also note that the SQL standard specifies the xmlexists construct to take an XQuery expression as first argument, but PostgreSQL currently only supports XPath, which is a subset of XQuery.

Apologies, you will have to scroll for the subsection, there was no anchor at

If you are looking to make a major contribution to PostgreSQL, note that XQuery is on the todo list.

Now for all the stuff that you will skip reading anyway. 😉

(I would save the prose for use in reports to management about using or transitioning to PostgreSQL 9.5.)

7 JANUARY 2016: The PostgreSQL Global Development Group announces the release of PostgreSQL 9.5. This release adds UPSERT capability, Row Level Security, and multiple Big Data features, which will broaden the user base for the world’s most advanced database. With these new capabilities, PostgreSQL will be the best choice for even more applications for startups, large corporations, and government agencies.

Annie Prévot, CIO of the CNAF, the French Child Benefits Office, said, “The CNAF is providing services for 11 million persons and distributing 73 billion Euros every year, through 26 types of social benefit schemes. This service is essential to the population and it relies on an information system that must be absolutely efficient and reliable. The CNAF’s information system is satisfyingly based on the PostgreSQL database management system.”


A most-requested feature by application developers for several years, “UPSERT” is shorthand for “INSERT, ON CONFLICT UPDATE”, allowing new and updated rows to be treated the same. UPSERT simplifies web and mobile application development by enabling the database to handle conflicts between concurrent data changes. This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL.

Developed over the last two years by Heroku programmer Peter Geoghegan, PostgreSQL’s implementation of UPSERT is significantly more flexible and powerful than those offered by other relational databases. The new ON CONFLICT clause permits ignoring the new data, or updating different columns or relations in ways which will support complex ETL (Extract, Transform, Load) toolchains for bulk data loading. And, like all of PostgreSQL, it is designed to be absolutely concurrency-safe and to integrate with all other PostgreSQL features, including Logical Replication.

Row Level Security

PostgreSQL continues to expand database security capabilities with its new Row Level Security (RLS) feature. RLS implements true per-row and per-column data access control which integrates with external label-based security stacks such as SE Linux. PostgreSQL is already known as “the most secure by default.” RLS cements its position as the best choice for applications with strong data security requirements, such as compliance with PCI, the European Data Protection Directive, and healthcare data protection standards.

RLS is the culmination of five years of security features added to PostgreSQL, including extensive work by KaiGai Kohei of NEC, Stephen Frost of Crunchy Data, and Dean Rasheed. Through it, database administrators can set security “policies” which filter which rows particular users are allowed to update or view. Data security implemented this way is resistant to SQL injection exploits and other application-level security holes.

Big Data Features

PostgreSQL 9.5 includes multiple new features for bigger databases, and for integrating with other Big Data systems. These features ensure that PostgreSQL continues to have a strong role in the rapidly growing open source Big Data marketplace. Among them are:

BRIN Indexing: This new type of index supports creating tiny, but effective indexes for very large, “naturally ordered” tables. For example, tables containing logging data with billions of rows could be indexed and searched in 5% of the time required by standard BTree indexes.

Faster Sorts: PostgreSQL now sorts text and NUMERIC data faster, using an algorithm called “abbreviated keys”. This makes some queries which need to sort large amounts of data 2X to 12X faster, and can speed up index creation by 20X.

CUBE, ROLLUP and GROUPING SETS: These new standard SQL clauses let users produce reports with multiple levels of summarization in one query instead of requiring several. CUBE will also enable tightly integrating PostgreSQL with more Online Analytic Processing (OLAP) reporting tools such as Tableau.

Foreign Data Wrappers (FDWs): These already allow using PostgreSQL as a query engine for other Big Data systems such as Hadoop and Cassandra. Version 9.5 adds IMPORT FOREIGN SCHEMA and JOIN pushdown making query connections to external databases both easier to set up and more efficient.

TABLESAMPLE: This SQL clause allows grabbing a quick statistical sample of huge tables, without the need for expensive sorting.

“The new BRIN index in PostgreSQL 9.5 is a powerful new feature which enables PostgreSQL to manage and index volumes of data that were impractical or impossible in the past. It allows scalability of data and performance beyond what was considered previously attainable with traditional relational databases and makes PostgreSQL a perfect solution for Big Data analytics,” said Boyan Botev, Lead Database Administrator, Premier, Inc.