## Archive for the ‘XML’ Category

### XQuery Snippets on Gist

Thursday, October 6th, 2016

@XQuery tweeted today:

Check out some of the 1,637 XQuery code snippets on GitHub’s gist service
https://gist.github.com/search?q=xquery&ref=searchresults&utf8=%E2%9C%93

Not a bad way to get in a daily dose of XQuery!

You can also try Stack Overflow:

XQuery (3,000)

xquery-sql (293)

xquery-3.0 (70)

xquery-update (55)

Enjoy!

### XQuery Working Group (Vanderbilt)

Saturday, September 24th, 2016

XQuery Working Group – Learn XQuery in the Company of Digital Humanists and Digital Scientists

From the webpage:

We meet from 3:00 to 4:30 p.m. on most Fridays in 800FA of the Central Library. Newcomers are always welcome! Check the schedule below for details about topics. Also see our Github repository for code samples. Contact Cliff Anderson with any questions or see the FAQs below.

Good thing we are all mindful of the distinction W3C XML Query Working Group and XQuery Working Group (Vanderbilt).

Otherwise, you might need a topic map to sort out casual references. 😉

Even if you can’t attend meetings in person, support this project by Cliff Anderson.

### BaseX 8.5.3 Released!

Monday, August 15th, 2016

BaseX 8.5.3 Released! (2016/08/15)

BaseX 8.5.3 was released today!

VERSION 8.5.3 (August 15, 2016) —————————————-

Enjoy!

PS: You do remember that Congress is throwing XML in ever increasing amounts at the internet?

Perhaps in hopes of burying information in angle-bang syntax.

XQuery can help disappoint them.

### MorganaXProc

Thursday, July 28th, 2016

MorganaXProc

From the webpage:

MorganaXProc is an implementation of W3C’s XProc: An XML Pipeline Language written in Java™. It is free software, released under GNU General Public License version 2.0 (GPLv2).

The current version is 0.95 (public beta). It is very close to the recommendation with all related tests of the XProc Test Suite passed.

I haven’t worked my way through A User’s Guide to MorganaXProc but it looks promising.

Enjoy!

### Saxon-JS – Beta Release (EE-License)

Thursday, July 28th, 2016

From the webpage:

Saxon-JS is an XSLT 3.0 run-time written in pure JavaScript. It’s designed to execute Stylesheet Export Files compiled by Saxon-EE.

The first beta release is Saxon-JS 0.9 (released 28 July 2016), for use on web browsers. This can be used with Saxon-EE 9.7.0.7 or later.

The beta release has been tested with current versions of Safari, Firefox, and Chrome browsers. It is known not to work under Internet Explorer. Browser support will be extended in future releases. Please let us know of any problems.

Goodies from the documentation:

Because people want to write rich interactive client-side applications, Saxon-JS does far more than simply converting XML to HTML, in the way that the original client-side XSLT 1.0 engines did. Instead, the stylesheet can contain rules that respond to user input, such as clicking on buttons, filling in form fields, or hovering the mouse. These events trigger template rules in the stylesheet which can be used to read additional data and modify the content of the HTML page.

We’re talking here primarily about running Saxon-JS in the browser. However, it’s also capable of running in server-side JavaScript environments such as Node.js (not yet fully supported in this beta release).

Grab a copy to get ready for discussions at Balisage!

### Accessing IRS 990 Filings (Old School)

Monday, July 25th, 2016

Like many others, I was glad to see: IRS 990 Filings on AWS.

From the webpage:

Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present are available for anyone to use via Amazon S3.

Form 990 is the form used by the United States Internal Revenue Service to gather financial information about nonprofit organizations. Data for each 990 filing is provided in an XML file that contains structured information that represents the main 990 form, any filed forms and schedules, and other control information describing how the document was filed. Some non-disclosable information is not included in the files.

This data set includes Forms 990, 990-EZ and 990-PF which have been electronically filed with the IRS and is updated regularly in an XML format. The data can be used to perform research and analysis of organizations that have electronically filed Forms 990, 990-EZ and 990-PF. Forms 990-N (e-Postcard) are not available withing this data set. Forms 990-N can be viewed and downloaded from the IRS website.

I could use AWS but I’m more interested in deep analysis of a few returns than analysis of the entire dataset.

Fortunately the webpage continues:

An index listing all of the available filings is available at s3://irs-form-990/index.json. This file includes basic information about each filing including the name of the filer, the Employer Identificiation Number (EIN) of the filer, the date of the filing, and the path to download the filing.

All of the data is publicly accessible via the S3 bucket’s HTTPS endpoint at https://s3.amazonaws.com/irs-form-990. No authentication is required to download data over HTTPS. For example, the index file can be accessed at https://s3.amazonaws.com/irs-form-990/index.json and the example filing mentioned above can be accessed at https://s3.amazonaws.com/irs-form-990/201541349349307794_public.xml (emphasis in original).

I open a terminal window and type:

wget https://s3.amazonaws.com/irs-form-990/index.json

which as of today, results in:

-rw-rw-r-- 1 patrick patrick 1036711819 Jun 16 10:23 index.json

A trial grep:

grep "NATIONAL RIFLE" index.json > nra.txt

Which produces:

{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “https://s3.amazonaws.com/irs-form-990/201423099349300417_public.xml”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “https://s3.amazonaws.com/irs-form-990/201302609349300520_public.xml”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “https://s3.amazonaws.com/irs-form-990/201203119349301120_public.xml”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “396056607”, “SubmittedOn”: “2011-05-12”, “TaxPeriod”: “201012”, “FormType”: “990EZ”, “LastUpdated”: “2016-06-14T01:22:09.915971Z”, “OrganizationName”: “EAU CLAIRE NATIONAL RIFLE CLUB”, “IsElectronic”: false, “IsAvailable”: false},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “https://s3.amazonaws.com/irs-form-990/201132709349300508_public.xml”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “https://s3.amazonaws.com/irs-form-990/201532599349300503_public.xml”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

We have one errant result, the “EAU CLAIRE NATIONAL RIFLE CLUB,” so let’s delete that, re-order by year and the NATIONAL RIFLE ASSOCIATION OF AMERICA result reads (most recent to oldest):

{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “https://s3.amazonaws.com/irs-form-990/201532599349300503_public.xml”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “https://s3.amazonaws.com/irs-form-990/201423099349300417_public.xml”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “https://s3.amazonaws.com/irs-form-990/201302609349300520_public.xml”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “https://s3.amazonaws.com/irs-form-990/201203119349301120_public.xml”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “https://s3.amazonaws.com/irs-form-990/201132709349300508_public.xml”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

Of course, now you want the XML 990 returns, so extract the URLs for the 990s to a file, here nra-urls.txt (I would use awk if it is more than a handful):

https://s3.amazonaws.com/irs-form-990/201532599349300503_public.xml
https://s3.amazonaws.com/irs-form-990/201423099349300417_public.xml
https://s3.amazonaws.com/irs-form-990/201302609349300520_public.xml
https://s3.amazonaws.com/irs-form-990/201203119349301120_public.xml
https://s3.amazonaws.com/irs-form-990/201132709349300508_public.xml

Back to wget:

wget -i nra-urls.txt

Results:

-rw-rw-r– 1 patrick patrick 111798 Mar 21 16:12 201132709349300508_public.xml
-rw-rw-r– 1 patrick patrick 123490 Mar 21 19:47 201203119349301120_public.xml
-rw-rw-r– 1 patrick patrick 116786 Mar 21 22:12 201302609349300520_public.xml
-rw-rw-r– 1 patrick patrick 122071 Mar 21 15:20 201423099349300417_public.xml
-rw-rw-r– 1 patrick patrick 132081 Apr 29 10:10 201532599349300503_public.xml

Ooooh, it’s in XML! 😉

For the XML you are going to need: Current Valid XML Schemas and Business Rules for Exempt Organizations Modernized e-File, not to mention a means of querying the data (may I suggest XQuery?).

Once you have the index.json file, with grep, a little awk and wget, you can quickly explore IRS 990 filings for further analysis or to prepare queries for running on AWS (such as discovery of common directors, etc.).

Enjoy!

### BaseX 8.5.1 Released! (XQuery Texts for Smart Phone?)

Saturday, July 16th, 2016

BaseX – 8.5.1 Released!

From the documentation page:

BaseX is both a light-weight, high-performance and scalable XML Database and an XQuery 3.1 Processor with full support for the W3C Update and Full Text extensions. It focuses on storing, querying, and visualizing large XML and JSON documents and collections. A visual frontend allows users to interactively explore data and evaluate XQuery expressions in realtime. BaseX is platform-independent and distributed under the free BSD License (find more in Wikipedia).

Besides Priscilia Walmsley’s XQuery 2nd Edition and the BaseX documentation as a PDF file, what other XQuery resources would you store on a smart phone? (For occasional reference, leisure reading, etc.)

### The Feynman Technique – Contest for Balisage 2016?

Tuesday, June 28th, 2016

The Best Way to Learn Anything: The Feynman Technique by Shane Parrish.

From the post:

There are four simple steps to the Feynman Technique, which I’ll explain below:

1. Choose a Concept
2. Teach it to a Toddler
3. Identify Gaps and Go Back to The Source Material
4. Review and Simplify

This made me think of the late-breaking Balisage 2016 papers posted by Tommie Usdin in email:

• Saxon-JS – XSLT 3.0 in the Browser, by Debbie Lockett and Michael Kay, Saxonica
• A MicroXPath for MicroXML (AKA A New, Simpler Way of Looking at XML Data Content), by Uche Ogbuji, Zepheira
• A catalog of Functional programming idioms in XQuery 3.1, James Fuller, MarkLogic

New contest for Balisage?

Pick a concept from a Balisage 2016 presentation and you have sixty (60) seconds to explain it to Balisage attendees.

What do you think?

Remember, you can’t play if you don’t attend! Register today!

If Tommie agrees, the winner gets me to record a voice mail greeting for their phone! 😉

### The Symptom of Many Formats

Monday, June 13th, 2016

Distro.Mic: An Open Source Service for Creating Instant Articles, Google AMP and Apple News Articles

From the post:

Mic is always on the lookout for new ways to reach our audience. When Facebook, Google and Apple announced their own native news experiences, we jumped at the opportunity to publish there.

While setting Mic up on these services, David Björklund realized we needed a common article format that we could use for generating content on any platform. We call this format article-json, and we open-sourced parsers for it.

Article-json got a lot of support from Google and Apple, so we decided to take it a step further. Enter DistroMic. Distro lets anyone transform an HTML article into the format mandated by one of the various platforms.

Sigh.

While I applaud the DistroMic work, I am saddened that it was necessary.

From the DistroMic page, here is the same article in three formats:

Apple:

{
“article”: [
{
“text”: “Astronomers just announced the universe might be expanding up to 9% faster than we thought.\n”,
{
“rangeStart”: 59,
“rangeLength”: 8,
“URL”: “http://hubblesite.org/newscenter/archive/releases/2016/17/text/”
}
],
“inlineTextStyles”: [
{
“rangeStart”: 59,
“rangeLength”: 8,
}
],
“role”: “body”,
“layout”: “bodyLayout”
},
{
“text”: “It’s a surprising insight that could put us one step closer to finally figuring out what the hell dark energy and dark matter are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.\n”,
{
“rangeStart”: 98,
“rangeLength”: 28,
“URL”: “http://science.nasa.gov/astrophysics/focus-areas/what-is-dark-energy/”
}
],
“inlineTextStyles”: [
{
“rangeStart”: 98,
“rangeLength”: 28,
}
],
“role”: “body”,
“layout”: “bodyLayout”
},
{
“role”: “container”,
“components”: [
{
“role”: “photo”,
“URL”: “bundle://image-0.jpg”,
“style”: “embedMediaStyle”,
“layout”: “embedMediaLayout”,
“caption”: {
“text”: “Source: \n NASA\n \n”,
{
“rangeStart”: 13,
“rangeLength”: 4,
“URL”: “http://www.nasa.gov/mission_pages/hubble/hst_young_galaxies_200604.html”
}
],
“inlineTextStyles”: [
{
“rangeStart”: 13,
“rangeLength”: 4,
“textStyle”: “embedCaptionTextStyle”
}
],
“textStyle”: “embedCaptionTextStyle”
}
}
],
“layout”: “embedLayout”,
“style”: “embedStyle”
}
],
“bundlesToUrls”: {
“image-0.jpg”: “http://bit.ly/1UFHdpf”
}
}

<article>
<p>Astronomers just announced the universe might be expanding
<a href=”http://hubblesite.org/newscenter/archive/releases/2016/17/text/”>up to 9%</a> faster than we thought.</p>
<p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=”http://science.nasa.gov/astrophysics/focus-areas/what-is-dark-energy/”>
dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<img src=”http://bit.ly/1UFHdpf”></img>
<figcaption><cite>
Source: <a href=”http://www.nasa.gov/mission_pages/hubble/hst_young_
galaxies_200604.html”>NASA</a>
</cite></figcaption>
</figure>
</article>

<article>
<p>Astronomers just announced the universe might be expanding
<a href=”http://hubblesite.org/newscenter/archive/releases/2016/17/text/”>up to 9%</a> faster than we thought.</p> <p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=”http://science.nasa.gov/astrophysics/focus-areas/what-is-dark-energy/”> dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<figure>
<amp-img width=”900″ height=”445″ layout=”responsive” src=”http://bit.ly/1UFHdpf”></amp-img>
<figcaption>Source:
<a href=”http://www.nasa.gov/mission_pages/hubble/hst_young_
galaxies_200604.html”>NASA</a>
</figcaption>
</figure>
</article>

All starting from the same HTML source:

<p>Astronomers just announced the universe might be expanding
<a href=”http://hubblesite.org/newscenter/archive/releases/2016/17/text/”>up to 9%</a> faster than we thought.</p><p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=”http://science.nasa.gov/astrophysics/focus-areas/what-is-dark-energy/”>
dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<figure>
<img width=”900″ height=”445″ src=”http://bit.ly/1UFHdpf”>
<figcaption>Source:
<a href=”http://www.nasa.gov/mission_pages/hubble/hst_young_
galaxies_200604.html”>NASA</a>
</figcaption>
</figure>

Three workflows based on what started life in one common format.

Three workflows that have their own bugs and vulnerabilities.

Three workflows that duplicate the capabilities of each other.

Three formats that require different indexing/searching.

This is not the cause of why we can’t have nice things in software, but it certainly is a symptom.

The next time someone proposes a new format for a project, challenge them to demonstrate a value-add over existing formats.

### Balisage 2016 Program Posted! (Newcomers Welcome!)

Monday, May 23rd, 2016

Tommie Usdin wrote today to say:

Balisage: The Markup Conference
2016 Program Now Available
http://www.balisage.net/2016/Program.html

Balisage: where serious markup practitioners and theoreticians meet every August.

The 2016 program includes papers discussing reducing ambiguity in linked-open-data annotations, the visualization of XSLT execution patterns, automatic recognition of grant- and funding-related information in scientific papers, construction of an interactive interface to assist cybersecurity analysts, rules for graceful extension and customization of standard vocabularies, case studies of agile schema development, a report on XML encoding of subtitles for video, an extension of XPath to file systems, handling soft hyphens in historical texts, an automated validity checker for formatted pages, one no-angle-brackets editing interface for scholars of German family names and another for scholars of Roman legal history, and a survey of non-XML markup such as Markdown.

XML In, Web Out: A one-day Symposium on the sub rosa XML that powers an increasing number of websites will be held on Monday, August 1. http://balisage.net/XML-In-Web-Out/

If you are interested in open information, reusable documents, and vendor and application independence, then you need descriptive markup, and Balisage is the conference you should attend. Balisage brings together document architects, librarians, archivists, computer
scientists, XML practitioners, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, standards developers, academics, industrial researchers, government and NGO staff, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Some participants are busy designing replacements for XML while other still use SGML (and know why they do).

Discussion is open, candid, and unashamedly technical.

Balisage 2016 Program: http://www.balisage.net/2016/Program.html

Symposium Program: http://balisage.net/XML-In-Web-Out/symposiumProgram.html

Even if you don’t eat RELAX grammars at snack time, put Balisage on your conference schedule. Even if a bit scruffy looking, the long time participants like new document/information problems or new ways of looking at old ones. Not to mention they, on occasion, learn something from newcomers as well.

It is a unique opportunity to meet the people who engineered the tools and specs that you use day to day.

Be forewarned that most of them have difficulty agreeing what controversial terms mean, like “document,” but that to one side, they are a good a crew as you are likely to meet.

Enjoy!

### TEI XML -> HTML w/ XQuery [+ CSS -> XML]

Thursday, May 5th, 2016

From the post:

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

The post only treats six (6) TEI elements but the methods presented could be extended to a larger set of TEI elements.

TEI 5 has 563 elements, which may appear in varying, valid, combinations. It also defines 256 attributes which are distributed among those 563 elements.

Consider using XQuery as a quality assurance (QA) tool to insure that encoded texts conform your project’s definition of expected text encoding.

While I was at Adam’s site I encountered: Convert CSV to XML with XQuery and BaseX, which you should bookmark for future reference.

### Balisage 2016, 2–5 August 2016 [XML That Makes A Difference!]

Tuesday, February 2nd, 2016

Call for Participation

Dates:

• 25 March 2016 — Peer review applications due
• 22 April 2016 — Paper submissions due
• 21 May 2016 — Speakers notified
• 10 June 2016 — Late-breaking News submissions due
• 16 June 2016 — Late-breaking News speakers notified
• 8 July 2016 — Final papers due from presenters of peer reviewed papers
• 8 July 2016 — Short paper or slide summary due from presenters of late-breaking news
• 1 August 2016 — Pre-conference Symposium
• 2–5 August 2016 — Balisage: The Markup Conference

From the call:

Balisage is the premier conference on the theory, practice, design, development, and application of markup. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

• Web application development with XML
• Informal data models and consensus-based vocabularies
• Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
• Performance issues in parsing, XML database retrieval, or XSLT processing
• Development of angle-bracket-free user interfaces for non-technical users
• Semistructured data and full text search
• Deployment of XML systems for enterprise data
• Web application development with XML
• Design and implementation of XML vocabularies
• Case studies of the use of XML for publishing, interchange, or archiving
• Alternatives to XML
• the role(s) of XML in the application lifecycle
• the role(s) of vocabularies in XML environments

Full papers should be submitted by the deadline given below. All papers are peer-reviewed — we pride ourselves that you will seldom get a more thorough, skeptical, or helpful review than the one provided by Balisage reviewers.

Whether in theory or practice, let’s make Balisage 2016 the one people speak of in hushed tones at future markup and information conferences.

Useful semantics continues to flounder about, cf. Vice-President Biden’s interest in “one cancer research language.” Easy enough to say. How hard could it be?

Documents are commonly thought of and processed as if from BOM to EOF is the definition of a document. Much to our impoverishment.

Silo dissing has gotten popular. What if we could have our silos and eat them too?

Let’s set our sights on a Balisage 2016 where non-technicals come away saying “I want that!”

Have your first drafts done well before the end of February, 2016!

### Congressional Roll Call Vote – The Documents – Part 2 (XQuery)

Wednesday, January 13th, 2016

Congressional Roll Call Vote – The Documents (XQuery) we looked at the initial elements found in FINAL VOTE RESULTS FOR ROLL CALL 705. Today we continue our examination of those elements, starting with <vote-data>.

As before, use ctrl-u in your browser to display the XML source for that page. Look for </vote-metadata>, the next element is <vote-data>, which contains all the votes cast by members of Congress as follows:

<recorded-vote>
<legislator name-id=”A000374″ sort-field=”Abraham” unaccented-name=”Abraham” party=”R” state=”LA” role=”legislator”>Abraham</legislator><
vote>Nay</vote>
</recorded-vote>
<recorded-vote>
<vote>Yea</vote>
</recorded-vote>

These are only the first two (2) lines but only the content of other <recorded-vote> elements varies from these.

I have introduced line returns to make it clear that <recorded-vote> … </recorded-vote> begin and end each record. Also note that <legislator> and <vote> are siblings.

What you didn’t see in the upper part of this document were the attributes that appear inside the <legislator> element.

Some of the attributes are: name-id=”A000374,” state=”LA” role=”legislator.”

In an XQuery, we address attributes by writing out the path to the element containing the attributes and then appending the attribute.

For example, for name-id=”A000374,” we could write:

rollcall-vote/vote-data/recorded-vote/legislator[@name-id = "A000374]

If we wanted to select that attribute value and/or the <legislator> element with that attribute and value.

Recalling that:

rollcall-vote – Root element of the document.

vote-data – Direct child of the root element.

recorded-vote – Direct child of the vote-data element (with many siblings).

legislator – Direct child of recorded-vote.

@name-id – One of the attributes of legislator.

As I mentioned in our last post, there are other ways to access elements and attributes but many useful things can be done with direct descendant XPaths.

In preparation for our next post, trying searching for “A000374” and limiting your search to the domain, congress.gov.

It is a good practice to search on unfamiliar attribute values. You never know what you may find!

Until next time!

### Congressional Roll Call Vote – The Documents (XQuery)

Monday, January 11th, 2016

I assume you have read my new starter post for this series: Congressional Roll Call Vote and XQuery (A Do Over). If you haven’t and aren’t already familiar with XQuery, take a few minutes to go read it now. I’ll wait.

The first XML document we need to look at is FINAL VOTE RESULTS FOR ROLL CALL 705. If you press ctrl-u in your browser, the XML source of that document will be displayed.

The top portion of that document, before you see <vote-data> reads:

<?xml version=”1.0″ encoding=”UTF-8″?>
<!DOCTYPE rollcall-vote PUBLIC “-//US Congress//DTDs/vote
v1.0 20031119 //EN” “http://clerk.house.gov/evs/vote.dtd”>
<?xml-stylesheet type=”text/xsl” href=”http://clerk.house.gov/evs/vote.xsl”?>
<rollcall-vote>
<majority>R</majority>
<congress>114</congress>
<session>1st</session>
<chamber>U.S. House of Representatives</chamber>
<rollcall-num>705</rollcall-num>
<legis-num>H R 2029</legis-num>
<vote-question>On Concurring in Senate Amdt with
Amdt Specified in Section 3(a) of H.Res. 566</vote-question>
<vote-type>YEA-AND-NAY</vote-type>
<vote-result>Passed</vote-result>
<action-date>18-Dec-2015</action-date>
<action-time time-etz=”09:49″>9:49 AM</action-time>
<vote-desc>Making appropriations for military construction, the
Department of Veterans Affairs, and related agencies for the fiscal
year ending September 30, 2016, and for other purposes</vote-desc>
<vote-totals>
<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Independent</party>
<yea-total>0</yea-total>
<nay-total>0</nay-total>
<present-total>0</present-total>
<not-voting-total>0</not-voting-total>
</totals-by-party>
<totals-by-vote>
<total-stub>Totals</total-stub>
<yea-total>316</yea-total>
<nay-total>113</nay-total>
<present-total>0</present-total>
<not-voting-total>5</not-voting-total>
</totals-by-vote>
</vote-totals>

One of the first skills you need to learn to make effective use of XQuery is how to recognize paths in an XML document.

I’ll do the first several and leave some of the others for you.

<rollcall-vote> – the root element – aka “parent” element

<vote-metadata> – first child element in this document

<majority>R</majority> first child of <majority>R</majority> of <vote-metadata>

<congress>114</congress>

What do you think? Looks like the same level as <majority>R</majority> and it is. Called a sibling of <majority>R</majority>

Caveat: There are ways to go back up the XPath and to reach siblings and attributes. For the moment, lets get good at spotting direct XPaths.

Let’s skip down in the markup until we come to <totals-by-party-header>. It’s not followed, at least not immediately, with </totals-by-party-header>. That’s a signal that the previous siblings have stopped and we have another step in the XPath.

As you may suspect, the next four elements are siblings of <party-header>Party</party-header>

The closing element, shown by the “/,” signals the end of the <totals-by-party-header> element.

See how you do mapping out the remaining XPaths from the top of the document.

<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>

Tomorrow we are going to dive into the structure of the <vote-data> and how to address the attributes therein and their values.

Enjoy!

### JATS: Journal Article Tag Suite, Navigation Update!

Monday, January 11th, 2016

I posted about the appearance of JATS: Journal Article Tag Suite, version 1.1 and then began to lazily browse the pdf.

I forget what I was looking for now but I noticed the table of contents jumped from page 42 to page 235, and again from 272 to to 405. I’m thinking by this point “this is going to be a bear to find elements/attributes in.” I looked for an index only to find none. 🙁

But, there’s hope!

If you look at Chapter 7 “TAG Suite Components,” elements start on page 7 and attributes on page 28, you will find:

Each ✔ is a navigation link to that element (or attribute if you are in the attribute section) under each of those divisions, Archiving, Publishing, Authoring.

Very cool but falls under “non-obvious” for me.

Pass it on so others can safely and quickly navigate JATS 1.1!

PS: It was Tommie Usdin of Balisage fame who pointed out the table in chapter 7 to me. Thanks Tommie!

### Congressional Roll Call Vote and XQuery (A Do Over)

Sunday, January 10th, 2016

Once words are written, as an author I consider them to be fixed. Even typos should be acknowledged as being corrected and not silently “improve” the original text. Rather than editing what has been said, more words can cover the same ground with the hope of doing so more completely or usefully.

I am starting my XQuery series of posts with the view of being more systematic, including references to at least one popular XQuery book, along with my progress through a series of uses of XQuery.

You are going to need an XQuery engine for all but this first post to be meaningful so let’s cover getting that setup first.

There are any number of GUI interface tools that I will mention over time but for now, let’s start with Saxon.

Download Saxon, unzip the file and you can choose to put saxon9he.jar in your Java classpath (if set) or you can invoke it with the -cp (path to saxon9he.jar), as in java -cp (path to saxon9he.jar) net.sf.saxon.Query -q:query-file.

Classpaths are a mixed blessing at best but who wants to keep typing -cp (your path to saxon9he.jar) net.sf.saxon.Query -q: all the time?

What I have found very useful (Ubuntu system) is to create a short shell script that I can invoke from the command line, thus:

#!/bin/bash java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -q:$1  Which after creating that file, which I very imaginatively named “runsaxon.sh,” I used chmod 755 to make it executable. When I want to run Saxon at the command line, in the same directory with “runsaxon.sh” I type: ./runsaxon.sh ex-5.4.xq > ex-5.4.html It is a lot easier and not subject to my fat-fingering of the keyboard. The “>” sign is a pipe in Linux that redirects the output to a file, in this case, ex-5.4.html. The source of ex-5.4.xq (and its data file) is: XQuery, 2nd Edition by Patricia Walmsley. Highly recommended. Patricia has put all of her examples online, XQuery Examples. Please pass that along with a link to her book if you use her examples. If you have ten minutes, take a look at: Learn XQuery in 10 Minutes: An XQuery Tutorial *UPDATED* by Dr. Michael Kay. Michael Kay is also the author of Saxon. By this point you should be well on your way to having a working XQuery engine and tomorrow we will start exploring the structure of the congressional roll call vote documents. ### Congressional Roll Call and XQuery – (Week 1 of XQuery) Saturday, January 9th, 2016 Truthfully a little more than a week of daily XQuery posts, I started a day or so before January 1, 2016. I haven’t been flooded with suggestions or comments, ;-), so I read back over my XQuery posts and I see lots of room for improvement. Most of my posts are on fairly technical topics and are meant to alert other researchers of interesting software or techniques. Most of them are not “how-to” or step by step guides, but some of them are. The posts on congressional roll call documents made sense to me but then I wrote them. Part of what I sensed was that either you know enough to follow my jumps, in which case you are looking for specific details, like the correspondence across documents for attribute values, and not so much for my XQuery expressions. On the other hand, if you weren’t already comfortable with XQuery, the correspondence of values between documents was the least of your concerns. Where the hell was all this terminology coming from? I’m no stranger to long explanations, one of the standards I edit crosses the line at over 1,500 pages. But it hasn’t been my habit to write really long posts on this blog. I’m going to spend the next week, starting tomorrow, re-working and expanding the congressional roll call vote posts to be more detailed for those getting into XQuery, with a very terse, short experts tips at the end of each post if needed. The expert part will have observations such as the correspondences in attribute values and other oddities that either you know or you don’t. Will have the first longer style post up tomorrow, January 10, 2016 and we will see how the week develops from there. ### Congressional Roll Call Vote – Join/Merge Remote XML Files (XQuery) Friday, January 8th, 2016 One of the things that yesterday’s output lacked was the full names of the Georgia representatives. Which aren’t reported in the roll call documents. But, what the roll call documents do have, is the following: <recorded-vote> <legislator name-id=”J000288″ sort-field=”Johnson (GA)” unaccented-name=”Johnson (GA)” party=”D” state=”GA” role=”legislator”>Johnson (GA)</legislator> <vote>Nay</vote> </recorded-vote> With emphasis on name-id=”J000288″ I call that attribute out because there is a sample data file, just for the House of Representatives that has: <bioguideID>J000288</bioguideID> And yes, the “name-id” attribute and the <bioguideID> share the same value for Henry C. “Hank” Johnson, Jr. of Georgia. As far as I can find, that relationship between the “name-id” value in roll call result files and the House Member Data File is undocumented. You have to be paying attention to the data values in the various XML files at Congress.gov. The result of the XQuery script today has the usual header but for members of the Georgia delegation, the following: That is the result of joining/merging two XML files hosted at congress.gov in real time. You can substitute any roll call vote and your state as appropriate and generate a similar webpage for that roll call vote. The roll call vote file I used for this example is: http://clerk.house.gov/evs/2015/roll705.xml and the House Member Data File was: http://xml.house.gov/MemberData/MemberData.xml. The MemberData.xml file dates from April of 2015 so it may not have the latest data on any given member. Documentation for House Member Data in XML (pdf). The main XQuery function for merging the two XML files: {for$voter in doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//recorded-vote,
$mem in doc(“http://xml.house.gov/MemberData/MemberData.xml”)//member/member-info where$voter/legislator[@state = ‘GA’] and $voter/legislator/@name-id =$mem//bioguideID
where $voter/legislator[@state = ‘GA’] return <li> {string($voter/legislator)} — {string(voter/vote)}</li> }</ul> Which makes our localized display a bit better for local readers but only just. What we need is more information that can be found at: http://clerk.house.gov/evs/2015/roll705.xml. More on that tomorrow! ### PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data Thursday, January 7th, 2016 PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data Let’s reverse the order of the announcement, to be in reader-friendly order: Downloads Press kit Release Notes What’s New in 9.5 Edit: I moved my comments above the fold as it were: Just so you know, PostgreSQL 9.5 documentation, 9.14.2.2 XMLEXISTS says: Also note that the SQL standard specifies the xmlexists construct to take an XQuery expression as first argument, but PostgreSQL currently only supports XPath, which is a subset of XQuery. Apologies, you will have to scroll for the subsection, there was no anchor at 9.14.2.2. If you are looking to make a major contribution to PostgreSQL, note that XQuery is on the todo list. Now for all the stuff that you will skip reading anyway. 😉 (I would save the prose for use in reports to management about using or transitioning to PostgreSQL 9.5.) 7 JANUARY 2016: The PostgreSQL Global Development Group announces the release of PostgreSQL 9.5. This release adds UPSERT capability, Row Level Security, and multiple Big Data features, which will broaden the user base for the world’s most advanced database. With these new capabilities, PostgreSQL will be the best choice for even more applications for startups, large corporations, and government agencies. Annie Prévot, CIO of the CNAF, the French Child Benefits Office, said, “The CNAF is providing services for 11 million persons and distributing 73 billion Euros every year, through 26 types of social benefit schemes. This service is essential to the population and it relies on an information system that must be absolutely efficient and reliable. The CNAF’s information system is satisfyingly based on the PostgreSQL database management system.” ## UPSERT A most-requested feature by application developers for several years, “UPSERT” is shorthand for “INSERT, ON CONFLICT UPDATE”, allowing new and updated rows to be treated the same. UPSERT simplifies web and mobile application development by enabling the database to handle conflicts between concurrent data changes. This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL. Developed over the last two years by Heroku programmer Peter Geoghegan, PostgreSQL’s implementation of UPSERT is significantly more flexible and powerful than those offered by other relational databases. The new ON CONFLICT clause permits ignoring the new data, or updating different columns or relations in ways which will support complex ETL (Extract, Transform, Load) toolchains for bulk data loading. And, like all of PostgreSQL, it is designed to be absolutely concurrency-safe and to integrate with all other PostgreSQL features, including Logical Replication. ## Row Level Security PostgreSQL continues to expand database security capabilities with its new Row Level Security (RLS) feature. RLS implements true per-row and per-column data access control which integrates with external label-based security stacks such as SE Linux. PostgreSQL is already known as “the most secure by default.” RLS cements its position as the best choice for applications with strong data security requirements, such as compliance with PCI, the European Data Protection Directive, and healthcare data protection standards. RLS is the culmination of five years of security features added to PostgreSQL, including extensive work by KaiGai Kohei of NEC, Stephen Frost of Crunchy Data, and Dean Rasheed. Through it, database administrators can set security “policies” which filter which rows particular users are allowed to update or view. Data security implemented this way is resistant to SQL injection exploits and other application-level security holes. ## Big Data Features PostgreSQL 9.5 includes multiple new features for bigger databases, and for integrating with other Big Data systems. These features ensure that PostgreSQL continues to have a strong role in the rapidly growing open source Big Data marketplace. Among them are: BRIN Indexing: This new type of index supports creating tiny, but effective indexes for very large, “naturally ordered” tables. For example, tables containing logging data with billions of rows could be indexed and searched in 5% of the time required by standard BTree indexes. Faster Sorts: PostgreSQL now sorts text and NUMERIC data faster, using an algorithm called “abbreviated keys”. This makes some queries which need to sort large amounts of data 2X to 12X faster, and can speed up index creation by 20X. CUBE, ROLLUP and GROUPING SETS: These new standard SQL clauses let users produce reports with multiple levels of summarization in one query instead of requiring several. CUBE will also enable tightly integrating PostgreSQL with more Online Analytic Processing (OLAP) reporting tools such as Tableau. Foreign Data Wrappers (FDWs): These already allow using PostgreSQL as a query engine for other Big Data systems such as Hadoop and Cassandra. Version 9.5 adds IMPORT FOREIGN SCHEMA and JOIN pushdown making query connections to external databases both easier to set up and more efficient. TABLESAMPLE: This SQL clause allows grabbing a quick statistical sample of huge tables, without the need for expensive sorting. “The new BRIN index in PostgreSQL 9.5 is a powerful new feature which enables PostgreSQL to manage and index volumes of data that were impractical or impossible in the past. It allows scalability of data and performance beyond what was considered previously attainable with traditional relational databases and makes PostgreSQL a perfect solution for Big Data analytics,” said Boyan Botev, Lead Database Administrator, Premier, Inc. ### A Lesson about Let Clauses (XQuery) Wednesday, January 6th, 2016 I was going to demonstrate how to localize roll call votes so that only representatives from your state and their votes were displayed for any given roll call vote. Which would enable libraries or local newsrooms, whose users/readers have little interest in how obscure representatives from other states voted, to pare down the roll call vote list to those that really matter, your state’s representatives. But remembering that I promised to clean up the listings in yesterday’s post that read: {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} and kept repeating (doc(“http://clerk.house.gov/evs/2015/roll705.xml”). My thought was to replace that string with a variable declared by a let clause and then substituting that variable for that string. To save you from the same mistake, combining a let clause with direct element constructors returns an error saying, in this case: Left operand of ‘>’ needs parentheses Not a terribly helpful error message. I have found examples of using a let clause within a direct element constructor that would have defeated the rationale for declaring the variable to begin with. Tomorrow I hope to post today’s content, which will enable you to display data relevant to local voters, news reporters, for any arbitrary roll call vote in Congress. Mark today’s adventure as a mistake to avoid. 😉 ### Jazzing a Roll Call Vote – Part 3 (XQuery) Tuesday, January 5th, 2016 I posted Congressional Roll Call Vote – Accessibility Issues earlier today to deal with some accessibility issues noticed by @XQuery with my color coding. Today we are going to start at the top of the boring original roll call vote and work our way down using XQuery. Be forewarned that the XQuery you see today we will be shortening and cleaning up tomorrow. It works, but its not best practice. You will need to open up the source of the original roll call vote to see the elements I select in the path expressions. Here is the XQuery that is the goal for today: xquery version “3.0”; declare boundary-space preserve; <html> <head></head> <body> <h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2> <strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)} {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/> <strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/> <strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)} </body> </html> The title of the document we obtain with: <h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2> Two quick things to notice: First, for very simple documents like this one, I use “//” rather than writing out the path to the rollcall-num element. I already know it only occurs once in each rollcall document. Second, when using direct element constructors, the XQuery statements are enclosed by “{ }” brackets. The rollcall number, date and time of the vote come next (I have introduced line breaks for readability): <strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)} {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/> If you compare my presentation of that string and that from the original, you will find the original has slightly more space between the items. Here is the XSLT for that spacing: <xsl:if test=”legis-num[text()!=’0′]”><xsl:text> </xsl:text><b><xsl:value-of select=”legis-num”/></b></xsl:if> <xsl:text> </xsl:text><xsl:value-of select=”vote-type”/> <xsl:text> </xsl:text><xsl:value-of select=”action-date”/> <xsl:text> </xsl:text><xsl:value-of select=”action-time”/><br/> Since I already had white space separating my XQuery expressions, I just added to the prologue: declare boundary-space preserve; The last two lines: <strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/> <strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)} Are just standard queries for content. The string operator extracts the content of the element you address. Tomorrow we are going to talk about how to clean up and shorten the path statements and look around for information that should be at the top of this document, but isn’t! PS: Did you notice that the vote totals, etc., are written as static data in the XML file? Curious isn’t it? Easy enough to generate from the voting data. I don’t have an answer but thought you might. ### Congressional Roll Call Vote – Accessibility Issues Tuesday, January 5th, 2016 I posted a color coded version of a congressional roll call vote in Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway), using red for Republicans and blue for Democrats. #XQuery points out accessibility issues which depend upon color perception. Color coding works better for me than the more traditional roman versus italic font face distinction but let’s improve the color coding to remove the accessibility issue. The first question is what colors should I use for accessibility? In searching to answer that question I found this thread at Edward Tufte’s site (of course), Choice of colors in print and graphics for color-blind readers, which has a rich list of suggestions and pointers to other resources. One in particular, Color Universal Design (CUD), posted by Maarten Boers, has this graphic on colors: Relying on that palette, I changed the colors for the roll call vote to Republicans in orange; Democrats in sky blue and re-generated the roll call document. Here is an accessible version, but color-coded version of: FINAL VOTE RESULTS FOR ROLL CALL 705. An upside of XML is that changing the presentation of all 429 votes took only a few seconds to change the stylesheet and re-generate the results. Thanks to #XQuery for prodding me on the accessibility issue which resulted in finding the thread at Tufte and the Colorblind barrier-free color pallet. Other post on congressional roll call votes: ### Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway) Monday, January 4th, 2016 Apologies but did not make as much progress on the Congressional Roll Call vote as I had hoped. I did find some interesting information about the vote.xsl stylesheet and manage to use color to code members of the House. You probably remember me whining about how hard it is to tell between roman and italics to distinguish members of different parties. Jazzing Up Roll Call Votes For Fun and Profit (XQuery) The XSLT code is worse than I imagined. Here’s what I mean: <b><center><font size=”+2″>FINAL VOTE RESULTS FOR ROLL CALL <xsl:value-of select=”/rollcall-vote/vote-metadata/rollcall-num”/> <xsl:if test=”/rollcall-vote/vote-metadata/vote-correction[text()!=”]”>*</xsl:if></font></center></b> <!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘D’]”> –> <xsl:if test = “Majority=’D'”>
<center>(Democrats in roman; Republicans in <i>italic</i>; Independents <u>underlined</u>)</center><br/>
</xsl:if>
<!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘R’]”> –>
<xsl:if test = “$Majority!=’D'”> <center>(Republicans in roman; Democrats in <i>italic</i>; Independents <u>underlined</u>)</center><br/> </xsl:if> Which party is in the majority determines whether the names in a party appear in roman or italic face font. Now there’s a distinction that will be lost on a casual reader! What’s more, if you are trying to reform the stylesheet, don’t look for R or D but again for majority party: <xsl:template match=”vote”> <!– Handles formatting of Member names based on party. –> <!– <xsl:if test=”../legislator/@party=’R'”><xsl:value-of select=”../legislator”/></xsl:if> <xsl:if test=”../legislator/@party=’D'”><i><xsl:value-of select=”../legislator”/></i></xsl:if> –> <xsl:if test=”../legislator/@party=’I'”><u><xsl:value-of select=”../legislator”/></u></xsl:if> <xsl:if test=”../legislator/@party!=’I'”> <xsl:if test=”../legislator/@party =$Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –>
<xsl:value-of select=”../legislator”/>
</xsl:if>
<xsl:if test=”../legislator/@party != $Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –> <i><xsl:value-of select=”../legislator”/></i> </xsl:if> </xsl:if> </xsl:template> As you can see, selecting by party has been commented out in favor of the roman/italic distinction based on the majority party. I wanted to label the Republicans with an icon but my GIMP skills don’t extend to making an icon of young mothers throwing their children under the carriage wheels of the wealthy to save them from a live of poverty and degradation. A bit much to get into a HTML button sized icon. I settled for using the traditional red for Republicans and blue for Republicans and ran the modified stylesheet against roll705.xml locally. Here is FINAL VOTE RESULTS FOR ROLL CALL 705 as HTML. Question: Are red and blue easier to distinguish than roman and italic? If your answer is yes, why resort to typographic subtlety on something like party affiliation? Are subtle distinctions used to confuse the uninitiated and unwary? ### Jazzing Up Roll Call Votes For Fun and Profit (XQuery) Sunday, January 3rd, 2016 Roll call votes in the US House of Representatives are a stable of local, state and national news. If you go looking for the “official” version, what you find is as boring as your 5th grade civics class. Trigger Warning: Boring and Minimally Informative Page Produced By Following Link: Final Vote Results For Roll Call 705. Take a deep breath and load the page. It will open in a new browser tab. Boring. Yes? (You were warned.) It is the recent roll call vote to fund the US government, take another slice of privacy from citizens, and make a number of other dubious policy choices. (Everything after the first comma depending upon your point of view.) Whatever your politics though, you have to agree this is sub-optimal presentation, even for a government document. This is no accident, sans the header, you will find the identical presentation of this very roll call vote at: page H10696, Congressional Record for December 18, 2015 (pdf). Disappointing so much XML, XSLT, XQuery, etc., has been wasted duplicating non-informative print formatting. Or should I say less-informative formatting than is possible with XML? Once the data is in XML, legend has it, users can transform that XML in ways more suited to their purposes and not those of the content providers. I say “legend has it,” because we all know if content providers had their way, web navigation would be via ads and not bare hyperlinks. You want to see the next page? You must select the ad + hyperlink, waiting for the ad to clear before the resource appears. I can summarize my opinion about content provider control over information legally delivered to my computer: Screw that! If a content provider enables access to content, I am free to transform that content into speech, graphics, add information, take away information, in short do anything that my imagination desires and my skill enables. Let’s take the roll call vote in the House of Representatives, Final Vote Results For Roll Call 705. Just under the title you will read: (Republicans in roman; Democrats in italic; Independents underlined) Boring. For a bulk display of voting results, we can do better than that. What if we had small images to identify the respective parties? Here are some candidates (sic) for the Republicans: Of course we would have to reduce them to icons size, but XML processing is rarely ever just XML processing. Nearly every project includes some other skill set as well. Which one do you think looks more neutral? 😉 Certainly be more colorful and depending upon your inclinations, more fun to play about with than the difference in roman and italic. Yes? Presentation of the data in http://clerk.house.gov/evs/2015/roll705.xml is only one of the possibilities that XQuery offers. Follow along and offer your suggestions for changes, additions and modifications. First steps: In the browser tab with Final Vote Results For Roll Call 705, use CNTR-u to view the page source. First notice that the boring web presentation is controlled by http://clerk.house.gov/evs/vote.xsl. Copy and paste: http://clerk.house.gov/evs/vote.xsl into a new browser tab and select return. The resulting xsl:stylesheet is responsible for generating the original page, from the vote totals to column presentation of the results. Pay particular attention to the generation of totals from the <vote-data> element and its children. That generation is powered by these lines in vote.xsl: <xsl:apply-templates select=”/rollcall-vote/vote-metadata”/> <!– Create total variables based on counts. –> <xsl:variable name=”y” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Yea’])”/> <xsl:variable name=”a” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Aye’])”/> <xsl:variable name=”yeas” select=”$y + $a”/> <xsl:variable name=”nay” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Nay’])”/> <xsl:variable name=”no” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’No’])”/> <xsl:variable name=”nays” select=”$nay + $no”/> <xsl:variable name=”nvs” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Not Voting’])”/> <xsl:variable name=”presents” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Present’])”/> <br/> (Not entirely, I omitted the purely formatting stuff.) For tomorrow I will be working on a more “visible” way to identify political party affiliation and “borrowing” the count code from vote.xsl. Enjoy! You may be wondering what XQuery has to do with topic maps? Well, if you think about it, every time we select, aggregate, etc., data, we are making choices based on notions of subject identity. That is we think the data we are manipulating represents some subjects and/or information about some subjects, that we find sensible (for some unstated reason) to put together for others to read. The first step towards a topic map, however, is the putting of information together so we can judge what subjects need explicit representation and how we choose to identify them. Prior topic map work was never explicit about how we get to a topic map, putting that possibly divisive question behind us, we simply start with topic maps, ab initio. I was in the car when we took that turn and for the many miles since then. I have come to think that a better starting place is choosing subjects, what we want to say about them and how we wish to say it, so that we have only so much machinery as is necessary for any particular set of subjects. Some subjects can be identified by IRIs, others by multi-dimensional vectors, still others by unspecified processes of deep learning, etc. Which ones we choose will depend upon the immediate ROI from subject identity and relationships between subjects. I don’t need triples, for instance, to recognize natural languages to a sufficient degree of accuracy. Unnecessary triples, topics or associations are just padding. If you are on a per-triple contract, they make sense, otherwise, not. A long way of saying that subject identity lurks just underneath the application of XQuery and we will see where it is useful to call subject identity to the fore. ### Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery (Part 2) Saturday, January 2nd, 2016 Despite heavy carousing during the holidays, you may still remember Great R packages for data import, wrangling & visualization [+ XQuery], where I re-sorted the table by Sharon Machlis, to present the R packages in package name order. I followed that up with: Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery, where I detailed the travails of trying to sort the software packages by their short descriptions, again in alphabetical order. My assumption in that post was that either the spaces or the “,” commas in the descriptions were fouling the sort by. That wasn’t the case, which I should have known because the string operator always returns a string. That is the spaces and “,” inside are just parts of a string, nothing more. The up-side of the problem was that I spent more than a little while with Walmsley’s XQuery book, searching for ever more esoteric answers. Here’s the failing XQuery: <html> <body> <table>{ for$row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2]/a)) return <tr>{$row/td[2]} {$row/td[1]}</tr> }</table> </body> </html>  And here is the working XQuery: <html> <body> <table>{ for$row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2])) return <tr>{$row/td[2]} {$row/td[1]}</tr> }</table> </body> </html>  Here is the mistake highlighted: order by lower-case"(string($row/td[2]/a))"


My first mistake was the inclusion of “/a” in the path. Using string on ($row/td[1]), that is without having /a at the end of the path, gives the original script the same result. (Run that for yourself on favorite-R-packages.xml). Make any path as long as required and no longer! My second mistake was not checking the XPath immediately upon the failure of the sort. (The simplest answer is usually the correct one.) Enjoy! Update: Removed the quotes marks around (string($row/td[2])) in both queries, they were part of an explanation that did not make the cut. Thanks to XQuery for the catch!

### XQilla-2.3.2 – Tooling up for 2016 (Part 2) (XQuery)

Friday, January 1st, 2016

As I promised yesterday, a solution to the XQilla-2.3.2 installation problem!

Using a virtual machine to install the latest version of Ubuntu (15.10), which had the libraries required to install XQilla!

I use VirtualBox from Oracle but people also use VMware.

Virtual boxes come in all manner of configurations so you are likely to spend some time loading linux headers and the like to compile software.

The advantage of a virtual box is that I don’t have to risk doing something dumb or out of fatigue to my working setup. If I have to blow away the entire virtual machine, its takes only a few minutes to download another one.

Well, on any day other than New Year’s Day I found out today. I don’t know if people were streaming that many football games or streaming live “acts” of some sort but the Net was very slow today.

Introducing XQuery to humanists, librarians and reporters using a VM with the usual XQuery suspects pre-loaded would be very cool!

Great way to distribute xqueries* and shell scripts that run them for immediate results.

If you have any thoughts about what such a VM should contain, etc., drop me an email patrick@durusau.net or leave a comment. Thanks!

PS: XQueries returned approximately 26K “hits,” and xquerys returned approximately 1,700 “hits.” Usage favors the plural as “xqueries” so that is what I am following. At the first of a sentence, XQueries?

PPS: I could have written this without the woes of failed downloads, missing header files, etc. but I wanted to know for myself that Ubuntu (15.10) with all the appropriate header files would in fact compile XQilla-2.3.2.

You may need this line to get all the headers:

Not to mention that I would update everything before trying to compile software. Hard to say how long your VM has been on the shelf.

### XQilla-2.3.2 – Tooling up for 2016 (Part 1) (XQuery)

Thursday, December 31st, 2015

Along with other end of the year tasks, I’m installing several different XQuery tools. Not all tools support all extensions and so a variety of tools can be a useful thing.

The README for XQila-2.3.2 comes close to winning a prize for being terse:

2. Build Xerces-C

cd xerces-c-3.1.2/
./configure
make

4. Build XQilla

cd xqilla/
./configure –with-xerces=pwd/../xerces-c-3.1.2/
make

A few notes that may help:

Obtain Xerces-c-3.1.2 homepage.

Xerces project homepage. Home of Apache Xerces C++, Apache Xerces2 Java, Apache Xerces Perl, and, Apache XML Commons.

On configuring the make file for XQilla:

./configure –with-xerces=pwd/../xerces-c-3.1.2/

the README is presuming you built xerces-c-3.1.2 in a sub-directory of the XQilla source. You could, just out of habit I built xerces-c-3.1.2 in a separate directory.

The configuration file for XQilla reads in part:

–with-xerces=DIR Path of Xerces. DIR=”/usr/local”

So you could build XQilla with an existing install of xerces-c-3.1.2 if you are so-minded. But if you are that far along, you don’t need these notes. 😉

Strictly for my system (your paths will be different), after building xerces-c-3.1.2, I changed directories to XQilla-2.3.2 and typed:

./configure --with-xerces=/home/patrick/working/xerces-c-3.1.2 

No error messages so I am now back at the command prompt and enter make.

Welllll, that was supposed to work!

Here is the error I got:

libtool: link: g++ -O2 -ftemplate-depth-50 -o .libs/xqilla
xqilla-commandline.o
-L/home/patrick/working/xerces-c-3.1.2/src
/home/patrick/working/xerces-c-3.1.2/src/
-Wl,/home/patrick/working/xerces-c-3.1.2/src
/usr/bin/ld: warning: libicuuc.so.55, needed by
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so,
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so:
undefined reference to uset_close_55'
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so:
undefined reference to ucnv_fromUnicode_55'
...[omitted numerous undefined references]...
collect2: error: ld returned 1 exit status
make[1]: *** [xqilla] Error 1
make[1]: Leaving directory /home/patrick/working/XQilla-2.3.2'
make: *** [all-recursive] Error 1


To help you avoid surfing the web to track down this issue, realize that Ubuntu doesn’t use the latest releases. Of anything as far as I can tell.

The bottom line being that Ubuntu 14.04 doesn’t have libicuuc.so.55.

If I manually upgrade libraries, I might create an inconsistency package management tools can’t fix. 🙁 And break working tools. Bad joss!

Fear Not! There is a solution, which I will cover in my next XQilla-2.3.2 post!

PS: I didn’t get back to the sorting post in time to finish it today. Not to mention that I encountered another nasty list in Most Vulnerable Software of 2015! (Perils of Interpretation!, Advice for 2016).

I say “nasty,” you should see some of the lists you can find at Congress.gov. Valid XML I’ll concede but not as useful as they could be.

Improving online lists, combining them with other data, etc., are some of the things I want to cover this coming year.

### Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery

Wednesday, December 30th, 2015

Continuing with the data from my post: Great R packages for data import, wrangling & visualization [+ XQuery], I have discovered the dangers of perfect example data!

The XQuery examples on sorting that I have read either enclose strings in quotes and/or have strings with no whitespaces.

How often to you see strings with no whitespaces? Outside of highly constrained environments?

Why is that a problem?

Well, take a look at my results from sorting on the short description and displaying the short description first and the package name second:

 package development, package installation devtools misc installr data import readxl data import, data export googlesheets data import RMySQL data import readr data import, data export rio data analysis psych data wrangling, data analysis sqldf data import, data wrangling jsonlite data import, data wrangling XML data import, data visualization, data analysis quantmod data import, web scraping rvest data wrangling, data analysis dplyr data wrangling plyr data wrangling reshape2 data wrangling tidyr data wrangling, data analysis data.table data wrangling stringr data wrangling lubridate data wrangling, data analysis zoo data display editR data display knitr data display, data wrangling listviewer data display DT data visualization ggplot2 data visualization dygraphs data visualization googleVis data visualization metricsgraphics data visualization RColorBrewer data visualization plotly mapping leaflet mapping choroplethr mapping tmap misc fitbitScraper Web analytics rga Web analytics RSiteCatalyst package development roxygen2 data visualization shiny misc openxlsx data wrangling, data analysis gmodels data wrangling car data visualization rcdimple data wrangling foreach data acquisition downloader data wrangling scales data visualization plotly

Err, that’s not right!

The XQuery from yesterday:

1. xquery version “1.0”;
2. <html>
3. <table>{
4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr 5. order by lower-case(string($row/td[1]/a))
6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
7. }</table>
8. </html>

XQuery from today, changes in red:

1. xquery version “1.0”;
2. <html>
3. <table>{
4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr 5. order by lower-case(string($row/td[2]/a))
6. return <tr>{$row/td[2]} {$row/td[1]}</tr>
7. }</table>
8. </html>

First, how do you explain the failure? Looks like no sort order at all.

Truthfully it does have a sort order, just not the one you expected. The results appear in document sort order, as they appeared in the document.

Here’s a snippet of that document:

<table>
<tr>
<td>package development, package installation</td>
<td>While devtools is aimed at helping you create your own R packages, it's also
essential if you want to easily install other packages from GitHub. Install it!
Requires <a href="http://cran.r-project.org/bin/windows/Rtools/" target="_new">
target="_new">XCode</a> on a Mac. On CRAN.</td>
<td>install_github("rstudio/leaflet")</td>
</tr>
<tr>
<td><a href="https://github.com/talgalili/installr/" target="_new">installr</a>
</td><td>misc</td>
<td>Windows only: Update your installed version of R from within R. On CRAN.</td>
<td>updateR()</td>
<td>Tal Galili & others</td>
</tr>
<tr>
</td><td>data import</td>
<td>Fast way to read Excel files in R, without dependencies such as Java. CRAN.</td>
</tr>
...
</table>
`

I haven’t run the problem entirely to ground but as you can see from the output:

data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod

Most of the descriptions have spaces, not to mention “,” separating categories.

It is always possible to clean up the data but I want to avoid that if at all possible.

Cleaning data involves the risk I may change the data and once changed, I may not be able to go back to the original.

I can think of at least two (2) ways to fix this problem but want to sleep on it first and pick that can be easily adapted to the next soiled data that comes through the door.

PS: Neither Saxon (9.7), nor BaseX (8.3) gave any error messages at the console for the failure of the sort request.

You could say that document order is about as large an error message as can be given. 😉