Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 13, 2016

The Symptom of Many Formats

Filed under: JSON,Publishing,XML — Patrick Durusau @ 7:56 am

Distro.Mic: An Open Source Service for Creating Instant Articles, Google AMP and Apple News Articles

From the post:

Mic is always on the lookout for new ways to reach our audience. When Facebook, Google and Apple announced their own native news experiences, we jumped at the opportunity to publish there.

While setting Mic up on these services, David Björklund realized we needed a common article format that we could use for generating content on any platform. We call this format article-json, and we open-sourced parsers for it.

Article-json got a lot of support from Google and Apple, so we decided to take it a step further. Enter DistroMic. Distro lets anyone transform an HTML article into the format mandated by one of the various platforms.

Sigh.

While I applaud the DistroMic work, I am saddened that it was necessary.

From the DistroMic page, here is the same article in three formats:

Apple:

{
“article”: [
{
“text”: “Astronomers just announced the universe might be expanding up to 9% faster than we thought.\n”,
“additions”: [
{
“type”: “link”,
“rangeStart”: 59,
“rangeLength”: 8,
“URL”: “http://hubblesite.org/newscenter/archive/releases/2016/17/text/”
}
],
“inlineTextStyles”: [
{
“rangeStart”: 59,
“rangeLength”: 8,
“textStyle”: “bodyLinkTextStyle”
}
],
“role”: “body”,
“layout”: “bodyLayout”
},
{
“text”: “It’s a surprising insight that could put us one step closer to finally figuring out what the hell dark energy and dark matter are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.\n”,
“additions”: [
{
“type”: “link”,
“rangeStart”: 98,
“rangeLength”: 28,
“URL”: “http://science.nasa.gov/astrophysics/focus-areas/what-is-dark-energy/”
}
],
“inlineTextStyles”: [
{
“rangeStart”: 98,
“rangeLength”: 28,
“textStyle”: “bodyLinkTextStyle”
}
],
“role”: “body”,
“layout”: “bodyLayout”
},
{
“role”: “container”,
“components”: [
{
“role”: “photo”,
“URL”: “bundle://image-0.jpg”,
“style”: “embedMediaStyle”,
“layout”: “embedMediaLayout”,
“caption”: {
“text”: “Source: \n NASA\n \n”,
“additions”: [
{
“type”: “link”,
“rangeStart”: 13,
“rangeLength”: 4,
“URL”: “http://www.nasa.gov/mission_pages/hubble/hst_young_galaxies_200604.html”
}
],
“inlineTextStyles”: [
{
“rangeStart”: 13,
“rangeLength”: 4,
“textStyle”: “embedCaptionTextStyle”
}
],
“textStyle”: “embedCaptionTextStyle”
}
}
],
“layout”: “embedLayout”,
“style”: “embedStyle”
}
],
“bundlesToUrls”: {
“image-0.jpg”: “http://bit.ly/1UFHdpf”
}
}

Facebook:

<article>
<p>Astronomers just announced the universe might be expanding
<a href=”http://hubblesite.org/newscenter/archive/releases/2016/17/text/”>up to 9%</a> faster than we thought.</p>
<p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=”http://science.nasa.gov/astrophysics/focus-areas/what-is-dark-energy/”>
dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<figure data-feedback=”fb:likes,fb:comments”>
<img src=”http://bit.ly/1UFHdpf”></img>
<figcaption><cite>
Source: <a href=”http://www.nasa.gov/mission_pages/hubble/hst_young_
galaxies_200604.html”>NASA</a>
</cite></figcaption>
</figure>
</article>

Google:

<article>
<p>Astronomers just announced the universe might be expanding
<a href=”http://hubblesite.org/newscenter/archive/releases/2016/17/text/”>up to 9%</a> faster than we thought.</p> <p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=”http://science.nasa.gov/astrophysics/focus-areas/what-is-dark-energy/”> dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<figure>
<amp-img width=”900″ height=”445″ layout=”responsive” src=”http://bit.ly/1UFHdpf”></amp-img>
<figcaption>Source:
<a href=”http://www.nasa.gov/mission_pages/hubble/hst_young_
galaxies_200604.html”>NASA</a>
</figcaption>
</figure>
</article>

All starting from the same HTML source:

<p>Astronomers just announced the universe might be expanding
<a href=”http://hubblesite.org/newscenter/archive/releases/2016/17/text/”>up to 9%</a> faster than we thought.</p><p>It’s a surprising insight that could put us one step closer to finally figuring out what the hell
<a href=”http://science.nasa.gov/astrophysics/focus-areas/what-is-dark-energy/”>
dark energy and dark matter</a> are. Or it could mean that we’ve gotten something fundamentally wrong in our understanding of physics, perhaps even poking a hole in Einstein’s theory of gravity.</p>
<figure>
<img width=”900″ height=”445″ src=”http://bit.ly/1UFHdpf”>
<figcaption>Source: 
<a href=”http://www.nasa.gov/mission_pages/hubble/hst_young_
galaxies_200604.html”>NASA</a>
</figcaption>
</figure>

Three workflows based on what started life in one common format.

Three workflows that have their own bugs and vulnerabilities.

Three workflows that duplicate the capabilities of each other.

Three formats that require different indexing/searching.

This is not the cause of why we can’t have nice things in software, but it certainly is a symptom.

The next time someone proposes a new format for a project, challenge them to demonstrate a value-add over existing formats.

May 23, 2016

Balisage 2016 Program Posted! (Newcomers Welcome!)

Filed under: Conferences,Topic Maps,XML,XML Schema,XPath,XProc,XQuery,XSLT — Patrick Durusau @ 8:03 pm

Tommie Usdin wrote today to say:

Balisage: The Markup Conference
2016 Program Now Available
http://www.balisage.net/2016/Program.html

Balisage: where serious markup practitioners and theoreticians meet every August.

The 2016 program includes papers discussing reducing ambiguity in linked-open-data annotations, the visualization of XSLT execution patterns, automatic recognition of grant- and funding-related information in scientific papers, construction of an interactive interface to assist cybersecurity analysts, rules for graceful extension and customization of standard vocabularies, case studies of agile schema development, a report on XML encoding of subtitles for video, an extension of XPath to file systems, handling soft hyphens in historical texts, an automated validity checker for formatted pages, one no-angle-brackets editing interface for scholars of German family names and another for scholars of Roman legal history, and a survey of non-XML markup such as Markdown.

XML In, Web Out: A one-day Symposium on the sub rosa XML that powers an increasing number of websites will be held on Monday, August 1. http://balisage.net/XML-In-Web-Out/

If you are interested in open information, reusable documents, and vendor and application independence, then you need descriptive markup, and Balisage is the conference you should attend. Balisage brings together document architects, librarians, archivists, computer
scientists, XML practitioners, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, standards developers, academics, industrial researchers, government and NGO staff, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Some participants are busy designing replacements for XML while other still use SGML (and know why they do).

Discussion is open, candid, and unashamedly technical.

Balisage 2016 Program: http://www.balisage.net/2016/Program.html

Symposium Program: http://balisage.net/XML-In-Web-Out/symposiumProgram.html

Even if you don’t eat RELAX grammars at snack time, put Balisage on your conference schedule. Even if a bit scruffy looking, the long time participants like new document/information problems or new ways of looking at old ones. Not to mention they, on occasion, learn something from newcomers as well.

It is a unique opportunity to meet the people who engineered the tools and specs that you use day to day.

Be forewarned that most of them have difficulty agreeing what controversial terms mean, like “document,” but that to one side, they are a good a crew as you are likely to meet.

Enjoy!

May 5, 2016

TEI XML -> HTML w/ XQuery [+ CSS -> XML]

Filed under: HTML,Text Encoding Initiative (TEI),XML,XQuery — Patrick Durusau @ 1:10 pm

Convert TEI XML to HTML with XQuery and BaseX by Adam Steffanick.

From the post:

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

The post only treats six (6) TEI elements but the methods presented could be extended to a larger set of TEI elements.

TEI 5 has 563 elements, which may appear in varying, valid, combinations. It also defines 256 attributes which are distributed among those 563 elements.

Consider using XQuery as a quality assurance (QA) tool to insure that encoded texts conform your project’s definition of expected text encoding.

While I was at Adam’s site I encountered: Convert CSV to XML with XQuery and BaseX, which you should bookmark for future reference.

February 2, 2016

Balisage 2016, 2–5 August 2016 [XML That Makes A Difference!]

Filed under: Conferences,XLink,XML,XML Data Clustering,XML Schema,XPath,XProc,XQuery,XSLT — Patrick Durusau @ 9:47 pm

Call for Participation

Dates:

  • 25 March 2016 — Peer review applications due
  • 22 April 2016 — Paper submissions due
  • 21 May 2016 — Speakers notified
  • 10 June 2016 — Late-breaking News submissions due
  • 16 June 2016 — Late-breaking News speakers notified
  • 8 July 2016 — Final papers due from presenters of peer reviewed papers
  • 8 July 2016 — Short paper or slide summary due from presenters of late-breaking news
  • 1 August 2016 — Pre-conference Symposium
  • 2–5 August 2016 — Balisage: The Markup Conference

From the call:

Balisage is the premier conference on the theory, practice, design, development, and application of markup. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

  • Web application development with XML
  • Informal data models and consensus-based vocabularies
  • Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
  • Performance issues in parsing, XML database retrieval, or XSLT processing
  • Development of angle-bracket-free user interfaces for non-technical users
  • Semistructured data and full text search
  • Deployment of XML systems for enterprise data
  • Web application development with XML
  • Design and implementation of XML vocabularies
  • Case studies of the use of XML for publishing, interchange, or archiving
  • Alternatives to XML
  • the role(s) of XML in the application lifecycle
  • the role(s) of vocabularies in XML environments

Full papers should be submitted by the deadline given below. All papers are peer-reviewed — we pride ourselves that you will seldom get a more thorough, skeptical, or helpful review than the one provided by Balisage reviewers.

Whether in theory or practice, let’s make Balisage 2016 the one people speak of in hushed tones at future markup and information conferences.

Useful semantics continues to flounder about, cf. Vice-President Biden’s interest in “one cancer research language.” Easy enough to say. How hard could it be?

Documents are commonly thought of and processed as if from BOM to EOF is the definition of a document. Much to our impoverishment.

Silo dissing has gotten popular. What if we could have our silos and eat them too?

Let’s set our sights on a Balisage 2016 where non-technicals come away saying “I want that!”

Have your first drafts done well before the end of February, 2016!

January 13, 2016

Congressional Roll Call Vote – The Documents – Part 2 (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 11:54 pm

Congressional Roll Call Vote – The Documents (XQuery) we looked at the initial elements found in FINAL VOTE RESULTS FOR ROLL CALL 705. Today we continue our examination of those elements, starting with <vote-data>.

As before, use ctrl-u in your browser to display the XML source for that page. Look for </vote-metadata>, the next element is <vote-data>, which contains all the votes cast by members of Congress as follows:

<recorded-vote>
<legislator name-id=”A000374″ sort-field=”Abraham” unaccented-name=”Abraham” party=”R” state=”LA” role=”legislator”>Abraham</legislator><
vote>Nay</vote>
</recorded-vote>
<recorded-vote>
<legislator name-id=”A000370″ sort-field=”Adams” unaccented-name=”Adams” party=”D” state=”NC” role=”legislator”>Adams</legislator>
<vote>Yea</vote>
</recorded-vote>

These are only the first two (2) lines but only the content of other <recorded-vote> elements varies from these.

I have introduced line returns to make it clear that <recorded-vote> … </recorded-vote> begin and end each record. Also note that <legislator> and <vote> are siblings.

What you didn’t see in the upper part of this document were the attributes that appear inside the <legislator> element.

Some of the attributes are: name-id=”A000374,” state=”LA” role=”legislator.”

In an XQuery, we address attributes by writing out the path to the element containing the attributes and then appending the attribute.

For example, for name-id=”A000374,” we could write:

rollcall-vote/vote-data/recorded-vote/legislator[@name-id = "A000374]

If we wanted to select that attribute value and/or the <legislator> element with that attribute and value.

Recalling that:

rollcall-vote – Root element of the document.

vote-data – Direct child of the root element.

recorded-vote – Direct child of the vote-data element (with many siblings).

legislator – Direct child of recorded-vote.

@name-id – One of the attributes of legislator.

As I mentioned in our last post, there are other ways to access elements and attributes but many useful things can be done with direct descendant XPaths.

In preparation for our next post, trying searching for “A000374” and limiting your search to the domain, congress.gov.

It is a good practice to search on unfamiliar attribute values. You never know what you may find!

Until next time!

January 11, 2016

Congressional Roll Call Vote – The Documents (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 10:41 pm

I assume you have read my new starter post for this series: Congressional Roll Call Vote and XQuery (A Do Over). If you haven’t and aren’t already familiar with XQuery, take a few minutes to go read it now. I’ll wait.

The first XML document we need to look at is FINAL VOTE RESULTS FOR ROLL CALL 705. If you press ctrl-u in your browser, the XML source of that document will be displayed.

The top portion of that document, before you see <vote-data> reads:

<?xml version=”1.0″ encoding=”UTF-8″?>
<!DOCTYPE rollcall-vote PUBLIC “-//US Congress//DTDs/vote
v1.0 20031119 //EN” “http://clerk.house.gov/evs/vote.dtd”>
<?xml-stylesheet type=”text/xsl” href=”http://clerk.house.gov/evs/vote.xsl”?>
<rollcall-vote>
<vote-metadata>
<majority>R</majority>
<congress>114</congress>
<session>1st</session>
<chamber>U.S. House of Representatives</chamber>
<rollcall-num>705</rollcall-num>
<legis-num>H R 2029</legis-num>
<vote-question>On Concurring in Senate Amdt with
Amdt Specified in Section 3(a) of H.Res. 566</vote-question>
<vote-type>YEA-AND-NAY</vote-type>
<vote-result>Passed</vote-result>
<action-date>18-Dec-2015</action-date>
<action-time time-etz=”09:49″>9:49 AM</action-time>
<vote-desc>Making appropriations for military construction, the
Department of Veterans Affairs, and related agencies for the fiscal
year ending September 30, 2016, and for other purposes</vote-desc>
<vote-totals>
<totals-by-party-header>
<party-header>Party</party-header>
<yea-header>Yeas</yea-header>
<nay-header>Nays</nay-header>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>
</totals-by-party-header>
<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Independent</party>
<yea-total>0</yea-total>
<nay-total>0</nay-total>
<present-total>0</present-total>
<not-voting-total>0</not-voting-total>
</totals-by-party>
<totals-by-vote>
<total-stub>Totals</total-stub>
<yea-total>316</yea-total>
<nay-total>113</nay-total>
<present-total>0</present-total>
<not-voting-total>5</not-voting-total>
</totals-by-vote>
</vote-totals>
</vote-metadata>

One of the first skills you need to learn to make effective use of XQuery is how to recognize paths in an XML document.

I’ll do the first several and leave some of the others for you.

<rollcall-vote> – the root element – aka “parent” element

<vote-metadata> – first child element in this document
XPath rollcall-vote/vote-metadata

<majority>R</majority> first child of <majority>R</majority> of <vote-metadata>
XPath rollcall-vote/vote-metadata/majority

<congress>114</congress>

What do you think? Looks like the same level as <majority>R</majority> and it is. Called a sibling of <majority>R</majority>
XPath rollcall-vote/vote-metadata/congress

Caveat: There are ways to go back up the XPath and to reach siblings and attributes. For the moment, lets get good at spotting direct XPaths.

Let’s skip down in the markup until we come to <totals-by-party-header>. It’s not followed, at least not immediately, with </totals-by-party-header>. That’s a signal that the previous siblings have stopped and we have another step in the XPath.

<totals-by-party-header>
XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header

<party-header>Party</party-header>
XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header/party-header

As you may suspect, the next four elements are siblings of <party-header>Party</party-header>

<yea-header>Yeas</yea-header>
<nay-header>Nays</nay-header>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>

The closing element, shown by the “/,” signals the end of the <totals-by-party-header> element.

</totals-by-party-header>

See how you do mapping out the remaining XPaths from the top of the document.

<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>

Tomorrow we are going to dive into the structure of the <vote-data> and how to address the attributes therein and their values.

Enjoy!

JATS: Journal Article Tag Suite, Navigation Update!

Filed under: Publishing,XML — Patrick Durusau @ 8:28 am

I posted about the appearance of JATS: Journal Article Tag Suite, version 1.1 and then began to lazily browse the pdf.

I forget what I was looking for now but I noticed the table of contents jumped from page 42 to page 235, and again from 272 to to 405. I’m thinking by this point “this is going to be a bear to find elements/attributes in.” I looked for an index only to find none. 🙁

But, there’s hope!

If you look at Chapter 7 “TAG Suite Components,” elements start on page 7 and attributes on page 28, you will find:

JATS-nav

Each ✔ is a navigation link to that element (or attribute if you are in the attribute section) under each of those divisions, Archiving, Publishing, Authoring.

Very cool but falls under “non-obvious” for me.

Pass it on so others can safely and quickly navigate JATS 1.1!

PS: It was Tommie Usdin of Balisage fame who pointed out the table in chapter 7 to me. Thanks Tommie!

January 10, 2016

Congressional Roll Call Vote and XQuery (A Do Over)

Filed under: Government,XML,XQuery — Patrick Durusau @ 10:11 pm

Once words are written, as an author I consider them to be fixed. Even typos should be acknowledged as being corrected and not silently “improve” the original text. Rather than editing what has been said, more words can cover the same ground with the hope of doing so more completely or usefully.

I am starting my XQuery series of posts with the view of being more systematic, including references to at least one popular XQuery book, along with my progress through a series of uses of XQuery.

You are going to need an XQuery engine for all but this first post to be meaningful so let’s cover getting that setup first.

There are any number of GUI interface tools that I will mention over time but for now, let’s start with Saxon.

Download Saxon, unzip the file and you can choose to put saxon9he.jar in your Java classpath (if set) or you can invoke it with the -cp (path to saxon9he.jar), as in java -cp (path to saxon9he.jar) net.sf.saxon.Query -q:query-file.

Classpaths are a mixed blessing at best but who wants to keep typing -cp (your path to saxon9he.jar) net.sf.saxon.Query -q: all the time?

What I have found very useful (Ubuntu system) is to create a short shell script that I can invoke from the command line, thus:

#!/bin/bash
java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -q:$1

Which after creating that file, which I very imaginatively named “runsaxon.sh,” I used chmod 755 to make it executable.

When I want to run Saxon at the command line, in the same directory with “runsaxon.sh” I type:

./runsaxon.sh ex-5.4.xq > ex-5.4.html

It is a lot easier and not subject to my fat-fingering of the keyboard.

The “>” sign is a pipe in Linux that redirects the output to a file, in this case, ex-5.4.html.

The source of ex-5.4.xq (and its data file) is: XQuery, 2nd Edition by Patricia Walmsley. Highly recommended.

Patricia has put all of her examples online, XQuery Examples. Please pass that along with a link to her book if you use her examples.

If you have ten minutes, take a look at: Learn XQuery in 10 Minutes: An XQuery Tutorial *UPDATED* by Dr. Michael Kay. Michael Kay is also the author of Saxon.

By this point you should be well on your way to having a working XQuery engine and tomorrow we will start exploring the structure of the congressional roll call vote documents.

January 9, 2016

Congressional Roll Call and XQuery – (Week 1 of XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 9:49 pm

Truthfully a little more than a week of daily XQuery posts, I started a day or so before January 1, 2016.

I haven’t been flooded with suggestions or comments, ;-), so I read back over my XQuery posts and I see lots of room for improvement.

Most of my posts are on fairly technical topics and are meant to alert other researchers of interesting software or techniques. Most of them are not “how-to” or step by step guides, but some of them are.

The posts on congressional roll call documents made sense to me but then I wrote them. Part of what I sensed was that either you know enough to follow my jumps, in which case you are looking for specific details, like the correspondence across documents for attribute values, and not so much for my XQuery expressions.

On the other hand, if you weren’t already comfortable with XQuery, the correspondence of values between documents was the least of your concerns. Where the hell was all this terminology coming from?

I’m no stranger to long explanations, one of the standards I edit crosses the line at over 1,500 pages. But it hasn’t been my habit to write really long posts on this blog.

I’m going to spend the next week, starting tomorrow, re-working and expanding the congressional roll call vote posts to be more detailed for those getting into XQuery, with a very terse, short experts tips at the end of each post if needed.

The expert part will have observations such as the correspondences in attribute values and other oddities that either you know or you don’t.

Will have the first longer style post up tomorrow, January 10, 2016 and we will see how the week develops from there.

January 8, 2016

Congressional Roll Call Vote – Join/Merge Remote XML Files (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 10:59 pm

One of the things that yesterday’s output lacked was the full names of the Georgia representatives. Which aren’t reported in the roll call documents.

But, what the roll call documents do have, is the following:

<recorded-vote>
<legislator name-id=”J000288″ sort-field=”Johnson (GA)” unaccented-name=”Johnson (GA)”
party=”D” state=”GA” role=”legislator”>Johnson (GA)</legislator>
<vote>Nay</vote>
</recorded-vote>

With emphasis on name-id=”J000288″

I call that attribute out because there is a sample data file, just for the House of Representatives that has:

<bioguideID>J000288</bioguideID>

And yes, the “name-id” attribute and the <bioguideID> share the same value for Henry C. “Hank” Johnson, Jr. of Georgia.

As far as I can find, that relationship between the “name-id” value in roll call result files and the House Member Data File is undocumented. You have to be paying attention to the data values in the various XML files at Congress.gov.

The result of the XQuery script today has the usual header but for members of the Georgia delegation, the following:

congress-ga-phone

That is the result of joining/merging two XML files hosted at congress.gov in real time. You can substitute any roll call vote and your state as appropriate and generate a similar webpage for that roll call vote.

The roll call vote file I used for this example is: http://clerk.house.gov/evs/2015/roll705.xml and the House Member Data File was: http://xml.house.gov/MemberData/MemberData.xml. The MemberData.xml file dates from April of 2015 so it may not have the latest data on any given member. Documentation for House Member Data in XML (pdf).

The main XQuery function for merging the two XML files:

{for $voter in doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//recorded-vote,
$mem in doc(“http://xml.house.gov/MemberData/MemberData.xml”)//member/member-info
where $voter/legislator[@state = ‘GA’] and $voter/legislator/@name-id = $mem//bioguideID
return <li> {string($mem//official-name)} — {string($voter/vote)} — {string($mem//phone)}&;lt;/li>

At a minimum, you can auto-generate a listing for representatives from your state, their vote on any roll-call vote and give readers their phone number to register their opinion.

This is a crude example of what you can do with XML, XQuery and online data from Congress.gov.

BTW, if you work in “sharing” environment at a media outlet or library, you can also join/merge data that you hold internally, say the private phone number of a congressional aide, for example.

We are not nearly done with the congressional roll call vote but you can begin to see the potential that XQuery offers for very little effort. Not to mention that XQuery scripts can be rapidly adapted to your library or news room.

Try out today’s XQuery roll705-join-merge.xq.txt for yourself. (Apologies for the “.txt” extension but my ISP host has ideas about “safe” files to upload.)

I realize this first week has been kinda haphazard in its presentation. Suggestions welcome on improvements as this series goes forward.

The government and others are cranking out barely useful XML by the boatload. XQuery is your ticket to creating personalized presentations dynamically from that XML and other data.

Enjoy!

PS: For display of XML and XQuery, should I be using a different Word template? Suggestions?

JATS: Journal Article Tag Suite, version 1.1

Filed under: Publishing,XML — Patrick Durusau @ 5:42 pm

JATS: Journal Article Tag Suite, version 1.1

Abstract:

The Journal Article Tag Suite provides a common XML format in which publishers and archives can exchange journal content. The JATS provides a set of XML elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews.

Documentation and help files: Journal Article Tag Suite.

Tommie Usdin (of Balisage fame) posted to Facebook:

JATS has added capabilities to encode:
– NISO Access License and Indicators
– additional support for multiple language documents and for Japanese documents (including Ruby)
– citation of datasets
and some other things users of version 1.0 have requested.

Another XML vocabulary that provides grist for your XQuery adventures!

January 7, 2016

Localizing A Congressional Roll Call Vote (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 10:07 pm

I made some progress today on localizing a congressional roll call vote.

As you might expect, I chose to localize to the representatives from Georgia. 😉

I used a FLWOR expression to select legislators where the attribute state = GA.

Here is that expression:

<ul>
{for $voter in doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//recorded-vote
where $voter/legislator[@state = ‘GA’]
return <li> {string($voter/legislator)} — {string($voter/vote)}</li>
}</ul>

Which makes our localized display a bit better for local readers but only just.

See roll705-local.html.

What we need is more information that can be found at: http://clerk.house.gov/evs/2015/roll705.xml.

More on that tomorrow!

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Filed under: BigData,PostgreSQL,SQL,XML,XQuery — Patrick Durusau @ 5:26 pm

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Let’s reverse the order of the announcement, to be in reader-friendly order:

Downloads

Press kit

Release Notes

What’s New in 9.5

Edit: I moved my comments above the fold as it were:

Just so you know, PostgreSQL 9.5 documentation, 9.14.2.2 XMLEXISTS says:

Also note that the SQL standard specifies the xmlexists construct to take an XQuery expression as first argument, but PostgreSQL currently only supports XPath, which is a subset of XQuery.

Apologies, you will have to scroll for the subsection, there was no anchor at 9.14.2.2.

If you are looking to make a major contribution to PostgreSQL, note that XQuery is on the todo list.

Now for all the stuff that you will skip reading anyway. 😉

(I would save the prose for use in reports to management about using or transitioning to PostgreSQL 9.5.)

7 JANUARY 2016: The PostgreSQL Global Development Group announces the release of PostgreSQL 9.5. This release adds UPSERT capability, Row Level Security, and multiple Big Data features, which will broaden the user base for the world’s most advanced database. With these new capabilities, PostgreSQL will be the best choice for even more applications for startups, large corporations, and government agencies.

Annie Prévot, CIO of the CNAF, the French Child Benefits Office, said, “The CNAF is providing services for 11 million persons and distributing 73 billion Euros every year, through 26 types of social benefit schemes. This service is essential to the population and it relies on an information system that must be absolutely efficient and reliable. The CNAF’s information system is satisfyingly based on the PostgreSQL database management system.”

UPSERT

A most-requested feature by application developers for several years, “UPSERT” is shorthand for “INSERT, ON CONFLICT UPDATE”, allowing new and updated rows to be treated the same. UPSERT simplifies web and mobile application development by enabling the database to handle conflicts between concurrent data changes. This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL.

Developed over the last two years by Heroku programmer Peter Geoghegan, PostgreSQL’s implementation of UPSERT is significantly more flexible and powerful than those offered by other relational databases. The new ON CONFLICT clause permits ignoring the new data, or updating different columns or relations in ways which will support complex ETL (Extract, Transform, Load) toolchains for bulk data loading. And, like all of PostgreSQL, it is designed to be absolutely concurrency-safe and to integrate with all other PostgreSQL features, including Logical Replication.

Row Level Security

PostgreSQL continues to expand database security capabilities with its new Row Level Security (RLS) feature. RLS implements true per-row and per-column data access control which integrates with external label-based security stacks such as SE Linux. PostgreSQL is already known as “the most secure by default.” RLS cements its position as the best choice for applications with strong data security requirements, such as compliance with PCI, the European Data Protection Directive, and healthcare data protection standards.

RLS is the culmination of five years of security features added to PostgreSQL, including extensive work by KaiGai Kohei of NEC, Stephen Frost of Crunchy Data, and Dean Rasheed. Through it, database administrators can set security “policies” which filter which rows particular users are allowed to update or view. Data security implemented this way is resistant to SQL injection exploits and other application-level security holes.

Big Data Features

PostgreSQL 9.5 includes multiple new features for bigger databases, and for integrating with other Big Data systems. These features ensure that PostgreSQL continues to have a strong role in the rapidly growing open source Big Data marketplace. Among them are:

BRIN Indexing: This new type of index supports creating tiny, but effective indexes for very large, “naturally ordered” tables. For example, tables containing logging data with billions of rows could be indexed and searched in 5% of the time required by standard BTree indexes.

Faster Sorts: PostgreSQL now sorts text and NUMERIC data faster, using an algorithm called “abbreviated keys”. This makes some queries which need to sort large amounts of data 2X to 12X faster, and can speed up index creation by 20X.

CUBE, ROLLUP and GROUPING SETS: These new standard SQL clauses let users produce reports with multiple levels of summarization in one query instead of requiring several. CUBE will also enable tightly integrating PostgreSQL with more Online Analytic Processing (OLAP) reporting tools such as Tableau.

Foreign Data Wrappers (FDWs): These already allow using PostgreSQL as a query engine for other Big Data systems such as Hadoop and Cassandra. Version 9.5 adds IMPORT FOREIGN SCHEMA and JOIN pushdown making query connections to external databases both easier to set up and more efficient.

TABLESAMPLE: This SQL clause allows grabbing a quick statistical sample of huge tables, without the need for expensive sorting.

“The new BRIN index in PostgreSQL 9.5 is a powerful new feature which enables PostgreSQL to manage and index volumes of data that were impractical or impossible in the past. It allows scalability of data and performance beyond what was considered previously attainable with traditional relational databases and makes PostgreSQL a perfect solution for Big Data analytics,” said Boyan Botev, Lead Database Administrator, Premier, Inc.

January 6, 2016

A Lesson about Let Clauses (XQuery)

Filed under: XML,XQuery — Patrick Durusau @ 10:48 pm

I was going to demonstrate how to localize roll call votes so that only representatives from your state and their votes were displayed for any given roll call vote.

Which would enable libraries or local newsrooms, whose users/readers have little interest in how obscure representatives from other states voted, to pare down the roll call vote list to those that really matter, your state’s representatives.

But remembering that I promised to clean up the listings in yesterday’s post that read:

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}

and kept repeating (doc(“http://clerk.house.gov/evs/2015/roll705.xml”).

My thought was to replace that string with a variable declared by a let clause and then substituting that variable for that string.

To save you from the same mistake, combining a let clause with direct element constructors returns an error saying, in this case:

Left operand of ‘>’ needs parentheses

Not a terribly helpful error message.

I have found examples of using a let clause within a direct element constructor that would have defeated the rationale for declaring the variable to begin with.

Tomorrow I hope to post today’s content, which will enable you to display data relevant to local voters, news reporters, for any arbitrary roll call vote in Congress.

Mark today’s adventure as a mistake to avoid. 😉

January 5, 2016

Jazzing a Roll Call Vote – Part 3 (XQuery)

Filed under: XML,XQuery — Patrick Durusau @ 9:48 pm

I posted Congressional Roll Call Vote – Accessibility Issues earlier today to deal with some accessibility issues noticed by @XQuery with my color coding.

Today we are going to start at the top of the boring original roll call vote and work our way down using XQuery.

Be forewarned that the XQuery you see today we will be shortening and cleaning up tomorrow. It works, but its not best practice.

You will need to open up the source of the original roll call vote to see the elements I select in the path expressions.

Here is the XQuery that is the goal for today:

xquery version “3.0”;
declare boundary-space preserve;
<html>
<head></head>
<body>
<h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2>

<strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)} {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/>

<strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/>

<strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)}
</body>
</html>

The title of the document we obtain with:

<h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2>

Two quick things to notice:

First, for very simple documents like this one, I use “//” rather than writing out the path to the rollcall-num element. I already know it only occurs once in each rollcall document.

Second, when using direct element constructors, the XQuery statements are enclosed by “{ }” brackets.

The rollcall number, date and time of the vote come next (I have introduced line breaks for readability):

<strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong>

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)}

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/>

If you compare my presentation of that string and that from the original, you will find the original has slightly more space between the items.

Here is the XSLT for that spacing:

<xsl:if test=”legis-num[text()!=’0′]”><xsl:text>      </xsl:text><b><xsl:value-of select=”legis-num”/></b></xsl:if>
<xsl:text>      </xsl:text><xsl:value-of select=”vote-type”/>
<xsl:text>      </xsl:text><xsl:value-of select=”action-date”/>
<xsl:text>      </xsl:text><xsl:value-of select=”action-time”/><br/>

Since I already had white space separating my XQuery expressions, I just added to the prologue:

declare boundary-space preserve;

The last two lines:

<strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/>

<strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)}

Are just standard queries for content. The string operator extracts the content of the element you address.

Tomorrow we are going to talk about how to clean up and shorten the path statements and look around for information that should be at the top of this document, but isn’t!

PS: Did you notice that the vote totals, etc., are written as static data in the XML file? Curious isn’t it? Easy enough to generate from the voting data. I don’t have an answer but thought you might.

Congressional Roll Call Vote – Accessibility Issues

Filed under: XML,XQuery,XSLT — Patrick Durusau @ 2:43 pm

I posted a color coded version of a congressional roll call vote in Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway), using red for Republicans and blue for Democrats. #XQuery points out accessibility issues which depend upon color perception.

Color coding works better for me than the more traditional roman versus italic font face distinction but let’s improve the color coding to remove the accessibility issue.

The first question is what colors should I use for accessibility?

In searching to answer that question I found this thread at Edward Tufte’s site (of course), Choice of colors in print and graphics for color-blind readers, which has a rich list of suggestions and pointers to other resources.

One in particular, Color Universal Design (CUD), posted by Maarten Boers, has this graphic on colors:

colorblind_palette

Relying on that palette, I changed the colors for the roll call vote to Republicans in orange; Democrats in sky blue and re-generated the roll call document.

roll-call-access

Here is an accessible version, but color-coded version of: FINAL VOTE RESULTS FOR ROLL CALL 705.

An upside of XML is that changing the presentation of all 429 votes took only a few seconds to change the stylesheet and re-generate the results.

Thanks to #XQuery for prodding me on the accessibility issue which resulted in finding the thread at Tufte and the Colorblind barrier-free color pallet.


Other post on congressional roll call votes:

1. Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

2. Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway)

January 4, 2016

Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway)

Filed under: XML,XQuery — Patrick Durusau @ 11:41 pm

Apologies but did not make as much progress on the Congressional Roll Call vote as I had hoped.

I did find some interesting information about the vote.xsl stylesheet and manage to use color to code members of the House.

You probably remember me whining about how hard it is to tell between roman and italics to distinguish members of different parties. Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

The XSLT code is worse than I imagined.

Here’s what I mean:

<b><center><font size=”+2″>FINAL VOTE RESULTS FOR ROLL CALL <xsl:value-of select=”/rollcall-vote/vote-metadata/rollcall-num”/>
<xsl:if test=”/rollcall-vote/vote-metadata/vote-correction[text()!=”]”>*</xsl:if></font></center></b>
<!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘D’]”> –>
<xsl:if test = “$Majority=’D'”>
<center>(Democrats in roman; Republicans in <i>italic</i>; Independents <u>underlined</u>)</center><br/>
</xsl:if>
<!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘R’]”> –>
<xsl:if test = “$Majority!=’D'”>
<center>(Republicans in roman; Democrats in <i>italic</i>; Independents <u>underlined</u>)</center><br/>
</xsl:if>

Which party is in the majority determines whether the names in a party appear in roman or italic face font.

Now there’s a distinction that will be lost on a casual reader!

What’s more, if you are trying to reform the stylesheet, don’t look for R or D but again for majority party:

<xsl:template match=”vote”>
<!– Handles formatting of Member names based on party. –>
<!– <xsl:if test=”../legislator/@party=’R'”><xsl:value-of select=”../legislator”/></xsl:if>
<xsl:if test=”../legislator/@party=’D'”><i><xsl:value-of select=”../legislator”/></i></xsl:if> –>
<xsl:if test=”../legislator/@party=’I'”><u><xsl:value-of select=”../legislator”/></u></xsl:if>
<xsl:if test=”../legislator/@party!=’I'”>
<xsl:if test=”../legislator/@party = $Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –>
<xsl:value-of select=”../legislator”/>
</xsl:if>
<xsl:if test=”../legislator/@party != $Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –>
<i><xsl:value-of select=”../legislator”/></i>
</xsl:if>
</xsl:if>
</xsl:template>

As you can see, selecting by party has been commented out in favor of the roman/italic distinction based on the majority party.

I wanted to label the Republicans with an icon but my GIMP skills don’t extend to making an icon of young mothers throwing their children under the carriage wheels of the wealthy to save them from a live of poverty and degradation. A bit much to get into a HTML button sized icon.

I settled for using the traditional red for Republicans and blue for Republicans and ran the modified stylesheet against roll705.xml locally.

vote-color-coded

Here is FINAL VOTE RESULTS FOR ROLL CALL 705 as HTML.

Question: Are red and blue easier to distinguish than roman and italic?

If your answer is yes, why resort to typographic subtlety on something like party affiliation?

Are subtle distinctions used to confuse the uninitiated and unwary?

January 3, 2016

Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 11:02 pm

Roll call votes in the US House of Representatives are a stable of local, state and national news. If you go looking for the “official” version, what you find is as boring as your 5th grade civics class.

Trigger Warning: Boring and Minimally Informative Page Produced By Following Link: Final Vote Results For Roll Call 705.

Take a deep breath and load the page. It will open in a new browser tab. Boring. Yes? (You were warned.)

It is the recent roll call vote to fund the US government, take another slice of privacy from citizens, and make a number of other dubious policy choices. (Everything after the first comma depending upon your point of view.)

Whatever your politics though, you have to agree this is sub-optimal presentation, even for a government document.

This is no accident, sans the header, you will find the identical presentation of this very roll call vote at: page H10696, Congressional Record for December 18, 2015 (pdf).

Disappointing so much XML, XSLT, XQuery, etc., has been wasted duplicating non-informative print formatting. Or should I say less-informative formatting than is possible with XML?

Once the data is in XML, legend has it, users can transform that XML in ways more suited to their purposes and not those of the content providers.

I say “legend has it,” because we all know if content providers had their way, web navigation would be via ads and not bare hyperlinks. You want to see the next page? You must select the ad + hyperlink, waiting for the ad to clear before the resource appears.

I can summarize my opinion about content provider control over information legally delivered to my computer: Screw that!

If a content provider enables access to content, I am free to transform that content into speech, graphics, add information, take away information, in short do anything that my imagination desires and my skill enables.

Let’s take the roll call vote in the House of Representatives, Final Vote Results For Roll Call 705.

Just under the title you will read:

(Republicans in roman; Democrats in italic; Independents underlined)

Boring.

For a bulk display of voting results, we can do better than that.

What if we had small images to identify the respective parties? Here are some candidates (sic) for the Republicans:

r-photo1

r-photo-2

r-photo-3

Of course we would have to reduce them to icons size, but XML processing is rarely ever just XML processing. Nearly every project includes some other skill set as well.

Which one do you think looks more neutral? 😉

Certainly be more colorful and depending upon your inclinations, more fun to play about with than the difference in roman and italic. Yes?

Presentation of the data in http://clerk.house.gov/evs/2015/roll705.xml is only one of the possibilities that XQuery offers. Follow along and offer your suggestions for changes, additions and modifications.

First steps:

In the browser tab with Final Vote Results For Roll Call 705, use CNTR-u to view the page source. First notice that the boring web presentation is controlled by http://clerk.house.gov/evs/vote.xsl.

Copy and paste: http://clerk.house.gov/evs/vote.xsl into a new browser tab and select return. The resulting xsl:stylesheet is responsible for generating the original page, from the vote totals to column presentation of the results.

Pay particular attention to the generation of totals from the <vote-data> element and its children. That generation is powered by these lines in vote.xsl:

<xsl:apply-templates select=”/rollcall-vote/vote-metadata”/>
<!– Create total variables based on counts. –>
<xsl:variable name=”y” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Yea’])”/>
<xsl:variable name=”a” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Aye’])”/>
<xsl:variable name=”yeas” select=”$y + $a”/>
<xsl:variable name=”nay” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Nay’])”/>
<xsl:variable name=”no” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’No’])”/>
<xsl:variable name=”nays” select=”$nay + $no”/>
<xsl:variable name=”nvs” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Not Voting’])”/>
<xsl:variable name=”presents” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Present’])”/>
<br/>

(Not entirely, I omitted the purely formatting stuff.)

For tomorrow I will be working on a more “visible” way to identify political party affiliation and “borrowing” the count code from vote.xsl.

Enjoy!


You may be wondering what XQuery has to do with topic maps? Well, if you think about it, every time we select, aggregate, etc., data, we are making choices based on notions of subject identity.

That is we think the data we are manipulating represents some subjects and/or information about some subjects, that we find sensible (for some unstated reason) to put together for others to read.

The first step towards a topic map, however, is the putting of information together so we can judge what subjects need explicit representation and how we choose to identify them.

Prior topic map work was never explicit about how we get to a topic map, putting that possibly divisive question behind us, we simply start with topic maps, ab initio.

I was in the car when we took that turn and for the many miles since then. I have come to think that a better starting place is choosing subjects, what we want to say about them and how we wish to say it, so that we have only so much machinery as is necessary for any particular set of subjects.

Some subjects can be identified by IRIs, others by multi-dimensional vectors, still others by unspecified processes of deep learning, etc. Which ones we choose will depend upon the immediate ROI from subject identity and relationships between subjects.

I don’t need triples, for instance, to recognize natural languages to a sufficient degree of accuracy. Unnecessary triples, topics or associations are just padding. If you are on a per-triple contract, they make sense, otherwise, not.

A long way of saying that subject identity lurks just underneath the application of XQuery and we will see where it is useful to call subject identity to the fore.

January 2, 2016

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery (Part 2)

Filed under: XML,XQuery — Patrick Durusau @ 7:30 pm

Despite heavy carousing during the holidays, you may still remember Great R packages for data import, wrangling & visualization [+ XQuery], where I re-sorted the table by Sharon Machlis, to present the R packages in package name order.

I followed that up with: Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery, where I detailed the travails of trying to sort the software packages by their short descriptions, again in alphabetical order. My assumption in that post was that either the spaces or the “,” commas in the descriptions were fouling the sort by.

That wasn’t the case, which I should have known because the string operator always returns a string. That is the spaces and “,” inside are just parts of a string, nothing more.

The up-side of the problem was that I spent more than a little while with Walmsley’s XQuery book, searching for ever more esoteric answers.

Here’s the failing XQuery:

<html>
<body>
<table>{
for $row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2]/a))
return <tr>{$row/td[2]} {$row/td[1]}</tr>
}</table>
</body>
</html>

And here is the working XQuery:

<html>
<body>
<table>{
for $row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2]))
return <tr>{$row/td[2]} {$row/td[1]}</tr>
}</table>
</body>
</html>

Here is the mistake highlighted:

order by lower-case"(string($row/td[2]/a))"

My first mistake was the inclusion of “/a” in the path. Using string on ($row/td[1]), that is without having /a at the end of the path, gives the original script the same result. (Run that for yourself on favorite-R-packages.xml).

Make any path as long as required and no longer!

My second mistake was not checking the XPath immediately upon the failure of the sort. (The simplest answer is usually the correct one.)

Enjoy!


Update: Removed the quotes marks around (string($row/td[2])) in both queries, they were part of an explanation that did not make the cut. Thanks to XQuery for the catch!

January 1, 2016

XQilla-2.3.2 – Tooling up for 2016 (Part 2) (XQuery)

Filed under: Virtual Machines,XML,XQilla,XQuery — Patrick Durusau @ 5:03 pm

As I promised yesterday, a solution to the XQilla-2.3.2 installation problem!

Using a virtual machine to install the latest version of Ubuntu (15.10), which had the libraries required to install XQilla!

I use VirtualBox from Oracle but people also use VMware.

Virtual boxes come in all manner of configurations so you are likely to spend some time loading linux headers and the like to compile software.

The advantage of a virtual box is that I don’t have to risk doing something dumb or out of fatigue to my working setup. If I have to blow away the entire virtual machine, its takes only a few minutes to download another one.

Well, on any day other than New Year’s Day I found out today. I don’t know if people were streaming that many football games or streaming live “acts” of some sort but the Net was very slow today.

Introducing XQuery to humanists, librarians and reporters using a VM with the usual XQuery suspects pre-loaded would be very cool!

Great way to distribute xqueries* and shell scripts that run them for immediate results.

If you have any thoughts about what such a VM should contain, etc., drop me an email patrick@durusau.net or leave a comment. Thanks!

PS: XQueries returned approximately 26K “hits,” and xquerys returned approximately 1,700 “hits.” Usage favors the plural as “xqueries” so that is what I am following. At the first of a sentence, XQueries?

PPS: I could have written this without the woes of failed downloads, missing header files, etc. but I wanted to know for myself that Ubuntu (15.10) with all the appropriate header files would in fact compile XQilla-2.3.2.

You may need this line to get all the headers:

apt-get install dkms build-essential linux-headers-generic

Not to mention that I would update everything before trying to compile software. Hard to say how long your VM has been on the shelf.

December 31, 2015

XQilla-2.3.2 – Tooling up for 2016 (Part 1) (XQuery)

Filed under: XML,XQilla,XQuery — Patrick Durusau @ 8:38 pm

Along with other end of the year tasks, I’m installing several different XQuery tools. Not all tools support all extensions and so a variety of tools can be a useful thing.

The README for XQila-2.3.2 comes close to winning a prize for being terse:

1. Download a source distribution of Xerces-C 3.1.2

2. Build Xerces-C

cd xerces-c-3.1.2/
./configure
make

4. Build XQilla

cd xqilla/
./configure –with-xerces=`pwd`/../xerces-c-3.1.2/
make

A few notes that may help:

Obtain Xerces-c-3.1.2 homepage.

Xerces project homepage. Home of Apache Xerces C++, Apache Xerces2 Java, Apache Xerces Perl, and, Apache XML Commons.

On configuring the make file for XQilla:

./configure –with-xerces=`pwd`/../xerces-c-3.1.2/

the README is presuming you built xerces-c-3.1.2 in a sub-directory of the XQilla source. You could, just out of habit I built xerces-c-3.1.2 in a separate directory.

The configuration file for XQilla reads in part:

–with-xerces=DIR Path of Xerces. DIR=”/usr/local”

So you could build XQilla with an existing install of xerces-c-3.1.2 if you are so-minded. But if you are that far along, you don’t need these notes. 😉

Strictly for my system (your paths will be different), after building xerces-c-3.1.2, I changed directories to XQilla-2.3.2 and typed:

./configure --with-xerces=/home/patrick/working/xerces-c-3.1.2

No error messages so I am now back at the command prompt and enter make.

Welllll, that was supposed to work!

Here is the error I got:

libtool: link: g++ -O2 -ftemplate-depth-50 -o .libs/xqilla 
   xqilla-commandline.o  
-L/home/patrick/working/xerces-c-3.1.2/src 
  /home/patrick/working/xerces-c-3.1.2/src/
.libs/libxerces-c.so ./.libs/libxqilla.so -lnsl -lpthread -Wl,-rpath 
-Wl,/home/patrick/working/xerces-c-3.1.2/src
/usr/bin/ld: warning: libicuuc.so.55, needed by 
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so, 
   not found (try using -rpath or -rpath-link)
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so: 
   undefined reference to `uset_close_55'
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so: 
   undefined reference to `ucnv_fromUnicode_55'
...[omitted numerous undefined references]...
collect2: error: ld returned 1 exit status
make[1]: *** [xqilla] Error 1
make[1]: Leaving directory `/home/patrick/working/XQilla-2.3.2'
make: *** [all-recursive] Error 1

To help you avoid surfing the web to track down this issue, realize that Ubuntu doesn’t use the latest releases. Of anything as far as I can tell.

The bottom line being that Ubuntu 14.04 doesn’t have libicuuc.so.55.

If I manually upgrade libraries, I might create an inconsistency package management tools can’t fix. 🙁 And break working tools. Bad joss!

Fear Not! There is a solution, which I will cover in my next XQilla-2.3.2 post!

PS: I didn’t get back to the sorting post in time to finish it today. Not to mention that I encountered another nasty list in Most Vulnerable Software of 2015! (Perils of Interpretation!, Advice for 2016).

I say “nasty,” you should see some of the lists you can find at Congress.gov. Valid XML I’ll concede but not as useful as they could be.

Improving online lists, combining them with other data, etc., are some of the things I want to cover this coming year.

December 30, 2015

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery

Filed under: XML,XQuery — Patrick Durusau @ 8:19 pm

Continuing with the data from my post: Great R packages for data import, wrangling & visualization [+ XQuery], I have discovered the dangers of perfect example data!

The XQuery examples on sorting that I have read either enclose strings in quotes and/or have strings with no whitespaces.

How often to you see strings with no whitespaces? Outside of highly constrained environments?

Why is that a problem?

Well, take a look at my results from sorting on the short description and displaying the short description first and the package name second:

package development, package installation devtools
misc installr
data import readxl
data import, data export googlesheets
data import RMySQL
data import readr
data import, data export rio
data analysis psych
data wrangling, data analysis sqldf
data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod
data import, web scraping rvest
data wrangling, data analysis dplyr
data wrangling plyr
data wrangling reshape2
data wrangling tidyr
data wrangling, data analysis data.table
data wrangling stringr
data wrangling lubridate
data wrangling, data analysis zoo
data display editR
data display knitr
data display, data wrangling listviewer
data display DT
data visualization ggplot2
data visualization dygraphs
data visualization googleVis
data visualization metricsgraphics
data visualization RColorBrewer
data visualization plotly
mapping leaflet
mapping choroplethr
mapping tmap
misc fitbitScraper
Web analytics rga
Web analytics RSiteCatalyst
package development roxygen2
data visualization shiny
misc openxlsx
data wrangling, data analysis gmodels
data wrangling car
data visualization rcdimple
data wrangling foreach
data acquisition downloader
data wrangling scales
data visualization plotly

Err, that’s not right!

The XQuery from yesterday:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

XQuery from today, changes in red:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[2]/a))
  6. return <tr>{$row/td[2]} {$row/td[1]}</tr>
  7. }</table>
  8. </html>

First, how do you explain the failure? Looks like no sort order at all.

Truthfully it does have a sort order, just not the one you expected. The results appear in document sort order, as they appeared in the document.

Here’s a snippet of that document:

<table>
<tr>
<td><a href="https://github.com/hadley/devtools" target="_new">devtools</a></td>
<td>package development, package installation</td>
<td>While devtools is aimed at helping you create your own R packages, it's also 
essential if you want to easily install other packages from GitHub. Install it! 
Requires <a href="http://cran.r-project.org/bin/windows/Rtools/" target="_new">
Rtools</a> on Windows and <a href="https://developer.apple.com/xcode/downloads/" 
target="_new">XCode</a> on a Mac. On CRAN.</td>
<td>install_github("rstudio/leaflet")</td>
<td>Hadley Wickham & others</td>
</tr>
<tr>
<td><a href="https://github.com/talgalili/installr/" target="_new">installr</a>
</td><td>misc</td>
<td>Windows only: Update your installed version of R from within R. On CRAN.</td>
<td>updateR()</td>
<td>Tal Galili & others</td>
</tr>
<tr>
<td><a href="https://github.com/hadley/readxl/" target="_new">readxl</a>
</td><td>data import</td>
<td>Fast way to read Excel files in R, without dependencies such as Java. CRAN.</td>
<td>read_excel("my-spreadsheet.xls", sheet = 1)</td>
<td>Hadley Wickham</td>
</tr>
...
</table>

I haven’t run the problem entirely to ground but as you can see from the output:

data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod

Most of the descriptions have spaces, not to mention “,” separating categories.

It is always possible to clean up the data but I want to avoid that if at all possible.

Cleaning data involves the risk I may change the data and once changed, I may not be able to go back to the original.

I can think of at least two (2) ways to fix this problem but want to sleep on it first and pick that can be easily adapted to the next soiled data that comes through the door.

PS: Neither Saxon (9.7), nor BaseX (8.3) gave any error messages at the console for the failure of the sort request.

You could say that document order is about as large an error message as can be given. 😉

December 25, 2015

Facets for Christmas!

Filed under: XML,XPath,XQuery — Patrick Durusau @ 11:47 am

Facet Module

From the introduction:

Faceted search has proven to be enormously popular in the real world applications. Faceted search allows user to navigate and access information via a structured facet classification system. Combined with full text search, it provides user with enormous power and flexibility to discover information.

This proposal defines a standardized approach to support the Faceted search in XQuery. It has been designed to be compatible with XQuery 3.0, and is intended to be used in conjunction with XQuery and XPath Full Text 3.0.

Imagine my surprise when after opening Christmas presents with family to see a tweet by XQuery announcing yet another Christmas present:

“Facets”: A new EXPath spec w/extension functions & data models to enable faceted navigation & search in XQuery http://expath.org/spec/facet

The EXPath homepage says:

XPath is great. XPath-based languages like XQuery, XSLT, and XProc, are great. The XPath recommendation provides a foundation for writing expressions that evaluate the same way in a lot of processors, written in different languages, running in different environments, in XML databases, in in-memory processors, in servers or in clients.

Supporting so many different kinds of processor is wonderful thing. But this also contrains which features are feasible at the XPath level and which are not. In the years since the release of XPath 2.0, experience has gradually revealed some missing features.

EXPath exists to provide specifications for such missing features in a collaborative- and implementation-independent way. EXPath also provides facilities to help and deliver implementations to as many processors as possible, via extensibility mechanisms from the XPath 2.0 Recommendation itself.

Other projects exist to define extensions for XPath-based languages or languages using XPath, as the famous EXSLT, and the more recent EXQuery and EXProc projects. We think that those projects are really useful and fill a gap in the XML core technologies landscape. Nevertheless, working at the XPath level allows common solutions when there is no sense in reinventing the wheel over and over again. This is just following the brilliant idea of the W3C’s XSLT and XQuery working groups, which joined forces to define XPath 2.0 together. EXPath purpose is not to compete with other projects, but collaborate with them.

Be sure to visit the resources page. It has a manageable listing of processors that handle extensions.

What would you like to see added to XPath?

Enjoy!

December 14, 2015

35 Lines XQuery versus 604 of XSLT: A List of W3C Recommendations

Filed under: BaseX,Saxon,XML,XQuery,XSLT — Patrick Durusau @ 10:16 pm

Use Case

You should be familiar with the W3C Bibliography Generator. You can insert one or more URLs and the generator produces correctly formatted citations for W3C work products.

It’s quite handy but requires a URL to produce a useful response. I need authors to use correctly formatted W3C citations and asking them to find URLs and to generate correct citations was a bridge too far. Simply didn’t happen.

My current attempt is to produce a list of correctly W3C citations in HTML. Authors can use CTRL-F in their browsers to find citations. (Time will tell if this is a successful approach or not.)

Goal: An HTML page of correctly formatted W3C Recommendations, sorted by title (ignoring case because W3C Recommendations are not consistent in their use of case in titles). “Correctly formatted” meaning that it matches the output from the W3C Bibliography Generator.

Resources

As a starting point, I viewed the source of http://www.w3.org/2002/01/tr-automation/tr-biblio.xsl, the XSLT script that generates the XHTML page with its responses.

The first XSLT script imports two more XSLT scripts, http://www.w3.org/2001/08/date-util.xslt and http://www.w3.org/2001/10/str-util.xsl.

I’m not going to reproduce the XSLT here, but can say that starting with <stylesheet> and ending with </stylesheet>, inclusive, I came up with 604 lines.

You will need to download the file used by the W3C Bibliography Generator, tr.rdf.

XQuery Script

I have used the XQuery script successfully with: BaseX 8.3, eXide 2.1.3 and SaxonHE-6-07J.

Here’s the prolog:

declare default element namespace "http://www.w3.org/2001/02pd/rec54#";
declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
declare namespace dc = "http://purl.org/dc/elements/1.1/"; 
declare namespace doc = "http://www.w3.org/2000/10/swap/pim/doc#";
declare namespace contact = "http://www.w3.org/2000/10/swap/pim/contact#";
declare namespace functx = "http://www.functx.com";
declare function functx:substring-after-last
($string as xs:string?, $delim as xs:string) as xs:string?
{
if (contains ($string, $delim))
then functx:substring-after-last(substring-after($string, $delim), $delim)
else $string
};

Declaring the namespaces and functx:substring-after-last from Patricia Walmsley’s excellent FunctX XQuery Functions site and in particular, functx:substring-after-last.

<html>
<head>XQuery Generated W3C Recommendation List</head>
<body>
<ul class="ul">

Start the HTML page and the unordered list that will contain the list items.

{
for $rec in doc("tr.rdf")//REC
    order by upper-case($rec/dc:title)

If you sort W3C Recommendations by dc:title and don’t specify upper-case, rdf:PlainLiteral: A Datatype for RDF Plain Literals,
rdf:PlainLiteral: A Datatype for RDF Plain Literals (Second Edition), and xml:id Version 1.0, appear at the end of the list sorted by title. Dirty data isn’t limited to databases.

return <li class="li">
  <a href="{string($rec/@rdf:about)}"> {string($rec/dc:title)} </a>, 
   { for $auth in $rec/editor
   return
   if (contains(string($auth/contact:fullName), "."))
   then (concat(string($auth/contact:fullName), ","))
   else (concat(concat(concat(substring(substring-before(string($auth/\
   contact:fullName), ' '), 0, 2), ". "), (substring-after(string\
   ($auth/contact:fullName), ' '))), ","))}

Watch for the line continuation marker “\”.

We begin by grabbing the URL and title for an entry and then confront dirty author data. The standard author listing by the W3C creates an initial plus a period for the author’s first name and then concatenates the rest of the author’s name to that initial plus period.

Problem: There is one entry for authors that already has initials, T.V. Raman, so I had to account for that one entry (as does the XSLT).

{if (count ($rec/editor) >= 2) then " Editors," else " Editor,"}
W3C Recommendation, 
{fn:format-date(xs:date(string($rec/dc:date)), "[MNn] [D], [Y]") }, 
{string($rec/@rdf:about)}. <a href="{string($rec/doc:versionOf/\
@rdf:resource)}">Latest version</a> \
available at {string($rec/doc:versionOf/@rdf:resource)}.
<br/>[Suggested label: <strong>{functx:substring-after-last(uppercase\
(replace(string($rec/doc:versionOf/@rdf:resource), '/$', '')), "/")}\
</strong>]<br/></li>} </ul></body></html>

Nothing remarkable here, except that I snipped the concluding “/” off of the values from doc:versionOf/@rdf:resource so I could use functx:substring-after-last to create the token for a suggested label.

Comments / Omissions

I depart from the XSLT in one case. It calls http://www.w3.org/2002/01/tr-automation/known-tr-editors.rdf here:

<!-- Special casing for when we have the name in Original Script (e.g. in \
Japanese); currently assume that the order is inversed in this case... -->

<:xsl:when test="document('http://www.w3.org/2002/01/tr-automation/\
known-tr-editors.rdf')/rdf:RDF/*[contact:lastNameInOriginalScript=\
substring-before(current(),' ')]">

But that refers to only one case:

<REC rdf:about="http://www.w3.org/TR/2003/REC-SVG11-20030114/">
<dc:date>2003-01-14</dc:date>
<dc:title>Scalable Vector Graphics (SVG) 1.1 Specification</dc:title>

Where Jun Fujisawa appears as an editor.

Recalling my criteria for “correctness” being the output of the W3C Bibliography Generator:

svg-cite-image

Preparing for this post made me discover at least one bug in the XSLT that was supposed to report the name in original script:

&lt:xsl:when test=”document(‘http://www.w3.org/2002/01/tr-automation/\
known-tr-editors.rdf’)/rdf:RDF/*[contact:lastNameInOriginalScript=\
substring-before(current(),’ ‘)]”>

Whereas the entry in http://www.w3.org/2002/01/tr-automation/known-tr-editors.rdf reads:

<rdf:Description>
<rdf:type rdf:resource=”http://www.w3.org/2000/10/swap/pim/contact#Person”/>
<firstName>Jun</firstName>
<firstNameInOriginalScript>藤沢 淳</firstNameInOriginalScript>
<lastName>Fujisawa</lastName>
<sortName>Fujisawa</sortName>
</rdf:Description>

Since the W3C Bibliography Generator doesn’t produce the name in original script, neither do I. When the W3C fixes its output, I will have to amend this script to pick up that entry.

String

While writing this query I found text(), fn:string() and fn:data() by Dave Cassels. Recommended reading. The weakness of text() is that if markup is inserted inside your target element after you write the query, you will get unexpected results. The use of fn:string() avoids that sort of surprise.

Recommendations Only

Unlike the W3C Bibliography Generator, my script as written only generates entries for Recommendations. It would be trivial to modify the script to include drafts, notes, etc., but I chose to not include material that should not be used as normative citations.

I can see the usefulness of the bibliography generator for works in progress but external to the W3C, citing Recommendations is the better course.

Contra Search

The SpecRef project has a searchable interface to all the W3C documents. If you search for XQuery, the interface returns 385 “hits.”

Contrast that with using CNTR-F with the list of recommendations generated from the XQuery script, controlling for case, XQuery produced only 23 “hits.”

There are reasons for using search, but users repeatedly mining results of searches that could be captured (it was called curation once upon a time) is wasteful.

Reading

I can’t recommend Patricia Walmsley’s XQuery 2nd Edition strongly enough.

There is one danger to Walmsley’s book. You will be so ready to start using XQuery after the first ten chapters it’s hard to find the time to read the remaining ones. Great stuff!

You can download the XQuery file, tr.rdf and the resulting html file at: 35LinesOfXQuery.zip.

Congress.gov Enhancements: Quick Search, Congressional Record Index, and More

Filed under: Government,Government Data,XML,XQuery — Patrick Durusau @ 9:12 pm

New End of Year Congress.gov Enhancements: Quick Search, Congressional Record Index, and More by Andrew Weber.

From the post:

In our quest to retire THOMAS, we have made many enhancements to Congress.gov this year.  Our first big announcement was the addition of email alerts, which notify users of the status of legislation, new issues of the Congressional Record, and when Members of Congress sponsor and cosponsor legislation.  That development was soon followed by the addition of treaty documents and better default bill text in early spring; improved search, browse, and accessibility in late spring; user driven feedback in the summer; and Senate Executive Communications and a series of Two-Minute Tip videos in the fall.

Today’s update on end of year enhancements includes a new Quick Search for legislation, the Congressional Record Index (back to 1995), and the History of Bills from the Congressional Record Index (available from the Actions tab).  We have also brought over the State Legislature Websites page from THOMAS, which has links to state level websites similar to Congress.gov.

Text of legislation from the 101st and 102nd Congresses (1989-1992) has been migrated to Congress.gov. The Legislative Process infographic that has been available from the homepage as a JPG and PDF is now available in Spanish as a JPG and PDF (translated by Francisco Macías). Margaret and Robert added Fiscal Year 2003 and 2004 to the Congress.gov Appropriations Table. There is also a new About page on the site for XML Bulk Data.

The Quick Search provides a form-based search with fields similar to those available from the Advanced Legislation Search on THOMAS.  The Advanced Search on Congress.gov is still there with many additional fields and ways to search for those who want to delve deeper into the data.  We are providing the new Quick Search interface based on user feedback, which highlights selected fields most likely needed for a search.

There’s an impressive summary of changes!

Speaking of practicing programming, are you planning on practicing XQuery on congressional data in the coming year?

December 8, 2015

Congress: More XQuery Fodder

Filed under: Government,Government Data,Law - Sources,XML,XQuery — Patrick Durusau @ 8:07 pm

Congress Poised for Leap to Open Up Legislative Data by Daniel Schuman.

From the post:

Following bills in Congress requires three major pieces of information: the text of the bill, a summary of what the bill is about, and the status information associated with the bill. For the last few years, Congress has been publishing the text and summaries for all legislation moving in Congress, but has not published bill status information. This key information is necessary to identify the bill author, where the bill is in the legislative process, who introduced the legislation, and so on.

While it has been in the works for a while, this week Congress confirmed it will make “Bill Statuses in XML format available through the GPO’s Federal Digital System (FDsys) Bulk Data repository starting with the 113th Congress,” (i.e. January 2013). In “early 2016,” bill status information will be published online in bulk– here. This should mean that people who wish to use the legislative information published on Congress.gov and THOMAS will no longer need to scrape those websites for current legislative information, but instead should be able to access it automatically.

Congress isn’t just going to pull the plug without notice, however. Through the good offices of the Bulk Data Task Force, Congress will hold a public meeting with power users of legislative information to review how this will work. Eight sample bill status XML files and draft XML User Guides were published on GPO’s GitHub page this past Monday. Based on past positive experiences with the Task Force, the meeting is a tremendous opportunity for public feedback to make sure the XML files serve their intended purposes. It will take place next Tuesday, Dec. 15, from 1-2:30. RSVP details below.

If all goes as planned, this milestone has great significance.

  • It marks the publication of essential legislative information in a format that supports unlimited public reuse, analysis, and republication. It will be possible to see much of a bill’s life cycle.
  • It illustrates the positive relationship that has grown between Congress and the public on access to legislative information, where there is growing open dialog and conversation about how to best meet our collective needs.
  • It is an example of how different components within the legislative branch are engaging with one another on a range of data-related issues, sometimes for the first time ever, under the aegis of the Bulk Data Task Force.
  • It means the Library of Congress and GPO will no longer be tied to the antiquated THOMAS website and can focus on more rapid technological advancement.
  • It shows how a diverse community of outside organizations and interests came together and built a community to work with Congress for the common good.

To be sure, this is not the end of the story. There is much that Congress needs to do to address its antiquated technological infrastructure. But considering where things were a decade ago, the bulk publication of information about legislation is a real achievement, the culmination of a process that overcame high political barriers and significant inertia to support better public engagement with democracy and smarter congressional processes.

Much credit is due in particular to leadership in both parties in the House who have partnered together to push for public access to legislative information, as well as the staff who worked tireless to make it happen.

If you look at the sample XML files, pay close attention to the <bioguideID> element and its contents. Is is the same value as you will find for roll-call votes, but there the value appears in the name-id attribute of the <legislator> element. See: http://clerk.house.gov/evs/2015/roll643.xml and do view source.

Oddly, the <bioguideID> element does not appear in the documentation on GitHub, you just have to know the correspondence to the name-id attribute of the <legislator> element

As I said in the title, this is going to be XQuery fodder.

XQuery, 2nd Edition, Updated! (A Drawback to XQuery)

Filed under: XML,XPath,XQuery — Patrick Durusau @ 3:57 pm

XQuery, 2nd Edition, Updated! by Priscilla Walmsley.

The updated version of XQuery, 2nd Edition has hit the streets!

As a plug for the early release program at O’Reilly, yours truly appears in the acknowledgments (page xxii) from having submitted comments on the early release version of XQuery. You can too. Early release participation is yet another way to contribute back to the community.

There is one drawback to XQuery which I discuss below.

For anyone not fortunate enough to already have a copy of XQuery, 2nd Edition, here is the full description from the O’Reilly site:

The W3C XQuery 3.1 standard provides a tool to search, extract, and manipulate content, whether it’s in XML, JSON or plain text. With this fully updated, in-depth tutorial, you’ll learn to program with this highly practical query language.

Designed for query writers who have some knowledge of XML basics, but not necessarily advanced knowledge of XML-related technologies, this book is ideal as both a tutorial and a reference. You’ll find background information for namespaces, schemas, built-in types, and regular expressions that are relevant to writing XML queries.

This second edition provides:

  • A high-level overview and quick tour of XQuery
  • New chapters on higher-order functions, maps, arrays, and JSON
  • A carefully paced tutorial that teaches XQuery without being bogged down by the details
  • Advanced concepts for taking advantage of modularity, namespaces, typing, and schemas
  • Guidelines for working with specific types of data, such as numbers, strings, dates, URIs, maps and arrays
  • XQuery’s implementation-specific features and its relationship to other standards including SQL and XSLT
  • A complete alphabetical reference to the built-in functions, types, and error messages

Drawback to XQuery:

You know I hate to complain, but the brevity of XQuery is a real drawback to billing.

For example, I have a post pending on taking 604 lines of XSLT down to 35 lines of XQuery.

Granted the XQuery is easier to maintain, modify, extend, but all a client will see is the 35 lines of XQuery. At least 604 lines of XSLT looks like you really worked to produce something.

I know about XQueryX but I haven’t seen any automatic way to convert XQuery into XQueryX. Am I missing something obvious? If that’s possible, I could just bulk up the deliverable with an XQueryX expression of the work and keep the XQuery version for production use.

As excellent as I think XQuery and Walmsley’s book both are, I did want to warn you about the brevity of your XQuery deliverables.

I look forward to finish reading XQuery, 2nd Edition. I started doing so many things based on the first twelve or so chapters that I just read selectively from that point on. It merits a complete read. You won’t be sorry you did.

November 19, 2015

Stop Comparing JSON and XML

Filed under: JSON,XML — Patrick Durusau @ 4:04 pm

Stop Comparing JSON and XML by Yegor Bugayenko.

From the post:

JSON or XML? Which one is better? Which one is faster? Which one should I use in my next project? Stop it! These things are not comparable. It’s similar to comparing a bicycle and an AMG S65. Seriously, which one is better? They both can take you from home to the office, right? In some cases, a bicycle will do it better. But does that mean they can be compared to each other? The same applies here with JSON and XML. They are very different things with their own areas of applicability.

Yegor follows that time-honored Web tradition of telling people, who aren’t listening, why they should follow his advice.

😉

If nothing else, circulate this around the office to get everyone’s blood pumping this late in the week.

I would amend Yegor’s headline to read: Stop Comparing JSON and XML Online!

As long as your discussions don’t gum up email lists, news feeds, Twitter, have at it.

Enjoy!

XSL Transformations (XSLT) Version 3.0 [Comments by 31 March 2016]

Filed under: XML,XSLT — Patrick Durusau @ 10:06 am

XSL Transformations (XSLT) Version 3.0

Abstract:

This specification defines the syntax and semantics of XSLT 3.0, a language designed primarily for transforming XML documents into other XML documents.

XSLT 3.0 is a revised version of the XSLT 2.0 Recommendation [XSLT 2.0] published on 23 January 2007.

The primary purpose of the changes in this version of the language is to enable transformations to be performed in streaming mode, where neither the source document nor the result document is ever held in memory in its entirety. Another important aim is to improve the modularity of large stylesheets, allowing stylesheets to be developed from independently-developed components with a high level of software engineering robustness.

XSLT 3.0 is designed to be used in conjunction with XPath 3.0, which is defined in [XPath 3.0]. XSLT shares the same data model as XPath 3.0, which is defined in [XDM 3.0], and it uses the library of functions and operators defined in [Functions and Operators 3.0]. XPath 3.0 and the underlying function library introduce a number of enhancements, for example the availability of higher-order functions.

As an implementer option, XSLT 3.0 can also be used with XPath 3.1. All XSLT 3.0 processors provide maps, an addition to the data model which is specified (identically) in both XSLT 3.0 and XPath 3.1. Other features from XPath 3.1, such as arrays, and new functions such as random-number-generatorFO31 and sortFO31, are available in XSLT 3.0 stylesheets only if the implementer chooses to support XPath 3.1.

Some of the functions that were previously defined in the XSLT 2.0 specification, such as the format-dateFO30 and format-numberFO30 functions, are now defined in the standard function library to make them available to other host languages.

XSLT 3.0 also includes optional facilities to serialize the results of a transformation, by means of an interface to the serialization component described in [XSLT and XQuery Serialization]. Again, the new serialization capabilities of [XSLT and XQuery Serialization 3.1] are available at the implementer’s option.

This document contains hyperlinks to specific sections or definitions within other documents in this family of specifications. These links are indicated visually by a superscript identifying the target specification: for example XP30 for XPath 3.0, DM30 for the XDM data model version 3.0, FO30 for Functions and Operators version 3.0.

Comments are due by 31 March 2016.

That may sound like a long time for comments but it is shorter than you might think.

It is a long document and standards are never an “easy” read.

Fortunately it is cold weather or about to be in many parts of the world with holidays rapidly approaching. Some extra time to curl up with XSL Transformations (XSLT) Version 3.0 and its related documents for a slow read.

Something I have never done before that I plan to attempt with this draft is running the test cases, almost 11,000 of them. I’m not an implementer but being more familiar with the test cases will my understanding of new features in XSL 3.0.

Comment early and often!

Enjoy!

November 14, 2015

Querying Biblical Texts: Part 1 [Humanists Take Note!]

Filed under: Bible,Text Mining,XML,XQuery — Patrick Durusau @ 5:13 pm

Querying Biblical Texts: Part 1 by Jonathan Robie.

From the post:

This is the first in a series on querying Greek texts with XQuery. We will also look at the differences among various representations of the same text, starting with the base text, morphology, and three different treebank formats. As we will see, the representation of a text indicates what the producer of the text was most interested in, and it determines the structure and power of queries done on that particular representation. The principles discussed here also apply to other languages.

This is written as a tutorial, and it can be read in two ways. The first time through, you may want to simply read the text. If you want to really learn how to do this yourself, you should download an XQuery processor and some data (in your favorite biblical language) and try these queries and variations on them.

Humanists need to follow this series and pass it along to others.

Texts of interest to you will vary but the steps Jonathan covers are applicable to all texts (well, depending upon your encoding).

In exchange for learning a little XQuery, you can gain a good degree of mastery over XML encoded texts.

Enjoy!

« Newer PostsOlder Posts »

Powered by WordPress