## Archive for the ‘XQuery’ Category

### BaseX 8.5.3 Released!

Monday, August 15th, 2016

BaseX 8.5.3 Released! (2016/08/15)

BaseX 8.5.3 was released today!

The changelog reads:

VERSION 8.5.3 (August 15, 2016) —————————————-

Minor bug fixes, improved thread-safety.

Still, not a bad idea to upgrade today!

Enjoy!

PS: You do remember that Congress is throwing XML in ever increasing amounts at the internet?

Perhaps in hopes of burying information in angle-bang syntax.

XQuery can help disappoint them.

### BaseX 8.5.1 Released! (XQuery Texts for Smart Phone?)

Saturday, July 16th, 2016

BaseX – 8.5.1 Released!

From the documentation page:

BaseX is both a light-weight, high-performance and scalable XML Database and an XQuery 3.1 Processor with full support for the W3C Update and Full Text extensions. It focuses on storing, querying, and visualizing large XML and JSON documents and collections. A visual frontend allows users to interactively explore data and evaluate XQuery expressions in realtime. BaseX is platform-independent and distributed under the free BSD License (find more in Wikipedia).

Besides Priscilia Walmsley’s XQuery 2nd Edition and the BaseX documentation as a PDF file, what other XQuery resources would you store on a smart phone? (For occasional reference, leisure reading, etc.)

### The Feynman Technique – Contest for Balisage 2016?

Tuesday, June 28th, 2016

The Best Way to Learn Anything: The Feynman Technique by Shane Parrish.

From the post:

There are four simple steps to the Feynman Technique, which I’ll explain below:

1. Choose a Concept
2. Teach it to a Toddler
3. Identify Gaps and Go Back to The Source Material
4. Review and Simplify

This made me think of the late-breaking Balisage 2016 papers posted by Tommie Usdin in email:

• Saxon-JS – XSLT 3.0 in the Browser, by Debbie Lockett and Michael Kay, Saxonica
• A MicroXPath for MicroXML (AKA A New, Simpler Way of Looking at XML Data Content), by Uche Ogbuji, Zepheira
• A catalog of Functional programming idioms in XQuery 3.1, James Fuller, MarkLogic

New contest for Balisage?

Pick a concept from a Balisage 2016 presentation and you have sixty (60) seconds to explain it to Balisage attendees.

What do you think?

Remember, you can’t play if you don’t attend! Register today!

If Tommie agrees, the winner gets me to record a voice mail greeting for their phone! 😉

### Balisage 2016 Program Posted! (Newcomers Welcome!)

Monday, May 23rd, 2016

Tommie Usdin wrote today to say:

Balisage: The Markup Conference
2016 Program Now Available
http://www.balisage.net/2016/Program.html

Balisage: where serious markup practitioners and theoreticians meet every August.

The 2016 program includes papers discussing reducing ambiguity in linked-open-data annotations, the visualization of XSLT execution patterns, automatic recognition of grant- and funding-related information in scientific papers, construction of an interactive interface to assist cybersecurity analysts, rules for graceful extension and customization of standard vocabularies, case studies of agile schema development, a report on XML encoding of subtitles for video, an extension of XPath to file systems, handling soft hyphens in historical texts, an automated validity checker for formatted pages, one no-angle-brackets editing interface for scholars of German family names and another for scholars of Roman legal history, and a survey of non-XML markup such as Markdown.

XML In, Web Out: A one-day Symposium on the sub rosa XML that powers an increasing number of websites will be held on Monday, August 1. http://balisage.net/XML-In-Web-Out/

If you are interested in open information, reusable documents, and vendor and application independence, then you need descriptive markup, and Balisage is the conference you should attend. Balisage brings together document architects, librarians, archivists, computer
scientists, XML practitioners, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, standards developers, academics, industrial researchers, government and NGO staff, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Some participants are busy designing replacements for XML while other still use SGML (and know why they do).

Discussion is open, candid, and unashamedly technical.

Balisage 2016 Program: http://www.balisage.net/2016/Program.html

Symposium Program: http://balisage.net/XML-In-Web-Out/symposiumProgram.html

Even if you don’t eat RELAX grammars at snack time, put Balisage on your conference schedule. Even if a bit scruffy looking, the long time participants like new document/information problems or new ways of looking at old ones. Not to mention they, on occasion, learn something from newcomers as well.

It is a unique opportunity to meet the people who engineered the tools and specs that you use day to day.

Be forewarned that most of them have difficulty agreeing what controversial terms mean, like “document,” but that to one side, they are a good a crew as you are likely to meet.

Enjoy!

### TEI XML -> HTML w/ XQuery [+ CSS -> XML]

Thursday, May 5th, 2016

From the post:

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

The post only treats six (6) TEI elements but the methods presented could be extended to a larger set of TEI elements.

TEI 5 has 563 elements, which may appear in varying, valid, combinations. It also defines 256 attributes which are distributed among those 563 elements.

Consider using XQuery as a quality assurance (QA) tool to insure that encoded texts conform your project’s definition of expected text encoding.

While I was at Adam’s site I encountered: Convert CSV to XML with XQuery and BaseX, which you should bookmark for future reference.

### Balisage 2016, 2–5 August 2016 [XML That Makes A Difference!]

Tuesday, February 2nd, 2016

Call for Participation

Dates:

• 25 March 2016 — Peer review applications due
• 22 April 2016 — Paper submissions due
• 21 May 2016 — Speakers notified
• 10 June 2016 — Late-breaking News submissions due
• 16 June 2016 — Late-breaking News speakers notified
• 8 July 2016 — Final papers due from presenters of peer reviewed papers
• 8 July 2016 — Short paper or slide summary due from presenters of late-breaking news
• 1 August 2016 — Pre-conference Symposium
• 2–5 August 2016 — Balisage: The Markup Conference

From the call:

Balisage is the premier conference on the theory, practice, design, development, and application of markup. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

• Web application development with XML
• Informal data models and consensus-based vocabularies
• Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
• Performance issues in parsing, XML database retrieval, or XSLT processing
• Development of angle-bracket-free user interfaces for non-technical users
• Semistructured data and full text search
• Deployment of XML systems for enterprise data
• Web application development with XML
• Design and implementation of XML vocabularies
• Case studies of the use of XML for publishing, interchange, or archiving
• Alternatives to XML
• the role(s) of XML in the application lifecycle
• the role(s) of vocabularies in XML environments

Full papers should be submitted by the deadline given below. All papers are peer-reviewed — we pride ourselves that you will seldom get a more thorough, skeptical, or helpful review than the one provided by Balisage reviewers.

Whether in theory or practice, let’s make Balisage 2016 the one people speak of in hushed tones at future markup and information conferences.

Useful semantics continues to flounder about, cf. Vice-President Biden’s interest in “one cancer research language.” Easy enough to say. How hard could it be?

Documents are commonly thought of and processed as if from BOM to EOF is the definition of a document. Much to our impoverishment.

Silo dissing has gotten popular. What if we could have our silos and eat them too?

Let’s set our sights on a Balisage 2016 where non-technicals come away saying “I want that!”

Have your first drafts done well before the end of February, 2016!

### Congressional Roll Call Vote – The Documents – Part 2 (XQuery)

Wednesday, January 13th, 2016

Congressional Roll Call Vote – The Documents (XQuery) we looked at the initial elements found in FINAL VOTE RESULTS FOR ROLL CALL 705. Today we continue our examination of those elements, starting with <vote-data>.

As before, use ctrl-u in your browser to display the XML source for that page. Look for </vote-metadata>, the next element is <vote-data>, which contains all the votes cast by members of Congress as follows:

<recorded-vote>
<legislator name-id=”A000374″ sort-field=”Abraham” unaccented-name=”Abraham” party=”R” state=”LA” role=”legislator”>Abraham</legislator><
vote>Nay</vote>
</recorded-vote>
<recorded-vote>
<legislator name-id=”A000370″ sort-field=”Adams” unaccented-name=”Adams” party=”D” state=”NC” role=”legislator”>Adams</legislator>
<vote>Yea</vote>
</recorded-vote>

These are only the first two (2) lines but only the content of other <recorded-vote> elements varies from these.

I have introduced line returns to make it clear that <recorded-vote> … </recorded-vote> begin and end each record. Also note that <legislator> and <vote> are siblings.

What you didn’t see in the upper part of this document were the attributes that appear inside the <legislator> element.

Some of the attributes are: name-id=”A000374,” state=”LA” role=”legislator.”

In an XQuery, we address attributes by writing out the path to the element containing the attributes and then appending the attribute.

For example, for name-id=”A000374,” we could write:

rollcall-vote/vote-data/recorded-vote/legislator[@name-id = "A000374]

If we wanted to select that attribute value and/or the <legislator> element with that attribute and value.

Recalling that:

rollcall-vote – Root element of the document.

vote-data – Direct child of the root element.

recorded-vote – Direct child of the vote-data element (with many siblings).

legislator – Direct child of recorded-vote.

@name-id – One of the attributes of legislator.

As I mentioned in our last post, there are other ways to access elements and attributes but many useful things can be done with direct descendant XPaths.

In preparation for our next post, trying searching for “A000374” and limiting your search to the domain, congress.gov.

It is a good practice to search on unfamiliar attribute values. You never know what you may find!

Until next time!

### Congressional Roll Call Vote – The Documents (XQuery)

Monday, January 11th, 2016

I assume you have read my new starter post for this series: Congressional Roll Call Vote and XQuery (A Do Over). If you haven’t and aren’t already familiar with XQuery, take a few minutes to go read it now. I’ll wait.

The first XML document we need to look at is FINAL VOTE RESULTS FOR ROLL CALL 705. If you press ctrl-u in your browser, the XML source of that document will be displayed.

The top portion of that document, before you see <vote-data> reads:

<?xml version=”1.0″ encoding=”UTF-8″?>
<!DOCTYPE rollcall-vote PUBLIC “-//US Congress//DTDs/vote
v1.0 20031119 //EN” “http://clerk.house.gov/evs/vote.dtd”>
<?xml-stylesheet type=”text/xsl” href=”http://clerk.house.gov/evs/vote.xsl”?>
<rollcall-vote>
<vote-metadata>
<majority>R</majority>
<congress>114</congress>
<session>1st</session>
<chamber>U.S. House of Representatives</chamber>
<rollcall-num>705</rollcall-num>
<legis-num>H R 2029</legis-num>
<vote-question>On Concurring in Senate Amdt with
Amdt Specified in Section 3(a) of H.Res. 566</vote-question>
<vote-type>YEA-AND-NAY</vote-type>
<vote-result>Passed</vote-result>
<action-date>18-Dec-2015</action-date>
<action-time time-etz=”09:49″>9:49 AM</action-time>
<vote-desc>Making appropriations for military construction, the
Department of Veterans Affairs, and related agencies for the fiscal
year ending September 30, 2016, and for other purposes</vote-desc>
<vote-totals>
<totals-by-party-header>
<party-header>Party</party-header>
<yea-header>Yeas</yea-header>
<nay-header>Nays</nay-header>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>
</totals-by-party-header>
<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Independent</party>
<yea-total>0</yea-total>
<nay-total>0</nay-total>
<present-total>0</present-total>
<not-voting-total>0</not-voting-total>
</totals-by-party>
<totals-by-vote>
<total-stub>Totals</total-stub>
<yea-total>316</yea-total>
<nay-total>113</nay-total>
<present-total>0</present-total>
<not-voting-total>5</not-voting-total>
</totals-by-vote>
</vote-totals>
</vote-metadata>

One of the first skills you need to learn to make effective use of XQuery is how to recognize paths in an XML document.

I’ll do the first several and leave some of the others for you.

<rollcall-vote> – the root element – aka “parent” element

<vote-metadata> – first child element in this document
XPath rollcall-vote/vote-metadata

<majority>R</majority> first child of <majority>R</majority> of <vote-metadata>
XPath rollcall-vote/vote-metadata/majority

<congress>114</congress>

What do you think? Looks like the same level as <majority>R</majority> and it is. Called a sibling of <majority>R</majority>
XPath rollcall-vote/vote-metadata/congress

Caveat: There are ways to go back up the XPath and to reach siblings and attributes. For the moment, lets get good at spotting direct XPaths.

Let’s skip down in the markup until we come to <totals-by-party-header>. It’s not followed, at least not immediately, with </totals-by-party-header>. That’s a signal that the previous siblings have stopped and we have another step in the XPath.

<totals-by-party-header>
XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header

<party-header>Party</party-header>
XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header/party-header

As you may suspect, the next four elements are siblings of <party-header>Party</party-header>

<yea-header>Yeas</yea-header>
<nay-header>Nays</nay-header>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>

The closing element, shown by the “/,” signals the end of the <totals-by-party-header> element.

</totals-by-party-header>

See how you do mapping out the remaining XPaths from the top of the document.

<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>

Tomorrow we are going to dive into the structure of the <vote-data> and how to address the attributes therein and their values.

Enjoy!

### Congressional Roll Call Vote and XQuery (A Do Over)

Sunday, January 10th, 2016

Once words are written, as an author I consider them to be fixed. Even typos should be acknowledged as being corrected and not silently “improve” the original text. Rather than editing what has been said, more words can cover the same ground with the hope of doing so more completely or usefully.

I am starting my XQuery series of posts with the view of being more systematic, including references to at least one popular XQuery book, along with my progress through a series of uses of XQuery.

You are going to need an XQuery engine for all but this first post to be meaningful so let’s cover getting that setup first.

There are any number of GUI interface tools that I will mention over time but for now, let’s start with Saxon.

Download Saxon, unzip the file and you can choose to put saxon9he.jar in your Java classpath (if set) or you can invoke it with the -cp (path to saxon9he.jar), as in java -cp (path to saxon9he.jar) net.sf.saxon.Query -q:query-file.

Classpaths are a mixed blessing at best but who wants to keep typing -cp (your path to saxon9he.jar) net.sf.saxon.Query -q: all the time?

What I have found very useful (Ubuntu system) is to create a short shell script that I can invoke from the command line, thus:

#!/bin/bash java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -q:$1  Which after creating that file, which I very imaginatively named “runsaxon.sh,” I used chmod 755 to make it executable. When I want to run Saxon at the command line, in the same directory with “runsaxon.sh” I type: ./runsaxon.sh ex-5.4.xq > ex-5.4.html It is a lot easier and not subject to my fat-fingering of the keyboard. The “>” sign is a pipe in Linux that redirects the output to a file, in this case, ex-5.4.html. The source of ex-5.4.xq (and its data file) is: XQuery, 2nd Edition by Patricia Walmsley. Highly recommended. Patricia has put all of her examples online, XQuery Examples. Please pass that along with a link to her book if you use her examples. If you have ten minutes, take a look at: Learn XQuery in 10 Minutes: An XQuery Tutorial *UPDATED* by Dr. Michael Kay. Michael Kay is also the author of Saxon. By this point you should be well on your way to having a working XQuery engine and tomorrow we will start exploring the structure of the congressional roll call vote documents. ### Congressional Roll Call and XQuery – (Week 1 of XQuery) Saturday, January 9th, 2016 Truthfully a little more than a week of daily XQuery posts, I started a day or so before January 1, 2016. I haven’t been flooded with suggestions or comments, ;-), so I read back over my XQuery posts and I see lots of room for improvement. Most of my posts are on fairly technical topics and are meant to alert other researchers of interesting software or techniques. Most of them are not “how-to” or step by step guides, but some of them are. The posts on congressional roll call documents made sense to me but then I wrote them. Part of what I sensed was that either you know enough to follow my jumps, in which case you are looking for specific details, like the correspondence across documents for attribute values, and not so much for my XQuery expressions. On the other hand, if you weren’t already comfortable with XQuery, the correspondence of values between documents was the least of your concerns. Where the hell was all this terminology coming from? I’m no stranger to long explanations, one of the standards I edit crosses the line at over 1,500 pages. But it hasn’t been my habit to write really long posts on this blog. I’m going to spend the next week, starting tomorrow, re-working and expanding the congressional roll call vote posts to be more detailed for those getting into XQuery, with a very terse, short experts tips at the end of each post if needed. The expert part will have observations such as the correspondences in attribute values and other oddities that either you know or you don’t. Will have the first longer style post up tomorrow, January 10, 2016 and we will see how the week develops from there. ### Congressional Roll Call Vote – Join/Merge Remote XML Files (XQuery) Friday, January 8th, 2016 One of the things that yesterday’s output lacked was the full names of the Georgia representatives. Which aren’t reported in the roll call documents. But, what the roll call documents do have, is the following: <recorded-vote> <legislator name-id=”J000288″ sort-field=”Johnson (GA)” unaccented-name=”Johnson (GA)” party=”D” state=”GA” role=”legislator”>Johnson (GA)</legislator> <vote>Nay</vote> </recorded-vote> With emphasis on name-id=”J000288″ I call that attribute out because there is a sample data file, just for the House of Representatives that has: <bioguideID>J000288</bioguideID> And yes, the “name-id” attribute and the <bioguideID> share the same value for Henry C. “Hank” Johnson, Jr. of Georgia. As far as I can find, that relationship between the “name-id” value in roll call result files and the House Member Data File is undocumented. You have to be paying attention to the data values in the various XML files at Congress.gov. The result of the XQuery script today has the usual header but for members of the Georgia delegation, the following: That is the result of joining/merging two XML files hosted at congress.gov in real time. You can substitute any roll call vote and your state as appropriate and generate a similar webpage for that roll call vote. The roll call vote file I used for this example is: http://clerk.house.gov/evs/2015/roll705.xml and the House Member Data File was: http://xml.house.gov/MemberData/MemberData.xml. The MemberData.xml file dates from April of 2015 so it may not have the latest data on any given member. Documentation for House Member Data in XML (pdf). The main XQuery function for merging the two XML files: {for$voter in doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//recorded-vote,
$mem in doc(“http://xml.house.gov/MemberData/MemberData.xml”)//member/member-info where$voter/legislator[@state = ‘GA’] and $voter/legislator/@name-id =$mem//bioguideID
where $voter/legislator[@state = ‘GA’] return <li> {string($voter/legislator)} — {string(voter/vote)}</li> }</ul> Which makes our localized display a bit better for local readers but only just. What we need is more information that can be found at: http://clerk.house.gov/evs/2015/roll705.xml. More on that tomorrow! ### PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data Thursday, January 7th, 2016 PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data Let’s reverse the order of the announcement, to be in reader-friendly order: Downloads Press kit Release Notes What’s New in 9.5 Edit: I moved my comments above the fold as it were: Just so you know, PostgreSQL 9.5 documentation, 9.14.2.2 XMLEXISTS says: Also note that the SQL standard specifies the xmlexists construct to take an XQuery expression as first argument, but PostgreSQL currently only supports XPath, which is a subset of XQuery. Apologies, you will have to scroll for the subsection, there was no anchor at 9.14.2.2. If you are looking to make a major contribution to PostgreSQL, note that XQuery is on the todo list. Now for all the stuff that you will skip reading anyway. 😉 (I would save the prose for use in reports to management about using or transitioning to PostgreSQL 9.5.) 7 JANUARY 2016: The PostgreSQL Global Development Group announces the release of PostgreSQL 9.5. This release adds UPSERT capability, Row Level Security, and multiple Big Data features, which will broaden the user base for the world’s most advanced database. With these new capabilities, PostgreSQL will be the best choice for even more applications for startups, large corporations, and government agencies. Annie Prévot, CIO of the CNAF, the French Child Benefits Office, said, “The CNAF is providing services for 11 million persons and distributing 73 billion Euros every year, through 26 types of social benefit schemes. This service is essential to the population and it relies on an information system that must be absolutely efficient and reliable. The CNAF’s information system is satisfyingly based on the PostgreSQL database management system.” ## UPSERT A most-requested feature by application developers for several years, “UPSERT” is shorthand for “INSERT, ON CONFLICT UPDATE”, allowing new and updated rows to be treated the same. UPSERT simplifies web and mobile application development by enabling the database to handle conflicts between concurrent data changes. This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL. Developed over the last two years by Heroku programmer Peter Geoghegan, PostgreSQL’s implementation of UPSERT is significantly more flexible and powerful than those offered by other relational databases. The new ON CONFLICT clause permits ignoring the new data, or updating different columns or relations in ways which will support complex ETL (Extract, Transform, Load) toolchains for bulk data loading. And, like all of PostgreSQL, it is designed to be absolutely concurrency-safe and to integrate with all other PostgreSQL features, including Logical Replication. ## Row Level Security PostgreSQL continues to expand database security capabilities with its new Row Level Security (RLS) feature. RLS implements true per-row and per-column data access control which integrates with external label-based security stacks such as SE Linux. PostgreSQL is already known as “the most secure by default.” RLS cements its position as the best choice for applications with strong data security requirements, such as compliance with PCI, the European Data Protection Directive, and healthcare data protection standards. RLS is the culmination of five years of security features added to PostgreSQL, including extensive work by KaiGai Kohei of NEC, Stephen Frost of Crunchy Data, and Dean Rasheed. Through it, database administrators can set security “policies” which filter which rows particular users are allowed to update or view. Data security implemented this way is resistant to SQL injection exploits and other application-level security holes. ## Big Data Features PostgreSQL 9.5 includes multiple new features for bigger databases, and for integrating with other Big Data systems. These features ensure that PostgreSQL continues to have a strong role in the rapidly growing open source Big Data marketplace. Among them are: BRIN Indexing: This new type of index supports creating tiny, but effective indexes for very large, “naturally ordered” tables. For example, tables containing logging data with billions of rows could be indexed and searched in 5% of the time required by standard BTree indexes. Faster Sorts: PostgreSQL now sorts text and NUMERIC data faster, using an algorithm called “abbreviated keys”. This makes some queries which need to sort large amounts of data 2X to 12X faster, and can speed up index creation by 20X. CUBE, ROLLUP and GROUPING SETS: These new standard SQL clauses let users produce reports with multiple levels of summarization in one query instead of requiring several. CUBE will also enable tightly integrating PostgreSQL with more Online Analytic Processing (OLAP) reporting tools such as Tableau. Foreign Data Wrappers (FDWs): These already allow using PostgreSQL as a query engine for other Big Data systems such as Hadoop and Cassandra. Version 9.5 adds IMPORT FOREIGN SCHEMA and JOIN pushdown making query connections to external databases both easier to set up and more efficient. TABLESAMPLE: This SQL clause allows grabbing a quick statistical sample of huge tables, without the need for expensive sorting. “The new BRIN index in PostgreSQL 9.5 is a powerful new feature which enables PostgreSQL to manage and index volumes of data that were impractical or impossible in the past. It allows scalability of data and performance beyond what was considered previously attainable with traditional relational databases and makes PostgreSQL a perfect solution for Big Data analytics,” said Boyan Botev, Lead Database Administrator, Premier, Inc. ### A Lesson about Let Clauses (XQuery) Wednesday, January 6th, 2016 I was going to demonstrate how to localize roll call votes so that only representatives from your state and their votes were displayed for any given roll call vote. Which would enable libraries or local newsrooms, whose users/readers have little interest in how obscure representatives from other states voted, to pare down the roll call vote list to those that really matter, your state’s representatives. But remembering that I promised to clean up the listings in yesterday’s post that read: {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} and kept repeating (doc(“http://clerk.house.gov/evs/2015/roll705.xml”). My thought was to replace that string with a variable declared by a let clause and then substituting that variable for that string. To save you from the same mistake, combining a let clause with direct element constructors returns an error saying, in this case: Left operand of ‘>’ needs parentheses Not a terribly helpful error message. I have found examples of using a let clause within a direct element constructor that would have defeated the rationale for declaring the variable to begin with. Tomorrow I hope to post today’s content, which will enable you to display data relevant to local voters, news reporters, for any arbitrary roll call vote in Congress. Mark today’s adventure as a mistake to avoid. 😉 ### Jazzing a Roll Call Vote – Part 3 (XQuery) Tuesday, January 5th, 2016 I posted Congressional Roll Call Vote – Accessibility Issues earlier today to deal with some accessibility issues noticed by @XQuery with my color coding. Today we are going to start at the top of the boring original roll call vote and work our way down using XQuery. Be forewarned that the XQuery you see today we will be shortening and cleaning up tomorrow. It works, but its not best practice. You will need to open up the source of the original roll call vote to see the elements I select in the path expressions. Here is the XQuery that is the goal for today: xquery version “3.0”; declare boundary-space preserve; <html> <head></head> <body> <h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2> <strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)} {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/> <strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/> <strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)} </body> </html> The title of the document we obtain with: <h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2> Two quick things to notice: First, for very simple documents like this one, I use “//” rather than writing out the path to the rollcall-num element. I already know it only occurs once in each rollcall document. Second, when using direct element constructors, the XQuery statements are enclosed by “{ }” brackets. The rollcall number, date and time of the vote come next (I have introduced line breaks for readability): <strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)} {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/> If you compare my presentation of that string and that from the original, you will find the original has slightly more space between the items. Here is the XSLT for that spacing: <xsl:if test=”legis-num[text()!=’0′]”><xsl:text> </xsl:text><b><xsl:value-of select=”legis-num”/></b></xsl:if> <xsl:text> </xsl:text><xsl:value-of select=”vote-type”/> <xsl:text> </xsl:text><xsl:value-of select=”action-date”/> <xsl:text> </xsl:text><xsl:value-of select=”action-time”/><br/> Since I already had white space separating my XQuery expressions, I just added to the prologue: declare boundary-space preserve; The last two lines: <strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/> <strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)} Are just standard queries for content. The string operator extracts the content of the element you address. Tomorrow we are going to talk about how to clean up and shorten the path statements and look around for information that should be at the top of this document, but isn’t! PS: Did you notice that the vote totals, etc., are written as static data in the XML file? Curious isn’t it? Easy enough to generate from the voting data. I don’t have an answer but thought you might. ### Congressional Roll Call Vote – Accessibility Issues Tuesday, January 5th, 2016 I posted a color coded version of a congressional roll call vote in Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway), using red for Republicans and blue for Democrats. #XQuery points out accessibility issues which depend upon color perception. Color coding works better for me than the more traditional roman versus italic font face distinction but let’s improve the color coding to remove the accessibility issue. The first question is what colors should I use for accessibility? In searching to answer that question I found this thread at Edward Tufte’s site (of course), Choice of colors in print and graphics for color-blind readers, which has a rich list of suggestions and pointers to other resources. One in particular, Color Universal Design (CUD), posted by Maarten Boers, has this graphic on colors: Relying on that palette, I changed the colors for the roll call vote to Republicans in orange; Democrats in sky blue and re-generated the roll call document. Here is an accessible version, but color-coded version of: FINAL VOTE RESULTS FOR ROLL CALL 705. An upside of XML is that changing the presentation of all 429 votes took only a few seconds to change the stylesheet and re-generate the results. Thanks to #XQuery for prodding me on the accessibility issue which resulted in finding the thread at Tufte and the Colorblind barrier-free color pallet. Other post on congressional roll call votes: ### Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway) Monday, January 4th, 2016 Apologies but did not make as much progress on the Congressional Roll Call vote as I had hoped. I did find some interesting information about the vote.xsl stylesheet and manage to use color to code members of the House. You probably remember me whining about how hard it is to tell between roman and italics to distinguish members of different parties. Jazzing Up Roll Call Votes For Fun and Profit (XQuery) The XSLT code is worse than I imagined. Here’s what I mean: <b><center><font size=”+2″>FINAL VOTE RESULTS FOR ROLL CALL <xsl:value-of select=”/rollcall-vote/vote-metadata/rollcall-num”/> <xsl:if test=”/rollcall-vote/vote-metadata/vote-correction[text()!=”]”>*</xsl:if></font></center></b> <!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘D’]”> –> <xsl:if test = “Majority=’D'”>
<center>(Democrats in roman; Republicans in <i>italic</i>; Independents <u>underlined</u>)</center><br/>
</xsl:if>
<!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘R’]”> –>
<xsl:if test = “$Majority!=’D'”> <center>(Republicans in roman; Democrats in <i>italic</i>; Independents <u>underlined</u>)</center><br/> </xsl:if> Which party is in the majority determines whether the names in a party appear in roman or italic face font. Now there’s a distinction that will be lost on a casual reader! What’s more, if you are trying to reform the stylesheet, don’t look for R or D but again for majority party: <xsl:template match=”vote”> <!– Handles formatting of Member names based on party. –> <!– <xsl:if test=”../legislator/@party=’R'”><xsl:value-of select=”../legislator”/></xsl:if> <xsl:if test=”../legislator/@party=’D'”><i><xsl:value-of select=”../legislator”/></i></xsl:if> –> <xsl:if test=”../legislator/@party=’I'”><u><xsl:value-of select=”../legislator”/></u></xsl:if> <xsl:if test=”../legislator/@party!=’I'”> <xsl:if test=”../legislator/@party =$Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –>
<xsl:value-of select=”../legislator”/>
</xsl:if>
<xsl:if test=”../legislator/@party != $Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –> <i><xsl:value-of select=”../legislator”/></i> </xsl:if> </xsl:if> </xsl:template> As you can see, selecting by party has been commented out in favor of the roman/italic distinction based on the majority party. I wanted to label the Republicans with an icon but my GIMP skills don’t extend to making an icon of young mothers throwing their children under the carriage wheels of the wealthy to save them from a live of poverty and degradation. A bit much to get into a HTML button sized icon. I settled for using the traditional red for Republicans and blue for Republicans and ran the modified stylesheet against roll705.xml locally. Here is FINAL VOTE RESULTS FOR ROLL CALL 705 as HTML. Question: Are red and blue easier to distinguish than roman and italic? If your answer is yes, why resort to typographic subtlety on something like party affiliation? Are subtle distinctions used to confuse the uninitiated and unwary? ### Jazzing Up Roll Call Votes For Fun and Profit (XQuery) Sunday, January 3rd, 2016 Roll call votes in the US House of Representatives are a stable of local, state and national news. If you go looking for the “official” version, what you find is as boring as your 5th grade civics class. Trigger Warning: Boring and Minimally Informative Page Produced By Following Link: Final Vote Results For Roll Call 705. Take a deep breath and load the page. It will open in a new browser tab. Boring. Yes? (You were warned.) It is the recent roll call vote to fund the US government, take another slice of privacy from citizens, and make a number of other dubious policy choices. (Everything after the first comma depending upon your point of view.) Whatever your politics though, you have to agree this is sub-optimal presentation, even for a government document. This is no accident, sans the header, you will find the identical presentation of this very roll call vote at: page H10696, Congressional Record for December 18, 2015 (pdf). Disappointing so much XML, XSLT, XQuery, etc., has been wasted duplicating non-informative print formatting. Or should I say less-informative formatting than is possible with XML? Once the data is in XML, legend has it, users can transform that XML in ways more suited to their purposes and not those of the content providers. I say “legend has it,” because we all know if content providers had their way, web navigation would be via ads and not bare hyperlinks. You want to see the next page? You must select the ad + hyperlink, waiting for the ad to clear before the resource appears. I can summarize my opinion about content provider control over information legally delivered to my computer: Screw that! If a content provider enables access to content, I am free to transform that content into speech, graphics, add information, take away information, in short do anything that my imagination desires and my skill enables. Let’s take the roll call vote in the House of Representatives, Final Vote Results For Roll Call 705. Just under the title you will read: (Republicans in roman; Democrats in italic; Independents underlined) Boring. For a bulk display of voting results, we can do better than that. What if we had small images to identify the respective parties? Here are some candidates (sic) for the Republicans: Of course we would have to reduce them to icons size, but XML processing is rarely ever just XML processing. Nearly every project includes some other skill set as well. Which one do you think looks more neutral? 😉 Certainly be more colorful and depending upon your inclinations, more fun to play about with than the difference in roman and italic. Yes? Presentation of the data in http://clerk.house.gov/evs/2015/roll705.xml is only one of the possibilities that XQuery offers. Follow along and offer your suggestions for changes, additions and modifications. First steps: In the browser tab with Final Vote Results For Roll Call 705, use CNTR-u to view the page source. First notice that the boring web presentation is controlled by http://clerk.house.gov/evs/vote.xsl. Copy and paste: http://clerk.house.gov/evs/vote.xsl into a new browser tab and select return. The resulting xsl:stylesheet is responsible for generating the original page, from the vote totals to column presentation of the results. Pay particular attention to the generation of totals from the <vote-data> element and its children. That generation is powered by these lines in vote.xsl: <xsl:apply-templates select=”/rollcall-vote/vote-metadata”/> <!– Create total variables based on counts. –> <xsl:variable name=”y” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Yea’])”/> <xsl:variable name=”a” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Aye’])”/> <xsl:variable name=”yeas” select=”$y + $a”/> <xsl:variable name=”nay” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Nay’])”/> <xsl:variable name=”no” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’No’])”/> <xsl:variable name=”nays” select=”$nay + $no”/> <xsl:variable name=”nvs” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Not Voting’])”/> <xsl:variable name=”presents” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Present’])”/> <br/> (Not entirely, I omitted the purely formatting stuff.) For tomorrow I will be working on a more “visible” way to identify political party affiliation and “borrowing” the count code from vote.xsl. Enjoy! You may be wondering what XQuery has to do with topic maps? Well, if you think about it, every time we select, aggregate, etc., data, we are making choices based on notions of subject identity. That is we think the data we are manipulating represents some subjects and/or information about some subjects, that we find sensible (for some unstated reason) to put together for others to read. The first step towards a topic map, however, is the putting of information together so we can judge what subjects need explicit representation and how we choose to identify them. Prior topic map work was never explicit about how we get to a topic map, putting that possibly divisive question behind us, we simply start with topic maps, ab initio. I was in the car when we took that turn and for the many miles since then. I have come to think that a better starting place is choosing subjects, what we want to say about them and how we wish to say it, so that we have only so much machinery as is necessary for any particular set of subjects. Some subjects can be identified by IRIs, others by multi-dimensional vectors, still others by unspecified processes of deep learning, etc. Which ones we choose will depend upon the immediate ROI from subject identity and relationships between subjects. I don’t need triples, for instance, to recognize natural languages to a sufficient degree of accuracy. Unnecessary triples, topics or associations are just padding. If you are on a per-triple contract, they make sense, otherwise, not. A long way of saying that subject identity lurks just underneath the application of XQuery and we will see where it is useful to call subject identity to the fore. ### Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery (Part 2) Saturday, January 2nd, 2016 Despite heavy carousing during the holidays, you may still remember Great R packages for data import, wrangling & visualization [+ XQuery], where I re-sorted the table by Sharon Machlis, to present the R packages in package name order. I followed that up with: Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery, where I detailed the travails of trying to sort the software packages by their short descriptions, again in alphabetical order. My assumption in that post was that either the spaces or the “,” commas in the descriptions were fouling the sort by. That wasn’t the case, which I should have known because the string operator always returns a string. That is the spaces and “,” inside are just parts of a string, nothing more. The up-side of the problem was that I spent more than a little while with Walmsley’s XQuery book, searching for ever more esoteric answers. Here’s the failing XQuery: <html> <body> <table>{ for$row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2]/a)) return <tr>{$row/td[2]} {$row/td[1]}</tr> }</table> </body> </html>  And here is the working XQuery: <html> <body> <table>{ for$row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2])) return <tr>{$row/td[2]} {$row/td[1]}</tr> }</table> </body> </html>  Here is the mistake highlighted: order by lower-case"(string($row/td[2]/a))"


My first mistake was the inclusion of “/a” in the path. Using string on ($row/td[1]), that is without having /a at the end of the path, gives the original script the same result. (Run that for yourself on favorite-R-packages.xml). Make any path as long as required and no longer! My second mistake was not checking the XPath immediately upon the failure of the sort. (The simplest answer is usually the correct one.) Enjoy! Update: Removed the quotes marks around (string($row/td[2])) in both queries, they were part of an explanation that did not make the cut. Thanks to XQuery for the catch!

### XQilla-2.3.2 – Tooling up for 2016 (Part 2) (XQuery)

Friday, January 1st, 2016

As I promised yesterday, a solution to the XQilla-2.3.2 installation problem!

Using a virtual machine to install the latest version of Ubuntu (15.10), which had the libraries required to install XQilla!

I use VirtualBox from Oracle but people also use VMware.

Virtual boxes come in all manner of configurations so you are likely to spend some time loading linux headers and the like to compile software.

The advantage of a virtual box is that I don’t have to risk doing something dumb or out of fatigue to my working setup. If I have to blow away the entire virtual machine, its takes only a few minutes to download another one.

Well, on any day other than New Year’s Day I found out today. I don’t know if people were streaming that many football games or streaming live “acts” of some sort but the Net was very slow today.

Introducing XQuery to humanists, librarians and reporters using a VM with the usual XQuery suspects pre-loaded would be very cool!

Great way to distribute xqueries* and shell scripts that run them for immediate results.

If you have any thoughts about what such a VM should contain, etc., drop me an email patrick@durusau.net or leave a comment. Thanks!

PS: XQueries returned approximately 26K “hits,” and xquerys returned approximately 1,700 “hits.” Usage favors the plural as “xqueries” so that is what I am following. At the first of a sentence, XQueries?

PPS: I could have written this without the woes of failed downloads, missing header files, etc. but I wanted to know for myself that Ubuntu (15.10) with all the appropriate header files would in fact compile XQilla-2.3.2.

You may need this line to get all the headers:

apt-get install dkms build-essential linux-headers-generic

Not to mention that I would update everything before trying to compile software. Hard to say how long your VM has been on the shelf.

### XQilla-2.3.2 – Tooling up for 2016 (Part 1) (XQuery)

Thursday, December 31st, 2015

Along with other end of the year tasks, I’m installing several different XQuery tools. Not all tools support all extensions and so a variety of tools can be a useful thing.

The README for XQila-2.3.2 comes close to winning a prize for being terse:

1. Download a source distribution of Xerces-C 3.1.2

2. Build Xerces-C

cd xerces-c-3.1.2/
./configure
make

4. Build XQilla

cd xqilla/
./configure –with-xerces=pwd/../xerces-c-3.1.2/
make

A few notes that may help:

Obtain Xerces-c-3.1.2 homepage.

Xerces project homepage. Home of Apache Xerces C++, Apache Xerces2 Java, Apache Xerces Perl, and, Apache XML Commons.

On configuring the make file for XQilla:

./configure –with-xerces=pwd/../xerces-c-3.1.2/

the README is presuming you built xerces-c-3.1.2 in a sub-directory of the XQilla source. You could, just out of habit I built xerces-c-3.1.2 in a separate directory.

The configuration file for XQilla reads in part:

–with-xerces=DIR Path of Xerces. DIR=”/usr/local”

So you could build XQilla with an existing install of xerces-c-3.1.2 if you are so-minded. But if you are that far along, you don’t need these notes. 😉

Strictly for my system (your paths will be different), after building xerces-c-3.1.2, I changed directories to XQilla-2.3.2 and typed:

./configure --with-xerces=/home/patrick/working/xerces-c-3.1.2 

No error messages so I am now back at the command prompt and enter make.

Welllll, that was supposed to work!

Here is the error I got:

libtool: link: g++ -O2 -ftemplate-depth-50 -o .libs/xqilla
xqilla-commandline.o
-L/home/patrick/working/xerces-c-3.1.2/src
/home/patrick/working/xerces-c-3.1.2/src/
.libs/libxerces-c.so ./.libs/libxqilla.so -lnsl -lpthread -Wl,-rpath
-Wl,/home/patrick/working/xerces-c-3.1.2/src
/usr/bin/ld: warning: libicuuc.so.55, needed by
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so,
not found (try using -rpath or -rpath-link)
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so:
undefined reference to uset_close_55'
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so:
undefined reference to ucnv_fromUnicode_55'
...[omitted numerous undefined references]...
collect2: error: ld returned 1 exit status
make[1]: *** [xqilla] Error 1
make[1]: Leaving directory /home/patrick/working/XQilla-2.3.2'
make: *** [all-recursive] Error 1


To help you avoid surfing the web to track down this issue, realize that Ubuntu doesn’t use the latest releases. Of anything as far as I can tell.

The bottom line being that Ubuntu 14.04 doesn’t have libicuuc.so.55.

If I manually upgrade libraries, I might create an inconsistency package management tools can’t fix. 🙁 And break working tools. Bad joss!

Fear Not! There is a solution, which I will cover in my next XQilla-2.3.2 post!

PS: I didn’t get back to the sorting post in time to finish it today. Not to mention that I encountered another nasty list in Most Vulnerable Software of 2015! (Perils of Interpretation!, Advice for 2016).

I say “nasty,” you should see some of the lists you can find at Congress.gov. Valid XML I’ll concede but not as useful as they could be.

Improving online lists, combining them with other data, etc., are some of the things I want to cover this coming year.

### Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery

Wednesday, December 30th, 2015

Continuing with the data from my post: Great R packages for data import, wrangling & visualization [+ XQuery], I have discovered the dangers of perfect example data!

The XQuery examples on sorting that I have read either enclose strings in quotes and/or have strings with no whitespaces.

How often to you see strings with no whitespaces? Outside of highly constrained environments?

Why is that a problem?

Well, take a look at my results from sorting on the short description and displaying the short description first and the package name second:

 package development, package installation devtools misc installr data import readxl data import, data export googlesheets data import RMySQL data import readr data import, data export rio data analysis psych data wrangling, data analysis sqldf data import, data wrangling jsonlite data import, data wrangling XML data import, data visualization, data analysis quantmod data import, web scraping rvest data wrangling, data analysis dplyr data wrangling plyr data wrangling reshape2 data wrangling tidyr data wrangling, data analysis data.table data wrangling stringr data wrangling lubridate data wrangling, data analysis zoo data display editR data display knitr data display, data wrangling listviewer data display DT data visualization ggplot2 data visualization dygraphs data visualization googleVis data visualization metricsgraphics data visualization RColorBrewer data visualization plotly mapping leaflet mapping choroplethr mapping tmap misc fitbitScraper Web analytics rga Web analytics RSiteCatalyst package development roxygen2 data visualization shiny misc openxlsx data wrangling, data analysis gmodels data wrangling car data visualization rcdimple data wrangling foreach data acquisition downloader data wrangling scales data visualization plotly

Err, that’s not right!

The XQuery from yesterday:

1. xquery version “1.0”;
2. <html>
3. <table>{
4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr 5. order by lower-case(string($row/td[1]/a))
6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
7. }</table>
8. </html>

XQuery from today, changes in red:

1. xquery version “1.0”;
2. <html>
3. <table>{
4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr 5. order by lower-case(string($row/td[2]/a))
6. return <tr>{$row/td[2]} {$row/td[1]}</tr>
7. }</table>
8. </html>

First, how do you explain the failure? Looks like no sort order at all.

Truthfully it does have a sort order, just not the one you expected. The results appear in document sort order, as they appeared in the document.

Here’s a snippet of that document:

<table>
<tr>
<td><a href="https://github.com/hadley/devtools" target="_new">devtools</a></td>
<td>package development, package installation</td>
<td>While devtools is aimed at helping you create your own R packages, it's also
essential if you want to easily install other packages from GitHub. Install it!
Requires <a href="http://cran.r-project.org/bin/windows/Rtools/" target="_new">
Rtools</a> on Windows and <a href="https://developer.apple.com/xcode/downloads/"
target="_new">XCode</a> on a Mac. On CRAN.</td>
<td>install_github("rstudio/leaflet")</td>
<td>Hadley Wickham & others</td>
</tr>
<tr>
<td><a href="https://github.com/talgalili/installr/" target="_new">installr</a>
</td><td>misc</td>
<td>Windows only: Update your installed version of R from within R. On CRAN.</td>
<td>updateR()</td>
<td>Tal Galili & others</td>
</tr>
<tr>
<td><a href="https://github.com/hadley/readxl/" target="_new">readxl</a>
</td><td>data import</td>
<td>Fast way to read Excel files in R, without dependencies such as Java. CRAN.</td>
<td>read_excel("my-spreadsheet.xls", sheet = 1)</td>
<td>Hadley Wickham</td>
</tr>
...
</table>


I haven’t run the problem entirely to ground but as you can see from the output:

data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod

Most of the descriptions have spaces, not to mention “,” separating categories.

It is always possible to clean up the data but I want to avoid that if at all possible.

Cleaning data involves the risk I may change the data and once changed, I may not be able to go back to the original.

I can think of at least two (2) ways to fix this problem but want to sleep on it first and pick that can be easily adapted to the next soiled data that comes through the door.

PS: Neither Saxon (9.7), nor BaseX (8.3) gave any error messages at the console for the failure of the sort request.

You could say that document order is about as large an error message as can be given. 😉

### Great R packages for data import, wrangling & visualization [+ XQuery]

Tuesday, December 29th, 2015

From the post:

One of the great things about R is the thousands of packages users have written to solve specific problems in various disciplines — analyzing everything from weather or financial data to the human genome — not to mention analyzing computer security-breach data.

Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in the table are clickable if you want more information. To find out more about a package once you’ve installed it, type help(package = "packagename") in your R console (of course substituting the actual package name ).

Forty-seven (47) “favorites” sounds a bit on the high side but some people have more than one “favorite” ice cream, or obsession. 😉

You know how I feel about sort-order and I could not detect an obvious one in Sharon’s listing.

So, I extracted the package links/name plus the short description into a new table:

 car data wrangling choroplethr mapping data.table data wrangling, data analysis devtools package development, package installation downloader data acquisition dplyr data wrangling, data analysis DT data display dygraphs data visualization editR data display fitbitScraper misc foreach data wrangling ggplot2 data visualization gmodels data wrangling, data analysis googlesheets data import, data export googleVis data visualization installr misc jsonlite data import, data wrangling knitr data display leaflet mapping listviewer data display, data wrangling lubridate data wrangling metricsgraphics data visualization openxlsx misc plotly data visualization plotly data visualization plyr data wrangling psych data analysis quantmod data import, data visualization, data analysis rcdimple data visualization RColorBrewer data visualization readr data import readxl data import reshape2 data wrangling rga Web analytics rio data import, data export RMySQL data import roxygen2 package development RSiteCatalyst Web analytics rvest data import, web scraping scales data wrangling shiny data visualization sqldf data wrangling, data analysis stringr data wrangling tidyr data wrangling tmap mapping XML data import, data wrangling zoo data wrangling, data analysis

Enjoy!

I want to use XQuery at least once a day in 2016 on my blog. To keep myself honest, I will be posting any XQuery I use.

To sort and extract two of the columns from Mary’s table, I copied the table to a separate file and ran this XQuery:

1. xquery version “1.0”;
2. <html>
3. <table>{
4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr 5. order by lower-case(string($row/td[1]/a))
6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
7. }</table>
8. </html>

One of the nifty aspects of XQuery is that you can sort, as on line 5, in all lower-case on the first <td> element, while returning the same element as written in the original table. Which gives better (IMHO) sort order than UPPERCASE followed by lowercase.

This same technique should make you the master of any simple tables you encounter on the web.

PS: You should always acknowledge the source of your data and the original author.

I first saw Sharon’s list in a tweet by Christophe Lalanne.

### Facets for Christmas!

Friday, December 25th, 2015

Facet Module

From the introduction:

Faceted search has proven to be enormously popular in the real world applications. Faceted search allows user to navigate and access information via a structured facet classification system. Combined with full text search, it provides user with enormous power and flexibility to discover information.

This proposal defines a standardized approach to support the Faceted search in XQuery. It has been designed to be compatible with XQuery 3.0, and is intended to be used in conjunction with XQuery and XPath Full Text 3.0.

Imagine my surprise when after opening Christmas presents with family to see a tweet by XQuery announcing yet another Christmas present:

“Facets”: A new EXPath spec w/extension functions & data models to enable faceted navigation & search in XQuery http://expath.org/spec/facet

The EXPath homepage says:

XPath is great. XPath-based languages like XQuery, XSLT, and XProc, are great. The XPath recommendation provides a foundation for writing expressions that evaluate the same way in a lot of processors, written in different languages, running in different environments, in XML databases, in in-memory processors, in servers or in clients.

Supporting so many different kinds of processor is wonderful thing. But this also contrains which features are feasible at the XPath level and which are not. In the years since the release of XPath 2.0, experience has gradually revealed some missing features.

EXPath exists to provide specifications for such missing features in a collaborative- and implementation-independent way. EXPath also provides facilities to help and deliver implementations to as many processors as possible, via extensibility mechanisms from the XPath 2.0 Recommendation itself.

Other projects exist to define extensions for XPath-based languages or languages using XPath, as the famous EXSLT, and the more recent EXQuery and EXProc projects. We think that those projects are really useful and fill a gap in the XML core technologies landscape. Nevertheless, working at the XPath level allows common solutions when there is no sense in reinventing the wheel over and over again. This is just following the brilliant idea of the W3C’s XSLT and XQuery working groups, which joined forces to define XPath 2.0 together. EXPath purpose is not to compete with other projects, but collaborate with them.

Be sure to visit the resources page. It has a manageable listing of processors that handle extensions.

What would you like to see added to XPath?

Enjoy!

### An XQuery Module For Simplifying Semantic Namespaces

Wednesday, December 23rd, 2015

From the post:

While I enjoy working with the MarkLogic 8 server, there are a number of features about the semantics library there that I still find a bit problematic. Declaring namespaces for semantics in particular is a pain—I normally have trouble remembering the namespaces for RDF or RDFS or OWL, even after working with them for several years, and once you start talking about namespaces that are specific to your own application domain, managing this list can get onerous pretty quickly.

I should point out however, that namespaces within semantics can be very useful in helping to organize and design an ontology, even a non-semantic ontology, and as such, my applications tend to be namespace rich. However, when working with Turtle, Sparql, RDFa, and other formats of namespaces, the need to incorporate these namespaces can be a real showstopper for any developer. Thus, like any good developer, I decided to automate my pain points and create a library that would allow me to simplify this process.

The code given here is in turtle and xquery, but I hope to build out similar libraries for use in JavaScript shortly. When I do, I’ll update this article to reflect those changes.

If you are forced to use a MarkLogic 8 server, great post on managing semantic namespaces.

If you have a choice of tools, something to consider before you willingly choose to use a MarkLogic 8 server.

I first saw this in a tweet by XQuery.

### My Bad – You Are Not! 747 Edits Away From Using XML Tools

Thursday, December 17th, 2015

The original, unedited post is below but in response to comments, I checked the XQuery, XPath, XSLT and XQuery Serialization 3.1 files in Chrome (CNTR-U) before saving them.

All the empty elements were properly closed.

I then saved the files and re-opened in Emacs, to discover that Chrome had stripped the “/” from the empty elements, which then caused BaseX to complain. It was an accurate complaint but the files I was tossing against BaseX were not the files as published by the W3C.

So now I need to file a bug report on Chrome, Version 47.0.2526.80 (64-bit) on Ubuntu, for mangling closed empty elements.

You could tell in XQuery, XPath, XSLT and XQuery Serialization 3.1, New Candidate Recommendations! that I was really excited to see the new drafts hit the street.

Me and my big mouth.

I grabbed copies of all three and tossed the XQuery draft against an xquery to create a list of all the paths in it. Simple enough.

The result weren’t.

Here is the first error message:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 68): The element type “link” must be terminated by the matching end-tag “</link>”.

Ouch!

I corrected that and running the query a second time I got:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 68): The element type “meta” must be terminated by the matching end-tag “</meta>”.

The <meta> elements appear on lines three and four.

On the third try:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 69): The element type “img” must be terminated by the matching end-tag “</img>”.

There are 3 <img> elements that are not closed.

I’m getting fairly annoyed at this point.

Fourth try:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 78): The element type “br” must be terminated by the matching end-tag “</br>”.

Of course at this point I revert to grep and discover there are 353
elements that are not closed.

Sigh, nothing to do but correct and soldier on.

Fifth attempt.

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 17618): The element type “hr” must be terminated by the matching end-tag “</hr>”.

There are 2 <hr> elements that are not closed.

A total of 361 edits in order to use XML based tools with the most recent XQuery 3.1 Candidate draft.

The most recent XPath 3.1 has 238 empty elements that aren’t closed (same elements as XQuery 3.1).

The XSLT and XQuery Serialization 3.1 draft has 149 empty elements that aren’t closed, same as the other but with the addition of four <col> elements that weren’t closed.

Grand total: 747 edits in order to use XML tools.

Not an editorial but a production problem. A rather severe one it seems to me.

Anyone who wants to use XML tools on these drafts will have to perform the same edits.

### XQuery, XPath, XSLT and XQuery Serialization 3.1, New Candidate Recommendations!

Thursday, December 17th, 2015

As I forecast 😉 earlier this week, new Candidate Recommendations for:

XQuery 3.1: An XML Query Language

XSLT and XQuery Serialization 3.1

have hit the streets for your review and comments!

Comments due by 2016-01-31.

That’s forty-five days, minus the ones spent with drugs/sex/rock-n-roll over the holidays and recovering from same.

Say something shy of forty-four actual working days (my endurance isn’t what it once was) for the review process.

What tools, techniques are you going to use to review this latest set of candidates?

BTW, some people review software and check only fixes, for standards I start at the beginning, go to the end, then stop. (Or the reverse for backward proofing.)

My estimates on days spent with drugs/sex/rock-n-rock are approximate only and your experience may vary.

### 35 Lines XQuery versus 604 of XSLT: A List of W3C Recommendations

Monday, December 14th, 2015

Use Case

You should be familiar with the W3C Bibliography Generator. You can insert one or more URLs and the generator produces correctly formatted citations for W3C work products.

It’s quite handy but requires a URL to produce a useful response. I need authors to use correctly formatted W3C citations and asking them to find URLs and to generate correct citations was a bridge too far. Simply didn’t happen.

My current attempt is to produce a list of correctly W3C citations in HTML. Authors can use CTRL-F in their browsers to find citations. (Time will tell if this is a successful approach or not.)

Goal: An HTML page of correctly formatted W3C Recommendations, sorted by title (ignoring case because W3C Recommendations are not consistent in their use of case in titles). “Correctly formatted” meaning that it matches the output from the W3C Bibliography Generator.

Resources

As a starting point, I viewed the source of http://www.w3.org/2002/01/tr-automation/tr-biblio.xsl, the XSLT script that generates the XHTML page with its responses.

The first XSLT script imports two more XSLT scripts, http://www.w3.org/2001/08/date-util.xslt and http://www.w3.org/2001/10/str-util.xsl.

I’m not going to reproduce the XSLT here, but can say that starting with <stylesheet> and ending with </stylesheet>, inclusive, I came up with 604 lines.

You will need to download the file used by the W3C Bibliography Generator, tr.rdf.

XQuery Script

I have used the XQuery script successfully with: BaseX 8.3, eXide 2.1.3 and SaxonHE-6-07J.

Here’s the prolog:

declare default element namespace "http://www.w3.org/2001/02pd/rec54#";
declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
declare namespace dc = "http://purl.org/dc/elements/1.1/";
declare namespace doc = "http://www.w3.org/2000/10/swap/pim/doc#";
declare namespace contact = "http://www.w3.org/2000/10/swap/pim/contact#";
declare namespace functx = "http://www.functx.com";
declare function functx:substring-after-last
($string as xs:string?,$delim as xs:string) as xs:string?
{
if (contains ($string,$delim))
then functx:substring-after-last(substring-after($string,$delim), $delim) else$string
};


Declaring the namespaces and functx:substring-after-last from Patricia Walmsley’s excellent FunctX XQuery Functions site and in particular, functx:substring-after-last.

<html>
<head>XQuery Generated W3C Recommendation List</head>
<body>
<ul class="ul">


Start the HTML page and the unordered list that will contain the list items.

{
for $rec in doc("tr.rdf")//REC order by upper-case($rec/dc:title)


If you sort W3C Recommendations by dc:title and don’t specify upper-case, rdf:PlainLiteral: A Datatype for RDF Plain Literals,
rdf:PlainLiteral: A Datatype for RDF Plain Literals (Second Edition), and xml:id Version 1.0, appear at the end of the list sorted by title. Dirty data isn’t limited to databases.

return <li class="li">
<a href="{string($rec/@rdf:about)}"> {string($rec/dc:title)} </a>,
{ for $auth in$rec/editor
return
if (contains(string($auth/contact:fullName), ".")) then (concat(string($auth/contact:fullName), ","))
else (concat(concat(concat(substring(substring-before(string($auth/\ contact:fullName), ' '), 0, 2), ". "), (substring-after(string\ ($auth/contact:fullName), ' '))), ","))}


Watch for the line continuation marker “\”.

We begin by grabbing the URL and title for an entry and then confront dirty author data. The standard author listing by the W3C creates an initial plus a period for the author’s first name and then concatenates the rest of the author’s name to that initial plus period.

Problem: There is one entry for authors that already has initials, T.V. Raman, so I had to account for that one entry (as does the XSLT).

{if (count ($rec/editor) >= 2) then " Editors," else " Editor,"} W3C Recommendation, {fn:format-date(xs:date(string($rec/dc:date)), "[MNn] [D], [Y]") },
{string($rec/@rdf:about)}. <a href="{string($rec/doc:versionOf/\
@rdf:resource)}">Latest version</a> \
available at {string($rec/doc:versionOf/@rdf:resource)}. <br/>[Suggested label: <strong>{functx:substring-after-last(uppercase\ (replace(string($rec/doc:versionOf/@rdf:resource), '/\$', '')), "/")}\
</strong>]<br/></li>} </ul></body></html>


Nothing remarkable here, except that I snipped the concluding “/” off of the values from doc:versionOf/@rdf:resource so I could use functx:substring-after-last to create the token for a suggested label.

Comments / Omissions

I depart from the XSLT in one case. It calls http://www.w3.org/2002/01/tr-automation/known-tr-editors.rdf here:

<!-- Special casing for when we have the name in Original Script (e.g. in \
Japanese); currently assume that the order is inversed in this case... -->

<:xsl:when test="document('http://www.w3.org/2002/01/tr-automation/\
known-tr-editors.rdf')/rdf:RDF/*[contact:lastNameInOriginalScript=\
substring-before(current(),' ')]">


But that refers to only one case:

<REC rdf:about="http://www.w3.org/TR/2003/REC-SVG11-20030114/">
<dc:date>2003-01-14</dc:date>
<dc:title>Scalable Vector Graphics (SVG) 1.1 Specification</dc:title>


Where Jun Fujisawa appears as an editor.

Recalling my criteria for “correctness” being the output of the W3C Bibliography Generator:

Preparing for this post made me discover at least one bug in the XSLT that was supposed to report the name in original script:

&lt:xsl:when test=”document(‘http://www.w3.org/2002/01/tr-automation/\
known-tr-editors.rdf’)/rdf:RDF/*[contact:lastNameInOriginalScript=\
substring-before(current(),’ ‘)]”>

Whereas the entry in http://www.w3.org/2002/01/tr-automation/known-tr-editors.rdf reads:

<rdf:Description>
<rdf:type rdf:resource=”http://www.w3.org/2000/10/swap/pim/contact#Person”/>
<firstName>Jun</firstName>
<firstNameInOriginalScript>藤沢 淳</firstNameInOriginalScript>
<lastName>Fujisawa</lastName>
<sortName>Fujisawa</sortName>
</rdf:Description>

Since the W3C Bibliography Generator doesn’t produce the name in original script, neither do I. When the W3C fixes its output, I will have to amend this script to pick up that entry.

String

While writing this query I found text(), fn:string() and fn:data() by Dave Cassels. Recommended reading. The weakness of text() is that if markup is inserted inside your target element after you write the query, you will get unexpected results. The use of fn:string()` avoids that sort of surprise.

Recommendations Only

Unlike the W3C Bibliography Generator, my script as written only generates entries for Recommendations. It would be trivial to modify the script to include drafts, notes, etc., but I chose to not include material that should not be used as normative citations.

I can see the usefulness of the bibliography generator for works in progress but external to the W3C, citing Recommendations is the better course.

Contra Search

The SpecRef project has a searchable interface to all the W3C documents. If you search for XQuery, the interface returns 385 “hits.”

Contrast that with using CNTR-F with the list of recommendations generated from the XQuery script, controlling for case, XQuery produced only 23 “hits.”

There are reasons for using search, but users repeatedly mining results of searches that could be captured (it was called curation once upon a time) is wasteful.

Reading

I can’t recommend Patricia Walmsley’s XQuery 2nd Edition strongly enough.

There is one danger to Walmsley’s book. You will be so ready to start using XQuery after the first ten chapters it’s hard to find the time to read the remaining ones. Great stuff!

You can download the XQuery file, tr.rdf and the resulting html file at: 35LinesOfXQuery.zip.

### Congress.gov Enhancements: Quick Search, Congressional Record Index, and More

Monday, December 14th, 2015

From the post:

In our quest to retire THOMAS, we have made many enhancements to Congress.gov this year.  Our first big announcement was the addition of email alerts, which notify users of the status of legislation, new issues of the Congressional Record, and when Members of Congress sponsor and cosponsor legislation.  That development was soon followed by the addition of treaty documents and better default bill text in early spring; improved search, browse, and accessibility in late spring; user driven feedback in the summer; and Senate Executive Communications and a series of Two-Minute Tip videos in the fall.

Today’s update on end of year enhancements includes a new Quick Search for legislation, the Congressional Record Index (back to 1995), and the History of Bills from the Congressional Record Index (available from the Actions tab).  We have also brought over the State Legislature Websites page from THOMAS, which has links to state level websites similar to Congress.gov.

Text of legislation from the 101st and 102nd Congresses (1989-1992) has been migrated to Congress.gov. The Legislative Process infographic that has been available from the homepage as a JPG and PDF is now available in Spanish as a JPG and PDF (translated by Francisco Macías). Margaret and Robert added Fiscal Year 2003 and 2004 to the Congress.gov Appropriations Table. There is also a new About page on the site for XML Bulk Data.

The Quick Search provides a form-based search with fields similar to those available from the Advanced Legislation Search on THOMAS.  The Advanced Search on Congress.gov is still there with many additional fields and ways to search for those who want to delve deeper into the data.  We are providing the new Quick Search interface based on user feedback, which highlights selected fields most likely needed for a search.

There’s an impressive summary of changes!

Speaking of practicing programming, are you planning on practicing XQuery on congressional data in the coming year?

### XQuery, XPath, XSLT and XQuery Serialization 3.1 (Back-to-Front) Drafts (soon!)

Monday, December 14th, 2015

XQuery, XPath, XSLT and XQuery Serialization 3.1 (Back-to-Front) Drafts will be published quite soon so I wanted to give you a heads up on your holiday reading schedule.

This is deep enough in the review cycle that a back-to-front reading is probably your best approach.

You have read the drafts and corrections often enough by this point that you read the first few words of a paragraph and you “know” what it says so you move on. (At the very least I can report that happens to me.)

By back-to-front reading I mean to start at the end of each draft and read the last sentence and then the next to last sentence and so on.

The back-to-front process does two things:

1. You are forced to read each sentence on its own.
2. It prevents skimming and filling in errors with silent corrections (unknown to your conscious mind).

The back-to-front method is quite time consuming so its fortunate these drafts are due to appear just before a series of holidays in a large number of places.

I hesitate to mention it but there is another way to proof these drafts.

If you have XML experienced visitors, you could take turns reading the drafts to each other. It was a technique used by copyists many years ago where one person read and two others took down the text. The two versions were then compared to each other and the original.

Even with a great reading voice, I’m not certain many people would be up to that sort of exercise.

PS: I will post on the new drafts as soon as they are published.