Archive for the ‘XQuery’ Category

XML Prague 2017 is coming

Sunday, October 16th, 2016

XML Prague 2017 is coming by Jirka Kosek.

From the post:

I’m happy to announce that call for papers for XML Prague 2017 is finally open. We are looking forward for your interesting submissions related to XML. We have switched from CMT to EasyChair for managing submission process – we hope that new system will have less quirks for users then previous one.

We are sorry for slightly delayed start than in past years. But we have to setup new non-profit organization for running the conference and sometimes we felt like characters from Kafka’s Der Process during this process.

We are now working hard on redesigning and opening of registration. Process should be more smooth then in the past.

But these are just implementation details. XML Prague will be again three day gathering of XML geeks, users, vendors, … which we all are used to enjoy each year. I’m looking forward to meet you in Prague in February.

Conference: February 9-11, 2016.

Important Dates:

Important Dates:

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

You can see videos of last year’s presentation (to gauge the competition): Watch videos from XML Prague 2016 on Youtube channel.

December the 15th will be here sooner than you think!

Think of it as a welcome distraction from the barn yard posturing that is U.S. election politics this year!

XQuery Snippets on Gist

Thursday, October 6th, 2016

@XQuery tweeted today:

Check out some of the 1,637 XQuery code snippets on GitHub’s gist service
https://gist.github.com/search?q=xquery&ref=searchresults&utf8=%E2%9C%93

Not a bad way to get in a daily dose of XQuery!

You can also try Stack Overflow:

XQuery (3,000)

xquery-sql (293)

xquery-3.0 (70)

xquery-update (55)

Enjoy!

XQuery Working Group (Vanderbilt)

Saturday, September 24th, 2016

XQuery Working Group – Learn XQuery in the Company of Digital Humanists and Digital Scientists

From the webpage:

We meet from 3:00 to 4:30 p.m. on most Fridays in 800FA of the Central Library. Newcomers are always welcome! Check the schedule below for details about topics. Also see our Github repository for code samples. Contact Cliff Anderson with any questions or see the FAQs below.

Good thing we are all mindful of the distinction W3C XML Query Working Group and XQuery Working Group (Vanderbilt).

Otherwise, you might need a topic map to sort out casual references. 😉

Even if you can’t attend meetings in person, support this project by Cliff Anderson.

Data Provenance: A Short Bibliography

Tuesday, September 6th, 2016

The video Provenance for Database Transformations by Val Tannen ends with a short bibliography.

Links and abstracts for the items in Val’s bibliography:

Provenance Semirings by Todd J. Green, Grigoris Karvounarakis, Val Tannen. (2007)

We show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and whyprovenance are particular cases of the same general algorithms involving semirings. This further suggests a comprehensive provenance representation that uses semirings of polynomials. We extend these considerations to datalog and semirings of formal power series. We give algorithms for datalog provenance calculation as well as datalog evaluation for incomplete and probabilistic databases. Finally, we show that for some semirings containment of conjunctive queries is the same as for standard set semantics.

Update Exchange with Mappings and Provenance by Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, Val Tannen. (2007)

We consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, every peer’s updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditions — expressing what data and sources a peer judges to be authoritative — which may cause a peer to reject another’s updates. In order to support such filtering, updates carry provenance information. These systems target scientific data sharing applications, and their general principles and architecture have been described in [20].

In this paper we present methods for realizing such systems. Specifically, we extend techniques from data integration, data exchange, and incremental view maintenance to propagate updates along mappings; we integrate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance; we discuss strategies for implementing our techniques in conjunction with an RDBMS; and we experimentally demonstrate the viability of our techniques in the ORCHESTRA prototype system.

Annotated XML: Queries and Provenance by J. Nathan Foster, Todd J. Green, Val Tannen. (2008)

We present a formal framework for capturing the provenance of data appearing in XQuery views of XML. Building on previous work on relations and their (positive) query languages, we decorate unordered XML with annotations from commutative semirings and show that these annotations suffice for a large positive fragment of XQuery applied to this data. In addition to tracking provenance metadata, the framework can be used to represent and process XML with repetitions, incomplete XML, and probabilistic XML, and provides a basis for enforcing access control policies in security applications.

Each of these applications builds on our semantics for XQuery, which we present in several steps: we generalize the semantics of the Nested Relational Calculus (NRC) to handle semiring-annotated complex values, we extend it with a recursive type and structural recursion operator for trees, and we define a semantics for XQuery on annotated XML by translation into this calculus.

Containment of Conjunctive Queries on Annotated Relations by Todd J. Green. (2009)

We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incomplete databases, and databases annotated with various kinds of provenance information. We obtain positive decidability results and complexity characterizations for databases with lineage, why-provenance, and provenance polynomial annotations, for both conjunctive queries and unions of conjunctive queries. At least one of these results is surprising given that provenance polynomial annotations seem “more expressive” than bag semantics and under the latter, containment of unions of conjunctive queries is known to be undecidable. The decision procedures rely on interesting variations on the notion of containment mappings. We also show that for any positive semiring (a very large class) and conjunctive queries without self-joins, equivalence is the same as isomorphism.

Collaborative Data Sharing with Mappings and Provenance by Todd J. Green, dissertation. (2009)

A key challenge in science today involves integrating data from databases managed by different collaborating scientists. In this dissertation, we develop the foundations and applications of collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collaborators to define loose confederations of heterogeneous databases, relating them through schema mappings that establish how data should flow from one site to the next. In addition to simply propagating data along the mappings, it is critical to record data provenance (annotations describing where and how data originated) and to support policies allowing scientists to specify whose data they trust, and when. Since a large data sharing confederation is certain to evolve over time, the CDSS must also efficiently handle incremental changes to data, schemas, and mappings.

We focus in this dissertation on the formal foundations of CDSSs, as well as practical issues of its implementation in a prototype CDSS called Orchestra. We propose a novel model of data provenance appropriate for CDSSs, based on a framework of semiring-annotated relations. This framework elegantly generalizes a number of other important database semantics involving annotated relations, including ranked results, prior provenance models, and probabilistic databases. We describe the design and implementation of the Orchestra prototype, which supports update propagation across schema mappings while maintaining data provenance and filtering data according to trust policies. We investigate fundamental questions of query containment and equivalence in the context of provenance information. We use the results of these investigations to develop novel approaches to efficiently propagating changes to data and mappings in a CDSS. Our approaches highlight unexpected connections between the two problems and with the problem of optimizing queries using materialized views. Finally, we show that semiring annotations also make sense for XML and nested relational data, paving the way towards a future extension of CDSS to these richer data models.

Provenance in Collaborative Data Sharing by Grigoris Karvounarakis, dissertation. (2009)

This dissertation focuses on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support the capability of update exchange — which publishes a participant’s updates and then translates others’ updates to the participant’s local schema and imports them — while tolerating disagreement between them and recording the provenance of exchanged data, i.e., information about the sources and mappings involved in their propagation. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases.

To address these challenges, in this dissertation we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform update exchange incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the ORCHESTRA prototype system. We define ProQL, iv a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we develop a prototype implementation ProQL over an RDBMS and indexing techniques to speed up provenance querying, evaluate experimentally the performance of provenance querying and the benefits of our indexing techniques.

Provenance for Aggregate Queries by Yael Amsterdamer, Daniel Deutch, Val Tannen. (2011)

We study in this paper provenance information for queries with aggregation. Provenance information was studied in the context of various query languages that do not allow for aggregation, and recent work has suggested to capture provenance by annotating the different database tuples with elements of a commutative semiring and propagating the annotations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inapplicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation. We realize this approach in a concrete construction, first for “simple” queries where the aggregation operator is the last one applied, and then for arbitrary (positive) relational algebra queries with aggregation; the latter queries are shown to be more challenging in this context. Finally, we use aggregation to encode queries with difference, and study the semantics obtained for such queries on provenance annotated databases.

Circuits for Datalog Provenance by Daniel Deutch, Tova Milo, Sudeepa Roy, Val Tannen. (2014)

The annotation of the results of database queries with provenance information has many applications. This paper studies provenance for datalog queries. We start by considering provenance representation by (positive) Boolean expressions, as pioneered in the theories of incomplete and probabilistic databases. We show that even for linear datalog programs the representation of provenance using Boolean expressions incurs a super-polynomial size blowup in data complexity. We address this with an approach that is novel in provenance studies, showing that we can construct in PTIME poly-size (data complexity) provenance representations as Boolean circuits. Then we present optimization techniques that embed the construction of circuits into seminaive datalog evaluation, and further reduce the size of the circuits. We also illustrate the usefulness of our approach in multiple application domains such as query evaluation in probabilistic databases, and in deletion propagation. Next, we study the possibility of extending the circuit approach to the more general framework of semiring annotations introduced in earlier work. We show that for a large and useful class of provenance semirings, we can construct in PTIME poly-size circuits that capture the provenance.

Incomplete but a substantial starting point exploring data provenance and its relationship/use with topic map merging.

To get a feel for “data provenance” just prior to the earliest reference here (2007), consider A Survey of Data Provenance Techniques by Yogesh L. Simmhan, Beth Plale, Dennis Gannon, published in 2005.

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.

The provenance of data products generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update a data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes.

In this paper we create a taxonomy of data provenance techniques, and apply the classification to current research efforts in the field. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. Our synthesis can help those building scientific and business metadata-management systems to understand existing provenance system designs. The survey culminates with an identification of open research problems in the field.

Another rich source of reading material!

BaseX 8.5.3 Released!

Monday, August 15th, 2016

BaseX 8.5.3 Released! (2016/08/15)

BaseX 8.5.3 was released today!

The changelog reads:

VERSION 8.5.3 (August 15, 2016) —————————————-

Minor bug fixes, improved thread-safety.

Still, not a bad idea to upgrade today!

Enjoy!

PS: You do remember that Congress is throwing XML in ever increasing amounts at the internet?

Perhaps in hopes of burying information in angle-bang syntax.

XQuery can help disappoint them.

BaseX 8.5.1 Released! (XQuery Texts for Smart Phone?)

Saturday, July 16th, 2016

BaseX – 8.5.1 Released!

From the documentation page:

BaseX is both a light-weight, high-performance and scalable XML Database and an XQuery 3.1 Processor with full support for the W3C Update and Full Text extensions. It focuses on storing, querying, and visualizing large XML and JSON documents and collections. A visual frontend allows users to interactively explore data and evaluate XQuery expressions in realtime. BaseX is platform-independent and distributed under the free BSD License (find more in Wikipedia).

Besides Priscilia Walmsley’s XQuery 2nd Edition and the BaseX documentation as a PDF file, what other XQuery resources would you store on a smart phone? (For occasional reference, leisure reading, etc.)

The Feynman Technique – Contest for Balisage 2016?

Tuesday, June 28th, 2016

The Best Way to Learn Anything: The Feynman Technique by Shane Parrish.

From the post:

There are four simple steps to the Feynman Technique, which I’ll explain below:

  1. Choose a Concept
  2. Teach it to a Toddler
  3. Identify Gaps and Go Back to The Source Material
  4. Review and Simplify

This made me think of the late-breaking Balisage 2016 papers posted by Tommie Usdin in email:

  • Saxon-JS – XSLT 3.0 in the Browser, by Debbie Lockett and Michael Kay, Saxonica
  • A MicroXPath for MicroXML (AKA A New, Simpler Way of Looking at XML Data Content), by Uche Ogbuji, Zepheira
  • A catalog of Functional programming idioms in XQuery 3.1, James Fuller, MarkLogic

New contest for Balisage?

Pick a concept from a Balisage 2016 presentation and you have sixty (60) seconds to explain it to Balisage attendees.

What do you think?

Remember, you can’t play if you don’t attend! Register today!

If Tommie agrees, the winner gets me to record a voice mail greeting for their phone! 😉

Balisage 2016 Program Posted! (Newcomers Welcome!)

Monday, May 23rd, 2016

Tommie Usdin wrote today to say:

Balisage: The Markup Conference
2016 Program Now Available
http://www.balisage.net/2016/Program.html

Balisage: where serious markup practitioners and theoreticians meet every August.

The 2016 program includes papers discussing reducing ambiguity in linked-open-data annotations, the visualization of XSLT execution patterns, automatic recognition of grant- and funding-related information in scientific papers, construction of an interactive interface to assist cybersecurity analysts, rules for graceful extension and customization of standard vocabularies, case studies of agile schema development, a report on XML encoding of subtitles for video, an extension of XPath to file systems, handling soft hyphens in historical texts, an automated validity checker for formatted pages, one no-angle-brackets editing interface for scholars of German family names and another for scholars of Roman legal history, and a survey of non-XML markup such as Markdown.

XML In, Web Out: A one-day Symposium on the sub rosa XML that powers an increasing number of websites will be held on Monday, August 1. http://balisage.net/XML-In-Web-Out/

If you are interested in open information, reusable documents, and vendor and application independence, then you need descriptive markup, and Balisage is the conference you should attend. Balisage brings together document architects, librarians, archivists, computer
scientists, XML practitioners, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, standards developers, academics, industrial researchers, government and NGO staff, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Some participants are busy designing replacements for XML while other still use SGML (and know why they do).

Discussion is open, candid, and unashamedly technical.

Balisage 2016 Program: http://www.balisage.net/2016/Program.html

Symposium Program: http://balisage.net/XML-In-Web-Out/symposiumProgram.html

Even if you don’t eat RELAX grammars at snack time, put Balisage on your conference schedule. Even if a bit scruffy looking, the long time participants like new document/information problems or new ways of looking at old ones. Not to mention they, on occasion, learn something from newcomers as well.

It is a unique opportunity to meet the people who engineered the tools and specs that you use day to day.

Be forewarned that most of them have difficulty agreeing what controversial terms mean, like “document,” but that to one side, they are a good a crew as you are likely to meet.

Enjoy!

TEI XML -> HTML w/ XQuery [+ CSS -> XML]

Thursday, May 5th, 2016

Convert TEI XML to HTML with XQuery and BaseX by Adam Steffanick.

From the post:

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

The post only treats six (6) TEI elements but the methods presented could be extended to a larger set of TEI elements.

TEI 5 has 563 elements, which may appear in varying, valid, combinations. It also defines 256 attributes which are distributed among those 563 elements.

Consider using XQuery as a quality assurance (QA) tool to insure that encoded texts conform your project’s definition of expected text encoding.

While I was at Adam’s site I encountered: Convert CSV to XML with XQuery and BaseX, which you should bookmark for future reference.

Balisage 2016, 2–5 August 2016 [XML That Makes A Difference!]

Tuesday, February 2nd, 2016

Call for Participation

Dates:

  • 25 March 2016 — Peer review applications due
  • 22 April 2016 — Paper submissions due
  • 21 May 2016 — Speakers notified
  • 10 June 2016 — Late-breaking News submissions due
  • 16 June 2016 — Late-breaking News speakers notified
  • 8 July 2016 — Final papers due from presenters of peer reviewed papers
  • 8 July 2016 — Short paper or slide summary due from presenters of late-breaking news
  • 1 August 2016 — Pre-conference Symposium
  • 2–5 August 2016 — Balisage: The Markup Conference

From the call:

Balisage is the premier conference on the theory, practice, design, development, and application of markup. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

  • Web application development with XML
  • Informal data models and consensus-based vocabularies
  • Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
  • Performance issues in parsing, XML database retrieval, or XSLT processing
  • Development of angle-bracket-free user interfaces for non-technical users
  • Semistructured data and full text search
  • Deployment of XML systems for enterprise data
  • Web application development with XML
  • Design and implementation of XML vocabularies
  • Case studies of the use of XML for publishing, interchange, or archiving
  • Alternatives to XML
  • the role(s) of XML in the application lifecycle
  • the role(s) of vocabularies in XML environments

Full papers should be submitted by the deadline given below. All papers are peer-reviewed — we pride ourselves that you will seldom get a more thorough, skeptical, or helpful review than the one provided by Balisage reviewers.

Whether in theory or practice, let’s make Balisage 2016 the one people speak of in hushed tones at future markup and information conferences.

Useful semantics continues to flounder about, cf. Vice-President Biden’s interest in “one cancer research language.” Easy enough to say. How hard could it be?

Documents are commonly thought of and processed as if from BOM to EOF is the definition of a document. Much to our impoverishment.

Silo dissing has gotten popular. What if we could have our silos and eat them too?

Let’s set our sights on a Balisage 2016 where non-technicals come away saying “I want that!”

Have your first drafts done well before the end of February, 2016!

Congressional Roll Call Vote – The Documents – Part 2 (XQuery)

Wednesday, January 13th, 2016

Congressional Roll Call Vote – The Documents (XQuery) we looked at the initial elements found in FINAL VOTE RESULTS FOR ROLL CALL 705. Today we continue our examination of those elements, starting with <vote-data>.

As before, use ctrl-u in your browser to display the XML source for that page. Look for </vote-metadata>, the next element is <vote-data>, which contains all the votes cast by members of Congress as follows:

<recorded-vote>
<legislator name-id=”A000374″ sort-field=”Abraham” unaccented-name=”Abraham” party=”R” state=”LA” role=”legislator”>Abraham</legislator><
vote>Nay</vote>
</recorded-vote>
<recorded-vote>
<legislator name-id=”A000370″ sort-field=”Adams” unaccented-name=”Adams” party=”D” state=”NC” role=”legislator”>Adams</legislator>
<vote>Yea</vote>
</recorded-vote>

These are only the first two (2) lines but only the content of other <recorded-vote> elements varies from these.

I have introduced line returns to make it clear that <recorded-vote> … </recorded-vote> begin and end each record. Also note that <legislator> and <vote> are siblings.

What you didn’t see in the upper part of this document were the attributes that appear inside the <legislator> element.

Some of the attributes are: name-id=”A000374,” state=”LA” role=”legislator.”

In an XQuery, we address attributes by writing out the path to the element containing the attributes and then appending the attribute.

For example, for name-id=”A000374,” we could write:

rollcall-vote/vote-data/recorded-vote/legislator[@name-id = "A000374]

If we wanted to select that attribute value and/or the <legislator> element with that attribute and value.

Recalling that:

rollcall-vote – Root element of the document.

vote-data – Direct child of the root element.

recorded-vote – Direct child of the vote-data element (with many siblings).

legislator – Direct child of recorded-vote.

@name-id – One of the attributes of legislator.

As I mentioned in our last post, there are other ways to access elements and attributes but many useful things can be done with direct descendant XPaths.

In preparation for our next post, trying searching for “A000374” and limiting your search to the domain, congress.gov.

It is a good practice to search on unfamiliar attribute values. You never know what you may find!

Until next time!

Congressional Roll Call Vote – The Documents (XQuery)

Monday, January 11th, 2016

I assume you have read my new starter post for this series: Congressional Roll Call Vote and XQuery (A Do Over). If you haven’t and aren’t already familiar with XQuery, take a few minutes to go read it now. I’ll wait.

The first XML document we need to look at is FINAL VOTE RESULTS FOR ROLL CALL 705. If you press ctrl-u in your browser, the XML source of that document will be displayed.

The top portion of that document, before you see <vote-data> reads:

<?xml version=”1.0″ encoding=”UTF-8″?>
<!DOCTYPE rollcall-vote PUBLIC “-//US Congress//DTDs/vote
v1.0 20031119 //EN” “http://clerk.house.gov/evs/vote.dtd”>
<?xml-stylesheet type=”text/xsl” href=”http://clerk.house.gov/evs/vote.xsl”?>
<rollcall-vote>
<vote-metadata>
<majority>R</majority>
<congress>114</congress>
<session>1st</session>
<chamber>U.S. House of Representatives</chamber>
<rollcall-num>705</rollcall-num>
<legis-num>H R 2029</legis-num>
<vote-question>On Concurring in Senate Amdt with
Amdt Specified in Section 3(a) of H.Res. 566</vote-question>
<vote-type>YEA-AND-NAY</vote-type>
<vote-result>Passed</vote-result>
<action-date>18-Dec-2015</action-date>
<action-time time-etz=”09:49″>9:49 AM</action-time>
<vote-desc>Making appropriations for military construction, the
Department of Veterans Affairs, and related agencies for the fiscal
year ending September 30, 2016, and for other purposes</vote-desc>
<vote-totals>
<totals-by-party-header>
<party-header>Party</party-header>
<yea-header>Yeas</yea-header>
<nay-header>Nays</nay-header>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>
</totals-by-party-header>
<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Independent</party>
<yea-total>0</yea-total>
<nay-total>0</nay-total>
<present-total>0</present-total>
<not-voting-total>0</not-voting-total>
</totals-by-party>
<totals-by-vote>
<total-stub>Totals</total-stub>
<yea-total>316</yea-total>
<nay-total>113</nay-total>
<present-total>0</present-total>
<not-voting-total>5</not-voting-total>
</totals-by-vote>
</vote-totals>
</vote-metadata>

One of the first skills you need to learn to make effective use of XQuery is how to recognize paths in an XML document.

I’ll do the first several and leave some of the others for you.

<rollcall-vote> – the root element – aka “parent” element

<vote-metadata> – first child element in this document
XPath rollcall-vote/vote-metadata

<majority>R</majority> first child of <majority>R</majority> of <vote-metadata>
XPath rollcall-vote/vote-metadata/majority

<congress>114</congress>

What do you think? Looks like the same level as <majority>R</majority> and it is. Called a sibling of <majority>R</majority>
XPath rollcall-vote/vote-metadata/congress

Caveat: There are ways to go back up the XPath and to reach siblings and attributes. For the moment, lets get good at spotting direct XPaths.

Let’s skip down in the markup until we come to <totals-by-party-header>. It’s not followed, at least not immediately, with </totals-by-party-header>. That’s a signal that the previous siblings have stopped and we have another step in the XPath.

<totals-by-party-header>
XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header

<party-header>Party</party-header>
XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header/party-header

As you may suspect, the next four elements are siblings of <party-header>Party</party-header>

<yea-header>Yeas</yea-header>
<nay-header>Nays</nay-header>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>

The closing element, shown by the “/,” signals the end of the <totals-by-party-header> element.

</totals-by-party-header>

See how you do mapping out the remaining XPaths from the top of the document.

<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>

Tomorrow we are going to dive into the structure of the <vote-data> and how to address the attributes therein and their values.

Enjoy!

Congressional Roll Call Vote and XQuery (A Do Over)

Sunday, January 10th, 2016

Once words are written, as an author I consider them to be fixed. Even typos should be acknowledged as being corrected and not silently “improve” the original text. Rather than editing what has been said, more words can cover the same ground with the hope of doing so more completely or usefully.

I am starting my XQuery series of posts with the view of being more systematic, including references to at least one popular XQuery book, along with my progress through a series of uses of XQuery.

You are going to need an XQuery engine for all but this first post to be meaningful so let’s cover getting that setup first.

There are any number of GUI interface tools that I will mention over time but for now, let’s start with Saxon.

Download Saxon, unzip the file and you can choose to put saxon9he.jar in your Java classpath (if set) or you can invoke it with the -cp (path to saxon9he.jar), as in java -cp (path to saxon9he.jar) net.sf.saxon.Query -q:query-file.

Classpaths are a mixed blessing at best but who wants to keep typing -cp (your path to saxon9he.jar) net.sf.saxon.Query -q: all the time?

What I have found very useful (Ubuntu system) is to create a short shell script that I can invoke from the command line, thus:

#!/bin/bash
java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -q:$1

Which after creating that file, which I very imaginatively named “runsaxon.sh,” I used chmod 755 to make it executable.

When I want to run Saxon at the command line, in the same directory with “runsaxon.sh” I type:

./runsaxon.sh ex-5.4.xq > ex-5.4.html

It is a lot easier and not subject to my fat-fingering of the keyboard.

The “>” sign is a pipe in Linux that redirects the output to a file, in this case, ex-5.4.html.

The source of ex-5.4.xq (and its data file) is: XQuery, 2nd Edition by Patricia Walmsley. Highly recommended.

Patricia has put all of her examples online, XQuery Examples. Please pass that along with a link to her book if you use her examples.

If you have ten minutes, take a look at: Learn XQuery in 10 Minutes: An XQuery Tutorial *UPDATED* by Dr. Michael Kay. Michael Kay is also the author of Saxon.

By this point you should be well on your way to having a working XQuery engine and tomorrow we will start exploring the structure of the congressional roll call vote documents.

Congressional Roll Call and XQuery – (Week 1 of XQuery)

Saturday, January 9th, 2016

Truthfully a little more than a week of daily XQuery posts, I started a day or so before January 1, 2016.

I haven’t been flooded with suggestions or comments, ;-), so I read back over my XQuery posts and I see lots of room for improvement.

Most of my posts are on fairly technical topics and are meant to alert other researchers of interesting software or techniques. Most of them are not “how-to” or step by step guides, but some of them are.

The posts on congressional roll call documents made sense to me but then I wrote them. Part of what I sensed was that either you know enough to follow my jumps, in which case you are looking for specific details, like the correspondence across documents for attribute values, and not so much for my XQuery expressions.

On the other hand, if you weren’t already comfortable with XQuery, the correspondence of values between documents was the least of your concerns. Where the hell was all this terminology coming from?

I’m no stranger to long explanations, one of the standards I edit crosses the line at over 1,500 pages. But it hasn’t been my habit to write really long posts on this blog.

I’m going to spend the next week, starting tomorrow, re-working and expanding the congressional roll call vote posts to be more detailed for those getting into XQuery, with a very terse, short experts tips at the end of each post if needed.

The expert part will have observations such as the correspondences in attribute values and other oddities that either you know or you don’t.

Will have the first longer style post up tomorrow, January 10, 2016 and we will see how the week develops from there.

Congressional Roll Call Vote – Join/Merge Remote XML Files (XQuery)

Friday, January 8th, 2016

One of the things that yesterday’s output lacked was the full names of the Georgia representatives. Which aren’t reported in the roll call documents.

But, what the roll call documents do have, is the following:

<recorded-vote>
<legislator name-id=”J000288″ sort-field=”Johnson (GA)” unaccented-name=”Johnson (GA)”
party=”D” state=”GA” role=”legislator”>Johnson (GA)</legislator>
<vote>Nay</vote>
</recorded-vote>

With emphasis on name-id=”J000288″

I call that attribute out because there is a sample data file, just for the House of Representatives that has:

<bioguideID>J000288</bioguideID>

And yes, the “name-id” attribute and the <bioguideID> share the same value for Henry C. “Hank” Johnson, Jr. of Georgia.

As far as I can find, that relationship between the “name-id” value in roll call result files and the House Member Data File is undocumented. You have to be paying attention to the data values in the various XML files at Congress.gov.

The result of the XQuery script today has the usual header but for members of the Georgia delegation, the following:

congress-ga-phone

That is the result of joining/merging two XML files hosted at congress.gov in real time. You can substitute any roll call vote and your state as appropriate and generate a similar webpage for that roll call vote.

The roll call vote file I used for this example is: http://clerk.house.gov/evs/2015/roll705.xml and the House Member Data File was: http://xml.house.gov/MemberData/MemberData.xml. The MemberData.xml file dates from April of 2015 so it may not have the latest data on any given member. Documentation for House Member Data in XML (pdf).

The main XQuery function for merging the two XML files:

{for $voter in doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//recorded-vote,
$mem in doc(“http://xml.house.gov/MemberData/MemberData.xml”)//member/member-info
where $voter/legislator[@state = ‘GA’] and $voter/legislator/@name-id = $mem//bioguideID
return <li> {string($mem//official-name)} — {string($voter/vote)} — {string($mem//phone)}&;lt;/li>

At a minimum, you can auto-generate a listing for representatives from your state, their vote on any roll-call vote and give readers their phone number to register their opinion.

This is a crude example of what you can do with XML, XQuery and online data from Congress.gov.

BTW, if you work in “sharing” environment at a media outlet or library, you can also join/merge data that you hold internally, say the private phone number of a congressional aide, for example.

We are not nearly done with the congressional roll call vote but you can begin to see the potential that XQuery offers for very little effort. Not to mention that XQuery scripts can be rapidly adapted to your library or news room.

Try out today’s XQuery roll705-join-merge.xq.txt for yourself. (Apologies for the “.txt” extension but my ISP host has ideas about “safe” files to upload.)

I realize this first week has been kinda haphazard in its presentation. Suggestions welcome on improvements as this series goes forward.

The government and others are cranking out barely useful XML by the boatload. XQuery is your ticket to creating personalized presentations dynamically from that XML and other data.

Enjoy!

PS: For display of XML and XQuery, should I be using a different Word template? Suggestions?

Localizing A Congressional Roll Call Vote (XQuery)

Thursday, January 7th, 2016

I made some progress today on localizing a congressional roll call vote.

As you might expect, I chose to localize to the representatives from Georgia. 😉

I used a FLWOR expression to select legislators where the attribute state = GA.

Here is that expression:

<ul>
{for $voter in doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//recorded-vote
where $voter/legislator[@state = ‘GA’]
return <li> {string($voter/legislator)} — {string($voter/vote)}</li>
}</ul>

Which makes our localized display a bit better for local readers but only just.

See roll705-local.html.

What we need is more information that can be found at: http://clerk.house.gov/evs/2015/roll705.xml.

More on that tomorrow!

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Thursday, January 7th, 2016

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Let’s reverse the order of the announcement, to be in reader-friendly order:

Downloads

Press kit

Release Notes

What’s New in 9.5

Edit: I moved my comments above the fold as it were:

Just so you know, PostgreSQL 9.5 documentation, 9.14.2.2 XMLEXISTS says:

Also note that the SQL standard specifies the xmlexists construct to take an XQuery expression as first argument, but PostgreSQL currently only supports XPath, which is a subset of XQuery.

Apologies, you will have to scroll for the subsection, there was no anchor at 9.14.2.2.

If you are looking to make a major contribution to PostgreSQL, note that XQuery is on the todo list.

Now for all the stuff that you will skip reading anyway. 😉

(I would save the prose for use in reports to management about using or transitioning to PostgreSQL 9.5.)

7 JANUARY 2016: The PostgreSQL Global Development Group announces the release of PostgreSQL 9.5. This release adds UPSERT capability, Row Level Security, and multiple Big Data features, which will broaden the user base for the world’s most advanced database. With these new capabilities, PostgreSQL will be the best choice for even more applications for startups, large corporations, and government agencies.

Annie Prévot, CIO of the CNAF, the French Child Benefits Office, said, “The CNAF is providing services for 11 million persons and distributing 73 billion Euros every year, through 26 types of social benefit schemes. This service is essential to the population and it relies on an information system that must be absolutely efficient and reliable. The CNAF’s information system is satisfyingly based on the PostgreSQL database management system.”

UPSERT

A most-requested feature by application developers for several years, “UPSERT” is shorthand for “INSERT, ON CONFLICT UPDATE”, allowing new and updated rows to be treated the same. UPSERT simplifies web and mobile application development by enabling the database to handle conflicts between concurrent data changes. This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL.

Developed over the last two years by Heroku programmer Peter Geoghegan, PostgreSQL’s implementation of UPSERT is significantly more flexible and powerful than those offered by other relational databases. The new ON CONFLICT clause permits ignoring the new data, or updating different columns or relations in ways which will support complex ETL (Extract, Transform, Load) toolchains for bulk data loading. And, like all of PostgreSQL, it is designed to be absolutely concurrency-safe and to integrate with all other PostgreSQL features, including Logical Replication.

Row Level Security

PostgreSQL continues to expand database security capabilities with its new Row Level Security (RLS) feature. RLS implements true per-row and per-column data access control which integrates with external label-based security stacks such as SE Linux. PostgreSQL is already known as “the most secure by default.” RLS cements its position as the best choice for applications with strong data security requirements, such as compliance with PCI, the European Data Protection Directive, and healthcare data protection standards.

RLS is the culmination of five years of security features added to PostgreSQL, including extensive work by KaiGai Kohei of NEC, Stephen Frost of Crunchy Data, and Dean Rasheed. Through it, database administrators can set security “policies” which filter which rows particular users are allowed to update or view. Data security implemented this way is resistant to SQL injection exploits and other application-level security holes.

Big Data Features

PostgreSQL 9.5 includes multiple new features for bigger databases, and for integrating with other Big Data systems. These features ensure that PostgreSQL continues to have a strong role in the rapidly growing open source Big Data marketplace. Among them are:

BRIN Indexing: This new type of index supports creating tiny, but effective indexes for very large, “naturally ordered” tables. For example, tables containing logging data with billions of rows could be indexed and searched in 5% of the time required by standard BTree indexes.

Faster Sorts: PostgreSQL now sorts text and NUMERIC data faster, using an algorithm called “abbreviated keys”. This makes some queries which need to sort large amounts of data 2X to 12X faster, and can speed up index creation by 20X.

CUBE, ROLLUP and GROUPING SETS: These new standard SQL clauses let users produce reports with multiple levels of summarization in one query instead of requiring several. CUBE will also enable tightly integrating PostgreSQL with more Online Analytic Processing (OLAP) reporting tools such as Tableau.

Foreign Data Wrappers (FDWs): These already allow using PostgreSQL as a query engine for other Big Data systems such as Hadoop and Cassandra. Version 9.5 adds IMPORT FOREIGN SCHEMA and JOIN pushdown making query connections to external databases both easier to set up and more efficient.

TABLESAMPLE: This SQL clause allows grabbing a quick statistical sample of huge tables, without the need for expensive sorting.

“The new BRIN index in PostgreSQL 9.5 is a powerful new feature which enables PostgreSQL to manage and index volumes of data that were impractical or impossible in the past. It allows scalability of data and performance beyond what was considered previously attainable with traditional relational databases and makes PostgreSQL a perfect solution for Big Data analytics,” said Boyan Botev, Lead Database Administrator, Premier, Inc.

A Lesson about Let Clauses (XQuery)

Wednesday, January 6th, 2016

I was going to demonstrate how to localize roll call votes so that only representatives from your state and their votes were displayed for any given roll call vote.

Which would enable libraries or local newsrooms, whose users/readers have little interest in how obscure representatives from other states voted, to pare down the roll call vote list to those that really matter, your state’s representatives.

But remembering that I promised to clean up the listings in yesterday’s post that read:

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}

and kept repeating (doc(“http://clerk.house.gov/evs/2015/roll705.xml”).

My thought was to replace that string with a variable declared by a let clause and then substituting that variable for that string.

To save you from the same mistake, combining a let clause with direct element constructors returns an error saying, in this case:

Left operand of ‘>’ needs parentheses

Not a terribly helpful error message.

I have found examples of using a let clause within a direct element constructor that would have defeated the rationale for declaring the variable to begin with.

Tomorrow I hope to post today’s content, which will enable you to display data relevant to local voters, news reporters, for any arbitrary roll call vote in Congress.

Mark today’s adventure as a mistake to avoid. 😉

Jazzing a Roll Call Vote – Part 3 (XQuery)

Tuesday, January 5th, 2016

I posted Congressional Roll Call Vote – Accessibility Issues earlier today to deal with some accessibility issues noticed by @XQuery with my color coding.

Today we are going to start at the top of the boring original roll call vote and work our way down using XQuery.

Be forewarned that the XQuery you see today we will be shortening and cleaning up tomorrow. It works, but its not best practice.

You will need to open up the source of the original roll call vote to see the elements I select in the path expressions.

Here is the XQuery that is the goal for today:

xquery version “3.0”;
declare boundary-space preserve;
<html>
<head></head>
<body>
<h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2>

<strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)} {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/>

<strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/>

<strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)}
</body>
</html>

The title of the document we obtain with:

<h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2>

Two quick things to notice:

First, for very simple documents like this one, I use “//” rather than writing out the path to the rollcall-num element. I already know it only occurs once in each rollcall document.

Second, when using direct element constructors, the XQuery statements are enclosed by “{ }” brackets.

The rollcall number, date and time of the vote come next (I have introduced line breaks for readability):

<strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong>

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)}

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/>

If you compare my presentation of that string and that from the original, you will find the original has slightly more space between the items.

Here is the XSLT for that spacing:

<xsl:if test=”legis-num[text()!=’0′]”><xsl:text>      </xsl:text><b><xsl:value-of select=”legis-num”/></b></xsl:if>
<xsl:text>      </xsl:text><xsl:value-of select=”vote-type”/>
<xsl:text>      </xsl:text><xsl:value-of select=”action-date”/>
<xsl:text>      </xsl:text><xsl:value-of select=”action-time”/><br/>

Since I already had white space separating my XQuery expressions, I just added to the prologue:

declare boundary-space preserve;

The last two lines:

<strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/>

<strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)}

Are just standard queries for content. The string operator extracts the content of the element you address.

Tomorrow we are going to talk about how to clean up and shorten the path statements and look around for information that should be at the top of this document, but isn’t!

PS: Did you notice that the vote totals, etc., are written as static data in the XML file? Curious isn’t it? Easy enough to generate from the voting data. I don’t have an answer but thought you might.

Congressional Roll Call Vote – Accessibility Issues

Tuesday, January 5th, 2016

I posted a color coded version of a congressional roll call vote in Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway), using red for Republicans and blue for Democrats. #XQuery points out accessibility issues which depend upon color perception.

Color coding works better for me than the more traditional roman versus italic font face distinction but let’s improve the color coding to remove the accessibility issue.

The first question is what colors should I use for accessibility?

In searching to answer that question I found this thread at Edward Tufte’s site (of course), Choice of colors in print and graphics for color-blind readers, which has a rich list of suggestions and pointers to other resources.

One in particular, Color Universal Design (CUD), posted by Maarten Boers, has this graphic on colors:

colorblind_palette

Relying on that palette, I changed the colors for the roll call vote to Republicans in orange; Democrats in sky blue and re-generated the roll call document.

roll-call-access

Here is an accessible version, but color-coded version of: FINAL VOTE RESULTS FOR ROLL CALL 705.

An upside of XML is that changing the presentation of all 429 votes took only a few seconds to change the stylesheet and re-generate the results.

Thanks to #XQuery for prodding me on the accessibility issue which resulted in finding the thread at Tufte and the Colorblind barrier-free color pallet.


Other post on congressional roll call votes:

1. Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

2. Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway)

Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway)

Monday, January 4th, 2016

Apologies but did not make as much progress on the Congressional Roll Call vote as I had hoped.

I did find some interesting information about the vote.xsl stylesheet and manage to use color to code members of the House.

You probably remember me whining about how hard it is to tell between roman and italics to distinguish members of different parties. Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

The XSLT code is worse than I imagined.

Here’s what I mean:

<b><center><font size=”+2″>FINAL VOTE RESULTS FOR ROLL CALL <xsl:value-of select=”/rollcall-vote/vote-metadata/rollcall-num”/>
<xsl:if test=”/rollcall-vote/vote-metadata/vote-correction[text()!=”]”>*</xsl:if></font></center></b>
<!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘D’]”> –>
<xsl:if test = “$Majority=’D'”>
<center>(Democrats in roman; Republicans in <i>italic</i>; Independents <u>underlined</u>)</center><br/>
</xsl:if>
<!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘R’]”> –>
<xsl:if test = “$Majority!=’D'”>
<center>(Republicans in roman; Democrats in <i>italic</i>; Independents <u>underlined</u>)</center><br/>
</xsl:if>

Which party is in the majority determines whether the names in a party appear in roman or italic face font.

Now there’s a distinction that will be lost on a casual reader!

What’s more, if you are trying to reform the stylesheet, don’t look for R or D but again for majority party:

<xsl:template match=”vote”>
<!– Handles formatting of Member names based on party. –>
<!– <xsl:if test=”../legislator/@party=’R'”><xsl:value-of select=”../legislator”/></xsl:if>
<xsl:if test=”../legislator/@party=’D'”><i><xsl:value-of select=”../legislator”/></i></xsl:if> –>
<xsl:if test=”../legislator/@party=’I'”><u><xsl:value-of select=”../legislator”/></u></xsl:if>
<xsl:if test=”../legislator/@party!=’I'”>
<xsl:if test=”../legislator/@party = $Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –>
<xsl:value-of select=”../legislator”/>
</xsl:if>
<xsl:if test=”../legislator/@party != $Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –>
<i><xsl:value-of select=”../legislator”/></i>
</xsl:if>
</xsl:if>
</xsl:template>

As you can see, selecting by party has been commented out in favor of the roman/italic distinction based on the majority party.

I wanted to label the Republicans with an icon but my GIMP skills don’t extend to making an icon of young mothers throwing their children under the carriage wheels of the wealthy to save them from a live of poverty and degradation. A bit much to get into a HTML button sized icon.

I settled for using the traditional red for Republicans and blue for Republicans and ran the modified stylesheet against roll705.xml locally.

vote-color-coded

Here is FINAL VOTE RESULTS FOR ROLL CALL 705 as HTML.

Question: Are red and blue easier to distinguish than roman and italic?

If your answer is yes, why resort to typographic subtlety on something like party affiliation?

Are subtle distinctions used to confuse the uninitiated and unwary?

Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

Sunday, January 3rd, 2016

Roll call votes in the US House of Representatives are a stable of local, state and national news. If you go looking for the “official” version, what you find is as boring as your 5th grade civics class.

Trigger Warning: Boring and Minimally Informative Page Produced By Following Link: Final Vote Results For Roll Call 705.

Take a deep breath and load the page. It will open in a new browser tab. Boring. Yes? (You were warned.)

It is the recent roll call vote to fund the US government, take another slice of privacy from citizens, and make a number of other dubious policy choices. (Everything after the first comma depending upon your point of view.)

Whatever your politics though, you have to agree this is sub-optimal presentation, even for a government document.

This is no accident, sans the header, you will find the identical presentation of this very roll call vote at: page H10696, Congressional Record for December 18, 2015 (pdf).

Disappointing so much XML, XSLT, XQuery, etc., has been wasted duplicating non-informative print formatting. Or should I say less-informative formatting than is possible with XML?

Once the data is in XML, legend has it, users can transform that XML in ways more suited to their purposes and not those of the content providers.

I say “legend has it,” because we all know if content providers had their way, web navigation would be via ads and not bare hyperlinks. You want to see the next page? You must select the ad + hyperlink, waiting for the ad to clear before the resource appears.

I can summarize my opinion about content provider control over information legally delivered to my computer: Screw that!

If a content provider enables access to content, I am free to transform that content into speech, graphics, add information, take away information, in short do anything that my imagination desires and my skill enables.

Let’s take the roll call vote in the House of Representatives, Final Vote Results For Roll Call 705.

Just under the title you will read:

(Republicans in roman; Democrats in italic; Independents underlined)

Boring.

For a bulk display of voting results, we can do better than that.

What if we had small images to identify the respective parties? Here are some candidates (sic) for the Republicans:

r-photo1

r-photo-2

r-photo-3

Of course we would have to reduce them to icons size, but XML processing is rarely ever just XML processing. Nearly every project includes some other skill set as well.

Which one do you think looks more neutral? 😉

Certainly be more colorful and depending upon your inclinations, more fun to play about with than the difference in roman and italic. Yes?

Presentation of the data in http://clerk.house.gov/evs/2015/roll705.xml is only one of the possibilities that XQuery offers. Follow along and offer your suggestions for changes, additions and modifications.

First steps:

In the browser tab with Final Vote Results For Roll Call 705, use CNTR-u to view the page source. First notice that the boring web presentation is controlled by http://clerk.house.gov/evs/vote.xsl.

Copy and paste: http://clerk.house.gov/evs/vote.xsl into a new browser tab and select return. The resulting xsl:stylesheet is responsible for generating the original page, from the vote totals to column presentation of the results.

Pay particular attention to the generation of totals from the <vote-data> element and its children. That generation is powered by these lines in vote.xsl:

<xsl:apply-templates select=”/rollcall-vote/vote-metadata”/>
<!– Create total variables based on counts. –>
<xsl:variable name=”y” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Yea’])”/>
<xsl:variable name=”a” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Aye’])”/>
<xsl:variable name=”yeas” select=”$y + $a”/>
<xsl:variable name=”nay” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Nay’])”/>
<xsl:variable name=”no” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’No’])”/>
<xsl:variable name=”nays” select=”$nay + $no”/>
<xsl:variable name=”nvs” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Not Voting’])”/>
<xsl:variable name=”presents” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Present’])”/>
<br/>

(Not entirely, I omitted the purely formatting stuff.)

For tomorrow I will be working on a more “visible” way to identify political party affiliation and “borrowing” the count code from vote.xsl.

Enjoy!


You may be wondering what XQuery has to do with topic maps? Well, if you think about it, every time we select, aggregate, etc., data, we are making choices based on notions of subject identity.

That is we think the data we are manipulating represents some subjects and/or information about some subjects, that we find sensible (for some unstated reason) to put together for others to read.

The first step towards a topic map, however, is the putting of information together so we can judge what subjects need explicit representation and how we choose to identify them.

Prior topic map work was never explicit about how we get to a topic map, putting that possibly divisive question behind us, we simply start with topic maps, ab initio.

I was in the car when we took that turn and for the many miles since then. I have come to think that a better starting place is choosing subjects, what we want to say about them and how we wish to say it, so that we have only so much machinery as is necessary for any particular set of subjects.

Some subjects can be identified by IRIs, others by multi-dimensional vectors, still others by unspecified processes of deep learning, etc. Which ones we choose will depend upon the immediate ROI from subject identity and relationships between subjects.

I don’t need triples, for instance, to recognize natural languages to a sufficient degree of accuracy. Unnecessary triples, topics or associations are just padding. If you are on a per-triple contract, they make sense, otherwise, not.

A long way of saying that subject identity lurks just underneath the application of XQuery and we will see where it is useful to call subject identity to the fore.

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery (Part 2)

Saturday, January 2nd, 2016

Despite heavy carousing during the holidays, you may still remember Great R packages for data import, wrangling & visualization [+ XQuery], where I re-sorted the table by Sharon Machlis, to present the R packages in package name order.

I followed that up with: Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery, where I detailed the travails of trying to sort the software packages by their short descriptions, again in alphabetical order. My assumption in that post was that either the spaces or the “,” commas in the descriptions were fouling the sort by.

That wasn’t the case, which I should have known because the string operator always returns a string. That is the spaces and “,” inside are just parts of a string, nothing more.

The up-side of the problem was that I spent more than a little while with Walmsley’s XQuery book, searching for ever more esoteric answers.

Here’s the failing XQuery:

<html>
<body>
<table>{
for $row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2]/a))
return <tr>{$row/td[2]} {$row/td[1]}</tr>
}</table>
</body>
</html>

And here is the working XQuery:

<html>
<body>
<table>{
for $row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2]))
return <tr>{$row/td[2]} {$row/td[1]}</tr>
}</table>
</body>
</html>

Here is the mistake highlighted:

order by lower-case"(string($row/td[2]/a))"

My first mistake was the inclusion of “/a” in the path. Using string on ($row/td[1]), that is without having /a at the end of the path, gives the original script the same result. (Run that for yourself on favorite-R-packages.xml).

Make any path as long as required and no longer!

My second mistake was not checking the XPath immediately upon the failure of the sort. (The simplest answer is usually the correct one.)

Enjoy!


Update: Removed the quotes marks around (string($row/td[2])) in both queries, they were part of an explanation that did not make the cut. Thanks to XQuery for the catch!

XQilla-2.3.2 – Tooling up for 2016 (Part 2) (XQuery)

Friday, January 1st, 2016

As I promised yesterday, a solution to the XQilla-2.3.2 installation problem!

Using a virtual machine to install the latest version of Ubuntu (15.10), which had the libraries required to install XQilla!

I use VirtualBox from Oracle but people also use VMware.

Virtual boxes come in all manner of configurations so you are likely to spend some time loading linux headers and the like to compile software.

The advantage of a virtual box is that I don’t have to risk doing something dumb or out of fatigue to my working setup. If I have to blow away the entire virtual machine, its takes only a few minutes to download another one.

Well, on any day other than New Year’s Day I found out today. I don’t know if people were streaming that many football games or streaming live “acts” of some sort but the Net was very slow today.

Introducing XQuery to humanists, librarians and reporters using a VM with the usual XQuery suspects pre-loaded would be very cool!

Great way to distribute xqueries* and shell scripts that run them for immediate results.

If you have any thoughts about what such a VM should contain, etc., drop me an email patrick@durusau.net or leave a comment. Thanks!

PS: XQueries returned approximately 26K “hits,” and xquerys returned approximately 1,700 “hits.” Usage favors the plural as “xqueries” so that is what I am following. At the first of a sentence, XQueries?

PPS: I could have written this without the woes of failed downloads, missing header files, etc. but I wanted to know for myself that Ubuntu (15.10) with all the appropriate header files would in fact compile XQilla-2.3.2.

You may need this line to get all the headers:

apt-get install dkms build-essential linux-headers-generic

Not to mention that I would update everything before trying to compile software. Hard to say how long your VM has been on the shelf.

XQilla-2.3.2 – Tooling up for 2016 (Part 1) (XQuery)

Thursday, December 31st, 2015

Along with other end of the year tasks, I’m installing several different XQuery tools. Not all tools support all extensions and so a variety of tools can be a useful thing.

The README for XQila-2.3.2 comes close to winning a prize for being terse:

1. Download a source distribution of Xerces-C 3.1.2

2. Build Xerces-C

cd xerces-c-3.1.2/
./configure
make

4. Build XQilla

cd xqilla/
./configure –with-xerces=`pwd`/../xerces-c-3.1.2/
make

A few notes that may help:

Obtain Xerces-c-3.1.2 homepage.

Xerces project homepage. Home of Apache Xerces C++, Apache Xerces2 Java, Apache Xerces Perl, and, Apache XML Commons.

On configuring the make file for XQilla:

./configure –with-xerces=`pwd`/../xerces-c-3.1.2/

the README is presuming you built xerces-c-3.1.2 in a sub-directory of the XQilla source. You could, just out of habit I built xerces-c-3.1.2 in a separate directory.

The configuration file for XQilla reads in part:

–with-xerces=DIR Path of Xerces. DIR=”/usr/local”

So you could build XQilla with an existing install of xerces-c-3.1.2 if you are so-minded. But if you are that far along, you don’t need these notes. 😉

Strictly for my system (your paths will be different), after building xerces-c-3.1.2, I changed directories to XQilla-2.3.2 and typed:

./configure --with-xerces=/home/patrick/working/xerces-c-3.1.2

No error messages so I am now back at the command prompt and enter make.

Welllll, that was supposed to work!

Here is the error I got:

libtool: link: g++ -O2 -ftemplate-depth-50 -o .libs/xqilla 
   xqilla-commandline.o  
-L/home/patrick/working/xerces-c-3.1.2/src 
  /home/patrick/working/xerces-c-3.1.2/src/
.libs/libxerces-c.so ./.libs/libxqilla.so -lnsl -lpthread -Wl,-rpath 
-Wl,/home/patrick/working/xerces-c-3.1.2/src
/usr/bin/ld: warning: libicuuc.so.55, needed by 
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so, 
   not found (try using -rpath or -rpath-link)
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so: 
   undefined reference to `uset_close_55'
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so: 
   undefined reference to `ucnv_fromUnicode_55'
...[omitted numerous undefined references]...
collect2: error: ld returned 1 exit status
make[1]: *** [xqilla] Error 1
make[1]: Leaving directory `/home/patrick/working/XQilla-2.3.2'
make: *** [all-recursive] Error 1

To help you avoid surfing the web to track down this issue, realize that Ubuntu doesn’t use the latest releases. Of anything as far as I can tell.

The bottom line being that Ubuntu 14.04 doesn’t have libicuuc.so.55.

If I manually upgrade libraries, I might create an inconsistency package management tools can’t fix. 🙁 And break working tools. Bad joss!

Fear Not! There is a solution, which I will cover in my next XQilla-2.3.2 post!

PS: I didn’t get back to the sorting post in time to finish it today. Not to mention that I encountered another nasty list in Most Vulnerable Software of 2015! (Perils of Interpretation!, Advice for 2016).

I say “nasty,” you should see some of the lists you can find at Congress.gov. Valid XML I’ll concede but not as useful as they could be.

Improving online lists, combining them with other data, etc., are some of the things I want to cover this coming year.

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery

Wednesday, December 30th, 2015

Continuing with the data from my post: Great R packages for data import, wrangling & visualization [+ XQuery], I have discovered the dangers of perfect example data!

The XQuery examples on sorting that I have read either enclose strings in quotes and/or have strings with no whitespaces.

How often to you see strings with no whitespaces? Outside of highly constrained environments?

Why is that a problem?

Well, take a look at my results from sorting on the short description and displaying the short description first and the package name second:

package development, package installation devtools
misc installr
data import readxl
data import, data export googlesheets
data import RMySQL
data import readr
data import, data export rio
data analysis psych
data wrangling, data analysis sqldf
data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod
data import, web scraping rvest
data wrangling, data analysis dplyr
data wrangling plyr
data wrangling reshape2
data wrangling tidyr
data wrangling, data analysis data.table
data wrangling stringr
data wrangling lubridate
data wrangling, data analysis zoo
data display editR
data display knitr
data display, data wrangling listviewer
data display DT
data visualization ggplot2
data visualization dygraphs
data visualization googleVis
data visualization metricsgraphics
data visualization RColorBrewer
data visualization plotly
mapping leaflet
mapping choroplethr
mapping tmap
misc fitbitScraper
Web analytics rga
Web analytics RSiteCatalyst
package development roxygen2
data visualization shiny
misc openxlsx
data wrangling, data analysis gmodels
data wrangling car
data visualization rcdimple
data wrangling foreach
data acquisition downloader
data wrangling scales
data visualization plotly

Err, that’s not right!

The XQuery from yesterday:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

XQuery from today, changes in red:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[2]/a))
  6. return <tr>{$row/td[2]} {$row/td[1]}</tr>
  7. }</table>
  8. </html>

First, how do you explain the failure? Looks like no sort order at all.

Truthfully it does have a sort order, just not the one you expected. The results appear in document sort order, as they appeared in the document.

Here’s a snippet of that document:

<table>
<tr>
<td><a href="https://github.com/hadley/devtools" target="_new">devtools</a></td>
<td>package development, package installation</td>
<td>While devtools is aimed at helping you create your own R packages, it's also 
essential if you want to easily install other packages from GitHub. Install it! 
Requires <a href="http://cran.r-project.org/bin/windows/Rtools/" target="_new">
Rtools</a> on Windows and <a href="https://developer.apple.com/xcode/downloads/" 
target="_new">XCode</a> on a Mac. On CRAN.</td>
<td>install_github("rstudio/leaflet")</td>
<td>Hadley Wickham & others</td>
</tr>
<tr>
<td><a href="https://github.com/talgalili/installr/" target="_new">installr</a>
</td><td>misc</td>
<td>Windows only: Update your installed version of R from within R. On CRAN.</td>
<td>updateR()</td>
<td>Tal Galili & others</td>
</tr>
<tr>
<td><a href="https://github.com/hadley/readxl/" target="_new">readxl</a>
</td><td>data import</td>
<td>Fast way to read Excel files in R, without dependencies such as Java. CRAN.</td>
<td>read_excel("my-spreadsheet.xls", sheet = 1)</td>
<td>Hadley Wickham</td>
</tr>
...
</table>

I haven’t run the problem entirely to ground but as you can see from the output:

data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod

Most of the descriptions have spaces, not to mention “,” separating categories.

It is always possible to clean up the data but I want to avoid that if at all possible.

Cleaning data involves the risk I may change the data and once changed, I may not be able to go back to the original.

I can think of at least two (2) ways to fix this problem but want to sleep on it first and pick that can be easily adapted to the next soiled data that comes through the door.

PS: Neither Saxon (9.7), nor BaseX (8.3) gave any error messages at the console for the failure of the sort request.

You could say that document order is about as large an error message as can be given. 😉

Great R packages for data import, wrangling & visualization [+ XQuery]

Tuesday, December 29th, 2015

Great R packages for data import, wrangling & visualization by Sharon Machlis.

From the post:

One of the great things about R is the thousands of packages users have written to solve specific problems in various disciplines — analyzing everything from weather or financial data to the human genome — not to mention analyzing computer security-breach data.

Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in the table are clickable if you want more information. To find out more about a package once you’ve installed it, type help(package = "packagename") in your R console (of course substituting the actual package name ).

Forty-seven (47) “favorites” sounds a bit on the high side but some people have more than one “favorite” ice cream, or obsession. 😉

You know how I feel about sort-order and I could not detect an obvious one in Sharon’s listing.

So, I extracted the package links/name plus the short description into a new table:

car data wrangling
choroplethr mapping
data.table data wrangling, data analysis
devtools package development, package installation
downloader data acquisition
dplyr data wrangling, data analysis
DT data display
dygraphs data visualization
editR data display
fitbitScraper misc
foreach data wrangling
ggplot2 data visualization
gmodels data wrangling, data analysis
googlesheets data import, data export
googleVis data visualization
installr misc
jsonlite data import, data wrangling
knitr data display
leaflet mapping
listviewer data display, data wrangling
lubridate data wrangling
metricsgraphics data visualization
openxlsx misc
plotly data visualization
plotly data visualization
plyr data wrangling
psych data analysis
quantmod data import, data visualization, data analysis
rcdimple data visualization
RColorBrewer data visualization
readr data import
readxl data import
reshape2 data wrangling
rga Web analytics
rio data import, data export
RMySQL data import
roxygen2 package development
RSiteCatalyst Web analytics
rvest data import, web scraping
scales data wrangling
shiny data visualization
sqldf data wrangling, data analysis
stringr data wrangling
tidyr data wrangling
tmap mapping
XML data import, data wrangling
zoo data wrangling, data analysis

Enjoy!


I want to use XQuery at least once a day in 2016 on my blog. To keep myself honest, I will be posting any XQuery I use.

To sort and extract two of the columns from Mary’s table, I copied the table to a separate file and ran this XQuery:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

One of the nifty aspects of XQuery is that you can sort, as on line 5, in all lower-case on the first <td> element, while returning the same element as written in the original table. Which gives better (IMHO) sort order than UPPERCASE followed by lowercase.

This same technique should make you the master of any simple tables you encounter on the web.

PS: You should always acknowledge the source of your data and the original author.

I first saw Sharon’s list in a tweet by Christophe Lalanne.

Facets for Christmas!

Friday, December 25th, 2015

Facet Module

From the introduction:

Faceted search has proven to be enormously popular in the real world applications. Faceted search allows user to navigate and access information via a structured facet classification system. Combined with full text search, it provides user with enormous power and flexibility to discover information.

This proposal defines a standardized approach to support the Faceted search in XQuery. It has been designed to be compatible with XQuery 3.0, and is intended to be used in conjunction with XQuery and XPath Full Text 3.0.

Imagine my surprise when after opening Christmas presents with family to see a tweet by XQuery announcing yet another Christmas present:

“Facets”: A new EXPath spec w/extension functions & data models to enable faceted navigation & search in XQuery http://expath.org/spec/facet

The EXPath homepage says:

XPath is great. XPath-based languages like XQuery, XSLT, and XProc, are great. The XPath recommendation provides a foundation for writing expressions that evaluate the same way in a lot of processors, written in different languages, running in different environments, in XML databases, in in-memory processors, in servers or in clients.

Supporting so many different kinds of processor is wonderful thing. But this also contrains which features are feasible at the XPath level and which are not. In the years since the release of XPath 2.0, experience has gradually revealed some missing features.

EXPath exists to provide specifications for such missing features in a collaborative- and implementation-independent way. EXPath also provides facilities to help and deliver implementations to as many processors as possible, via extensibility mechanisms from the XPath 2.0 Recommendation itself.

Other projects exist to define extensions for XPath-based languages or languages using XPath, as the famous EXSLT, and the more recent EXQuery and EXProc projects. We think that those projects are really useful and fill a gap in the XML core technologies landscape. Nevertheless, working at the XPath level allows common solutions when there is no sense in reinventing the wheel over and over again. This is just following the brilliant idea of the W3C’s XSLT and XQuery working groups, which joined forces to define XPath 2.0 together. EXPath purpose is not to compete with other projects, but collaborate with them.

Be sure to visit the resources page. It has a manageable listing of processors that handle extensions.

What would you like to see added to XPath?

Enjoy!

An XQuery Module For Simplifying Semantic Namespaces

Wednesday, December 23rd, 2015

An XQuery Module For Simplifying Semantic Namespaces by Kurt Cagle.

From the post:

While I enjoy working with the MarkLogic 8 server, there are a number of features about the semantics library there that I still find a bit problematic. Declaring namespaces for semantics in particular is a pain—I normally have trouble remembering the namespaces for RDF or RDFS or OWL, even after working with them for several years, and once you start talking about namespaces that are specific to your own application domain, managing this list can get onerous pretty quickly.

I should point out however, that namespaces within semantics can be very useful in helping to organize and design an ontology, even a non-semantic ontology, and as such, my applications tend to be namespace rich. However, when working with Turtle, Sparql, RDFa, and other formats of namespaces, the need to incorporate these namespaces can be a real showstopper for any developer. Thus, like any good developer, I decided to automate my pain points and create a library that would allow me to simplify this process.

The code given here is in turtle and xquery, but I hope to build out similar libraries for use in JavaScript shortly. When I do, I’ll update this article to reflect those changes.

If you are forced to use a MarkLogic 8 server, great post on managing semantic namespaces.

If you have a choice of tools, something to consider before you willingly choose to use a MarkLogic 8 server.

I first saw this in a tweet by XQuery.

My Bad – You Are Not! 747 Edits Away From Using XML Tools

Thursday, December 17th, 2015

The original, unedited post is below but in response to comments, I checked the XQuery, XPath, XSLT and XQuery Serialization 3.1 files in Chrome (CNTR-U) before saving them.

All the empty elements were properly closed.

I then saved the files and re-opened in Emacs, to discover that Chrome had stripped the “/” from the empty elements, which then caused BaseX to complain. It was an accurate complaint but the files I was tossing against BaseX were not the files as published by the W3C.

So now I need to file a bug report on Chrome, Version 47.0.2526.80 (64-bit) on Ubuntu, for mangling closed empty elements.


You could tell in XQuery, XPath, XSLT and XQuery Serialization 3.1, New Candidate Recommendations! that I was really excited to see the new drafts hit the street.

Me and my big mouth.

I grabbed copies of all three and tossed the XQuery draft against an xquery to create a list of all the paths in it. Simple enough.

The result weren’t.

Here is the first error message:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 68): The element type “link” must be terminated by the matching end-tag “</link>”.

Ouch!

I corrected that and running the query a second time I got:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 68): The element type “meta” must be terminated by the matching end-tag “</meta>”.

The <meta> elements appear on lines three and four.

On the third try:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 69): The element type “img” must be terminated by the matching end-tag “</img>”.

There are 3 <img> elements that are not closed.

I’m getting fairly annoyed at this point.

Fourth try:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 78): The element type “br” must be terminated by the matching end-tag “</br>”.

Of course at this point I revert to grep and discover there are 353
elements that are not closed.

Sigh, nothing to do but correct and soldier on.

Fifth attempt.

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 17618): The element type “hr” must be terminated by the matching end-tag “</hr>”.

There are 2 <hr> elements that are not closed.

A total of 361 edits in order to use XML based tools with the most recent XQuery 3.1 Candidate draft.

The most recent XPath 3.1 has 238 empty elements that aren’t closed (same elements as XQuery 3.1).

The XSLT and XQuery Serialization 3.1 draft has 149 empty elements that aren’t closed, same as the other but with the addition of four <col> elements that weren’t closed.

Grand total: 747 edits in order to use XML tools.

Not an editorial but a production problem. A rather severe one it seems to me.

Anyone who wants to use XML tools on these drafts will have to perform the same edits.