Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 10, 2017

XQuery Ready CIA Vault7 Files

Filed under: CIA,Government,XML,XQuery — Patrick Durusau @ 11:26 am

I have extracted the HTML files from WikiLeaks Vault7 Year Zero 2017 V 1.7z, processed them with Tidy (see note on correction below), and uploaded the “tidied” HTML files to: Vault7-CIA-Clean-HTML-Only.

Beyond the usual activities of Tidy, I did have to correct the file page_26345506.html: by creating a comment around one line of content:

<!– <declarations><string name=”½ö”></string></declarations&>lt;p>›<br> –>

Otherwise, the files are only corrected HTML markup with no other changes.

The HTML compresses well, 7811 files coming in at 3.4 MB.

Demonstrate the power of basic XQuery skills!

Enjoy!

March 9, 2017

How Bad Is Wikileaks Vault7 (CIA) HTML?

Filed under: HTML,Wikileaks,WWW,XQuery — Patrick Durusau @ 8:29 pm

How bad?

Unless you want to hand correct 7809 html files to use with XQuery, grab the latest copy of Tidy

It’s not the worst HTML I have ever seen, but put that in the context of having seen a lot of really poor HTML.

I’ve “tidied” up a test collection and will grab a fresh copy of the files before producing and releasing a clean set of the HTML files.

Producing a document collection for XQuery processing. Working towards something suitable for application of NLP and other tools.

February 12, 2017

Oxford Dictionaries Thesaurus Data – XQuery

Filed under: XML,XQuery — Patrick Durusau @ 8:50 pm

Retrieve Oxford Dictionaries API Thesaurus Data as XML with XQuery and BaseX by Adam Steffanick.

From the post:

We retrieved thesaurus data from the Oxford Dictionaries application programming interface (API) and returned Extensible Markup Language (XML) with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial illustrates how to retrieve thesaurus data—synonyms and antonyms—as XML from the Oxford Dictionaries API with XQuery and BaseX.

The Oxford Dictionaries API returns JavaScript Object Notation (JSON) responses that yield undesired XML structures when converted automatically with BaseX. Fortunately, we’re able to use XQuery to fill in some blanks after converting JSON to XML. My GitHub repository od-api-xquery contains XQuery code for this tutorial.

If you are having trouble staying at your computer during this unreasonably warm spring, this XQuery/Oxford Dictionary tutorial may help!

Ok, maybe that is an exaggeration but only a slight one. 😉

Enjoy!

February 8, 2017

Functional Programming in Erlang – MOOC – 20 Feb. 2017

Filed under: Elixir,Erlang,Functional Programming,XQuery — Patrick Durusau @ 9:06 pm

Functional Programming in Erlang with Simon Thompson (co-author of Erlang Programming)

From the webpage:

Functional programming is increasingly important in providing global-scale applications on the internet. For example, it’s the basis of the WhatsApp messaging system, which has over a billion users worldwide.

This free online course is designed to teach the principles of functional programming to anyone who’s already able to program, but wants to find out more about the novel approach of Erlang.

Learn the theory of functional programming and apply it in Erlang

The course combines the theory of functional programming and the practice of how that works in Erlang. You’ll get the opportunity to reinforce what you learn through practical exercises and more substantial, optional practical projects.

Over three weeks, you’ll:

  • learn why Erlang was developed, how its design was shaped by the context in which it was used, and how Erlang can be used in practice today;
  • write programs using the concepts of functional programming, including, in particular, recursion, pattern matching and immutable data;
  • apply your knowledge of lists and other Erlang data types in your programs;
  • and implement higher-order functions using generic patterns.

The course will also help you if you are interested in Elixir, which is based on the same virtual machine as Erlang, and shares its fundamental approach as well as its libraries, and indeed will help you to get going with any functional language, and any message-passing concurrency language – for example, Google Go and the Akka library for Scala/Java.

If you are not excited already, remember that XQuery is a functional programming language. What if your documents were “immutable data?”

Use #FLerlangfunc to see Twitter discussions on the course.

That looks like a committee drafted hashtag. 😉

February 1, 2017

Digital Humanities / Studies: U.Pitt.Greenberg

Filed under: Digital Research,Humanities,Literature,Social Sciences,XML,XQuery — Patrick Durusau @ 9:13 pm

Digital Humanities / Studies: U.Pitt.Greenberg maintained by Elisa E. Beshero-Bondar.

I discovered this syllabus and course materials by accident when one of its modules on XQuery turned up in a search. Backing out of that module I discovered this gem of a digital humanities course.

The course description:

Our course in “digital humanities” and “digital studies” is designed to be interdisciplinary and practical, with an emphasis on learning through “hands-on” experience. It is a computer course, but not a course in which you learn programming for the sake of learning a programming language. It’s a course that will involve programming, and working with coding languages, and “putting things online,” but it’s not a course designed to make you, in fifteen weeks, a professional website designer. Instead, this is a course in which we prioritize what we can investigate in the Humanities and related Social Sciences fields about cultural, historical, and literary research questions through applications in computer coding and programming, which you will be learning and applying as you go in order to make new discoveries and transform cultural objects—what we call “texts” in their complex and multiple dimensions. We think of “texts” as the transmittable, sharable forms of human creativity (mainly through language), and we interface with a particular text in multiple ways through print and electronic “documents.” When we refer to a “document,” we mean a specific instance of a text, and much of our work will be in experimenting with the structures of texts in digital document formats, accessing them through scripts we write in computer code—scripts that in themselves are a kind of text, readable both by humans and machines.

Your professors are scholars and teachers of humanities, not computer programmers by trade, and we teach this course from our backgrounds (in literature and anthropology, respectively). We teach this course to share coding methods that are highly useful to us in our fields, with an emphasis on working with texts as artifacts of human culture shaped primarily with words and letters—the forms of “written” language transferable to many media (including image and sound) that we can study with computer modelling tools that we design for ourselves based on the questions we ask. We work with computers in this course as precision instruments that help us to read and process great quantities of information, and that lead us to make significant connections, ask new kinds of questions, and build models and interfaces to change our reading and thinking experience as people curious about human history, culture, and creativity.

Our focus in this course is primarily analytical: to apply computer technologies to represent and investigate cultural materials. As we design projects together, you will gain practical experience in editing and you will certainly fine-tune your precision in writing and thinking. We will be working primarily with eXtensible Markup Language (XML) because it is a powerful tool for modelling texts that we can adapt creatively to our interests and questions. XML represents a standard in adaptability and human-readability in digital code, and it works together with related technologies with which you will gain working experience: You’ll learn how to write XPath expressions: a formal language for searching and extracting information from XML code which serves as the basis for transforming XML into many publishable forms, using XSLT and XQuery. You’ll learn to write XSLT: a programming “stylesheet” transforming language designed to convert XML to publishable formats, as well as XQuery, a query (or search) language for extracting information from XML files bundled collectively. You will learn how to design your own systematic coding methods to work on projects, and how to write your own rules in schema languages (like Schematron and Relax-NG) to keep your projects organized and prevent errors. You’ll gain experience with an international XML language called TEI (after the Text Encoding Initiative) which serves as the international standard for coding digital archives of cultural materials. Since one of the best and most widely accessible ways to publish XML is on the worldwide web, you’ll gain working experience with HTML code (a markup language that is a kind of XML) and styling HTML with Cascading Stylesheets (CSS). We will do all of this with an eye to your understanding how coding works—and no longer relying without question on expensive commercial software as the “only” available solution, because such software is usually not designed with our research questions in mind.

We think you’ll gain enough experience at least to become a little dangerous, and at the very least more independent as investigators and makers who wield computers as fit instruments for your own tasks. Your success will require patience, dedication, and regular communication and interaction with us, working through assignments on a daily basis. Your success will NOT require perfection, but rather your regular efforts throughout the course, your documenting of problems when your coding doesn’t yield the results you want. Homework exercises are a back-and-forth, intensive dialogue between you and your instructors, and we plan to spend a great deal of time with you individually over these as we work together. Our guiding principle in developing assignments and working with you is that the best way for you to learn and succeed is through regular practice as you hone your skills. Our goal is not to make you expert programmers (as we are far from that ourselves)! Rather, we want you to learn how to manipulate coding technologies for your own purposes, how to track down answers to questions, how to think your way algorithmically through problems and find good solutions.

Skimming the syllabus rekindles an awareness of the distinction between the “hard” sciences and the “difficult” ones.

Enjoy!

Update:

After yesterday’s post, Elisa Beshero-Bondar tweeted this one course is now two:

At a new homepage: newtFire {dh|ds}!

Enjoy!

January 29, 2017

Up-Translation and Up-Transformation … [Balisage Rocks!]

Filed under: Conferences,XML,XPath,XQuery,XSLT — Patrick Durusau @ 8:48 pm

Up-Translation and Up-Transformation: Tasks, Challenges, and Solutions (a Balisage pre-conference symposium)

When & Where:

Monday July 31, 2017
CAMBRiA Hotel, Rockville, MD USA

Chair: Evan Owens, Cenveo

You need more details than that?

Ok, from the webpage:

Increasing the granularity and/or specificity of markup is an important task in many different content and information workflows. Markup transformations might involve tasks such as high-level structuring, detailed component structuring, or enhancing information by matching or linking to external vocabularies or data. Enhancing markup presents numerous secondary challenges including lack of structure of the inputs or inconsistency of input data down to the level of spelling, punctuation, and vocabulary. Source data for up-translation may be XML, word processing documents, plain text, scanned & OCRed text, or databases; transformation goals may be content suitable for page makeup, search, or repurposing, in XML, JSON, or any other markup language.

The range of approaches to up-transformation is as varied as the variety of specifics of the input and required outputs. Solutions may combine automated processing with human review or could be 100% software implementations. With the potential for requirements to evolve over time, tools may have to be actively maintained and enhanced.

The presentations in this pre-conference symposium will include goals, challenges, solutions, and workflows for significant XML enhancements, including approaches, tools, and techniques that may potentially be used for a variety of other tasks. The symposium will be of value not only to those facing up-translation and transformation but also to general XML practitioners seeking to get the most out of their data.

If I didn’t know better, up-translation and up-transformation sound suspiciously like conferred properties of topic maps fame.

Well, modulo that conferred properties could be predicated on explicit subject identity and not hidden in the personal knowledge of the author.

There are two categories of up-translation and up-transformation:

  1. Ones that preserve jobs like spaghetti Cobol code, and
  2. Ones that support easy long term maintenance.

While writing your paper for the pre-conference, which category fits yours the best?

January 24, 2017

XQuery Update 3.0 – WG Notes

Filed under: XML,XQuery — Patrick Durusau @ 7:41 pm

A tweet by Jonathan Robie points out:

XQuery Update Facility 3.0

and

XQuery Update Facility 3.0 Requirements and Use Cases

have been issued as working group notes.

It’s not clear to me why anyone continues to think of data as mutable.

Mutable data was an artifact of limited storage and processing.

Neither of which obtains in a modern computing environment. (Yes, there are edge cases, the Large Hadron Collider for example.)

Still, if your interested, read on!

BaseX 8.6 Is Out!

Filed under: BaseX,XML,XQuery — Patrick Durusau @ 5:50 pm

Email from Christian Grün arrived today with great news!

BaseX 8.6 is out! The new version of our XML database system and XQuery 3.1 processor includes countless improvements and optimizations. Many of them have been triggered by your valuable feedback, many others are the result of BaseX used in productive and commercial environments.

The most prominent new features are:

LOCKING

  • jobs without database access will never be locked
  • read transactions are now favored (adjustable via FAIRLOCK)

RESTXQ

  • file monitoring was improved (adjustable via PARSERESTXQ)
  • authentication was reintroduced (no passwords anymore in web.xml)
  • id session attributes will show up in log data

DBA

  • always accessible, even if job queue is full
  • pagination of table results

INDEXING

  • path index improved: distinct values storage for numeric types

XQUERY

  • aligned with latest version of XQuery 3.1
  • updated functions: map:find, map:merge, fn:sort, array:sort, …
  • enhancements in User, Process, Jobs, REST and Database Module

CSV DATA

  • improved import/export compatibility with Excel data

Visit http://basex.org to find the latest release, and check out http://docs.basex.org/ to get more information. As always, we are looking forward to your feedback. Enjoy!

If one or more of your colleagues goes missing this week, suspect BaseX 8.6 and the new W3C drafts for XQuery are responsible.

😉

XQuery/XSLT Proposals – Comments by 28 February 2017

Filed under: XML,XPath,XQuery,XSLT — Patrick Durusau @ 4:09 pm

Proposed Recommendations Published for XQuery WG and XSLT WG.

From the webpage:

The XML Query Working Group and XSLT Working Group have published a Proposed Recommendation for four documents:

  • XQuery and XPath Data Model 3.1: This document defines the XQuery and XPath Data Model 3.1, which is the data model of XML Path Language (XPath) 3.1, XSL Transformations (XSLT) Version 3.0, and XQuery 3.1: An XML Query Language. The XQuery and XPath Data Model 3.1 (henceforth “data model”) serves two purposes. First, it defines the information contained in the input to an XSLT or XQuery processor. Second, it defines all permissible values of expressions in the XSLT, XQuery, and XPath languages.
  • XPath and XQuery Functions and Operators 3.1: The purpose of this document is to catalog the functions and operators required for XPath 3.1, XQuery 3.1, and XSLT 3.0. It defines constructor functions, operators, and functions on the datatypes defined in XML Schema Part 2: Datatypes Second Edition and the datatypes defined in XQuery and XPath Data Model (XDM) 3.1. It also defines functions and operators on nodes and node sequences as defined in the XQuery and XPath Data Model (XDM) 3.1.
  • XML Path Language (XPath) 3.1: XPath 3.1 is an expression language that allows the processing of values conforming to the data model defined in XQuery and XPath Data Model (XDM) 3.1. The name of the language derives from its most distinctive feature, the path expression, which provides a means of hierarchic addressing of the nodes in an XML tree. As well as modeling the tree structure of XML, the data model also includes atomic values, function items, and sequences.
  • XSLT and XQuery Serialization 3.1: This document defines serialization of an instance of the data model as defined in XQuery and XPath Data Model (XDM) 3.1 into a sequence of octets. Serialization is designed to be a component that can be used by other specifications such as XSL Transformations (XSLT) Version 3.0 or XQuery 3.1: An XML Query Language.

Comments are welcome through 28 February 2017.

Get your red pen out!

Unlike political flame wars on social media, comments on these proposed recommendatons could make a useful difference.

Enjoy!

January 16, 2017

XML.com Relaunch!

Filed under: XML,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 4:11 pm

XML.com

Lauren Wood posted this note about the relaunch of XML.com recently:

I’ve relaunched XML.com (for some background, Tim Bray wrote an article here: https://www.xml.com/articles/2017/01/01/xmlcom-redux/). I’m hoping it will become part of the community again, somewhere for people to post their news (submit your news here: https://www.xml.com/news/submit-news-item/) and articles (see the guidelines at https://www.xml.com/about/contribute/). I added a job board to the site as well (if you’re in Berlin, Germany, or able to
move there, look at the job currently posted; thanks LambdaWerk!); if your employer might want to post XML-related jobs please email me.

The old content should mostly be available but some articles were previously available at two (or more) locations and may now only be at one; try the archive list (https://www.xml.com/pub/a/archive/) if you’re looking for something. Please let me know if something major is missing from the archives.

XML is used in a lot of areas, and there is a wealth of knowledge in this community. If you’d like to write an article, send me your ideas. If you have comments on the site, let me know that as well.

Just in time as President Trump is about to stir, vigorously, that big pot of crazy known as federal data.

Mapping, processing, transformation demands will grow at an exponential rate.

Notice the emphasis on demand.

Taking a two weeks to write custom software to sort files (you know the Weiner/Abedin laptop story, yes?) won’t be acceptable quite soon.

How are your on-demand XML chops?

January 14, 2017

Looking up words in the OED with XQuery [Details on OED API Key As Well]

Filed under: Dictionary,XML,XQuery — Patrick Durusau @ 4:14 pm

Looking up words in the OED with XQuery by Clifford Anderson.

Clifford has posted a gist of work from the @VandyLibraries XQuery group, looking up words in the Oxford English Dictionary (OED) with XQuery.

To make full use of Clifford’s post, you will need for the Oxford Dictionaries API.

If you go straight to the regular Oxford English Dictionary (I’m omitting the URL so you don’t make the same mistake), there is nary a mention of the Oxford Dictionaries API.

The free plan allows 3K queries a month.

Not enough to shut out the outside world for the next four/eight years but enough to decide if it’s where you want to hide.

Application for the free api key was simple enough.

Save that the dumb password checker insisted on one or more special characters, plus one or more digits, plus upper and lowercase. When you get beyond 12 characters the insistence on a special character is just a little lame.

Email response with the key was fast, so I’m in!

What about you?

January 4, 2017

BaseX – Bad News, Good News

Filed under: BaseX,XML,XQuery — Patrick Durusau @ 5:39 pm

Good news and bad news from BaseX. Christian Grün posted to the BaseX discussion list today:

Dear all,

This year, there will be no BaseX Preconference Meeting in XML Prague. Sorry for that! The reason is simple: We were too busy in order to prepare ourselves for the event, look for speakers, etc.

At least we are happy to announce that BaseX 8.6 will finally be released this month. Stay tuned!

All the best,

Christian

That’s disappointing but understandable. Console yourself with watching presentations from 2013 – 2016 and reviewing issues for 8.6.

Just a guess on my part, ;-), but I suspect more testing of BaseX builds would not go unappreciated.

Something to keep yourself busy while waiting for BaseX 8.6 to drop.

December 13, 2016

XQuery/XPath CRs 3.1! [#DisruptJ20 Twitter Game]

Filed under: Saxon,XML,XPath,XQuery — Patrick Durusau @ 1:57 pm

Just in time for the holidays, new CRs for XQuery/XPath hit the street! Comments due by 2017-01-10.

XQuery and XPath Data Model 3.1 https://www.w3.org/TR/2016/CR-xpath-datamodel-31-20161213/

XML Path Language (XPath) 3.1 https://www.w3.org/TR/2016/CR-xpath-31-20161213/

XQuery 3.1: An XML Query Language https://www.w3.org/TR/2016/CR-xquery-31-20161213/

XPath and XQuery Functions and Operators 3.1 https://www.w3.org/TR/2016/CR-xpath-functions-31-20161213/

XQueryX 3.1 https://www.w3.org/TR/2016/CR-xqueryx-31-20161213/

#DisruptJ20 is too late for comments to the W3C but you can break the boredom of indeterminate waiting to protest excitedly for TV cameras and/or to be arrested.

How?

Play the XQuery/XPath 3.1 Twitter Game!

Definitions litter the drafts and appear as:

[Definition: A sequence is an ordered collection of zero or more items.]

You Tweet:

An ordered collection of zero or more items? #xquery

Correct response:

A sequence.

Some definitions are too long to be tweeted in full:

An expanded-QName is a value in the value space of the xs:QName datatype as defined in the XDM data model (see [XQuery and XPath Data Model (XDM) 3.1]): that is, a triple containing namespace prefix (optional), namespace URI (optional), and local name. (xpath-functions)

Suggest you tweet:

A triple containing namespace prefix (optional), namespace URI (optional), and local name.

or

A value in the value space of the xs:QName datatype as defined in the XDM data model (see [XQuery and XPath Data Model (XDM) 3.1].

In both cases, the correct response:

An expanded-QName.

Use a $10 burner phone and it unlocked at protests. If your phone is searched, imagine the attempts to break the “code.”

You could agree on definitions/responses as instructions for direct action. But I digress.

December 11, 2016

4 Days Left – Submission Alert – XML Prague

Filed under: Conferences,XML,XPath,XQuery,XSLT — Patrick Durusau @ 7:53 pm

A tweet by Jirka Kosek reminded me there are only 4 days left for XML Prague submissions!

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

From the call for papers:

XML Prague 2017 now welcomes submissions for presentations on the following topics:

  • Markup and the Extensible Web – HTML5, XHTML, Web Components, JSON and XML sharing the common space
  • Semantic visions and the reality – micro-formats, semantic data in business, linked data
  • Publishing for the 21th century – publishing toolchains, eBooks, EPUB, DITA, DocBook, CSS for print, …
  • XML databases and Big Data – XML storage, indexing, query languages, …
  • State of the XML Union – updates on specs, the XML community news, …

All proposals will be submitted for review by a peer review panel made up of the XML Prague Program Committee. Submissions will be chosen based on interest, applicability, technical merit, and technical correctness.

Accepted papers will be included in published conference proceedings.

I don’t travel but if you need a last-minute co-author or proofer, you know where to find me!

November 22, 2016

Manipulate XML Text Data Using XQuery String Functions

Filed under: BaseX,XML,XQuery — Patrick Durusau @ 3:31 pm

Manipulate XML Text Data Using XQuery String Functions by Adam Steffanick.

Adam’s notes from the Vanderbilt University XQuery Working Group:

We evaluated and manipulated text data (i.e., strings) within Extensible Markup Language (XML) using string functions in XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This tutorial covers the basics of how to use XQuery string functions and manipulate text data with BaseX.

We used a limited dataset of English words as text data to evaluate and manipulate, and I’ve created a GitHub gist of XML input and XQuery code for use with this tutorial.

A quick run-through of basic XQuery string functions that takes you up to writing your own XQuery function.

While we wait for more reports from the Vanderbilt University XQuery Working Group, have you considered using XQuery to impose different views on a single text document?

For example, some Bible translations follow the “traditional” chapter and verse divisions (a very late addition), while others use paragraph level organization and largely ignore the tradition verses.

Creating a view of a single source text as either one or both should not involve permanent changes to a source file in XML. Or at least not the original source file.

If for processing purposes there was a need for a static file rendering one way or the other, that’s doable but should be separate from the original XML file.

November 16, 2016

XML Prague 2017, February 9-11, 2017 – Registration Opens!

Filed under: Conferences,XML,XQuery,XSLT — Patrick Durusau @ 3:22 pm

XML Prague 2017, February 9-11, 2017

I mentioned XML Prague 2017 last month and now, after the election of Donald Trump as president of the United States, registration for the conference opens!

Coincidence?

Maybe. 😉

Even if you are returning to the U.S. after the conference, XML Prague will be a welcome respite from the tempest of news coverage of what isn’t known about the impending Trump administration.

At 120 Euros for three days, this is a great investment both professionally and emotionally.

Enjoy!

October 16, 2016

XML Prague 2017 is coming

Filed under: Conferences,XML,XQuery — Patrick Durusau @ 9:18 pm

XML Prague 2017 is coming by Jirka Kosek.

From the post:

I’m happy to announce that call for papers for XML Prague 2017 is finally open. We are looking forward for your interesting submissions related to XML. We have switched from CMT to EasyChair for managing submission process – we hope that new system will have less quirks for users then previous one.

We are sorry for slightly delayed start than in past years. But we have to setup new non-profit organization for running the conference and sometimes we felt like characters from Kafka’s Der Process during this process.

We are now working hard on redesigning and opening of registration. Process should be more smooth then in the past.

But these are just implementation details. XML Prague will be again three day gathering of XML geeks, users, vendors, … which we all are used to enjoy each year. I’m looking forward to meet you in Prague in February.

Conference: February 9-11, 2016.

Important Dates:

Important Dates:

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

You can see videos of last year’s presentation (to gauge the competition): Watch videos from XML Prague 2016 on Youtube channel.

December the 15th will be here sooner than you think!

Think of it as a welcome distraction from the barn yard posturing that is U.S. election politics this year!

October 6, 2016

XQuery Snippets on Gist

Filed under: XML,XQuery — Patrick Durusau @ 12:36 pm

@XQuery tweeted today:

Check out some of the 1,637 XQuery code snippets on GitHub’s gist service
https://gist.github.com/search?q=xquery&ref=searchresults&utf8=%E2%9C%93

Not a bad way to get in a daily dose of XQuery!

You can also try Stack Overflow:

XQuery (3,000)

xquery-sql (293)

xquery-3.0 (70)

xquery-update (55)

Enjoy!

September 24, 2016

XQuery Working Group (Vanderbilt)

Filed under: eXist,XML,XQuery — Patrick Durusau @ 1:54 pm

XQuery Working Group – Learn XQuery in the Company of Digital Humanists and Digital Scientists

From the webpage:

We meet from 3:00 to 4:30 p.m. on most Fridays in 800FA of the Central Library. Newcomers are always welcome! Check the schedule below for details about topics. Also see our Github repository for code samples. Contact Cliff Anderson with any questions or see the FAQs below.

Good thing we are all mindful of the distinction W3C XML Query Working Group and XQuery Working Group (Vanderbilt).

Otherwise, you might need a topic map to sort out casual references. 😉

Even if you can’t attend meetings in person, support this project by Cliff Anderson.

September 6, 2016

Data Provenance: A Short Bibliography

Filed under: Data Aggregation,Data Provenance,Merging,Topic Maps,XQuery — Patrick Durusau @ 7:45 pm

The video Provenance for Database Transformations by Val Tannen ends with a short bibliography.

Links and abstracts for the items in Val’s bibliography:

Provenance Semirings by Todd J. Green, Grigoris Karvounarakis, Val Tannen. (2007)

We show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and whyprovenance are particular cases of the same general algorithms involving semirings. This further suggests a comprehensive provenance representation that uses semirings of polynomials. We extend these considerations to datalog and semirings of formal power series. We give algorithms for datalog provenance calculation as well as datalog evaluation for incomplete and probabilistic databases. Finally, we show that for some semirings containment of conjunctive queries is the same as for standard set semantics.

Update Exchange with Mappings and Provenance by Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, Val Tannen. (2007)

We consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, every peer’s updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditions — expressing what data and sources a peer judges to be authoritative — which may cause a peer to reject another’s updates. In order to support such filtering, updates carry provenance information. These systems target scientific data sharing applications, and their general principles and architecture have been described in [20].

In this paper we present methods for realizing such systems. Specifically, we extend techniques from data integration, data exchange, and incremental view maintenance to propagate updates along mappings; we integrate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance; we discuss strategies for implementing our techniques in conjunction with an RDBMS; and we experimentally demonstrate the viability of our techniques in the ORCHESTRA prototype system.

Annotated XML: Queries and Provenance by J. Nathan Foster, Todd J. Green, Val Tannen. (2008)

We present a formal framework for capturing the provenance of data appearing in XQuery views of XML. Building on previous work on relations and their (positive) query languages, we decorate unordered XML with annotations from commutative semirings and show that these annotations suffice for a large positive fragment of XQuery applied to this data. In addition to tracking provenance metadata, the framework can be used to represent and process XML with repetitions, incomplete XML, and probabilistic XML, and provides a basis for enforcing access control policies in security applications.

Each of these applications builds on our semantics for XQuery, which we present in several steps: we generalize the semantics of the Nested Relational Calculus (NRC) to handle semiring-annotated complex values, we extend it with a recursive type and structural recursion operator for trees, and we define a semantics for XQuery on annotated XML by translation into this calculus.

Containment of Conjunctive Queries on Annotated Relations by Todd J. Green. (2009)

We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incomplete databases, and databases annotated with various kinds of provenance information. We obtain positive decidability results and complexity characterizations for databases with lineage, why-provenance, and provenance polynomial annotations, for both conjunctive queries and unions of conjunctive queries. At least one of these results is surprising given that provenance polynomial annotations seem “more expressive” than bag semantics and under the latter, containment of unions of conjunctive queries is known to be undecidable. The decision procedures rely on interesting variations on the notion of containment mappings. We also show that for any positive semiring (a very large class) and conjunctive queries without self-joins, equivalence is the same as isomorphism.

Collaborative Data Sharing with Mappings and Provenance by Todd J. Green, dissertation. (2009)

A key challenge in science today involves integrating data from databases managed by different collaborating scientists. In this dissertation, we develop the foundations and applications of collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collaborators to define loose confederations of heterogeneous databases, relating them through schema mappings that establish how data should flow from one site to the next. In addition to simply propagating data along the mappings, it is critical to record data provenance (annotations describing where and how data originated) and to support policies allowing scientists to specify whose data they trust, and when. Since a large data sharing confederation is certain to evolve over time, the CDSS must also efficiently handle incremental changes to data, schemas, and mappings.

We focus in this dissertation on the formal foundations of CDSSs, as well as practical issues of its implementation in a prototype CDSS called Orchestra. We propose a novel model of data provenance appropriate for CDSSs, based on a framework of semiring-annotated relations. This framework elegantly generalizes a number of other important database semantics involving annotated relations, including ranked results, prior provenance models, and probabilistic databases. We describe the design and implementation of the Orchestra prototype, which supports update propagation across schema mappings while maintaining data provenance and filtering data according to trust policies. We investigate fundamental questions of query containment and equivalence in the context of provenance information. We use the results of these investigations to develop novel approaches to efficiently propagating changes to data and mappings in a CDSS. Our approaches highlight unexpected connections between the two problems and with the problem of optimizing queries using materialized views. Finally, we show that semiring annotations also make sense for XML and nested relational data, paving the way towards a future extension of CDSS to these richer data models.

Provenance in Collaborative Data Sharing by Grigoris Karvounarakis, dissertation. (2009)

This dissertation focuses on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support the capability of update exchange — which publishes a participant’s updates and then translates others’ updates to the participant’s local schema and imports them — while tolerating disagreement between them and recording the provenance of exchanged data, i.e., information about the sources and mappings involved in their propagation. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases.

To address these challenges, in this dissertation we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform update exchange incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the ORCHESTRA prototype system. We define ProQL, iv a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we develop a prototype implementation ProQL over an RDBMS and indexing techniques to speed up provenance querying, evaluate experimentally the performance of provenance querying and the benefits of our indexing techniques.

Provenance for Aggregate Queries by Yael Amsterdamer, Daniel Deutch, Val Tannen. (2011)

We study in this paper provenance information for queries with aggregation. Provenance information was studied in the context of various query languages that do not allow for aggregation, and recent work has suggested to capture provenance by annotating the different database tuples with elements of a commutative semiring and propagating the annotations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inapplicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation. We realize this approach in a concrete construction, first for “simple” queries where the aggregation operator is the last one applied, and then for arbitrary (positive) relational algebra queries with aggregation; the latter queries are shown to be more challenging in this context. Finally, we use aggregation to encode queries with difference, and study the semantics obtained for such queries on provenance annotated databases.

Circuits for Datalog Provenance by Daniel Deutch, Tova Milo, Sudeepa Roy, Val Tannen. (2014)

The annotation of the results of database queries with provenance information has many applications. This paper studies provenance for datalog queries. We start by considering provenance representation by (positive) Boolean expressions, as pioneered in the theories of incomplete and probabilistic databases. We show that even for linear datalog programs the representation of provenance using Boolean expressions incurs a super-polynomial size blowup in data complexity. We address this with an approach that is novel in provenance studies, showing that we can construct in PTIME poly-size (data complexity) provenance representations as Boolean circuits. Then we present optimization techniques that embed the construction of circuits into seminaive datalog evaluation, and further reduce the size of the circuits. We also illustrate the usefulness of our approach in multiple application domains such as query evaluation in probabilistic databases, and in deletion propagation. Next, we study the possibility of extending the circuit approach to the more general framework of semiring annotations introduced in earlier work. We show that for a large and useful class of provenance semirings, we can construct in PTIME poly-size circuits that capture the provenance.

Incomplete but a substantial starting point exploring data provenance and its relationship/use with topic map merging.

To get a feel for “data provenance” just prior to the earliest reference here (2007), consider A Survey of Data Provenance Techniques by Yogesh L. Simmhan, Beth Plale, Dennis Gannon, published in 2005.

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.

The provenance of data products generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update a data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes.

In this paper we create a taxonomy of data provenance techniques, and apply the classification to current research efforts in the field. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. Our synthesis can help those building scientific and business metadata-management systems to understand existing provenance system designs. The survey culminates with an identification of open research problems in the field.

Another rich source of reading material!

August 15, 2016

BaseX 8.5.3 Released!

Filed under: BaseX,XML,XQuery — Patrick Durusau @ 7:09 pm

BaseX 8.5.3 Released! (2016/08/15)

BaseX 8.5.3 was released today!

The changelog reads:

VERSION 8.5.3 (August 15, 2016) —————————————-

Minor bug fixes, improved thread-safety.

Still, not a bad idea to upgrade today!

Enjoy!

PS: You do remember that Congress is throwing XML in ever increasing amounts at the internet?

Perhaps in hopes of burying information in angle-bang syntax.

XQuery can help disappoint them.

July 16, 2016

BaseX 8.5.1 Released! (XQuery Texts for Smart Phone?)

Filed under: BaseX,XML,XQuery — Patrick Durusau @ 3:44 pm

BaseX – 8.5.1 Released!

From the documentation page:

BaseX is both a light-weight, high-performance and scalable XML Database and an XQuery 3.1 Processor with full support for the W3C Update and Full Text extensions. It focuses on storing, querying, and visualizing large XML and JSON documents and collections. A visual frontend allows users to interactively explore data and evaluate XQuery expressions in realtime. BaseX is platform-independent and distributed under the free BSD License (find more in Wikipedia).

Besides Priscilia Walmsley’s XQuery 2nd Edition and the BaseX documentation as a PDF file, what other XQuery resources would you store on a smart phone? (For occasional reference, leisure reading, etc.)

June 28, 2016

The Feynman Technique – Contest for Balisage 2016?

Filed under: Conferences,XML,XML Schema,XQuery — Patrick Durusau @ 7:23 pm

The Best Way to Learn Anything: The Feynman Technique by Shane Parrish.

From the post:

There are four simple steps to the Feynman Technique, which I’ll explain below:

  1. Choose a Concept
  2. Teach it to a Toddler
  3. Identify Gaps and Go Back to The Source Material
  4. Review and Simplify

This made me think of the late-breaking Balisage 2016 papers posted by Tommie Usdin in email:

  • Saxon-JS – XSLT 3.0 in the Browser, by Debbie Lockett and Michael Kay, Saxonica
  • A MicroXPath for MicroXML (AKA A New, Simpler Way of Looking at XML Data Content), by Uche Ogbuji, Zepheira
  • A catalog of Functional programming idioms in XQuery 3.1, James Fuller, MarkLogic

New contest for Balisage?

Pick a concept from a Balisage 2016 presentation and you have sixty (60) seconds to explain it to Balisage attendees.

What do you think?

Remember, you can’t play if you don’t attend! Register today!

If Tommie agrees, the winner gets me to record a voice mail greeting for their phone! 😉

May 23, 2016

Balisage 2016 Program Posted! (Newcomers Welcome!)

Filed under: Conferences,Topic Maps,XML,XML Schema,XPath,XProc,XQuery,XSLT — Patrick Durusau @ 8:03 pm

Tommie Usdin wrote today to say:

Balisage: The Markup Conference
2016 Program Now Available
http://www.balisage.net/2016/Program.html

Balisage: where serious markup practitioners and theoreticians meet every August.

The 2016 program includes papers discussing reducing ambiguity in linked-open-data annotations, the visualization of XSLT execution patterns, automatic recognition of grant- and funding-related information in scientific papers, construction of an interactive interface to assist cybersecurity analysts, rules for graceful extension and customization of standard vocabularies, case studies of agile schema development, a report on XML encoding of subtitles for video, an extension of XPath to file systems, handling soft hyphens in historical texts, an automated validity checker for formatted pages, one no-angle-brackets editing interface for scholars of German family names and another for scholars of Roman legal history, and a survey of non-XML markup such as Markdown.

XML In, Web Out: A one-day Symposium on the sub rosa XML that powers an increasing number of websites will be held on Monday, August 1. http://balisage.net/XML-In-Web-Out/

If you are interested in open information, reusable documents, and vendor and application independence, then you need descriptive markup, and Balisage is the conference you should attend. Balisage brings together document architects, librarians, archivists, computer
scientists, XML practitioners, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, standards developers, academics, industrial researchers, government and NGO staff, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Some participants are busy designing replacements for XML while other still use SGML (and know why they do).

Discussion is open, candid, and unashamedly technical.

Balisage 2016 Program: http://www.balisage.net/2016/Program.html

Symposium Program: http://balisage.net/XML-In-Web-Out/symposiumProgram.html

Even if you don’t eat RELAX grammars at snack time, put Balisage on your conference schedule. Even if a bit scruffy looking, the long time participants like new document/information problems or new ways of looking at old ones. Not to mention they, on occasion, learn something from newcomers as well.

It is a unique opportunity to meet the people who engineered the tools and specs that you use day to day.

Be forewarned that most of them have difficulty agreeing what controversial terms mean, like “document,” but that to one side, they are a good a crew as you are likely to meet.

Enjoy!

May 5, 2016

TEI XML -> HTML w/ XQuery [+ CSS -> XML]

Filed under: HTML,Text Encoding Initiative (TEI),XML,XQuery — Patrick Durusau @ 1:10 pm

Convert TEI XML to HTML with XQuery and BaseX by Adam Steffanick.

From the post:

We converted a document from the Text Encoding Initiative’s (TEI) Extensible Markup Language (XML) scheme to HTML with XQuery, an XML query language, and BaseX, an XML database engine and XQuery processor. This guide covers the basics of how to convert a document from TEI XML to HTML while retaining element attributes with XQuery and BaseX.

I’ve created a GitHub repository of sample TEI XML files to convert from TEI XML to HTML. This guide references a GitHub gist of XQuery code and HTML output to illustrate each step of the TEI XML to HTML conversion process.

The post only treats six (6) TEI elements but the methods presented could be extended to a larger set of TEI elements.

TEI 5 has 563 elements, which may appear in varying, valid, combinations. It also defines 256 attributes which are distributed among those 563 elements.

Consider using XQuery as a quality assurance (QA) tool to insure that encoded texts conform your project’s definition of expected text encoding.

While I was at Adam’s site I encountered: Convert CSV to XML with XQuery and BaseX, which you should bookmark for future reference.

February 2, 2016

Balisage 2016, 2–5 August 2016 [XML That Makes A Difference!]

Filed under: Conferences,XLink,XML,XML Data Clustering,XML Schema,XPath,XProc,XQuery,XSLT — Patrick Durusau @ 9:47 pm

Call for Participation

Dates:

  • 25 March 2016 — Peer review applications due
  • 22 April 2016 — Paper submissions due
  • 21 May 2016 — Speakers notified
  • 10 June 2016 — Late-breaking News submissions due
  • 16 June 2016 — Late-breaking News speakers notified
  • 8 July 2016 — Final papers due from presenters of peer reviewed papers
  • 8 July 2016 — Short paper or slide summary due from presenters of late-breaking news
  • 1 August 2016 — Pre-conference Symposium
  • 2–5 August 2016 — Balisage: The Markup Conference

From the call:

Balisage is the premier conference on the theory, practice, design, development, and application of markup. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

  • Web application development with XML
  • Informal data models and consensus-based vocabularies
  • Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
  • Performance issues in parsing, XML database retrieval, or XSLT processing
  • Development of angle-bracket-free user interfaces for non-technical users
  • Semistructured data and full text search
  • Deployment of XML systems for enterprise data
  • Web application development with XML
  • Design and implementation of XML vocabularies
  • Case studies of the use of XML for publishing, interchange, or archiving
  • Alternatives to XML
  • the role(s) of XML in the application lifecycle
  • the role(s) of vocabularies in XML environments

Full papers should be submitted by the deadline given below. All papers are peer-reviewed — we pride ourselves that you will seldom get a more thorough, skeptical, or helpful review than the one provided by Balisage reviewers.

Whether in theory or practice, let’s make Balisage 2016 the one people speak of in hushed tones at future markup and information conferences.

Useful semantics continues to flounder about, cf. Vice-President Biden’s interest in “one cancer research language.” Easy enough to say. How hard could it be?

Documents are commonly thought of and processed as if from BOM to EOF is the definition of a document. Much to our impoverishment.

Silo dissing has gotten popular. What if we could have our silos and eat them too?

Let’s set our sights on a Balisage 2016 where non-technicals come away saying “I want that!”

Have your first drafts done well before the end of February, 2016!

January 13, 2016

Congressional Roll Call Vote – The Documents – Part 2 (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 11:54 pm

Congressional Roll Call Vote – The Documents (XQuery) we looked at the initial elements found in FINAL VOTE RESULTS FOR ROLL CALL 705. Today we continue our examination of those elements, starting with <vote-data>.

As before, use ctrl-u in your browser to display the XML source for that page. Look for </vote-metadata>, the next element is <vote-data>, which contains all the votes cast by members of Congress as follows:

<recorded-vote>
<legislator name-id=”A000374″ sort-field=”Abraham” unaccented-name=”Abraham” party=”R” state=”LA” role=”legislator”>Abraham</legislator><
vote>Nay</vote>
</recorded-vote>
<recorded-vote>
<legislator name-id=”A000370″ sort-field=”Adams” unaccented-name=”Adams” party=”D” state=”NC” role=”legislator”>Adams</legislator>
<vote>Yea</vote>
</recorded-vote>

These are only the first two (2) lines but only the content of other <recorded-vote> elements varies from these.

I have introduced line returns to make it clear that <recorded-vote> … </recorded-vote> begin and end each record. Also note that <legislator> and <vote> are siblings.

What you didn’t see in the upper part of this document were the attributes that appear inside the <legislator> element.

Some of the attributes are: name-id=”A000374,” state=”LA” role=”legislator.”

In an XQuery, we address attributes by writing out the path to the element containing the attributes and then appending the attribute.

For example, for name-id=”A000374,” we could write:

rollcall-vote/vote-data/recorded-vote/legislator[@name-id = "A000374]

If we wanted to select that attribute value and/or the <legislator> element with that attribute and value.

Recalling that:

rollcall-vote – Root element of the document.

vote-data – Direct child of the root element.

recorded-vote – Direct child of the vote-data element (with many siblings).

legislator – Direct child of recorded-vote.

@name-id – One of the attributes of legislator.

As I mentioned in our last post, there are other ways to access elements and attributes but many useful things can be done with direct descendant XPaths.

In preparation for our next post, trying searching for “A000374” and limiting your search to the domain, congress.gov.

It is a good practice to search on unfamiliar attribute values. You never know what you may find!

Until next time!

January 11, 2016

Congressional Roll Call Vote – The Documents (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 10:41 pm

I assume you have read my new starter post for this series: Congressional Roll Call Vote and XQuery (A Do Over). If you haven’t and aren’t already familiar with XQuery, take a few minutes to go read it now. I’ll wait.

The first XML document we need to look at is FINAL VOTE RESULTS FOR ROLL CALL 705. If you press ctrl-u in your browser, the XML source of that document will be displayed.

The top portion of that document, before you see <vote-data> reads:

<?xml version=”1.0″ encoding=”UTF-8″?>
<!DOCTYPE rollcall-vote PUBLIC “-//US Congress//DTDs/vote
v1.0 20031119 //EN” “http://clerk.house.gov/evs/vote.dtd”>
<?xml-stylesheet type=”text/xsl” href=”http://clerk.house.gov/evs/vote.xsl”?>
<rollcall-vote>
<vote-metadata>
<majority>R</majority>
<congress>114</congress>
<session>1st</session>
<chamber>U.S. House of Representatives</chamber>
<rollcall-num>705</rollcall-num>
<legis-num>H R 2029</legis-num>
<vote-question>On Concurring in Senate Amdt with
Amdt Specified in Section 3(a) of H.Res. 566</vote-question>
<vote-type>YEA-AND-NAY</vote-type>
<vote-result>Passed</vote-result>
<action-date>18-Dec-2015</action-date>
<action-time time-etz=”09:49″>9:49 AM</action-time>
<vote-desc>Making appropriations for military construction, the
Department of Veterans Affairs, and related agencies for the fiscal
year ending September 30, 2016, and for other purposes</vote-desc>
<vote-totals>
<totals-by-party-header>
<party-header>Party</party-header>
<yea-header>Yeas</yea-header>
<nay-header>Nays</nay-header>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>
</totals-by-party-header>
<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Independent</party>
<yea-total>0</yea-total>
<nay-total>0</nay-total>
<present-total>0</present-total>
<not-voting-total>0</not-voting-total>
</totals-by-party>
<totals-by-vote>
<total-stub>Totals</total-stub>
<yea-total>316</yea-total>
<nay-total>113</nay-total>
<present-total>0</present-total>
<not-voting-total>5</not-voting-total>
</totals-by-vote>
</vote-totals>
</vote-metadata>

One of the first skills you need to learn to make effective use of XQuery is how to recognize paths in an XML document.

I’ll do the first several and leave some of the others for you.

<rollcall-vote> – the root element – aka “parent” element

<vote-metadata> – first child element in this document
XPath rollcall-vote/vote-metadata

<majority>R</majority> first child of <majority>R</majority> of <vote-metadata>
XPath rollcall-vote/vote-metadata/majority

<congress>114</congress>

What do you think? Looks like the same level as <majority>R</majority> and it is. Called a sibling of <majority>R</majority>
XPath rollcall-vote/vote-metadata/congress

Caveat: There are ways to go back up the XPath and to reach siblings and attributes. For the moment, lets get good at spotting direct XPaths.

Let’s skip down in the markup until we come to <totals-by-party-header>. It’s not followed, at least not immediately, with </totals-by-party-header>. That’s a signal that the previous siblings have stopped and we have another step in the XPath.

<totals-by-party-header>
XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header

<party-header>Party</party-header>
XPath: rollcall-vote/vote-metadata/majority/totals-by-party-header/party-header

As you may suspect, the next four elements are siblings of <party-header>Party</party-header>

<yea-header>Yeas</yea-header>
<nay-header>Nays</nay-header>
<present-header>Answered “Present”</present-header>
<not-voting-header>Not Voting</not-voting-header>

The closing element, shown by the “/,” signals the end of the <totals-by-party-header> element.

</totals-by-party-header>

See how you do mapping out the remaining XPaths from the top of the document.

<totals-by-party>
<party>Republican</party>
<yea-total>150</yea-total>
<nay-total>95</nay-total>
<present-total>0</present-total>
<not-voting-total>1</not-voting-total>
</totals-by-party>
<totals-by-party>
<party>Democratic</party>
<yea-total>166</yea-total>
<nay-total>18</nay-total>
<present-total>0</present-total>
<not-voting-total>4</not-voting-total>
</totals-by-party>

Tomorrow we are going to dive into the structure of the <vote-data> and how to address the attributes therein and their values.

Enjoy!

January 10, 2016

Congressional Roll Call Vote and XQuery (A Do Over)

Filed under: Government,XML,XQuery — Patrick Durusau @ 10:11 pm

Once words are written, as an author I consider them to be fixed. Even typos should be acknowledged as being corrected and not silently “improve” the original text. Rather than editing what has been said, more words can cover the same ground with the hope of doing so more completely or usefully.

I am starting my XQuery series of posts with the view of being more systematic, including references to at least one popular XQuery book, along with my progress through a series of uses of XQuery.

You are going to need an XQuery engine for all but this first post to be meaningful so let’s cover getting that setup first.

There are any number of GUI interface tools that I will mention over time but for now, let’s start with Saxon.

Download Saxon, unzip the file and you can choose to put saxon9he.jar in your Java classpath (if set) or you can invoke it with the -cp (path to saxon9he.jar), as in java -cp (path to saxon9he.jar) net.sf.saxon.Query -q:query-file.

Classpaths are a mixed blessing at best but who wants to keep typing -cp (your path to saxon9he.jar) net.sf.saxon.Query -q: all the time?

What I have found very useful (Ubuntu system) is to create a short shell script that I can invoke from the command line, thus:

#!/bin/bash
java -cp /home/patrick/saxon/saxon9he.jar net.sf.saxon.Query -q:$1

Which after creating that file, which I very imaginatively named “runsaxon.sh,” I used chmod 755 to make it executable.

When I want to run Saxon at the command line, in the same directory with “runsaxon.sh” I type:

./runsaxon.sh ex-5.4.xq > ex-5.4.html

It is a lot easier and not subject to my fat-fingering of the keyboard.

The “>” sign is a pipe in Linux that redirects the output to a file, in this case, ex-5.4.html.

The source of ex-5.4.xq (and its data file) is: XQuery, 2nd Edition by Patricia Walmsley. Highly recommended.

Patricia has put all of her examples online, XQuery Examples. Please pass that along with a link to her book if you use her examples.

If you have ten minutes, take a look at: Learn XQuery in 10 Minutes: An XQuery Tutorial *UPDATED* by Dr. Michael Kay. Michael Kay is also the author of Saxon.

By this point you should be well on your way to having a working XQuery engine and tomorrow we will start exploring the structure of the congressional roll call vote documents.

January 9, 2016

Congressional Roll Call and XQuery – (Week 1 of XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 9:49 pm

Truthfully a little more than a week of daily XQuery posts, I started a day or so before January 1, 2016.

I haven’t been flooded with suggestions or comments, ;-), so I read back over my XQuery posts and I see lots of room for improvement.

Most of my posts are on fairly technical topics and are meant to alert other researchers of interesting software or techniques. Most of them are not “how-to” or step by step guides, but some of them are.

The posts on congressional roll call documents made sense to me but then I wrote them. Part of what I sensed was that either you know enough to follow my jumps, in which case you are looking for specific details, like the correspondence across documents for attribute values, and not so much for my XQuery expressions.

On the other hand, if you weren’t already comfortable with XQuery, the correspondence of values between documents was the least of your concerns. Where the hell was all this terminology coming from?

I’m no stranger to long explanations, one of the standards I edit crosses the line at over 1,500 pages. But it hasn’t been my habit to write really long posts on this blog.

I’m going to spend the next week, starting tomorrow, re-working and expanding the congressional roll call vote posts to be more detailed for those getting into XQuery, with a very terse, short experts tips at the end of each post if needed.

The expert part will have observations such as the correspondences in attribute values and other oddities that either you know or you don’t.

Will have the first longer style post up tomorrow, January 10, 2016 and we will see how the week develops from there.

« Newer PostsOlder Posts »

Powered by WordPress