Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 8, 2016

Congressional Roll Call Vote – Join/Merge Remote XML Files (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 10:59 pm

One of the things that yesterday’s output lacked was the full names of the Georgia representatives. Which aren’t reported in the roll call documents.

But, what the roll call documents do have, is the following:

<recorded-vote>
<legislator name-id=”J000288″ sort-field=”Johnson (GA)” unaccented-name=”Johnson (GA)”
party=”D” state=”GA” role=”legislator”>Johnson (GA)</legislator>
<vote>Nay</vote>
</recorded-vote>

With emphasis on name-id=”J000288″

I call that attribute out because there is a sample data file, just for the House of Representatives that has:

<bioguideID>J000288</bioguideID>

And yes, the “name-id” attribute and the <bioguideID> share the same value for Henry C. “Hank” Johnson, Jr. of Georgia.

As far as I can find, that relationship between the “name-id” value in roll call result files and the House Member Data File is undocumented. You have to be paying attention to the data values in the various XML files at Congress.gov.

The result of the XQuery script today has the usual header but for members of the Georgia delegation, the following:

congress-ga-phone

That is the result of joining/merging two XML files hosted at congress.gov in real time. You can substitute any roll call vote and your state as appropriate and generate a similar webpage for that roll call vote.

The roll call vote file I used for this example is: http://clerk.house.gov/evs/2015/roll705.xml and the House Member Data File was: http://xml.house.gov/MemberData/MemberData.xml. The MemberData.xml file dates from April of 2015 so it may not have the latest data on any given member. Documentation for House Member Data in XML (pdf).

The main XQuery function for merging the two XML files:

{for $voter in doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//recorded-vote,
$mem in doc(“http://xml.house.gov/MemberData/MemberData.xml”)//member/member-info
where $voter/legislator[@state = ‘GA’] and $voter/legislator/@name-id = $mem//bioguideID
return <li> {string($mem//official-name)} — {string($voter/vote)} — {string($mem//phone)}&;lt;/li>

At a minimum, you can auto-generate a listing for representatives from your state, their vote on any roll-call vote and give readers their phone number to register their opinion.

This is a crude example of what you can do with XML, XQuery and online data from Congress.gov.

BTW, if you work in “sharing” environment at a media outlet or library, you can also join/merge data that you hold internally, say the private phone number of a congressional aide, for example.

We are not nearly done with the congressional roll call vote but you can begin to see the potential that XQuery offers for very little effort. Not to mention that XQuery scripts can be rapidly adapted to your library or news room.

Try out today’s XQuery roll705-join-merge.xq.txt for yourself. (Apologies for the “.txt” extension but my ISP host has ideas about “safe” files to upload.)

I realize this first week has been kinda haphazard in its presentation. Suggestions welcome on improvements as this series goes forward.

The government and others are cranking out barely useful XML by the boatload. XQuery is your ticket to creating personalized presentations dynamically from that XML and other data.

Enjoy!

PS: For display of XML and XQuery, should I be using a different Word template? Suggestions?

January 7, 2016

Localizing A Congressional Roll Call Vote (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 10:07 pm

I made some progress today on localizing a congressional roll call vote.

As you might expect, I chose to localize to the representatives from Georgia. 😉

I used a FLWOR expression to select legislators where the attribute state = GA.

Here is that expression:

<ul>
{for $voter in doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//recorded-vote
where $voter/legislator[@state = ‘GA’]
return <li> {string($voter/legislator)} — {string($voter/vote)}</li>
}</ul>

Which makes our localized display a bit better for local readers but only just.

See roll705-local.html.

What we need is more information that can be found at: http://clerk.house.gov/evs/2015/roll705.xml.

More on that tomorrow!

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Filed under: BigData,PostgreSQL,SQL,XML,XQuery — Patrick Durusau @ 5:26 pm

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Let’s reverse the order of the announcement, to be in reader-friendly order:

Downloads

Press kit

Release Notes

What’s New in 9.5

Edit: I moved my comments above the fold as it were:

Just so you know, PostgreSQL 9.5 documentation, 9.14.2.2 XMLEXISTS says:

Also note that the SQL standard specifies the xmlexists construct to take an XQuery expression as first argument, but PostgreSQL currently only supports XPath, which is a subset of XQuery.

Apologies, you will have to scroll for the subsection, there was no anchor at 9.14.2.2.

If you are looking to make a major contribution to PostgreSQL, note that XQuery is on the todo list.

Now for all the stuff that you will skip reading anyway. 😉

(I would save the prose for use in reports to management about using or transitioning to PostgreSQL 9.5.)

7 JANUARY 2016: The PostgreSQL Global Development Group announces the release of PostgreSQL 9.5. This release adds UPSERT capability, Row Level Security, and multiple Big Data features, which will broaden the user base for the world’s most advanced database. With these new capabilities, PostgreSQL will be the best choice for even more applications for startups, large corporations, and government agencies.

Annie Prévot, CIO of the CNAF, the French Child Benefits Office, said, “The CNAF is providing services for 11 million persons and distributing 73 billion Euros every year, through 26 types of social benefit schemes. This service is essential to the population and it relies on an information system that must be absolutely efficient and reliable. The CNAF’s information system is satisfyingly based on the PostgreSQL database management system.”

UPSERT

A most-requested feature by application developers for several years, “UPSERT” is shorthand for “INSERT, ON CONFLICT UPDATE”, allowing new and updated rows to be treated the same. UPSERT simplifies web and mobile application development by enabling the database to handle conflicts between concurrent data changes. This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL.

Developed over the last two years by Heroku programmer Peter Geoghegan, PostgreSQL’s implementation of UPSERT is significantly more flexible and powerful than those offered by other relational databases. The new ON CONFLICT clause permits ignoring the new data, or updating different columns or relations in ways which will support complex ETL (Extract, Transform, Load) toolchains for bulk data loading. And, like all of PostgreSQL, it is designed to be absolutely concurrency-safe and to integrate with all other PostgreSQL features, including Logical Replication.

Row Level Security

PostgreSQL continues to expand database security capabilities with its new Row Level Security (RLS) feature. RLS implements true per-row and per-column data access control which integrates with external label-based security stacks such as SE Linux. PostgreSQL is already known as “the most secure by default.” RLS cements its position as the best choice for applications with strong data security requirements, such as compliance with PCI, the European Data Protection Directive, and healthcare data protection standards.

RLS is the culmination of five years of security features added to PostgreSQL, including extensive work by KaiGai Kohei of NEC, Stephen Frost of Crunchy Data, and Dean Rasheed. Through it, database administrators can set security “policies” which filter which rows particular users are allowed to update or view. Data security implemented this way is resistant to SQL injection exploits and other application-level security holes.

Big Data Features

PostgreSQL 9.5 includes multiple new features for bigger databases, and for integrating with other Big Data systems. These features ensure that PostgreSQL continues to have a strong role in the rapidly growing open source Big Data marketplace. Among them are:

BRIN Indexing: This new type of index supports creating tiny, but effective indexes for very large, “naturally ordered” tables. For example, tables containing logging data with billions of rows could be indexed and searched in 5% of the time required by standard BTree indexes.

Faster Sorts: PostgreSQL now sorts text and NUMERIC data faster, using an algorithm called “abbreviated keys”. This makes some queries which need to sort large amounts of data 2X to 12X faster, and can speed up index creation by 20X.

CUBE, ROLLUP and GROUPING SETS: These new standard SQL clauses let users produce reports with multiple levels of summarization in one query instead of requiring several. CUBE will also enable tightly integrating PostgreSQL with more Online Analytic Processing (OLAP) reporting tools such as Tableau.

Foreign Data Wrappers (FDWs): These already allow using PostgreSQL as a query engine for other Big Data systems such as Hadoop and Cassandra. Version 9.5 adds IMPORT FOREIGN SCHEMA and JOIN pushdown making query connections to external databases both easier to set up and more efficient.

TABLESAMPLE: This SQL clause allows grabbing a quick statistical sample of huge tables, without the need for expensive sorting.

“The new BRIN index in PostgreSQL 9.5 is a powerful new feature which enables PostgreSQL to manage and index volumes of data that were impractical or impossible in the past. It allows scalability of data and performance beyond what was considered previously attainable with traditional relational databases and makes PostgreSQL a perfect solution for Big Data analytics,” said Boyan Botev, Lead Database Administrator, Premier, Inc.

January 6, 2016

A Lesson about Let Clauses (XQuery)

Filed under: XML,XQuery — Patrick Durusau @ 10:48 pm

I was going to demonstrate how to localize roll call votes so that only representatives from your state and their votes were displayed for any given roll call vote.

Which would enable libraries or local newsrooms, whose users/readers have little interest in how obscure representatives from other states voted, to pare down the roll call vote list to those that really matter, your state’s representatives.

But remembering that I promised to clean up the listings in yesterday’s post that read:

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}

and kept repeating (doc(“http://clerk.house.gov/evs/2015/roll705.xml”).

My thought was to replace that string with a variable declared by a let clause and then substituting that variable for that string.

To save you from the same mistake, combining a let clause with direct element constructors returns an error saying, in this case:

Left operand of ‘>’ needs parentheses

Not a terribly helpful error message.

I have found examples of using a let clause within a direct element constructor that would have defeated the rationale for declaring the variable to begin with.

Tomorrow I hope to post today’s content, which will enable you to display data relevant to local voters, news reporters, for any arbitrary roll call vote in Congress.

Mark today’s adventure as a mistake to avoid. 😉

January 5, 2016

Jazzing a Roll Call Vote – Part 3 (XQuery)

Filed under: XML,XQuery — Patrick Durusau @ 9:48 pm

I posted Congressional Roll Call Vote – Accessibility Issues earlier today to deal with some accessibility issues noticed by @XQuery with my color coding.

Today we are going to start at the top of the boring original roll call vote and work our way down using XQuery.

Be forewarned that the XQuery you see today we will be shortening and cleaning up tomorrow. It works, but its not best practice.

You will need to open up the source of the original roll call vote to see the elements I select in the path expressions.

Here is the XQuery that is the goal for today:

xquery version “3.0”;
declare boundary-space preserve;
<html>
<head></head>
<body>
<h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2>

<strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)} {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/>

<strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/>

<strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)}
</body>
</html>

The title of the document we obtain with:

<h2 align=”center”>FINAL VOTE RESULTS FOR ROLL CALL {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)} </h2>

Two quick things to notice:

First, for very simple documents like this one, I use “//” rather than writing out the path to the rollcall-num element. I already know it only occurs once in each rollcall document.

Second, when using direct element constructors, the XQuery statements are enclosed by “{ }” brackets.

The rollcall number, date and time of the vote come next (I have introduced line breaks for readability):

<strong>{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//rollcall-num)}</strong>

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-date)}

{string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//action-time)} <br/>

If you compare my presentation of that string and that from the original, you will find the original has slightly more space between the items.

Here is the XSLT for that spacing:

<xsl:if test=”legis-num[text()!=’0′]”><xsl:text>      </xsl:text><b><xsl:value-of select=”legis-num”/></b></xsl:if>
<xsl:text>      </xsl:text><xsl:value-of select=”vote-type”/>
<xsl:text>      </xsl:text><xsl:value-of select=”action-date”/>
<xsl:text>      </xsl:text><xsl:value-of select=”action-time”/><br/>

Since I already had white space separating my XQuery expressions, I just added to the prologue:

declare boundary-space preserve;

The last two lines:

<strong>Question:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-question)} <br/>

<strong>Bill Title:</strong> {string(doc(“http://clerk.house.gov/evs/2015/roll705.xml”)//vote-desc)}

Are just standard queries for content. The string operator extracts the content of the element you address.

Tomorrow we are going to talk about how to clean up and shorten the path statements and look around for information that should be at the top of this document, but isn’t!

PS: Did you notice that the vote totals, etc., are written as static data in the XML file? Curious isn’t it? Easy enough to generate from the voting data. I don’t have an answer but thought you might.

Congressional Roll Call Vote – Accessibility Issues

Filed under: XML,XQuery,XSLT — Patrick Durusau @ 2:43 pm

I posted a color coded version of a congressional roll call vote in Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway), using red for Republicans and blue for Democrats. #XQuery points out accessibility issues which depend upon color perception.

Color coding works better for me than the more traditional roman versus italic font face distinction but let’s improve the color coding to remove the accessibility issue.

The first question is what colors should I use for accessibility?

In searching to answer that question I found this thread at Edward Tufte’s site (of course), Choice of colors in print and graphics for color-blind readers, which has a rich list of suggestions and pointers to other resources.

One in particular, Color Universal Design (CUD), posted by Maarten Boers, has this graphic on colors:

colorblind_palette

Relying on that palette, I changed the colors for the roll call vote to Republicans in orange; Democrats in sky blue and re-generated the roll call document.

roll-call-access

Here is an accessible version, but color-coded version of: FINAL VOTE RESULTS FOR ROLL CALL 705.

An upside of XML is that changing the presentation of all 429 votes took only a few seconds to change the stylesheet and re-generate the results.

Thanks to #XQuery for prodding me on the accessibility issue which resulted in finding the thread at Tufte and the Colorblind barrier-free color pallet.


Other post on congressional roll call votes:

1. Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

2. Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway)

January 4, 2016

Jazzing a Roll Call Vote – Part 2 (XQuery, well XSLT anyway)

Filed under: XML,XQuery — Patrick Durusau @ 11:41 pm

Apologies but did not make as much progress on the Congressional Roll Call vote as I had hoped.

I did find some interesting information about the vote.xsl stylesheet and manage to use color to code members of the House.

You probably remember me whining about how hard it is to tell between roman and italics to distinguish members of different parties. Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

The XSLT code is worse than I imagined.

Here’s what I mean:

<b><center><font size=”+2″>FINAL VOTE RESULTS FOR ROLL CALL <xsl:value-of select=”/rollcall-vote/vote-metadata/rollcall-num”/>
<xsl:if test=”/rollcall-vote/vote-metadata/vote-correction[text()!=”]”>*</xsl:if></font></center></b>
<!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘D’]”> –>
<xsl:if test = “$Majority=’D'”>
<center>(Democrats in roman; Republicans in <i>italic</i>; Independents <u>underlined</u>)</center><br/>
</xsl:if>
<!– <xsl:if test = “/rollcall-vote/vote-metadata/majority[text() = ‘R’]”> –>
<xsl:if test = “$Majority!=’D'”>
<center>(Republicans in roman; Democrats in <i>italic</i>; Independents <u>underlined</u>)</center><br/>
</xsl:if>

Which party is in the majority determines whether the names in a party appear in roman or italic face font.

Now there’s a distinction that will be lost on a casual reader!

What’s more, if you are trying to reform the stylesheet, don’t look for R or D but again for majority party:

<xsl:template match=”vote”>
<!– Handles formatting of Member names based on party. –>
<!– <xsl:if test=”../legislator/@party=’R'”><xsl:value-of select=”../legislator”/></xsl:if>
<xsl:if test=”../legislator/@party=’D'”><i><xsl:value-of select=”../legislator”/></i></xsl:if> –>
<xsl:if test=”../legislator/@party=’I'”><u><xsl:value-of select=”../legislator”/></u></xsl:if>
<xsl:if test=”../legislator/@party!=’I'”>
<xsl:if test=”../legislator/@party = $Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –>
<xsl:value-of select=”../legislator”/>
</xsl:if>
<xsl:if test=”../legislator/@party != $Majority”><!– /rollcall-vote/vote-metadata/majority/text()”> –>
<i><xsl:value-of select=”../legislator”/></i>
</xsl:if>
</xsl:if>
</xsl:template>

As you can see, selecting by party has been commented out in favor of the roman/italic distinction based on the majority party.

I wanted to label the Republicans with an icon but my GIMP skills don’t extend to making an icon of young mothers throwing their children under the carriage wheels of the wealthy to save them from a live of poverty and degradation. A bit much to get into a HTML button sized icon.

I settled for using the traditional red for Republicans and blue for Republicans and ran the modified stylesheet against roll705.xml locally.

vote-color-coded

Here is FINAL VOTE RESULTS FOR ROLL CALL 705 as HTML.

Question: Are red and blue easier to distinguish than roman and italic?

If your answer is yes, why resort to typographic subtlety on something like party affiliation?

Are subtle distinctions used to confuse the uninitiated and unwary?

January 3, 2016

Jazzing Up Roll Call Votes For Fun and Profit (XQuery)

Filed under: Government,XML,XQuery — Patrick Durusau @ 11:02 pm

Roll call votes in the US House of Representatives are a stable of local, state and national news. If you go looking for the “official” version, what you find is as boring as your 5th grade civics class.

Trigger Warning: Boring and Minimally Informative Page Produced By Following Link: Final Vote Results For Roll Call 705.

Take a deep breath and load the page. It will open in a new browser tab. Boring. Yes? (You were warned.)

It is the recent roll call vote to fund the US government, take another slice of privacy from citizens, and make a number of other dubious policy choices. (Everything after the first comma depending upon your point of view.)

Whatever your politics though, you have to agree this is sub-optimal presentation, even for a government document.

This is no accident, sans the header, you will find the identical presentation of this very roll call vote at: page H10696, Congressional Record for December 18, 2015 (pdf).

Disappointing so much XML, XSLT, XQuery, etc., has been wasted duplicating non-informative print formatting. Or should I say less-informative formatting than is possible with XML?

Once the data is in XML, legend has it, users can transform that XML in ways more suited to their purposes and not those of the content providers.

I say “legend has it,” because we all know if content providers had their way, web navigation would be via ads and not bare hyperlinks. You want to see the next page? You must select the ad + hyperlink, waiting for the ad to clear before the resource appears.

I can summarize my opinion about content provider control over information legally delivered to my computer: Screw that!

If a content provider enables access to content, I am free to transform that content into speech, graphics, add information, take away information, in short do anything that my imagination desires and my skill enables.

Let’s take the roll call vote in the House of Representatives, Final Vote Results For Roll Call 705.

Just under the title you will read:

(Republicans in roman; Democrats in italic; Independents underlined)

Boring.

For a bulk display of voting results, we can do better than that.

What if we had small images to identify the respective parties? Here are some candidates (sic) for the Republicans:

r-photo1

r-photo-2

r-photo-3

Of course we would have to reduce them to icons size, but XML processing is rarely ever just XML processing. Nearly every project includes some other skill set as well.

Which one do you think looks more neutral? 😉

Certainly be more colorful and depending upon your inclinations, more fun to play about with than the difference in roman and italic. Yes?

Presentation of the data in http://clerk.house.gov/evs/2015/roll705.xml is only one of the possibilities that XQuery offers. Follow along and offer your suggestions for changes, additions and modifications.

First steps:

In the browser tab with Final Vote Results For Roll Call 705, use CNTR-u to view the page source. First notice that the boring web presentation is controlled by http://clerk.house.gov/evs/vote.xsl.

Copy and paste: http://clerk.house.gov/evs/vote.xsl into a new browser tab and select return. The resulting xsl:stylesheet is responsible for generating the original page, from the vote totals to column presentation of the results.

Pay particular attention to the generation of totals from the <vote-data> element and its children. That generation is powered by these lines in vote.xsl:

<xsl:apply-templates select=”/rollcall-vote/vote-metadata”/>
<!– Create total variables based on counts. –>
<xsl:variable name=”y” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Yea’])”/>
<xsl:variable name=”a” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Aye’])”/>
<xsl:variable name=”yeas” select=”$y + $a”/>
<xsl:variable name=”nay” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Nay’])”/>
<xsl:variable name=”no” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’No’])”/>
<xsl:variable name=”nays” select=”$nay + $no”/>
<xsl:variable name=”nvs” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Not Voting’])”/>
<xsl:variable name=”presents” select=”count(/rollcall-vote/vote-data/recorded-vote/vote[text()=’Present’])”/>
<br/>

(Not entirely, I omitted the purely formatting stuff.)

For tomorrow I will be working on a more “visible” way to identify political party affiliation and “borrowing” the count code from vote.xsl.

Enjoy!


You may be wondering what XQuery has to do with topic maps? Well, if you think about it, every time we select, aggregate, etc., data, we are making choices based on notions of subject identity.

That is we think the data we are manipulating represents some subjects and/or information about some subjects, that we find sensible (for some unstated reason) to put together for others to read.

The first step towards a topic map, however, is the putting of information together so we can judge what subjects need explicit representation and how we choose to identify them.

Prior topic map work was never explicit about how we get to a topic map, putting that possibly divisive question behind us, we simply start with topic maps, ab initio.

I was in the car when we took that turn and for the many miles since then. I have come to think that a better starting place is choosing subjects, what we want to say about them and how we wish to say it, so that we have only so much machinery as is necessary for any particular set of subjects.

Some subjects can be identified by IRIs, others by multi-dimensional vectors, still others by unspecified processes of deep learning, etc. Which ones we choose will depend upon the immediate ROI from subject identity and relationships between subjects.

I don’t need triples, for instance, to recognize natural languages to a sufficient degree of accuracy. Unnecessary triples, topics or associations are just padding. If you are on a per-triple contract, they make sense, otherwise, not.

A long way of saying that subject identity lurks just underneath the application of XQuery and we will see where it is useful to call subject identity to the fore.

January 2, 2016

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery (Part 2)

Filed under: XML,XQuery — Patrick Durusau @ 7:30 pm

Despite heavy carousing during the holidays, you may still remember Great R packages for data import, wrangling & visualization [+ XQuery], where I re-sorted the table by Sharon Machlis, to present the R packages in package name order.

I followed that up with: Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery, where I detailed the travails of trying to sort the software packages by their short descriptions, again in alphabetical order. My assumption in that post was that either the spaces or the “,” commas in the descriptions were fouling the sort by.

That wasn’t the case, which I should have known because the string operator always returns a string. That is the spaces and “,” inside are just parts of a string, nothing more.

The up-side of the problem was that I spent more than a little while with Walmsley’s XQuery book, searching for ever more esoteric answers.

Here’s the failing XQuery:

<html>
<body>
<table>{
for $row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2]/a))
return <tr>{$row/td[2]} {$row/td[1]}</tr>
}</table>
</body>
</html>

And here is the working XQuery:

<html>
<body>
<table>{
for $row in doc("/home/patrick/working/favorite-R-packages.xml")/table/tr
order by lower-case(string($row/td[2]))
return <tr>{$row/td[2]} {$row/td[1]}</tr>
}</table>
</body>
</html>

Here is the mistake highlighted:

order by lower-case"(string($row/td[2]/a))"

My first mistake was the inclusion of “/a” in the path. Using string on ($row/td[1]), that is without having /a at the end of the path, gives the original script the same result. (Run that for yourself on favorite-R-packages.xml).

Make any path as long as required and no longer!

My second mistake was not checking the XPath immediately upon the failure of the sort. (The simplest answer is usually the correct one.)

Enjoy!


Update: Removed the quotes marks around (string($row/td[2])) in both queries, they were part of an explanation that did not make the cut. Thanks to XQuery for the catch!

January 1, 2016

XQilla-2.3.2 – Tooling up for 2016 (Part 2) (XQuery)

Filed under: Virtual Machines,XML,XQilla,XQuery — Patrick Durusau @ 5:03 pm

As I promised yesterday, a solution to the XQilla-2.3.2 installation problem!

Using a virtual machine to install the latest version of Ubuntu (15.10), which had the libraries required to install XQilla!

I use VirtualBox from Oracle but people also use VMware.

Virtual boxes come in all manner of configurations so you are likely to spend some time loading linux headers and the like to compile software.

The advantage of a virtual box is that I don’t have to risk doing something dumb or out of fatigue to my working setup. If I have to blow away the entire virtual machine, its takes only a few minutes to download another one.

Well, on any day other than New Year’s Day I found out today. I don’t know if people were streaming that many football games or streaming live “acts” of some sort but the Net was very slow today.

Introducing XQuery to humanists, librarians and reporters using a VM with the usual XQuery suspects pre-loaded would be very cool!

Great way to distribute xqueries* and shell scripts that run them for immediate results.

If you have any thoughts about what such a VM should contain, etc., drop me an email patrick@durusau.net or leave a comment. Thanks!

PS: XQueries returned approximately 26K “hits,” and xquerys returned approximately 1,700 “hits.” Usage favors the plural as “xqueries” so that is what I am following. At the first of a sentence, XQueries?

PPS: I could have written this without the woes of failed downloads, missing header files, etc. but I wanted to know for myself that Ubuntu (15.10) with all the appropriate header files would in fact compile XQilla-2.3.2.

You may need this line to get all the headers:

apt-get install dkms build-essential linux-headers-generic

Not to mention that I would update everything before trying to compile software. Hard to say how long your VM has been on the shelf.

December 31, 2015

XQilla-2.3.2 – Tooling up for 2016 (Part 1) (XQuery)

Filed under: XML,XQilla,XQuery — Patrick Durusau @ 8:38 pm

Along with other end of the year tasks, I’m installing several different XQuery tools. Not all tools support all extensions and so a variety of tools can be a useful thing.

The README for XQila-2.3.2 comes close to winning a prize for being terse:

1. Download a source distribution of Xerces-C 3.1.2

2. Build Xerces-C

cd xerces-c-3.1.2/
./configure
make

4. Build XQilla

cd xqilla/
./configure –with-xerces=`pwd`/../xerces-c-3.1.2/
make

A few notes that may help:

Obtain Xerces-c-3.1.2 homepage.

Xerces project homepage. Home of Apache Xerces C++, Apache Xerces2 Java, Apache Xerces Perl, and, Apache XML Commons.

On configuring the make file for XQilla:

./configure –with-xerces=`pwd`/../xerces-c-3.1.2/

the README is presuming you built xerces-c-3.1.2 in a sub-directory of the XQilla source. You could, just out of habit I built xerces-c-3.1.2 in a separate directory.

The configuration file for XQilla reads in part:

–with-xerces=DIR Path of Xerces. DIR=”/usr/local”

So you could build XQilla with an existing install of xerces-c-3.1.2 if you are so-minded. But if you are that far along, you don’t need these notes. 😉

Strictly for my system (your paths will be different), after building xerces-c-3.1.2, I changed directories to XQilla-2.3.2 and typed:

./configure --with-xerces=/home/patrick/working/xerces-c-3.1.2

No error messages so I am now back at the command prompt and enter make.

Welllll, that was supposed to work!

Here is the error I got:

libtool: link: g++ -O2 -ftemplate-depth-50 -o .libs/xqilla 
   xqilla-commandline.o  
-L/home/patrick/working/xerces-c-3.1.2/src 
  /home/patrick/working/xerces-c-3.1.2/src/
.libs/libxerces-c.so ./.libs/libxqilla.so -lnsl -lpthread -Wl,-rpath 
-Wl,/home/patrick/working/xerces-c-3.1.2/src
/usr/bin/ld: warning: libicuuc.so.55, needed by 
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so, 
   not found (try using -rpath or -rpath-link)
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so: 
   undefined reference to `uset_close_55'
/home/patrick/working/xerces-c-3.1.2/src/.libs/libxerces-c.so: 
   undefined reference to `ucnv_fromUnicode_55'
...[omitted numerous undefined references]...
collect2: error: ld returned 1 exit status
make[1]: *** [xqilla] Error 1
make[1]: Leaving directory `/home/patrick/working/XQilla-2.3.2'
make: *** [all-recursive] Error 1

To help you avoid surfing the web to track down this issue, realize that Ubuntu doesn’t use the latest releases. Of anything as far as I can tell.

The bottom line being that Ubuntu 14.04 doesn’t have libicuuc.so.55.

If I manually upgrade libraries, I might create an inconsistency package management tools can’t fix. 🙁 And break working tools. Bad joss!

Fear Not! There is a solution, which I will cover in my next XQilla-2.3.2 post!

PS: I didn’t get back to the sorting post in time to finish it today. Not to mention that I encountered another nasty list in Most Vulnerable Software of 2015! (Perils of Interpretation!, Advice for 2016).

I say “nasty,” you should see some of the lists you can find at Congress.gov. Valid XML I’ll concede but not as useful as they could be.

Improving online lists, combining them with other data, etc., are some of the things I want to cover this coming year.

December 30, 2015

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery

Filed under: XML,XQuery — Patrick Durusau @ 8:19 pm

Continuing with the data from my post: Great R packages for data import, wrangling & visualization [+ XQuery], I have discovered the dangers of perfect example data!

The XQuery examples on sorting that I have read either enclose strings in quotes and/or have strings with no whitespaces.

How often to you see strings with no whitespaces? Outside of highly constrained environments?

Why is that a problem?

Well, take a look at my results from sorting on the short description and displaying the short description first and the package name second:

package development, package installation devtools
misc installr
data import readxl
data import, data export googlesheets
data import RMySQL
data import readr
data import, data export rio
data analysis psych
data wrangling, data analysis sqldf
data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod
data import, web scraping rvest
data wrangling, data analysis dplyr
data wrangling plyr
data wrangling reshape2
data wrangling tidyr
data wrangling, data analysis data.table
data wrangling stringr
data wrangling lubridate
data wrangling, data analysis zoo
data display editR
data display knitr
data display, data wrangling listviewer
data display DT
data visualization ggplot2
data visualization dygraphs
data visualization googleVis
data visualization metricsgraphics
data visualization RColorBrewer
data visualization plotly
mapping leaflet
mapping choroplethr
mapping tmap
misc fitbitScraper
Web analytics rga
Web analytics RSiteCatalyst
package development roxygen2
data visualization shiny
misc openxlsx
data wrangling, data analysis gmodels
data wrangling car
data visualization rcdimple
data wrangling foreach
data acquisition downloader
data wrangling scales
data visualization plotly

Err, that’s not right!

The XQuery from yesterday:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

XQuery from today, changes in red:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[2]/a))
  6. return <tr>{$row/td[2]} {$row/td[1]}</tr>
  7. }</table>
  8. </html>

First, how do you explain the failure? Looks like no sort order at all.

Truthfully it does have a sort order, just not the one you expected. The results appear in document sort order, as they appeared in the document.

Here’s a snippet of that document:

<table>
<tr>
<td><a href="https://github.com/hadley/devtools" target="_new">devtools</a></td>
<td>package development, package installation</td>
<td>While devtools is aimed at helping you create your own R packages, it's also 
essential if you want to easily install other packages from GitHub. Install it! 
Requires <a href="http://cran.r-project.org/bin/windows/Rtools/" target="_new">
Rtools</a> on Windows and <a href="https://developer.apple.com/xcode/downloads/" 
target="_new">XCode</a> on a Mac. On CRAN.</td>
<td>install_github("rstudio/leaflet")</td>
<td>Hadley Wickham & others</td>
</tr>
<tr>
<td><a href="https://github.com/talgalili/installr/" target="_new">installr</a>
</td><td>misc</td>
<td>Windows only: Update your installed version of R from within R. On CRAN.</td>
<td>updateR()</td>
<td>Tal Galili & others</td>
</tr>
<tr>
<td><a href="https://github.com/hadley/readxl/" target="_new">readxl</a>
</td><td>data import</td>
<td>Fast way to read Excel files in R, without dependencies such as Java. CRAN.</td>
<td>read_excel("my-spreadsheet.xls", sheet = 1)</td>
<td>Hadley Wickham</td>
</tr>
...
</table>

I haven’t run the problem entirely to ground but as you can see from the output:

data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod

Most of the descriptions have spaces, not to mention “,” separating categories.

It is always possible to clean up the data but I want to avoid that if at all possible.

Cleaning data involves the risk I may change the data and once changed, I may not be able to go back to the original.

I can think of at least two (2) ways to fix this problem but want to sleep on it first and pick that can be easily adapted to the next soiled data that comes through the door.

PS: Neither Saxon (9.7), nor BaseX (8.3) gave any error messages at the console for the failure of the sort request.

You could say that document order is about as large an error message as can be given. 😉

December 29, 2015

Great R packages for data import, wrangling & visualization [+ XQuery]

Filed under: Data Mining,R,Visualization,XQuery — Patrick Durusau @ 5:37 pm

Great R packages for data import, wrangling & visualization by Sharon Machlis.

From the post:

One of the great things about R is the thousands of packages users have written to solve specific problems in various disciplines — analyzing everything from weather or financial data to the human genome — not to mention analyzing computer security-breach data.

Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in the table are clickable if you want more information. To find out more about a package once you’ve installed it, type help(package = "packagename") in your R console (of course substituting the actual package name ).

Forty-seven (47) “favorites” sounds a bit on the high side but some people have more than one “favorite” ice cream, or obsession. 😉

You know how I feel about sort-order and I could not detect an obvious one in Sharon’s listing.

So, I extracted the package links/name plus the short description into a new table:

car data wrangling
choroplethr mapping
data.table data wrangling, data analysis
devtools package development, package installation
downloader data acquisition
dplyr data wrangling, data analysis
DT data display
dygraphs data visualization
editR data display
fitbitScraper misc
foreach data wrangling
ggplot2 data visualization
gmodels data wrangling, data analysis
googlesheets data import, data export
googleVis data visualization
installr misc
jsonlite data import, data wrangling
knitr data display
leaflet mapping
listviewer data display, data wrangling
lubridate data wrangling
metricsgraphics data visualization
openxlsx misc
plotly data visualization
plotly data visualization
plyr data wrangling
psych data analysis
quantmod data import, data visualization, data analysis
rcdimple data visualization
RColorBrewer data visualization
readr data import
readxl data import
reshape2 data wrangling
rga Web analytics
rio data import, data export
RMySQL data import
roxygen2 package development
RSiteCatalyst Web analytics
rvest data import, web scraping
scales data wrangling
shiny data visualization
sqldf data wrangling, data analysis
stringr data wrangling
tidyr data wrangling
tmap mapping
XML data import, data wrangling
zoo data wrangling, data analysis

Enjoy!


I want to use XQuery at least once a day in 2016 on my blog. To keep myself honest, I will be posting any XQuery I use.

To sort and extract two of the columns from Mary’s table, I copied the table to a separate file and ran this XQuery:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

One of the nifty aspects of XQuery is that you can sort, as on line 5, in all lower-case on the first <td> element, while returning the same element as written in the original table. Which gives better (IMHO) sort order than UPPERCASE followed by lowercase.

This same technique should make you the master of any simple tables you encounter on the web.

PS: You should always acknowledge the source of your data and the original author.

I first saw Sharon’s list in a tweet by Christophe Lalanne.

December 25, 2015

Facets for Christmas!

Filed under: XML,XPath,XQuery — Patrick Durusau @ 11:47 am

Facet Module

From the introduction:

Faceted search has proven to be enormously popular in the real world applications. Faceted search allows user to navigate and access information via a structured facet classification system. Combined with full text search, it provides user with enormous power and flexibility to discover information.

This proposal defines a standardized approach to support the Faceted search in XQuery. It has been designed to be compatible with XQuery 3.0, and is intended to be used in conjunction with XQuery and XPath Full Text 3.0.

Imagine my surprise when after opening Christmas presents with family to see a tweet by XQuery announcing yet another Christmas present:

“Facets”: A new EXPath spec w/extension functions & data models to enable faceted navigation & search in XQuery http://expath.org/spec/facet

The EXPath homepage says:

XPath is great. XPath-based languages like XQuery, XSLT, and XProc, are great. The XPath recommendation provides a foundation for writing expressions that evaluate the same way in a lot of processors, written in different languages, running in different environments, in XML databases, in in-memory processors, in servers or in clients.

Supporting so many different kinds of processor is wonderful thing. But this also contrains which features are feasible at the XPath level and which are not. In the years since the release of XPath 2.0, experience has gradually revealed some missing features.

EXPath exists to provide specifications for such missing features in a collaborative- and implementation-independent way. EXPath also provides facilities to help and deliver implementations to as many processors as possible, via extensibility mechanisms from the XPath 2.0 Recommendation itself.

Other projects exist to define extensions for XPath-based languages or languages using XPath, as the famous EXSLT, and the more recent EXQuery and EXProc projects. We think that those projects are really useful and fill a gap in the XML core technologies landscape. Nevertheless, working at the XPath level allows common solutions when there is no sense in reinventing the wheel over and over again. This is just following the brilliant idea of the W3C’s XSLT and XQuery working groups, which joined forces to define XPath 2.0 together. EXPath purpose is not to compete with other projects, but collaborate with them.

Be sure to visit the resources page. It has a manageable listing of processors that handle extensions.

What would you like to see added to XPath?

Enjoy!

December 23, 2015

An XQuery Module For Simplifying Semantic Namespaces

Filed under: MarkLogic,Namespace,Semantics,XQuery — Patrick Durusau @ 10:26 am

An XQuery Module For Simplifying Semantic Namespaces by Kurt Cagle.

From the post:

While I enjoy working with the MarkLogic 8 server, there are a number of features about the semantics library there that I still find a bit problematic. Declaring namespaces for semantics in particular is a pain—I normally have trouble remembering the namespaces for RDF or RDFS or OWL, even after working with them for several years, and once you start talking about namespaces that are specific to your own application domain, managing this list can get onerous pretty quickly.

I should point out however, that namespaces within semantics can be very useful in helping to organize and design an ontology, even a non-semantic ontology, and as such, my applications tend to be namespace rich. However, when working with Turtle, Sparql, RDFa, and other formats of namespaces, the need to incorporate these namespaces can be a real showstopper for any developer. Thus, like any good developer, I decided to automate my pain points and create a library that would allow me to simplify this process.

The code given here is in turtle and xquery, but I hope to build out similar libraries for use in JavaScript shortly. When I do, I’ll update this article to reflect those changes.

If you are forced to use a MarkLogic 8 server, great post on managing semantic namespaces.

If you have a choice of tools, something to consider before you willingly choose to use a MarkLogic 8 server.

I first saw this in a tweet by XQuery.

December 17, 2015

My Bad – You Are Not! 747 Edits Away From Using XML Tools

Filed under: XPath,XQuery,XSLT — Patrick Durusau @ 4:11 pm

The original, unedited post is below but in response to comments, I checked the XQuery, XPath, XSLT and XQuery Serialization 3.1 files in Chrome (CNTR-U) before saving them.

All the empty elements were properly closed.

I then saved the files and re-opened in Emacs, to discover that Chrome had stripped the “/” from the empty elements, which then caused BaseX to complain. It was an accurate complaint but the files I was tossing against BaseX were not the files as published by the W3C.

So now I need to file a bug report on Chrome, Version 47.0.2526.80 (64-bit) on Ubuntu, for mangling closed empty elements.


You could tell in XQuery, XPath, XSLT and XQuery Serialization 3.1, New Candidate Recommendations! that I was really excited to see the new drafts hit the street.

Me and my big mouth.

I grabbed copies of all three and tossed the XQuery draft against an xquery to create a list of all the paths in it. Simple enough.

The result weren’t.

Here is the first error message:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 68): The element type “link” must be terminated by the matching end-tag “</link>”.

Ouch!

I corrected that and running the query a second time I got:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 68): The element type “meta” must be terminated by the matching end-tag “</meta>”.

The <meta> elements appear on lines three and four.

On the third try:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 69): The element type “img” must be terminated by the matching end-tag “</img>”.

There are 3 <img> elements that are not closed.

I’m getting fairly annoyed at this point.

Fourth try:

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 78): The element type “br” must be terminated by the matching end-tag “</br>”.

Of course at this point I revert to grep and discover there are 353
elements that are not closed.

Sigh, nothing to do but correct and soldier on.

Fifth attempt.

[FODC0002] “file:/home/patrick/working/w3c/XQuery3.1.html” (Line 17618): The element type “hr” must be terminated by the matching end-tag “</hr>”.

There are 2 <hr> elements that are not closed.

A total of 361 edits in order to use XML based tools with the most recent XQuery 3.1 Candidate draft.

The most recent XPath 3.1 has 238 empty elements that aren’t closed (same elements as XQuery 3.1).

The XSLT and XQuery Serialization 3.1 draft has 149 empty elements that aren’t closed, same as the other but with the addition of four <col> elements that weren’t closed.

Grand total: 747 edits in order to use XML tools.

Not an editorial but a production problem. A rather severe one it seems to me.

Anyone who wants to use XML tools on these drafts will have to perform the same edits.

XQuery, XPath, XSLT and XQuery Serialization 3.1, New Candidate Recommendations!

Filed under: W3C,XPath,XQuery,XSLT — Patrick Durusau @ 11:10 am

As I forecast 😉 earlier this week, new Candidate Recommendations for:

XQuery 3.1: An XML Query Language

XML Path Language (XPath) 3.1

XSLT and XQuery Serialization 3.1

have hit the streets for your review and comments!

Comments due by 2016-01-31.

That’s forty-five days, minus the ones spent with drugs/sex/rock-n-roll over the holidays and recovering from same.

Say something shy of forty-four actual working days (my endurance isn’t what it once was) for the review process.

What tools, techniques are you going to use to review this latest set of candidates?

BTW, some people review software and check only fixes, for standards I start at the beginning, go to the end, then stop. (Or the reverse for backward proofing.)

My estimates on days spent with drugs/sex/rock-n-rock are approximate only and your experience may vary.

December 14, 2015

35 Lines XQuery versus 604 of XSLT: A List of W3C Recommendations

Filed under: BaseX,Saxon,XML,XQuery,XSLT — Patrick Durusau @ 10:16 pm

Use Case

You should be familiar with the W3C Bibliography Generator. You can insert one or more URLs and the generator produces correctly formatted citations for W3C work products.

It’s quite handy but requires a URL to produce a useful response. I need authors to use correctly formatted W3C citations and asking them to find URLs and to generate correct citations was a bridge too far. Simply didn’t happen.

My current attempt is to produce a list of correctly W3C citations in HTML. Authors can use CTRL-F in their browsers to find citations. (Time will tell if this is a successful approach or not.)

Goal: An HTML page of correctly formatted W3C Recommendations, sorted by title (ignoring case because W3C Recommendations are not consistent in their use of case in titles). “Correctly formatted” meaning that it matches the output from the W3C Bibliography Generator.

Resources

As a starting point, I viewed the source of http://www.w3.org/2002/01/tr-automation/tr-biblio.xsl, the XSLT script that generates the XHTML page with its responses.

The first XSLT script imports two more XSLT scripts, http://www.w3.org/2001/08/date-util.xslt and http://www.w3.org/2001/10/str-util.xsl.

I’m not going to reproduce the XSLT here, but can say that starting with <stylesheet> and ending with </stylesheet>, inclusive, I came up with 604 lines.

You will need to download the file used by the W3C Bibliography Generator, tr.rdf.

XQuery Script

I have used the XQuery script successfully with: BaseX 8.3, eXide 2.1.3 and SaxonHE-6-07J.

Here’s the prolog:

declare default element namespace "http://www.w3.org/2001/02pd/rec54#";
declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
declare namespace dc = "http://purl.org/dc/elements/1.1/"; 
declare namespace doc = "http://www.w3.org/2000/10/swap/pim/doc#";
declare namespace contact = "http://www.w3.org/2000/10/swap/pim/contact#";
declare namespace functx = "http://www.functx.com";
declare function functx:substring-after-last
($string as xs:string?, $delim as xs:string) as xs:string?
{
if (contains ($string, $delim))
then functx:substring-after-last(substring-after($string, $delim), $delim)
else $string
};

Declaring the namespaces and functx:substring-after-last from Patricia Walmsley’s excellent FunctX XQuery Functions site and in particular, functx:substring-after-last.

<html>
<head>XQuery Generated W3C Recommendation List</head>
<body>
<ul class="ul">

Start the HTML page and the unordered list that will contain the list items.

{
for $rec in doc("tr.rdf")//REC
    order by upper-case($rec/dc:title)

If you sort W3C Recommendations by dc:title and don’t specify upper-case, rdf:PlainLiteral: A Datatype for RDF Plain Literals,
rdf:PlainLiteral: A Datatype for RDF Plain Literals (Second Edition), and xml:id Version 1.0, appear at the end of the list sorted by title. Dirty data isn’t limited to databases.

return <li class="li">
  <a href="{string($rec/@rdf:about)}"> {string($rec/dc:title)} </a>, 
   { for $auth in $rec/editor
   return
   if (contains(string($auth/contact:fullName), "."))
   then (concat(string($auth/contact:fullName), ","))
   else (concat(concat(concat(substring(substring-before(string($auth/\
   contact:fullName), ' '), 0, 2), ". "), (substring-after(string\
   ($auth/contact:fullName), ' '))), ","))}

Watch for the line continuation marker “\”.

We begin by grabbing the URL and title for an entry and then confront dirty author data. The standard author listing by the W3C creates an initial plus a period for the author’s first name and then concatenates the rest of the author’s name to that initial plus period.

Problem: There is one entry for authors that already has initials, T.V. Raman, so I had to account for that one entry (as does the XSLT).

{if (count ($rec/editor) >= 2) then " Editors," else " Editor,"}
W3C Recommendation, 
{fn:format-date(xs:date(string($rec/dc:date)), "[MNn] [D], [Y]") }, 
{string($rec/@rdf:about)}. <a href="{string($rec/doc:versionOf/\
@rdf:resource)}">Latest version</a> \
available at {string($rec/doc:versionOf/@rdf:resource)}.
<br/>[Suggested label: <strong>{functx:substring-after-last(uppercase\
(replace(string($rec/doc:versionOf/@rdf:resource), '/$', '')), "/")}\
</strong>]<br/></li>} </ul></body></html>

Nothing remarkable here, except that I snipped the concluding “/” off of the values from doc:versionOf/@rdf:resource so I could use functx:substring-after-last to create the token for a suggested label.

Comments / Omissions

I depart from the XSLT in one case. It calls http://www.w3.org/2002/01/tr-automation/known-tr-editors.rdf here:

<!-- Special casing for when we have the name in Original Script (e.g. in \
Japanese); currently assume that the order is inversed in this case... -->

<:xsl:when test="document('http://www.w3.org/2002/01/tr-automation/\
known-tr-editors.rdf')/rdf:RDF/*[contact:lastNameInOriginalScript=\
substring-before(current(),' ')]">

But that refers to only one case:

<REC rdf:about="http://www.w3.org/TR/2003/REC-SVG11-20030114/">
<dc:date>2003-01-14</dc:date>
<dc:title>Scalable Vector Graphics (SVG) 1.1 Specification</dc:title>

Where Jun Fujisawa appears as an editor.

Recalling my criteria for “correctness” being the output of the W3C Bibliography Generator:

svg-cite-image

Preparing for this post made me discover at least one bug in the XSLT that was supposed to report the name in original script:

&lt:xsl:when test=”document(‘http://www.w3.org/2002/01/tr-automation/\
known-tr-editors.rdf’)/rdf:RDF/*[contact:lastNameInOriginalScript=\
substring-before(current(),’ ‘)]”>

Whereas the entry in http://www.w3.org/2002/01/tr-automation/known-tr-editors.rdf reads:

<rdf:Description>
<rdf:type rdf:resource=”http://www.w3.org/2000/10/swap/pim/contact#Person”/>
<firstName>Jun</firstName>
<firstNameInOriginalScript>藤沢 淳</firstNameInOriginalScript>
<lastName>Fujisawa</lastName>
<sortName>Fujisawa</sortName>
</rdf:Description>

Since the W3C Bibliography Generator doesn’t produce the name in original script, neither do I. When the W3C fixes its output, I will have to amend this script to pick up that entry.

String

While writing this query I found text(), fn:string() and fn:data() by Dave Cassels. Recommended reading. The weakness of text() is that if markup is inserted inside your target element after you write the query, you will get unexpected results. The use of fn:string() avoids that sort of surprise.

Recommendations Only

Unlike the W3C Bibliography Generator, my script as written only generates entries for Recommendations. It would be trivial to modify the script to include drafts, notes, etc., but I chose to not include material that should not be used as normative citations.

I can see the usefulness of the bibliography generator for works in progress but external to the W3C, citing Recommendations is the better course.

Contra Search

The SpecRef project has a searchable interface to all the W3C documents. If you search for XQuery, the interface returns 385 “hits.”

Contrast that with using CNTR-F with the list of recommendations generated from the XQuery script, controlling for case, XQuery produced only 23 “hits.”

There are reasons for using search, but users repeatedly mining results of searches that could be captured (it was called curation once upon a time) is wasteful.

Reading

I can’t recommend Patricia Walmsley’s XQuery 2nd Edition strongly enough.

There is one danger to Walmsley’s book. You will be so ready to start using XQuery after the first ten chapters it’s hard to find the time to read the remaining ones. Great stuff!

You can download the XQuery file, tr.rdf and the resulting html file at: 35LinesOfXQuery.zip.

Congress.gov Enhancements: Quick Search, Congressional Record Index, and More

Filed under: Government,Government Data,XML,XQuery — Patrick Durusau @ 9:12 pm

New End of Year Congress.gov Enhancements: Quick Search, Congressional Record Index, and More by Andrew Weber.

From the post:

In our quest to retire THOMAS, we have made many enhancements to Congress.gov this year.  Our first big announcement was the addition of email alerts, which notify users of the status of legislation, new issues of the Congressional Record, and when Members of Congress sponsor and cosponsor legislation.  That development was soon followed by the addition of treaty documents and better default bill text in early spring; improved search, browse, and accessibility in late spring; user driven feedback in the summer; and Senate Executive Communications and a series of Two-Minute Tip videos in the fall.

Today’s update on end of year enhancements includes a new Quick Search for legislation, the Congressional Record Index (back to 1995), and the History of Bills from the Congressional Record Index (available from the Actions tab).  We have also brought over the State Legislature Websites page from THOMAS, which has links to state level websites similar to Congress.gov.

Text of legislation from the 101st and 102nd Congresses (1989-1992) has been migrated to Congress.gov. The Legislative Process infographic that has been available from the homepage as a JPG and PDF is now available in Spanish as a JPG and PDF (translated by Francisco Macías). Margaret and Robert added Fiscal Year 2003 and 2004 to the Congress.gov Appropriations Table. There is also a new About page on the site for XML Bulk Data.

The Quick Search provides a form-based search with fields similar to those available from the Advanced Legislation Search on THOMAS.  The Advanced Search on Congress.gov is still there with many additional fields and ways to search for those who want to delve deeper into the data.  We are providing the new Quick Search interface based on user feedback, which highlights selected fields most likely needed for a search.

There’s an impressive summary of changes!

Speaking of practicing programming, are you planning on practicing XQuery on congressional data in the coming year?

XQuery, XPath, XSLT and XQuery Serialization 3.1 (Back-to-Front) Drafts (soon!)

Filed under: W3C,XPath,XQuery,XSLT — Patrick Durusau @ 4:04 pm

XQuery, XPath, XSLT and XQuery Serialization 3.1 (Back-to-Front) Drafts will be published quite soon so I wanted to give you a heads up on your holiday reading schedule.

This is deep enough in the review cycle that a back-to-front reading is probably your best approach.

You have read the drafts and corrections often enough by this point that you read the first few words of a paragraph and you “know” what it says so you move on. (At the very least I can report that happens to me.)

By back-to-front reading I mean to start at the end of each draft and read the last sentence and then the next to last sentence and so on.

The back-to-front process does two things:

  1. You are forced to read each sentence on its own.
  2. It prevents skimming and filling in errors with silent corrections (unknown to your conscious mind).

The back-to-front method is quite time consuming so its fortunate these drafts are due to appear just before a series of holidays in a large number of places.

I hesitate to mention it but there is another way to proof these drafts.

If you have XML experienced visitors, you could take turns reading the drafts to each other. It was a technique used by copyists many years ago where one person read and two others took down the text. The two versions were then compared to each other and the original.

Even with a great reading voice, I’m not certain many people would be up to that sort of exercise.

PS: I will post on the new drafts as soon as they are published.

December 10, 2015

Cursive 1.0, Gingerbread Cuneiform Tablets, XQuery

Filed under: Clojure,ClojureScript,Editor,XQuery — Patrick Durusau @ 4:32 pm

Cursive 1.0

From the webpage:

Cursive 1.0 is finally here! Here’s everything you need to know about the release.

One important thing first, we’ve had to move some things around. Our website is now at cursive-ide.com, we’re on Twitter at @CursiveIDE, and our Github handle is now cursive-ide. Hopefully everything should be redirected as much as possible, but let me know if you can’t find anything or if you find anything where it shouldn’t be. The main Cursive email address is now cursive@cursive-ide.com but the previous one will continue to work. The mailing lists will continue to use the previous domain for the moment.

Cursive 1.0 is considered a stable build, and is now in the JetBrains plugin repo (here). Going forward, we’ll have a stable release channel which will have new builds every 2-3 months, and an EAP channel which will have builds more or less as they have been during the EAP program. Stable builds will be published to the JetBrains repo, and EAP builds will continue to be published to our private repo. You’ll be prompted at startup which channel you’d like to use, and Cursive will set up the new URLs for the EAP channel automatically if required. Stable builds will be numbered x.y.0, and EAP builds will use x.y.z numbering.

From https://cursive-ide.com/:

Cursive is available as an IntelliJ plugin for use with the Community or Ultimate editions, and will be available in the future as a standalone Clojure-focused IDE. It is a commercial product, with a free non-commercial licence for open-source work, personal hacking, and student work.

I’m not brave enough to install software on someone’s computer as a “surprise” holiday present. Unless you are a sysadmin, I suggest you don’t either.

Of course, you could take the UPenn instructions for making gingerbread cuneiform tablets and make a “card” for the lucky recipient of a license for the full version of Cursive 1.0.

For that matter, you could inscribe smaller gingerbread pieces with bits of XQuery syntax and play holiday games with arbitrary XML documents. Timed trials for the most imaginative XQuery, shortest, etc.

PS: Suggestions on an XQuery corpus to determine the frequency of XQuery syntax representations? “For,” “(,” and, “)” are likely the most common ones, but that’s just guessing on my part.

The Preservation of Favoured Traces [Multiple Editions of Darwin]

Filed under: Books,Graphics,Text Encoding Initiative (TEI),Visualization,XQuery — Patrick Durusau @ 1:19 pm

The Preservation of Favoured Traces

From the webpage:

Charles Darwin first published On the Origin of Species in 1859, and continued revising it for several years. As a result, his final work reads as a composite, containing more than a decade’s worth of shifting approaches to his theory of evolution. In fact, it wasn’t until his fifth edition that he introduced the concept of “survival of the fittest,” a phrase that actually came from philosopher Herbert Spencer. By color-coding each word of Darwin’s final text by the edition in which it first appeared, our latest book and poster of his work trace his thoughts and revisions, demonstrating how scientific theories undergo adaptation before their widespread acceptance.

The original interactive version was built in tandem with exploratory and teaching tools, enabling users to see changes at both the macro level, and word-by-word. The printed poster allows you to see the patterns where edits and additions were made and—for those with good vision—you can read all 190,000 words on one page. For those interested in curling up and reading at a more reasonable type size, we’ve also created a book.

The poster and book are available for purchase below. All proceeds are donated to charity.

For textual history fans this is an impressive visualization of the various editions of On the Origin of Species.

To help students get away from the notion of texts as static creations, plus to gain some experience with markup, consider choosing a well known work that has multiple editions that is available in TEI.

Then have the students write XQuery expressions to transform a chapter of such a work into a later (or earlier) edition.

Depending on the quality of the work, that could be a means of contributing to the number of TEI encoded texts and your students would gain experience with both TEI and XQuery.

December 8, 2015

Congress: More XQuery Fodder

Filed under: Government,Government Data,Law - Sources,XML,XQuery — Patrick Durusau @ 8:07 pm

Congress Poised for Leap to Open Up Legislative Data by Daniel Schuman.

From the post:

Following bills in Congress requires three major pieces of information: the text of the bill, a summary of what the bill is about, and the status information associated with the bill. For the last few years, Congress has been publishing the text and summaries for all legislation moving in Congress, but has not published bill status information. This key information is necessary to identify the bill author, where the bill is in the legislative process, who introduced the legislation, and so on.

While it has been in the works for a while, this week Congress confirmed it will make “Bill Statuses in XML format available through the GPO’s Federal Digital System (FDsys) Bulk Data repository starting with the 113th Congress,” (i.e. January 2013). In “early 2016,” bill status information will be published online in bulk– here. This should mean that people who wish to use the legislative information published on Congress.gov and THOMAS will no longer need to scrape those websites for current legislative information, but instead should be able to access it automatically.

Congress isn’t just going to pull the plug without notice, however. Through the good offices of the Bulk Data Task Force, Congress will hold a public meeting with power users of legislative information to review how this will work. Eight sample bill status XML files and draft XML User Guides were published on GPO’s GitHub page this past Monday. Based on past positive experiences with the Task Force, the meeting is a tremendous opportunity for public feedback to make sure the XML files serve their intended purposes. It will take place next Tuesday, Dec. 15, from 1-2:30. RSVP details below.

If all goes as planned, this milestone has great significance.

  • It marks the publication of essential legislative information in a format that supports unlimited public reuse, analysis, and republication. It will be possible to see much of a bill’s life cycle.
  • It illustrates the positive relationship that has grown between Congress and the public on access to legislative information, where there is growing open dialog and conversation about how to best meet our collective needs.
  • It is an example of how different components within the legislative branch are engaging with one another on a range of data-related issues, sometimes for the first time ever, under the aegis of the Bulk Data Task Force.
  • It means the Library of Congress and GPO will no longer be tied to the antiquated THOMAS website and can focus on more rapid technological advancement.
  • It shows how a diverse community of outside organizations and interests came together and built a community to work with Congress for the common good.

To be sure, this is not the end of the story. There is much that Congress needs to do to address its antiquated technological infrastructure. But considering where things were a decade ago, the bulk publication of information about legislation is a real achievement, the culmination of a process that overcame high political barriers and significant inertia to support better public engagement with democracy and smarter congressional processes.

Much credit is due in particular to leadership in both parties in the House who have partnered together to push for public access to legislative information, as well as the staff who worked tireless to make it happen.

If you look at the sample XML files, pay close attention to the <bioguideID> element and its contents. Is is the same value as you will find for roll-call votes, but there the value appears in the name-id attribute of the <legislator> element. See: http://clerk.house.gov/evs/2015/roll643.xml and do view source.

Oddly, the <bioguideID> element does not appear in the documentation on GitHub, you just have to know the correspondence to the name-id attribute of the <legislator> element

As I said in the title, this is going to be XQuery fodder.

XQuery, 2nd Edition, Updated! (A Drawback to XQuery)

Filed under: XML,XPath,XQuery — Patrick Durusau @ 3:57 pm

XQuery, 2nd Edition, Updated! by Priscilla Walmsley.

The updated version of XQuery, 2nd Edition has hit the streets!

As a plug for the early release program at O’Reilly, yours truly appears in the acknowledgments (page xxii) from having submitted comments on the early release version of XQuery. You can too. Early release participation is yet another way to contribute back to the community.

There is one drawback to XQuery which I discuss below.

For anyone not fortunate enough to already have a copy of XQuery, 2nd Edition, here is the full description from the O’Reilly site:

The W3C XQuery 3.1 standard provides a tool to search, extract, and manipulate content, whether it’s in XML, JSON or plain text. With this fully updated, in-depth tutorial, you’ll learn to program with this highly practical query language.

Designed for query writers who have some knowledge of XML basics, but not necessarily advanced knowledge of XML-related technologies, this book is ideal as both a tutorial and a reference. You’ll find background information for namespaces, schemas, built-in types, and regular expressions that are relevant to writing XML queries.

This second edition provides:

  • A high-level overview and quick tour of XQuery
  • New chapters on higher-order functions, maps, arrays, and JSON
  • A carefully paced tutorial that teaches XQuery without being bogged down by the details
  • Advanced concepts for taking advantage of modularity, namespaces, typing, and schemas
  • Guidelines for working with specific types of data, such as numbers, strings, dates, URIs, maps and arrays
  • XQuery’s implementation-specific features and its relationship to other standards including SQL and XSLT
  • A complete alphabetical reference to the built-in functions, types, and error messages

Drawback to XQuery:

You know I hate to complain, but the brevity of XQuery is a real drawback to billing.

For example, I have a post pending on taking 604 lines of XSLT down to 35 lines of XQuery.

Granted the XQuery is easier to maintain, modify, extend, but all a client will see is the 35 lines of XQuery. At least 604 lines of XSLT looks like you really worked to produce something.

I know about XQueryX but I haven’t seen any automatic way to convert XQuery into XQueryX. Am I missing something obvious? If that’s possible, I could just bulk up the deliverable with an XQueryX expression of the work and keep the XQuery version for production use.

As excellent as I think XQuery and Walmsley’s book both are, I did want to warn you about the brevity of your XQuery deliverables.

I look forward to finish reading XQuery, 2nd Edition. I started doing so many things based on the first twelve or so chapters that I just read selectively from that point on. It merits a complete read. You won’t be sorry you did.

December 1, 2015

Progress on Connecting Votes and Members of Congress (XQuery)

Filed under: Government,Government Data,XQuery — Patrick Durusau @ 5:34 pm

Not nearly to the planned end point but I have corrected a file I generated with XQuery that provides the name-id numbers for members of the House and a link to their websites at house.gov.

It is a rough draft but you can find it at: http://www.durusau.net/publications/name-id-member-website-draft.html.

While I was casting about for the resources for this posting, I had the sinking feeling that I had wasted a lot of time and effort when I found: http://clerk.house.gov/xml/lists/MemberData.xml.

But, if you read that file carefully, what is the one thing it lacks?

A link to every members’s website at “….house.gov.”

Isn’t that interesting?

Of all the things to omit, why that one?

Especially since you can’t auto-generate the website names from the member names. What appear to be older names use just the last name of members. But, that strategy must have fallen pretty quickly when members with the same last names appeared.

The conflicting names and even some non-conflicting names follow a new naming protocol that appears to be firstname+lastname.house.gov.

That will work for a while until the next generation starts inheriting positions in the House.

Anyway, that is as far as I got today but at least it is a useful list for invoking the name-id of members of the House and obtaining their websites.

The next step will be hitting the websites to extract contact information.

Yes, I know that http://clerk.house.gov/xml/lists/MemberData.xml has the “official” contact information, along with their forms for email, etc.

If I wanted to throw my comment into a round file I could do that myself.

No, what I want to extract is their local office data so when they are “back home” meeting with constituents, the average voter has a better chance of being one of those constituents. Not just those who maxed out on campaign donations limits.

November 30, 2015

Connecting Roll Call Votes to Members of Congress (XQuery)

Filed under: Data Mining,Government,Government Data,XQuery — Patrick Durusau @ 10:29 pm

Apologies for the lack of posting today but I have been trying to connect up roll call votes in the House of Representatives to additional information on members of Congress.

In case you didn’t know, roll call votes are reported in XML and have this form:

<recorded-vote><legislator name-id="A000374" sort-field="Abraham" 
unaccented-name="Abraham" party="R" state="LA"
role="legislator">Abraham</legislator><
vote>Aye</vote></recorded-vote>
<recorded-vote><legislator name-id="A000370" sort-field="Adams" 
unaccented-name="Adams" party="D" state="NC" 
role="legislator">Adams</legislator
><vote>No</vote></recorded-vote>
<recorded-vote><legislator name-id="A000055" sort-field="Aderholt" 
unaccented-name="Aderholt" party="R" state="AL" 
role="legislator">Aderholt</legislator>
<vote>Aye</vote></recorded-vote>
<recorded-vote><legislator name-id="A000371" sort-field="Aguilar" 
unaccented-name="Aguilar" party="D" state="CA"
role="legislator">Aguilar</legislator><
vote>Aye</vote></recorded-vote>
...

For a full example: http://clerk.house.gov/evs/2015/roll643.xml

With the name-id attribute value, I can automatically construct URIs to the Biographical Directory of the United States Congress, for example, the entry on Abraham, Ralph.

More information than a poke with a sharp stick would give you but its only self-serving cant.

One of the things that would be nice to link up with roll call votes would be the homepages of those voting.

Continuing with Ralph Abraham, mapping A000374 to https://abraham.house.gov/ would be helpful in gathering other information, such as the various offices where Representative Abraham can be contacted.

If you are reading the URIs, you might think just prepending the last name of each representative to “house.gov” would be sufficient. Well, it would be except that there are eight-three cases where representatives share last names and/or a new naming scheme has more than the last name + house.gov.

After I was satisfied that there wasn’t a direct mapping between the current uses of name-id and House member websites, I started creating such a mapping that you can drop into XQuery as a lookup table and/or use as an external file.

The lookup table should be finished tomorrow so check back.

PS: Yes, I am aware there are tables of contact information for members of Congress but I have yet to see one that lists all their local offices. Moreover, a lookup table for XQuery may encourage people to connect more data to their representatives. Such as articles in local newspapers, property deeds and other such material.

November 28, 2015

Saxon 9.7 Release!

Filed under: Saxon,XQuery,XSLT — Patrick Durusau @ 4:29 pm

I saw a tweet from Michael Kay announcing that Saxon 9.7 has been released!

Saxon 9.7 is up to date with the XSLT 3.0 Candidate Recommendation released from the W3C November 19, 2015.

From the details page:

  • XSLT 3.0 implementation largely complete (requires Saxon-PE or Saxon-EE): The new XSLT 3.0 Candidate Recommendation was published on 19 November 2015, and Saxon 9.7 is a complete implementation with a very small number of exceptions. Apart from general updating of Saxon as the spec has developed, the main area of new functionality is in packaging, which allows stylesheet modules to be independently compiled and distributed, and provides much more “software engineering” control over public and private interfaces, and the like. The ability to save packages in compiled form gives much faster loading of frequently used stylesheets.
  • Schema validation: improved error reporting. The schema validator now offers customised error reporting, with an option to create an XML report detailing all validation errors found. This has structured information about each error so the report can readily be customised; it has been developed in conjunction with some of our IDE partners who can use this information to provide an improved interactive presentation of the validation report.
  • Arrays, Maps, and JSON: Arrays are implemented as a new data type (defined in XPath 3.1). Along with maps, which were already available in Saxon 9.6, this provides the infrastructure for full support of JSON, including functions such as parse-json() which converts JSON to a structure of maps and arrays, and the JSON serialization method which does the inverse.
  • Miscellaneous new functions: two of the most interesting are random-number-generator(), and parse-ietf-date().
  • Streaming: further improvements to the set of constructs that can be streamed, and the diagnostics when constructs cannot be streamed.
  • Collections: In line with XPath 3.1 changes, a major overhaul of the way collections work. They can now contain any kind of item, and new abstractions are provided to give better control over asynchronous and parallel resource fetching, parsing, and validation.
  • Concurrency improvements: Saxon 9.6 already offered various options for executing stylesheets in parallel to take advantage of multi-code processors. These facilities have now been tuned for performance and made more robust, by taking advantage of more advanced concurrency features in the JDK platform. The Saxon NamePool, which could be a performance bottleneck in high throughput workloads, has been completely redesigned to allow much higher concurrency.
  • Cost-based optimization: Saxon’s optimizer now makes cost estimates in order to decide the best execution strategy. Although the estimates are crude, they have been found to make a vast difference to the execution speed of some stylesheets. Saxon 9.7 particularly addresses the performance of XSLT pattern matching.

There was no indication of when these features will appear in the free “home edition.”

In the meantime, you can go the Online Shop for Saxonica.

Currency conversion rates vary but as of today, Saxon-PE (Professional Edition) is about $75 U.S. and some change.

I’m considering treating myself to Saxon-PE as a Christmas present to myself.

And you?

November 24, 2015

XQuery and XPath Full Text 3.0 (Recommendation)

Filed under: Searching,XPath,XQuery — Patrick Durusau @ 4:29 pm

XQuery and XPath Full Text 3.0

From 1.1 Full-Text Search and XML:

As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT defines extensions to SQL to express full-text searches providing functionality similar to that defined in this full-text language extension to XQuery 3.0 and XPath 3.0.

XML documents may contain highly structured data (fixed schemas, known types such as numbers, dates), semi-structured data (flexible schemas and types), markup data (text with embedded tags), and unstructured data (untagged free-flowing text). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as
scoring and weighting.

Full-text search is different from substring search in many ways:

  1. A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string “lease” will return a news item that contains “Foobar Corporation releases version 20.9 …”. A full-text search for the token “lease” will not.
  2. There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is “find me all the news items that contain a token with the same linguistic stem as ‘mouse'” (finds “mouse” and “mice”). Another example based on token proximity is “find me all the news items that contain the tokens ‘XML’ and ‘Query’ allowing up to 3 intervening tokens”.
  3. Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for “mouse”, there is only 1 expected result set. When you do a full-text search for all the news items that contain the token “mouse”, you probably expect to find news items containing the token “mice”, and possibly “rodents”, or possibly “computers”. Not all results are equal. Some results are more “mousey” than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.

Note:

As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full Text 3.0.

Definition: Full-text queries are performed on tokens and phrases. Tokens and phrases are produced via tokenization.] Informally, tokenization breaks a character string into a sequence of tokens, units of punctuation, and spaces.

Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of tokens found in the target text of a search. These tokens are characterized by integers that capture the relative position(s) of the token inside the string, the relative position(s) of the sentence containing the token, and the relative position(s) of the paragraph containing the token. The positions typically comprise a start and an end position.

Tokenization, including the definition of the term “tokens”, SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interpret the results of tokenization. Tokenization operates on the string value of an item; for element nodes this does not include the content of attribute nodes, but for attribute nodes it does. Tokenization is defined more formally in 4.1 Tokenization.

[Definition: A token is a non-empty sequence of characters returned by a tokenizer as a basic unit to be searched. Beyond that, tokens are implementation-defined.] [Definition: A phrase is an ordered sequence of any number of tokens. Beyond that, phrases are implementation-defined.]

Not a fast read but a welcome one!

XQuery and XPath increase the value of all XML-encoded documents, at least down to the level of their markup. Beyond nodes, you are on your own.

XQuery and XPath Full Text 3.0 extend XQuery and XPath beyond existing markup in documents. Content that was too expensive or simply not of enough interest to encode, can still be reached in a robust and reliable way.

If you can “see” it with your computer, you can annotate it.

You might have to possess a copy of the copyrighted content, but still, it isn’t a closed box that resists annotation. Enabling you to sell the annotation as a value-add to the copyrighted content.

XQuery and XPath Full Text 3.0 says token and phrase are implementation defined.

Imagine the user (name) commented version of X movie, which is a driver file that has XQuery links into DVD playing on your computer (or rather to the data stream).

I rather like that idea.

PS: Check with a lawyer before you commercialize that annotation idea. I am not familiar with all EULAs and national laws.

November 18, 2015

On Teaching XQuery to Digital Humanists [Lesson in Immediate ROI]

Filed under: Humanities,XQuery — Patrick Durusau @ 3:55 pm

On Teaching XQuery to Digital Humanists by Clifford B. Anderson.

A paper presented at Balisage 2014 but still a great read for today. In particular where Clifford makes the case for teaching XQuery to humanists:

Making the Case for XQuery

I may as well state upfront that I regard XQuery as a fantastic language for digital humanists. If you are involved in marking up documents in XML, then learning XQuery will pay long-term dividends. I do have arguments for this bit of bravado. My reasons for lifting up XQuery as a programing language of particular interest to digital humanists are essentially three:

  • XQuery is domain-appropriate for digital humanists.

Let’s take each of these points in turn.

First, XQuery fits the domain of digital humanists. Admittedly, I am focusing here on a particular area of the digital humanities, namely the domain of digital text editing and analysis. In that domain, however, XQuery proves a nearly perfect match to the needs of digital humanists.

If you scour the online communities related to digital humanities, you will repeatedly find conversations about which programming languages to learn. Predictably, the advice is all over the map. PHP is easy to learn, readily accessible, and the language of many popular projects in the digital humanities such as Omeka and Neatline. Javascript is another obvious choice given its ubiquity. Others recommend Python or Ruby. At the margins, you’ll find the statistically-inclined recommending R. There are pluses and minuses to learning any of these languages. When you are working with XML, however, they all fall short. Inevitably, working with XML in these languages will require learning how to use packages to read XML and convert it to other formats.

Learning XQuery eliminates any impedance between data and code. There is no need to import any special packages to work with XML. Rather, you can proceed smoothly from teaching XML basics to showing how to navigate XML documents with XPath to querying XML with XQuery. You do not need to jump out of context to teach students about classes, objects, tables, or anything as awful-sounding as “shredding” XML documents or storing them as “blobs.” XQuery makes it possible for students to become productive without having to learn as many computer science or software engineering concepts. A simple four or five line FLWOR expression can easily demonstrate the power of XQuery and provide a basis for students’ tinkering and exploration. (emphasis added)

I commend the rest of the paper to you for reading but Clifford’s first point nails why learn XQuery for humanists and others.

The part I highlighted above sums it up:

XQuery makes it possible for students to become productive without having to learn as many computer science or software engineering concepts. A simple four or five line FLWOR expression can easily demonstrate the power of XQuery and provide a basis for students’ tinkering and exploration. (emphasis added)

Whether you are a student, scholar or even a type-A business type, what do you want?

To get sh*t done!

A few of us like tinkering with edge cases, proofs, theorems and automata, but having the needed output on time or sooner, really makes the day for most folks.

A minimal amount of XQuery expressions will open up XML encoded data for your custom exploration. You can experience an immediate ROI from the time you spend learning XQuery. Which will prompt you to learn more XQuery.

Think of learning XQuery as a step towards user independence. Independence from the choices made by unseen and unknown programmers.

Are you ready to take that step?

November 14, 2015

Querying Biblical Texts: Part 1 [Humanists Take Note!]

Filed under: Bible,Text Mining,XML,XQuery — Patrick Durusau @ 5:13 pm

Querying Biblical Texts: Part 1 by Jonathan Robie.

From the post:

This is the first in a series on querying Greek texts with XQuery. We will also look at the differences among various representations of the same text, starting with the base text, morphology, and three different treebank formats. As we will see, the representation of a text indicates what the producer of the text was most interested in, and it determines the structure and power of queries done on that particular representation. The principles discussed here also apply to other languages.

This is written as a tutorial, and it can be read in two ways. The first time through, you may want to simply read the text. If you want to really learn how to do this yourself, you should download an XQuery processor and some data (in your favorite biblical language) and try these queries and variations on them.

Humanists need to follow this series and pass it along to others.

Texts of interest to you will vary but the steps Jonathan covers are applicable to all texts (well, depending upon your encoding).

In exchange for learning a little XQuery, you can gain a good degree of mastery over XML encoded texts.

Enjoy!

« Newer PostsOlder Posts »

Powered by WordPress