Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data)

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 30, 2015

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery

Filed under: XML,XQuery — Patrick Durusau @ 8:19 pm

Continuing with the data from my post: Great R packages for data import, wrangling & visualization [+ XQuery], I have discovered the dangers of perfect example data!

The XQuery examples on sorting that I have read either enclose strings in quotes and/or have strings with no whitespaces.

How often to you see strings with no whitespaces? Outside of highly constrained environments?

Why is that a problem?

Well, take a look at my results from sorting on the short description and displaying the short description first and the package name second:

package development, package installation	devtools
misc	installr
data import	readxl
data import, data export	googlesheets
data import	RMySQL
data import	readr
data import, data export	rio
data analysis	psych
data wrangling, data analysis	sqldf
data import, data wrangling	jsonlite
data import, data wrangling	XML
data import, data visualization, data analysis	quantmod
data import, web scraping	rvest
data wrangling, data analysis	dplyr
data wrangling	plyr
data wrangling	reshape2
data wrangling	tidyr
data wrangling, data analysis	data.table
data wrangling	stringr
data wrangling	lubridate
data wrangling, data analysis	zoo
data display	editR
data display	knitr
data display, data wrangling	listviewer
data display	DT
data visualization	ggplot2
data visualization	dygraphs
data visualization	googleVis
data visualization	metricsgraphics
data visualization	RColorBrewer
data visualization	plotly
mapping	leaflet
mapping	choroplethr
mapping	tmap
misc	fitbitScraper
Web analytics	rga
Web analytics	RSiteCatalyst
package development	roxygen2
data visualization	shiny
misc	openxlsx
data wrangling, data analysis	gmodels
data wrangling	car
data visualization	rcdimple
data wrangling	foreach
data acquisition	downloader
data wrangling	scales
data visualization	plotly

Err, that’s not right!

The XQuery from yesterday:

xquery version “1.0”;
<html>
<table>{
for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
order by lower-case(string($row/td[1]/a))
return <tr>{$row/td[1]} {$row/td[2]}</tr>
}</table>
</html>

XQuery from today, changes in red:

xquery version “1.0”;
<html>
<table>{
for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
order by lower-case(string($row/td[2]/a))
return <tr>{$row/td[2]} {$row/td[1]}</tr>
}</table>
</html>

First, how do you explain the failure? Looks like no sort order at all.

Truthfully it does have a sort order, just not the one you expected. The results appear in document sort order, as they appeared in the document.

Here’s a snippet of that document:

<table>
<tr>
<td><a href="https://github.com/hadley/devtools" target="_new">devtools</a></td>
<td>package development, package installation</td>
<td>While devtools is aimed at helping you create your own R packages, it's also 
essential if you want to easily install other packages from GitHub. Install it! 
Requires <a href="http://cran.r-project.org/bin/windows/Rtools/" target="_new">
Rtools</a> on Windows and <a href="https://developer.apple.com/xcode/downloads/" 
target="_new">XCode</a> on a Mac. On CRAN.</td>
<td>install_github("rstudio/leaflet")</td>
<td>Hadley Wickham & others</td>
</tr>
<tr>
<td><a href="https://github.com/talgalili/installr/" target="_new">installr</a>
</td><td>misc</td>
<td>Windows only: Update your installed version of R from within R. On CRAN.</td>
<td>updateR()</td>
<td>Tal Galili & others</td>
</tr>
<tr>
<td><a href="https://github.com/hadley/readxl/" target="_new">readxl</a>
</td><td>data import</td>
<td>Fast way to read Excel files in R, without dependencies such as Java. CRAN.</td>
<td>read_excel("my-spreadsheet.xls", sheet = 1)</td>
<td>Hadley Wickham</td>
</tr>
...
</table>

I haven’t run the problem entirely to ground but as you can see from the output:

data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod

Most of the descriptions have spaces, not to mention “,” separating categories.

It is always possible to clean up the data but I want to avoid that if at all possible.

Cleaning data involves the risk I may change the data and once changed, I may not be able to go back to the original.

I can think of at least two (2) ways to fix this problem but want to sleep on it first and pick that can be easily adapted to the next soiled data that comes through the door.

PS: Neither Saxon (9.7), nor BaseX (8.3) gave any error messages at the console for the failure of the sort request.

You could say that document order is about as large an error message as can be given. 😉

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 30, 2015

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery

No Comments