Continuing with the data from my post: Great R packages for data import, wrangling & visualization [+ XQuery], I have discovered the dangers of perfect example data!
The XQuery examples on sorting that I have read either enclose strings in quotes and/or have strings with no whitespaces.
How often to you see strings with no whitespaces? Outside of highly constrained environments?
Why is that a problem?
Well, take a look at my results from sorting on the short description and displaying the short description first and the package name second:
package development, package installation | devtools |
misc | installr |
data import | readxl |
data import, data export | googlesheets |
data import | RMySQL |
data import | readr |
data import, data export | rio |
data analysis | psych |
data wrangling, data analysis | sqldf |
data import, data wrangling | jsonlite |
data import, data wrangling | XML |
data import, data visualization, data analysis | quantmod |
data import, web scraping | rvest |
data wrangling, data analysis | dplyr |
data wrangling | plyr |
data wrangling | reshape2 |
data wrangling | tidyr |
data wrangling, data analysis | data.table |
data wrangling | stringr |
data wrangling | lubridate |
data wrangling, data analysis | zoo |
data display | editR |
data display | knitr |
data display, data wrangling | listviewer |
data display | DT |
data visualization | ggplot2 |
data visualization | dygraphs |
data visualization | googleVis |
data visualization | metricsgraphics |
data visualization | RColorBrewer |
data visualization | plotly |
mapping | leaflet |
mapping | choroplethr |
mapping | tmap |
misc | fitbitScraper |
Web analytics | rga |
Web analytics | RSiteCatalyst |
package development | roxygen2 |
data visualization | shiny |
misc | openxlsx |
data wrangling, data analysis | gmodels |
data wrangling | car |
data visualization | rcdimple |
data wrangling | foreach |
data acquisition | downloader |
data wrangling | scales |
data visualization | plotly |
Err, that’s not right!
The XQuery from yesterday:
- xquery version “1.0”;
- <html>
- <table>{
- for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
- order by lower-case(string($row/td[1]/a))
- return <tr>{$row/td[1]} {$row/td[2]}</tr>
- }</table>
- </html>
XQuery from today, changes in red:
- xquery version “1.0”;
- <html>
- <table>{
- for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
- order by lower-case(string($row/td[2]/a))
- return <tr>{$row/td[2]} {$row/td[1]}</tr>
- }</table>
- </html>
First, how do you explain the failure? Looks like no sort order at all.
Truthfully it does have a sort order, just not the one you expected. The results appear in document sort order, as they appeared in the document.
Here’s a snippet of that document:
<table> <tr> <td><a href="https://github.com/hadley/devtools" target="_new">devtools</a></td> <td>package development, package installation</td> <td>While devtools is aimed at helping you create your own R packages, it's also essential if you want to easily install other packages from GitHub. Install it! Requires <a href="http://cran.r-project.org/bin/windows/Rtools/" target="_new"> Rtools</a> on Windows and <a href="https://developer.apple.com/xcode/downloads/" target="_new">XCode</a> on a Mac. On CRAN.</td> <td>install_github("rstudio/leaflet")</td> <td>Hadley Wickham & others</td> </tr> <tr> <td><a href="https://github.com/talgalili/installr/" target="_new">installr</a> </td><td>misc</td> <td>Windows only: Update your installed version of R from within R. On CRAN.</td> <td>updateR()</td> <td>Tal Galili & others</td> </tr> <tr> <td><a href="https://github.com/hadley/readxl/" target="_new">readxl</a> </td><td>data import</td> <td>Fast way to read Excel files in R, without dependencies such as Java. CRAN.</td> <td>read_excel("my-spreadsheet.xls", sheet = 1)</td> <td>Hadley Wickham</td> </tr> ... </table>
I haven’t run the problem entirely to ground but as you can see from the output:
data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod
Most of the descriptions have spaces, not to mention “,” separating categories.
It is always possible to clean up the data but I want to avoid that if at all possible.
Cleaning data involves the risk I may change the data and once changed, I may not be able to go back to the original.
I can think of at least two (2) ways to fix this problem but want to sleep on it first and pick that can be easily adapted to the next soiled data that comes through the door.
PS: Neither Saxon (9.7), nor BaseX (8.3) gave any error messages at the console for the failure of the sort request.
You could say that document order is about as large an error message as can be given. 😉