Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 9, 2015

Solr 2014: A Year in Review

Filed under: Search Engines,Solr — Patrick Durusau @ 8:25 pm

Solr 2014: A Year in Review by Anshum Gupta.

If you aren’t already excited about Solr 5, targeted for alter this month, perhaps these section headings from Anshum’s post will capture your interest:

Usability – Ease of use and management

SolrCloud and Collection APIs

Scalability and optimizations

CursorMark: Distributed deep paging

TTL: Auto-expiration for documents

Distributed Pivot Faceting

Query Parsers

Distributed IDF

Solr Scale Toolkit

Testing

No more war

Solr 5

Community

That is a lot of improvement for a single year! See Anshum’s post and you will be excited about Solr 5 too!

Machine Learning (Andrew Ng) – Jan. 19th

Filed under: Computer Science,Education,Machine Learning — Patrick Durusau @ 6:00 pm

Machine Learning (Andrew Ng) – Jan. 19th

From the course page:

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

I could have just posted Machine Learning, Andrew Ng and 19 Jan. but there are people who have heard of this course before. Hard to believe but I have been assured that is in fact the case.

So the prose stuff is for them. Why are you reading this far? Go register for the course!

I have heard rumors the first course had an enrollment of over 100,000! I wonder if this course will break current records?

Enjoy!

Using graph databases to perform pathing analysis… [In XML too?]

Filed under: Graphs,Neo4j,Path Enumeration — Patrick Durusau @ 5:51 pm

Using graph databases to perform pathing analysis – initial experiments with Neo4J by Nick Dingwall.

From the post:

In the first post in this series, we raised the possibility that graph databases might allow us to analyze event data in new ways, especially where we were interested in understanding the sequences that events occured in. In the second post, we walked through loading Snowplow page view event data into Neo4J in a graph designed to enable pathing analytics. In this post, we’re going to see whether the hypothesis we raised in the first post is right: can we perform the type of pathing analysis on Snowplow data that is so difficult and expensive when it’s in a SQL database, once it’s loaded in a graph?

In this blog post, we’re going to answer a set of questions related to the journeys that users have taken through our own (this) website. We’ll start by answering some some easy questions to get used to working with Cypher. Note that some of these simpler queries could be easily written in SQL; we’re just interested in checking out how Cypher works at this stage. Later on, we’ll move on to answering questions that are not feasible using SQL.

If you dream in markup, ;-), you are probably thinking what I’m thinking. Yes, what about modeling paths in markup documents? What is more, visualizing those paths. Would certainly beat the hell out of some of the examples you find in the XML specifications.

Not to mentioned that they would be paths in your own documents.

Question: I am assuming you would not collapse all the <p> nodes yes? That is for some purposes we display the tree as though every node is unique, identified by its location in the markup tree. For other purposes it might be useful to visualize some paths as collapsed node where size or color is an indicator of the number of nodes collapsed into that path.

That sounds like a Balisage presentation for 2015.

Frequently brought up topics in #haskell [Do you have a frequency-based FAQ?]

Filed under: Functional Programming,Haskell — Patrick Durusau @ 5:32 pm

Frequently brought up topics in #haskell by David Luposchainsky.

From the webpage:

(This is like an FAQ, except that the F stands for frequently instead of someone thought this might be worth mentioning.)

Ouch!

David doesn’t say how he determined the frequency of these topics but it does suggest a number of interesting data mining and curation projects.

Where would you go to find and measure the frequency of questions on particular issues?

My first thought was StackOverflow. I didn’t see any obvious way to download their data so I pinged the admin team.

While waiting for a response, I searched and found:

Stack Exchange Data Dump (September 26, 2014) at the Internet Archive.

How cool is that?! You can get the “big” file or individual files. Well, I don’t see “computer science” listed as a separate file so I assume that would require downloading the “big” file.

Rather than a “worth mentioning” FAQ you could have a “frequently” mentioned based FAQ for a large number of areas.

The further you move away from the dump date, the less accurate your “frequency,” and I don’t know how much of an impact that would have.

Still, has a great deal of promise!

Natural Language Analytics made simple and visual with Neo4j

Filed under: Graphs,Natural Language Processing,Neo4j — Patrick Durusau @ 5:10 pm

Natural Language Analytics made simple and visual with Neo4j by Michael Hunger.

From the post:

I was really impressed by this blog post on Summarizing Opinions with a Graph from Max and always waited for Part 2 to show up 🙂

The blog post explains an really interesting approach by Kavita Ganesan which uses a graph representation of sentences of review content to extract the most significant statements about a product.

From later in the post:

The essence of creating the graph can be formulated as: “Each word of the sentence is represented by a shared node in the graph with order of words being reflected by relationships pointing to the next word”.

Michael goes on to create features with Cypher and admits near the end that “LOAD CSV” doesn’t really care if you have CSV files or not. You can split on a space and load text such as the “Lord of the Rings poem of the One Ring” into Neo4j.

Interesting work and a good way to play with text and Neo4j.

The single node per unique word presented here will be problematic if you need to capture the changing roles of words in a sentence.

A Master List of 1,100 Free Courses From Top Universities:…

Filed under: Computer Science,Education — Patrick Durusau @ 4:27 pm

A Master List of 1,100 Free Courses From Top Universities: 33,000 Hours of Audio/Video Lectures

From the post:

While you were eating turkey, we were busy rummaging around the internet and adding new courses to our big list of Free Online Courses, which now features 1,100 courses from top universities. Let’s give you the quick overview: The list lets you download audio & video lectures from schools like Stanford, Yale, MIT, Oxford and Harvard. Generally, the courses can be accessed via YouTube, iTunes or university web sites, and you can listen to the lectures anytime, anywhere, on your computer or smart phone. We didn’t do a precise calculation, but there’s probably about 33,000 hours of free audio & video lectures here. Enough to keep you busy for a very long time.

Right now you’ll find 127 free philosophy courses, 82 free history courses, 116 free computer science courses, 64 free physics courses and 55 Free Literature Courses in the collection, and that’s just beginning to scratch the surface. You can peruse sections covering Astronomy, Biology, Business, Chemistry, Economics, Engineering, Math, Political Science, Psychology and Religion.

OpenCulture has gathered up a large variety of materials.

Sadly I must report that Akkadian, Egyptian, Hittite, Sanskrit, and, Sumerian are all missing from their language resources. Maybe next year.

In the meantime, there are a number of other course selections to enjoy!

GHC (STG,Cmm,asm) illustrated for hardware persons

Filed under: Functional Programming,Haskell — Patrick Durusau @ 3:53 pm

GHC (STG,Cmm,asm) illustrated for hardware persons

From the slides:

NOTE
– This is not an official document by the ghc development team.
– Please don’t forget “semantics”. It’s very important.
– This is written for ghc 7.8 (and ghc 7.10).

As the weekend approaches I thought you might enjoy the discipline of getting a bit closer to the metal than usual. 😉

Enjoy!

I first saw this in a tweet by Pat Shaughnessy

Special Issue on Visionary Ideas in Data Management

Filed under: Data Management — Patrick Durusau @ 3:14 pm

SIGMOD Record Call For Papers: Special Issue on Visionary Ideas in Data Management Guest Editor: Jun Yang Editor-in-Chief: Yanlei Diao.

Important Dates

Submission deadline: March 15, 2015
Publication of the special issue: June 2015

From the announcement:

This special issue of SIGMOD Record seeks papers describing visions of future systems, frameworks, algorithms, applications, and technology related to the management or use of data. The goal of this special issue is to promote the discussion and sharing of challenges and ideas that are not necessarily well-explored at the time of writing, but have potential for significantly expanding the possibilities and horizons of the field of databases and data management. The submissions will be evaluated on their originality, significance, potential impact, and interest to the community, with less emphasis on the current level of maturity, technical depth, and evaluation.

Submission Guidelines: http://www.sigmod.org/publications/sigmod-record/authors

If “visionary” means not yet widely implemented, I think topic maps would easily qualify for this issue. From HDFS to CSV, I haven’t seen another solution for documenting the identity of subjects in data sets. Thoughts? (Modulo the CSV work I mentioned from the W3C quite recently. CSV on the Web:… [ .csv 5,250,000, .rdf 72,700].)

The Myth of Islamic/Muslim Terrorism

Filed under: Politics,Semantics — Patrick Durusau @ 2:55 pm

The recent Charlie Hebdo attacks have given the media a fresh opportunity to refer to “Islamic or Muslim terrorists.” While there is no doubt those who attacked Charlie Hebdo were in fact Muslims, that does not justify referring to them as “Islamic or Muslim terrorists.”

Use of “Islamic or Muslim terrorists” reflects the underlying bigotry of the speaker and/or a failure to realize they are repeating the bigotry of others.

If you doubt my take on Islam since I am not a Muslim, consider What everyone gets wrong about Islam and cartoons of Mohammed by Amanda Taub, who talks to Muslims about the Charlie Hebdo event.

If you want to call the attackers of Charlie Hebdo, well, attackers, murderers, etc., all of that is true and isn’t problematic.

Do you use “Christian terrorists” to refer to American service personnel who kill women and children with cruise missiles, drones and bombs? Or perhaps you would prefer “American terrorists,” or “Israeli terrorists,” as news labels?

Using Islamic or Muslim, you aren’t identifying a person’s motivation, you are smearing a historic and honorable religion with the outrages of the few. Whether that is your intention or not.

I’m not advocating politically correct speech. You can wrap yourself in a cocoon of ignorance and intolerance and to speak freely from that position.

But before you are beyond the reach of reasonable speech, let me make a suggestion.

Contact a local mosque and make arrangements to attend outreach events/programs at the mosque. Not just once but go often enough to be a regular participant for several months. You will find Muslims are very much like other people you know. Some you will like and some perhaps not. But it will be as individuals that you like/dislike them, not because of their religion.

As a bonus, in addition to meeting Muslims, you will have an opportunity to learn about Islam first hand.

After such experiences, you will be able to distinguish the acts of a few criminals from a religion that numbers its followers in the millions.

Structural Issues in XPath/XQuery/XPath-XQuery F&O Drafts

Filed under: Standards,W3C,XML,XPath,XQuery — Patrick Durusau @ 1:02 pm

Apologies as I thought I was going to be further along in demonstrating some proofing techniques for XPath 3.1, XQuery 3.1, XPath and XQuery Functions and Operations 3.1 by today.

Instead, I encountered structural issues that are common to all three drafts that I didn’t anticipate but that need to be noted before going further with proofing. I will be using sample material to illustrate the problems and will not always have a sample from all three drafts or even note every occurrence of the issues. They are too numerous for that treatment and it would be repetition for repetition’s sake.

First, consider these passages from XPath 3.1, 1 Introduction:

[Definition: XPath 3.1 operates on the abstract, logical structure of an XML document, rather than its surface syntax. This logical structure, known as the data model, is defined in [XQuery and XPath Data Model (XDM) 3.1].]

[Definition: An XPath 3.0 Processor processes a query according to the XPath 3.0 specification.] [Definition: An XPath 2.0 Processor processes a query according to the XPath 2.0 specification.] [Definition: An XPath 1.0 Processor processes a query according to the XPath 1.0 specification.]

1. Unnumbered Definitions – Unidentified Cross-References

The first structural issue that you will note with the “[Definition…” material is that all such definitions are unnumbered and appear throughout all three texts. The lack of numbering means that it is difficult to refer with any precision to a particular definition. How would I draw your attention to the third definition of the second grouping? Searching for XPath 1.0 turns up 79 occurrences in XPath 3.1 so that doesn’t sound satisfactory. (FYI, “Definition” turns up 193 instances.)

While the “Definitions” have anchors that allow them to be addressed by cross-references, you should note that the cross-references are text hyperlinks that have no identifier by which a reader can find the definition without using the hyperlink. That is to say when I see:

A lexical QName with a prefix can be converted into an expanded QName by resolving its namespace prefix to a namespace URI, using the statically known namespaces. [These are fake links to draw your attention to the text in question.]

The hyperlinks in the original will take me to various parts of the document where these definitions occur, but if I have printed the document, I have no clue where to look for these definitions.

The better practice is to number all the definitions and since they are all self-contained, to put them in a single location. Additionally, all interlinear references to those definitions (or other internal cross-references) should have a visible reference that enables a reader to find the definition or cross-reference, without use of an internal hyperlink.

Example:

A lexical QName Def-21 with a prefix can be converted into an expanded QName Def-19 by resolving its namespace prefix to a namespace URI, using the statically known namespaces. Def-99 [These are fake links to draw your attention to the text in question. The Def numbers are fictitious in this example. Actual references would have the visible definition numbers assigned to the appropriate definition.]

2. Vague references – $N versus 5000 x $N

Another problem I encountered was what I call “vague references,” or less generously, $N versus 5,000 x $N.

For example:

[Definition: An atomic value is a value in the value space of an atomic type, as defined in [XML Schema 1.0] or [XML Schema 1.1].] [Definition: A node is an instance of one of the node kinds defined in [XQuery and XPath Data Model (XDM) 3.1].

Contrary to popular opinion, standards don’t write themselves and every jot and tittle was placed in a draft at the expense of someone’s time and resources. Let’s call that $N.

In the example, you and I both know somewhere in XML Schema 1.0 and XML Schema 1.1 that the “value space of the atomic type” is defined. The same is true for nodes and XQuery and XPath Data Model (XDM) 3.1. But where? The authors of these specifications could insert that information at a cost of $N.

What is the cost of not inserting that information in the current drafts? I estimate the number of people interested in reading these drafts to be 5,000. So each of those person will have to find the same information omitted from these specifications, which is a cost of 5,000 x $N. In terms of convenience to readers and reducing their costs of reading these specifications, references to exact locations in other materials are a necessity.

In full disclosure, I have no more or less reason to think 5,000 people are interested in these drafts than the United States has for positing the existence of approximately 5,000 terrorists in the world. I suspect the number of people interested in XML is actually higher but the number works to make the point. Editors can either convenience themselves or their readers.

Vague references are also problematic in terms of users finding the correct reference. The citation above, [XML Schema 1.0] for “value space of an atomic type,” refers to all three parts of XML Schema 1.0.

Part 1, at 3.14.1 (non-normative) The Simple Type Definition Schema Component, has the only reference to “atomic type.”

Part 2, actually has “0” hits for “atomic type.” True enough, “2.5.1.1 Atomic datatypes” is likely the intended reference but that isn’t what the specification says to look for.

Bottom line is that any external reference needs to include in the inline citation the precise internal reference in the work being cited. If you want to inconvenience readers by pointing to internal bibliographies rather than online HTML documents, where available, that’s an editorial choice. But in any event, for every external reference, give the internal reference in the work being cited.

Your readers will appreciate it and it could make your work more accurate as well.

3. Normative vs. Non-Normative Text

Another structural issue which is important for proofing is the distinction between normative and non-normative text.

In XPath 3.1, still in the Introduction we read:

This document normatively defines the static and dynamic semantics of XPath 3.1. In this document, examples and material labeled as “Note” are provided for explanatory purposes and are not normative.

OK, and under 2.2.3.1 Static Analysis Phase (XPath 3.1), we find:

Examples of inferred static types might be:

Which is followed by a list so at least we know where the examples end.

However, there are numerous cases of:

For example, with the expression substring($a, $b, $c), $a must be of type xs:string (or something that can be converted to xs:string by the function calling rules), while $b and $c must be of type xs:double. [also in 2.2.3.1 Static Analysis Phase (XPath 3.1)]

So, is that a non-normative example? If so, what is the nature of the “must” that occurs in it? Is that normative?

Moreover, the examples (XPath 3.1 has 283 occurrences of that term, XQuery has 455 occurrences of that term, XPath and XQuery Functions and Operators have 537 occurrences of that term) are unnumbered, which makes referencing the examples by other materials very imprecise and wordy. For the use of authors creating secondary literature on these materials, to promote adoption, etc., number of all examples should be the default case.

Oh, before anyone protests that XPath and XQuery Functions and Operators has separated its examples into lists, that is true but only partially. There remain 199 occurrences of “for example” which do not occur in lists. Where lists are used, converting to numbered examples should be trivial. The elimination of “for example” material may be more difficult. Hard to say without a good sampling of the cases.

Conclusion:

As I said at the outset, apologies for not reaching more substantive proofing techniques but structural issues are important for the readability and usability of specifications for readers. Being correct and unreadable isn’t a useful goal.

It may seem like some of the changes I suggest are a big “ask” this late in the processing of these specifications. If this were a hand edited document, I would quickly agree with you. But it’s not. Or at least it shouldn’t be. I don’t know where the source is held but the HTML you read is an generated artifact.

Gathering and numbering the definitions and inserting those numbers into the internal cross-references are a matter of applying a different style sheet to the source. Fixing the vague references and unnumbered example texts would take more editorial work but readers would greatly benefit from precise references and a clear separation of normative from non-normative text.

I will try again over the weekend to reach aids for substantive proofing on these drafts. With luck, I will return to these drafts on Monday of next week (12 January 2014).

January 8, 2015

Simple Pictures That State-of-the-Art AI Still Can’t Recognize

Filed under: Artificial Intelligence,Deep Learning,Machine Learning,Neural Networks — Patrick Durusau @ 3:58 pm

Simple Pictures That State-of-the-Art AI Still Can’t Recognize by Kyle VanHemert.

I encountered this non-technical summary of Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, which I covered as: Deep Neural Networks are Easily Fooled:… earlier today.

While I am sure you have read the fuller explanation, I wanted to replicate the top 40 images for your consideration:

top40-660x589

Select the image to see a larger, readable version.

Enjoy the images and pass the Wired article along to friends.

Wikipedia in Python, Gephi, and Neo4j

Filed under: Gephi,Giraph,Neo4j,NetworkX,Python,Wikipedia — Patrick Durusau @ 3:22 pm

Wikipedia in Python, Gephi, and Neo4j: Vizualizing relationships in Wikipedia by Matt Krzus.

From the introduction:

g3

We have had a bit of a stretch here where we used Wikipedia for a good number of things. From Doc2Vec to experimenting with word2vec layers in deep RNNs, here are a few of those cool visualization tools we’ve used along the way.

Cool things you will find in this post:

  • Building relationship links between Categories and Subcategories
  • Visualization with Networkx (think Betweenness Centrality and PageRank)
  • Neo4j and Cypher (the author thinks avoiding the Giraph learning curve is a plus, I leave that for you to decide)
  • Visualization with Gephi

Enjoy!

CSV on the Web:… [ .csv 5,250,000, .rdf 72,700]

Filed under: CSV,JSON,RDF — Patrick Durusau @ 3:09 pm

CSV on the Web: Metadata Vocabulary for Tabular Data, and Their Conversion to JSON and RDF

From the post:

The CSV on the Web Working Group has published First Public Working Drafts of the Generating JSON from Tabular Data on the Web and the Generating RDF from Tabular Data on the Web documents, and has also issued new releases of the Metadata Vocabulary for Tabular Data and the Model for Tabular Data and Metadata on the Web Working Drafts. A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. Validation, conversion, display, and search of that tabular data requires additional information on that data. The “Metadata vocabulary” document defines a vocabulary for metadata that annotates tabular data, providing such information as datatypes, linkage among different tables, license information, or human readable description of columns. The standard conversion of the tabular data to JSON and/or RDF makes use of that metadata to provide representations of the data for various applications. All these technologies rely on a basic data model for tabular data described in the “Model” document. The Working Group welcomes comments on these documents and on their motivating use cases. Learn more about the Data Activity.

These are working drafts and as such have a number of issues noted in the text of each one. Excellent opportunity to participate in the W3C process.

There aren’t any reliable numbers but searching for “.csv” returns 5,250,000 “hits” and searching on “.rdf” returns 72,700 “hits.”

That sound really low for CSV and doesn’t include all the CSV files on local systems.

Still, I would say that CSV files continue to be important and that this work merits your attention.

PANDA Project (News Reporters)

Filed under: News,Reporting — Patrick Durusau @ 2:44 pm

PANDA Project (News Reporters)

From the webpage:

Information on a deadline The newsroom’s data at your fingertips, available at the speed of breaking news.

Smarter, not harder Subscribe to your favorite searches to get an email when news happens.

Institutional memory People are going to leave, but your data shouldn’t. Make it faster for new reporters to find stories in data.

Newsroom born & raised PANDA was built by newsroom developers, with the support of The Knight Foundation. It is sustained by Investigative Reporters and Editors.

PANDA is …

You can install PANDA on Amazon EC2 or on your own hardware, assuming you are using Ubuntu 12.02.

I haven’t set this up (yet) but it looks promising. I don’t see an obvious way to store observations about data for discovery by others or how to create links (associations) between data. To say nothing of annotating subjects I find in the data.

Capturing that level of institutional knowledge might to might not be socially acceptable. I recall reading about an automatic collaborative bookmarks application developed by a news room that faced opposition from reporters not wanting to share their links. Sounded odd to me but I pass it along for your consideration.

New Advanced Analytics and Data Wrangling Tutorials on Cloudera Live

Filed under: Cloudera,Spark — Patrick Durusau @ 2:06 pm

New Advanced Analytics and Data Wrangling Tutorials on Cloudera Live by Alex Gutow.

From the post:

When it comes to learning Apache Hadoop and CDH (Cloudera’s open source platform including Hadoop), there is no better place to start than Cloudera Live. With a quick, one-button deployment option, Cloudera Live launches a four-node Cloudera cluster that you can learn and experiment in free for two-weeks. To help plan and extend the capabilities of your cluster, we also offer various partner deployments. Building on the addition of interactive tutorials and Tableau and Zoomdata integration, we have added a new tutorial on Apache Spark and a new Trifacta partner deployment.

One of the most popular tools in the Hadoop ecosystem is Apache Spark. This easy-to-use, general-purpose framework is extensible across multiple use cases – including batch processing, iterative advanced analytics, and real-time stream processing. With support and development from multiple industry vendors and partner tools, Spark has quickly become a standard within Hadoop.

With the new tutorial, “Relationship Strength Analytics Using Spark,” it will walk you through the basics of Spark and how you can utilize the same, unified enterprise data hub to launch into advanced analytics. Using the example of product relationships, it will walk you through how to discover what products are commonly viewed together, how to optimize product campaigns together for better sales, and discover other insights about product relationships to help build advanced recommendations.

There is enough high grade educational material for data science that I think with some slicing and dicing, an entire curriculum could be fashioned out of online resources alone.

A number of Cloudera tutorials would find their way into such a listing.

Enjoy!

The Type Theory Podcast

Filed under: Functional Programming,Programming,Types — Patrick Durusau @ 1:58 pm

The Type Theory Podcast

From the about page:

The Type Theory Podcast is a podcast about type theory and its interactions with programming, mathematics, and philosophy. Our goal is to bring interesting research to a wider audience.

I’m not sure how I missed this when it started. 😉

There are three episodes available now:

Episode 1: Peter Dybjer on types and testing

In our inaugural episode, we speak with Peter Dybjer from Chalmers University of Technology. Peter has made significant contributions to type theory, including inductive families, induction-recursion, and categorical models of dependent types. He is generally interested in program correctness, programming language semantics, and the connection between mathematics and programming. Today, we will talk about the relationship between QuickCheck-style testing and proofs and verification in type theory.

Episode 2: Edwin Brady on Idris

In our second episode, we speak with Edwin Brady from the University of St. Andrews. Since 2008, Edwin has been working on Idris, a functional programming language with dependent types. This episode is very much about programming: we discuss the language Idris, its history, its implementation strategies, and plans for the future.

Episode 3: Dan Licata on Homotopy Type Theory

In our third episode, we dicuss homotopy type theory (HoTT) with Wesleyan University’s Dan Licata. Dan has participated during much of the development of HoTT, having completed his PhD at CMU and having been a part of the Institute for Advanced Study’s special year on the subject. In our interview, we discuss the basics of HoTT, some potential applications in both mathematics and computing, as well as ongoing work on computation, univalence, and cubes.

Each episode has links to additional reading materials and resources.

Enjoy!

GraphLab Changes Name to Dato

Filed under: GraphLab — Patrick Durusau @ 11:35 am

GraphLab Changes Name to Dato, Raises $18.5 Million to Enable Creation of Intelligent Applications

From the post:

GraphLab today announced it closed an $18.5 million Series B funding round led by Vulcan Capital with participation from Opus Capital Ventures and existing investors New Enterprise Associates (NEA) and Madrona Venture Group. The company has also changed its name and brand from GraphLab to Dato, reflecting the evolution of its popular machine learning platform which now enables the creation of intelligent applications based on any type of data, including graphs, tables, text and images. Dato will use the investment to expand its business development, engineering and customer support teams to serve a rapidly growing customer base. The Series B round brings the total amount raised by Dato to $25.25 million. Steve Hall from Vulcan Capital will join Dato’s board of directors.

Those pesky startups. Begin with one name, soon there is another and that’s before even getting to raising capital. Then some smartass marketing person thinks they have a name that someday will be as universal as IBM or Nike. So now it has another name.

With a topic map approach, the changing of names isn’t a problem because the legal obligations of the entity continue, whatever its outward facing name.

If you wanted to track the name of the entity at particular times, I would create an association between the entity and its then present name, and use that association as a role player in the association you want to associate in time.

Bootkit for Macs

Filed under: Cybersecurity,Security — Patrick Durusau @ 11:04 am

World’s first (known) bootkit for OS X can permanently backdoor Macs by Dan Goodin.

From the post:

Securing Macs against stealthy malware infections could get more complicated thanks to a new proof-of-concept exploit that allows attackers with brief physical access to covertly replace the firmware of most machines built since 2011.

Once installed, the bootkit—that is, malware that replaces the firmware that is normally used to boot Macs—can control the system from the very first instruction. That allows the malware to bypass firmware passwords, passwords users enter to decrypt hard drives and to preinstall backdoors in the operating system before it starts running. Because it’s independent of the operating system and hard drive, it will survive both reformatting and OS reinstallation. And since it replaces the digital signature Apple uses to ensure only authorized firmware runs on Macs, there are few viable ways to disinfect infected boot systems. The proof-of-concept is the first of its kind on the OS X platform. While there are no known instances of bootkits for OS X in the wild, there is currently no way to detect them, either.

The malware has been dubbed Thunderstrike, because it spreads through maliciously modified peripheral devices that connect to a Mac’s Thunderbolt interface. When plugged into a Mac that’s in the process of booting up, the device injects what’s known as an Option ROM into the extensible firmware interface (EFI), the firmware responsible for starting a Mac’s system management mode and enabling other low-level functions before loading the OS. The Option ROM replaces the RSA encryption key Macs use to ensure only authorized firmware is installed. From there, the Thunderbolt device can install malicious firmware that can’t easily be removed by anyone who doesn’t have the new key.

There are similar bookits for Windows, Stoned Bootkit being one of them.

Physical security is the first principal of computer security that is most often overlooked. You can have encrypted drives, passwords, etc., but if your computer isn’t physically secure, all that security is just window dressing.

Special Issue on Arabic NLP

Filed under: Natural Language Processing — Patrick Durusau @ 10:33 am

Special Issue on Arabic NLP Editor-in-Chief M.M. Alsulaiman

Including the introduction, twelve open access articles on Arabic NLP.

From the introduction:

Arabic natural language processing (NLP) is still in its initial stage compared to the work in English and other languages. NLP is made possible by the collaboration of many disciplines including computer science, linguistics, mathematics, psychology and artificial intelligence. The results of which is highly beneficial to many applications such as Machine Translation, Information Retrieval, Information Extraction, Text Summarization and Question Answering.

This special issue of the Journal of King Saud University – Computer and Information Sciences (CIS) synthesizes current research in the field of Arabic NLP. A total of 56 submissions was received, 11 of which were finally accepted for this special issue. Each accepted paper has gone through three rounds of reviews, each round with two to three reviewers. The content of this special issue covers different topics such as: Dialectal Arabic Morphology, Arabic Corpus, Transliteration, Annotation, Discourse Relations, Sentiment Lexicon, Arabic named entities, Arabic Treebank, Text Summarization, Ontological Relations and Authorship attribution. The following is a brief summary of each of the main articles in this issue.

If you are interested in doing original NLP work, not a bad place to start looking for projects.

I first saw this in a tweet by Tony McEnery.

WorldWide Telescope (MS) Goes Open Source!

Filed under: Astroinformatics,Open Source — Patrick Durusau @ 10:15 am

Microsoft is Open‐Sourcing WorldWide Telescope in 2015

From the post:

Why is this great news?

Millions of people rely on WorldWide Telescope (WWT) as their unified astronomical image and data environment for exploratory research, teaching, and public outreach. With OpenWWT, any individual or organization will be able to adapt and extend the functionality of WorldWide Telescope to meet any research or educational need. Extensions to the software will continuously enhance astronomical research, formal and informal learning, and public outreach.

What is WWT, and where did it come from?

WorldWide Telescope began in 2007 as a research project, led from within Microsoft Research. Early partners included astronomers and educators from Caltech, Harvard, Johns Hopkins, Northwestern, the University of Chicago, and several NASA facilities. Thanks to these collaborations and Microsoft’s leadership, WWT has reached its goal of creating a free unified contextual visualization of the Universe with global reach that lets users explore multispectral imagery, all of which is deeply connected to scholarly publications and online research databases.

The WWT software was designed with rich interactivity in mind. Guided tours which can be created within the program, offer scripted paths through the 3D environment, allowing media-rich interactive stories to be told, about anything from star formation to the discovery of the large scale structure of the Universe. On the web, WWT is used as both as a standalone program and as an API, in teaching and in research—where it offers unparalleled options for sharing and contextualizing data sets, on the “2D” multispectral sky and/or within the “3D” Universe.

How can you help?

Open-sourcing WWT will allow the people who can best imagine how WWT should evolve to meet the expanding research and teaching challenges in astronomy to guide and foster future development. The OpenWWT Consortium’s members are institutions who will guide WWT’s transition from Microsoft Research to a new host organization. The Consortium and hosting organization will work with the broader astronomical community on a three-part mission of: 1) advancing astronomical research, 2) improving formal and informal astronomy education; and 3) enhancing public outreach.

Join us. If you and your institution want to help shape the future of WWT to support your needs, and the future of open-source software development in Astronomy, then ask us about joining the OpenWWT Consortium.

To contact the WWT team, or inquire about joining the OpenWWT Consortium, contact Doug Roberts at doug-roberts@northwestern.edu.

What a nice way to start the day!

I’m Twitter follower #30 for OpenWWT. What Twitter follower are you going to be?

If you are interested in astronomy, teaching, interfaces, coding great interfaces, etc., there is something of interest for you here.

Enjoy!

January 7, 2015

Harvard Library adopts LibraryCloud

Filed under: Library,Library software — Patrick Durusau @ 8:37 pm

Harvard Library adopts LibraryCloud by David Weinberger.

From the post:

According to a post by the Harvard Library, LibraryCloud is now officially a part of the Library toolset. It doesn’t even have the word “pilot” next to it. I’m very happy and a little proud about this.

LibraryCloud is two things at once. Internal to Harvard Library, it’s a metadata hub that lets lots of different data inputs be normalized, enriched, and distributed. As those inputs change, you can change LibraryCloud’s workflow process once, and all the apps and services that depend upon those data can continue to work without making any changes. That’s because LibraryCloud makes the data that’s been input available through an API which provides a stable interface to that data. (I am overstating the smoothness here. But that’s the idea.)

To the Harvard community and beyond, LibraryCloud provides open APIs to access tons of metadata gathered by Harvard Library. LibraryCloud already has metadata about 18M items in the Harvard Library collection — one of the great collections — including virtually all the books and other items in the catalog (nearly 13M), a couple of million of images in the VIA collection, and archives at the folder level in Harvard OASIS. New data can be added relatively easily, and because LibraryCloud is workflow based, that data can be updated, normalized and enriched automatically. (Note that we’re talking about metadata here, not the content. That’s a different kettle of copyrighted fish.)

LibraryCloud began as an idea of mine (yes, this is me taking credit for the idea) about 4.5 years ago. With the help of the Harvard Library Innovation Lab, which I co-directed until a few months ago, we invited in local libraries and had a great conversation about what could be done if there were an open API to metadata from multiple libraries. Over time, the Lab built an initial version of LibraryCloud primarily with Harvard data, but with scads of data from non-Harvard sources. (Paul Deschner, take many many bows. Matt Phillips, too.) This version of LibraryCloud — now called lilCloud — is still available and is still awesome.

Very impressive news from Harvard!

Plus, the LibraryCloud is open source!

Documentation. Well, that’s the future home of the documentation. For now, the current documentation is on Google Doc: LibraryCloud Item API

Overview:

The LibraryCloud Item API provides access to metadata about items in the Harvard Library collections. For the purposes of this API, an “item” is the metadata describing a catalog record within the Harvard Library.

Enjoy!

University Administrations and Data Checking

Filed under: Data Mining,Data Replication,Skepticism — Patrick Durusau @ 7:40 pm

Axel Brennicke and Björn Brembs, posted the following about university administrations in Germany.

Noam Chomsky, writing about the Death of American Universities, recently reminded us that reforming universities using a corporate business model leads to several easy to understand consequences. The increase of the precariat of faculty without benefits or tenure, a growing layer of administration and bureaucracy, or the increase in student debt. In part, this well-known corporate strategy serves to increase labor servility. The student debt problem is particularly obvious in countries with tuition fees, especially in the US where a convincing argument has been made that the tuition system is nearing its breaking point. The decrease in tenured positions is also quite well documented (see e.g., an old post). So far, and perhaps as may have been expected, Chomsky was dead on with his assessment. But how about the administrations?

To my knowledge, nobody has so far checked if there really is any growth in university administration and bureaucracy, apart from everybody complaining about it. So Axel Brennicke and I decided to have a look at the numbers. In Germany, employment statistics can be obtained from the federal statistics registry, Destatis. We sampled data from 2005 (the year before the Excellence Initiative and the Higher Education Pact) and the latest year we were able to obtain, 2012.

I’m sympathetic to the authors and their position, but that doesn’t equal verification of their claims about the data.

They have offered the data to anyone who want to check: Raw Data for Axel Brennicke and Björn Brembs.

Granting the article doesn’t detail their analysis, after downloading the data, what’s next? How would you go about verifying statements made in the article?

If people get in the habit of offering data for verification and no one looks, what guarantee of correctness will that bring?


The data passes the first test, it is actually present at the download site. Don’t laugh, the NSA has trouble making that commitment.

Do note that the files have underscores in their names which makes them appear to have spaces in their names. HINT: Don’t use underscores in file name. Ever.

The files are old style .xls files so just about anything recent should read them. Do be aware the column headers are in German.

The only description reads:

Employment data from DESTATIS about German university employment in 2005 and 2012

My first curiosity is the data being from two years only, 2005 and 2012. Just note that for now. What steps would you take with the data sets as they are?

I first saw this in a tweet by David Colquhoun.

New in CDH 5.3: Transparent Encryption in HDFS

Filed under: Cloudera,Cybersecurity,Hadoop,Security — Patrick Durusau @ 5:21 pm

New in CDH 5.3: Transparent Encryption in HDFS by Charles Lamb, Yi Liu & Andrew Wang

From the post:

Apache Hadoop 2.6 adds support for transparent encryption to HDFS. Once configured, data read from and written to specified HDFS directories will be transparently encrypted and decrypted, without requiring any changes to user application code. This encryption is also end-to-end, meaning that data can only be encrypted and decrypted by the client. HDFS itself never handles unencrypted data or data encryption keys. All these characteristics improve security, and HDFS encryption can be an important part of an organization-wide data protection story.

Cloudera’s HDFS and Cloudera Navigator Key Trustee (formerly Gazzang zTrustee) engineering teams did this work under HDFS-6134 in collaboration with engineers at Intel as an extension of earlier Project Rhino work. In this post, we’ll explain how it works, and how to use it.

Excellent news! Especially for data centers who are responsible for the data of others.

The authors do mention the problem of rogue users, that is on the client side:

Finally, since each file is encrypted with a unique DEK and each EZ can have a different key, the potential damage from a single rogue user is limited. A rogue user can only access EDEKs and ciphertext of files for which they have HDFS permissions, and can only decrypt EDEKs for which they have KMS permissions. Their ability to access plaintext is limited to the intersection of the two. In a secure setup, both sets of permissions will be heavily restricted.

Just so you know, it won’t be a security problem with Hadoop 2.6 if Sony is hacked while running on a Hadoop 2.6 at a data center. Anyone who copies the master access codes from sticky notes will be able to do a lot of damage. North Korea, will be the whipping boy for major future cyberhacks. That’s policy, not facts talking.

For users who do understand what secure environments should look like, this a great advance.

Non-Uniform Random Variate Generation

Filed under: Random Numbers,Statistics — Patrick Durusau @ 5:05 pm

Non-Uniform Random Variate Generation by Luc Devroye.

From the introduction:

Random number generatlon has Intrigued sclentlsts for a few decades, and a lot of effort has been spent on the creatlon of randomness on a determlnlstlc (non-random) machlne, that Is, on the deslgn of computer algorlthms that are able to produce “random” sequences of lntegers. Thls Is a dlfflcult task. Such algorlthms are called generators, and all generators have flaws because all of them construct the n -th number In the sequence In functlon of the n -1 numbers precedlng It, lnltlallzed wlth a nonrandom seed. Numerous quantltles have been lnvented over the years that measure Just how “random” a sequence Is, and most well-known generators have been subJected to rlgorous statlstlcal testlng. How-ever, for every generator, It ls always posslble to And a statlstlcal test of a (possl- bly odd) property to make the generator flunk. The mathernatlcal tools that are needed to deslgn and analyze these generators are largely number theoretlc and comblnatorlal. These tools differ drastically from those needed when we want to generate sequences of lntegers wlth certain non-unlform dlstrlbutlons, glven that a perfect unlform random number generator 1s avallable. The reader should be aware that we provlde hlm wlth only half the story (the second half). The assGmptlon that a perfect unlform random number generator 1s avallable 1s now qulte unreallstlc, but, wlth tlme, It should become less so. Havlng made the assumptlon, we can bulld qulte a powerful theory of non-unlform random varlate generatlon.

You will need random numbers for some purposes in information retrieval but that isn’t why I mention this eight hundred (800) + page tome.

The author has been good enough to put the entire work up on the Internet and you are free to use it for any purpose, even reselling it.

I mention it because in a recent podcast about Solr 5, the greatest emphasis was on building and managing Solr clusters. Which is a very important use case if you are indexing and searching “big data.”

But in the rush to index and search “big data,” to what extent are we ignoring the need to index and search Small But Important Data (SBID)?

This book would qualify as SBID and even better, it already has an index against which to judge your Solr indexing.

And there are other smallish collections of texts. The Michael Brown grand jury transcripts, which are < 5,000 pages, the CIA Torture Report at 6,000 pages, and many others. Texts that don’t qualify as “big data” but still require highly robust indexing capabilities.

Take Non-Uniform Random Variate Generation as a SBID and practice target for Solr.

I first saw this in a tweet by Computer Science.

Serendipity in the Stacks:…

Filed under: Library,Serendipity — Patrick Durusau @ 3:45 pm

Serendipity in the Stacks: Libraries, Information Architecture, and the Problems of Accidental Discovery by Patrick L. Carr.

Abstract:

Serendipity in the library stacks is generally regarded as a positive occurrence. While acknowledging its benefits, this essay draws on research in library science, information systems, and other fields to argue that, in two important respects, this form of discovery can be usefully framed as a problem. To make this argument, the essay examines serendipity both as the outcome of a process situated within the information architecture of the stacks and as a user perception about that outcome.

A deeply dissatisfying essay on serendipity as evidenced in the author’s conclusion that reads in part:

While acknowledging the validity of Morville’s points, I nevertheless believe that, along with its positive aspects, serendipity in the stacks can be usefully framed as a problem. From a process-based standpoint, serendipity is problematic because it is an indicator of a potential misalignment between user intention and process outcome. And, from a perception-based standpoint, serendipity is problematic because it can encourage user-constructed meanings for libraries that are rooted in opposition to change rather than in users’ immediate and evolving information needs.

To illustrate the “…potential misalignment between user intention and process outcome,” Carr uses the illustration of a user looking for a specific volume by call number but the absence of the book for its location, results in the discovery of an even more useful book nearby. That Carr describes as:

Even if this information were to prove to be more valuable to the user than the information in the book that was sought, the user’s serendipitous discovery nevertheless signifies a misalignment of user intention and process outcome.

Sorry, that went by rather quickly. If the user considers the discovery to be a favorable outcome, why should we take Carr’s word that it “signifies a misalignment of user intention and process outcome?” What other measure for success should an information retrieval system have other than satisfaction of its users? What other measure would be meaningful?

Carr refuses to consider how libraries could seem to maximize what is seen as a positive experience by users because:

By situating the library as a tool that functions to facilitate serendipitous discovery in the stacks, librarians risk also situating the library as a mechanism that functions as a symbolic antithesis to the tools for discovery that are emerging in online environments. In this way, the library could signify a kind of bastion against change. Rather than being cast as a vital tool for meeting discovery needs in emergent online environments, the library could be marginalized in a way that suggests to users that they perceive it as a means of retreat from online environments.

I don’t doubt the same people who think librarians are superflous since “everyone can find what they need on the Internet” would be quick to find libraries as being “bastion[s] against change.” For any number of reasons. But the opinions of semi-literates should not dictate library policy.

What Carr fails to take into account is that a stacks “environment,” which he concedes does facilitate serendipitous discovery, can be replicated in digital space.

For example, while it is currently a prototype, StackLife at Harvard is an excellent demonstration of a virtual stack environment.

stacklife

Jonathan Zittrain, Vice-Dean for Library and Information Resources, Harvard Law School; Professor of Law at Harvard Law School and the Harvard Kennedy School of Government; Professor of Computer Science at the Harvard School of Engineering and Applied Sciences; Co-founder of the Berkman Center for Internet & Society, nominated StackLife for Stanford Prize for Innovation in Research Libraries, saying in part:

  • It always shows a book (or other item) in a context of other books.
  • That context is represented visually as a scrollable stack of items — a shelf rotated so that users can more easily read the information on the spines.
  • The stack integrates holdings from multiple libraries.
  • That stack is sorted by “StackScore,” a measure of how often the library’s community has used a book. At the Harvard Library installation, the computation includes ten year aggregated checkouts weighted by faculty, grad, or undergrad; number of holdings in the 73 campus libraries, times put on reserve, etc.
  • The visualization is simple and clean but also information-rich. (a) The horizontal length of the book reflects the physical book’s height. (b) The vertical height of the book in the stack represents its page count. (c) The depth of the color blue of the spine indicates its StackScore; a deeper blue means that the work is more often used by the community.
  • When clicked, a work displays its Library of Congress Subject Headings (among other metadata). Clicking one of those headings creates a new stack consisting of all the library’s items that share that heading.
  • If there is a Wikipedia page about that work, Stacklife also displays the Wikipedia categories on that page, and lets the user explore by clicking on them.
  • Clicking on a work creates an information box that includes bibliographic information, real-time availability at the various libraries, and, when available: (a) the table of contents; (b) a link to Google Books’ online reader; (c) a link to the Wikipedia page about that book; (d) a link to any National Public Radio audio about the work; (e) a link to the book’s page at Amazon.
  • Every author gets a page that shows all of her works in the library in a virtual stack. The user can click to see any of those works on a shelf with works on the same topic by other authors.
  • Stacklife is scalable, presenting enormous collections of items in a familiar way, and enabling one-click browsing, faceting, and subject-based clustering.

Does StackLife sound like a library “…that [is] rooted in opposition to change rather than in users’ immediate and evolving information needs.”

I can’t speak for you but it doesn’t sound that way to me. It sounds like a library that isn’t imposing its definition of satisfaction upon users (good for Harvard) and that is working to blend the familiar with new to the benefit of its users.

We can only hope that College & Research Libraries will have a response from the StackLife project to Carr’s essay in the same issue.

PS: If you have library friends who don’t read this blog, please forward a link to this post to their attention. I know they are consumed with their current tasks but the StackLife project is one they need to be aware of. Thanks!

I first saw the essay on Facebook in a posting by Simon St.Laurent.

Codelists by Statistics Belgium

Filed under: Linked Data — Patrick Durusau @ 2:41 pm

Codelists by Statistics Belgium

From the webpage:

The following codelists have been published according to the principles of linked data:

  • NIS codes, the list of alfanumeric codes for indicating administrative geographical areas as applied in statistical applications in Belgium
  • NACE 2003, the statistical Classification of Economic Activities in the European Community, Belgian adaptation, version 2003
  • NACE 2008, the statistical Classification of Economic Activities in the European Community, Belgian adaptation, version 2008

We hope to publish a mapping between NACE 2003 and NACE 2008 soon.

The data sets themselves may not be interesting but I did find the “Explore the dataset” options of interest. You can:

  • Lookup a resource by its identifier
  • Retrieve identifiers for a label (Reconciliation Service API)
  • Search by keyword

I tried Keerbergen, one of the examples under Retrieve identifiers for a label and got this result:

{“result”: [{“id”:”http://id.fedstats.be/nis/24048#id”,”score”:”15″,”name”:”Keerbergen”,
“type”: [“http://www.w3.org/2004/02/skos/core#Concept”],”match”: true},{“id”:”http://id.fedstats.be/nis/24048#id”,”score”:”15″,”name”:”Keerbergen”,
“type”: [“http://www.w3.org/2004/02/skos/core#Concept”],”match”: true},{“id”:”http://id.fedstats.be/nis/24048#id”,”score”:”15″,”name”:”Keerbergen”,
“type”: [“http://www.w3.org/2004/02/skos/core#Concept”],”match”: true},{“id”:”http://id.fedstats.be/nis/24048#id”,”score”:”15″,”name”:”Keerbergen”,
“type”: [“http://www.w3.org/2004/02/skos/core#Concept”],”match”: true}]}

Amazing.

Wikipedia reports on Keerbergen as follows:

Keerbergen (Dutch pronunciation: [ˈkeːrbɛrɣə(n)]) is a municipality located in the Belgian province of Flemish Brabant. The municipality comprises only the town of Keerbergen proper. On January 1, 2006 Keerbergen had a total population of 12,444. The total area is 18.39 km² which gives a population density of 677 inhabitants per km².

I would think town or municipality is more of a concept and Keerbergen is a specific place. Yes?

Just to be sure, I also searched for Keerbergen as a keyword, same results, Keerbergen is a concept.

You should check these datasets before drawing any conclusions based upon them.

I first saw this in a tweet by Paul Hermans (who lives in Keerbergen, the town).

Review of Large-Scale RDF Data Processing in MapReduce

Filed under: MapReduce,RDF,Semantic Web — Patrick Durusau @ 1:38 pm

Review of Large-Scale RDF Data Processing in MapReduce by Ke Hou, Jing Zhang and Xing Fang.

Abstract:

Resource Description Framework (RDF) is an important data presenting standard of semantic web and how to process, the increasing RDF data is a key problem for development of semantic web. MapReduce is a widely-used parallel programming model which can provide a solution to large-scale RDF data processing. This study reviews the recent literatures on RDF data processing in MapReduce framework in aspects of the forward-chaining reasoning, the simple querying and the storage mode determined by the related querying method. Finally, it is proposed that the future research direction of RDF data processing should aim at the scalable, increasing and complex RDF data query.

I count twenty-nine (29) projects with two to three sentence summaries of each one. Great starting point for an in-depth review of RDF data processing using mapreduce.

I first saw this in a tweet by Marin Dimitrov.

MUST in XPath 3.1/XQuery 3.1/XQueryX 3.1

Filed under: Standards,XPath,XQuery — Patrick Durusau @ 12:13 pm

I mentioned the problems with redefining may and must in XPath and XQuery Functions and Operators 3.1 in Redefining RFC 2119? Danger! Danger! Will Robinson! last Monday.

Requirements language is one of the first things to check for any specification so I thought I should round that issue out by looking at the requirement language in XPath 3.1, XQuery 3.1, and, XQueryX 3.1.

XPath 3.1

XPath 3.1 includes RFC 2119 as a normative reference but then never cites RFC 2119 in the document or use the uppercase MUST.

I suspect that is the case because of Appendix F Conformance:

XPath is intended primarily as a component that can be used by other specifications. Therefore, XPath relies on specifications that use it (such as [XPointer] and [XSL Transformations (XSLT) Version 3.0]) to specify conformance criteria for XPath in their respective environments. Specifications that set conformance criteria for their use of XPath must not change the syntactic or semantic definitions of XPath as given in this specification, except by subsetting and/or compatible extensions.

The specification of such a language may describe it as an extension of XPath provided that every expression that conforms to the XPath grammar behaves as described in this specification. (Edited on include the actual links to XPointer and XSLT, pointing internally to a bibliography defeats the purpose of hyperlinking.)

Personally I would simply remove the RFC 2119 reference since XPath 3.1 is a set of definitions to which conformance is mandated or not, by other specifications.

XQuery 3.1 and XQueryX 3.1

XQuery 3.1 5 Conformance reads in part:

This section defines the conformance criteria for an XQuery processor. In this section, the following terms are used to indicate the requirement levels defined in [RFC 2119]. [Definition: MUST means that the item is an absolute requirement of the specification.] [Definition: MUST NOT means that the item isan absolute prohibition of the specification.] [Definition: MAY means that an item is truly optional.] [Definition: SHOULD means that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.] (Emphasis in the original)

XQueryX 3.1 5 Conformance reads in part:

This section defines the conformance criteria for an XQueryX
processor (see Figure 1, “Processing Model Overview”, in [XQuery 3.1: An XML Query Language] , Section 2.2 Processing Model XQ31.

In this section, the following terms are used to indicate the requirement levels defined in [RFC 2119]. [Definition: MUST means that the item is an absolute requirement of the specification.] [Definition: SHOULD means that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.] [Definition: MAY means that an item is truly optional.]

First, the better practice is not to repeat definitions found elsewhere (a source of error and misstatement) but to cite RFC 2119 as follows:

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in [RFC2119].

Second, the bolding found in XQuery 3.1 of MUST, etc., is unnecessary, particularly when not then followed by bolding in the use of MUST in the conformance clauses. Best practice to simply use UPPERCASE in both cases.

Third, and really my principal reason for mentioning XQuery 3.1 and XQueryX 3.1 is to call attention to their use of RFC 2119 keywords. That is to say you will find the keywords in the conformance clauses and not any where else in the specification.

Both use the word “must” in their texts but only as would normally appear in prose and implementers don’t have to pour through a sprinkling of MUST as you see in some drafts, which makes for stilted writing and traps for the unwary.

The usage of RFC 2119 keywords in XQuery 3.1 and XQueryX 3.1 make the job of writing in declarative prose easier, eliminates the need to distinguish MUST and must in the normative text, and gives clear guidance to implementers as to the requirements to be met for conformance.

I was quick to point out an error in my last post so it is only proper that I be quick to point out a best practice in XQuery 3.1 and XQueryX 3.1 as well.

This coming Friday, 9 January 2015, I will have a post on proofing content proper for this bundle of specifications.

PS: I am encouraging you to take on this venture into proofing specifications because this particular bundle of W3C specification work is important for pointing into data. If we don’t have reliable and consistent pointing, your topic maps will suffer.

January 6, 2015

Become Data Literate

Filed under: Data Mining — Patrick Durusau @ 7:10 pm

Become Data Literate

From the webpage:

Sign up to receive a new dataset and fun problems every two weeks.

Improving your sense with data is as easy as trying our problems!

The first one is available now!

No endorsement (I haven’t seen the first problem or dataset) but this could be fun.

I will keep you updated on what shows up as datasets and problems.

PS: Not off to a great start because after signing up I get a pop-up window asking me to invite a friend. 🙁 If I had Dick Cheney’s email address I might but I don’t. If and when I am impressed by the datasets/problems, I will mention it here and maybe in email to a friend.

Social networks can be very useful but they also are distractions. I prefer to allow my friends to choose their own distractions.

Scientific Computing on the Erlang VM

Filed under: Erlang,Programming,Science,Scientific Computing — Patrick Durusau @ 6:48 pm

Scientific Computing on the Erlang VM by Duncan McGreggor.

From the post:

This tutorial brings in the New Year by introducing the Erlang/LFE scientific computing library lsci – a ports wrapper of NumPy and SciPy (among others) for the Erlang ecosystem. The topic of the tutorial is polynomial curve-fitting for a given data set. Additionally, this post further demonstrates py usage, the previously discussed Erlang/LFE library for running Python code from the Erlang VM.

Background

The content of this post was taken from a similar tutorial done by the same author for the Python Lisp Hy in an IPython notebook. It, in turn, was completely inspired by the Clojure Incantor tutorial on the same subject, by David Edgar Liebke.

This content is also available in the lsci examples directory.

Introduction

The lsci library (pronounced “Elsie”) provides access to the fast numerical processing libraries that have become so popular in the scientific computing community. lsci is written in LFE but can be used just as easily from Erlang.

Just in case Erlang was among your New Year’s Resolutions. 😉

Well, that’s not the only reason. You are going to encounter data processing that was performed in systems or languages that are strange to you. Assuming access to the data and a sufficient explanation of what was done, you need to be able to verify analysis in a language comfortable to you.

There isn’t now nor is there likely to be a shortage of languages and applications for data processing. Apologies to the various evangelists who dream of world domination for their favorite. Unless and until that happy day for someone arrives, the rest of us need to survive in a multilingual and multi-application space.

Which means having the necessary tools for data analysis/verification in your favorite tool suite counts for a lot. It is the difference in taking someone’s word for analysis and verifying the analysis for yourself. There is a world of difference between those two positions.

« Newer PostsOlder Posts »

Powered by WordPress