Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 16, 2013

Optimizing TM Queries?

Filed under: Query Language,TMQL,XML,XQuery — Patrick Durusau @ 7:56 pm

A recent paper by V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyễn, Optimizing XML querying using type-based document projection, suggests some interesting avenues for optimizing topic map queries.

Abstract:

XML data projection (or pruning) is a natural optimization for main memory query engines: given a query Q over a document D, the subtrees of D that are not necessary to evaluate Q are pruned, thus producing a smaller document D ; the query Q is then executed on D , hence avoiding to allocate and process nodes that will never be reached by Q.

In this article, we propose a new approach, based on types, that greatly improves current solutions. Besides providing comparable or greater precision and far lesser pruning overhead, our solution ―unlike current approaches― takes into account backward axes, predicates, and can be applied to multiple queries rather than just to single ones. A side contribution is a new type system for XPath able to handle backward axes. The soundness of our approach is formally proved. Furthermore, we prove that the approach is also complete (i.e., yields the best possible type-driven pruning) for a relevant class of queries and Schemas. We further validate our approach using the XMark and XPathMark benchmarks and show that pruning not only improves the main memory query engine’s performances (as expected) but also those of state of the art native XML databases.

Phrased in traditional XML terms but imagine pruning a topic map by topic or association types, for example, before execution of a query.

While true enough that a query could include topic type, the remains the matter of examining all the instances of topic type before proceeding to the rest of the query.

For common query sub-maps as it were, I suspect that to prune once and store the results could be a viable alternative.

Despite the graphic chart enhancement from processing millions or billions of nodes, processing the right set of nodes and producing a useful answer has its supporters.

January 15, 2013

XQuery 3.0: An XML Query Language [Subject Identity Equivalence Language?]

Filed under: Identity,XML,XQuery — Patrick Durusau @ 8:32 pm

XQuery 3.0: An XML Query Language – W3C Candidate Recommendation

Abstract:

XML is a versatile markup language, capable of labeling the information content of diverse data sources including structured and semi-structured documents, relational databases, and object repositories. A query language that uses the structure of XML intelligently can express queries across all these kinds of data, whether physically stored in XML or viewed as XML via middleware. This specification describes a query language called XQuery, which is designed to be broadly applicable across many types of XML data sources.

Just starting to read the XQuery CR but the thought occurred to me that it could be a basis for a “subject identity equivalence language.”

Rather than duplicating the work on expressions, paths, data types, operators, expressions, etc., why not take all that as given?

Suffice it to define a “subject equivalence function,” the variables of which are XQuery statements that identify values (or value expressions) as required, optional or forbidden and the definition of the results of the function.

Reusing a well-tested query language seems preferable to writing an entirely new one from scratch.

Suggestions?

I first saw this in a tweet by Michael Kay.

January 10, 2013

Markup Olympics (Balisage) [No Drug Testing]

Filed under: Conferences,XML,XML Database,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 1:46 pm

Markup athletes take heart! Unlike venues that intrude into the personal lives of competitors, there are no, repeat no drug tests for presenters at Balisage!

Fear no trainer betrayals or years of being dogged by second-raters in the press.

Eat, drink, visit, ???, present, in the company of your peers.

The more traditional call for participation, yawn, has the following details:

Dates:

15 March 2013 – Peer review applications due
19 April 2013 – Paper submissions due
19 April 2013 – Applications due for student support awards due
21 May 2013 – Speakers notified
12 July 2013 – Final papers due

5 August 2013 – Pre-conference Symposium on XForms
6-9 August 2013 – Balisage: The Markup Conference

From the call:

Balisage is where people interested in descriptive markup meet each year in August for informed technical discussion, occasionally impassioned debate, good coffee, and the incomparable ambience of one of North America’s greatest cities, Montreal. We welcome anyone interested in discussing the use of descriptive markup to build strong, lasting information systems.

Practitioner or theorist, tool-builder or tool-user, student or lecturer — you are invited to submit a paper proposal for Balisage 2013. As always, papers at Balisage can address any aspect of the use of markup and markup languages to represent information and build information systems. Possible topics include but are not limited to:

  • XML and related technologies
  • Non-XML markup languages
  • Big Data and XML
  • Implementation experience with XML parsing, XSLT processors, XQuery processors, XML databases, XProc integrations, or any markup-related technology
  • Semantics, overlap, and other complex fundamental issues for markup languages
  • Case studies of markup design and deployment
  • Quality of information in markup systems
  • JSON and XML
  • Efficiency of Markup Software
  • Markup systems in and for the mobile web
  • The future of XML and of descriptive markup in general
  • Interesting applications of markup

In addition, please consider becoming a Peer Reviewer. Reviewers play a critical role towards the success of Balisage. They review blind submissions — on topics that interest them — for technical merit, interest, and applicability. Your comments and recommendations can assist the Conference Committee in creating the program for Balisage 2013!

How:

More IQ per square foot than any other conference you will attend in 2013!

December 27, 2012

pdfx v1.0 [PDF-to-XML]

Filed under: Conversion,PDF,XML — Patrick Durusau @ 3:18 pm

pdfx v1.0

From the homepage:

Fully-automated PDF-to-XML conversion of scientific text

I submitted Static and Dynamic Semantics of NoSQL Languages, a paper I blogged about earlier this week. Twenty-four pages of lots of citations and equations.

I forgot to set a timer but it isn’t for the impatient. I think the conversion ran more than ten (10) minutes.

Some mathematical notation defeats the conversion process.

See: Static-and-Dynamic-Semantics-NoSQL-Languages.tar.gz for the original PDF plus the HTML and PDF outputs.

For occasional conversions where heavy math notation isn’t required, this may prove to be quite useful.

December 21, 2012

BaseX. The XML Database. [XPath/XQuery]

Filed under: Editor,XML,XQuery — Patrick Durusau @ 11:08 am

BaseX. The XML Database.

From the webpage:

News: BaseX 7.5 has just been released…

BaseX is a very light-weight, high-performance and scalable XML Database engine and XPath/XQuery 3.0 Processor, including full support for the W3C Update and Full Text extensions. An interactive and user-friendly GUI frontend gives you great insight into your XML documents.

Another XML editor but I mention it for its support of XQuery more than as an editor per se.

We continue to lack a standard query language for topic maps and experience with XQuery may prove informative.

Not to mention its possible role in gathering diverse data for presentation in a merged state to users.

<ANGLES>

Filed under: Editor,Software,XML — Patrick Durusau @ 10:33 am

<ANGLES>

From the homepage:

ANGLES is a research project aimed at developing a lightweight, online XML editor tuned to the needs of the scholarly text encoding community. By combining the model of intensive code development (the “code sprint”) with participatory design exercises, testing, and feedback from domain experts gathered at disciplinary conferences, ANGLES will contribute not only a working prototype of a new software tool but also another model for tool building in the digital humanities (the “community roadshow”).

Work on ANGLES began in November 2012.

We’ll have something to share very soon!

<ANGLES> is an extension of ACE:

ACE is an embeddable code editor written in JavaScript. It matches the features and performance of native editors such as Sublime, Vim and TextMate. It can be easily embedded in any web page and JavaScript application. ACE is maintained as the primary editor for Cloud9 IDE and is the successor of the Mozilla Skywriter (Bespin) project.

<ANGLES> code at Sourceforge.

I will be interested to see how ACE is extended. Just glancing at it this morning, it appears to be the traditional “display angle bang syntax” editor we all know so well.

What puzzles me is that we have been to the mountain of teaching users to be comfortable with raw XML markup and the results have not been promising.

As opposed to the experience with OpenOffice, MS Office, etc., which have proven that creating documents that are then expressed in XML, is within the range of ordinary users.

<ANGLES> looks like an interesting project but whether it brings XML editing within the reach of ordinary users is an open question.

If the XML editing puzzle is solved, perhaps it will have lessons for topic map editors.

November 20, 2012

Balisage 2013 – Dates/Location

Filed under: Conferences,XML,XML Database,XML Query Rewriting,XML Schema,XPath,XQuery,XSLT,XTM — Patrick Durusau @ 3:19 pm

Tommie Usdin just posted email with the Balisage 2013 dates and location:

Montreal, Hotel Europa, August 5 – 9 , 2013

Hope that works with everything else.

That’s the entire email so I don’t know what was meant by:

Hope that works with everything else.

Short of it being your own funeral, open-heart surgery or giving birth (to your first child), I am not sure what “everything else” there could be?

You get a temporary excuse for the second two cases and a permanent excuse for the first one.

Now’s a good time to hint about plane fare plus hotel and expenses for Balisage as a stocking stuffer.

And to wish a happy holiday Tommie Usdin and to all the folks at Mulberry Technology who make Balisage possible all of us. Each and every one.

September 11, 2012

XML-Print 1.0

Filed under: Mapping,Visualization,XML — Patrick Durusau @ 2:46 pm

Prof. Marc W. Küster announced XML-Print 1.0 this week, “…an open source XML formatter designated especially for the needs of the Digital Humanties.”

Mapping from “…semantic structures to typesetting styles….” (from below)

We have always mapped from semantic structures to typesetting styles, but this time it will be explicit.

Consider whether you need “transformation” (implies a static file output) or merely a “view” for some purpose, such as printing?

Both require mappings but the later keeps your options open as it were.

Enjoy!

XML-Print allows the end user to directly interact with semantically annotated data. It consists of two independent, but well-integrated components, an Eclipse-based front-end that enables the user to map their semantic structures to typesetting styles, and the typesetting engine proper that produces the PDF based on this mapping. Both components build as much as possible on existing standards such as XML, XSL-T and XSL-FO and extend those only where absolutely necessary, e.g. for the handling of critical apparatuses.

XML-Print is a DFG-supported joint project of the FH Worms (Prof. Marc W. Küster) and the University of Trier (Prof. Claudine Moulin) in collaboration wiht the TU Darmstadt (Prof. Andrea Rapp). It is released under the Eclipse Public Licence (EPL) for the front-end and the Affero General Public Licence (APGL) for the typesetting engine. The project is currently roughly half-way through its intended duration. In its final incarnation the PDF that is produced will satisfy the full set of requirements for the typesetting of (amongst others) critical editions including critical apparatuses, multicolumn synoptic texts and bidirectional text. At this stage it can already handle basic formatting as well as multiple apparatuses, albeit still with some restrictions and rough edges. It is work in progress with new releases coming out regularly.

If you have questions, please do not hesitate to contact us via our website http://www.xmlprint.eu or directly to print@uni-trier.de. Any and all feedback is welcome. Moreover, if you know some people you think could benefit from XML-Print, please feel free to spread the news amongst your peers.

Project homepage: http://www.xmlprint.eu
Source code: http://sourceforge.net/projects/xml-print/
Installers for Windows, Mac and Linux:
http://sourceforge.net/projects/xml-print/files/

August 12, 2012

St. Laurent on Balisage

Filed under: JSON,XML — Patrick Durusau @ 4:29 pm

Applying markup to complexity: The blurry line between markup and programming by Simon St. Laurent.

Simon’s review of Balisage will make you want to attend next year, if you missed this year.

He misses an important issue with JSON (and XML) when he writes:

JSON gave programmers much of what they wanted: a simple format for shuttling (and sometimes storing) loosely structured data. Its simpler toolset, freed of a heritage of document formats and schemas, let programmers think less about information formats and more about the content of what they were sending.

XML and JSON look at data through different lenses. XML is a tree structure of elements, attributes, and content, while JSON is arrays, objects, and values. Element order matters by default in XML, while JSON is far less ordered and contains many more anonymous structures. (emphasis added)

The problem with JSON in a nutshell (apologies to O’Reilly): anonymous structures.

How is a subsequent programmer going to discover the semantics of “anonymous structures?”

Works great for job security, works less well for information integration several “generations” of programmers later.

XML can be poorly documented, just like JSON, but relationships between elements are explicit.

Anonymity, of all kinds, is the enemy of re-use of data, semantic integration and useful archiving of data.

If those aren’t your use cases, use anonymous JSON structures. (Or undocumented XML.)

August 4, 2012

Using the flickr XML/API as a source of RSS feeds

Filed under: Data,XML,XSLT — Patrick Durusau @ 2:07 pm

Using the flickr XML/API as a source of RSS feeds by Pierre Lindenbaum.

Pierre has created an XSLT stylesheet to transform XML from flickr into an RSS feed.

Something for your data harvesting recipe box.

July 17, 2012

If you are in Kolkata/Pune, India…a request.

Filed under: Search Engines,Synonymy,Word Meaning,XML — Patrick Durusau @ 1:55 pm

No emails are given for the authors of: Identify Web-page Content meaning using Knowledge based System for Dual Meaning Words but their locations were listed as Kolkata and Pune, India. I would appreciate your pointing the authors to this blog as one source of information on topic maps.

The authors have re-invented a small part of topic maps to deal with synonymy using XSD syntax. Quite doable but I think they would be better served by either using topic maps or engaging in improving topic maps.

Reinvention is rarely a step forward.

Abstract:

Meaning of Web-page content plays a big role while produced a search result from a search engine. Most of the cases Web-page meaning stored in title or meta-tag area but those meanings do not always match with Web-page content. To overcome this situation we need to go through the Web-page content to identify the Web-page meaning. In such cases, where Webpage content holds dual meaning words that time it is really difficult to identify the meaning of the Web-page. In this paper, we are introducing a new design and development mechanism of identifying the Web-page content meaning which holds dual meaning words in their Web-page content.

July 9, 2012

TUSTEP is open source – with TXSTEP providing a new XML interface

Filed under: Text Analytics,Text Mining,TUSTEP/TXSTEP,XML — Patrick Durusau @ 9:15 am

TUSTEP is open source – with TXSTEP providing a new XML interface

I won’t recount how many years ago I first received email from Wilhelm Ott about TUSTEP. 😉

From the TUSTEP homepage:

TUSTEP is a professional toolbox for scholarly processing textual data (including those in non-latin scripts) with a strong focus on humanities applications. It contains modules for all stages of scholarly text data processing, starting from data capture and including information retrieval, text collation, text analysis, sorting and ordering, rule-based text manipulation, and output in electronic or conventional form (including typesetting in professional quality).

Since the title “big data” is taken, perhaps we should take “complex data” for texts.

If you are exploring textual data in any detail or with XML, you should give take a look at the TUSTEP project and its new XML interface, TXSTEP.

Or consider contributing to the project as well.

Wilhelm Ott writes (in part):

We are pleased to announce that, starting with the release 2012, TUSTEP is available as open source software. It is distributed under the Revised BSD Licence and can be downloaded from www.tustep.org.

TUSTEP has a long tradition as a highly flexible, reliable, efficient suite of programs for humanities computing. It started in the early 70ies as a tool for supporting humanities projects at the University of Tübingen, relying on own funds of the University. From 1985 to 1989, a substantial grant from the Land Baden-Württemberg officially opened its distribution beyond the limits of the University and started its success as a highly appreciated research tool for many projects at about a hundred universities and academic institutions in the German speaking part of the world, represented since 1993 in the International TUSTEP User Group (ITUG). Reports on important projects relying on TUSTEP and a list of publications (includig lexicograpic works and critical editions) can be found on the tustep webpage.

TXSTEP, presently being developed in cooperation with Stuttgart Media University, offers a new XML-based user interface to the TUSTEP programs. Compared to the original TUSTEP commands, we see important advantages:

  • it will offer an up-to-date established syntax for scripting;
  • it will show the typical benefits of working with an XML editor, like content completion, highlighting, showing annotations, and, of course, verifying the code;
  • it will offer – to a certain degree – a self teaching environment by commenting on the scope of every step;
  • it will help to avoid many syntactical errors, even compared to the original TUSTEP scripting environment;
  • the syntax is in English, providing a more widespread usability than TUSTEP’s German command language.

At the TEI conference last year in Würzburg, we presented a first prototype to an international audience. We look forward to DH2012 in Hamburg next week where, during the Poster Session, a more enhanced version which already contains most of TUSTEPs functions will be presented. A demonstration of TXSTEPs functionality will include tasks which can not easily be performed by existing XML tools.

After the demo, you are invited to download a test version of TXSTEP to play with, to comment on it and to help make it a great and flexible tool for everyday – and complex – questions.

OK, I confess a fascination with complex textual analysis.

June 29, 2012

MuteinDB

Filed under: Bioinformatics,Biomedical,Genome,XML — Patrick Durusau @ 3:16 pm

MuteinDB: the mutein database linking substrates, products and enzymatic reactions directly with genetic variants of enzymes by Andreas Braun, Bettina Halwachs, Martina Geier, Katrin Weinhandl, Michael Guggemos, Jan Marienhagen, Anna J. Ruff, Ulrich Schwaneberg, Vincent Rabin, Daniel E. Torres Pazmiño, Gerhard G. Thallinger, and Anton Glieder.

Abstract:

Mutational events as well as the selection of the optimal variant are essential steps in the evolution of living organisms. The same principle is used in laboratory to extend the natural biodiversity to obtain better catalysts for applications in biomanufacturing or for improved biopharmaceuticals. Furthermore, single mutation in genes of drug-metabolizing enzymes can also result in dramatic changes in pharmacokinetics. These changes are a major cause of patient-specific drug responses and are, therefore, the molecular basis for personalized medicine. MuteinDB systematically links laboratory-generated enzyme variants (muteins) and natural isoforms with their biochemical properties including kinetic data of catalyzed reactions. Detailed information about kinetic characteristics of muteins is available in a systematic way and searchable for known mutations and catalyzed reactions as well as their substrates and known products. MuteinDB is broadly applicable to any known protein and their variants and makes mutagenesis and biochemical data searchable and comparable in a simple and easy-to-use manner. For the import of new mutein data, a simple, standardized, spreadsheet-based data format has been defined. To demonstrate the broad applicability of the MuteinDB, first data sets have been incorporated for selected cytochrome P450 enzymes as well as for nitrilases and peroxidases.

Database URL: http://www.muteindb.org/

Why is this relevant to topic maps or semantic diversity you ask?

I will let the author’s answer:

Information about specific proteins and their muteins are widely spread in the literature. Many studies only describe single mutation and its effects without comparison to already known muteins. Possible additive effects of single amino acid changes are scarcely described or used. Even after a thorough and time-consuming literature search, researchers face the problem of assembling and presenting the data in an easy understandable and comprehensive way. Essential information may be lost such as details about potentially cooperative mutations or reactions one would not expect in certain protein families. Therefore, a web-accessible database combining available knowledge about a specific enzyme and its muteins in a single place are highly desirable. Such a database would allow researchers to access relevant information about their protein of interest in a fast and easy way and accelerate the engineering of new and improved variants. (Third paragraph of the introduction)

I would have never dreamed that gene data would be spread to Hell and back. 😉

The article will give you insight into how gene data is collected, searched, organized, etc. All of which will be valuable to you whether you are designing or using information systems in this area.

I was a bit let down when I read about data formats:

Most of them are XML based, which can be difficult to create and manipulate. Therefore, simpler, spreadsheet-based formats have been introduced which are more accessible for the individual researcher.

I’ve never had any difficulties with XML based formats but will admit that may not be a universal experience. Sounds to me like the XML community should concentrate a bit less on making people write angle-bang syntax and more on long term useful results. (Which I think XML can deliver.)

June 25, 2012

Show Me The Money!

Filed under: Conferences,XBRL,XML,XPath,XQuery — Patrick Durusau @ 2:28 pm

I need to talk to Tommie Usdin about marketing the Balisage conference.

The final program came out today and here is what Tommie had to say:

When the regular (peer-reviewed) part of the Balisage 2012 program was scheduled, a few slots were reserved for presentation of “Late breaking” material. These presentations have now been selected and added to the program.

Topics added include:

  • making robust and multi-platform ebooks
  • creating representative documents from large document collections
  • validating RESTful services using XProc, XSLT, and XSD
  • XML for design-based (e.g. magazine) publishing
  • provenance in XSLT transformation (tracking what XSLT does to documents)
  • literate programming
  • managing the many XML-related standards and specifications
  • leveraging XML for web applications

The program already included talks about adding RDF to TEI documents, compression of XML documents, exploring large XML collections, Schematron, relation of XML to JSON, overlap, higher-order functions in XSLT, the balance between XML and non-XML notations, and many other topics. Now it is a real must for anyone who thinks deeply about markup.

Balisage is the XML Geek-fest; the annual gathering of people who design markup and markup-based applications; who develop XML specifications, standards, and tools; the people who read and write, books about publishing technologies in general and XML in particular; and super-users of XML and related technologies. You can read about the Balisage 2011 conference at http://www.balisage.net.

Yawn. Are we there yet? 😉

Why you should care about XML and Balisage:

  • US government and others are publishing laws and regulations and soon to be legislative material in XML
  • Securities are increasingly using XML for required government reports
  • Texts and online data sets are being made available in XML
  • All the major document formats are based in XML

A $billion here, a $billion there and pretty soon you are talking about real business opportunity.

Your un-Balisaged XML developers have $1,000 bills blowing overhead.

Be smart, make your XML developers imaginative and productive.

Send your XML developers to Balisage.

(http://www.balisage.net/registration.html)

June 21, 2012

BaseX 7.3 (The Summer Edition) is now available!

Filed under: BaseX,XML,XML Database,XML Schema,XPath,XQuery — Patrick Durusau @ 7:47 am

BaseX 7.3 (The Summer Edition) is now available!

From the post:

we are glad to announce a great new release of BaseX, our XML database and XPath/XQuery 3.0 processor! Here are the latest features:

  • Many new internal XQuery Modules have been added, and existing ones have been revised to ensure long-term stability of your future XQuery applications
  • A new powerful Command API is provided to specify BaseX commands and scripts as XML
  • The full-text fuzzy index was extended to also support wildcard queries
  • The simple map operator of XQuery 3.0 gives you a compact syntax to process items of sequences
  • BaseX as Web Application can now start its own server instance
  • All command-line options will now be executed in the given order
  • Charles Foster’s latest XQJ Driver supports XQuery 3.0 and the Update and Full Text extensions

For those of you in the Northern Hemisphere, we wish you a nice summer! No worries, we’ll stay busy..

Just in time for the start of summer in the Northern Hemisphere!

Something you can toss onto your laptop before you head to the beach.

Err, huh? Well, even if you don’t take BaseX 7.3 to the beach, it promises to be good fun for the summer and more serious work should the occasion arise.

I count twenty-three (23) modules in addition to the XQuery functions specified by the latest XPath/XQuery 3.0 draft.

Just so you know, the BaseX database server listens to port 1984 by default.

June 10, 2012

XML to Graph Converter

Filed under: Geoff,Graphs,Neo4j,XML — Patrick Durusau @ 8:17 pm

XML to Graph Converter

From the webpage:

XML data can easily be converted into a graph. Simply load paste the XML data into the left-hand side, convert into Geoff, then view the results in the Neo4j console.

I would have modeled the XML differently, but that is probably a markup prejudice.

Still, an impressive demonstration and worth your time to review.

June 6, 2012

A Pluggable XML Editor

Filed under: Editor,XML — Patrick Durusau @ 7:50 pm

A Pluggable XML Editor by Grant Vergottini.

From the post:

Ever since I announced my HTML5-based XML editor, I’ve been getting all sorts of requests for a variety of implementations. While the focus has been, and continues to be, providing an Akoma Ntoso based legislative editor, I’ve realized that the interest in a web-based XML editor extends well beyond Akoma Ntoso and even legislative editors.

So… with that in mind I’ve started making some serious architectural changes to the base editor. From the get-go, my intent had been for the editor to be “pluggable” although I hadn’t totally thought it through. By “pluggable” I mean capable of allowing different information models to be used. I’m actually taking the model a bit further to allow modules to be built that can provide optional functionality to the base editor. What this means is that if you have a different document information model, and it is capable of being round-tripped in some way with an editing view, then I can probably adapt it to the editor.

Let’s talk about the round-tripping problem for a moment. In the other XML editors I have worked with, the XML model has had to quite closely match the editing view that one works with. So you’re literally authoring the document using that information model. Think about HTML (or XHTML for an XML perspective). The arrangement of the tags pretty much exactly represents how you think of an deal with the components of the document. Paragraphs, headings, tables, images, etc, are all pretty much laid out how you would author them. This is the ideal situation as it makes building the editor quite straight-forward.

Note the line:

What this means is that if you have a different document information model, and it is capable of being round-tripped in some way with an editing view, then I can probably adapt it to the editor.

I think that means that we don’t all have to use the same editing view and at the same time, we can share an underlying format. Or perhaps even annotate texts with subject identities, not even realizing we are helping others.

This is an impressive bit of work and as the post promises, there is more to follow.

(I first saw this at Legal Informatics. http://legalinformatics.wordpress.com/2012/06/05/vergottini-on-improvements-to-akneditor-html-5-based-xml-editor-for-legislation/)

June 1, 2012

Are You Going to Balisage?

Filed under: Conferences,RDF,RDFa,Semantic Web,XML,XML Database,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 2:48 pm

To the tune of “Are You Going to Scarborough Fair:”

Are you going to Balisage?
Parsley, sage, rosemary and thyme.
Remember me to one who is there,
she once was a true love of mine.

Tell her to make me an XML shirt,
Parsley, sage, rosemary, and thyme;
Without any seam or binary code,
Then she shall be a true lover of mine.

….

Oh, sorry! There you will see:

  • higher-order functions in XSLT
  • Schematron to enforce consistency constraints
  • relation of the XML stack (the XDM data model) to JSON
  • integrating JSON support into XDM-based technologies like XPath, XQuery, and XSLT
  • XML and non-XML syntaxes for programming languages and documents
  • type introspection in XQuery
  • using XML to control processing in a document management system
  • standardizing use of XQuery to support RESTful web interfaces
  • RDF to record relations among TEI documents
  • high-performance knowledge management system using an XML database
  • a corpus of overlap samples
  • an XSLT pipeline to translate non-XML markup for overlap into XML
  • comparative entropy of various representations of XML
  • interoperability of XML in web browsers
  • XSLT extension functions to validate OCL constraints in UML models
  • ontological analysis of documents
  • statistical methods for exploring large collections of XML data

Balisage is an annual conference devoted to the theory and practice of descriptive markup and related technologies for structuring and managing information. Participants typically include XML users, librarians, archivists, computer scientists, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, members of the working groups which define the specifications, academics, industrial researchers, representatives of governmental bodies and NGOs, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Discussion is open, candid, and unashamedly technical.

The Balisage 2012 Program is now available at: http://www.balisage.net/2012/Program.html

May 29, 2012

Destination: Montreal!

If you remember the Saturday afternoon sci-fi movies, Destination: …., then you will appreciate the title for this post. 😉

Tommie Usdin and company just posted: Balisage 2012 Call for Late-breaking News, written in torn bodice style:

The peer-reviewed part of the Balisage 2012 program has been scheduled (and will be announced in a few days). A few slots on the Balisage program have been reserved for presentation of “Late-breaking” material.

Proposals for late-breaking slots must be received by June 15, 2012. Selection of late-breaking proposals will be made by the Balisage conference committee, instead of being made in the course of the regular peer-review process.

If you have a presentation that should be part of Balisage, please send a proposal message as plain-text email to info@balisage.net.

In order to be considered for inclusion in the final program, your proposal message must supply the following information:

  • The name(s) and affiliations of all author(s)/speaker(s)
  • The email address of the presenter
  • The title of the presentation
  • An abstract of 100-150 words, suitable for immediate distribution
  • Disclosure of when and where, if some part of this material has already been presented or published
  • An indication as to whether the presenter is comfortable giving a conference presentation and answering questions in English about the material to be presented
  • Your assurance that all authors are willing and able to sign the Balisage Non-exclusive Publication Agreement (http://www.balisage.net/BalisagePublicationAgreement.pdf) with respect to the proposed presentation

In order to be in serious contention for inclusion in the final program, your proposal should probably be either a) really late-breaking (it happened in the last month or two) or b) a paper, an extended paper proposal, or a very long abstract with references. Late-breaking slots are few and the competition is fiercer than for peer-reviewed papers. The more we know about your proposal, the better we can appreciate the quality of your submission.

Please feel encouraged to provide any other information that could aid the conference committee as it considers your proposal, such as a detailed outline, samples, code, and/or graphics. We expect to receive far more proposals than we can accept, so it’s important that you send enough information to make your proposal convincing and exciting. (This material may be attached to the email message, if appropriate.)

The conference committee reserves the right to make editorial changes in your abstract and/or title for the conference program and publicity. (emphasis added to last sentence)

Read that last sentence again!

The conference committee reserves the right to make editorial changes in your abstract and/or title for the conference program and publicity.

The conference committee might change your abstract and/or title to say something …. controversial? ….attention getting? ….CNN / Slashdot worthy?

Bring it on!

Submit late breaking proposals!

Please!

April 24, 2012

TEI Boilerplate

Filed under: Text Encoding Initiative (TEI),XML — Patrick Durusau @ 7:15 pm

TEI Boilerplate

If you don’t know it, the TEI (Text Encoding Initiative), is one of the oldest digital humanities projects dedicated to fashioning encoding solutions for non-digital texts. The Encoding Guidelines, as they are known, were designed to capture the complexities of pre-digital texts.

If you doubt the complexities of pre-digital texts, consider the following image of a cover page from the Leningrad Codex:

Leningrad Codex Image

Or, consider this page from the Mikraot Gedolot:

Mikraot Gedolot Image

There are more complex pages, such as the mss. of Charles Peirce (Peirce Logic Notebook, Charles Sanders Peirce Papers MS Am 1632 (339). Houghton Library, Harvard University, Cambridge, Mass.):

Peirce Logic Notebook, Charles Sanders Peirce Papers MS AM 1632 (339)

And those are just a few random examples. Encoding pre-digital texts is a complex and rewarding field of study.

Not that “born digital” texts need concede anything to “pre-digital” texts. When you think about our capacity to capture versions, multiple authors, sources, interpretations of readers, discussions and the like, the wealth of material that can be associated with any one text becomes quite complex.

Consider for example the Harry Potter book series that spawned websites, discussion lists, interviews with the author, films and other resources. Not quite like the interpretative history of the Bible but enough to make an interesting problem.

Anything that can encode that range of texts is of necessity quite complex itself and therein lies the rub. You work very hard at document analysis, using or extending the TEI Guidelines to encode your text, now what?

You can:

  1. Show the XML text to family and friends. Always a big hit at parties. 😉
  2. Use your tame XSLT wizard to create a custom conversion of the XML text so normal people will want to see and use it.
  3. Use the TEI Boilerplate project for a stock delivery of the XML text so normal people will want to see and use it. (like your encoders, funders)

From the webpage:

TEI Boilerplate is a lightweight solution for publishing styled TEI (Text Encoding Initiative) P5 content directly in modern browsers. With TEI Boilerplate, TEI XML files can be served directly to the web without server-side processing or translation to HTML. Our TEI Boilerplate Demo illustrates many TEI features rendered by TEI Boilerplate.

Browser Compatibility

TEI Boilerplate requires a robust, modern browser to do its work. It is compatible with current versions of Firefox, Chrome, Safari, and Internet Explorer (IE 9). If you have problems with TEI Boilerplate with a modern browser, please let us know by filing a bug report at https://sourceforge.net/p/teiboilerplate/tickets/.

Many thanks to John Walsh, Grant Simpson, and Saeed Moaddeli, all from Indiana University for this wonderful addition to the TEI toolbox!

PS: If you have disposable funds and aren’t planning on mining asteroids, please consider donating to the TEI (Text Encoding Initiative). Even asteroid miners need to know Earth history, a history written in texts.

February 25, 2012

XML data clustering: An overview

Filed under: Clustering,Data Clustering,XML,XML Data Clustering — Patrick Durusau @ 7:39 pm

XML data clustering: An overview by Alsayed Algergawy, Marco Mesiti, Richi Nayak, and Gunter Saake.

Abstract:

In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.

I thought this survey article would be of particular interest since it covers the syntax and semantics of XML that contains data.

Not to mention that our old friend, heterogeneous data, isn’t far behind:

Since XML data are engineered by different people, they often have different structural and terminological heterogeneities. The integration of heterogeneous data sources requires many tools for organizing and making their structure and content homogeneous. XML data integration is a complex activity that involves reconciliation at different levels: (1) at schema level, reconciling different representations of the same entity or property, and (2) at instance level, determining if different objects coming from different sources represent the same real-world entity. Moreover, the integration of Web data increases the integration process challenges in terms of heterogeneity of data. Such data come from different resources and it is quite hard to identify the relationship with the business subjects. Therefore, a first step in integrating XML data is to find clusters of the XML data that are similar in semantics and structure [Lee et al. 2002; Viyanon et al. 2008]. This allows system integrators to concentrate on XML data within each cluster. We remark that reconciling similar XML data is an easier task than reconciling XML data that are different in structures and semantics, since the later involves more restructuring. (emphasis added)

Two comments to bear in mind while reading this paper.

First, print our or photocopy Table II on page 35, “Features of XML Clustering Approaches.” It will be a handy reminder/guide as you read the coverage of the various techniques.

Second, on the last page, page 41, note that the article was accepted in October of 2009 but not published until October of 2011. It’s great that the ACM has an abundance of excellent survey articles but a two year delay is publication is unreasonable.

Surveys in rapidly developing fields are of most interest when they are timely. Electronic publication upon final acceptance should be the rule at an organization such as the ACM.

February 24, 2012

Having a ChuQL at XML on the Cloud

Filed under: ChuQL,Cloud Computing,XML — Patrick Durusau @ 5:04 pm

Having a ChuQL at XML on the Cloud by Shahan Khatchadourian, Mariano P. Consens, and Jérôme Siméon.

Abstract:

MapReduce/Hadoop has gained acceptance as a framework to process, transform, integrate, and analyze massive amounts of Web data on the Cloud. The MapReduce model (simple, fault tolerant, data parallelism on elastic clouds of commodity servers) is also attractive for processing enterprise and scienti c data. Despite XML ubiquity, there is yet little support for XML processing on top of MapReduce.

In this paper, we describe ChuQL, a MapReduce extension to XQuery, with its corresponding Hadoop implementation. The ChuQL language incorporates records to support the key/value data model of MapReduce, leverages higher-order functions to provide clean semantics, and exploits side-e ffects to fully expose to XQuery developers the Hadoop framework. The ChuQL implementation distributes computation to multiple XQuery engines, providing developers with an expressive language to describe tasks over big data.

The aggregation and co-grouping were the most interesting examples for me.

The description of ChuQL was a bit thin. Pointers to more resources would be appreciated.

February 14, 2012

Would You Know “Good” XML If It Bit You?

Filed under: Uncategorized,XML,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 5:16 pm

XML is a pale imitation of a markup language. It has resulted in real horrors across the markup landscape. After years in its service, I don’t have much hope of that changing.

But, the Princess of the Northern Marches has organized a war council to consider how to stem the tide of bad XML. Despite my personal misgivings, I wish them well and invite you to participate as you see fit.

Oh, and I found this message about the council meeting:

International Symposium on Quality Assurance and Quality Control in XML

Monday August 6, 2012
Hotel Europa, Montréal, Canada

Paper submissions due April 20, 2012.

A one-day discussion of issues relating to Quality Control and Quality Assurance in the XML environment.

XML systems and software are complex and constantly changing. XML documents are highly varied, may be large or small, and often have complex life-cycles. In this challenging environment quality is difficult to define, measure, or control, yet the justifications for using XML often include promises or implications relating to quality.

We invite papers on all aspects of quality with respect to XML systems, including but not limited to:

  • Defining, measuring, testing, improving, and documenting quality
  • Quality in documents, document models, software, transformations, or queries
  • Case studies in the control of quality in an XML environment
  • Theoretical or practical approaches to measuring quality in XML
  • Does the presence of XML, XML schemas, and XML tools make quality checking easier, harder, or even different from other computing environments
  • Should XML transforms and schemas be QAed as software? Or configuration files? Or documents? Does it matter?

Paper submissions due April 20, 2012.

Details at: http://www.balisage.net/QA-QC/

You do have to understand the semantics of even imitation markup languages before mapping them with more robust languages. Enjoy!

February 12, 2012

XML Prague 2012 (proceedings)

Filed under: Conferences,XML,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 5:11 pm

XML Prague 2012 (proceedings) (PDF)

Fourteen papers by the leading lights in the XML world covering everything from XProc and XQuery to NVDL and JSONiq, and places in between.

Put it on your XML reading list.

December 31, 2011

Webdam Project: Foundations of Web Data Management

Filed under: Data,Data Management,Web Applications,XML — Patrick Durusau @ 7:28 pm

Webdam Project: Foundations of Web Data Management

From the homepage:

The goal of the Webdam project is to develop a formal model for Web data management. This model will open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability. Specifically, the goal is to develop a universally accepted formal framework for describing complex and flexible interacting Web applications featuring notably data exchange, sharing, integration, querying and updating. We also propose to develop formal foundations that will enable peers to concurrently reason about global data management activities, cooperate in solving specific tasks and support services with desired quality of service. Although the proposal addresses fundamental issues, its goal is to serve as the basis for future software development for Web data management.

Books from the project:

  • Foundation of Database, Serge Abiteboul, Rick Hull, Victor Vianu, open access online edition
  • Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart, open access online edition
  • Modeling, Querying and Mining Uncertain XML Data Evgeny Kharlamov and Pierre Senellart, , In A. Tagarelli, editor, XML Data Mining: Models, Methods, and Applications. IGI Global, 2011. open access online edition

I discovered this project via a link to “Web Data Management and Distribution” in Christophe Lalanne’s A bag of tweets / Dec 2011, that pointed to the PDF file, some 400 pages. I went looking for the HTML page with the link and discovered this project along with these titles.

There are a number of other publications associated with the project that you may find useful. The “Querying and Mining Uncertain XML” is only a chapter out of a larger publication by IGI Global. About what one expects from IGI Global. Cambrige Press published the title just proceeding this chapter and allows download for personal use of the entire book.

I think there is a lot to be learned from this project, even if it has not resulted in a universal framework for web applications that exchange data. I don’t think we are in any danger of universal frameworks on or off the web. And we are better for it.

December 24, 2011

Development Life Cycle and Tools for Data Exchange Specification

Filed under: Integration,XML,XML Schema — Patrick Durusau @ 4:42 pm

Development Life Cycle and Tools for Data Exchange Specification (2008) by KC Morris , Puja Goyal.

Abstract:

In enterprise integration, a data exchange specification is an architectural artifact that evolves along with the business. Developing and maintaining a coherent semantic model for data exchange is an important, yet non-trivial, task. A coherent semantic model of data exchange specifications supports reuse, promotes interoperability, and, consequently, reduces integration costs. Components of data exchange specifications must be consistent and valid in terms of agreed upon standards and guidelines. In this paper, we describe an activity model and NIST developed tools for the creation, test, and maintenance of a shared semantic model that is coherent and supports scalable, standards-based enterprise integration. The activity model frames our research and helps define tools to support the development of data exchange specification implemented using XML (Extensible Markup Language) Schema.

A paper that makes it clear that interoperability is not a trivial task. Could be helpful in convincing the ‘powers that be’ that projects on semantic integration or interoperability have to be properly resourced in order to have a useful result.

Manufacturing System Integration Division – MSID XML Testbed (NIST)

Filed under: Integration,XML,XML Schema — Patrick Durusau @ 4:42 pm

Manufacturing System Integration Division – MSID XML Testbed (NIST)

From the website:

NIST’s efforts to define methods and tools for developing XML Schemas to support systems integraton will help you effectively build and deploy XML Schemas amongst partners in integration projects. Through the Manufacturing Interoperability Program (MIP) XML Testbed, NIST provides guidance on how to build XML Schemas as well as a collection of tools that will help with the process allowing projects to more quickly and efficiently meet their goals.

The NIST XML Schema development and testing process is documented as the Model Development Life Cycle, which is an activity model for the creation, use, and maintenance of shared semantic models, and has been used to frame our research and development tools. We have worked with a number of industries on refining and automating the specification process and provide a wealth of information on how to use XML to address your integration needs.

On this site you will find a collection of tools and ideas to help you in developing high quality XML schemas. The tools available on this site are offered to the general public free of charge. They have been developed by the United States Government and as such are not subject to copyright or other restrictions.

If you are interested in seeing the tools extended or having some of your work included in the service please contact us.

The thought did occur to me that you could write an XML schema that governs the documentation of the subjects, their properties and merging conditions in your information systems. Perhaps even to the point of using XSLT to run against the resulting documentation to create SQL statements for the integration of information resources held in your database (or accessible therefrom).

« Newer Posts

Powered by WordPress