Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 17, 2012

R for Quants, Part II (A)

Filed under: News,R — Patrick Durusau @ 5:04 pm

R for Quants, Part II (A) by Brian Lee Yung Rowe.

From the post:

This is the second part in a three part series on teaching R to MFE students at CUNY Baruch. The focus of this lesson is on basic statistical methods in R.

No news but a good introduction to R.

rNews 1.0: Introduction to rNews

Filed under: News,RDFa,rNews — Patrick Durusau @ 5:03 pm

rNews 1.0: Introduction to rNews

The New York Times started using rNews to tag content on the 23rd of January, 2012. To use rNews as fodder for your application (mapping or otherwise), it won’t hurt to look over this introduction to rNews.

From the website:

rNews is a data model for embedding machine-readable publishing metadata in web documents and a set of suggested implementations. In this document, we’ll provide an overview rNews and an implementation guide. We’ll get started by reviewing the class diagram of the rNews data model. Following that we’ll review each individual class. After that we will use rNews to annotate a sample news document. We will conclude with a guide for implementors of rNews.

I would validate the “rNews” periodically from any site just as a sanity check.

rNews is here. And this is what it means.

Filed under: Microdata,Microformats,rNews — Patrick Durusau @ 5:02 pm

rNews is here. And this is what it means. by EVAN SANDHAUS.

From the post:

On January 23rd, 2012, The Times made a subtle change to articles published on nytimes.com. We rolled out phase one of our implementation of rNews – a new standard for embedding machine-readable publishing metadata into HTML documents. Many of our users will never see the change but the change will likely impact how they experience the news.

Far beneath the surface of nytimes.com lurk the databases — databases of articles, metadata and images, databases that took tremendous effort to develop, databases that the world only glimpses through the dark lens of HTML.

A rather slow lead into the crux of the story, the New York Times has started embedding rNews snippets in its news stories as of January 23rd, 2012. With the use of rNews to expand in the future.

Interesting result if you follow the request to paste the URL for The Bookstores Last Stand, http://www.nytimes.com/2012/01/29/business/barnes-noble-taking-on-amazon-in-the-fight-of-its-life.html, into the Google Rich Snippet tool. Go ahead, I’m not going anywhere, try it.

The New York Times has already diverged from the schema that it wants others to follow: “Warning: Page contains property “identifier” which is not part of the schema.

Earlier in the article Evan notes:

Several extensions to HTML have emerged that allow web publishers to explicitly markup structural metadata. These technologies include Microformats, HTML 5 Microdata and the Resource Description Framework in Attributes (RDFa).

For these technologies to be usefully applied, however, everybody has to agree what things should be called. For example, what The Times calls a “Headline,” a blogger might call a “Title,” and a German publisher might call an “überschrift.”

To use these new technologies for expressing underlying structure, the web publishing industry has to agree on a standard set of names and attributes, not an easy task. (emphasis added)

Using common names whenever possible but adapting (rather than breaking) in the event of change would be a better strategy.

One that would serve the NYT until 2173 and keep articles back to January 23rd 2012 as accessible as the day they were published.

February 16, 2012

Effectopedia

Filed under: Bioinformatics,Biomedical,Collaboration — Patrick Durusau @ 7:03 pm

Effectopedia – An Open Data Project for Collaborative Scientific Research, with the aim of reducing Animal Testing by Velichka Dimitrova, Coordinator of the Open Economics Working Group and Hristo Alajdov, Associate Professor at Institute of Biomedical Engineering at the Bulgarian Academy of Sciences.

From the post:

One of the key problems in natural science research is the lack of effective collaboration. A lot of research is conducted by scientists from different disciplines, yet cross-discipline collaboration is rare. Even within a discipline, research is often duplicated, which wastes resources and valuable scientific potential. Furthermore, without a common framework and context, research that involves animal testing often becomes phenomenological and little or no general knowledge can be gained from it. The peer reviewed publishing process is also not very effective in stimulating scientific collaboration, mainly due to the loss of an underlying machine readable structure for the data and the duration of the process itself.

If research results were more effectively shared and re-used by a wider scientific community – including scientists with different disciplinary backgrounds – many of these problems could be addressed. We could hope to see a more efficient use of resources, an accelerated rate of academic publications, and, ultimately, a reduction in animal testing.

Effectopedia is a project of the International QSAR Foundation. Effectopedia itself is an open knowledge aggregation and collaboration tool that provides a means of describing adverse outcome pathways (AOPs)1 in an encyclopedic manner. Effectopedia defines internal organizational space which helps scientist with different backgrounds to know exactly where their knowledge belongs and aids them in identifying both the larger context of their research and the individual experts who might be actively interested in it. Using automated notifications when researchers create causal linkage between parts of the pathways, they can simultaneously create a valuable contact with a fellow researcher interested in the same topic who might have a different background or perspective towards the subject. Effectopedia allows creation of live scientific documents which are instantly open for focused discussions and feedback whilst giving credit to the original authors and reviewers involved. The review process is never closed and if new evidence arises it can be presented immediately, allowing the information in Effectopedia to remain current, while keeping track of its complete evolution.

Sounds interesting but there is no link to the Effectopedia website. Followed links a bit and found: Effectopedia at SourceForge.

Apparently still in pre-alpha state.

I remember more than one workspace project so how do we decide whose identifications/terminology gets used?

Isn’t that the tough nut of collaboration? If scholars (given my background in biblical studies) decide to collaborate beyond their departments, they form projects, but that are less inclusive than all workers in a particular area. The end result being there are multiple projects with different identifications/terminologies. How do we bridge those gaps?

As you know, my suggestion is that everyone keeps their own identifications/terminologies.

Curious though if everyone does, keeps their own identifications/terminologies, if they will be able to read enough of another project’s content to understand that it is meaningful in their quest?

That is a topic map author deciding that two or more representatives represent the same subject may not carry over to users of the topic map having the same appreciation.

Cascading

Filed under: Cascading,Hadoop,MapReduce — Patrick Durusau @ 7:02 pm

Cascading

Since Cascading got called out today in the graph partitioning posts, thought it would not hurt to point it out.

From the webpage:

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.

Cascading is a thin Java library and API that sits on top of Hadoop’s MapReduce layer and is executed from the command line like any other Hadoop application.

As a library and API that can be driven from any JVM based language (Jython, JRuby, Groovy, Clojure, etc.), developers can create applications and frameworks that are “operationalized”. That is, a single deployable Jar can be used to encapsulate a series of complex and dynamic processes all driven from the command line or a shell. Instead of using external schedulers to glue many individual applications together with XML against each individual command line interface.

The Cascading API approach dramatically simplifies development, regression and integration testing, and deployment of business critical applications on both Amazon Web Services (like Elastic MapReduce) or on dedicated hardware.

Apache ZooKeeper 3.4.3 has been released

Filed under: Zookeeper — Patrick Durusau @ 7:00 pm

Apache ZooKeeper 3.4.3 has been released by Patrick Hunt.

From the post:

Apache ZooKeeper release 3.4.3 is now available. This is a bug fix release covering 18 issues, one of which was considered a blocker.

ZooKeeper 3.4 is incorporated into CDH4 and now available in beta 1!

ZOOKEEPER-1367 is the most serious of the issues addressed, it could cause data corruption on restart. This version also adds support for compiling the client on ARM architectures.

  • ZOOKEEPER-1367 Data inconsistencies and unexpired ephemeral nodes after cluster restart
  • ZOOKEEPER-1343 getEpochToPropose should check if lastAcceptedEpoch is greater or equal than epoch
  • ZOOKEEPER-1373 Hardcoded SASL login context name clashes with Hadoop security configuration override
  • ZOOKEEPER-1089 zkServer.sh status does not work due to invalid option of nc
  • ZOOKEEPER-973 bind() could fail on Leader because it does not setReuseAddress on its ServerSocket
  • ZOOKEEPER-1374 C client multi-threaded test suite fails to compile on ARM architectures.
  • ZOOKEEPER-1348 Zookeeper 3.4.2 C client incorrectly reports string version of 3.4.1

If you are running 3.4.2 or earlier, be sure to upgrade immediately. See my earlier post for details on what’s new in 3.4.

From the Apache ZooKeeper homepage:

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Just in case you hope to manage distributed applications some day. Zookeeper should be on your resume.

Graph partitioning in MapReduce with Cascading

Filed under: Cascading,Graph Partitioning,MapReduce — Patrick Durusau @ 7:00 pm

Graph partitioning in MapReduce with Cascading in two parts by Friso van Vollenhoven.

Graph partitioning in MapReduce with Cascading (Part 1).

From the post:

I have recently had the joy of doing MapReduce based graph partitioning. Here’s a post about how I did that. I decided to use Cascading for writing my MR jobs, as it is a lot less verbose than raw Java based MR. The graph algorithm consists of one step to prepare the input data and then a iterative part, that runs until convergence. The program uses a Hadoop counter to check for convergence and will stop iterating once there. All code is available. Also, the explanation has colorful images of graphs. (And everything is written very informally and there is no math.)

Graph partitioning part 2: connected graphs.

From the post:

In a previous post, we talked about finding the partitions in a disconnected graph using Cascading. In reality, most graphs are actually fully connected, so only being able to partition already disconnected graphs is not very helpful. In this post, we’ll take a look at partitioning a connected graph based on some criterium for creating a partition boundary.

Very accessible explanations complete with source code (github).

What puzzles me about the efforts to develop an algorithm to automatically partition a graph database is that there is no corresponding effort to develop an algorithm to automatically partition relational databases. Yet we know that relational databases can be represented as graphs. So what’s the difference?

Conceding that graphs such as Facebook, the WWW, etc., have grown without planning and so aren’t subject to the same partitioning considerations as relational databases. But isn’t there a class of graphs that are closer to relational databases than Facebook?

Consider that diverse research facilities for a drug company could use graph databases for research purposes but that doesn’t mean that any user can create edges between nodes at random. Any more than a user of a sharded database can create arbitrary joins.

I deeply enjoy graph posts such as these by Friso van Vollenhoven but the “cool” aspects of using MapReduce should not keep us from seeing heuristics we can use to enhance the performance of graph databases.

Metrography: London Reshaped to Match the Classic Tube Map

Filed under: Mapping,Maps,Visualization — Patrick Durusau @ 6:58 pm

Metrography: London Reshaped to Match the Classic Tube Map.

From the post:

In Metrography [looksgood.de], interaction design students Benedikt Groß [looksgood.de] and Bertrand Clerc [bertrandclerc.com] and presents us with an alternative view on London. What if the street map was reshaped according to the positions of the tube stations as placed on the Tube map?

The result is a ‘warped’ or ‘morphed’ map of London, that highlights the discrepancy between the stylized metro map and the geographically correct depiction. The resulting high-resolution prints can be viewed online in all detail.

I am not sure I agree there is a “geographically correct depiction” of London or any other locale. Depends on whose “geography” you are using. We are so schooled in some depictions being “correct,” that we fail to speak up when lines of advantage/disadvantage are being drawn. That is “just the way things fall on the map,” no personal motive involved. Right.

Topic maps are one way to empower alternative views, geographic or otherwise.

Slicing Obama’s 2013 budget proposal four ways

Filed under: Visualization — Patrick Durusau @ 6:56 pm

Slicing Obama’s 2013 budget proposal four ways first seen at Flowing Data.

Whatever your politics, the New York Times four way visualization of President Obama’s 2013 budget proposal demonstrates the power of visualization. Very much worth your time to appreciate and if you are interested in visualization, to study.

Interactive and animated word cloud

Filed under: Visualization — Patrick Durusau @ 6:55 pm

Interactive and animated word cloud seen at Flowing Data.

Reports that Jason Davies has created a configurable word cloud that is also animated.

Not my favorite form of visualization but it may work for you.

Akoma Ntoso

Filed under: Law,Legal Informatics — Patrick Durusau @ 6:54 pm

Akoma Ntoso

From the webpage:

Akoma Ntoso (“linked hearts“ in Akan language of West Africa) defines a “machine readable“ set of simple technology-neutral electronic representations (in XML format) of parliamentary, legislative and judiciary documents.

Akoma Ntoso is a set of simple, technology-neutral XML machine-readable descriptions of official documents such as legislation, debate record, minutes, etc. that enable addition of descriptive structure (markup) to the content of parliamentary and legislative documents.

Akoma Ntoso XML schema make “accessible” structure and semantic components of digital documents supporting the creation of high value information services to deliver the power of ICTs to support efficiency and accountability in the parliamentary, legislative and judiciary contexts.

Akoma Ntoso is an initiative of “Africa i-Parliament Action Plan” (www.parliaments.info) a programme of UN/DESA.

Be aware that a new TC has been proposed at OASIS, LeDML, to move Akoma Ntoso towards becoming an international standard.

Applying Akoma Ntoso to the United States Code is a post by Grant Vergottini about his experiences converting the US Code markup into Akoma Ntoso.

Markup can, not necessarily will, simplify the task of creating topic maps of legal materials.

Google MapReduce/Pregel – Graph Reading Club – 15 February 2012

Filed under: Graphs,MapReduce,Pregel — Patrick Durusau @ 6:53 pm

Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club by René Pickhardt.

From the post:

The first meeting of our reading club was quite a success. Everyone was well prepared and we discussed some issues about Google’s Map Reduce framework and I had the feeling that everyone now better understands what is going on there. I will now post a summary of what has been discussed and will also post some feedback and reading for next week to the end of this post. Most importantly: The reading club will meet next week Wednesday February 22nd at 2 o’clock pm CET.

René includes some rules/guidance for the next meeting and a very interesting looking reading list!

Code for Machine Learning for Hackers

Filed under: Machine Learning,R — Patrick Durusau @ 6:52 pm

Code for Machine Learning for Hackers by Drew Conway.

Drew writes:

For those interested, my co-author John Myles White is hosting the code at his Github, which can be accessed at:

https://github.com/johnmyleswhite/ML_for_Hackers

Drew and John wrote Machine Learning for Hackers, which has just been released but their publisher hasn’t updated its website to point to the code repository. (As of the time of Drew’s post.)

February 15, 2012

Get Introduced to Graph Databases with a Webinar from Neo4j

Filed under: Graphs,Neo4j — Patrick Durusau @ 8:35 pm

Get Introduced to Graph Databases with a Webinar from Neo4j. Post by Allison Sparrow.

Link to Intro to Graph Databases webinar, along with overflow questions and answers from the webinar!

Graphviz – Graph Visualization Software

Filed under: Graphs,Graphviz,Visualization — Patrick Durusau @ 8:35 pm

Graphviz – Graph Visualization Software

From the webpage:

What is Graphviz?

Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains.

Features

The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in useful formats, such as images and SVG for web pages, PDF or Postscript for inclusion in other documents; or display in an interactive graph browser. (Graphviz also supports GXL, an XML dialect.) Graphviz has many useful features for concrete diagrams, such as options for colors, fonts, tabular node layouts, line styles, hyperlinks, rolland custom shapes.

I thought I had posted on Graphviz but it was just a casual reference in the body of a post. I needed to visualize some graphs for import into a document and that made me think about it.

From Relational Databases to Linked Data in Epigraphy: Hispania Epigraphica Online

Filed under: Linked Data — Patrick Durusau @ 8:34 pm

From Relational Databases to Linked Data in Epigraphy: Hispania Epigraphica Online by Fernando-Luis Álvarez, Joaquín-L. Gómez-Pantoja and Elena García-Barriocanal.

Abstract:

Epigraphic databases store metadata and digital representations of inscriptions for information purposes, heritage conservation or scientific use. At present, there are several of such databases available, but our focus is on those that are part of the EAGLE consortium, which aims to make available the epigraphy from the ancient classical civilization. Right now, the EAGLE partners share a basic data schema and an agreement on workload and responsibilities, but each repository has it own storage structure, data identification system and even its different idea of what an epigraphic database is or should be. Any of these aspects may lead to redundancy and hampers search and linking. This paper describes a system implementation for epigraphic data sharing as linked data. Although the described system was tested on a specific database, i.e. Hispania Epigraphica Online, it could be easily tailored to other systems, enabling the advantage of semantic search on several disparate databases.

Good work but isn’t it true that most approaches, “…could be easily tailored to other systems, enabling the advantage of semantic search over several disparate databases”?

That is the ability to query disparate databases as disparate databases that continues to elude us.

Isn’t that the question that we need to answer? Yes?

Open Innovator’s Toolkit

Filed under: Government — Patrick Durusau @ 8:33 pm

Open Innovator’s Toolkit

From the announcement:

President Obama emphasizes a “bottom-up” philosophy that taps citizen expertise to make government smarter and more responsive to private sector demands. This philosophy of “open innovation” has already delivered tangible results in public and regulated sectors of the economy – areas like health IT, learning technologies, and smart grid – that are poised to deliver productivity growth and grow the jobs of the future. We have surfaced new or improved policy tools deployed by our government to achieve them. We’ve posted the Open Innovator’s Toolkit as a roster of 20 leading practices that an “open innovator” should consider when confronting any policy challenge – at any level of government. Our aspiration is to build upon this list, adding new tools and case studies to form an evidence base that will help to scale “open innovation” across the public sector.

What new tools/case studies would you like to add to this list? With reference to topic maps in particular.

Unstructured data is a myth

Filed under: Data,Data Mining — Patrick Durusau @ 8:33 pm

Unstructured data is a myth by Ram Subramanyam Gopalan.

From the post:

Couldn’t resist that headline! But seriously, if you peel the proverbial onion enough, you will see that the lack of tools to discover / analyze the structure of that data is the truth behind the opaqueness that is implied by calling the data “unstructured”.

This article will give you a firm basis for arguing with casual use of “unstructured” data as a phrase.

One point that stands above the others is that all the so-called “unstructured” data is generated by some process, automated or otherwise. That you may be temporarily ignorant of that process, doesn’t mean that the data is “unstructured.” Worth reading, more than once.

OWL: Yet to Arrive on the Web of Data?

Filed under: Linked Data,OWL,Semantic Web — Patrick Durusau @ 8:33 pm

OWL: Yet to Arrive on the Web of Data? by Angela Guess.

From the post:

A new paper is currently available for download entitled OWL: Yet to arrive on the Web of Data? The paper was written by Birte Glimm, Aidan Hogan, Markus Krötzsch, and Axel Polleres. The abstract states, “Seven years on from OWL becoming a W3C recommendation, and two years on from the more recent OWL 2 W3C recommendation, OWL has still experienced only patchy uptake on the Web. Although certain OWL features (like owl:sameAs) are very popular, other features of OWL are largely neglected by publishers in the Linked Data world.”

It continues, “This may suggest that despite the promise of easy implementations and the proposal of tractable profiles suggested in OWL’s second version, there is still no “right” standard fragment for the Linked Data community. In this paper, we (1) analyse uptake of OWL on the Web of Data, (2) gain insights into the OWL fragment that is actually used/usable on the Web, where we arrive at the conclusion that this fragment is likely to be a simplified profile based on OWL RL, (3) propose and discuss such a new fragment, which we call OWL LD (for Linked Data).”

Interesting and perhaps valuable data about the use of RDFS/OWL primitives on the Web.

I find it curious that the authors don’t survey users about what OWL capabilities they would find compelling. It could be that users are interested in and willing to support some subset of OWL that hasn’t been considered by the authors or others.

Might not be the Semantic Web as the authors envision it, but without broad user support, the author’s Semantic Web will never come to pass.

International Conference on Knowledge Management and Information Sharing

Filed under: Conferences,Information Sharing,Knowledge Management — Patrick Durusau @ 8:32 pm

International Conference on Knowledge Management and Information Sharing

Regular Paper Submission: April 17, 2012
Authors Notification (regular papers): June 12, 2012
Final Regular Paper Submission and Registration: July 4, 2012

From the call for papers:

Knowledge Management (KM) is a discipline concerned with the analysis and technical support of practices used in an organization to identify, create, represent, distribute and enable the adoption and leveraging of good practices embedded in collaborative settings and, in particular, in organizational processes. Effective knowledge management is an increasingly important source of competitive advantage, and a key to the success of contemporary organizations, bolstering the collective expertise of its employees and partners.

Information Sharing (IS) is a term used for a long time in the information technology (IT) lexicon, related to data exchange, communication protocols and technological infrastructures. Although standardization is indeed an essential element for sharing information, IS effectiveness requires going beyond the syntactic nature of IT and delve into the human functions involved in the semantic, pragmatic and social levels of organizational semiotics.

The two areas are intertwined as information sharing is the foundation for knowledge management.

Part of IC3K 2012 – International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management.

Although all three conferences at IC3K 2012 will be of interest to topic mappers, the line:

Although standardization is indeed an essential element for sharing information, IS effectiveness requires going beyond the syntactic nature of IT and delve into the human functions involved in the semantic, pragmatic and social levels of organizational semiotics.

did catch my attention.

I am not sure that I would treat syntactic standardization as a prerequisite for sharing information. If anything, syntactic diversity increases more quickly than semantic diversity, as every project to address the latter starts by claiming a need to address the former.

Let’s start with extant syntaxes, whether COBOL, relational tables, topic maps, RDF, etc., and specify semantics that we wish to map between them. To see if there is any ROI. If not, stop there and select other data sets. If yes, then specify only so much in the way of syntax/semantics as results in ROI.

Don’t have to plan on integrating all the data from all federal agencies. Just don’t do anything inconsistent with that as a long term goal. Like failing to document why you arrived at particular mappings. (You will forget by tomorrow or the next day.)

International Conference on Knowledge Engineering and Ontology Development

Filed under: Conferences,Knowledge Management,Ontology — Patrick Durusau @ 8:32 pm

International Conference on Knowledge Engineering and Ontology Development

Regular Paper Submission: April 17, 2012
Authors Notification (regular papers): June 12, 2012
Final Regular Paper Submission and Registration: July 4, 2012

From the call for papers:

Knowledge Engineering (KE) refers to all technical, scientific and social as-pects involved in building, maintaining and using knowledge-based systems. KE is a multidisciplinary field, bringing in concepts and methods from several computer science domains such as artificial intelligence, databases, expert systems, decision support systems and geographic information systems.

Ontology Development (OD) aims at building reusable semantic structures that can be informal vocabularies, catalogs, glossaries as well as more complex finite formal structures representing the entities within a domain and the relationships between those entities. Ontologies, have been gaining interest and acceptance in computational audiences: formal ontologies are a form of software, thus software development methodologies can be adapted to serve ontology development. A wide range of applications is emerging, especially given the current web emphasis, including library science, ontology-enhanced search, e-commerce and business process design.

Part of IC3K 2012 – International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management.

KDIR 2012 : International Conference on Knowledge Discovery and Information

Filed under: Conferences,Information Retrieval,Knowledge Discovery — Patrick Durusau @ 8:31 pm

KDIR 2012 : International Conference on Knowledge Discovery and Information

Regular Paper Submission: April 17, 2012
Authors Notification (regular papers): June 12, 2012
Final Regular Paper Submission and Registration: July 4, 2012

From the call for papers:

Knowledge Discovery is an interdisciplinary area focusing upon methodologies for identifying valid, novel, potentially useful and meaningful patterns from data, often based on underlying large data sets. A major aspect of Knowledge Discovery is data mining, i.e. applying data analysis and discovery algorithms that produce a particular enumeration of patterns (or models) over the data. Knowledge Discovery also includes the evaluation of patterns and identification of which add to knowledge. This has proven to be a promising approach for enhancing the intelligence of software systems and services. The ongoing rapid growth of online data due to the Internet and the widespread use of large databases have created an important need for knowledge discovery methodologies. The challenge of extracting knowledge from data draws upon research in a large number of disciplines including statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing, to deliver advanced business intelligence and web discovery solutions.

Information retrieval (IR) is concerned with gathering relevant information from unstructured and semantically fuzzy data in texts and other media, searching for information within documents and for metadata about documents, as well as searching relational databases and the Web. Automation of information retrieval enables the reduction of what has been called “information overload”.

Information retrieval can be combined with knowledge discovery to create software tools that empower users of decision support systems to better understand and use the knowledge underlying large data sets.

Part of IC3K 2012 – International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management.

February 14, 2012

Would You Know “Good” XML If It Bit You?

Filed under: Uncategorized,XML,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 5:16 pm

XML is a pale imitation of a markup language. It has resulted in real horrors across the markup landscape. After years in its service, I don’t have much hope of that changing.

But, the Princess of the Northern Marches has organized a war council to consider how to stem the tide of bad XML. Despite my personal misgivings, I wish them well and invite you to participate as you see fit.

Oh, and I found this message about the council meeting:

International Symposium on Quality Assurance and Quality Control in XML

Monday August 6, 2012
Hotel Europa, Montréal, Canada

Paper submissions due April 20, 2012.

A one-day discussion of issues relating to Quality Control and Quality Assurance in the XML environment.

XML systems and software are complex and constantly changing. XML documents are highly varied, may be large or small, and often have complex life-cycles. In this challenging environment quality is difficult to define, measure, or control, yet the justifications for using XML often include promises or implications relating to quality.

We invite papers on all aspects of quality with respect to XML systems, including but not limited to:

  • Defining, measuring, testing, improving, and documenting quality
  • Quality in documents, document models, software, transformations, or queries
  • Case studies in the control of quality in an XML environment
  • Theoretical or practical approaches to measuring quality in XML
  • Does the presence of XML, XML schemas, and XML tools make quality checking easier, harder, or even different from other computing environments
  • Should XML transforms and schemas be QAed as software? Or configuration files? Or documents? Does it matter?

Paper submissions due April 20, 2012.

Details at: http://www.balisage.net/QA-QC/

You do have to understand the semantics of even imitation markup languages before mapping them with more robust languages. Enjoy!

Cloudera Manager | Service and Configuration Management Demo Videos

Filed under: Cloudera,Hadoop,HBase,HDFS,MapReduce — Patrick Durusau @ 5:11 pm

Cloudera Manager | Service and Configuration Management Demo Videos by Jon Zuanich.

From the post:

Service and Configuration Management (Part I & II)

We’ve recently recorded a series of demo videos intended to highlight the extensive set of features and functions included with Cloudera Manager, the industry’s first end-to-end management application for Apache Hadoop. These demo videos showcase the newly enhanced Cloudera Manager interface and reveal how to use this powerful application to simplify the administration of Hadoop clusters, optimize performance and enhance the quality of service.

In the first two videos of this series, Philip Langdale, a software engineer at Cloudera, walks through Cloudera Manager’s Service and Configuration Management module. He demonstrates how simple it is to set up and configure the full range of Hadoop services in CDH (including HDFS, MR and HBase); enable security; perform configuration rollbacks; and add, delete and decommission nodes.

Interesting that Vimeo detects the “embedding” of these videos in my RSS reader and displays a blocked message. At the Cloudera site, all is well.

Management may not be as romantic as the latest graph algorithms but it is a pre-condition to widespread enterprise adoption.

Introducing CDH4

Filed under: Cloudera,Hadoop,HBase,HDFS,MapReduce — Patrick Durusau @ 5:10 pm

Introducing CDH4 by Charles Zedlewski.

From the post:

I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.

There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:

  • Availability – a high availability namenode, better job isolation, hard drive failure handling, and multi-version support
  • Utilization – multiple namespaces, co-processors and a slot-less resource management model
  • Performance – improvements in HBase, HDFS, MapReduce and compression performance
  • Usability – broader BI support, expanded API access, unified file formats & compression codecs
  • Security – scheduler ACL’s

Some items of note about this beta:

This is the first beta for CDH4. We plan to do a second beta some weeks after the first beta. The second beta will roll in updates to Apache Flume, Apache Sqoop, Hue, Apache Oozie and Apache Whirr that did not make the first beta. It will also broaden the platform support back out to our normal release matrix of Red Hat, Centos, Suse, Ubuntu and Debian. Our plan is for this second beta to have the last significant component changes before CDH goes GA.

Some CDH components are getting substantial revamps and we have transition plans for these. There is a significantly redesigned MapReduce (aka MR2) with a similar API to the old MapReduce but with new daemons, user interface and more. MR2 is part of CDH4, but we also decided it makes sense to ship with the MapReduce from CDH3 which is widely used, thoroughly debugged and stable. We will support both generations of MapReduce for the life of CDH4, which will allow customers and users to take advantage of all of the new CDH4 features while making the transition to the new MapReduce in a timeframe that makes sense for them.

The only better time to be in data mining, information retrieval, data analysis is next week. 😉

Open Source OData Tools for MySQL and PHP Developers

Filed under: MySQL,Odata,PHP — Patrick Durusau @ 5:09 pm

Open Source OData Tools for MySQL and PHP Developers by Doug Mahugh.

To enable more interoperability scenarios, Microsoft has released today two open source tools that provide support for the Open Data Protocol (OData) for PHP and MySQL developers working on any platform.

The growing popularity of OData is creating new opportunities for developers working with a wide variety of platforms and languages. An ever increasing number of data sources are being exposed as OData producers, and a variety of OData consumers can be used to query these data sources via OData’s simple REST API.

In this post, we’ll take a look at the latest releases of two open source tools that help PHP developers implement OData producer support quickly and easily on Windows and Linux platforms:

  • The OData Producer Library for PHP, an open source server library that helps PHP developers expose data sources for querying via OData. (This is essentially a PHP port of certain aspects of the OData functionality found in System.Data.Services.)
  • The OData Connector for MySQL, an open source command-line tool that generates an implementation of the OData Producer Library for PHP from a specified MySQL database.

These tools are written in platform-agnostic PHP, with no dependencies on .NET.

This is way cool!

Seriously consider Doug’s request for what other tools you would like to see for OData?

On Approximating String Selection Problems with Outliers

Filed under: Algorithms,Bioinformatics,String Matching — Patrick Durusau @ 5:07 pm

On Approximating String Selection Problems with Outliers by Christina Boucher, Gad M. Landau, Avivit Levy, David Pritchard and Oren Weimann.

Abstract:

Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most Strings problem is, given a set S of same-length strings, and a parameter d, find a string x that maximizes the number of “non-outliers” within Hamming distance d of x. We prove this problem has no PTAS unless ZPP=NP, correcting a decade-old mistake. The Most Strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k; we show it has no PTAS unless P=NP. We also observe Closest to k Strings has no EPTAS unless W[1]=FPT. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult.

Just in case you need a break from graph algorithms, intractable and otherwise. 😉

Sublinear Time Algorithm for PageRank Computations and Related Applications

Filed under: Algorithms,PageRank — Patrick Durusau @ 5:06 pm

Sublinear Time Algorithm for PageRank Computations and Related Applications by Christian Borgs, Michael Brautbar, Jennifer Chayes, Shang-Hua Teng

In a network, identifying all vertices whose PageRank is more than a given threshold value $\Delta$ is a basic problem that has arisen in Web and social network analyses. In this paper, we develop a nearly optimal, sublinear time, randomized algorithm for a close variant of this problem. When given a network \graph, a threshold value $\Delta$, and a positive constant $c>1$, with probability $1-o(1)$, our algorithm will return a subset $S\subseteq V$ with the property that $S$ contains all vertices of PageRank at least $\Delta$ and no vertex with PageRank less than $\Delta/c$. The running time of our algorithm is always $\tilde{O}(\frac{n}{\Delta})$. In addition, our algorithm can be efficiently implemented in various network access models including the Jump and Crawl query model recently studied by \cite{brautbar_kearns10}, making it suitable for dealing with large social and information networks.

As part of our analysis, we show that any algorithm for solving this problem must have expected time complexity of ${\Omega}(\frac{n}{\Delta})$. Thus, our algorithm is optimal up to a logarithmic factor. Our algorithm (for identifying vertices with significant PageRank) applies a multi-scale sampling scheme that uses a fast personalized PageRank estimator as its main subroutine. We develop a new local randomized algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed by Jeh and Widom \cite{JehW03} and by Andersen, Chung, and Lang \cite{AndersenCL06}. Our multi-scale sampling scheme can also be adapted to handle a large class of matrix sampling problems that may have potential applications to online advertising on large social networks (See the appendix).

Pay close attention to the author’s definition of “significant” vertices:

A basic problem in network analysis is to identify the set of its vertices that are “significant.” For example, the significant nodes in the web graph defined by a query could provide the authoritative contents in web search; they could be the critical proteins in a protein interaction network; and they could be the set of people (in a social network) most effective to seed the influence for online advertising. As the networks become larger, we need more efficient algorithms to identify these “significant” nodes.

As far as online advertising, I await the discovery by vendors that “pull” models of advertising pre-qualify potential purchasers. “Push” models spam everyone within reach, with correspondingly low success rates.

For your convenience, the cites that don’t work well as source in the abstract:

brautbar_kearns10 – Local Algorithms for Finding Interesting Individuals in Large Networks by Mickey Brautbar , Michael Kearns.

Jeh and Widom – Scaling personalized web search (ACM), Scaling personalized web search (Stanford, free).

Andersen, Chung, and Lang – Local graph partitioning using PageRank vectors.

Scienceography: the study of how science is written

Filed under: Data Mining,Information Retrieval — Patrick Durusau @ 5:05 pm

Scienceography: the study of how science is written by Graham Cormode, S. Muthukrishnan and Jinyun Yun.

Abstract:

Scientific literature has itself been the subject of much scientific study, for a variety of reasons: understanding how results are communicated, how ideas spread, and assessing the influence of areas or individuals. However, most prior work has focused on extracting and analyzing citation and stylistic patterns. In this work, we introduce the notion of ‘scienceography’, which focuses on the writing of science. We provide a first large scale study using data derived from the arXiv e-print repository. Crucially, our data includes the “source code” of scientific papers-the LATEX source-which enables us to study features not present in the “final product”, such as the tools used and private comments between authors. Our study identifies broad patterns and trends in two example areas-computer science and mathematics-as well as highlighting key differences in the way that science is written in these fields. Finally, we outline future directions to extend the new topic of scienceography.

What content are you searching/indexing in a scientific context?

The authors discover what many of us have overlooked. The “source” of scientific papers. A source that can reflects a richer history than the final product.

Some questions:

Will searching the source give us finer grained access to the content? That is can we separate portions of text that recite history, related research, background, from new insights/conclusions? To access the other material only if needed. (Every graph paper starts off with nodes and edges, complete with citations. Anyone reading a graph paper is likely to know those terms.)

Other disciplines use LaTeX. Do those LaTeX files differ from the ones reported here? If so, in what way?

Responsive UX Design

Filed under: Interface Research/Design — Patrick Durusau @ 5:04 pm

Responsive UX Design by Darrin Henein.

From the post:

In recent years, the deluge of new connected devices that have entered the market has created an increas­ingly complex chal­lenge for designers and devel­opers alike. Until rela­tively recently, the role of a UX or UI designer was compar­a­tively straight­forward. Digital expe­ri­ences lived on their own, and were built and tailored for the specific mediums by which they were to be consumed. Moreover, the number of devices through which a user could access your brand was more limited. Digital expe­ri­ences were confined to (typi­cally) either a desktop or laptop computer, or in some cases a rela­tively basic browser embedded in a mobile phone. Furthermore, these devices were fairly homogenous within their classes, sporting similar screen reso­lu­tions and hardware capabilities.

Urges the use of CSS and HTML5 to plan for display of content on a variety of platforms. Rather than designing for one UX and taking ones changes on other devices.

Not simply a matter of getting larger or smaller but a UX/UI design issue.

Here taken to be a function of the device for viewing content but what if that were extended to content?

That is for manuals of various kinds that content is augmented with additional safeguards or warnings? If a particular repair sequence is chosen, additional cautionary content is loaded or checks are added to the process.

« Newer PostsOlder Posts »

Powered by WordPress