Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 7, 2012

Cascading 2.0

Filed under: Cascading,Data,Data Integration,Data Management,Data Streams — Patrick Durusau @ 2:16 pm

Cascading 2.0

From the post:

We are happy to announce that Cascading 2.0 is now publicly available for download.

http://www.cascading.org/downloads/

This release includes a number of new features. Specifically:

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

We have also created a new top-level project on GitHub for all community sponsored Cascading projects:

https://github.com/Cascading

From the documentation:

What is Cascading?

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading’s “local mode” can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.

Cascading homepage.

Don’t miss the extensions to Cascading: Cascading Extensions. Any summary would be unfair. Take a look for yourself. Coverage of any of these you would like to point out?

I first spotted Cascading 2.0 at Alex Popescu’s myNoSQL.

Lucene Revolution 2012 – Slides/Videos

Filed under: Conferences,Lucene,Mahout,Solr,SolrCloud — Patrick Durusau @ 2:16 pm

Lucene Revolution 2012 – Slides/Videos

The slides and videos from Lucene Revolution 2012 are up!

Now you don’t have to search for old re-runs on Hulu to watch during lunch!

June 6, 2012

A Pluggable XML Editor

Filed under: Editor,XML — Patrick Durusau @ 7:50 pm

A Pluggable XML Editor by Grant Vergottini.

From the post:

Ever since I announced my HTML5-based XML editor, I’ve been getting all sorts of requests for a variety of implementations. While the focus has been, and continues to be, providing an Akoma Ntoso based legislative editor, I’ve realized that the interest in a web-based XML editor extends well beyond Akoma Ntoso and even legislative editors.

So… with that in mind I’ve started making some serious architectural changes to the base editor. From the get-go, my intent had been for the editor to be “pluggable” although I hadn’t totally thought it through. By “pluggable” I mean capable of allowing different information models to be used. I’m actually taking the model a bit further to allow modules to be built that can provide optional functionality to the base editor. What this means is that if you have a different document information model, and it is capable of being round-tripped in some way with an editing view, then I can probably adapt it to the editor.

Let’s talk about the round-tripping problem for a moment. In the other XML editors I have worked with, the XML model has had to quite closely match the editing view that one works with. So you’re literally authoring the document using that information model. Think about HTML (or XHTML for an XML perspective). The arrangement of the tags pretty much exactly represents how you think of an deal with the components of the document. Paragraphs, headings, tables, images, etc, are all pretty much laid out how you would author them. This is the ideal situation as it makes building the editor quite straight-forward.

Note the line:

What this means is that if you have a different document information model, and it is capable of being round-tripped in some way with an editing view, then I can probably adapt it to the editor.

I think that means that we don’t all have to use the same editing view and at the same time, we can share an underlying format. Or perhaps even annotate texts with subject identities, not even realizing we are helping others.

This is an impressive bit of work and as the post promises, there is more to follow.

(I first saw this at Legal Informatics. http://legalinformatics.wordpress.com/2012/06/05/vergottini-on-improvements-to-akneditor-html-5-based-xml-editor-for-legislation/)

Boosting: Foundations and Algorithms

Filed under: Boosting,Searching — Patrick Durusau @ 7:50 pm

Boosting: Foundations and Algorithms by Robert E. Schapire and Yoav Freund. (Amazon link)

From the description:

Boosting is an approach to machine learning based on the idea of creating a highly accurate predictor by combining many weak and inaccurate “rules of thumb.” A remarkably rich theory has evolved around boosting, with connections to a range of topics, including statistics, game theory, convex optimization, and information geometry. Boosting algorithms have also enjoyed practical success in such fields as biology, vision, and speech processing. At various times in its history, boosting has been perceived as mysterious, controversial, even paradoxical.

This book, written by the inventors of the method, brings together, organizes, simplifies, and substantially extends two decades of research on boosting, presenting both theory and applications in a way that is accessible to readers from diverse backgrounds while also providing an authoritative reference for advanced researchers. With its introductory treatment of all material and its inclusion of exercises in every chapter, the book is appropriate for course use as well. The book begins with a general introduction to machine learning algorithms and their analysis; then explores the core theory of boosting, especially its ability to generalize; examines some of the myriad other theoretical viewpoints that help to explain and understand boosting; provides practical extensions of boosting for more complex learning problems; and finally presents a number of advanced theoretical topics. Numerous applications and practical illustrations are offered throughout.

If you can’t recognize a subject, how can you reliably boost it? (Inquiring minds want to know.)

(I first saw this title mentioned at KDnuggets, http://www.kdnuggets.com/2012/06/book-boosting-foundations-algorithms.html)

A Taxonomy of Site Search

Filed under: Search Behavior,Search Interface,Searching — Patrick Durusau @ 7:50 pm

A Taxonomy of Site Search by Tony Russell-Rose.

From the post:

Here are the slides from the talk I gave at Enterprise Search Europe last week on A Taxonomy of Site Search. This talk extends and validates the taxonomy of information search strategies (aka ‘search modes’) presented at last year’s event, and reviews some of their implications for design. But this year we looked specifically at site search rather than enterprise search, and explored the key differences in user needs and behaviours between the two domains. [see Tony’s post for the slides]

There is a lot to be learned (and put to use) from investigations of search behavior.

Concurrent Programming for Scalable Web Architectures

Filed under: Concurrent Programming,Parallel Programming,Scalability,Web Applications — Patrick Durusau @ 7:49 pm

Concurrent Programming for Scalable Web Architectures by Benjamin Erb.

Abstract:

Web architectures are an important asset for various large-scale web applications, such as social networks or e-commerce sites. Being able to handle huge numbers of users concurrently is essential, thus scalability is one of the most important features of these architectures. Multi-core processors, highly distributed backend architectures and new web technologies force us to reconsider approaches for concurrent programming in order to implement web applications and fulfil scalability demands. While focusing on different stages of scalable web architectures, we provide a survey of competing concurrency approaches and point to their adequate usages.

High Scalability has a good list of topics and the table of contents.

Or you can jump to the thesis homepage.

Just in case you are thinking about taking your application to “web scale.” 😉

Apache Camel Tutorial

Filed under: Apache Camel,Data Integration,Data Streams — Patrick Durusau @ 7:49 pm

If you haven’t seen Apache Camel Tutorial Business Partners (other tutorials here), you need to give it a close look:

So there’s a company, which we’ll call Acme. Acme sells widgets, in a fairly unusual way. Their customers are responsible for telling Acme what they purchased. The customer enters into their own systems (ERP or whatever) which widgets they bought from Acme. Then at some point, their systems emit a record of the sale which needs to go to Acme so Acme can bill them for it. Obviously, everyone wants this to be as automated as possible, so there needs to be integration between the customer’s system and Acme.

Sadly, Acme’s sales people are, technically speaking, doormats. They tell all their prospects, “you can send us the data in whatever format, using whatever protocols, whatever. You just can’t change once it’s up and running.”

The result is pretty much what you’d expect. Taking a random sample of 3 customers:

  • Customer 1: XML over FTP
  • Customer 2: CSV over HTTP
  • Customer 3: Excel via e-mail

Now on the Acme side, all this has to be converted to a canonical XML format and submitted to the Acme accounting system via JMS. Then the Acme accounting system does its stuff and sends an XML reply via JMS, with a summary of what it processed (e.g. 3 line items accepted, line item #2 in error, total invoice 123.45). Finally, that data needs to be formatted into an e-mail, and sent to a contact at the customer in question (“Dear Joyce, we received an invoice on 1/2/08. We accepted 3 line items totaling 123.45, though there was an error with line items #2 [invalid quantity ordered]. Thank you for your business. Love, Acme.”).

You don’t have to be a “doormat” to take data as you find it.

Intercepted communications are unlikely to use your preferred terminology for locations or actions. Ditto for web/blog pages.

If you are thinking about normalization of data streams by producing subject-identity enhanced data streams, then you are thinking what I am thinking about Apache Camel.

For further information:

Apache Camel Documentation

Apache Camel homepage

Java Annotations

Filed under: Java,Java Annotations — Patrick Durusau @ 7:48 pm

Java Annotations

From the post:

Annotation is code about the code, that is metadata about the program itself. In other words, organized data about the code, embedded within the code itself. It can be parsed by the compiler, annotation processing tools and can also be made available at run-time too.

We have basic java comments infrastructure using which we add information about the code / logic so that in future, another programmer or the same programmer can understand the code in a better way. Javadoc is an additional step over it, where we add information about the class, methods, variables in the source code. The way we need to add is organized using a syntax. Therefore, we can use a tool and parse those comments and prepare a javadoc document which can be distributed separately.

Javadoc facility gives option for understanding the code in an external way, instead of opening the code the javadoc document can be used separately. IDE benefits using this javadoc as it is able to render information about the code as we develop. Annotations were introduced in JDK 1.5

A reminder to myself of an opportunity for the application of topic maps to Java code. Obviously with regard to “custom” annotations but I suspect the range of usage for “standard” annotations is quite wide.

Not that there aren’t other “subjects” that could be usefully organized out of source code using topic maps. Such as which developers use which classes, methods, etc.

Neo4j in Action

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:48 pm

Neo4j in Action – EARLY ACCESS EDITION – Jonas Partner and Aleksa Vukotic – MEAP Began: June 2012

Description:

Databases are easier to develop and use when the structure of your data matches the way you think and talk about them. Neo4j is a new graph database that allows you to persist data more naturally from domains such as social networking and recommendation engines, where representing data as a graph of interconnected nodes is a natural choice. Neo4j significantly outperforms relational databases when querying graph data. It supports large data sets while preserving full transactional database attributes.

Neo4j in Action is a comprehensive guide to Neo4j, aimed mainly at application developers and software architects. Using the hands-on examples, you’ll learn to model graph domains naturally with Neo4j graph structures. The book explores the full power of the native Java APIs for graph data manipulation and querying. It also covers Cypher – declarative graph query languages developed specifically for Neo4j. In addition to the native API, this book provides a practical example of integration with popular Spring framework.

Along the way, you’ll learn how to efficiently install, setup, and configure Neo4j databases both as standalone servers and in the embedded mode, including performance and memory tuning techniques. You’ll work with the recommended tools for maintenance and monitoring of Neo4j database instance and configure Neo4j in High Availability mode in a clustered environment.

Manning Publications has released an early release edition of Neo4j in Action.

Currently available:

PART 1: INTRODUCTION TO NEO4J

  1. A case for Neo4j database – FREE
  2. Starting development with Neo4j – AVAILABLE
  3. The power of traversals – AVAILABLE

Recycling RDF and SPARQL

Filed under: Graphs,RDF,SPARQL — Patrick Durusau @ 7:48 pm

I was surprised to learn the W3C is recycling RDF and SPARQL for graph analytics:

RDF and SPARQL (both standards developed by the World Wide Web Consortium) [were developed] as the industry standard[s] for graph analytics.

It doesn’t hurt to repurpose those standards, assuming they are appropriate for graph analytics.

Or rather, assuming they are appropriate for your graph analytic needs.

BTW, there is a contest to promote recycling of RDF and SPARQL with a $70,000 first prize:

YarcData Announces $100,000 Big Data Graph Analytics Challenge

From the post:

At the 2012 Semantic Technology & Business Conference in San Francisco, YarcData, a Cray company, has announced the planned launch of a “Big Data” contest featuring $100,000 in prizes. The YarcData Graph Analytics Challenge will recognize the best submissions for solutions of un-partitionable, Big Data graph problems.

YarcData is holding the contest to showcase the increasing applicability and adoption of graph analytics in solving Big Data problems. The contest also is intended to promote the use and development of RDF and SPARQL (both standards developed by the World Wide Web Consortium) as the industry standard for graph analytics.

“Graph databases have a significant role to play in analytic environments, and they can solve problems like relationship discovery that other traditional technologies do not handle easily,” said Philip Howard, Research Director, Bloor Research. “YarcData driving thought leadership in this area will be positive for the overall graph database market, and this contest could help expand the use of RDF and SPARQL as valuable tools for solving Big Data problems.”

The grand prize for the first place winner is $70,000. The second place winner will receive $10,000, and the third place winner will receive $5,000. There also will be additional prizes for the other finalists. Contest judges, which will include a combination of Big Data industry analysts, experts from academia and semantic web, and YarcData customers, will review the submissions and select the 10 best contestants.

The YarcData Graph Analytics Challenge will officially begin on Tuesday, June 26, 2012, and winners will be announced during a live Web event on December 4, 2012. Full contest details, including specific criteria and the contest judges, will be announced on June 26. To pre-register for a contest information packet, please visit the YarcData website at www.yarcdata.com. Information packets will be sent out June 26. The contest will be open only to those individuals who are eligible to participate under U.S. and other applicable laws and regulations.

Full details to follow on June 26, 2012.

How Do You Define Failure?

Filed under: Modeling,Requirements — Patrick Durusau @ 7:48 pm

… business intelligence implementations are often called failures when they fail to meet the required objectives, lack user acceptance or are only implemented after numerous long delays.

Called failures? Sounds like failures to me. You?

News: The cause of such failures has been discovered:

…an improperly modeled repository not adhering to basic dimensional modeling principles

Really?

I would have said that not having a shared semantic, one shared by all the shareholders in the project, would be the root cause for most project failures.

I’m not particular about how you achieve that shared semantic. You could use white boards, sticky notes or have people physically act out the system. The important thing being to avoid the assumption that other stakeholders “know what I mean by….” They probably don’t. And several months into building of data structures, interfaces, etc., is a bad time to find out you assumed incorrectly.

The lack of a shared semantic can result in an “…improperly modeled repository…” but that is much later in the process.

Quotes from: Oracle Expert Shares Implementation Key

June 5, 2012

Are You a Bystander to Bad Data?

Filed under: Data,Data Quality — Patrick Durusau @ 7:58 pm

Are You a Bystander to Bad Data? by Jim Harris.

From the post:

In his recent Harvard Business Review blog post “Break the Bad Data Habit,” Tom Redman cautioned against correcting data quality issues without providing feedback to where the data originated.

“At a minimum,” Redman explained, “others using the erred data may not spot the error. There is no telling where it might turn up or who might be victimized.” And correcting bad data without providing feedback to its source also denies the organization an opportunity to get to the bottom of the problem.

“And failure to provide feedback,” Redman continued, “is but the proximate cause. The deeper root issue is misplaced accountability — or failure to recognize that accountability for data is needed at all. People and departments must continue to seek out and correct errors. They must also provide feedback and communicate requirements to their data sources.”

In his blog post, “The Secret to an Effective Data Quality Feedback Loop,” Dylan Jones responded to Redman’s blog post with some excellent insights regarding data quality feedback loops and how they can help improve your data quality initiatives.

[I removed two incorrect links in the quoted portion of Jim’s article. Were pointers to the rapper “Redman” and not Tom Redman. And I posted a comment on Jim’s blog about the error.]

Take the time to think about providing feedback on bad data.

Would bad data get corrected more often if correction was easier?

What if a data stream could be intercepted and corrected? Would that make correction easier?

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Filed under: Avro,Hive,Pig — Patrick Durusau @ 7:58 pm

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE by Russell Jurney.

From the post:

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.

Continues the high standard set in part one for walking through an entire data lifecycle in the Hadoop ecosystem.

CDH4 and Cloudera Enterprise 4.0 Now Available

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 7:58 pm

CDH4 and Cloudera Enterprise 4.0 Now Available by Charles Zedlewski.

From the post:

I’m very pleased to announce the immediate General Availability of CDH4 and Cloudera Manager 4 (part of the Cloudera Enterprise 4.0 subscription). These releases are an exciting milestone for Cloudera customers, Cloudera users and the open source community as a whole.

Functionality

Both CDH4 and Cloudera Manager 4 are chock full of new features. Many new features will appeal to enterprises looking to move more important workloads onto the Hadoop platform. CDH4 includes high availability for the filesystem, ability to support multiple namespaces, HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations.

Other features will appeal to developers and ISV’s looking to build applications on top of CDH and / or Cloudera Manager. HBase coprocessors enable the development of new kinds of real-time applications. MapReduce2 opens up Hadoop clusters to new data processing frameworks other that MapReduce. There are new REST API’s both for the Hadoop distributed filesystem and for Cloudera Manager.

Download and install. What new features you find the most interesting?

Dominic Widdows

Filed under: Data Mining,Natural Language Processing,Researchers,Visualization — Patrick Durusau @ 7:57 pm

While tracking references, I ran across the homepage of Dominic Widdows at Google.

Actually I found the Papers and Publications page for Dominic Widdows and then found his homepage. 😉

There is much to be read here.

DBLP page for Dominic Widdows.

Negation for Document Re-ranking in Ad-hoc Retrieval

Filed under: Disjunction (Widdows),Information Retrieval,Negation (Widdows) — Patrick Durusau @ 7:57 pm

Negation for Document Re-ranking in Ad-hoc Retrieval by Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro.

Interesting slide deck that was pointed out to me by Jack Park.

On the “negation” aspects, I found it helpful to review Word Vectors and Quantum Logic Experiments with negation and disjunction by Dominic Widdows and Stanley Peters (cited as an inspiration by the slide authors).

Depending upon your definition of subject identity and subject sameness, you may find negation/disjunction useful for topic map processing.

Quick-R

Filed under: R — Patrick Durusau @ 7:56 pm

Quick-R

From the homepage:

R is an elegant and comprehensive statistical and graphical programming language. Unfortunately, it can also have a steep learning curve. I created this website for both current R users, and experienced users of other statistical packages (e.g., SAS, SPSS, Stata) who would like to transition to R. My goal is to help you quickly access this language in your work.

I assume that you are already familiar with the statistical methods covered and instead provide you with a roadmap and the code necessary to get started quickly, and orient yourself for future learning. I designed this web site to be an easily accessible reference. Look at the sitemap to get an overview.

From the author of R in Action, if you know the book.

Sourcing Semantics

Filed under: Semantics — Patrick Durusau @ 7:56 pm

Ancient Jugs Hold the Secret to Practical Mathematics in Biblical Times is a good illustration of the source of semantics.

From the post:

Archaeologists in the eastern Mediterranean region have been unearthing spherical jugs, used by the ancients for storing and trading oil, wine, and other valuable commodities. Because we’re used to the metric system, which defines units of volume based on the cube, modern archaeologists believed that the merchants of antiquity could only approximately assess the capacity of these round jugs, says Prof. Itzhak Benenson of Tel Aviv University’s Department of Geography.

Now an interdisciplinary collaboration between Prof. Benensonand Prof. Israel Finkelstein of TAU’s Department of Archaeology and Ancient Near Eastern Cultures has revealed that, far from relying on approximations, merchants would have had precise measurements of their wares — and therefore known exactly what to charge their clients.

The researchers discovered that the ancients devised convenient mathematical systems in order to determine the volume of each jug. They theorize that the original owners and users of the jugs measured their contents through a system that linked units of length to units of volume, possibly by using a string to measure the circumference of the spherical container to determine the precise quantity of liquid within.

The system, which the researchers believe was developed by the ancient Egyptians and used in the Eastern Mediterranean from about 1,500 to 700 BCE, was recently reported in the journal PLoS ONE. Its discovery was part of the Reconstruction of Ancient Israel project supported by the European Union.

The artifacts in question are between 2,700 and 3,500 years old.

When did they take on the semantic of being a standardized unit of measurement based on circumference?

A. When they were in common use, approximately 1,500 to 700 BCE?

B. When this discovery was made as per this article?

Understanding that the artifacts have not changed, was this semantic “lost” during the time period between A and B?

Or have we re-attributed to these artifacts the semantic of being a standardized unit of measurement based on circumference?

If you have some explanation other than our being the source of the measurement semantic, I am interested to hear about it.

That may seem like a trivial point but consider its implications carefully.

If we are the source of semantics, then we are the source of semantics for ontologies, classification systems, IR, etc.

Making those semantics subject to the same uncertainty, vagueness, competing semantics as any other.

Making them subject to being defined/disclosed to be as precise as necessary.

Not defining semantics for the ages. Defining semantics against particular requirements. Not the same thing.


The journal reference:

Elena Zapassky, Yuval Gadot, Israel Finkelstein, Itzhak Benenson. An Ancient Relation between Units of Length and Volume Based on a Sphere. PLoS ONE, 2012; 7 (3): e33895 DOI: 10.1371/journal.pone.0033895

Geometric and Quantum Methods for Information Retrieval

Filed under: Geometry,Information Retrieval,Quantum — Patrick Durusau @ 7:55 pm

Geometric and Quantum Methods for Information Retrieval by Yaoyong Li and Hamish Cunningham.

Abstract:

This paper reviews the recent developments in applying geometric and quantum mechanics methods for information retrieval and natural language processing. It discusses the interesting analogies between components of information retrieval and quantum mechanics. It then describes some quantum mechanics phenomena found in the conventional data analysis and in the psychological experiments for word association. It also presents the applications of the concepts and methods in quantum mechanics such as quantum logic and tensor product to document retrieval and meaning of composite words, respectively. The purpose of the paper is to give the state of the art on and to draw attention of the IR community to the geometric and quantum methods and their potential applications in IR and NLP.

More complex models can (may?) lead to better IR methods, but:

Moreover, as Hilbert space is the mathematical foundation for quantum mechanics (QM), basing IR on Hilbert space creates an analogy between IR and QM and may usefully bring some concepts and methods from QM into IR. (p.24)

is a dubious claim at best.

The “analogy” between QM and IR makes the point:

QM IR
a quantum system a collection of object for retrieval
complex Hilbert space information space
state vector objects in collection
observable query
measurement search
eigenvalues relevant or not for one object
probability of getting one eigenvalue relevance degree of object to query

The authors are comparing apples and oranges. For example, “complex Hilbert space” and “information space.”

A “complex Hilbert space” is a model that has been found useful with another model, one called quantum mechanics.

An “information space,” on the other hand, encompasses models known to use “complex Hilbert spaces” and more. Depends on the information space of interest.

Or the notion of “observable” being paired with “query.”

Complex Hilbert spaces may be quite useful for IR, but tying IR to quantum mechanics isn’t required to make use of it.

Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Filed under: Conferences,Filters,Information Retrieval — Patrick Durusau @ 7:55 pm

6th International Workshop on Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Paper Submission: June 21, 2012
Authors Notification: July 10, 2012
Final Paper Submission and Registration: July 24, 2012

In conjunction with International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management – IC3K 2012 – 04 – 07 October, 2012 – Barcelona, Spain.

Scope

Nowadays users are more and more interested in information rather than in mere raw data. The huge amount of accessible data sources is growing rapidly. This calls for novel systems providing effective means of searching and retrieving information with the fundamental goal of making it exploitable by humans and machines.
DART focuses on researching and studying new challenges in distributed information filtering and retrieval. In particular, DART aims to investigate novel systems and tools to distributed scenarios and environments. DART will contribute to discuss and compare suitable novel solutions based on intelligent techniques and applied in real-world applications.
Information Retrieval attempts to address similar filtering and ranking problems for pieces of information such as links, pages, and documents. Information Retrieval systems generally focus on the development of global retrieval techniques, often neglecting individual user needs and preferences.
Information Filtering has drastically changed the way information seekers find what they are searching for. In fact, they effectively prune large information spaces and help users in selecting items that best meet their needs, interests, preferences, and tastes. These systems rely strongly on the use of various machine learning tools and algorithms for learning how to rank items and predict user evaluation.

Topics of Interest

Topics of interest will include (but not are limited to):

  • Web Information Filtering and Retrieval
  • Web Personalization and Recommendation
  • Web Advertising
  • Web Agents
  • Web of Data
  • Semantic Web
  • Linked Data
  • Semantics and Ontology Engineering
  • Search for Social Networks and Social Media
  • Natural Language and Information Retrieval in the Social Web
  • Real-time Search
  • Text categorization

If you are interested and have the time (or graduate students with the time), abstracts from prior conferences are here. Would be a useful exercise to search out publicly available copies. (As far as I can tell, no abstracts from DART.)

Capturing…Quantitative and Semantic Information in Radiology Images

Filed under: Biomedical,Ontology — Patrick Durusau @ 7:55 pm

Daniel Rubin from Stanford University on “Capturing and Computer Reasoning with Quantitative and Semantic Information in Radiology Images” at 10:00am PT, Wednesday, June 6.

ABSTRACT:

The use of semantic Web technologies to make the myriad of data in cyberspace accessible to intelligent agents is well established. However, a crucial type of information on the Web–and especially in life sciences–is imaging, which is largely being overlooked in current semantic Web endeavors. We are developing methods and tools to enable the transparent discovery and use of large distributed collections of medical images within hospital information systems and ultimately on the Web. Our approach is to make the human and machine descriptions of image content machine-accessible through “semantic annotation” using ontologies, capturing semantic and quantitative information from images as physicians view them in a manner that minimally affects their current workflow. We exploit new standards for making image contents explicit and publishable on the semantic Web. We will describe tools and methods we are developing and preliminary results using them for response assessment in cancer. While this work is focused on images in the life sciences, it has broader applicability to all images on the Web. Our ultimate goal is to enable semantic integration of images and all the related scientific data pertaining to their content so that physicians and basic scientists can have the best understanding of the biological and physiological significance of image content.

SPEAKER BIO:

Daniel L. Rubin, MD, MS is Assistant Professor of Radiology and Medicine (Biomedical Informatics Research) at Stanford University. He is a Member of the Stanford Cancer Center and the Bio-X interdisciplinary research program. His NIH-funded research program focuses on the intersection of biomedical informatics and imaging science, developing computational methods and applications to extract quantitative information and meaning from clinical, molecular, and imaging data, and to translate these methods into practice through applications to improve diagnostic accuracy and clinical effectiveness. He is Principal Investigator of one of the centers in the National Cancer Institute’s recently-established Quantitative Imaging Network (QIN), Chair of the RadLex Steering Committee of the Radiological Society of North America (RSNA), and Chair of the Informatics Committee of the American College of Radiology Imaging Network (ACRIN). Dr. Rubin has published over 100 scientific publications in biomedical imaging informatics and radiology.

WEBEX DETAILS:
——————————————————-
To start or join the online meeting
——————————————————-
Go to https://stanford.webex.com/stanford/j.php?ED=175352027&UID=481527042&PW=NYjM4OTVlZTFj&RT=MiM0

——————————————————-
Audio conference information
——————————————————-
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers: https://stanford.webex.com/stanford/globalcallin.php?serviceType=MC&ED=175352027&tollFree=0

Access code:925 343 903

Whether you are using topic maps for image annotation or mapping between systems of image annotation, this promises to be an interesting presentation.

June 4, 2012

Cloudera Manager 3.7.6 released!

Filed under: Cloudera,Hadoop,HDFS,MapReduce — Patrick Durusau @ 4:34 pm

Cloudera Manager 3.7.6 released! by Jon Zuanich.

Jon writes:

We are pleased to announce that Cloudera Manager 3.7.6 is now available! The most notable updates in this release are:

  • Support for multiple Hue service instances
  • Separating RPC queue and processing time metrics for HDFS
  • Performance tuning of the Resource Manager components
  • Several bug fixes and performance improvements

The detailed Cloudera Manager 3.7.6 release notes are available at: https://ccp.cloudera.com/display/ENT/Cloudera+Manager+3.7.x+Release+Notes

Cloudera Manager 3.7.6 is available to download from: https://ccp.cloudera.com/display/SUPPORT/Downloads

Only fair since I mentioned the Cray earlier that I get a post about Cloudera out today as well.

How big is R on CRAN #rstats

Filed under: Graphs,R — Patrick Durusau @ 4:33 pm

How big is R on CRAN #rstats by Ajay Ohri.

Ajay writes:

3.87 GB and 3786 packages. Thats what you need to install the whole of R as on CRAN

Just in case you are looking for a data set to map to a graph. Such as packages that call other packages, etc.

Think of the resulting graph as being a lens for viewing R literature on the Web.

(Curious what sort of download time you get?)

Data hoarding and bias among big challenges in big data and analytics

Filed under: Analytics,BigData — Patrick Durusau @ 4:33 pm

Data hoarding and bias among big challenges in big data and analytics by Linda Tucci.

From the post:

Hype aside, exploiting big data and analytics will matter hugely to companies’ future performance, remaking whole industries and spawning new ones. The list of challenges is long, however. They range from the well-documented paucity of data scientists available to crunch that big data, to more intractable but less-mentioned problems rooted in human nature.

One of the latter is humans’ tendency to hoard data. Another is their tendency to hold on to preconceived beliefs even when the data screams otherwise. That was the consensus of a panel of data experts speaking on big data and analytics at the recent MIT Sloan CIO Symposium in Cambridge, Mass. Another landmine? False hope. There is no final truth in big data and analytics, as the enterprises that do big data well already know. Iteration is all, the panel agreed.

Moreover, except for the value of iteration, CIOs can forget about best practices. Emerging so-called next practices are about the best companies can lean on as they dive into big data, said computer scientist Michael Chui, San Francisco-based senior fellow at the McKinsey Global Institute, the research arm of New York-based McKinsey & Co. Inc.

“The one thing we know that doesn’t work: Wait five years until the perfect data warehouse is ready,” said Chui, who’s an author of last year’s massive McKinsey report on the value of big data.

Seeing data quality in relative terms

In fact, obsessing over data quality is one of the first hurdles many companies have to overcome if they hope to use big data effectively, Chui said. Data accuracy is of paramount importance in banks’ financial statements. Messy data, however, contains patterns that can highlight business problems or provide insights that generate significant value, as laid out in a related story about the symposium panel, “Seize big data and analytics or fall behind, MIT panel says.

Issues that you will have to face in the creation of topic maps, big data or no.

Entry-Level HPC: Proven at a Petaflop, Affordably Priced!

Filed under: Cray — Patrick Durusau @ 4:32 pm

Entry-Level HPC: Proven at a Petaflop, Affordably Priced!

AMD sponsored this content at www.Datanami.com.

As a long time admirer of Cray I had to repost:

Computing needs at many commercial enterprises, research universities, and government labs continue to grow as more complex problems are explored using ever-more sophisticated modeling and analysis programs.

A new class of Cray XE6 and Cray XK6 high performance computing (HPC) systems, based on AMD Opteron™ processors, now offer teraFLOPS of processing power, reliability, utilization rates, and other advantages of high-end supercomputers, but with a great low purchase price. Entry-level supercomputing systems in this model line target midrange HPC applications, have an expected performance in the 6.5 teraflop to 200 teraFLOPS range, and scale in price from $200,000 to $3 million.

These systems can give organizations an alternative to high-end HPC clusters. One potential advantage of these entry-level systems is that they are designed to deliver supercomputing reliability and sustained performance. Users can be confident their jobs will run to completion. And the systems also offer predictability. “There is reduced OS noise, so you get similar run times every time,” said Margaret Williams, senior vice president of HPC Systems at Cray Inc.

Not enough to get you into “web scale” data but certainly enough for many semantic integration problems.

Where’s your database’s ER Diagram?

Filed under: Database,Documentation — Patrick Durusau @ 4:32 pm

Where’s your database’s ER Diagram? by Scott Selikoff.

From the post:

I was recently training a new software developer, explaining the joys of three-tier architecture and the importance of the proper black-box encapsulation, when the subject switched to database design and ER diagrams. For those unfamiliar with the subject, entity-relationship diagrams, or ER diagrams for short, are a visual technique for modelling entities, aka tables in relational databases, and the relationships between the entities, such as foreign key constraints, 1-to-many relationships, etc. Below is a sample of such a diagram.

Scott’s post is particularly appropriate since we were talking about documentation of your aggregation strategy in MongoDB.

My experience is that maintenance of documentation in general, not just E-R diagrams, is a very low priority.

Which means that migration of databases and other information resources is far more expensive and problematic than necessary.

There is a solution to the absence of current documentation.

No, it isn’t topic maps, at least not necessarily, although topic map could be part of a solution to the documentation problem.

What could make a difference would be the tracking of changes to the system/schema/database/etc. with relationships to the people who made them.

So that at the end of each week, for example, it would be easy to tell who had or had not created the necessary documentation for the changes they had made.

Think of it as bringing accountability to change tracking. It isn’t enough to track a change or to know who made it, if we lack the documentation necessary to understand the change.

When I said you would not necessarily have to use a topic map, I was thinking of JIRA, which has ample opportunities for documentation of changes. (Insert your favorite solution, JIRA happens to be one that is familiar.) It does require the discipline to enter the documentation.

Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2)

Filed under: Aggregation,MongoDB,NoSQL,Python — Patrick Durusau @ 4:32 pm

Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2) by Rick Copeland.

From the post:

Continuing on in my series on MongoDB and Python, this article will explore the new aggregation framework introduced in MongoDB 2.1. If you’re just getting started with MongoDB, you might want to read the previous articles in the series first:

And now that you’re all caught up, let’s jump right in….

Why a new framework?

If you’ve been following along with this article series, you’ve been introduced to MongoDB’s mapreduce command, which up until MongoDB 2.1 has been the go-to aggregation tool for MongoDB. (There’s also the group() command, but it’s really no more than a less-capable and un-shardable version of mapreduce(), so we’ll ignore it here.) So if you already have mapreduce() in your toolbox, why would you ever want something else?

Mapreduce is hard; let’s go shopping

The first motivation behind the new framework is that, while mapreduce() is a flexible and powerful abstraction for aggregation, it’s really overkill in many situations, as it requires you to re-frame your problem into a form that’s amenable to calculation using mapreduce(). For instance, when I want to calculate the mean value of a property in a series of documents, trying to break that down into appropriate map, reduce, and finalize steps imposes some extra cognitive overhead that we’d like to avoid. So the new aggregation framework is (IMO) simpler.

Other than the obvious utility of the new aggregation framework in MongoDB, there is another reason to mention this post: You should use only as much aggregation or in topic map terminology, “merging,” as you need.

It isn’t possible to create a system that will correctly aggregate/merge all possible content. Take that as a given.

In part because new semantics are emerging every day and there are too many previous semantics that are poorly documented or unknown.

What we can do is establish requirements for particular semantics for given tasks and document those to facilitate their possible re-use in the future.

Aggregation in MongoDB (Part 1)

Filed under: Aggregation,MongoDB,NoSQL,Python — Patrick Durusau @ 4:31 pm

Aggregation in MongoDB (Part 1) by Rick Copeland.

From the post:

In some previous posts on mongodb and python,
pymongo, and gridfs, I introduced the NoSQL database MongoDB how to use it from Python, and how to use it to store large (more than 16 MB) files in it. Here, I’ll be showing you a few of the features that the current (2.0) version of MongoDB includes for performing aggregation. In a future post, I’ll give you a peek into the new aggregation framework included in MongoDB version 2.1.

An index “aggregates” information about a subject (called an ‘entry’), where the information is traditionally found between the covers of a book.

MongoDB offers predefined as well as custom “aggregations,” where the information field can be larger than a single book.

Good introduction to aggregation in MongoDB, although you (and I) really should get around to reading the MondoDB documentation.

Different ways to make auto suggestions with Solr

Filed under: AutoSuggestion,Lucene,LucidWorks,Solr — Patrick Durusau @ 4:30 pm

Different ways to make auto suggestions with Solr

From the post:

Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That’s a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.

Starts with seven (7) questions you should ask yourself about auto-suggestions and then covers four methods for implementing them in Solr.

You can have the typical word completion seen in most search engines or you can be more imaginative, using custom dictionaries.

Stop Labeling Everything as an Impedance Mismatch!

Filed under: Communication,Marketing — Patrick Durusau @ 4:30 pm

Stop Labeling Everything as an Impedance Mismatch! by Jos Dirksen (DZone Java Lobby).

Jos writes:

I recently ran across an article that was talking (again) about the Object-Relational mismatch. And just like in many articles this mismatch is called the Object-Relational Impedance mismatch. This “impedance mismatch” label isn’t just added when talking about object and relational databases, but pretty much in any situation where we have two concepts that don’t match nicely:

As someone who has abused “semantic impedance” in the past (and probably will in the future), this caught my eye.

Particularly because Jos goes on to say:

…In the way we use it impedance mismatch sounds like a bad thing. In electrical engineering it is just a property of an electronic circuit. In some circuits you might need to have impedance matching, in others you don’t.

Saying we have an object relation impedance mismatch doesn’t mean anything. Yes we have a problem between the OO world and the relation world, no discussion about that. Same goes for the other examples I gave in the beginning of this article. But labelling it with the “impedance mismatch” doesn’t tell us anything about the kind of problem we have. We have a “concept mismatch”, a “model mismatch”, or a “technology mismatch”.

That impedance, a property of every circuit, doesn’t tell us anything, is the important point.

Just as “semantic impedance” doesn’t tell us anything about the nature of the “impedance.”

Or possible ways to reduce it.

Suggestion: Let’s take “semantic impedance” as a universal given.

Next question: What can we do to lessen it in specific situations? With enough details, that’s a question we may be able to answer, in part.

« Newer PostsOlder Posts »

Powered by WordPress