Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 16, 2013

10 Important Questions [Business Questions]

Filed under: Marketing,Topic Maps — Patrick Durusau @ 4:49 pm

10 Important Questions by Kevin Hillstrom.

Planning on making a profit on topic maps or any other semantic technology product/service?

Print out and answer Kevin’s questions, in writing.

I would suggest doing the same exercise annually, without peeking at last year’s.

Then have someone else compare the two.

If semantic technologies are going to be taken seriously, we all need to have good answers to these questions.

Pig, ToJson, and Redis to publish data with Flask

Filed under: JSON,Pig,Redis — Patrick Durusau @ 4:48 pm

Pig, ToJson, and Redis to publish data with Flask by Russell Jurney.

From the post:

Pig can easily stuff Redis full of data. To do so, we’ll need to convert our data to JSON. We’ve previously talked about pig-to-json in JSONize anything in Pig with ToJson. Once we convert our data to json, we can use the pig-redis project to load redis.

What do you think?

Something “lite” to test a URI dictionary locally?

Announcing Apache Hadoop 2.0.3 Release and Roadmap

Filed under: Hadoop YARN — Patrick Durusau @ 4:48 pm

Announcing Apache Hadoop 2.0.3 Release and Roadmap by Arun Murthy.

From the post:

As the Release Manager for hadoop-2.x, I’m very pleased to announce the next major milestone for the Apache Hadoop community, the release of hadoop-2.0.3-alpha!

2.0 Enhancements in this Alpha Release

This release delivers significant major enhancements and stability over previous releases in hadoop-2.x series. Notably, it includes:

  • QJM for HDFS HA for NameNode (HDFS-3077) and related stability fixes to HDFS HA
  • Multi-resource scheduling (CPU and memory) for YARN (YARN-2, YARN-3 & friends)
  • YARN ResourceManager Restart (YARN-230)
  • Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release – see more details from folks at Yahoo! here)

A beta release is a couple of months off so now is your chance to review the alpha and contribute towards the beta.

Working with Pig

Filed under: Pig,Regex,Regexes — Patrick Durusau @ 4:48 pm

Working with Pig by Dan Morrill. (video)

From the description:

Pig is a SQL like command language for use with Hadoop, we review a simple PIG script line by line to help you understand how pig works, and regular expressions to help parse data. If you want a copy of the slide presentation – they are over on slide share http://www.slideshare.net/rmorrill.

Very good intro to PIG!

Mentions a couple of resources you need to bookmark:

Input Validation Cheat Sheet (The Open Web Security Application Project – OWASP) – regexes to re-use in Pig scripts. Lots of other regex cheat sheet pointers. (Being mindful that “\” must be escaped in PIG.)

Regular-Expressions.info A more general resource on regexes.

I first saw this at: This Quick Pig Overview Brings You Up to Speed Line by Line.

Amazon Web Services Announces Amazon Redshift

Filed under: Amazon Web Services AWS,Cloud Computing,Data Warehouse — Patrick Durusau @ 4:47 pm

Amazon Web Services Announces Amazon Redshift

From the post:

Amazon Web Services, Inc. today announced that Amazon Redshift, a managed, petabyte-scale data warehouse service in the cloud, is now broadly available for use.

Since Amazon Redshift was announced at the AWS re: Invent conference in November 2012, customers using the service during the limited preview have ranged from startups to global enterprises, with datasets from terabytes to petabytes, across industries including social, gaming, mobile, advertising, manufacturing, healthcare, e-commerce, and financial services.

Traditional data warehouses require significant time and resource to administer. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift aims to lower the cost of a data warehouse and make it easy to analyze large amounts of data very quickly.

Amazon Redshift uses columnar data storage, advanced compression, and high performance IO and network to achieve higher performance than traditional databases for data warehousing and analytics workloads. Redshift is currently available in the US East (N. Virginia) Region and will be rolled out to other AWS Regions in the coming months.

“When we set out to build Amazon Redshift, we wanted to leverage the massive scale of AWS to deliver ten times the performance at 1/10 the cost of on-premise data warehouses in use today,” said Raju Gulabani, Vice President of Database Services, Amazon Web Services….

Amazon Web Services

Wondering what impact a 90% reduction in cost, if borne out over a variety of customers, will have on the cost of on-premise data warehouses?

Suspect the cost for on-premise warehouses will go up because there will be a smaller market for the hardware and people to run them.

Something to consider as a startup that wants to deliver big data services.

Do you really want your own server room/farm, etc.?

Or for that matter, will VCs ask: Why are you allocating funds to a server farm?

PS: Amazon “Redshift” is another example of semantic pollution. “Redshift” had (past tense) a well know and generally accepted semantic. Well, except for the other dozen or so meanings for “redshift” that I counted in less than a minute. 😉

Sigh, semantic confusion continues unabated.

atomic<> Weapons

Filed under: Memory,Multi-Core,Programming — Patrick Durusau @ 4:47 pm

atomic<> Weapons by Herb Sutter.

C++ and Beyond 2012: Herb Sutter – atomic<> Weapons, 1 of 2

C++ and Beyond 2012: Herb Sutter – atomic<> Weapons, 2 of 2

Abstract:

This session in one word: Deep.

It’s a session that includes topics I’ve publicly said for years is Stuff You Shouldn’t Need To Know and I Just Won’t Teach, but it’s becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they’ll reach for the big red levers with the flashing warning lights. Since we can’t keep people from pulling the big red levers, we’d better document the A to Z of what the levers actually do, so that people don’t SCRAM unless they really, really, really meant to.

With all the recent posts about simplicity and user interaction, some readers may be getting bored.

Never fear, something a bit more challenging for you.

Multicore memory models along with comments that cite even more research.

Plus I liked the line: “…reach for the big red levers with the flashing warning lights.”

Enjoy!

NBA Stats Like Never Before [No RDF/Linked Data/Topic Maps In Sight]

Filed under: Design,Interface Research/Design,Linked Data,RDF,Statistics,Topic Maps — Patrick Durusau @ 4:47 pm

NBA Stats Like Never Before by Timo Elliott.

From the post:

The National Baseball Association today unveiled a new site for fans of games statistics: NBA.com/stats, powered by SAP Analytics technology. The multi-year marketing partnership between SAP and the NBA was announced six months ago:

“We are constantly researching new and emerging technologies in an effort to provide our fans with new ways to connect with our game,” said NBA Commissioner David Stern. “SAP is a leader in providing innovative software solutions and an ideal partner to provide a dynamic and comprehensive statistical offering as fans interact with NBA basketball on a global basis.”

“SAP is honored to partner with the NBA, one of the world’s most respected sports organizations,” said Bill McDermott, co-CEO, SAP. “Through SAP HANA, fans will be able to experience the NBA as never before. This is a slam dunk for SAP, the NBA and the many fans who will now have access to unprecedented insight and analysis.”

The free database contains every box score of every game played since the league’s inception in 1946, including graphical displays of players shooting tendencies.

To the average fan NBA.com/Stats delivers information that is of immediate interest to them, not their computers.

Another way to think about it:

Computers don’t make purchasing decisions, users do.

Something to think about when deciding on your next semantic technology.

Poll: Which Solr version are you using?

Filed under: Solr — Patrick Durusau @ 4:47 pm

Poll: Which Solr version are you using?

From the post:

With Solr 4.1 recently released, let’s see which version(s) of Solr people are using. Please tweet it to help us get more vote and better stats.

Voted and Tweeted: Total Elapsed Time: 6 Seconds.

Can you do better? 😉

Seriously, do the poll and retweet (whether you use Solr or not).

It’s for a good cause.

February 15, 2013

Saving the “Semantic” Web (part 5)

Filed under: RDF,Semantic Web,Topic Maps — Patrick Durusau @ 4:33 pm

Simple Web Semantics

For what it’s worth, what follows in this post is a partial, non-universal and useful only in some cases proposal.

That has been forgotten by this point but in my defense, I did try to warn you. 😉

1. Division of Semantic Labor

The first step towards useful semantics on the web must be a division of semantic labor.

I won’t recount the various failures of the Semantic Web, topic maps and other initiatives to “educate” users on how they should encode semantics.

All such efforts have, are now and will fail.

That is not a negative comment on users.

In another life I advocated tools that would enable biblical scholars to work in XML, without having to learn angle-bang syntax. It wasn’t for lack of intelligence, most of them were fluent in five or six ancient languages.

They were focused on being biblical scholars and had no interest in learning the minutiae of XML encoding.

After many years, due to a cast of hundreds if not thousands, OpenOffice, OpenDocumentFormat (ODF) and XML editing became available to the ordinary users.

Not the fine tuned XML of the Text Encoding Initiative (TEI) or DocBook, but having a 50 million plus user share is better than being in the 5 to 6 digit range.

Users have not succeeded in authoring structured data, such as RDF, but have demonstrated competence at authoring <a> elements with URIs.

I propose the following division of semantic labor:

Users – Responsible for identification of subjects in content they author, using URIs in the <a> element.

Experts – Responsible for annotation (by whatever means) of URIs that can be found in <a> elements in content.

2. URIs as Pointers into a Dictionary

One of the comments in these series pointed out that URIs are like “pointers into a dictionary.” I like that imagery and it is easier to understand than the way I intended to say it.

If you think of words as pointers into a dictionary, how many dictionaries does a word point into?

And contrast your answer with the number of dictionaries into which a URI points?

If we are going to use URIs as “pointers into a dictionary,” then there should be no limit on the number of dictionaries into which they can point.

A URI can be posed to any number of dictionaries as a query, with possibly different results from each dictionary.

3. Of Dictionaries

Take for example the URI, http://data.nytimes.com/47271465269191538193 as an example of a URI that can appear in a dictionary.

If you follow that URI, you will notice a couple of things:

  1. It isn’t content suitable for primary or secondary education.
  2. The content is limited to that of the New York Times.
  3. The content of the NYT consists of article pointers

Not to mention it is a “pull” interface that requires effort on the part of users, as opposed to a “push” interface that reduces that effort.

What if rather than “following” the URI http://data.nytimes.com/47271465269191538193, you could submit that same URI to another dictionary, one than had different information?

A dictionary that for that URI returns:

  1. Links to content suitable for primary or secondary education.
  2. Broader content than just New York Times.
  3. Curated content and not just article pointers

Just as we have semantic diversity:

URI dictionaries shall not be required to use a particular technology or paradigm.

4. Immediate Feedback

Whether you will admit it or not, we have all coded HTML and then loaded it in a browser to see the results.

That’s called “immediate feedback” and made HTML, the early versions anyway, extremely successful.

When <a> elements with URIs are used to identify subjects, how can we duplicate that “immediate feedback” experience?

My suggestion is that users encode in the <head> of their documents a meta element that reads:

<meta name=”dictionary” content=”URI”>

And insert either JavaScript or JQuery code that creates an array of all the URIs in the document, passes those URIs to the dictionary specified by the user and then displays a set of values when a user mouses over a particular URI.

Think of it as being the equivalent of spell checking except for subjects. You could even call it “subject checking.”

For most purposes, dictionaries should only return 3 or 4 key/values pairs, enough for users to verify their choice of a URI. With an option to see more information.

True enough, I haven’t asked for users to say which of those properties identify the subject in question and I don’t intend to. That lies in the domain of experts.

The inline URI mechanism lends itself to automatic insertion of URIs, which users could then verify capture their meaning. (Wikifier is a good example, assuming you have a dictionary based on Wikipedia URIs.)

Users should be able to choose the dictionaries they prefer for identification of subjects. Further, users should be able to verify their identifications from observing properties associated with a URI.

5. Incentives, Economic and Otherwise

There are economic and other incentives that arise from “Simple Web Semantics.”

First, divorcing URI dictionaries from any particular technology will create an easy on ramp for dictionary creators to offer as many or few services as they choose. Users can vote with their feet on which URI dictionaries meet their needs.

Second, divorcing URIs from their sources creates the potential for economic opportunities and competition in the creation of URI dictionaries. Dictionary creators can serve up definitions for popular URIs, along with pointers to other content, free and otherwise.

Third, giving users the right to choose their URI dictionaries is a step towards returning democracy to the WWW.

Fourth, giving users immediate feedback based on URIs they choose, makes users the judges of their own semantics, again.

Fifth, with the rise of URI dictionaries, the need to maintain URIs, “cool” or otherwise, simply disappears. No one maintains the existence of words. We have dictionaries.

There are technical refinements that I could suggest but I wanted to draw the proposal in broad strokes and improve it based on your comments.

Comments/Suggestions?

PS: As I promised at the beginning, this proposal does not address many of the endless complexities of semantic integration. If you need a different solution, for a different semantic integration problem, you know where to find me.


The Family of MapReduce and Large Scale Data Processing Systems

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:03 pm

The Family of MapReduce and Large Scale Data Processing Systems by Sherif Sakr, Anna Liu, Ayman G. Fayoumi.

Abstract:

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.

At twenty-seven pages and one hundred and thirty-five references, this is one for the weekend and perhaps beyond!

Definitely a paper to master if you are interested in seeing the next generation of MapReduce techniques before your competition.

I first saw this at Alex Popescu’s The Family of MapReduce and Large Scale Data Processing Systems.

“Document Design and Purpose, Not Mechanics”

Filed under: Documentation,Software — Patrick Durusau @ 1:51 pm

“Document Design and Purpose, Not Mechanics” by Stephen Turner.

From the post:

If you ever write code for scientific computing (chances are you do if you’re here), stop what you’re doing and spend 8 minutes reading this open-access paper:

Wilson et al. Best Practices for Scientific Computing. arXiv:1210.0530 (2012). (Direct link to PDF).

The paper makes a number of good points regarding software as a tool just like any other lab equipment: it should be built, validated, and used as carefully as any other physical instrumentation. Yet most scientists who write software are self-taught, and haven’t been properly trained in fundamental software development skills.

The paper outlines ten practices every computational biologist should adopt when writing code for research computing. Most of these are the usual suspects that you’d probably guess – using version control, workflow management, writing good documentation, modularizing code into functions, unit testing, agile development, etc. One that particularly jumped out at me was the recommendation to document design and purpose, not mechanics.

We all know that good comments and documentation is critical for code reproducibility and maintenance, but inline documentation that recapitulates the code is hardly useful. Instead, we should aim to document the underlying ideas, interface, and reasons, not the implementation. (emphasis added)

There is no shortage of advice (largely unread) on good writing practices. 😉

Stephen calling out the advice to “…document design and purpose, not mechanics” struck me as relevant to semantic integration solutions.

In both RDF and XTM topic maps, the same URI as an identifier is taken as identifying the same subject.

But that’s mechanics isn’t it? Just string to string comparison.

Mechanics are important but they are just mechanics.

Documenting the conditions for using a URI will help guide you or your successor to using the same URI the same way.

But that takes more than mechanics.

That takes “…document[ing] the underlying ideas, interface, and reasons, not the implementation.”

New Query Tool Searches EHR Unstructured Data

Filed under: Biomedical,Medical Informatics,Searching,Unstructured Data — Patrick Durusau @ 1:32 pm

New Query Tool Searches EHR Unstructured Data by Ken Terry.

From the post:

A new electronic health record “intelligence platform” developed at Massachusetts General Hospital (MGH) and its parent organization, Partners Healthcare, is being touted as a solution to the problem of searching structured and unstructured data in EHRs for clinically useful information.

QPID Inc., a new firm spun off from Partners and backed by venture capital funds, is now selling its Web-based search engine to other healthcare organizations. Known as the Queriable Patient Inference Dossier (QPID), the tool is designed to allow clinicians to make ad hoc queries about particular patients and receive the desired information within seconds.

Today, 80% of stored health information is believed to be unstructured. It is trapped in free text such as physician notes and reports, discharge summaries, scanned documents and e-mail messages. One reason for the prevalence of unstructured data is that the standard methods for entering structured data, such as drop-down menus and check boxes, don’t fit into traditional physician workflow. Many doctors still dictate their notes, and the transcription goes into the EHR as free text.

and,

QPID, which was first used in the radiology department of MGH in 2005, incorporates an EHR search engine, a library of search queries based on clinical concepts, and a programming system for application and query development. When a clinician submits a query, QPID presents the desired data in a “dashboard” format that includes abnormal results, contraindications and other alerts, Doyle said.

The core of the system is a form of natural language processing (NLP) based on a library encompassing “thousands and thousands” of clinical concepts, he said. Because it was developed collaboratively by physicians and scientists, QPID identifies medical concepts imbedded in unstructured data more effectively than do other NLP systems from IBM, Nuance and M*Modal, Doyle maintained.

Take away points for data search/integration solutions:

  1. 80% of stored health information (need)
  2. traditional methods for data entry….don’t fit into traditional physician workflow (user requirement)
  3. developed collaboratively by physicians and scientists (semantics originate with users, not top down)

I am interested in how QPID conforms (or not) QPID to local medical terminology practices.

To duplicate their earlier success, conforming to local terminology practices is critical.

If for no other reason it will give physicians and other health professionals “ownership” of the vocabulary and hence faith in the system.

Capturing the “Semantic Differential”?

Filed under: Language,Semantics — Patrick Durusau @ 11:51 am

Reward Is Assessed in Three Dimensions That Correspond to the Semantic Differential by John G. Fennell and Roland J. Baddeley. (Fennell JG, Baddeley RJ (2013) Reward Is Assessed in Three Dimensions That Correspond to the Semantic Differential. PLoS ONE 8(2): e55588. doi:10.1371/journal.pone.0055588)

Abstract:

If choices are to be made between alternatives like should I go for a walk or grab a coffee, a ‘common currency’ is needed to compare them. This quantity, often known as reward in psychology and utility in economics, is usually conceptualised as a single dimension. Here we propose that to make a comparison between different options it is important to know not only the average reward, but also both the risk and level of certainty (or control) associated with an option. Almost all objects can be the subject of choice, so if these dimensions are required in order to make a decision, they should be part of the meaning of those objects. We propose that this ubiquity is unique, so if we take an average over many concepts and domains these three dimensions (reward, risk, and uncertainty) should emerge as the three most important dimensions in the “meaning” of objects. We investigated this possibility by relating the three dimensions of reward to an old, robust and extensively studied factor analytic instrument known as the semantic differential. Across a very wide range of situations, concepts and cultures, factor analysis shows that 50% of the variance in rating scales is accounted for by just three dimensions, with these dimensions being Evaluation, Potency, and Activity [1]. Using a statistical analysis of internet blog entries and a betting experiment, we show that these three factors of the semantic differential are strongly correlated with the reward history associated with a given concept: Evaluation measures relative reward; Potency measures absolute risk; and Activity measures the uncertainty or lack of control associated with a concept. We argue that the 50% of meaning captured by the semantic differential is simply a summary of the reward history that allows decisions to be made between widely different options.

Semantic Differential” as defined by Wikipedia:

Semantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept.

Invented over 50 years ago, semantic differential scales, ranking a concept on a scale anchored by opposites, such as good-evil, has proven to be very useful.

What the scale was measuring, despite its success, was unknown. (May still be, depends on how persuasive you find the author’s proposal.)

The proposal merits serious discussion and additional research but I am leery about relying on blogs as representative of language usage.

Or rather I take blogs as representative of people who blog, which is a decided minority of all language users.

Just as I would take transcripts of “Sex and the City” as representing the fantasies of socially deprived writers. Interesting perhaps but not the same as the mores of New York City. (If that lowers your expectations about a trip to New York City, my apologies.)

Using molecular networks to assess molecular similarity

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.

How to set up Semantic Logging…

Filed under: .Net,Log Analysis,Semantics — Patrick Durusau @ 10:55 am

How to set up Semantic Logging: part one with Logstash, Kibana, ElasticSearch and Puppet, by Henrik Feldt.

While we are on the topic of semantic logging:

Logging today is mostly done too unstructured; each application developer has his own syntax for the logs, optimized for his personal requirements and when it is time to deploy, ops consider themselves lucky if there is even some logging in the application, and even luckier if that logging can be used to find problems as they occur by being able to adjust verbosity where needed.

I’ve come to the point where I want a really awesome piece of logging from the get-go – something I can pick up and install in a couple of minutes when I come to a new customer site without proper operations support.

I want to be able to be able to search, drill down into, filter out patterns and have good tooling that allow me to let logging be an obvious support as the application is brought through its life cycle, from development to production. And I don’t want to write my own log parsers, thank you very much!

That’s where semantic logging comes in – my applications should be broadcasting log data in a manner that allow code to route, filter and index it. That’s why I’ve spent a lot of time researching how logging is done in a bloody good manner – this post and upcoming ones will teach you how to make your logs talk!

It’s worth noting that you can read this post no matter your programming language. In fact, the tooling that I’m about to discuss will span multiple operating systems; Linux, Windows, and multiple programming languages: Erlang, Java, Puppet, Ruby, PHP, JavaScript and C#. I will demo logging from C#/Win initially and continue with Python, Haskell and Scala in upcoming posts.

I didn’t see any posts following this one. But it is complete enough to get you started on semantic logging.

Embracing Semantic Logging

Filed under: .Net,Log Analysis,Semantics — Patrick Durusau @ 10:49 am

Embracing Semantic Logging by Grigori Melnik.

From the post:

In the world of software engineering, every system needs to log. Logging helps to diagnose and troubleshoot problems with your system both in development and in production. This requires proper, well-designed instrumentation. All too often, however, developers instrument their code to do logging without having a clear strategy and without thinking through the ways the logs are going to be consumed, parsed, and interpreted. Valuable contextual information about events frequently gets lost, or is buried inside the log messages. Furthermore, in some cases logging is done simply for the sake of logging, more like a checkmark on the list. This situation is analogous to people fallaciously believing their backup system is properly implemented by enabling the backup but never, actually, trying to restore from those backups.

This lack of a thought-through logging strategy results in systems producing huge amounts of log data which is less useful or entirely useless for problem resolution.

Many logging frameworks exist today (including our own Logging Application Block and log4net). In a nutshell, they provide high-level APIs to help with formatting log messages, grouping (by means of categories or hierarchies) and writing them to various destinations. They provide you with an entry point – some sort of a logger object through which you call log writing methods (conceptually, not very different from Console.WriteLine(message)). While supporting dynamic reconfiguration of certain knobs, they require the developer to decide upfront on the template of the logging message itself. Even when this can be changed, the message is usually intertwined with the application code, including metadata about the entry such as the severity and entry id.

As ever in all discussions, even those of semantics, there is some impedance:

Imagine another world, where the events get logged and their semantic meaning is preserved. You don’t lose any fidelity in your data. Welcome to the world of semantic logging. Note, some people refer to semantic logging as “structured logging”, “strongly-typed logging” or “schematized logging”.

Whatever you want to call it:

The technology to enable semantic logging in Windows has been around for a while (since Windows 2000). It’s called ETW – Event Tracing for Windows. It is a fast, scalable logging mechanism built into the Windows operating system itself. As Vance Morrison explains, “it is powerful because of three reasons:

  1. The operating system comes pre-wired with a bunch of useful events
  2. It can capture stack traces along with the event, which is INCREDIBLY USEFUL.
  3. It is extensible, which means that you can add your own information that is relevant to your code.

EW has been improved in .NET Framework 4.5 but I will leave you to Grigori’s post to ferret out those details.

Semantic logging is important for all the reasons mentioned in Grigori’s post and because captured semantics provide grist for semantic mapping mills.

Solr Unleashed

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 10:14 am

Solr Unleashed: A Hands-On Workshop for Building Killer Search Apps (LucidWorks)

From the post:

Having consulted with clients on Lucene and Solr for the better part of a decade, we’ve seen the same mistakes made over and over again: applications built on shaky foundations, stretched to the breaking point. In this two day class, learn from the experts about how to do it right and make sure your apps are rock solid, scalable, and produce relevant results. Also check the course outline.

The course looks great, but if you don’t have the fees, I have reproduced the course outline below.

Using online documentation, mailing lists and other online resources, track the outline and fill it in for yourself.

If you want a real challenge, work through the outline and then build a Solr application around the outline.

To keep your newly acquired skills polished to a fine sheen.

1. The Fundamentals

  • About Solr
  • Installing and running Solr
  • Adding content to Solr
  • Reading a Solr XML response
  • Changing parameters in the URL
  • Using the browse interface

2. Searching

  • Sorting results
  • Query parsers
  • More queries
  • Hardwiring request parameters
  • Adding fields to default search
  • Faceting on fields
  • Range faceting
  • Date range faceting
  • Hierarchical faceting
  • Result grouping

3. Indexing

  • Adding your own content to Solr
  • Deleting data from solr
  • Building a bookstore search
  • Adding book data
  • Exploring the book data
  • Dedupe updateprocessor

4. Updating your schema

  • Adding fields to the schema
  • Analyzing text
  • 5. Relevance

    • Field weighting
    • Phrase queries
    • Function queries

    6. Extended features

    • More-like-this
    • Fuzzier search
    • Sounds-like
    • Geospatial
    • Spell checking
    • Suggestions
    • Highlighting

    7. Multilanguage

    • Working with English
    • Working with other languages
    • Non-whitespace languages
    • Identifying languages
    • Language specific sorting

    8. SolrCloud

    • Introduction
    • How SolrCloud works
    • Commit strategies
    • ZooKeeper
    • Managing Solr config files

Not the same as the class but will help you ask better questions of LucidWorks experts when you need them.

DataDive to Fight Poverty and Corruption with the World Bank!

Filed under: Data Mining,Government,Transparency — Patrick Durusau @ 5:44 am

DataDive to Fight Poverty and Corruption with the World Bank!

From the post:

We’re thrilled to announce a huge new DataKind DataDive coming to DC the weekend of 3/15! We’re teaming up with the World Bank to put a dent in some of the most serious problems in poverty and corruption through the use of data. Low bar, right?

We’re calling on all socially conscious analysts, statisticians, data scientists, coders, hackers, designers, or eager-to-learn do-gooders to come out with us on the weekend of 3/15 to work with data to improve the world. You’ll be working alongside experts in the field to analyze, visualize, and mashup the most cutting-edge data from the World Bank, UN, and other sources to improve poverty monitoring and root out corruption. We’ve started digging into the data a little ourselves and we’re already so excited for how cool this event is going to be. “Oh, what’d you do this weekend? I reduced global poverty and rooted out corruption. No big deal.”

BTW, there is an Open Data Day on 2/23 to prepare for the DataDive on 3/15.

What isn’t clear from the announcement(s) is what data is to be mined to fight poverty and corruption?

Or what is meant by “corruption?”

Graph solutions, for example, would be better at tracking American style corruption that shuns quid pro quo in favor of a community of interest of the wealthy and well-connected.

Such communities aren’t any less corrupt than members of government with cash in their freezers, just less obvious.

February 14, 2013

InChI in the wild: An Assessment of InChIKey searching in Google

Filed under: Bioinformatics,Cheminformatics,InChl — Patrick Durusau @ 8:19 pm

InChI in the wild: An Assessment of InChIKey searching in Google by Christopher Southan. (Journal of Cheminformatics 2013, 5:10 doi:10.1186/1758-2946-5-10)

Abstract:

While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets and image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.

An interesting use of an identifier, not as a key to a database, as a recent comment suggested, but as the basis for enhanced search results.

How else would you use identifiers “in the wild?”

When All The Program’s A Graph…

Filed under: Flow-Based Programming (FBP),Functional Programming,Graphs — Patrick Durusau @ 8:00 pm

When All The Program’s A Graph – Prismatic’s Plumbing Library

From the post:

At some point as a programmer you might have the insight/fear that all programming is just doing stuff to other stuff.

Then you may observe after coding the same stuff over again that stuff in a program often takes the form of interacting patterns of flows.

Then you may think hey, a program isn’t only useful for coding datastructures, but a program is a kind of datastructure and that with a meta level jump you could program a program in terms of flows over data and flow over other flows.

That’s the kind of stuff Prismatic is making available in the Graph extension to their plumbing package (code examples), which is described in an excellent post: Graph: Abstractions for Structured Computation.

Formalizing the structure of FP code. Who could argue with that?

Read the first post as a quick introduction to the second.

Intellectual Property Rights: Fiscal Year 2012 Seizure Statistics

Filed under: Government,Intellectual Property (IP),Transparency — Patrick Durusau @ 7:51 pm

Intellectual Property Rights: Fiscal Year 2012 Seizure Statistics

Fulltextreports.com quotes this report as saying:

In Fiscal Year (FY) 2012, DHS and its agencies, CBP and ICE, remained vigilant in their commitment to protect American consumers from intellectual property theft as well as enforce the rights of intellectual property rights holders by expanding their efforts to seize infringing goods, leading to 691 arrests, 423 indictments and 334 prosecutions. Counterfeit and pirated goods pose a serious threat to America’s economic vitality, the health and safety of American consumers, and our critical infrastructure and national security. Through coordinated efforts to interdict infringing merchandise, including joint operations, DHS enforced intellectual property rights while facilitating the secure flow of legitimate trade and travel.

I just feel so…. underwhelmed.

When was the last time you felt frightened by a fake French handbag? Or imitation Italian shoes?

I mean, they may be ugly but so were the originals.

I mention this because tracking data across the various intellectual property enforcement agencies isn’t straight forward.

I found that out while looking into some historical data on copyright enforcement. After the Aaron Swartz tragedy.

The question I want to pursue with topic maps is: Who benefits from these government enforcement efforts?

As far as I can tell now, today, I never have. I bet the same is true for you.

More on gathering the information to make that case anon.

Core Public Service Vocabulary released for public review [Deadline 27 February 2013]

Filed under: Government,Vocabularies — Patrick Durusau @ 7:32 pm

Core Public Service Vocabulary released for public review

From the post:

The Core Public Service Vocabulary has entered in public review period. Anyone interested is invited to provide feedback until 27 February 2013 (inclusive).

In December 2012, the ISA Programme launched the Core Public Service Vocabulary (CPSV) initiative as part of Action 1.1 on improving semantic interoperability in European e-Government systems. The CPSV is a simplified, reusable and extensible data model that captures the fundamental characteristics of a service offered by public administrations.

The CPSV is designed to make it easy to exchange basic information about the functions carried out by the public sector and the services in which those functions are carried out. By using the vocabulary, organisations publishing data about their services will for example enable:

  • easier discovery of those services within and across countries;
  • easier discovery of the legislation and policies that underpin service provision;
  • easier comparison of similar services provided by different organisations.

Download the draft specification and comment by 27 February 2013.

From text at the draft download site, it appears the Pubic Review Period was to run from 8 February and 27 February 2013.

Take a look and see if you think that is enough time? Or to see if you have other comments.

Neo4j and Gatling Sitting in a Tree, Performance T-E-S-T-ING

Filed under: Gatling,Neo4j,Scala — Patrick Durusau @ 7:16 pm

Neo4j and Gatling Sitting in a Tree, Performance T-E-S-T-ING by Max De Marzi.

From the post:

I was introduced to the open-source performance testing tool Gatling a few months ago by Dustin Barnes and fell in love with it. It has an easy to use DSL, and even though I don’t know a lick of Scala, I was able to figure out how to use it. It creates pretty awesome graphics and takes care of a lot of work for you behind the scenes. They have great documentation and a pretty active google group where newbies and questions are welcomed.

It requires you to have Scala installed, but once you do all you need to do is create your tests and use a command line to execute it. I’ll show you how to do a few basic things, like test that you have everything working, then we’ll create nodes and relationships, and then query those nodes.

You did run performance tests on your semantic application. Yes?

“Improving Critical Infrastructure Cybersecurity” Executive Order

Filed under: Government,Government Data,Security — Patrick Durusau @ 2:53 pm

Unless you have been asleep for the last couple of days, you have heard about President Obama’s “Improving Critical Infrastructure Cybersecurity” Executive Order.

Wanted to point you to one of the lesser discussed provisions of the order:

Section 4 (e) reads:

In order to maximize the utility of cyber threat information sharing with the private sector, the Secretary shall expand the use of programs that bring private sector subject-matter experts into Federal service on a temporary basis. These subject matter experts should provide advice regarding the content, structure, and types of information most useful to critical infrastructure owners and operators in reducing and mitigating cyber risks.

I didn’t know which “…programs that bring private sector subject-matter experts into Federal Service…” he meant.

So, I wrote to the GSA (General Services Administration) and they said to look at schedules 70 and 874 at www.gsaelibrary.gsa.gov.

I won’t try to advise you on the steps to register for government contract work.

But this is an opportunity for building bridges across the semantic divides in any inter-agency effort.

Do remember where you heard the news!

Hypertable Has Reached A Major Milestone

Filed under: Hypertable,NoSQL — Patrick Durusau @ 2:20 pm

Hypertable Has Reached A Major Milestone by Doug Judd.

From the post:

RangeServer Failover

With the release of Hypertable version 0.9.7.0 comes support for automatic RangeServer failover. Hypertable will now detect when a RangeServer has failed, logically remove it from the system, and automatically re-assign the ranges that it was managing to other RangeServers. This represents a major milestone for Hypertable and allows for very large scale deployments. We have been actively working on this feature, full-time, for 1 1/2 years. To give you an idea of the magnitude of the change, here are the commit statistics:

  • 441 changed files
  • 17,522 line additions
  • 6,384 line deletions

The reason that this feature has been a long time in the making is because we placed a very high standard of quality for this feature so that under no circumstance, a RangeServer failure would lead to consistency problems or data loss. We’re confident that we’ve achieved 100% correctness under every conceivable circumstance. The two primary goals for the feature, robustness and applicaiton transparancy, are described below.

That is a major milestone!

High-end data processing is becoming as crowded with viable options as low-end data processing. And the “low-end” of data processing keeps getting bigger.

HyperGraphDB: A Generalized Graph Database

Filed under: Hyperedges,Hypergraphs — Patrick Durusau @ 2:11 pm

HyperGraphDB: A Generalized Graph Database by Borislav Iordanov.

Abstract:

We present HyperGraphDB, a novel graph database based on generalized hypergraphs where hyperedges can contain other hyperedges. This generalization automatically reifies every entity expressed in the database thus removing many of the usual difficulties in dealing with higher-order relationships. An open two-layered architecture of the data organization yields a highly customizable system where specific domain representations can be optimized while remaining within a uniform conceptual framework. HyperGraphDB is an embedded, transactional database designed as a universal data model for highly complex, large scale knowledge representation applications such as found in artificial intelligence, bioinformatics and natural language processing.

A formal treatment of HyperGraphDB.

Merits being printed out and given a slow read.

Borisla comments on both RDF and Topic Maps:

…Two other prominent issues are contextuality (scoping) and reification.

Those and other considerations from semantic web research disappear or find natural solutions in the model implemented by HyperGraphDB.

But when I search the paper, scoping comes up in an NLP example as:

The tree-like structure of the document is also recorded in HyperGraphDB with scoping parent-child binary links between (a) the document and its paragraphs, (b) a paragraph and its sentences, (c) a sentence and each linguistic relationship inferred from it.

Scoping at least in one sense of the word, but not the in the sense of say a name being “scoped” by the language French.

Reification, other than the discussion of RDF and topic maps, doesn’t appear again in the paper.

As I said, it needs a slow read but if you see something about scoping and/or reification that I have missed, please give a shout!

Hypergraph-based multidimensional data modeling…

Filed under: Graphs,Hyperedges,Hypergraphs,Networks — Patrick Durusau @ 1:48 pm

Hypergraph-based multidimensional data modeling towards on-demand business analysis by Duong Thi Anh Hoang, Torsten Priebe and A. Min Tjoa. (Proceeding iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services Pages 36-43 )

Abstract:

In the last few years, web-based environments have witnessed the emergence of new types of on-demand business analysis that facilitate complex and integrated analytical information from multidimensional databases. In these on-demand environments, users of business intelligence architectures can have very different reporting and analytical needs, requiring much greater flexibility and adaptability of today’s multidimensional data modeling. While structured data models for OLAP have been studied in detail, a majority of current approaches has not put its focus on the dynamic aspect of the multidimensional design and/or semantic enriched impact model. Within the scope of this paper, we present a flexible approach to model multidimensional databases in the context of dynamic web-based analysis and adaptive users’ requirements. The introduced formal approach is based on hypergraphs with the ability to provide formal constructs specifying the different types of multidimensional elements and relationships which enable the support of highly customized business analysis. The introduced hypergraphs are used to formally define the semantics of multidimensional models with unstructured ad-hoc analytic activities. The proposed model also supports a formal representation of advanced concepts like dynamic hierarchies, many-to-many associations, additivity constraints etc. Some scenario example are also provided to motivate and illustrate the proposed approach.

If you like illustrations of technologies with examples from the banking industry, this is the paper on hypergraphs for you.

Besides, banks are where they keep the money. 😉

Seriously, a very well illustrated introduction to the use of hypergraphs and multidimensional data modeling, plus why multidimensional data models matter to clients. (Another place where they keep money.)

Survey of graph database models

Filed under: Database,Graphs,Networks — Patrick Durusau @ 5:07 am

Survey of graph database models by Renzo Angles and Claudio Gutierrez. (ACM Computing Surveys (CSUR) Surveys, Volume 40 Issue 1, February 2008, Article No. 1 )

Abstract:

Graph database models can be defined as those in which data structures for the schema and instances are modeled as graphs or generalizations of them, and data manipulation is expressed by graph-oriented operations and type constructors. These models took off in the eighties and early nineties alongside object-oriented models. Their influence gradually died out with the emergence of other database models, in particular geographical, spatial, semistructured, and XML. Recently, the need to manage information with graph-like nature has reestablished the relevance of this area. The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.

If you need an antidote for graph database hype, look no further than this thirty-nine (39) page survey article.

You will come away with a deeper appreciate for graph databases and their history.

If you are looking for a self-improvement reading program, you could do far worse than starting with this article and reading the cited references one by one.

February 13, 2013

Saving the “Semantic” Web (part 4)

Filed under: RDF,Semantic Diversity,Semantic Web,Semantics — Patrick Durusau @ 4:15 pm

Democracy vs. Aristocracy

Part of a recent comment on this series reads:

What should we have been doing instead of the semantic web? ISO Topic Maps? There is some great work in there, but has it been a better success?

That is an important question and I wanted to capture it outside of comments on a prior post.

Earlier in this series of posts I pointed out the success of HTML, especially when contrasted with Semantic Web proposals.

Let me hasten to add the same observation is true for ISO Topic Maps (HyTime or later versions).

The critical difference between HTML (the early and quite serviceable versions) and Semantic Web/Topic Maps is that the former democratizes communication and the latter fosters a technical aristocracy.

Every user who can type and some who hunt-n-peck, can author HTML and publish their content for others around the world to read, discuss, etc.

That is a very powerful and democratizing notion about content creation.

The previous guardians, gate keepers, insiders, and their familiars, who didn’t add anything of value to prior publications processes, are still reeling from the blow.

Even as old aristocracies crumble, new ones evolve.

Technical aristocracies for example. A phrase relevant to both the Semantic Web and ISO Topic Maps.

Having tasted freedom, the crowds aren’t as accepting of the lash/leash as they once were. Nor of the aristocracies who would wield them. Nor should they be.

Which make me wonder: Why the emphasis on creating dumbed down semantics for computers?

We already have billions of people who are far more competent semantically than computers.

Where are our efforts to enable them to transverse the different semantics of other users?

Such as the semantics of the aristocrats who have self-anointed themselves to labor on their behalf?

If you have guessed that I have little patience with aristocracies, you are right in one.

I came by that aversion honestly.

I practiced law in a civilian jurisdiction for a decade. A specialist language, law, can be more precise, but it also excludes others from participation. The same experience was true when I studied theology and ANE languages. A bit later, in markup technologies (then SGML/HyTime), the same lesson was repeated. What I do with ODF and topic maps are two more specialized languages.

Yet a reasonably intelligent person can discuss issues in any of those fields, if they can get past the language barriers aristocrats take so much comfort in maintaining.

My answer to what we should be doing is:

Looking for ways to enable people to traverse and enjoy the semantic diversity that accounts for the richness of the human experience.

PS: Computers have a role to play in that quest, but a subordinate one.


Designing to Reward our Tribal Sides

Filed under: Design,Interface Research/Design,Usability — Patrick Durusau @ 3:07 pm

Designing to Reward our Tribal Sides by Nir Eyal.

From the post:

tribal rewards

We are a species of beings that depend on one another. Scientists theorize humans have specially adapted neurons that help us feel what others feel, providing evidence that we survive through our empathy for others. We’re meant to be part of a tribe and our brains seek out rewards that make us feel accepted, important, attractive, and included.

Many of our institutions and industries are built around this need for social reinforcement. From civic and religious groups to spectator sports, the need to feel social connectedness informs our values and drives much of how we spend our time.

Communication technology in particular has given rise to a long history of companies that have provided better ways of delivering what I call, “rewards of the tribe.”

However, it’s not only the reward we seek. Variability also keeps us engaged. From the telegraph to email, products that connect us are highly valued, but those that invoke an element of surprise are even more so. Recently, the explosion of Web technologies that cater to our insatiable search for validation provides clear examples of the tremendous appeal of the promise of social reward.

Do you capture the Stack Overflow lesson in your UI?

UI design that builds on and rewards our hard wiring seems like a good idea to me.

« Newer PostsOlder Posts »

Powered by WordPress