Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 7, 2015

RawTherapee

Filed under: Image Processing,Topic Maps,Visualization — Patrick Durusau @ 4:41 pm

RawTherapee

From the RawPedia (Getting Started)

RawTherapee is a cross-platform raw image processing program, released under the GNU General Public License Version 3. It was originally written by Gábor Horváth of Budapest. Rather than being a raster graphics editor such as Photoshop or GIMP, it is specifically aimed at raw photo post-production. And it does it very well – at a minimum, RawTherapee is one of the most powerful raw processing programs available. Many of us would make bigger claims…

At intervals of more than a month but not much more than two months, there is a Play Raw competition with an image and voting (plus commentary along the way).

Very impressive!

Thoughts on topic map competitions?

I first saw this in a tweet by Neil Saunders.

January 25, 2015

Understanding Context

Filed under: Context,Context Models,Context-aware,Topic Maps — Patrick Durusau @ 8:43 pm

Understanding Context by Andrew Hinton.

From the post:

Technology is destabilizing the way we understand our surroundings. From social identity to ubiquitous mobility, digital information keeps changing what here means, how to get there, and even who we are. Why does software so easily confound our perception and scramble meaning? And how can we make all this complexity still make sense to our users?

Understanding Context — written by Andrew Hinton of The Understanding Group — offers a powerful toolset for grasping and solving the challenges of contextual ambiguity. By starting with the foundation of how people perceive the world around them, it shows how users touch, navigate, and comprehend environments made of language and pixels, and how we can make those places better.

Understanding Context is ideal for information architects, user experience professionals, and designers of digital products and services of any scope. If what you create connects one context to another, you need this book.

Final_Final_forWeb_250_withMorville

Amazon summarizes in part:

You’ll discover not only how to design for a given context, but also how design participates in making context.

  • Learn how people perceive context when touching and navigating digital environments
  • See how labels, relationships, and rules work as building blocks for context
  • Find out how to make better sense of cross-channel, multi-device products or services
  • Discover how language creates infrastructure in organizations, software, and the Internet of Things
  • Learn models for figuring out the contextual angles of any user experience

This book is definitely going on my birthday wish list at Amazon. (There done!)

Looking forward to a slow read and in the meantime, will start looking for items from the bibliography.

My question, of course, is that after expending all the effort to discover and/or design a context, how do I pass that context onto another?

To someone coming from a slightly different context? (Assuming always that the designer is “in” a context.)

From a topic map perspective, what subjects do I need to represent to capture a visual context? Even more difficult, what properties of those subjects do I need to capture to enable their discovery by others? Or to facilitate mapping those subjects to another context/domain?

Definitely a volume I would assign as reading for a course on topic maps.

I first saw this in a tweet by subjectcentric.

January 21, 2015

TM-Gen: A Topic Map Generator from Text Documents

Filed under: Authoring Topic Maps,Text Mining,Topic Maps — Patrick Durusau @ 4:55 pm

TM-Gen: A Topic Map Generator from Text Documents by Angel L. Garrido, et al.

From the post:

The vast amount of text documents stored in digital format is growing at a frantic rhythm each day. Therefore, tools able to find accurate information by searching in natural language information repositories are gaining great interest in recent years. In this context, there are especially interesting tools capable of dealing with large amounts of text information and deriving human-readable summaries. However, one step further is to be able not only to summarize, but to extract the knowledge stored in those texts, and even represent it graphically.

In this paper we present an architecture to generate automatically a conceptual representation of knowledge stored in a set of text-based documents. For this purpose we have used the topic maps standard and we have developed a method that combines text mining, statistics, linguistic tools, and semantics to obtain a graphical representation of the information contained therein, which can be coded using a knowledge representation language such as RDF or OWL. The procedure is language-independent, fully automatic, self-adjusting, and it does not need manual configuration by the user. Although the validation of a graphic knowledge representation system is very subjective, we have been able to take advantage of an intermediate product of the process to make an experimental
validation of our proposal.

Of particular note on the automatic construction of topic maps:

Addition of associations:

TM-Gen adds to the topic map the associations between topics found in each sentence. These associations are given by the verbs present in the sentence. TM-Gen performs this task by searching the subject included as topic, and then it adds the verb as its association. Finally, it links its verb complement with the topic and with the association as a new topic.

Depending on the archive one would expect associations between the authors and articles but also topics within articles, to say nothing of date, the publication, etc. Once established, a user can request a view that consists of more or less detail. If not captured, however, more detail will not be available.

There is only a general description of TM-Gen but enough to put you on the way to assembling something quite similar.

TMR: A Semantic Recommender System using Topic Maps on the Items’ Descriptions

Filed under: Recommendation,Topic Maps — Patrick Durusau @ 3:37 pm

TMR: A Semantic Recommender System using Topic Maps on the Items’ Descriptions by Angel L. Garrido and Sergio Ilarri.

Abstract:

Recommendation systems have become increasingly popular these days. Their utility has been proved to filter and to suggest items archived at web sites to the users. Even though recommendation systems have been developed for the past two decades, existing recommenders are still inadequate to achieve their objectives and must be enhanced to generate appealing personalized recommendations e ectively. In this paper we present TMR, a context-independent tool based on topic maps that works with item’s descriptions and reviews to provide suitable recommendations to users. TMR takes advantage of lexical and semantic resources to infer users’ preferences and thus the recommender is not restricted by the syntactic constraints imposed on some existing recommenders. We have verifi ed the correctness of TMR using a popular benchmark dataset.

One of the more exciting aspects of this paper is the building of topic maps from free texts that are then used in the recommendation process.

I haven’t seen the generated topic maps (yet) but suspect that editing an existing topic map is far easier than creating one ab initio.

January 2, 2015

Augmented Reality or Same Old Sh*t (just closer)

ODG just set the new bar for augmented reality by Signe Brewster.

From the post:

Back in the fall of 2014, a little-known San Francisco company called ODG released two pairs of augmented reality glasses. While the industry’s software companies were busy hawking Epson’s respectable BT-200 glasses, developers were telling me something different: It’s all about ODG.

Now, ODG is expanding into the consumer space with a new headset it will announce at the CES conference. The yet-to-be named glasses will be designed similarly to Wayfarer sunglasses (every consumer augmented reality company’s choice these days) and weigh a relatively light 125 grams. They run on an integrated battery and work with ODG’s series of input devices, plus anything else that relies on Bluetooth. They will cost less than $1,000 and are scheduled to be released by the end of the year. ODG will debut a new software platform next week to complement the glasses. It all runs on Android.

Signe’s post isn’t long on details but she does have direct experience using the ODG headsets. Most of the rest of us will have to wait until the end of 2015. Rats!

In the meantime, however, I suspect you are going to be more interested in the developer resources:

Developer Resources

ODG supports Developers through it’s ReticleOS™ SDK and Developer Support Site with API documentation, tutorials, sample code, UI/UX guide, and forums that will allow developers to program new applications and modify existing ones. You can also apply for a 25% discount for glasses, up to 2 sets.

In Q4, we will offer a hardware development kit consisting of same board, sensors, controls, and camera as in the glasses with an HDMI out and serial port.

Reticle OS Marketplace

Follow our UI/UX suggestions and your app can have a home in the future ODG App Marketplace to be launched shortly. For app and in-app products that you sell on the ODG marketplace, the transaction fee will be equivalent to 25% of the price.

My primary interest is in the authoring of data that could then be used by applications for ODG headsets.

For example, (speculation follows) you ask the interface for the latest news on your congressional representative, Rep. Scalise. Assume it has been discovered they are known associates with a former leader of the KKK. Do you really want every link to every story on Rep. Scalise?

Wouldn’t you prefer a de-duped news feed that gave you one link? To the most complete story on that issue and suppressed the rest? When you have time to waste you can return to the story and pursue the endless repetition without new information just like on CNN.

Is your augmented reality going to be better than your everyday reality or is it going to be the same old sh*t, just closer to your eyes?

December 27, 2014

Accidental vs Deliberate Context

Filed under: Communication,Communities of Practice,Context,Diversity,Topic Maps — Patrick Durusau @ 2:16 pm

Accidental vs Deliberate Context by Jessica Kerr.

From the post:

In all decisions, we bring our context with us. Layers of context, from what we read about that morning to who our heroes were growing up. We don’t realize how much context we assume in our communications, and in our code.

One time I taught someone how to make the Baby Vampire face. It involves poking out both corners of my lower lip, so they stick up like poky gums. Very silly. To my surprise, the person couldn’t do it. They could only poke one side of the lower lip out at a time.

Hotel-Transylvania-Castle-1280x1024-Wallpaper-ToonsWallpapers.com-

Turns out, few outside my family can make this face. My mom can do it, my sister can do it, my daughters can do it – so it came as a complete surprise to me when someone couldn’t. There is a lip-flexibility that’s part of my context, always has been, and I didn’t even realize it.

Jessica goes on to illustrate that communication depends upon the existence of some degree of shared context and that additional context can be explained to others, as on a team.

She distinguishes between “incidental” shared contexts and “deliberate” shared contexts. Incidental contexts arising from family or long association with friends. Common/shared experiences form an incidental context.

Deliberate contexts, on the other hand, are the intentional melding of a variety of contexts, in her examples, the contexts of biologists and programmers. Who at the outset, lacked a common context in which to communicate.

Forming teams with diverse backgrounds is a way to create a “deliberate” context, but my question would be how to preserve that “deliberate” context for others? It becomes an “incidental” context if others must join the team in order to absorb the previously “deliberate” context. If that is a requirement, then others will not be able to benefit from deliberately created contexts in which they did not participate.

If the process and decisions made in forming a “deliberate” context were captured by a topic map, then others could apply this “new” deliberate context to develop other “deliberate” contexts. Perhaps some of the decisions or mappings made would not suit another “deliberate” context but perhaps some would. And perhaps other “deliberate” contexts would evolve beyond the end of their inputs.

The point being that unless these “deliberate” contexts are captured, to whatever degree of granularity is desired, every “deliberate” context for say biologists and programmers is starting off at ground zero. Have you ever heard of a chemistry experiment starting off by recreating the periodic table? I haven’t. Perhaps we should abandon that model in the building of “deliberate” contexts as well.

Not to mention that re-usable “deliberate” contexts might enable greater diversity in teams.

Topic maps anyone?

PS: I suggest topic maps to capture “deliberate” context because topic maps are not constrained by logic. You can capture any subject and any relationship between subjects, logical or not. For example, a user of a modern dictionary, which lists words in alphabetical order, would be quite surprised if given a dictionary of Biblical Hebrew and asked to find a word (assuming they know the alphabet). The most common dictionaries of Biblical Hebrew list words by their roots and not as they appear to the common reader. There are arguments to be made for each arrangement but neither one is a “logical” answer.

The arrangement of dictionaries is another example of differing contexts. With a topic map I can offer a reader whichever Biblical Hebrew dictionary is desired, with only one text underlying both displays. As opposed to the printed version which can offer only one context or another.

December 23, 2014

5 Ways to Find Trending Topics (Other than Twitter)

Filed under: Marketing,Topic Maps — Patrick Durusau @ 3:52 pm

5 Ways to Find Trending Topics (Other than Twitter) by Elisabeth Michaud.

From the post:

Like every community or social media manager, one type of social media content you’re likely to share is posts that play on what’s happening in the world– the trends of the day, week, or month. To find content for these posts, many of you are probably turning to Twitter’s Trending Topics–that friendly little section on the left-hand side of your browser when you visit Twitter.com, and something that can be personalized (or not) to what Twitter thinks you’ll be most interested in.

We admit that Trending Topics are pretty handy when it comes to inspiring content, but it’s also the same place EVERY. OTHER. BRAND (and probably your competitors) is looking for content ideas. Boring! Today, we’ve got 5 other places you can look for trending stories to inspire you.

Not recent but I think Elisabeth’s tips bear repeating. At least if you are interested in creating popular topic maps. That is topic maps that maybe of interest to someone other than yourself. 😉

I still aspire to create a topic map of the Chicago Assyrian Dictionary by using Tesseract to extract the text from image based PDF, etc. but the only buyers for that item would be me and the folks at the Oriental Archives at the University of Chicago. Maybe a few others but not something you want to bet the rent on.

Beyond Elisabeth’s suggestions, which are all social media, I would suggest you also monitor:

CNN

Guardian (UK edition)

New York Times

Spiegel Online International

The Wall Street Journal

To see if you can pick up trends in stories there as well.

The biggest problem with news channels being that stories blow hot and cold and it isn’t possible to know ahead of time which ones will last (like the Michael Brown shooting) and which ones are doing to be dropped like a hot potato (CIA torture report).

One suggestion would be to create a TWitter account to follow some representative sample of main news outlets and keep a word count, excluding noise words, on a weekly and monthly basis. Anything that spans more than a week, is likely to be a persistent topic of interest. At least to someone.

And when something flares up in social media, you can track it there as well. Like #gamergate. Where are you going to find a curated archive of all the tweets and other social media messages on that topic? Where you can track the principals, aggregate content, etc.? You could search for that now but I suspect some of it is already missing or edited.

The ultimate question is not whether topic maps as a technology are popular but rather do topic maps deliver a value-add for information that is of interest to others?

Is that a Golden Rule (A rule that will make you some gold.)?

Provide unto others the information they want

PS: Don’t confuse “provide” with “give.” The economic model for “providing” is your choice.

December 14, 2014

Machine Learning: The High-Interest Credit Card of Technical Debt (and Merging)

Filed under: Machine Learning,Merging,Topic Maps — Patrick Durusau @ 1:55 pm

Machine Learning: The High-Interest Credit Card of Technical Debt by D. Sculley, et al.

Abstract:

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

Under “entanglement” (referring to inputs) the authors announce the CACE principle:

Changing Anything Changes Everything

The net result of such changes is that prediction behavior may alter, either subtly or dramatically, on various slices of the distribution. The same principle applies to hyper-parameters. Changes in regularization strength, learning settings, sampling methods in training, convergence thresholds, and essentially every other possible tweak can have similarly wide ranging effects.

Entanglement is a native condition in topic maps as a result of the merging process. Yet, I don’t recall there being much discussion of how to evaluate the potential for unwanted entanglement or how to avoid entanglement (if desired).

You may have topics in a topic map where merging with later additions to the topic map is to be avoided. Perhaps to avoid the merging of spam topics that would otherwise overwhelm your content.

One way to avoid that and yet allow users to use links reported as subjectIdentifiers and subjectLocators under the TMDM would be to not report those properties for some set of topics to the topic map engine. The only property they could merge on would be their topicID, which hopefully you have concealed from public users.

Not unlike the traditions of Unix where some X ports are unavailable to any users other than root. Topics with IDs below N are skipped by the topic map engine for merging purposes, unless the merging is invoked by the equivalent of root.

No change in current syntax or modeling required, although a filter on topic IDs would need to be implemented to add this to current topic map applications.

I am sure there are other ways to prevent merging of some topics but this seems like a simple way to achieve that end.

Unfortunately it does not address the larger question of the “technical debt” incurred to maintain a topic map of any degree of sophistication.

Thoughts?

I first saw this in a tweet by Elias Ponvert.

December 2, 2014

Promoting Topic Maps (and writing)

Filed under: Marketing,Topic Maps — Patrick Durusau @ 4:11 pm

Ted Underwood posted a tweet today that seems relevant to marketing topic maps:

When Sumeria got psyched about writing, I bet they spent the first two decades mostly traveling around giving talks about writing.

I think Ted has a very good point.

You?

November 30, 2014

What is Walmart Doing Right and Topic Maps Doing Wrong?

Filed under: Advertising,Marketing,Topic Maps — Patrick Durusau @ 12:59 pm

Sentences to ponder by Chris Blattman.

From the post:

Walmart reported brisk traffic overnight. The retailer, based in Bentonville, Ark., said that 22 million shoppers streamed through stores across the country on Thanksgiving Day. That is more than the number of people who visit Disney’s Magic Kingdom in an entire year.

A blog at the Wall Street Journal suggests the numbers are even better than those reported by Chris:

Wal-Mart said it had more than 22 million customers at its stores between 6 p.m. and 10 p.m. Thursday, similar to its numbers a year ago.

In four (4) hours WalMart has more customers than visit Disney’s Magic Kingdom in a year.

Granting as of October 31, 2014, WalMart has forty-nine hundred and eighty-seven (4987) locations in the United States, that remains an impressive number.

Suffice it to say the number of people actively using topic maps is substantially less than the Thankgiving customer numbers for Walmart.

I don’t have the answer to the title question.

Asking you to ponder it as you do holiday shopping.

What is different about your experience in online or offline shopping that makes it different from your experience with topic maps? Or pre- or post-shopping experience that is different?

I will take this question up again after the first of 2015 so be working on your thoughts and suggestions over the holiday season.

Thanks!

November 29, 2014

VSEARCH

Filed under: Bioinformatics,Clustering,Merging,Topic Maps — Patrick Durusau @ 4:15 pm

VSEARCH: Open and free 64-bit multithreaded tool for processing metagenomic sequences, including searching, clustering, chimera detection, dereplication, sorting, masking and shuffling

From the webpage:

The aim of this project is to create an alternative to the USEARCH tool developed by Robert C. Edgar (2010). The new tool should:

  • have open source code with an appropriate open source license
  • be free of charge, gratis
  • have a 64-bit design that handles very large databases and much more than 4GB of memory
  • be as accurate or more accurate than usearch
  • be as fast or faster than usearch

We have implemented a tool called VSEARCH which supports searching, clustering, chimera detection, dereplication, sorting and masking (commands --usearch_global, --cluster_smallmem, --cluster_fast, --uchime_ref, --uchime_denovo, --derep_fulllength, --sortbysize, --sortbylength and --maskfasta, as well as almost all their options).

VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.

The same option names as in USEARCH version 7 has been used in order to make VSEARCH an almost drop-in replacement.

The reconciliation of characteristics that are different is the only way that merging in topic maps varies from the clustering found in bioinformatics programs like VSEARCH. The results are a cluster of items deemed “similar” on some basis and with topic maps, subject to further processing.

Scaling isn’t easy in bioinformatics but it hasn’t been found daunting either.

There is much to be learned from projects such as VSEARCH to inform the processing of topic maps.

I first saw this in a tweet by Torbjørn Rognes.

November 20, 2014

Senate Republicans are getting ready to declare war on patent trolls

Filed under: Intellectual Property (IP),Topic Maps — Patrick Durusau @ 7:00 pm

Senate Republicans are getting ready to declare war on patent trolls by Timothy B. Lee

From the post:

Republicans are about to take control of the US Senate. And when they do, one of the big items on their agenda will be the fight against patent trolls.

In a Wednesday speech on the Senate floor, Sen. Orrin Hatch (R-UT) outlined a proposal to stop abusive patent lawsuits. “Patent trolls – which are often shell companies that do not make or sell anything – are crippling innovation and growth across all sectors of our economy,” Hatch said.

Hatch, the longest-serving Republican in the US Senate , is far from the only Republican in Congress who is enthusiastic about patent reform. The incoming Republican chairmen of both the House and Senate Judiciary committees have signaled their support for patent legislation. And they largely see eye to eye with President Obama, who has also called for reform.

“We must improve the quality of patents issued by the U.S. Patent and Trademark Office,” Hatch said. “Low-quality patents are essential to a patent troll’s business model.” His speech was short on specifics here, but one approach he endorsed was better funding for the patent office. That, he argued, would allow “more and better-trained patent examiners, more complete libraries of prior art, and greater access to modern information technologies to address the agency’s growing needs.”

I would hate to agree with Senator Hatch on anything but there is no doubt that low-quality patents are rife at the U.S. Patent and Trademark Office. Whether patent trolls simply took advantage of the quality of patents or are responsible for low quality patents it’s hard to say.

In any event, the call for “…more complete libraries of prior art, and greater access to modern information technologies…” sounds like a business opportunity for topic maps.

After all, we all know that faster, more comprehensive search engines of the patent literature only gives you more material to review. It doesn’t give you more relevant material to review. Or give you material you did not know to look for. Only additional semantics has the power to accomplish either of those tasks.

There are those who will keep beating bags of words in hopes that semantics will appear.

Don’t be one of those. Choose an area of patents of interest and use interactive text mining to annotate existing terms with semantics (subject identity) which will reduce misses and increase the usefulness of “hits.”

That isn’t a recipe for mining all existing patents but who wants to do that? If you gain a large enough semantic advantage in genomics, semiconductors, etc., the start-up cost to catch up will be a tough nut to crack. Particularly since you are already selling a better product for a lower price than a start-up can match.

I first saw this in a tweet by Tim O’Reilly.

PS: A better solution for software patent trolls would be a Supreme Court ruling that eliminates all software patents. Then Congress could pass a software copyright bill that grants copyright status on published code for three (3) years, non-renewable. If that sounds harsh, consider the credibility impact of nineteen year old bugs.

If code had to be recast every three years and all vendors were on the same footing, there would be a commercial incentive for better software. Yes? If I had the coding advantages of a major vendor, I would start lobbying for three (3) year software copyrights tomorrow. Besides, it would make software piracy a lot easier to track.

November 19, 2014

Less Than Universal & Uniform Indexing

Filed under: Indexing,Topic Maps — Patrick Durusau @ 1:32 pm

In Suffix Trees and their Applications in String Algorithms, I pointed out that a subset of the terms for “suffix tree” resulted in About 1,830,000 results (0.22 seconds).

Not a very useful result, even for the most dedicated of graduate students. 😉

A better result would be an indexing entry for “suffix tree,” included results using its alternative names and enabled the user to quickly navigate to sub-entries under “suffix tree.”

To illustrate the benefit from actual indexing, consider that “Suffix Trees and their Applications in String Algorithms” lists only three keywords: “Pattern matching, String algorithms, Suffix tree.” Would you look at this paper for techniques on software maintenance?

Probably not, which would be a mistake. The section 4 covers the use of “parameterized pattern matching” for software maintenance of large programs in a fair amount of depth. Certainly more so than it covers “multidimensional pattern matching,” which is mentioned in the abstract and in the conclusion but not elsewhere in the paper. (“Higher dimensions” is mentioned on page 3 but only in two sentences with references.) Despite being mentioned in the abstract and conclusion as major theme of the paper.

A properly constructed index would break out both “parameterized pattern matching” and “software maintenance” as key subjects that occur in this paper. A bit easier to find than wading through 1,830,000 “results.”

Before anyone comments that such granular indexing would be too time consuming or expensive, recall the citation rates for computer science, 2000 – 2010:

Field 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 All years
Computer science 7.17 7.66 7.93 5.35 3.99 3.51 2.51 3.26 2.13 0.98 0.15 3.75

From: Citation averages, 2000-2010, by fields and years

The reason for the declining numbers is that citations to papers from the year 2000 decline over time.

But the highest percentage rate, 7.93 in 2002, is far less than the total number of papers published in 2000.

At one point in journal publication history, manual indexing was universal. But that was before full text searching became a reality and the scientific publication rate exploded.

journal-publication-rate

The STM Report by Mark Ware and Michael Mabe.

Rather than an all human indexing model (not possible due to the rate of publication, costs) or an all computer-based searching model (leads to poor results as described above), why not consider a bifurcated indexing/search model?

The well over 90% of CS publications that aren’t cited should be subject to computer-based indexing and search models. On the other hand, the meager 8% that are cited, perhaps subject to some scale of citation, could be curated by human/machine assisted indexing.

Human/machine assisted indexing would increase access to material already selected by other readers. Perhaps even as a value-add product as opposed to take your chances with search access.

November 18, 2014

Topic Maps By Another Name

Filed under: Merging,Topic Maps — Patrick Durusau @ 11:32 am

Data Integration up to 90% Faster and Cheaper by Marty Loughlin.

From the post:

data glue

A powerful new approach to addressing this challenge involves using semantic web technology as the “data glue” to guide integration and dramatically simplify the process. There are several key components to this approach:

  • Using semantic models to describe data in standard business terms (e.g., FIBO, CDISC, existing enterprise model etc.)
  • Mapping source and target data to the semantic model instead of directly from source to target
  • Combining these maps as needed to create end-to-end semantic descriptions of ETL jobs
  • Automatically generating ETL code from the semantic descriptions for leading ETL tools (e.g., Informatica and Pentaho)

There are significant benefits to this approach:

  • Data integration can be done by business analysts with minimal IT involvement
  • Adding a new source or target only requires an expert in that system to map to the common model as all maps are reusable
  • The time and cost do an integration project can be reduced up to 90%
  • Projects can be repurposed to a new ETL tool with the click of a mouse
  • The semantic model that describes that data, sources, maps and transformation is always up-to-date and can be queried for data meaning and lineage

The mapping of the source and target data to a semantic model is one use for a topic map. The topic map itself is then a data store to be queried using the source or target data models.

The primary differences (there are others) between topic maps and “data glue” is that topic maps don’t necessarily use MS Excel spreadsheets and aren’t called “data glue.”

I do appreciate Cambridge Semantics determining that a topic map-like mapping approach can save 90% on data integration projects.

That sounds a bit optimistic but marketing literature is always optimistic.

November 11, 2014

I/O Problem @ OpenStreetMap France

Filed under: Clustering,Mapping,Maps,Topic Maps — Patrick Durusau @ 2:54 pm

Benefit of data clustering for osm2pgsql/mapnik rending by Christian Quest.

The main server for OpenStreetMap France had an I/O problem:

OSM-FR

See Christian’s post for the details but the essence of the solution was to cluster geographic data on the basis of its location. To reduce the amount of I/O. Not unlike randomly seeking topics with similar characteristics.

How much did clustering reduce the I/O?

OSM-FR stats

Nearly 100% I/O was reduced to 15% I/O. 85% improvement.

An 85% improvement in I/O doesn’t look bad on a weekly/monthly activity report!

Now imagine clustering topics for dynamic merging and presentation to a user. Among other things, you can have an “auditing” view that shows all the topics that will merge to form a single topic in a presentation view.

Or a “pay-per-view” view that uses a different cluster to reveal more information for paying customers.

All while retaining the capacity to produce a serialized static file as an information product.

November 6, 2014

Caselaw is Set Free, What Next? [Expanding navigation/search targets]

Filed under: Law,Law - Sources,Legal Informatics,Topic Maps — Patrick Durusau @ 1:31 pm

Caselaw is Set Free, What Next? by Thomas Bruce, Director, Legal Information Institute, Cornell.

Thomas provides a great history of Google Scholar’s caselaw efforts and its impact on the legal profession.

More importantly, at least to me, were his observations on how to go beyond the traditional indexing and linking in legal publications:

A trivial example may help. Right now, a full-text search for “tylenol” in the US Code of Federal Regulations will find… nothing. Mind you, Tylenol is regulated, but it’s regulated as “acetaminophen”. But if we link up the data here in Cornell’s CFR collection with data in the DrugBank pharmaceutical collection , we can automatically determine that the user needs to know about acetaminophen — and we can do that with any name-brand drug in which acetaminophen is a component. By classifying regulations using the same system
that science librarians use to organize papers in agriculture
, we can determine which scientific papers may form the rationale for particular regulations, and link the regulations to the papers that explain the underlying science. These techniques, informed by emerging approaches in natural-language processing and the Semantic Web, hold great promise.

All successful information-seeking processes permit the searcher to exchange something she already knows for something she wants to know. By using technology to vastly expand the number of things that can meaningfully and precisely be submitted for search, we can dramatically improve results for a wide swath of users. In our shop, we refer to this as the process of “getting from barking dog to nuisance”, an in-joke that centers around mapping a problem expressed in real-world terms to a legal concept. Making those mappings on a wide scale is a great challenge. If we had those mappings, we could answer a lot of everyday questions for a lot of people.

(emphasis added)

The first line I bolded in the quote:

All successful information-seeking processes permit the searcher to exchange something she already knows for something she wants to know.

captures the essence of a topic map. Yes? That is a user navigates or queries a topic map on the basis of terms they already know. In so doing, they can find other terms that are interchangeable with theirs, but more importantly, if information is indexed using a different term than theirs, they can still find the information.

In traditional indexing systems, think of the Readers Guide to Periodical Literature, Library of Congress Subject Headings, some users learned those systems in order to become better searchers. Still an interchange of what you know for what you don’t know, but with a large front-end investment.

Thomas is positing a system like topic maps that enables a users to navigate by the terms they know already to find information they don’t know.

The second block of text I bolded:

Making those mappings on a wide scale is a great challenge. If we had those mappings, we could answer a lot of everyday questions for a lot of people.

Making wide scale mappings certainly is a challenge. In part because there are so many mappings to be made and so many different ways to make them. Not to mention that the mappings will evolve over time as usages change.

There is growing realization that indexing or linking data results in a very large pile of indexed or linked data. You can’t really navigate it unless or until you hit upon the correct terms to make the next link. We could try to teach everyone the correct terms but as more correct terms appear everyday, that seems an unlikely solution. Thomas has the right of it when he suggests expanding the target of “correct” terms.

Topic maps are poised to help expand the target of “correct” terms, and to do so in such a way as to combine with other expanded targets of “correct” terms.

I first saw this in a tweet by Aaron Kirschenfeld.


Update: Tarlton Law Libary (University of Texas at Austin) Legal Research Guide has a great page of tips and pointers on the Google Scholar caselaw collection. Bookmark this guide.

September 30, 2014

New Wandora Release 2014-09-25

Filed under: Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 7:29 pm

New Wandora Release 2014-09-25

This release features:

Sounds good to me!

Download the latest release today!

September 29, 2014

Peaxy Hyperfiler Redefines Data Management to Deliver on the Promise of Advanced Analytics

Filed under: Marketing,Topic Maps — Patrick Durusau @ 4:22 pm

Peaxy Hyperfiler Redefines Data Management to Deliver on the Promise of Advanced Analytics

From the post:

Peaxy, Inc. (www.peaxy.net) today announced general availability of the Peaxy Hyperfiler, its hyperscale data management system that enables enterprises to access and manage massive amounts of unstructured data without disrupting business operations. For engineers and researchers who must search for datasets across multiple geographies, platforms and drives, accessing all the data necessary to inform the product lifecycle, from design to predictive maintenance, presents a major challenge. By making all data, regardless of quantity or location, immediately accessible via a consistent data path, companies will be able to dramatically accelerate their highly technical, data-intensive initiatives. These organizations will be able to manage data in a way that allows them to employ advanced analytics that have been promised for years but never truly realized.

…Key product features include:

  • Scalability to tens of thousands of nodes enabling the creation of an exabyte-scale data infrastructure in which performance scales in parallel with capacity
  • Fully distributed namespace and data space that eliminate data silos to make all data easily accessible and manageable
  • Simple, intuitive user interface built for engineers and researchers as well as for IT
  • Data tiered in storage classes based on performance, capacity and replication factor
  • Automated, policy-based data migration
  • Flexible, customizable data management
  • Remote, asynchronous replication to facilitate disaster recovery
  • Call home remote monitoring
  • Software-based, hardware-agnostic architecture that eliminates proprietary lock-in
  • Addition or replacement of hardware resources with no down time
  • A version of the Hyperfiler that has been successfully beta tested on Amazon Web Services (AWS)

I would not say that the “how it works” page is opaque but it does remind me of the Grinch telling Cindy Lou that he was taking their Christmas tree to be repaired. Possible but lacking in detail.

What do you think?

hyperfilter

Do you see:

  1. Any mention of mapping multiple sources of data into a consolidated view?
  2. Any mention of managing changing terminology over a product history?
  3. Any mention of indexing heterogeneous data?
  4. Any mention of natural language processing unstructured data?
  5. Any mention of machine learning over unstructured data?
  6. Anything beyond am implied “a miracle” occurs between data and Hyperfiler?

The documentation promises “data filters” but is also short on specifics.

A safe bet that mapping of terminology and semantics, for an enterprise and/or long product history, remains fertile ground for topic maps.

I first saw this in a tweet by Gregory Piatetsky

PS: Answers to the questions I raise may exist somewhere but I warrant they weren’t posted on September 29, 2014 at the locations listed in this post.

September 21, 2014

Fixing Pentagon Intelligence [‘data glut but an information deficit’]

Filed under: Intelligence,Marketing,Topic Maps — Patrick Durusau @ 4:24 pm

Fixing Pentagon Intelligence by John R. Schindler.

From the post:

The U.S. Intelligence Community (IC), that vast agglomeration of seventeen different hush-hush agencies, is an espionage behemoth without peer anywhere on earth in terms of budget and capabilities. Fully eight of those spy agencies, plus the lion’s share of the IC’s budget, belong to the Department of Defense (DoD), making the Pentagon’s intelligence arm something special. It includes the intelligence agencies of all the armed services, but the jewel in the crown is the National Security Agency (NSA), America’s “big ears,” with the National Geospatial-Intelligence Agency (NGA), which produces amazing imagery, following close behind.

None can question the technical capabilities of DoD intelligence, but do the Pentagon’s spies actually know what they are talking about? This is an important, and too infrequently asked, question. Yet it was more or less asked this week, in a public forum, by a top military intelligence leader. The venue was an annual Washington, DC, intelligence conference that hosts IC higher-ups while defense contractors attempt a feeding frenzy, and the speaker was Rear Admiral Paul Becker, who serves as the Director of Intelligence (J2) on the Joint Chiefs of Staff (JCS). A career Navy intelligence officer, Becker’s job is keeping the Pentagon’s military bosses in the know on hot-button issues: it’s a firehose-drinking position, made bureaucratically complicated because JCS intelligence support comes from the Defense Intelligence Agency (DIA), which is an all-source shop that has never been a top-tier IC agency, and which happens to have some serious leadership churn at present.

Admiral Becker’s comments on the state of DoD intelligence, which were rather direct, merit attention. Not surprisingly for a Navy guy, he focused on China. He correctly noted that we have no trouble collecting the “dots” of (alleged) 9/11 infamy, but can the Pentagon’s big battalions of intel folks actually derive the necessary knowledge from all those tasty SIGINT, HUMINT, and IMINT morsels? Becker observed — accurately — that DoD intelligence possesses a “data glut but an information deficit” about China, adding that “We need to understand their strategy better.” In addition, he rued the absence of top-notch intelligence analysts of the sort the IC used to possess, asking pointedly: “Where are those people for China? We need them.”

Admiral Becker’s:

data glut but an information deficit” (emphasis added)

captures the essence of phone record subpoenas, mass collection of emails, etc., all designed to give the impression of frenzied activity, with no proof of effectiveness. That is an “information deficit.”

Be reassured you can host a data glut in a topic map so topic maps per se are not a threat to current data gluts. It is possible, however, to use topic maps over existing data gluts to create information and actionable intelligence. Without disturbing the underlying data gluts and their contractors.

I tried to find a video of Adm. Becker’s presentation but apparently the Intelligence and National Security Security Summit 2014 does not provide video recording of presentations. Whether that is to prevent any contemporaneous record being kept of remarks or just being low-tech kinda folks isn’t clear.

I can point out the meeting did have a known liar, “The Honorable James Clapper,” on the agenda. Hard to know if having perjured himself in front of Congress has made him gun shy of recorded speeches or not. (For Clapper’s latest “spin,” on “the least untruthful,” see: James Clapper says he misspoke, didn’t lie about NSA surveillance.) One hopes by next year’s conference Clapper will appear as: James Clapper, former DNI, convicted felon, Federal Prison Register #….

If you are interested in intelligence issues, you should be following John R. Schindler. A U.S. perspective but handling issues in intelligence with topic maps will vary in the details but not the underlying principles from one intelligence service to another.

Disclosure: I rag on the intelligence services of the United States due to greater access to public information on those services. Don’t take that as greater interest how their operations could be improved by topic maps over other intelligence services.

I am happy to discuss how your intelligence services can (or can’t) be improved by topic maps. There are problems, such as those discussed by Admiral Becker, that can’t be fixed by using topic maps. I will be as quick to point those out as I will problems where topic maps are relevant. My goal is your satisfaction that topic maps made a difference for you, not having a government entity in a billing database.

September 15, 2014

A Cambrian Explosion In AI Is Coming

Filed under: Artificial Intelligence,Topic Maps — Patrick Durusau @ 6:35 am

A Cambrian Explosion In AI Is Coming by Dag Kittlaus.

From the post:

However, done properly, this emerging conversational paradigm enables a new fluidity for achieving tasks in the digital realm. Such an interface requires no user manual, makes short work of complex tasks via simple conversational commands and, once it gets to know you, makes obsolete many of the most tedious aspects of using the apps, sites and services of today. What if you didn’t have to: register and form-fill; continuously express your preferences; navigate new interfaces with every new app; and the biggest one of them all, discover and navigate each single-purpose app or service at a time?

Let me repeat the last one.

When you can use AI as a conduit, as an orchestrating mechanism to the world of information and services, you find yourself in a place where services don’t need to be discovered by an app store or search engine. It’s a new space where users will no longer be required to navigate each individual application or service to find and do what they want. Rather they move effortlessly from one need to the next with thousands of services competing and cooperating to accomplish their desires and tasks simply by expressing their desires. Just by asking.

Need a babysitter tomorrow night in a jam? Just ask your assistant to find one and it will immediately present you with a near complete set of personalized options: it already knows where you live, knows how many kids you have and their ages, knows which of the babysitting services has the highest reputation and which ones cover your geographic area. You didn’t need to search and discover a babysitting app, download it, register for it, enter your location and dates you are requesting and so on.

Dag uses the time worn acronym AI (artificial intelligence), which covers any number of intellectual sins. For the scenarios that Dag describes, I propose a new acronym, UsI (user intelligence).

Take the babysitter example to make UsI concrete. The assistant has captured your current (it could change over time) identification of “babysitter” and uses that to find information with that identification. Otherwise searching for “babysitter” would return both useful and useless results, much like contemporary search engines today.

It is the capturing of your subject identifications, to use topic map language, that enables an assistant to “understand” the world as you do. Perhaps the reverse of “personalization” where an application attempts to guess your preferences for marketing purposes, this is “individualization” where the assistant becomes more like you and knows the usually unspoken facts that underlie your requests.

If I say, “check utility bill,” my assistant will already “know” that I mean for Covington, Georgia, not any of the other places I have resided and implicitly I mean the current (unpaid) bill.

The easier and faster it is for an assistant to capture UsI, the faster and more seamless it will become for users.

Specifying and inspecting properties that underlie identifications will play an important role in fueling a useful Cambrian explosion in UsI.

Who wants a “babysitter” using your definition? Could have quite unexpected (to me) results. http://www.imdb.com/title/tt0796302/ (Be mindful of your corporate policies on what you can or can’t view at work.)

PS: Did I mention topic maps as collections of properties for identifications?

I first saw this in a tweet by Subject-centric.

August 26, 2014

Probabilistic Topic Maps?

Filed under: Probalistic Models,Topic Maps — Patrick Durusau @ 6:21 pm

Probabilistic Soft Logic

From the webpage:

Probabilistic soft logic (PSL) is a modeling language (with accompanying implementation) for learning and predicting in relational domains. Such tasks occur in many areas such as natural language processing, social-network analysis, computer vision, and machine learning in general.

PSL allows users to describe their problems in an intuitive, logic-like language and then apply their models to data.

Details:

  • PSL models are templates for hinge-loss Markov random fields (HL-MRFs), a powerful class of probabilistic graphical models.
  • HL-MRFs are extremely scalable models because they are log-concave densities over continuous variables that can be optimized using the alternating direction method of multipliers.
  • See the publications page for more technical information and applications.

This homepage lists three introductory videos and has a set of slides on PSL.

Under entity resolution, the slides illustrate rules that govern the “evidence” that two entities represent the same person. You will also find link prediction, mapping of different ontologies, discussion of mapreduce implementations and other materials in the slides.

Probabilistic rules could be included in a TMDM instance but I don’t know of any topic map software that supports probabilistic merging. Would be a nice authoring feature to have.

The source code is on GitHub if you want to take a closer look.

August 25, 2014

Research topics in e-discovery

Filed under: e-Discovery,Legal Informatics,Topic Maps — Patrick Durusau @ 2:12 pm

Research topics in e-discovery by William Webber.

From the post:

Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I’d capture some of these topics here:

E-discovery, if you don’t know, is found in civil litigation and government investigations. Think of it as hacking with rules as the purpose of e-discovery is to find information that supports your claims or defense. E-discovery is high stakes data mining that pays very well. Need I say more?

Webber lists the following research topics:

  1. Classification across heterogeneous document types
  2. Automatic detection of document types
  3. Faceted categorization
  4. Label propagation across related documents
  5. Identifying unclassifiable documents
  6. Identifying poor training examples
  7. Identifying significant fragments in non-significant text
  8. Routing of documents to specialized trainers
  9. Total cost of annotation

“Label propagation across related documents” looks like a natural for topic maps but searching over defined properties that identify subjects as opposed to opaque tokens would enhance the results for a number of these topics.

August 23, 2014

Large-Scale Object Classification…

Filed under: Classification,Image Recognition,Image Understanding,Topic Maps — Patrick Durusau @ 3:37 pm

Large-Scale Object Classi cation using Label Relation Graphs by Jia Deng, et al.

Abstract:

In this paper we study how to perform object classi cation in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classifi cation model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a large-scale benchmark. Empirical results demonstrate that our model can signifi cantly improve object classifi cation by exploiting the label relations.

Let’s hear it for “real world labels!”

By which the authors mean:

  • An object can have more than one label.
  • There are relationships between labels.

From the introduction:

We first introduce Hierarchy and Exclusion (HEX) graphs, a new formalism allowing flexible specifi cation of relations between labels applied to the same object: (1) mutual exclusion (e.g. an object cannot be dog and cat), (2) overlapping (e.g. a husky may or may not be a puppy and vice versa), and (3) subsumption (e.g. all huskies are dogs). We provide theoretical analysis on properties of HEX graphs such as consistency, equivalence, and computational implications.

Next, we propose a probabilistic classi fication model leveraging HEX graphs. In particular, it is a special type of Conditional Random Field (CRF) that encodes the label relations as pairwise potentials. We show that this model enjoys
a number of desirable properties, including flexible encoding of label relations, predictions consistent with label relations, efficient exact inference for typical graphs, learning labels with varying specifi city, knowledge transfer, and uni fication of existing models.

Having more than one label is trivially possible in topic maps. The more interesting case is the authors choosing to treat semantic labels as subjects and to define permitted associations between those subjects.

A world of possibilities opens up when you can treat something as a subject that can have relationships defined to other subjects. Noting that those relationships can also be treated as subjects should someone desire to do so.

I first saw this at: Is that husky a puppy?

August 18, 2014

Topic Maps Are For Data Janitors

Filed under: Marketing,Topic Maps — Patrick Durusau @ 8:20 am

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights by Steve Lohr.

From the post:

Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

Data formats are one challenge, but so is the ambiguity of human language. Iodine, a new health start-up, gives consumers information on drug side effects and interactions. Its lists, graphics and text descriptions are the result of combining the data from clinical research, government reports and online surveys of people’s experience with specific drugs.

But the Food and Drug Administration, National Institutes of Health and pharmaceutical companies often apply slightly different terms to describe the same side effect. For example, “drowsiness,” “somnolence” and “sleepiness” are all used. A human would know they mean the same thing, but a software algorithm has to be programmed to make that interpretation. That kind of painstaking work must be repeated, time and again, on data projects.

Plenty of progress is still to be made in easing the analysis of data. “We really need better tools so we can spend less time on data wrangling and get to the sexy stuff,” said Michael Cavaretta, a data scientist at Ford Motor, which has used big data analysis to trim inventory levels and guide changes in car design.

Mr. Cavaretta is familiar with the work of ClearStory, Trifacta, Paxata and other start-ups in the field. “I’d encourage these start-ups to keep at it,” he said. “It’s a good problem, and a big one.”

Topic maps were only fifteen (15) years ahead of the need of Big Data for them.

How do you avoid:

That kind of painstaking work must be repeated, time and again, on data projects.

?

By annotating data once using a topic map and re-using that annotation over and over again.

By creating already annotated data using a topic map and reusing that annotation over and over again.

Recalling that topic map annotations can represent “logic” but more importantly, can represent any human insight that can be expressed about data.

See Lohr’s post for startups and others who are talking about a problem the topic maps community solved fifteen years ago.

August 11, 2014

Patent Fraud, As In Patent Office Fraud

Filed under: Intellectual Property (IP),Topic Maps — Patrick Durusau @ 2:09 pm

Patent Office staff engaged in fraud and rushed exams, report says by Jeff John Roberts.

From the post:

…One version of the report also flags a culture of “end-loading” in which examiners “can go from unacceptable performance to award levels in one bi-week by doing 500% to more than 1000% of their production goal.”…

See Jeff’s post for other details and resources.

Assuming the records for patent examiners can be pried loose from the Patent Office, this would make a great topic map project. Associate the 500% periods with specific patents and further litigation on those patents, to create a resource for further attacks on patents approved by a particular examiner.

By the time a gravy train like patent examining makes the news, you know the train has already left the station.

On the up side, perhaps Congress will re-establish the Patent Office and prohibit any prior staff, contractors, etc. from working at the new Patent Office. The new Patent Office can adopt rules designed to enable innovation but also tracking prior innovation effectively. Present Patent Office goals have little to do with either of those goals.

August 9, 2014

PHPTMAPI – Documentation Complete

Filed under: PHP,PHPTMAPI,Topic Maps — Patrick Durusau @ 1:55 pm

Johannes Schmidt tweeted today to announce PHPTMAPI “…is now fully documented.”

In case you are unfamiliar with PHPTMAPI:

PHPTMAPI is a PHP5 API for creating and manipulating topic maps, based on the http://tmapi.sourceforge.net/ project. This API enables PHP developers an easy and standardized implementation of ISO/IEC 13250 Topic Maps in their applications.

What is TMAPI?

TMAPI is a programming interface for accessing and manipulating data held in a topic map. The TMAPI specification defines a set of core interfaces which must be implemented by a compliant application as well as (eventually) a set of additional interfaces which may be implemented by a compliant application or which may be built upon the core interfaces.

Thanks Johannes!

August 8, 2014

ROpenSci News – August 2014

Filed under: R,Topic Maps — Patrick Durusau @ 3:23 pm

Community conversations and a new package for full text by Scott Chamberlain and Karthik Ram.

ROpenSci announces they are reopening their public Google list.

We encourage you to sign up and post ideas for packages, solicit feedback on new ideas, and most importantly find other collaborators who share your domain interests. We also plan to use the list to solicit feedback on some of the bigger rOpenSci projects early on in the development phase allowing our community to shape future direction and also collaborate where appropriate.

Among the work that is underway:

Through time we have been attempting to unify our R packages that interact with individual data sources into single packages that handle one use case. For example, spocc aims to create a single entry point to many different sources (currently 6) of species occurrence data, including GBIF, AntWeb, and others.

Another area we hope to simplify is acquiring text data, specifically text from scholarly journal articles. We call this R package fulltext. The goal of fulltext is to allow a single user interface to searching for and retrieving full text data from scholarly journal articles. Rather than learning a different interface for each data source, you can learn one interface, making your work easier. fulltext will likely only get you data, and make it easy to browse that data, and use it downstream for manipulation, analysis, and vizualization.

We currently have R packages for a number of sources of scholarly article text, including for Public Library of Science (PLOS), Biomed Central (BMC), and eLife – which could all be included in fulltext. We can add more sources as they become available.

Instead of us rOpenSci core members planning out the whole package, we'd love to get the community involved at the beginning.

The “individual data sources into single packages” sounds particularly ripe for enhancement with topic map based ideas.

Not a plea for topic map syntax or modeling, although either would make nice output options. The critical idea being to identify central subjects with key/value pairs to enable robust identification of subjects by later users.

Surface tokens with unexpressed contexts set hard boundaries to the usefulness and accuracy of search results. If we capture what is known to identity surface tokens, we enrich our world and the world of others.

August 7, 2014

Ebola: “Highly contagious…” or…

Filed under: Semantics,Topic Maps — Patrick Durusau @ 2:30 pm

NPR has developed a disturbing range of semantics for the current Ebola crisis.

Consider these two reports, one on August 7th and one on August 2nd, 2014.

Aug. 7th: Officials Fear Ebola Will Spread Across Nigeria

Dave Greene – Reports on there only being two or three cases in Lagos, but Nigeria is declaring a state of emergency because Ebola is “…highly contagious….”

Aug. 2nd: Atlanta Hospital Prepares To Treat 2 Ebola Patients

Jim Burress – comments on a news conference at Emory:

“He downplayed any threat to public safety because the virus can only be spread through close contact with an infected person.”

To me, “highly contagious” and “close contact with an infected person” are worlds apart. Why the shift in semantics in only five days?

Curious if you have noticed this or other shifting semantics around the Ebola outbreak from other news outlets?

Not that I would advocate any one “true” semantic for the crisis but I wonder who would benefit from a Ebola-fear panic in Nigeria? Or who would benefit from no panic and a possible successful treatment for Ebola?

Working on the assumption that semantics vary depending on who benefits from a particular semantic.

Topic maps could help you “out” the beneficiaries. Or help you plan to conceal connections to the beneficiaries, depending upon your business model.


Update: A close friend pointed me to: FILOVIR: Scientific Resource for Research on Filoviruses. Website, twitter feed, etc. In case you are looking for a current feed of Ebola information, both public and professional.

June 5, 2014

A Topic Map Classic

Filed under: Topic Maps — Patrick Durusau @ 10:12 am

I ran into a classic topic map problem today.

I have been trying to find a way to move a very large refrigerator out of an alcove to clean its coils.

Searching the web, I found a product by Airsled that is the perfect solution for me. Unfortunately, the cheapest one is over $500.00 US.

I really want to move the refrigerator but to move it once every four or five years, that seems really expensive.

Reasoning that other people would have the same reaction, I started calling equipment rental places, describing the tool and calling it by the manufacturer’s name, Airsled.

The last place I talked to this afternoon offered several other solutions but no, they had no such device.

This evening I was searching the web again and added “rental” to my search for aid sled and got the last place I called today.

You already know what the problem turns out to be.

Their name for the device?

700lb Air Appliance Mover Dolly

But when you compare:

The Airsled:

Airsled

to the 700 Pound Appliance Mover Dolly:

700 pound appliance mover

You get the idea they are the same thing.

Yes?

But for the chance finding of the reference to the local rental store and following it, they would have lost the rental, my refrigerator would not get moved, etc.

That’s just one experience. Imagine all the similar experiences today.

May 25, 2014

Emotion Markup Language 1.0 (No Repeat of RDF Mistake)

Filed under: EmotionML,Subject Identity,Topic Maps,W3C — Patrick Durusau @ 3:19 pm

Emotion Markup Language (EmotionML) 1.0

Abstract:

As the Web is becoming ubiquitous, interactive, and multimodal, technology needs to deal increasingly with human factors, including emotions. The specification of Emotion Markup Language 1.0 aims to strike a balance between practical applicability and scientific well-foundedness. The language is conceived as a “plug-in” language suitable for use in three different areas: (1) manual annotation of data; (2) automatic recognition of emotion-related states from user behavior; and (3) generation of emotion-related system behavior.

I started reading EmotionML with the expectation that the W3C had repeated its one way and one way only for identification mistake from RDF.

Much to my pleasant surprise I found:

1.2 The challenge of defining a generally usable Emotion Markup Language

Any attempt to standardize the description of emotions using a finite set of fixed descriptors is doomed to failure: even scientists cannot agree on the number of relevant emotions, or on the names that should be given to them. Even more basically, the list of emotion-related states that should be distinguished varies depending on the application domain and the aspect of emotions to be focused. Basically, the vocabulary needed depends on the context of use. On the other hand, the basic structure of concepts is less controversial: it is generally agreed that emotions involve triggers, appraisals, feelings, expressive behavior including physiological changes, and action tendencies; emotions in their entirety can be described in terms of categories or a small
number of dimensions; emotions have an intensity, and so on. For details, see Scientific Descriptions of Emotions in the Final Report of the Emotion Incubator Group.

Given this lack of agreement on descriptors in the field, the only practical way of defining an EmotionML is the definition of possible structural elements and their valid child elements and attributes, but to allow users to “plug in” vocabularies that they consider appropriate for their work. A separate W3C Working Draft complements this specification to provide a central repository of [Vocabularies for EmotionML] which can serve as a starting point; where the vocabularies listed there seem inappropriate, users can create their custom vocabularies.

An additional challenge lies in the aim to provide a generally usable markup, as the requirements arising from the three different use cases (annotation, recognition, and generation) are rather different. Whereas manual annotation tends to require all the fine-grained distinctions considered in the scientific literature, automatic recognition systems can usually distinguish
only a very small number of different states.

For the reasons outlined here, it is clear that there is an inevitable tension between flexibility and interoperability, which need to be weighed in the formulation of an EmotionML. The guiding principle in the following specification has been to provide a choice only where it is needed, and to propose reasonable default options for every choice.

Everything that is said about emotions is equally true for identification, emotions being on one of the infinite sets of subjects that you might want to identify.

Had the W3C avoided the one identifier scheme of RDF (and the reliance on a subset of reasoning, logic), RDF could have had plugin “identifier” modules, enabling the use of all extant and future identifiers, not to mention “reasoning” according to the designs of users.

It is good to see the W3C learning from its earlier mistakes and enabling users to express their world views, as opposed to a world view as prescribed by the W3C.

When users declare their emotional vocabularies, those are subjects which merit further identification. To avoid the problem of us not meaning the same thing by “owl:sameAs” as someone else means by “owl:sameAs.” (When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, Patrick J. Hayes.)

Topic maps are a good solution for documenting subject identity and deciding when two or more identifications of subjects are the same subject.

I first saw this in a tweet by Inge Henriksen

« Newer PostsOlder Posts »

Powered by WordPress