Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 5, 2011

Radoop (Will you rat out your friends?)

Filed under: BigData — Patrick Durusau @ 6:48 pm

Radoop

From the site after you enter your email address:

You can be one of the early adopters of Radoop, an easy-to-use interface for Big Data analytics and machine learning over Hadoop!

Invite some friends using the link below. The more friends you invite, the sooner you’ll get access!

You know, I’m not really interested in sharing data, contact or otherwise, about my friends. So I may be missing the next big thing, but so be it.

Oracle, NoSQL and Topic Maps

Filed under: RDBMS,SQL — Patrick Durusau @ 3:27 am

There have been more tweets about Oracle’s recent NoSQL offering than Lucene turning 10 years old. The information content has been about the same.

The Oracle tweets, “if you can’t beat them, join them,” “we have been waiting for your,” etc., don’t appreciate a software vendor’s view of the world.

Software vendors, as opposed to software cultists, offer products customers are likely to lease or purchase. A software vendor would port vi to the iPhone 5 if there was enough customer demand.

Which in an embarrassing way explains why Oracle doesn’t support topic maps, lack of customer demand.

Topic maps do have customer demand, at least enough to keep any number of topic map service/software vendors afloat. But, those customers don’t make up enough appeal for Oracle to move into the topic map field.

The NoSQL people may have a model we can follow (perhaps even using NoSQL as backends).

They isolated use cases of interest to customers, then demonstrated impressive performance numbers on those use cases.

Question: So how do I learn what use cases are of interest to others? That could be impacted by topic maps?*


*I know what use cases are of interest to me but a comparative Semitic linguistics topic map isn’t likely to have high demand as an iPhone app, for example. Quite doable with topic maps but not commercially compelling.

October 4, 2011

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

Filed under: Lucene,Mahout,Recommendation — Patrick Durusau @ 7:57 pm

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

From the webpage:

This is the first post in a four part series about a wine rating and recommendation Web application, named VinWiki, built using open source technology. The purpose of this series is to document key design and implementation decisions, which may be of interest to anyone wanting to build an intelligent Web application using Java technologies. The end result will not be a 100% functioning Web application, but will have enough functionality to prove the concepts.

I thought about Lars Marius and his expertise at beer evaluation when I saw this series. Not that Lars would need it but it looks like the sort of thing you could build to recommend things you know something about, and like. Whatever that may be. 😉

Efficient Multidimensional Blocking for Link Discovery without losing Recall

Filed under: Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 7:57 pm

Efficient Multidimensional Blocking for Link Discovery without losing Recall

Jack Park did due diligence on the SILK materials before I did and forwarded a link to this paper.

Abstract:

Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets. In record linkage, many partitioning methods have been proposed which substantially reduce the number of required entity comparisons. Unfortunately, most of these methods either lead to a decrease in recall or only work on metric spaces. We propose a novel blocking method called Multi-Block which uses a multidimensional index in which similar objects are located near each other. In each dimension the entities are indexed by a different property increasing the efficiency of the index significantly. In addition, it guarantees that no false dismissals can occur. Our approach works on complex link specifications which aggregate several di fferent similarity measures. MultiBlock has been implemented as part of the Silk Link Discovery Framework. The evaluation shows a speedup factor of several 100 for large datasets compared to the full evaluation without losing recall.

From deeper in the paper:

If the similarity between two entities exceeds a threshold $\theta$, a link between these two entities is generated. $sim$ is computed by evaluating a link specification $s$ (in record linkage typically called linkage decision rule [23]) which specifies the conditions two entities must fulfi ll in order to be interlinked.

If I am reading this paper correctly, there isn’t a requirement (as in record linkage) that we normalized the data to a common format before writing the rule for comparisons. That in and of itself is a major boon. To say nothing of the other contributions of this paper.

Buckets of Sockets

Filed under: Erlang,Software — Patrick Durusau @ 7:56 pm

Buckets of Sockets

OK, so some of the stuff I have pointed to lately hasn’t been “hard core.” 😉

This should give you some ideas about building communications (including servers) in connection with topic maps.

From the webpage:

So far we’ve had some fun dealing with Erlang itself, barely communicating to the outside world, if only by text files that we read here and there. As much of relationships with yourself might be fun, it’s time to get out of our lair and start talking to the rest of the world.

This chapter will cover three components of using sockets: IO lists, UDP sockets and TCP sockets. IO lists aren’t extremely complex as a topic. They’re just a clever way to efficiently build strings to be sent over sockets and other Erlang drivers.

SolrMarc

Filed under: MARC,Solr,SolrMarc — Patrick Durusau @ 7:55 pm

SolrMarc

From the webpage:

Solrmarc can index your marc records into apache solr. It also comes with an improved version of marc4j that improves handling of UTF-8 characters, is more forgiving of malformed marc data, and can recover from data errors gracefully. This indexer is used by blacklight (http://blacklight.rubyforge.org) and vufind (http://www.vufind.org/) but it can also be used as a standalone project.

Nice if short discussion of custom indexing with SolrMarc.

Berlin Graph Coding Dojo 27 Oct. 2011

Filed under: Conferences,Graphs — Patrick Durusau @ 7:55 pm

From Pere Urbón Bayes, news of a Berlin Graph Coding Dojo, 27 October 2011.

I won’t be there but here are a couple of questions to explore over coffee/beer: Are relational tables, columns, key-value stores, triple stores, etc., restrictions on a more general underlying graph? If so, how do we exploit that underlying graph to merge information is held in disparate data sources?

Graph databases, together with graph processing problems, are a trendy topic right now. Neo4j is a well known graph database, but there are also others like OrientDB, DEX, etc. and there are also a big set of graph processing toolsets like Blueprints, Apache Hamma, Google Pregel like systems, etc. So from recomendations systems to routing problems graph processing is an amazing thing to have in your toolset.

With the objective to have together experts and newbies, and for all of them to have the oportunity to learn new things by doing we launch the Berlin Graph Coding Dojo. Next 27 of October 2012[2011] we will meet with the main task of learning and practicing new graph related tasks.

There will be enought food for more experienced people, but also for the ones who just say, ei! graphdbs are cool, lets gonna see what can I do in a short time with theme.

If interested, no mather your level of experience with this topic, show up next 27 of October at the Berlin Coworking Space. Bring your laptop, and in a couple of hours your will for sure solved a new thing using graphs.

For more information you can join: graph-b@googlegroups.com

Lots of thanks to the Berlin Coworking Space for making this event possible. Also if you want to be an sponsor, collaborate, give your five cents, whatever!, contact us!.

Details
27/October/2011 19:30h
Berlin Coworking Space [http://g.co/maps/j2tmb]
Adalbertstr. 7-8
10999 Berlin


/purbon
– @purbon
http://www.purbon.com

I won’t be there but here is a question to explore over coffee/beer: To what extent are relational tables, columns, key-value stores, triple stores, etc., restrictions on a more general underlying graph?

Activity 1: Search for Meaning Using Topic Maps

Filed under: Education,Topic Maps — Patrick Durusau @ 7:54 pm

Activity 1: Search for Meaning Using Topic Maps

This is from:

Intro to US Writing: 9th Grade Writing/Physics Design Thinking Integrated Curriculum

.

I could not find contact information for the instructor on the blog so have contacted the school to see if I can get more information.

Encouraging example of topic maps being used in secondary education!

Definitely need to find out what the instructor did to make it successful.

SILK – Link Discovery Framework Version 2.5 released

Filed under: Linked Data,LOD,RDF,Semantic Web,SPARQL — Patrick Durusau @ 7:54 pm

SILK – Link Discovery Framework Version 2.5 released

I was quite excited to see under “New Data Transformations”…”Merge Values of different inputs.”

But the documentation for Transformation must be lagging behind or I have a different understanding of what it means to “Merge Values of different inputs.”

Perhaps I should ask: What does SILK mean by “Merge Values of different inputs?”

Picking out an issue that is of particular interest to me is not meant to be a negative comment on the project. An impressive bit of work for any EU funded project.

Another question: Has anyone looked at the SILK- Link Specification Language (SILK-LSL) as an input into declaring equivalence/processing for arbitrary data objects? Just curious.

Robert Isele posted this announcement about SILK on October 3, 2011:

we are happy to announce version 2.5 of the Silk Link Discovery Framework for the Web of Data.

The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Using the declarative Silk – Link Specification Language (Silk-LSL), developers can specify the linkage rules data items must fulfill in order to be interlinked. These linkage rules may combine various similarity metrics and can take the graph around a data item into account, which is addressed using an RDF path language.

Linkage rules can either be written manually or developed using the Silk Workbench. The Silk Workbench, is a web application which guides the user through the process of interlinking different data sources.

Version 2.5 includes the following additions to the last major release 2.4:

(1) Silk Workbench now includes a function to learn linkage rules from the reference links. The learning function is based on genetic programming and capable of learning complex linkage rules. Similar to a genetic algorithm, genetic programming starts with a randomly created population of linkage rules. From that starting point, the algorithm iteratively transforms the population into a population with better linkage rules by applying a number of genetic operators. As soon as either a linkage rule with a full f-Measure has been found or a specified maximum number of iterations is reached, the algorithm stops and the user can select a linkage rule.

(2) A new sampling tab allows for fast creation of the reference link set. It can be used to bootstrap the learning algorithm by generating a number of links which are then rated by the user either as correct or incorrect. In this way positive and negative reference links are defined which in turn can be used to learn a linkage rule. If a previous learning run has already been executed, the sampling tries to generate links which contain features which are not yet covered by the current reference link set.

(2) The new help sidebar provides the user with a general description of the current tab as well as with suggestions for the next steps in the linking process. As new users are usually not familiar with the steps involved in interlinking two data sources, the help sidebar currently provides basic guidance to the user and will be extended in future versions.

(3) Introducing per-comparison thresholds:

  • On popular request, thresholds can now be specified on each comparison.
  • Backwards-compatible: Link specifications using a global threshold can still be executed.

(4) New distance measures:

  • Jaccard Similarity
  • Dice’s coefficient
  • DateTime Similarity
  • Tokenwise Similarity, contributed by Florian Kleedorfer, Research Studios Austria

(5) New data transformations:

  • RemoveEmptyValues
  • Tokenizer
  • Merge Values of multiple inputs

(6) New DataSources and Outputs

  • In addition to reading from SPARQL endpoints, Silk now also supports reading from RDF dumps in all common formats. Currently the data set is held in memory and it is not available in the Workbench yet, but future versions will improve this.
  • New SPARQL/Update Output: In addition to writing the links to a file, Silk now also supports writing directly to a triple store using SPARQL/Update.

(7) Various improvements and bugfixes

———————————————————————————

More information about the Silk Link Discovery Framework is available at:

http://www4.wiwiss.fu-berlin.de/bizer/silk/

The Silk framework is provided under the terms of the Apache License, Version 2.0 and can be downloaded from:

http://www4.wiwiss.fu-berlin.de/bizer/silk/releases/

The development of Silk was supported by Vulcan Inc. as part of its Project Halo (www.projecthalo.com) and by the EU FP7 project LOD2-Creating Knowledge out of Interlinked Data (http://lod2.eu/, Ref. No. 257943).

Thanks to Christian Becker, Michal Murawicki and Andrea Matteini for contributing to the Silk Workbench.

Using Oracle Berkeley DB as a NoSQL Data Store

Filed under: BerkeleyDB,NoSQL — Patrick Durusau @ 7:53 pm

Using Oracle Berkeley DB as a NoSQL Data Store

I saw this on Twitter but waited until I could confirm with documentation I knew to exist on an Oracle website. 😉

I take this as a sign that storage, query and retrieval technology may be about to undergo a fundamental change. Unlike “big data,” which just that, data that requires a lot of storage, how we store, query and retrieve data is much more fundamental.

The BerkeleyDB as storage engine may be a clue as to future changes. What if there was even a common substrate for database engines, SQL, NoSQ, Graph, etc.? Onto which was imposed whatever higher level operations you wished to perform? Done with a copy-on-write mechanism so every view is persisted across the data set.

A common storage substrate would be a great boon to everyone. Think of three dimensional or even crystalline storage which isn’t that far away. Now would be a good time for the major vendors to start working towards a common substrate for database engines.

Adding Machine Learning to a Web App

Filed under: Artificial Intelligence,Machine Learning,Web Applications — Patrick Durusau @ 7:53 pm

Adding Machine Learning to a Web App by Richard Dallaway.

As Richard points out, the example is contrived and I don’t think you will be rushing off to add machine learning to a web app based on these slides.

That said, I think his point that you should pilot the data first is a good one.

If you mis-understand the data, then your results are not going to be very useful. Hmmm, maybe there is an AI/ML axiom in there somewhere. Probably already discovered, let me know if you run across it.

Whose Afraid of Topic Maps? (see end for alt title)

Filed under: Marketing,Topic Maps — Patrick Durusau @ 7:52 pm

I saw a post asking why don’t programmers use topic maps? I replied at the time but on reflection, I think the answer is simpler than I thought at the time.

What do ontologies, classification systems and terminologies all have in common?

Ontology

SUMO and Cyc are projects that would be admitted by most to fall under the rubric of “ontology.”

Classification System

The Library of Congress Subject Headings (LCSH) is an example of a classification system.

Terminology

SNOMED-CT self-identifies as a terminology, it makes a good example.

Substitute other projects that fall under these labels. It doesn’t change the following analysis.

What do these projects have in common?

SNOMED-CT and LCSH were both produced by organs of the U.S. government but SUMO and Cyc were not.

SUMO and Cyc both claim to be universal upper ontologies while SNOMED-CT and LCSH make no such claims.

All four have organizations that have grown up around them or fostered them. Is that a clue? Perhaps.

Question: If you had a question SUMO, Cyc, SNOMED-CT or LCSH, and need an authoritative answer, who would you ask?

Answer: Ask the appropriate project. Yes? That is only the maintainers of SUMO, Cyc, SNOMED-CT or LCSH can provide authoritative answers for their projects. Only their answers have interoperability with their systems.

Topic maps offer the capability to have decentralized authority over terms. While maintaining the use of extended terms across topic map systems.

I know, that didn’t read very smoothly so let me try to explain by example.

In Semantic Integration: N-Squared to N+1 (and decentralized) I demonstrated how four (4) different authors could have four (4) different identifiers for Superman and write different things about Superman.

As a topic map author I notice they are talking about the same subject and unknown to those authors, I create a topic that merges all of their information about Superman together.

Those authors may continue to write different information about Superman using their identifier, but anyone using my topic map will see all the information gathered together.

The same reasoning applies to SNOMED-CT and LCSH, both of which have medical classifications that are different. The medical community and their patients could wait until the SNOMED-CT and LCSH organizations create a mapping between the two, but there are other options.

A medical researcher who notices a mapping between terms in SNOMED-CT and LSCH, could mark that and future researchers (assuming they accept the mapping) would find information from either source, using either identifier, together. Creating sets identifiers is just that simple. (I lie, it would take a useful interface as well but that’s doable.

Note the difference in process.

In one case, highly bureaucratic organizations who have a stake in the use of “their” ontology/classification/terminology” make all the decisions about what maps are made and what those maps include.

In the topic map case, the person with the need for the information and expertise finds a mapping between information sources and adds in a mapping on the spot. A bread crumb if you will that may help future researchers with greater information than existed at that location before.

Oh, and one other issue, interoperability.

If you construct topic maps using the Topic Maps Data Model (TMDM) and one of the standard syntaxes, no matter what new material you add to it, it will work with other standard topic map software.

Try that with your local library catalog software for example.

The question isn’t why programmers aren’t using topic maps?

Topic maps enable decentralized decision making about information, without preset limits on that information, with interoperability by default.

The question is: Why aren’t you demanding the use of topic maps? (you have to answer that one for yourself)

Alt title: Topic Maps: Priesthood of the User

PS: Quite serious about the alt title. Remember when browsers were supposed to display webpages the way users wanted them? That didn’t last very long did it. Maybe we should have another go at it with information?

October 3, 2011

Algorithms of the Intelligent Web Review

Algorithms of the Intelligent Web Review by Pearlene McKinley

From the post:

I have always had an interest in AI, machine learning, and data mining but I found the introductory books too mathematical and focused mostly on solving academic problems rather than real-world industrial problems. So, I was curious to see what this book was about.

I have read the book front-to-back (twice!) before I write this report. I started reading the electronic version a couple of months ago and read the paper print again over the weekend. This is the best practical book in machine learning that you can buy today — period. All the examples are written in Java and all algorithms are explained in plain English. The writing style is superb! The book was written by one author (Marmanis) while the other one (Babenko) contributed in the source code, so there are no gaps in the narrative; it is engaging, pleasant, and fluent. The author leads the reader from the very introductory concepts to some fairly advanced topics. Some of the topics are covered in the book and some are left as an exercise at the end of each chapter (there is a “To Do” section, which was a wonderful idea!). I did not like some of the figures (they were probably made by the authors not an artist) but this was only a minor aesthetic inconvenience.

The book covers four cornerstones of machine learning and intelligence, i.e. intelligent search, recommendations, clustering, and classification. It also covers a subject that today you can find only in the academic literature, i.e. combination techniques. Combination techniques are very powerful and although the author presents the techniques in the context of classifiers, it is clear that the same can be done for recommendations — as the Bell Korr team did for the Netflix prize.

Wonder if this will be useful in the Stanford AI course that starts next week with more than 130,000 students? Introduction to Artificial Intelligence – Stanford Class

I am going to order a copy, if for no other reason than to evaluate the reviewer’s claim of explanations “in plain English.” I have seen some fairly clever explanations of AI algorithms and would like to see how these stack up.

Automated extraction of domain-specific clinical ontologies – Weds Oct. 5th

Filed under: Bioinformatics,Biomedical,Ontology,SNOMED — Patrick Durusau @ 7:09 pm

Automated extraction of domain-specific clinical ontologies by Chimezie Ogbuji from Case Western Research University School of Medicine. 10 AM PT Weds Oct. 5, 2011.

Full NCBO Webinar schedule: http://www.bioontology.org/webinar-series

ABSTRACT:

A significant set of challenges in the use of large, source ontologies in the medical domain include: automated translation, customization of source ontologies, and performance issues associated with the use of logical reasoning systems to interpret the meaning of a domain captured in a formal knowledge representation.

SNOMED-CT and FMA are two reference ontologies that cover much of the domain of clinical medicine and motivate a better means for the re-use of such ontologies. In this presentation, the author will present a set of automated methods (and tools) for segmenting, merging, and surveying modules extracted from these ontologies for a specific domain.

I’m interested generally but in particular about the merging aspects, for obvious reasons. Another reason to be interested is some research I encountered recently on “outliers” in reasoning systems. Apparently there is a class of reasoning systems that simply “fall over” if they encounter a concept they recognize (or “think” they do) only to find it has some property (what makes it an “outlier”) that they don’t. Seems rather fragile to me but I haven’t finished running it to ground. Curious how these methods and tools handle the “outlier” issue.

SPEAKER BIO:

Chimezie is a senior research associate in the Clinical Investigations Department of the Case Western Research University School of Medicine where he is responsible for managing, developing, and implementing Clinical and Translational Science Collaborative (CTSC) projects as well as clinical, biomedical, and administrative informatics projects for the Case Comprehensive Cancer Center.

His research interests are in applied ontology, knowledge representation, content repository infrastructure, and medical informatics. He has a BS in computer engineering from the University of Illinois and is a part-time PhD student in the Case Western School of Engineering. He most recently appeared as a guest editor in IEEE Internet Computing’s special issue on Personal Health Records in the August 2011 edition.

DETAILS:

——————————————————-
To join the online meeting (Now from mobile devices!)
——————————————————-
1. Go to https://stanford.webex.com/stanford/j.php?ED=107799137&UID=0&PW=NNjE3OWYzODk3&RT=MiM0
2. If requested, enter your name and email address.
3. If a password is required, enter the meeting password: ncbo
4. Click “Join”.

——————————————————-
To join the audio conference only
——————————————————-
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers: https://stanford.webex.com/stanford/globalcallin.php?serviceType=MC&ED=107799137&tollFree=0

Access code:926 719 478

Write Yourself a Scheme in 48 Hours

Filed under: Functional Programming,Haskell,Scheme — Patrick Durusau @ 7:08 pm

Write Yourself a Scheme in 48 Hours

From the webpage:

Most Haskell tutorials on the web seem to take a language-reference-manual approach to teaching. They show you the syntax of the language, a few language constructs, and then have you construct a few simple functions at the interactive prompt. The “hard stuff” of how to write a functioning, useful program is left to the end, or sometimes omitted entirely.

This tutorial takes a different tack. You’ll start off with command-line arguments and parsing, and progress to writing a fully-functional Scheme interpreter that implements a good-sized subset of R5RS Scheme. Along the way, you’ll learn Haskell’s I/O, mutable state, dynamic typing, error handling, and parsing features. By the time you finish, you should be fairly fluent in both Haskell and Scheme.

Functional programming is gaining ground. Are you?

DataCleaner

Filed under: Data Analysis,Data Governance,Data Management,DataCleaner,Software — Patrick Durusau @ 7:08 pm

DataCleaner

From the website:

DataCleaner is an Open Source application for analyzing, profiling, transforming and cleansing data. These activities help you administer and monitor your data quality. High quality data is key to making data useful and applicable to any modern business.

DataCleaner is the free alternative to software for master data management (MDM) methodologies, data warehousing (DW) projects, statistical research, preparation for extract-transform-load (ETL) activities and more.

Err, “…cleansing data.”? Did someone just call topic maps name? 😉

If it is important to eliminate duplicate data, everyone using duplicated data needs updates and relationships to it. Unless the duplicated data was the result of poor design or just wasting drive space.

This looks like an interesting project and certainly one were topic maps are clearly relevant as one possible output.

Parallel Haskell Tutorial: The Par Monad [Topic Map Profiling?]

Filed under: Haskell,Parallel Programming,Parallelism — Patrick Durusau @ 7:07 pm

Parallel Haskell Tutorial: The Par Monad

Parallel programming will become largely transparent at some point but not today. 😉

Walk through parallel processing of Sudoku and k-means, as well as measuring performance and debugging. Code is available.

I think the debugging aspects of this tutorial stand out the most for me. Understanding a performance issue as opposed to throwing resources at it seems like the better approach to me.

I know that a lot of time has been spent by the vendors of topic maps software profiling their code, but I wonder if anyone has profiled a topic map?

That is we make choices in terms of topic map construction, some of which may result in more or less processing demands, to reach the same ends.

As topic maps grow in size, the “how” a topic map is written may be as important as the “why” certain subjects were represented and merged.

Have you profiled the construction of your topic maps? Comments appreciated.

ScalaDays 2011 Resources

Filed under: Programming,Scala — Patrick Durusau @ 7:06 pm

ScalaDays 2011 Resources

From the webpage:

Below, you’ll find links to any publicly-available material relating to presentations given at ScalaDays 2011.

This includes, but is not limited to:

  • slides
  • videos
  • projects referenced
  • source code
  • blog articles
  • follow-ups / corrections

A number of resources that will be of interest to Scala programmers.

Scala – [Java – *] Documentation – Marketing Topic Maps

Filed under: Interface Research/Design,Marketing,Scala,Topic Maps — Patrick Durusau @ 7:06 pm

Scala Documentation

As usual, when I am pursuing one lead to interesting material for or on topic maps, another pops up!

The Scala Days 2011 wiki had the following note:

Please note that the Scala wikis are in a state of flux. We strongly encourage you to add content but avoid creating permanent links. URLs will frequently change. For our long-term plans see this post by the doc czar.

A post that was followed by the usual comments about re-inventing the wheel, documentation being produced but not known to many, etc.

I mentioned topic maps as a method to improve program documentation to a very skilled Java/Topic Maps programmer, who responded: How would that be an improvement over Javadoc?

How indeed?

Hmmm, well, for starters the API documentation would not be limited to a particular program. That is to say for common code the API documentation for say a package could be included across several independent programs so that when the package documentation is improved for one, it is improved for all.

Second, it is possible, although certainly not required, to maintain API documentation as “active” documentation, that is to say it has a “fixed” representation such as HTML, only because we have chosen to render it that way. Topic maps can reach out and incorporate content from any source as part of API documentation.

Third, this does not require any change in current documentation systems, which is fortunate because that would require the re-invention of the wheel in all documentation systems for source code/programming documentation. A wheel that continues to be re-invented with every new source repository and programming language.

So long as the content is addressable (hard to think of content that is non-addressable, do you have a counter-example?), topic maps can envelope and incorporate that content with other content in a meaningful way. Granting that incorporating some content requires more efforts that other content. (Pointer “Go ask Bill with a street address” would be unusual but not unusable.)

The real question is, as always, is it worth the effort in a particular context to create such a topic map? Answers to that are going to vary depending upon your requirements and interests.

Comments?

PS: For extra points, how would you handle the pointer “Go ask Bill + street address” so that the pointer and its results can be used in an TMDM instance for merging purposes? It is possible. The result of any identifier can be respresented as an IRI. That much TBL got right. It was failing to realize that it is necessary to distinguish between use of an address as an identifer versus a locator that has cause so much wasted effort in the SW project.

Well, that an identifier imperialism that requires every identifier be transposed into IRI syntax. Given all the extant identifiers, with new ones being invented every day, let’s just that that replacing all extant identifiers comes under the “fairy tales we tell children” label where they all live happily ever after.

Our big data/total data survey is now live [the 451 Group]

Filed under: BigData,Data Warehouse,Hadoop,NoSQL,SQL — Patrick Durusau @ 7:05 pm

Our big data/total data survey is now live [the 451 Group]

The post reads in part:

The 451 Group is conducting a survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

In return for your participation, you will receive a copy of a forthcoming long-format report covering introducing Total Data, The 451 Group’s concept for explaining the changing data management landscape, which will include the results. Respondents will also have the opportunity to become members of TheInfoPro’s peer network.

Just a word about the survey.

Question 10 reads:

What is the primary platform used for storing and querying from each of the following types of data?

Good question but you have to choose one of three answers to put other (and say what “other” means), you are not allowed to skip any type of data.

Data types are:

  • Customer Data
  • Transactional Data
  • Online Transaction Data
  • Domain-specific Application Data (e.g., Trade Data in Financial Services, and Call Data in Telecoms)
  • Application Log Data
  • Web Log Data
  • Network Log Data
  • Other Log Files
  • Social Media/Online Data
  • Search Log
  • Audio/Video/Graphics
  • Other Documents/Content

Same thing happens for Question 11:

What is the primary platform used for each of the following analytics workloads?

Eleven required answers that I won’t bother to repeat here.

As a consultant I really don’t have serious iron/data on the premises but that doesn’t seem to occurred to the survey designers. Nor that even a major IT installation might not have all forms of data or analytics.

My solution? I just marked Hadoop on Questions 10 and 11 so I could get to the rest of the survey.

Q12. Which are the top three benefits associated with each of the following data management technologies?

Q13. Which are the top three challenges associated with each of the following data management technologies?

Q14. To what extent do you agree with the following statements? (which includes: “The enterprise data warehouse is the single version of the truth for business intelligence”

Questions 12 – 14 all require answers to all options.

Note the clever first agree/disagree statement for Q.14.

Someone will conduct a useful survey of business opinions about big data and likely responses to it.

Hopefully with a technical survey of the various options and their advantages/disadvantages.

Please let me know when you see it, I would like to point people to it.

(I completed this form on Sunday, October 2, 2011, around 11 AM Eastern time.)

What resources & practices (teaching Haskell) [or learning n]

Filed under: Education,Haskell,Teaching — Patrick Durusau @ 7:04 pm

What resources & practices (teaching Haskell)

Clifford Beshers answers (in part, the most important part):

I have two recommendations: teach them the simplest definitions of the fundamentals; read programs with them, out loud, like children’s books, skipping nothing.

The second one, reading aloud, is one that I have advocated for standards editors. Mostly because it helps you slow down and not “skim” text that you already “know.”

And the same technique can be applied for self-study of any subject, whether it is Haskell, some other programming language, mathematics, or some other domain.

Scala Tutorial – Tuples, Lists, methods on Lists and Strings

Filed under: Computational Linguistics,Linguistics,Scala — Patrick Durusau @ 7:04 pm

Scala Tutorial – Tuples, Lists, methods on Lists and Strings

I mention this not only because it looks like a good Scala tutorial series but also because it is being developed in connection with a course on computational linguistics at UT Austin (sorry, University of Texas at Austin, USA).

The cross-over between computer programming and computational linguistics illustrates the artificial nature of the divisions we make between disciplines and professions.

October 2, 2011

Super Simple Data Integration with RESTx: An Example

Filed under: Enterprise Service Bus (ESB),Mule,RESTx — Patrick Durusau @ 6:38 pm

Super Simple Data Integration with RESTx: An Example

From the webpage:

Most people who ever worked in real-world data integration projects agree that at some point custom code becomes necessary. Pre-fabricated connectors, filter and pipeline logic can only go so far. And to top it off, using those pre-fabricated integration logic components often becomes cumbersome for anything but the most trivial data integration and processing tasks.

With RESTx – a platform for the rapid creation of RESTful web services – we recognize that custom code will always remain part of serious data integration tasks. As developers, we already know about a concise, standardized and very well defined way to express what we want: The programming languages we use every day! Why should we have to deal with complex, unfamiliar configuration files or UI tools that still restrict us in what we can do, if it is often so much more concise and simple to just write down in code what you want to have done?

Therefore, RESTx embraces custom code: Writing it and expressing your data integration logic with it is made as simple as possible.

Let me illustrate how straight forward it is to integrate data resources using just a few lines of clear, easy to read code.

In my experience “custom code” means “undocumented code.” But leaving that mis-giving to one side.

RESTx gets us part way to the TMDM by its use of URIs.

We just have to use them as appropriate to create TMDM output for further integration with existing resources. That is we have to decide which of these URIs are subjectIdentifiers and which function as subjectLocators as part of our integration activity.

I have just started to wander around the Mule site. Feel free to suggest examples or places I need to look at sooner than others. Examples of RESTx output as topic maps would be nice! Hint, hint.

PS: I’m got a small data set I need to clean up for a post next week but I am also planning a post on URIs, simple or complex identification?

“Algorithm” is not a !@%#$@ 4-letter Word

Filed under: Algorithms,Graphs — Patrick Durusau @ 6:37 pm

“Algorithm” is not a !@%#$@ 4-letter Word by Jamis Buck (via @CompSciFact)

Very nice algorithm and graph presentation.

Instructive on algorithms and graphs and despite the fact more than a little skill went into the presentation, simple enough technically that it ought to encourage all of us to do more such explanations for topic maps. (Well, talking to myself in particular to be honest. I keep wanting to do the “perfect” presentation rather than the presently “possible” ones.)

DMO (Data Mining Ontology) Foundry

Filed under: Data Mining,Ontology — Patrick Durusau @ 6:37 pm

Email from Agnieszka Lawrynowicz advises:

We are happy to announce the opening of the DMO (Data Mining Ontology) Foundry (http://www.dmo-foundry.org/), an initiative designed to promote the development of ontologies representing the data mining domain. The DMO Foundry will gather the most significant ontologies concerning data mining and the different algorithms and resources that have been developed to support the knowledge discovery process.

Each ontology in the DMO Foundry is freely available for browsing and open discussion, as well as collaborative development, by data mining specialists all over the world. We cordially welcome all interested researchers and practitioners to join the initiative. To find out how you can participate in ontology development, click on the “How to join” tab at the top of the DMO-Foundry page.

To access and navigate an ontology, and contribute to it, click on the “Ontologies” tab, then on your selected ontology and its OWL Browser tool. As you browse, you can click on the “Comment” button to share your insights, criticisms, and suggestions on the concept or relation you are currently exploring. For more general comments, go the the “Forum” tab and post a message to initiate a discussion thread. Please note that until the end of March 2012, this site is being road-tested on the Data Mining OPtimization (DMOP) Ontology developed in the EU FP7 ICT project e-LICO (2009-2012). We are in contact with authors of other DM ontologies, but if you are developing a relevant ontology that you think we are not aware of, please set up a post in the Forum. You are also invited to contact us by writing to an email address info@dmo-foundry.org.

Sad to say but they have omitted topic maps from their ontology. I am writing up a post for the authors. At a minimum, the terms with PSIs at http://psi.topicmaps.org. Others?

This sounds like a link I need to forward to the astronomy folks I mentioned in > 100 New KDD Models/Methods Appear Every Month. Could at least use the class listing as a starter set for mining journal literature.

Oracle rigs MySQL for NoSQL-like access

Filed under: MySQL,NoSQL — Patrick Durusau @ 6:36 pm

Oracle rigs MySQL for NoSQL-like access by Joab Jackson at CIO.

Joab writes:

In an interview in May with the IDG News Service, Tomas Ulin, Oracle vice president of MySQL engineering, described a project to bring the NoSQL-like speed of access to SQL-based MySQL.

“We feel very strongly we can combine SQL and NoSQL,” he said. “If you have really high-scalability performance requirements for certain parts of your application, you can share the dataset” across both NoSQL and SQL interfaces.

The key to Oracle’s effort is the use of Memecached, which Internet-based service providers, Facebook being the largest, have long used to quickly serve MySQL data to their users. Memcached creates a hash table of commonly accessed database items that is stored in a server’s working memory for quick access, by way of an API (application programming interface).

Memcached would provide a natural non-SQL interface for MySQL, Ulin said. Memcached “is heavily used in the Web world. It is something [webmasters] already have installed on their systems, and they know how to use [it]. So we felt that would be a good way to provide NoSQL access,” Ulin said.

Oracle’s thinking is that the Memecached interface can serve as an alternative access point for MySQL itself. Much of the putative sluggishness of SQL-based systems actually stems from the overhead of supporting a fully ACID-based query infrastructure needed to execute complex queries, industry observers said. By providing a NoSQL alternative access method, Oracle could offer customers the best of both worlds–a database that is fully ACID-compliant and has the speed of a NoSQL database.

With Memcached you are not accessing the data through SQL, but by a simple key-value lookup. “You can do a simple key-value-type lookup and get very optimal performance,” Ulin said.

The technology would not require any changes to MySQL itself. “We can just plug it in,” Ulin said. He added that Oracle was considering including this technology in the next version of MySQL, version 5.6.

While you are thinking about what that will mean for using MySQL engines, remember Stratified B-Tree and Versioned Dictionaries.

Suddenly, being able map the structures of data stores as subjects (ne topics) and to merge them, reliably, with structures of other data stores doesn’t seem all that far fetched does it? The thing to remember is that all that “big data” was stored in some “big structure,” a structure that topic maps can view as subjects to be represented by topics.

Not to mention knowing when you are accessing content (addressing) or authoring information about the content (identification).

Neo4j – Manual and Questions

Filed under: Neo4j — Patrick Durusau @ 6:36 pm

Peter Neubauer tweeted over the weekend for information or topics for the Neo4j Manual or Appendix B. Questions.

Now is your chance to:

  • Take that slow read through the documentation you have been promising yourself, with a hot cup of coffee nearby, or
  • Ask that question that has been bothering you but you never got time to write up properly (just go ahead and ask), or
  • Demonstrate your ability to state clearly (or otherwise) information or an answer that should appear in the manual or questions.

This could be the last time Peter asks so don’t miss what could be your only opportunity to contribute to Neo4j. 😉 (Just kidding. But do make the effort anyway.)

Let Peter know or comment via #disqus.

Let’s make an Elemental Type System in Haskell – Part I

Filed under: Games,Haskell — Patrick Durusau @ 6:35 pm

Let’s make an Elemental Type System in Haskell – Part I

From the post:

recently I falled in love (again) with Haskell, and I’ve decided to start a simple toy project just to stretch my Haskell muscles fibers.

Meanwhile, I’ve started Final Fantasy VII again, so I thought to realize a simple Elemental Type System. I dunno how much this project will be complex, but a lot of fun awaits.

By post three (3) the author changes the name to …Elemental Battle System….

Same series.

To put a topic map cast (shadow?) on this adventure think about the creatures, places, events and players as all having identities. Identities that we want to manage in a topic map. Their properties change during the game. So we need to update the topic map, but how often? Does a player “identity” change if they die?

Perhaps not from a system perspective but what about in the game? Or in the view of other players? Assume those differing “views” are also subjects that we want to represent. How do we manage game subjects as they move in and out of those “views?”


Other posts in this series are:

Let’s make an Elemental Type System in Haskell – Part II
Let’s make an Elemental Battle System in Haskell – Part III
Players are coming next.

Mule ESB

Filed under: Enterprise Service Bus (ESB),Mule — Patrick Durusau @ 6:35 pm

Mule ESB

The Mule ESB has an amusing stat line that reads: “103,000+ developers use Mule 3,200 Companies in Production 0 Headaches”

I don’t know if I believe the headache stat but if the other two are even approximately correct, this sounds like a place to be pushing topic maps.

To get the flavor of the community:

From “What is Mule ESB?

What is Mule ESB?

Mule ESB is a lightweight Java-based enterprise service bus (ESB) and integration platform that allows developers to connect applications together quickly and easily, enabling them to exchange data. Mule ESB enables easy integration of existing systems, regardless of the different technologies that the applications use, including JMS, Web Services, JDBC, HTTP, and more.

The key advantage of an ESB is that it allows different applications to communicate with each other by acting as a transit system for carrying data between applications within your enterprise or across the Internet. Mule ESB includes powerful capabilities that include:

  • Service creation and hosting — expose and host reusable services, using Mule ESB as a lightweight service container
  • Service mediation — shield services from message formats and protocols, separate business logic from messaging, and enable location-independent service calls
  • Message routing — route, filter, aggregate, and re-sequence messages based on content and rules
  • Data transformation — exchange data across varying formats and transport protocols

(graphic omitted)

Do I need an ESB?

Mule and other ESBs offer real value in scenarios where there are at least a few integration points or at least 3 applications to integrate. They are also well suited to scenarios where loose coupling, scalability and robustness are required.

Below is a quick ESB selection checklist. To read a much more comprehensive take on when to select an ESB, read this article written by MuleSoft founder and CTO Ross Mason: To ESB or not to ESB.

  1. Are you integrating 3 or more applications/services?
  2. Will you need to plug in more applications in the future?
  3. Do you need to use more than one type of communication protocol?
  4. Do you need message routing capabilities such as forking and aggregating message flows, or content-based routing?
  5. Do you need to publish services for consumption by other applications?

I am exploring the Mule site and entries for data integration in particular for those that may be of interest to topic mappers. More anon.

Apache Flume incubation wiki

Filed under: Flume,Hadoop,Probalistic Models — Patrick Durusau @ 6:34 pm

Apache Flume incubation wiki

From the website:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.

A number of resources for Flume.

Will “data flows” as the dominant means of accessing data be a consequence of an environment where a “local copy” of data is no longer meaningful or an enabler of such an environment? Or both?

I think topic maps would do well to develop models for streaming and perhaps probabilistic merging or even probabilistic creation of topics/associations from data streams. Static structures only give the appearance of certainty.

« Newer PostsOlder Posts »

Powered by WordPress