Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 11, 2012

AWS Documentation Now Available on the Kindle

Filed under: Amazon Web Services AWS — Patrick Durusau @ 4:35 pm

AWS Documentation Now Available on the Kindle

From the post:

AWS documentation is now available on the Kindle – if this is all you need to know, start here and you’ll have access to the new documents in seconds.

I “purchased” (the actual cost is $0.00) the EC2 Getting Started Guide and had it delivered to my trusty Kindle DX, where it looked great:

[graphic omitted]

You can highlight, annotate, and search the content as desired.

We’ve uploaded 43 documents so far; others will follow shortly.

Two observations:

For the “cloud” Kindle (what I use on Linux to read Kindle titles), should be able to select multiple AWS documentation titles for a single batch download. Yes?

Ahem, at least the “Analyzing Big Data with AWS” did not have an index.

Indexing all the AWS titles together (not entirely auto-magically), would make AWS documentation a cut above its competitors. (At least a goal to start with. Later versions can mix in titles from publishers, blogs, etc.)

April 10, 2012

Ontopia 5.2.1 and the CLASSPATH

Filed under: Ontopia — Patrick Durusau @ 6:48 pm

In the latest release of the Ontopia suite, you will find the following comments in the installation document:

Verifying

Now that you’ve set up your CLASSPATH environment variable you can verify it by issuing the following command:

java net.ontopia.Ontopia

It will run and produce the following output if it can find all the classes required:

Ontopia Topic Maps Engine [version]
Success: All required classes found.

If it fails you will get output similar to the following:

Ontopia Topic Maps Engine [version]
Class 'org.apache.log4j.BasicConfigurator' not found. Please add log4j.jar to your CLASSPATH.

The message is hopefully self-explanatory.

Well…, not quite.

At least in the latest release (5.2.1), I encountered the following error message from both Windows and Ubuntu setups:

SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder
at org.slf4j.LoggerFactory.getSingleton(LoggerFactory.java:230)

To cure this problem, I now have the CLASSPATH setting:

CLASSPATH=/home/patrick/working/ontopia-5.2.1/lib/ontopia-engine-5.2.1.jar:/home/patrick/working/ontopia-5.2.1/lib/log4j-1.2.14.jar:/home/patrick/working/ontopia-5.2.1/lib/slf4j-log4j12-1.5.11.jar:
export CLASSPATH

I say it cures the problem because I can replicate the problem at will on either box and this fix cures it on both boxes (different JDKs).

So, in addition to ontopia-engine-x.x.x.jar, add to your CLASSPATH:

log4j-1.2.14.jar
slf4j-log4j12-1.5.11.jar

Norwegian National Broadcasting

Filed under: Ontopia,Topic Maps — Patrick Durusau @ 6:46 pm

Norwegian National Broadcasting

I wrote to Lar Marius Garshol recently about some examples of the use of topic maps in education in Norway. This was one of the resources in his response.

NRK/Skole is an educational site for school children publishing sound and video clips from the archives of the Norwegian National Broadcasting Company, the Norwegian equivalent of the BBC. A team of editors scour the archives to find suitable content, then cut it into clips suitable for use in an educational setting and attach metadata to these clips. The content ranges from interviews with historical figures, clips from the daily news, documentaries, and even comedy gags.

All clips on the site are represented as topics in the topic map, and associated with topics representing people and subjects that the clips are about. In addition, clips are also attached to the programs they were taken from, providing three navigational entry points into the portal: person, subject, or program.

In addition, clips are connected with knowledge goals taken from the national curriculum, which has been published as a topic map by the Ministry of Education. Thus, teachers can navigate the curriciulum for their subject to find clips supporting any particular knowledge goal in the curriculum.

It occurs to me that merging content from this topic map with one on the same news subjects say from Fox News could be quite amusing. Even without translation. I don’t remember if news clips count as “fair use” or not. You would need to check with legal counsel before re-use of Fox content.

BTW, other examples of topic map welcome!

Popularizing public data

Filed under: Graphics,Visualization — Patrick Durusau @ 6:45 pm

Popularizing public data

Kaiser Fung writes:

Dona Wong, whose graphics book I reviewed two years ago (link), has recently joined the New York Fed to lead an effort to visualize data. This is exciting because consumers are unlikely to learn anything from Excel spreadsheets, HTML tables, etc. which are the typical formats of public data.

One of their efforts is visualization of mortgage delinquency data in the Tri-state and Long Island regions (link). This animation reminds me of the CDC obesity map, to which I gave a positive review in 2005 (link). This type of chart is great for revealing the evolution of a metric over time and over space. The sliding control is a very nice extra touch. This allows readers to freeze-frame the map and examine the details.

Kaiser walks the reader through his suggestions on improving the visualizations in question. It is one thing for someone to dislike a graphic. Quite another (and useful) to have someone explore what does or doesn’t work. And why?

Enjoy!

The Trend Point

Filed under: Analytics,Blogs,Open Source — Patrick Durusau @ 6:45 pm

The Trend Point

Described by a “sister” publication as:

ArnoldIT has rolled out The Trend Point information service. Published Monday through Friday, the information services focuses on the intersection of open source software and next-generation analytics. The approach will be for the editors and researchers to identify high-value source documents and then encapsulate these documents into easily-digested articles and stories. In addition, critical commentary, supplementary links, and important facts from the source document are provided. Unlike a news aggregation service run by automated agents, librarians and researchers use the ArnoldIT Overflight tools to track companies, concepts, and products. The combination of human-intermediated research with Overflight provide an executive or business professional with a quick, easy, and free way to keep track of important developments in open source analytics. There is no charge for the service.

I was looking for something different to say other than just reporting a new data stream and found this under the “about” link:

I write for fee columns for Enterprise Technology Management, Information Today, Online Magazine, and KMWorld plus a few occasional items. My content reaches somewhere between one and three people each month.

I started to monetize Beyond Search in 2008. I have expanded our content services to white papers about a search, content processing or analytics. These reports are prepared for a client. The approach is objective and we include information that makes these documents suitable for the client’s marketing and sales efforts. Clients work closely with the Beyond Search professional to help ensure that the message is on target and clear. Rates are set to be within reach of organizations regardless of their size.

You can get coverage in this or one of our other information services, but we charge for our time. Stated another way: If you want a story about you, your company, or your product, you will be expected to write a check or pay via PayPal. We do not do news. We do this. (emphasis added to the first paragraph)

For some reason, I would have expected Stephen E. Arnold to reach more than …between one and three people each month. That sounds low to me. 😉

The line: “We do not do news.” Makes me wonder what the University of Southhampton paid to have a four page document described as a “dissertation.” See: New Paper: Linked Data Strategy for Global Identity. Or for that matter, what will it cost to get into “The Trend Point?”

Thoughts?

HBase Hackathon at Cloudera

Filed under: Cloudera,HBase — Patrick Durusau @ 6:45 pm

HBase Hackathon at Cloudera by David S. Wang

From the post:

Cloudera will be hosting an Apache HBase hackathon on May 23rd, 2012, the day after HBaseCon 2012. The overall theme of the event will be 0.96 stabilization. If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon. This is a great opportunity to contribute some code towards the project and hang out with other HBasers.

More details are on the hackathon’s Meetup page. Please RSVP so we can better plan lunch, room size, and other logistics for the event. See you there!

If you get the opportunity, attend.

Studies show (American Library Association) that building social relationships that are then continued helps to sustain virtual communities.

Here is your chance to get to know other HBase folks.

Metablogging MADlib

Filed under: Data Analysis,SQL — Patrick Durusau @ 6:44 pm

Metablogging MADlib

Joseph M. Hellerstein writes:

When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I’m involved with. So I wrote up a discussion of MADlib, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote a paper on the design and use of MADlib, which made my writing job a bit easier.) I’m optimistic about MADlib closing a gap between algorithm researchers and working data scientists, using familiar SQL as a vector for adoption on both fronts.

I kicked off MADlib as a part-time consulting project for Greenplum during my sabbatical in 2010-2011. As I built out the first two methods (FM and CountMin sketches) and an installer, Greenplum started assembling a team of their own engineers and data scientists to overlap with and eventually replace me when I returned to campus. They also developed a roadmap of additional methods that their customers wanted in the field. Eighteen months later, Greenplum now contributes the bulk of the labor, management and expertise for the project, and has built bridges to leading academics as well.

Like they said at Woodstock, “if you don’t think SQL is all that weird….” you might want to stop by the MADlib project. (I will have to go listen to the soundtrack. That may not be an exact quote.)

This is an important project for database analytics in an SQL context.

Infinite Weft (Exploring the Old Aesthetic)

Filed under: Griswold,Punch Cards,Uncategorized — Patrick Durusau @ 6:44 pm

Infinite Weft (Exploring the Old Aesthetic)

Jer Thorp writes:

How can a textile function as a digital object? This is a central question of Infinite Weft, a project that I’ve been working on for a the last few months. The project is a collaboration with my mother, Diane Thorp, who has been weaving for almost 40 years – it’s a chance for me to combine my usually screen-based digital practice with her extraordinary hand-woven work. It’s also an exploration of mathematics, computational history, and the concept of pattern.

Most of us probably know that the loom played a part in the early days of computing – the Jacquard loom was the first machine to use punch cards, and its workings were very influential in the early design of programmable machines (In my 1980s basement this history was actually physically embodied; sitting about 10 feet away from my mother’s two floor looms, on an Ikea bookself, sat a box of IBM punch cards that we mostly used to make paper airplanes out of). But how many of us know how a loom actually works? Though I have watched my mother weave many times, it didn’t take long at the start of this project to realize that I had no real idea how the binary weaving patterns called ‘drawdowns‘ ended up making a pattern in a textile.

[graphic omitted]

To teach myself how this process actually happened, I built a functional software loom, where I could see the pattern manifest itself in the warp and weft (if you have Chrome you can see it in action here – better documentation is coming soon). This gave me a kind of sandbox which let me see how typical weaving patterns were constructed, and what kind of problems I could expect when I started to write my own. And run into problems, I did. My first attempts at generating patterns were sloppy and boring (at best) and the generative methods I was applying weren’t very successful. Enter Ralph E. Griswold.

By this point, “concept of pattern,” “punch cards,” “software loom,” and “Ralph E. Griswold,” I was completely hooked.

Comments?

Tiny New Zealand Company Brings Cool Microsoft Video Tech To The World

Filed under: Searching,Video — Patrick Durusau @ 6:41 pm

Tiny New Zealand Company Brings Cool Microsoft Video Tech To The World

Whitney Grace writes:

New Zealand is known for its beautiful countryside and all the popular movies filmed there, sheep, and Dot Com. Business Insider reports there is another item to add to the island nation’s “list of reasons to be famous,” “Tiny New Zealand Company Brings Cool Microsoft Video Tech to the World.” The small startup GreenButton used search technology from Microsoft Research and created InCus, a service that transcribes audio and video files to make them searchable. It is aimed at corporation enterprises to make their digital media libraries searchable.

Where there is searching, there are subjects.

Take that as a given.

The startup: GreenButton.

Apparently speech transcription. No motion detection/analysis for indexing. That would be a lot tougher.

Interesting opportunity for an “add-on” to this service to use topic map to map to other resources.

One service invents the potential for another.

A new framework for innovation in journalism: How a computer scientist would do it

Filed under: Journalism,News,Subject Identity — Patrick Durusau @ 6:40 pm

A new framework for innovation in journalism: How a computer scientist would do it

Andrew Phelps writes:

What if journalism were invented today? How would a computer scientist go about building it, improving it, iterating it?

He might start by mapping out some fundamental questions: What are the project’s values and goals? What consumer needs would it satisfy? How much should be automated, how much human-powered? How could it be designed to be as efficient as possible?

Computer science Ph.D. Nick Diakopoulos has attempted to create a new framework for innovation in journalism. His new white paper, commissioned by CUNY’s Tow-Knight Center for Entrepreneurial Journalism, does not provide answers so much as a different way to come up with questions.

Diakopolous identified 27 computing concepts that could apply to journalism — think natural language processing, machine learning, game engines, virtual reality, information visualization — and pored over thousands of research papers to determine which topics get the most (and least) attention. (There are untapped opportunities in robotics, augmented reality, and motion capture, it turns out.)

He thinks computer science and journalism have a lot in common, actually. They are both fundamentally concerned with information. Acquiring it, storing it, modifying it, presenting it.

Suggest you read his paper in full: Cultivating the Landscape of Innovation in Computational Journalism.

Intrigued by the idea of gauging the opportunities along a continuum of activities. Could be a stunning visual of how subject identity is handled across activities and/or technologies.

Interested?

Big Data Reference Model (And Panopticons)

Filed under: BigData,Data Structures,Data Warehouse,Panopticon — Patrick Durusau @ 6:40 pm

Big Data Reference Model

Michael Nygard writes:

A project that approaches Big Data as a purely technical challenge will not deliver results. It is about more than just massive Hadoop clusters and number-crunching. In order to deliver value, a Big Data project has to enable change and adaptation. This requires that there are known problems to be solved. Yet, identifying the problem can be the hardest part. It’s often the case that you have to collect some information to even discover what problem to solve. Deciding how to solve that problem creates a need for more information and analysis. This is an empirical discovery loop similar to that found in any research project or Six Sigma initiative.

Michael takes you on a sensible loop of discover and evaluation, making you more likely (no guarantees) to succeed with your next “big data” project. In particular see the following caution:

… it is tempting to think that we could build a complete panopticon: a universal data warehouse with everything in the company. This is an expensive endeavor, and not a historically successful path. Whether structured or unstructured, any data store is suited to answer some questions but not others. No matter how much you invest in building the panopticon, there will be dimensions you don’t think to support. It is better to skip the massive up-front time and expense, focusing instead on making it very fast and easy to add new data sources or new elements to existing sources.

I like the term panopticon. In part because if its historical association with prisons.

Data warehouses/structures are prisons and suited better for one purpose (or group of purposes) than another.

We must build prisons for today and leave tomorrow’s prisons for tomorrow.

The problem that topic maps trys to address is how to safely transfer prisoners from today’s prisons to tomorrows? Which is made more complicated by some people still using old prisons, sometimes generations of prisons older than most people. Not to mention the variety of prisons across businesses, governments, nationalities.

All of them have legitimate purposes and serve some purpose now, else their users would have migrated their prisoners to a new prison.

I will have to think about the prison metaphor. I think it works fairly well.

Comments?

April 9, 2012

Where am I, who am I?

Filed under: Mapping,Maps — Patrick Durusau @ 4:57 pm

Where am I, who am I?

Pete Warden writes:

Queequeg was a native of Rokovoko, an island far away to the West and South. It is not down in any map; true places never are.”

Where am I right now? Depending on who I’m talking to, I’m in SoMa, San Francisco, South Park, the City, or the Bay Area. What neighborhood is my apartment in? Craigslist had it down as Castro when it was listed. Long-time locals often describe it as Duboce Triangle, but people less concerned with fine differences lump it into the Lower Haight, since I’m only two blocks from Haight Street.

When I first started working with geographic data, I imagined this was a problem to be solved. There had to be a way to cut through the confusion and find a true definition, a clear answer to the question of “Where am I?”.

What I’ve come to realize over the last few years is that geography is a folksonomy. Sure, there’s political boundaries, but the only ones that people pay much attention to are states and countries. City limits don’t have much effect on people’s descriptions of where they live. Just take a look at this map of Los Angeles’ official boundaries:

Pete is onto a more general principle.

Semantics are folksonomy, the precision of which varies depending upon the reason for your interest and your community.

Biblical scholars split hairs, sorry, try to correct errors committed by others, by citing imagined nuances of languages used thousands of years ago. To the average person on the street, the Bible may as well have been written in King James English. Not that one is more precise than the other, just a different community and different habits for reading the text.

The question which community do you hail from and for what purpose are you asking about semantics? We can short-circuit a lot of discussion by recognition that communities vary in their semantics. Each to his/her own.

Topincs 6.1.0 – (Works for > 97% of all U.S. Businesses)

Filed under: Topic Map Software,Topincs — Patrick Durusau @ 4:33 pm

Topincs 6.1.0

From the release notes:

This release has shown good performance under commercial conditions with:

  • 30 concurrent users
  • 60.000 topics
  • 200.000 associations
  • 200.000 occurrences
  • 7.000 files
  • + 3 smaller stores

If that sounds like a small number of concurrent users, consider the following statistics from 2008 on businesses in the United States:

Total businesses 27,757,676
Nonemployers (no payroll) 21,708,021
Firms with 1 to 4 employees 3,617,764
Firms with 5 to 9 employees 1,044,065
Firms with 10 to 20 employees 633,141

The next break is 20 to 99 employees.

With 30 concurrent users, Topincs supports more users than more than 97% of all U.S. businesses have employees.

Different way to think about marketing a product.

For < 20 employees, there are 26,646,290 potential purchasers. For > 20 employees there are 655,587 potential purchasers.

Which target sounds larger to you?

Anyone care to supply the numbers for other geographic areas?

Play Color Cipher and Visual Cryptography

Filed under: Cryptography — Patrick Durusau @ 4:32 pm

Play Color Cipher and Visual Cryptography by Ajay Ohri.

From the post:

I was just reading up on my weekly to-read list and came across this interesting method. It is called Play Color Cipher-

Each Character ( Capital, Small letters, Numbers (0-9), Symbols on the keyboard ) in the plain text is substituted with a color block from the available 18 Decillions of colors in the world [11][12][13] and at the receiving end the cipher text block (in color) is decrypted in to plain text block. It overcomes the problems like “Meet in the middle attack, Birthday attack and Brute force attacks [1]”.

It also reduces the size of the plain text when it is encrypted in to cipher text by 4 times, with out any loss of content. Cipher text occupies very less buffer space; hence transmitting through channel is very fast. With this the transportation cost through channel comes down.

If your topic map software needs a cryptography option, this could be an interesting one to explore.

Reference article: A Block Cipher Generation using Color Substitution
.

Iowa Government Gets a Digital Dictionary Provided By Access

Filed under: Indexing,Law,Legal Informatics,Thesaurus — Patrick Durusau @ 4:32 pm

Iowa Government Gets a Digital Dictionary Provided By Access

Whitney Grace writes:

How did we get by without the invention of the quick search to look up information? We used to use dictionaries, encyclopedias, and a place called the library. Access Innovations, Inc. has brought the Iowa Legislature General Assembly into the twenty-first century.

The write-up “Access Innovations, Inc. Creates Taxonomy for Iowa Code, Administrative Code and Acts” tells us the data management industry leader has built a thesaurus that allows the Legislature to search its library of proposed laws, bills, acts, and regulations. Users can also add their unstructured data to the thesaurus. Access used their Data Harmony software to provide subscription-based delivery and they built the thesaurus on MAIstro.

Sounds very much like a topic map-like project doesn’t it? Will be following up for more details.

Cricinfo StatsGuru Database for Statistical and Graphical Analysis

Filed under: Data,ESPN — Patrick Durusau @ 4:32 pm

Cricinfo StatsGuru Database for Statistical and Graphical Analysis by Ajay Ohri.

From the post:

However ESPN has unleashed the API (including both free and premium)for Developers at http://developer.espn.com/docs.

and especially these sports http://developer.espn.com/docs/headlines#parameters

[parameters omitted]

What puzzled me at first was the title and then Ajay jumping right in to illustrate the use of the parameters, before I could understand what sport was being described.

Ok, grin, laugh, whatever. 😉 I did not recognize Circinfo.

I am sure that many of you will find creative ways to incorporate sports information into your topic maps.

The New World of Massive Data Mining

Filed under: BigData,Data Mining — Patrick Durusau @ 4:32 pm

The New World of Massive Data Mining

From the webpage:

Every time you go on the Internet, make a phone call, send an email, pass a traffic camera or pay a bill, you create data, electronic information. In all, 2.5 quintillion bytes of data are created each day. This massive pile of information from all sources is called “Big Data.” It gets stored somewhere, and everyday the pile gets bigger. Government and industry are finding new ways to analyze it. Last week the administration announced an initiative to aid the development of Big Data computing. A panel of experts join guest host Tom Gjelten to discuss the opportunities — for business, science, medicine, education, and security … but also the privacy concerns.

Guests

John Villasenor senior fellow at the Brookings Institution and professor of electrical engineering at UCLA.”

Michael Leiter senior counselor,Palantir Technologies, former director, National Counterterrorism Center.

Dr. Suzanne Iacono co-chair, Big Data Senior Steering Group and senior science adviser, Directorate for Computer and Information Science and Engineering at the National Science Foundation.

Daphne Koller professor,Stanford Artificial Intelligence Laboratory

You can listen to the show, download the podcast or a transcript of the discussion.

May help shape your rhetoric with NPR listeners who caught the show.

First Look – Pervasive RushAnalyzer

Filed under: Hadoop,Knime,Pervasive RushAnalyzer — Patrick Durusau @ 4:31 pm

First Look – Pervasive RushAnalyzer

James Taylor writes:

Pervasive is best known for its data integration products but has recently been developing and releasing a series of products focused on analytics. RushAnalyzer is a combination of the KNIME data mining workbench (reviewed here) and Pervasive DataRush, a platform for parallelization and automatic scaling of data manipulation and analysis (reviewed here).

In the combined product, the base KNIME workbench has been extended for faster processing of larger data sets (big data) with a particular focus on use by analysts without any skills in parallelism or Hadoop programming. Pervasive has added parallelized KNIME nodes that include data access, data preparation and analytic modeling routines. KNIME’s support for extension means that KNIME’s interface is still what you use to define the modeling process but these processes can use the DataRush nodes to access and process larger volumes of data, read/write Hadoop-based data and automatically take full advantage of multi core, multi processor servers and clusters (including operations on Amazon’s EMR).

There is plenty of life left in closed source software but have you noticed the growing robustness of open source software?

I don’t know if that is attributable to the “open source” model as much as commercial enterprises that find contributing professional software skills to “open source” projects is a cost-effective way to get more programming buck for their money.

Think about it. They can hire some of the best software talent around, who then associate with more world class programming talent than any one company is likely to have in house.

And, the resulting “product” is the result of all those world class programmers and not just the investment of one vendor. (So their investment is less than if they were creating a product on their own.)

Not to mention that any government or enterprise who wants to use the software will need support contracts from, you guessed it, the vendors who contributed to the creation of the software.

And we all know that the return on service contracts is an order of magnitude or greater than the return on software products.

Support your local open source project. Your local vendor will be glad you did.

DEAP: Distributed Evolutionary Algorithms in Python

Filed under: DEAP,Evoluntionary,Python — Patrick Durusau @ 4:31 pm

DEAP: Distributed Evolutionary Algorithms in Python

From the website:

DEAP is intended to be an easy to use distributed evolutionary algorithm library in the Python language. Its two main components are modular and can be used separately. The first module is a Distributed Task Manager (DTM), which is intended to run on cluster of computers. The second part is the Evolutionary Algorithms in Python (EAP) framework.

Components

DTM

DTM is a distributed task manager that is able to spread workload over a buch of computers using a TCP or a MPI connection.

DTM include the following features:

  • Easy to use parallelization paradigms
  • Offers a similar interface to the Python's multiprocessing module
  • Basic load balancing algorithm
  • Works over mpi4py
  • Support for TCP communication manager

EAP

EAP is the evolutionary core of DEAP, it provides data structures, methods and tools to design any kind of evolutionary algorithm. It works in perfect harmony with DTM, allowing easy parallelization of any demanding evolutionary task.

EAP includes the following features:

  • Genetic algorithm using any imaginable representation
    • List, Array, Set, Dictionary, Tree, Numpy Array (tip revision), etc.
  • Genetic programing using prefix trees
    • Loosely typed, Strongly typed
    • Automatically defined functions
  • Evolution strategies (including CMA-ES)
  • Multi-objective optimisation (NSGA-II, SPEA-II)
  • Co-evolution of multiple populations
  • Parallelization of the evaluations (and more)
  • Hall of Fame of the best individuals that lived in the population
  • Checkpoints that take snapshots of a system regularly
  • Benchmarks module containing most common test functions
  • Genealogy of an evolution (that is compatible with NetworkX)
  • Examples of alternative algorithms : Particle Swarm Optimization, Differential Evolution

If you are interested in evolutionary approaches to data mining, not a bad place to start.

The Database Nirvana (And an Alternative)

Filed under: Database,Open Source — Patrick Durusau @ 4:31 pm

The Database Nirvana

Alex Popescu of myNoSQL sides with Jim Webber in thinking we need to avoid a “winner-takes-it-all-war” among database advocates.

Saying that people should pick the best store for their data model is a nice sentiment but I rather doubt it will change long or short term outcomes between competing data stores.

I don’t know that anything will but I do have a concrete suggestion that might stand a chance in the short run at any rate.

We are all familiar with the “to many eyes all bugs are shallow” and other Ben Franklin like sayings.

OK, so rather than seeing another dozen or two dozen or more, data stores this year, that is 2012, why not pick an existing store, learn the community and offer your talents, writing code, tests, debugging, creating useful documentation, creating tutorials, etc.

The data store community, if you look for database projects at Sourceforge for example, is like a professional sports league with too many teams. The talent is so spread out that there are only one or two very successful teams and the others, well, are not so great.

If all of the existing data store projects picked up another 100 volunteers each, there would be enough good code, documentation and other resources to hold off both major/minor vendors and other store projects.

The various store projects would have to welcome volunteers. That means doing more than protesting the way it is done is the best possible way for whatever to be done.

If we don’t continue to have a rich ecosystem of store projects, it won’t be entirely the fault of vendors nor winner-take-it-all-wars. A lack of volunteers and acceptance of volunteers will share part of the blame.

EBNF Parser & Syntax Diagram Renderer

Filed under: Graphics,Visualization — Patrick Durusau @ 4:30 pm

EBNF Parser & Syntax Diagram Renderer

Just what the title says. Written in PHP it can be used standalone or as a Dokuwiki-Plugin.

“Railroad” diagrams.

Sam Hunting forwarded this to my attention.

April 8, 2012

Casellas et al. on Linked Legal Data: Improving Access to Regulatory Information

Filed under: Law - Sources,Legal Informatics,Linked Data — Patrick Durusau @ 4:21 pm

Casellas et al. on Linked Legal Data: Improving Access to Regulatory Information

From the post:

Dr. Núria Casellas of the Legal Information Institute at Cornell University Law School, and colleagues, have posted Linked Legal Data: Improving Access to Regulatory Information, a poster presented at Bits on Our Mind (BOOM) 2012, held 4 April 2012 at the Cornell University Department of Computing and Information Science, in Ithaca, New York, USA.

Here are excerpts from the poster:

The application of Linked Open Data (LOD) principles to legal information (URI naming of resources, assertions about named relationships between resources or between resources and data values, and the possibility to easily extend, update and modify these relationships and resources) could offer better access and understanding of legal knowledge to individual citizens, businesses and government agencies and administrations, and allow sharing and reuse of legal information across applications, organizations and jurisdictions. […]

With this project, we will enhance access to the Code of Federal Regulations (a text with 96.5 million words in total; ~823MB XML file size) with an RDF dataset created with a number of semantic-search and retrieval applications and information extraction techniques based on the development and the reuse of RDF product taxonomies, the application of semantic matching algorithms between these materials and the CFR content (Syntactic and Semantic Mapping), the detection of product-related terms and relations (Vocabulary Extraction), obligations and product definitions (Definition and Obligations Extraction). […]

You know, lawyers always speculated if the “Avoid Probate” (for non-U.S. readers, a publication to help citizens avoid the use of lawyers for inheritance issues) were in fact shadow publications of the bar association to promote the use of lawyers.

You haven’t seen a legal mess until someone tries “self-help” in a legal context. Probably doubles if not triples the legal fees involved.

Still, this may be an interesting source of data for services for lawyers and foolhardy citizens.

I shudder though at the “sharing of legal information across jurisdictions.” In most of the U.S., a creditor can claim say a car where a mortgage is past due. Without going to court. In Louisiana, at least a number of years ago, there was another name for self-help repossession. It was called felony theft. Like I said, self-help when it comes to the law isn’t a good idea.

Context matters: Search can’t replace a high-quality index

Filed under: eBooks,Indexing,Marketing — Patrick Durusau @ 4:21 pm

Context matters: Search can’t replace a high-quality index

Joe Wikert writes:

I’ve never consulted an index in an ebook. From a digital content point of view, indexes seem to be an unnecessary relic of the print world. The problem with my logic is that I’m thinking of simply dropping a print index into an ebook, and that’s as shortsighted as thinking the future of ebooks in general is nothing more than quick-and-dirty conversions of print books. In this TOC podcast interview, Kevin Broccoli, CEO of BIM Publishing Services, talks about how indexes can and should evolve in the digital world.

Key points from the full video interview (below) include:

  • Why bother with e-indexes? — Searching for raw text strings completely removes context, which is one of the most valuable attributes of a good index. [Discussed at the 1:05 mark.]
  • Index mashups are part of the future — In the digital world you should be able to combine indexes from books on common topics in your library. That’s exactly what IndexMasher sets out to do. [Discussed at 3:37.]
  • Indexes with links — It seems simple but almost nobody is doing it. And as Kevin notes, wouldn’t it be nice for ebook retailers to offer something like this as part of the browsing experience? [Discussed at 6:24.]
  • Index as cross-selling tool — The index mashup could be designed to show live links to content you own but also include entries without links to content in ebooks you don’t own. Those entries could offer a way to quickly buy the other books, right from within the index. [Discussed at 7:28.]
  • Making indexes more dynamic — The entry for “Anderson, Chris” in the “Poke The Box” index on IndexMasher shows a simple step in this direction by integrating a Google and Amazon search into the index. [Discussed at 9:42.]

Apologies but I left the links out to the interview to encourage you to visit the original. It is really worth your time.

Do these points sound like something a topic map could do? 😉

BTW, I am posting a note to IndexMasher and will advise. Sounds very interesting.

Nature Publishing Group releases linked data platform

Filed under: Linked Data,LOD,Semantic Web — Patrick Durusau @ 4:21 pm

Nature Publishing Group releases linked data platform

From the post:

Nature Publishing Group (NPG) today is pleased to join the linked data community by opening up access to its publication data via a linked data platform. NPG’s Linked Data Platform is available at http://data.nature.com.

The platform includes more than 20 million Resource Description Framework (RDF) statements, including primary metadata for more than 450,000 articles published by NPG since 1869. In this first release, the datasets include basic citation information (title, author, publication date, etc) as well as NPG specific ontologies. These datasets are being released under an open metadata license, Creative Commons Zero (CC0), which permits maximal use/re-use of this data.

NPG’s platform allows for easy querying, exploration and extraction of data and relationships about articles, contributors, publications, and subjects. Users can run web-standard SPARQL Protocol and RDF Query Language (SPARQL) queries to obtain and manipulate data stored as RDF. The platform uses standard vocabularies such as Dublin Core, FOAF, PRISM, BIBO and OWL, and the data is integrated with existing public datasets including CrossRef and PubMed.

More information about NPG’s Linked Data Platform is available at http://developers.nature.com/docs. Sample queries can be found at http://data.nature.com/query.

You may find it odd that I would cite such a resource on the same day as penning Technology speedup graph where I speak so harshly about the Semantic Web.

On the contrary, disagreement about the success/failure of the Semantic Web and its retreat to Linked Data is an example of conflicting semantics. Conflicting semantics not being a “feature” of the Semantic Web.

Besides, Nature is a major science publisher and their experience with Linked Data is instructive.

Such as the NPG specific ontologies. 😉 Not what you were expecting?

This is a very useful resource and the Nature Publishing Group is to be commended for it.

The creation of metadata about the terms used within articles and the relationships between those terms as well as other publications, will make it more useful still.

Indexing the content of Gene Ontology with apache SOLR

Filed under: Bioinformatics,Biomedical,Gene Ontology,Solr — Patrick Durusau @ 4:21 pm

Indexing the content of Gene Ontology with apache SOLR by Pierre Lindenbaum.

Pierre walks you through the use of Solr to index GeneOntology. As with all of his work, impressive!

Of course, one awesome post deserves another! So Pierre follows with:

Apache SOLR and GeneOntology: Creating the JQUERY-UI client (with autocompletion)

So you get to learn JQuery/UI stuff as well.

Technology speedup graph

Filed under: Semantic Diversity,Semantic Web — Patrick Durusau @ 4:21 pm

Technology speedup graph

Andrew Gelman posts an interesting graphic showing the adoption of various technologies from 1900 forward. See the post for the lineage on the graph and the details. Good graphic.

What caught my eye for topic maps was the rapid adoption of the Internet/WWW and the now well recognized failure of the Semantic Web.

You may feel like disputing my evaluation of the Semantic Web. Recall that agents were predicted to be roaming the Semantic Web by this point in Tim Berners-Lee’s first puff piece in Scientific American. After a few heady years of announcements of realization is just around the corner, the 21st century technology equivalent of the long retreat (think Napoleon).

Now the last gasp is Linked Data, the “meaning” of URIs is be determined on mount W3C and then imposed on the rest of us.

Make no mistake, I think the WWW was a truly great technological achievement.

But the technological progress graph prompted me to wonder, yet again, how is the WWW different from the Semantic Web?

Not sure this is helpful but consider the level of agreement on semantics required by the WWW versus the Semantic Web.

For the WWW, there are a handful of RFCs that specify the treatment of syntax. That is addresses and the composition of resources that you find at those addresses. Users may attach semantics to those resources, but none of those semantics are required for processing or delivery of the resources.

That is for the WWW to succeed, all we need is agreement on the addressing and processing of resources and not at all on their semantics.

A resource can have a crazy quilt of semantics attached to it by users, diverse, inconsistent, contradictory, because its addressing and processing is independent of those semantics and those who would impose them.

Resources on the WWW certainly have semantics, but processing those resources doesn’t depend on our agreement on those semantics.

So, the semantic agreement of the WWW = ~ 0. (Leaving aside the certainly true contention that protocols have semantics.)

The semantic agreement required by the Semantic Web is “web scale agreement.” That is everyone who encounters a semantic has to either honor it or break that part of the Semantic Web.

Wait until after you watch the BBC News or Al Jazeera (English), الجزيرة.نت, before you suggest universal semantics are just around the corner.

An R programmer looks at Julia

Filed under: Julia,Marketing,R — Patrick Durusau @ 4:20 pm

An R programmer looks at Julia by Douglas Bates.

Douglas writes:

In January of this year I first saw mention of the Julia language in the release notes for LLVM. I mentioned this to Dirk Eddelbuettel and later we got in contact with Viral Shah regarding a Debian package for Julia.

There are many aspects of Julia that are quite intriguing to an R programmer. I am interested in programming languages for “Computing with Data”, in John Chambers ‘term, or “Technical Computing”, as the authors of Julia classify it. I believe that learning a programming language is somewhat like learning a natural language in that you need to live with it and use it for a while before you feel comfortable with it and with the culture surrounding it.

A common complaint for those learning R is finding the name of the function to perform a particular task. In writing a bit of Julia code for fitting generalized linear models, as described below, I found myself in exactly the same position of having to search through documentation to find how to do something that I felt should be simple. The experience is frustrating but I don’t know of a way of avoiding it. One word of advice for R programmers looking at Julia, the names of most functions correspond to the Matlab/octave names, not the R names. One exception is the d-p-q-r functions for distributions, as I described in an earlier posting. [bold emphasis added in last paragraph]

Problem: Programming languages with different names for the same operation.

Suggestions anyone?

😉

Do topic maps spring to mind?

Perhaps with select match language, select target language and auto-completion capabilities?

Unintrusive window or pop-up for text entry of name (or signature) in match language, that displays equivalent name/signature (would Hamming distance work here?) in target language. Using XTM/CTM as format would enable distributed (and yet interchangeable) construction of editorial artifacts for various programming languages.

Not the path to world domination or peace but on the other hand, it would be useful.

Lumia Review Cluster

Filed under: Clustering,Marketing — Patrick Durusau @ 4:20 pm

Lumia Review Cluster

Matthew Hurst has clustered reviews of the Nokia Lumia 900.

His blog post is an image of the cluster at a point in time so you will have to go to http://d8taplex.com/track/microsoft-widescreen.html to interact with the cluster.

What would you add to this cluster to make it more useful? Such as sub-clustering strictly reviews of the Nokia Lumia 900 or perhaps clustering based on the mentioning of other phones for comparison?

Other navigation?

You have the subject (what is being buzzed about) and you have a relative measure of the “buzz.” How do we take advantage of that “buzz?”

Navigating to a place is great fun but most people expect something at the end of the journey.

What does your topic map enable at the end of a journey?

Data and the Liar’s Paradox

Filed under: Data,Data Quality,Marketing — Patrick Durusau @ 4:20 pm

Data and the Liar’s Paradox by Jim Harris.

Jim writes:

“This statement is a lie.”

That is an example of what is known in philosophy and logic as the Liar’s Paradox because if “this statement is a lie” is true, then the statement is false, which would in turn mean that it’s actually true, but this would mean that it’s false, and so on in an infinite, and paradoxical, loop of simultaneous truth and falsehood.

I have never been a fan of the data management concept known as the Single Version of the Truth, and I often quote Bob Kotch, via Tom Redman’s excellent book, Data Driven: “For all important data, there are too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. This does not imply malfeasance on anyone’s part; it is simply a fact of life. Getting everyone to work from a Single Version of the Truth may be a noble goal, but it is better to call this the One Lie Strategy than anything resembling truth.”

More business/data quality reading.

Imagine my chagrin after years of studying literary criticism in graduate seminary classes (don’t ask, its a long and boring story) to discover that business types already know “truth” is a relative thing.

What does that mean for topic maps?

I would argue with careful design we can capture several points of view, using a point of view as our vantage point.

As opposed to strategies that can only capture a single point of view, their own.

Capturing multiple viewpoints will be a hot topic when “big data” starts to hit the “big fan.”

Books That Influenced My Thinking: Quality, Productivity and Competitive Position

Filed under: Data Quality,Marketing — Patrick Durusau @ 4:19 pm

Books That Influenced My Thinking: Quality, Productivity and Competitive Position by Thomas Redman.

From the post:

I recently learned that Technics Publications, led by Steve Hoberman, is re-issuing one of my favorites, Data and Reality by William Kent. It led me to conclude I ought to review some of the books that most influenced my thinking about data quality. (I’ll include Data and Reality, when the re-issue appears). I am explicitly excluding books on data quality per se.

First up is Dr. Deming’s Quality, Productivity and Competitive Position (QPC). First published in 1982, to me this is Deming at his finest. The more famous Out of The Crisis came out about the same time and the two cover much the same material. But QPC is raw, powerful Deming. He is fed up the economic malaise of corporate America at the time and he rails against top management for simply not understanding the role of quality in marketplace competition.

Data quality is a “hot” topic these days. I thought it might be useful to see what business perspective resources were available on the topic.

Both to learn management “speak” about data quality and how solutions are evaluated.

QPC sounds a bit dated (1982) but I rather doubt management has changed that much, albeit the terms by which management is described have probably changed a lot. Not the terms used by their employees but the terms used by consultants who are being paid by management. 😉

Not to forget that topic maps as information products, information services or software, all face the same issues of quality, productivity and competitive position.

« Newer PostsOlder Posts »

Powered by WordPress