Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 7, 2013

The Semantic Web Is Failing — But Why? (Part 2)

Filed under: RDF,Semantic Web — Patrick Durusau @ 4:30 pm

Should You Be Using RDF?

Pay Hayes (editor of RDF Semantics) and Richard Cyganiak (a linked data expert), had this interchange on the RDF Working Group discussion list:

Cyganiak: The text stresses that the presence of an ill-typed literals does not constitute an inconsistency.

Cyganiak: But why does the distinction matter?

Hayes: I am not sure what you mean by “the distinction” here. Why would you expect that an ill-typed literal would produce an inconsistency? Why would the presence of an ill-typed literal make a triple false?

Cyganiak: Is there any reason anybody needs to know about this distinction who isn’t interested in the arcana of the model theory?

Hayes: I’m not sure what you consider to be “arcana”. Someone who cannot follow the model theory probably shouldn’t be using RDF. (emphasis added) Re: Ill-typed vs. inconsistent? (Mon, 12 Nov 2012 01:58:51 -0600)

When challenged on the need to follow model theory, Hayes retreats, but only slightly:

Well, it was rather late and I had just finished driving 2400 miles so maybe I was a bit abrupt. But I do think that anyone who does not understand what “inconsistent” means should not be using RDF, or at any rate should only be using it under the supervision of someone who *does* know the basics of semantic notions. Its not like nails versus metallurgy so much as nails versus hammers. If you are trying to push the nails in by hand, you probably need to hire a framer. (emphasis added) Re: Ill-typed vs. inconsistent? (Mon, 12 Nov 2012 09:58:52 -0600)

A portion of the Introduction to RDF Semantics reads:

RDF is an assertional language intended to be used to express propositions using precise formal vocabularies, particularly those specified using RDFS [RDF-VOCABULARY], for access and use over the World Wide Web, and is intended to provide a basic foundation for more advanced assertional languages with a similar purpose. The overall design goals emphasise generality and precision in expressing propositions about any topic, rather than conformity to any particular processing model: see the RDF Concepts document [RDF-CONCEPTS] for more discussion.

Exactly what is considered to be the ‘meaning’ of an assertion in RDF or RDFS in some broad sense may depend on many factors, including social conventions, comments in natural language or links to other content-bearing documents. Much of this meaning will be inaccessible to machine processing and is mentioned here only to emphasize that the formal semantics described in this document is not intended to provide a full analysis of ‘meaning’ in this broad sense; that would be a large research topic. The semantics given here restricts itself to a formal notion of meaning which could be characterized as the part that is common to all other accounts of meaning, and can be captured in mechanical inference rules.

This document uses a basic technique called model theory for specifying the semantics of a formal language. Readers unfamiliar with model theory may find the glossary in appendix B helpful; throughout the text, uses of terms in a technical sense are linked to their glossary definitions. Model theory assumes that the language refers to a ‘world‘, and describes the minimal conditions that a world must satisfy in order to assign an appropriate meaning for every expression in the language. A particular world is called an interpretation, so that model theory might be better called ‘interpretation theory’. The idea is to provide an abstract, mathematical account of the properties that any such interpretation must have, making as few assumptions as possible about its actual nature or intrinsic structure, thereby retaining as much generality as possible. The chief utility of a formal semantic theory is not to provide any deep analysis of the nature of the things being described by the language or to suggest any particular processing model, but rather to provide a technical way to determine when inference processes are valid, i.e. when they preserve truth. This provides the maximal freedom for implementations while preserving a globally coherent notion of meaning.

Model theory tries to be metaphysically and ontologically neutral. It is typically couched in the language of set theory simply because that is the normal language of mathematics – for example, this semantics assumes that names denote things in a set IR called the ‘universe‘ – but the use of set-theoretic language here is not supposed to imply that the things in the universe are set-theoretic in nature. Model theory is usually most relevant to implementation via the notion of entailment, described later, which makes it possible to define valid inference rules.

Readers should read RDF Semantics to answer for themselves whether they understand “inconsistent” as defined therein. Noting that Richard Cyganiak, a linked data expert, did not.


The next series starts with Saving the “Semantic” Web (Part 1)

The Semantic Web Is Failing — But Why? (Part 1)

Filed under: Identity,OWL,RDF,Semantic Web — Patrick Durusau @ 4:29 pm

Introduction

Before proposing yet another method for identification and annotation of entities in digital media, it is important to draw lessons from existing systems. Failing systems in particular, so their mistakes are not repeated or compounded. The Semantic Web is an example of such a system.

Doubters of that claim should the report Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus by Web Data Commons.

Web Data Commons is a structured data research project based at the Research Group Data and Web Science at the University of Mannheim and the Institute AIFB at the Karlsruhe Institute of Technology. Supported by PlanetData and LOD2 research projects, the Web Data Commons is not opposed to the Semantic Web.

But the Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus document reports:

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together. (emphasis added)

To sharpen the point, RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008).

Or more generally:

Parsed HTML URLs 3,005,629,093
URLs with Triples 369,254,196

On in a layperson’s terms, for this web corpus, parsed HTML URLs outnumber URLs with Triples between approximately eight to one.

Being mindful that the corpus is only web accessible data and excludes “dark data,” the need for a more robust solution that the Semantic Web is self-evident.

The failure of the Semantic Web is no assurance that any alternative proposal will fare better. Understanding why the Semantic Web is failing is a prerequisite to any successful alternative.


Before you “flame on,” you might want to read the entire series. I end up with a suggestion based on work by Ding, Shinavier, Finin and McGuinness.


The next series starts with Saving the “Semantic” Web (Part 1)

Open Source Rookies of the Year

Filed under: Marketing,Open Source — Patrick Durusau @ 11:26 am

Open Source Rookies of the Year

From the webpage:

The fifth annual Black Duck Open Source Rookies of the Year program recognizes the top new open source projects initiated in 2012. This year’s Open Source Rookies honorees span JavaScript frameworks, cloud, mobile, and messaging projects that address needs in the enterprise, government, gaming and consumer applications, among others, and reflect important trends in the open source community.

The 2012 Winners:

Honorable Mention: DCPUToolChain – an assembler, compiler, emulator and IDE for DCPU-16 virtual CPU (Ohloh entry).

What lessons do you draw from these awards about possible topic map projects for the coming year?

Projects that would interest developers that is. 😉

For example, Inasafe is described as:

InaSAFE provides a simple but rigorous way to combine data from scientists, local governments and communities to provide insights into the likely impacts of future disaster events. The software is focused on examining, in detail, the impacts that a single hazard would have on specific sectors, for example, the location of primary schools and estimated number of students affected by a possible tsunami like in Maumere, for instance, when it happened during the school hours.

Which is fine so long as I am seated in a reinforced concrete bunker with redundant power supplies.

On the other hand, if I am using a mobile device to access the same data source during a tornado watch, shouldn’t I get the nearest safe location?

Reduced or even eliminated navigation with minimal data could be returned from a topic map based on geo-location and active weather alerts.

I am sure there are others.

Comments/suggestions?

A Quick Guide to Hadoop Map-Reduce Frameworks

Filed under: Hadoop,Hive,MapReduce,Pig,Python,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 10:45 am

A Quick Guide to Hadoop Map-Reduce Frameworks by Alex Popescu.

Alex has assembled links to guides to MapReduce frameworks:

Thanks Alex!

Seamless Astronomy

Filed under: Astroinformatics,Data,Data Integration,Integration — Patrick Durusau @ 10:33 am

Seamless Astronomy: Linking scientific data, publications, and communities

From the webpage:

Seamless integration of scientific data and literature

Astronomical data artifacts and publications exist in disjointed repositories. The conceptual relationship that links data and publications is rarely made explicit. In collaboration with ADS and ADSlabs, and through our work in conjunction with the Institute for Quantitative Social Science (IQSS), we are working on developing a platform that allows data and literature to be seamlessly integrated, interlinked, mutually discoverable.

Projects:

  • ADS All-SKy Survey (ADSASS)
  • Astronomy Dataverse
  • WorldWide Telescope (WWT)
  • Viz-e-Lab
  • Glue
  • Study of the impact of social media and networking sites on scientific dissemination
  • Network analysis and visualization of astronomical research communities
  • Data citation practices in Astronomy
  • Semantic description and annotation of scientific resources

A project with large amounts of data for integration.

Moreover, unlike the U.S. Intelligence Community, they are working towards data integration, not resisting it.

I first saw this in Four short links: 6 February 2013 by Nat Torkington.

Marketplace in Query Libraries? Marketplace in Identified Entities?

Filed under: Entities,Marketing,SPARQL — Patrick Durusau @ 10:20 am

Using SPARQL Query Libraries to Generate Simple Linked Data API Wrappers by Tony Hirst.

From the post:

A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.

I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…

For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.

Tony’s marketplace of queries has a great deal of potential.

But I don’t think they need to be limited to SPARQL queries.

By extension his arguments should be true for searches on Google, Bing, etc., as well as vendor specialized search interfaces.

I would take that a step further into libraries for post-processing the results of such queries and presenting users with enhanced presentations and/or content.

And as part of that post-processing, I would add robust identification of entities as an additional feature of such a library/service.

For example, what if you have curated some significant portion of the ACM digital library and when passed what could be an ambiguous reference to a concept, you return to the user the properties that distinguish their reference into several groups.

Which frees every user from wading through unrelated papers and proceedings, when that reference comes up.

Would that be a service users would pay for?

I suppose that depends on how valuable their time is to them and/or their employers.

Ads 182 Times More Dangerous Than Porn

Filed under: Malware,Marketing,Security — Patrick Durusau @ 5:44 am

Cisco Annual Security Report: Threats Step Out of the Shadows

From the post:

Despite popular assumptions that security risks increase as a person’s online activity becomes shadier, findings from Cisco’s 2013 Annual Security Report (ASR) reveal that the highest concentration of online security threats do not target pornography, pharmaceutical or gambling sites as much as they do legitimate destinations visited by mass audiences, such as major search engines, retail sites and social media outlets. In fact, Cisco found that online shopping sites are 21 times as likely, and search engines are 27 times as likely, to deliver malicious content than a counterfeit software site. Viewing online advertisements? Advertisements are 182 as times likely to deliver malicious content than pornography. (emphasis added)

Numbers like this make me wonder: Is anyone indexing ads?

Or better yet, creating a topic map that maps back to the creators/origins of ad content?

That has the potential to be a useful service, unlike porn blocking ones.

Legitimate brands would have an incentive to stop malware in their ads, origins of malware ads would be exposed (blocked?).

I first saw this at Quick Links by Greg Linden.

February 6, 2013

BabelNet 1.1 [5.5 million concepts – TM Starter Kit]

Filed under: BabelNet,Topic Maps — Patrick Durusau @ 3:53 pm

BabelNet 1.1, A very large multilingual ontology.

From the homepage:

A very large multilingual ontology with 5.5 millions of concepts • A wide-coverage “encyclopedic dictionary” • Obtained from the automatic integration of WordNet and Wikipedia • Enriched with automatic translations of its concepts • Connected to the Linguistic Linked Open Data cloud!

From: BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network by Roberto Navigli and Simone Paolo Ponzetto.

In this paper, we take a major step towards realizing the vision of a wide-coverage multilingual knowledge resource. We present a novel 3 integration and enrichment methodology that produces a very large multilingual semantic network: BabelNet. This resource is created by linking the largest multilingual Web encyclopedia – i.e., Wikipedia – to the most popular computational lexicon – i.e., WordNet. The integration is performed via an automatic mapping and by filling in lexical gaps in resource-poor languages with the aid of Machine Translation. The result is an “encyclopedic dictionary” that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations.

A stunning achievement that will be very useful for any number of projects.

However, to put the 5.5 million concepts in perspective, consider that the Saturn V command module alone had more than 2 million parts.

All of which had names, other designations and relationships to other parts.

Even starting with 5.5 million concepts, there is no shortage of subjects to be identified.

Properties you use to identify these concepts and other subjects help capture your meaning in a topic map.

Simon Rogers

Filed under: Data Mining,Journalism,News — Patrick Durusau @ 2:56 pm

Simon Rogers

From the “about” page:

Simon Rogers is editor of guardian.co.uk/data, an online data resource which publishes hundreds of raw datasets and encourages its users to visualise and analyse them – and probably the world’s most popular data journalism website.

He is also a news editor on the Guardian, working with the graphics team to visualise and interpret huge datasets.

He was closely involved in the Guardian’s exercise to crowdsource 450,000 MP expenses records and the organisation’s coverage of the Afghanistan and Iraq Wikileaks war logs. He was also a key part of the Reading the Riots team which investigated the causes of the 2011 England disturbances.

Previously he was the launch editor of the Guardian’s online news service and has edited the paper’s science section. He has edited three Guardian books, including How Slow Can You Waterski and The Hutton Inquiry and its impact.

If you are interested in “data journalism,” data mining or visualization, Simon’s site is one of the first to bookmark.

Improving User Experience in Manuals

Filed under: Documentation,Writing — Patrick Durusau @ 2:49 pm

Improving User Experience in Manuals by Anastasios Karafillis.

From the post:

The manual: possibly the most awkward part of a user’s experience with a product. People avoid manuals whenever possible and designers try to build interfaces that need not rely on them. And yet, users and designers would certainly agree that you simply must provide a proper manual.

The manual can be a powerful tool for unleashing the full potential of an application, something of benefit to users and vendors. Why is it, then, that manuals so often seem to confuse users rather than help them?

Let’s look at the most common difficulties faced by technical writers, and how to best deal with them to improve the user experience of manuals.

“…a proper manual.” Doesn’t seem to be a lot to ask for.

I have seen some better than others but they were all fixed compromises of one sort of another.

Ironic because SGML and then XML advocates have been promising users dynamic content for years. Content that could adopt to circumstances and users.

Instead we gave them dead SGML/XML trees.

What if you had a manual that improved along with you?

A manual composed of different levels of information, which can be chosen by the user or adapted based on your performance with internal tests.

A beginning sysadmin isn’t going to be confronted with a chapter on diagnosing core dumps or long deprecated backup commands.

A topic map based manual could do that as well as integrate information from other resources.

Imagine a sysadmin manual with text imported from blogs, websites, lists, etc.

A manual that becomes a gateway to an entire area of knowledge.

That would be a great improvement in user experience with manuals!

Oracle’s MySQL 5.6 released

Filed under: Database,MySQL,NoSQL,Oracle — Patrick Durusau @ 2:00 pm

Oracle’s MySQL 5.6 released

From the post:

Just over two years after the release of MySQL 5.5, the developers at Oracle have released a GA (General Availability) version of Oracle MySQL 5.6, labelled MySQL 5.6.10. In MySQL 5.5, the developers replaced the old MyISAM backend and used the transactional InnoDB as the default for database tables. With 5.6, the retrofitting of full-text search capabilities has enabled InnoDB to now take on the position of default storage engine for all purposes.

Accelerating the performance of sub-queries was also a focus of development; they are now run using a process of semi-joins and materialise much faster; this means it should not be necessary to replace subqueries with joins. Many operations that change the data structures, such as ALTER TABLE, are now performed online, which avoids long downtimes. EXPLAIN also gives information about the execution plans of UPDATE, DELETE and INSERT commands. Other optimisations of queries include changes which can eliminate table scans where the query has a small LIMIT value.

MySQL’s row-oriented replication now supports “row image control” which only logs the columns needed to identify and make changes on each row rather than all the columns in the changing row. This could be particularly expensive if the row contained BLOBs, so this change not only saves disk space and other resources but it can also increase performance. “Index Condition Pushdown” is a new optimisation which, when resolving a query, attempts to use indexed fields in the query first, before applying the rest of the WHERE condition.

MySQL 5.6 also introduces a “NoSQL interface” which uses the memcached API to offer applications direct access to the InnoDB storage engine while maintaining compatibility with the relational database engine. That underlying InnoDB engine has also been enhanced with persistent optimisation statistics, multithreaded purging and more system tables and monitoring data available.

Download MySQL 5.6.

I mentioned Oracle earlier today (When Oracle bought MySQL [Humor]) so it’s only fair that I point out their most recent release of MySQL.

Three charts are all I need

Filed under: Graphics,Marketing,Visualization — Patrick Durusau @ 1:53 pm

Three charts are all I need by Noah Lorang.

See Noah’s post for his top three chart picks.

I am more interested in his reasons for his choices:

  1. Spend your energy on selling the message, not the medium
  2. Your job is to solve a problem, not make a picture
  3. Safe doesn’t mean boring

How would you apply #1 and #2 to marketing topic maps?

Introduction To R For Data Mining

Filed under: Data Mining,R — Patrick Durusau @ 1:42 pm

Introduction To R For Data Mining

Date: Thursday, February 14, 2013
Time: 10:00am – 11:00am Pacific Time
Presenter: Joseph Rickert, Technical Marketing Manager, Revolution Analytics

From the post:

We at Revolution Analytics are often asked “What is the best way to learn R?” While acknowledging that there may be as many effective learning styles as there are people we have identified three factors that greatly facilitate learning R. For a quick start:

  • Find a way of orienting yourself in the open source R world
  • Have a definite application area in mind
  • Set an initial goal of doing something useful and then build on it

In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. We will:

  • Provide an orientation to R’s data mining resources
  • Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data.
  • Show the simple R commands to accomplish these same tasks without the GUI
  • Demonstrate how to build on these fundamental skills to gain further competence in R
  • Move away from using small test data sets and show with the same level of skill one could analyze some fairly large data sets with RevoScaleR

Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R.

You have to do something while waiting for your significant other to get off work on Valentine’s Day. 😉

So long as you don’t try to watch the webinar on a smart phone at the restaurant, you should be ok.


Update: Video of the webinar: An Introduction to R for Data Mining.

The Evolution of Regression Modeling… [Webinar]

Filed under: Mathematics,Modeling,Regression — Patrick Durusau @ 12:46 pm

The Evolution of Regression Modeling: From Classical Linear Regression to Modern Ensembles by Mikhail Golovnya and Illia Polosukhin.

Dates/Times:

Part 1: Fri March 1, 10 am, PST

Part 2: Friday, March 15, 10 am, PST

Part 3: Friday, March 29, 10 am, PST

Part 4: Friday, April 12, 10 am, PST

From the webpage:

Class Description: Regression is one of the most popular modeling methods, but the classical approach has significant problems. This webinar series address these problems. Are you are working with larger datasets? Is your data challenging? Does your data include missing values, nonlinear relationships, local patterns and interactions? This webinar series is for you! We will cover improvements to conventional and logistic regression, and will include a discussion of classical, regularized, and nonlinear regression, as well as modern ensemble and data mining approaches. This series will be of value to any classically trained statistician or modeler.

Details:

Part 1: March 1 – Regression methods discussed

  •     Classical Regression
  •     Logistic Regression
  •     Regularized Regression: GPS Generalized Path Seeker
  •     Nonlinear Regression: MARS Regression Splines

Part 2: March 15 – Hands-on demonstration of concepts discussed in Part 1

  •     Step-by-step demonstration
  •     Datasets and software available for download
  •     Instructions for reproducing demo at your leisure
  •     For the dedicated student: apply these methods to your own data (optional)

Part 3: March 29 – Regression methods discussed
*Part 1 is a recommended pre-requisite

  •     Nonlinear Ensemble Approaches: TreeNet Gradient Boosting; Random Forests; Gradient Boosting incorporating RF
  •     Ensemble Post-Processing: ISLE; RuleLearner

Part 4: April 12 – Hands-on demonstration of concepts discussed in part 3

  •     Step-by-step demonstration
  •     Datasets and software available for download
  •     Instructions for reproducing demo at your leisure
  •     For the dedicated student: apply these methods to your own data (optional)

Salford Systems offers other introductory videos, webinars and tutorial and case studies.

Regression modeling is a tool you will encounter in data analysis and is likely to be an important part of your exploration toolkit.

I first saw this at KDNuggets.

Informer

Filed under: Information Retrieval,News — Patrick Durusau @ 12:25 pm

Informer Newsletter of the BCS Information Retrieval Specialist Group.

The Winter 2013 issue of the Informer has been published!

You will find:

Prior issues are also available.

Need to Pad Your Resume? Innovation Fellows Round 2

Filed under: Government,Government Data — Patrick Durusau @ 11:39 am

White House Seeks Tech Innovation Fellows by Elena Malykhina.

The White House death march farce I covered in A Competent CTO Can Say No continues.

Next group of six to twelve month projects are:

  • — Disaster Response and Recovery: The project will “pre-position” tech tools for disaster readiness in order to diminish economic damage and save lives.
  • — Cyber-Physical Systems: A new generation of cyber-physical “smart systems” will be developed to help the economy and job creation. These systems will combine distributed sensing, control and data analytics.
  • — 21st Century Financial Systems: The 21st Century Financial Systems initiative will transition agency-specific federal financial accounting systems to a more modular, scalable and cost-effective model.
  • — Innovation Toolkit: A suite of tools will be created for federal workers, allowing them to become more responsive and efficient in their jobs.
  • — Development Innovation Ventures: The Development Innovation Ventures project will address tough global problems by allowing the U.S. government to identify, test and scale new technologies.

Sound like six to twelve month projects? Yes?

I know, I know, I should be lining up to participate in this fraud on the public and be paid for doing it. Looks nice on the resume.

Successful solutions will not be developed on fixed timelines before problems are defined or understood.

Some will say, “So what? So long as you are paid for time, travel, etc., why would you care if the solution is successful?”

That must be why there are no links in the Round 2 announcement to “successes” of the first round of innovation.

Take the first one on the list from round one:

Open Data Initiatives have unleashed data from the vaults of the government as fuel for entrepreneurs and innovators to create new apps, products, and services that benefit the American people in myriad ways and contribute to job growth.

Can you name one? Just one.

Sequestration data (except for my releases) continues to be dead PDF files. And the data in those files is too incomplete for useful analysis.

Is that “…unleash[ing] data from the vaults of the government…?”

Or did the sequestration debate escaped their attention?

The number of people willing to defraud the public even in these hard economic times was encouragingly low.

Only 700 people applied for round one. Out of hundreds of thousands of highly qualified IT people who could have applied.

Is defrauding the public becoming unfashionable?

Perhaps there is hope.


Lest there be some misunderstanding, government at all levels is filled with public servants.

But you have to get away from elected/appointed positions to find them.

They mostly don’t appear on Sunday talk shows but tirelessly do the public’s business out of the limelight.

Public servants I would gladly help, public parasites, not so much.

When Oracle bought MySQL [Humor]

Filed under: Humor — Patrick Durusau @ 10:26 am

When Oracle bought MySQL from DBA Reactions.

Start your topic map day’s reading with some humor!

More seriously, suggestions of pics or video clips with topic map related captions welcome! (for the renovated topicmaps.com).

If we can’t smile at ourselves, very few are going to smile on us.

February 5, 2013

Chaotic Nihilists and Semantic Idealists [And What of Users?]

Filed under: Algorithms,Ontology,Semantics,Taxonomy,Topic Maps — Patrick Durusau @ 5:54 pm

Chaotic Nihilists and Semantic Idealists by Alistair Croll.

From the post:

There are competing views of how we should tackle an abundance of data, which I’ve referred to as big data’s “odd couple”.

One camp—made up of semantic idealists who fetishize taxonomies—is to tag and organize it all. Once we’ve marked everything and how it relates to everything else, they hope, the world will be reasonable and understandable.

The poster child for the Semantic Idealists is Wolfram Alpha, a “reasoning engine” that understands, for example, a question like “how many blue whales does the earth weigh?”—even if that question has never been asked before. But it’s completely useless until someone’s told it the weight of a whale, or the earth, or, for that matter, what weight is.

They’re wrong.

Alistair continues with the other camp:

Wolfram Alpha’s counterpart for the Algorithmic Nihilists is IBM’s Watson, a search engine that guesses at answers based on probabilities (and famously won on Jeopardy.) Watson was never guaranteed to be right, but it was really, really likely to have a good answer. It also wasn’t easily controlled: when it crawled the Urban Dictionary website, it started swearing in its responses[1], and IBM’s programmers had to excise some of its more colorful vocabulary by hand.

She’s wrong too.

And projects the future as:

The future of data is a blend of both semantics and algorithms. That’s one reason Google recently introduced a second search engine, called the Knowledge Graph, that understands queries.[3] Knowledge Graph was based on technology from Metaweb, a company it acquired in 2010, and it augments “probabilistic” algorithmic search with a structured, tagged set of relationships.

Why are we missing asking users what they meant as a third option?

Depends on who you want to be in charge:

Algorithms — Empower Computer Scientists.

Ontologies/taxonomies — Empower Ontologists.

Asking Users — Empowers Users.

Topic maps are a solution that can ask users.

Any questions?

Green Book – Semantic and Governmental Failure

Filed under: Government,Government Data — Patrick Durusau @ 5:30 pm

Full Text Reports carried a report today about the House Ways and Means Committee — 2012 Green Book (released November 2012).

I am always looking for data that might be of interesting for topic maps and the quoted blurb:

Since 1981, the Committee on Ways and Means has published the Green Book, which presents background material and statistical data on the major entitlement programs and other activities within the Committee’s jurisdiction. Over the decades, the Green Book has become a valuable resource and standard reference on American social policy. It is widely used by Members of Congress and their staffs, analysts in congressional and administrative agencies, members of the media, scholars, and citizens interested in the Nation’s social policy.

Seemed to fill the bill.

Until I got to: Committee on Ways and Means, U.S. House of Representatives, Green Book: Background Material and Data on the Programs within the Jurisdiction of the Committee on Ways and Means.

I sh*t you not. That is really the title.

No wonder they call it the “Green Book.”

When I got to the book itself, stop laughing!, you are ahead of me, all the tables are in PDF files.

No, I’m not going to convert them this time.

Why they don’t share machine readable files?, is a question you should ask your representative.

Thinking there may be a machine readable copy elsewhere, I searched for the “Green Book.”

Did you know the Department of Defense has a “Green Book?”

Or that Financial Management Services (Treasury) has a Greek Book?

Or that the Treasury has another Greek Book?

Or the U.S. Army Green Books? (apparently there are later ones than cited here)

Or that Obama has a Green Book.

Counting the one from Congress, that’s six and I suspect there are many more that any search will turn up.

Don’t suppose it ever occurred to anyone in government that distinguishing any of these for search purposes would be useful?

MapReduce Algorithms

Filed under: Algorithms,MapReduce,Texts — Patrick Durusau @ 4:55 pm

MapReduce Algorithms by Bill Bejeck.

Bill is writing a series of posts on implementing the algorithms given in pseudo-code in: Data-Intensive Text Processing with MapReduce.

  1. Working Through Data-Intensive Text Processing with MapReduce
  2. Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II
  3. Calculating A Co-Occurrence Matrix with Hadoop
  4. MapReduce Algorithms – Order Inversion
  5. Secondary Sorting

Another resource to try with your Hadoop Sandbox install!

I first saw this at Alex Popescu’s 3 MapReduce and Hadoop Links: Secondary Sorting, Hadoop-Based Letterpress, and Hadoop Vaidya.

Understanding MapReduce via Boggle [Topic Map Game Suggestions?]

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:54 pm

Understanding MapReduce via Boggle by Jesse Anderson.

From the post:

Graph theory is a growing part of Big Data. Using graph theory, we can find relationships in networks.

MapReduce is a great platform for traversing graphs. Therefore, one can leverage the power of an Apache Hadoop cluster to efficiently run an algorithm on the graph.

One such graph problem is playing Boggle*. Boggle is played by rolling a group of 16 dice. Each players’ job is find the most number of words spelled out by the dice. These dice are six-sided with a single letter that faces up:

Cool!

Any suggestions for a game that illustrates topic maps?

Perhaps a “discovery” game that leads to more points, etc., as merges occur?

I first saw this at Alex Popescu’s 3 MapReduce and Hadoop Links: Secondary Sorting, Hadoop-Based Letterpress, and Hadoop Vaidya.

Introduction to Complexity course is now enrolling!

Filed under: Cellular Automata,Complexity,Fractals — Patrick Durusau @ 3:34 pm

Santa Fe Institute’s Introduction to Complexity course is now enrolling!

From the webpage:

This free online course is open to anyone, and has no prerequisites. Watch the Intro Video to learn what this course is about and how to take it. Enroll to sign up, and you can start the course immediately. See the Syllabus and the Frequently Asked Questions to learn more.

I am waiting for the confirmation email now.

Definitely worth your attention.

Not that I think subject identity is fractal in nature.

Fractals as you know have a self-similarity property and at least in my view, subject identity does not.

As you explore a subject identity, you encounter other subjects identities, which isn’t the same thing as being self-similar.

Or should I say you will encounter complexities of subject identities?

Like all social constructs, identification of a subject is simple because we have chosen to view it that way.

Are you ready to look beyond our usual assumptions?

Concurrency Improvements in TokuDB v6.6 (Part 2)

Filed under: Fractal Trees,TokuDB,Tokutek — Patrick Durusau @ 3:02 pm

Concurrency Improvements in TokuDB v6.6 (Part 2)

From the post:

In Part 1, we showed performance results of some of the work that’s gone in to TokuDB v6.6. In this post, we’ll take a closer look at how this happened, on the engineering side, and how to think about the performance characteristics in the new version.

Background

It’s easiest to think about our concurrency changes in terms of a Fractal Tree® index that has nodes like a B-tree index, and buffers on each node that batch changes for the subtree rooted at that node. We have materials that describe this available here, but we can proceed just knowing that:

  1. To inject data into the tree, you need to store a message in a buffer at the root of the tree. These messages are moved down the tree, so you can find messages in all the internal nodes of the tree (the mechanism that moves them is irrelevant for now).
  2. To read data out of the tree, you need to find a leaf node that contains your key, check the buffers on the path up to the root for messages that affect your query, and apply any such messages to the value in the leaf before using that value to answer your query.

It’s these operations that modify and examine the buffers in the root that were the main reason we used to serialize operations inside a single index.

Just so not everything today is “soft” user stuff. 😉

Interesting avoidance of the root node as an I/O bottleneck.

Sort of thing that gets me to thinking about distributed topic map writing/querying.

4 Reasons Your UX Investment Isn’t Paying Off [Topic Map UX?]

Filed under: Design,Interface Research/Design,Usability,Users — Patrick Durusau @ 2:18 pm

4 Reasons Your UX Investment Isn’t Paying Off by Hilary Little.

You can imagine why this caught my eye.

From the post:

“Every dollar spent on UX brings in between $2 and $100 dollars in return.”

We all know the business case for doing user experience work: investing upfront in making products easy to use really pays off. It reduces project risk, cost, and time while improving, efficiency, effectiveness, and end user satisfaction.

(Don’t know the business case? Read this or this. Or this.) But what if you’re investing in UX and not getting results?

There can be many factors behind an under-performing user experience effort. Anything from a lack of tools to the zombie apocalypse can wreak havoc on your teams. Addressing either of those factors are outside my area of expertise.

Here’s where I do know what I’m talking about. First, rule out the obvious: your UX folks are jerks, they don’t communicate well, they don’t understand business, they aren’t team players, they have such terrible body odor people stay 10 feet away …

Next, look at your organization. I’ve based the following list on observations accumulated over my years as a UX professional. These are some common organizational “behavior” patterns that can make even the best UX efforts ineffective.

Let that first line soak in for a bit: “Every dollar spent on UX brings in between $2 and $100 dollars in return.”

Then go read the rest of the post for the four organizational patterns to watch for.

Assuming you have invested in professional UX work at all.

I haven’t and my ability to communicate topic maps to the average user is poorer as a result.

Not that I expect average users to “get” that identifications exist in fabrics of identifiers and any identified subject is at the intersection of multiple fabrics of identifiers, whether represented or not.

But to use and appreciate topic maps, that isn’t necessary.

Any more than I have to understand thermodynamics to drive an automobile.

And yes, yes I am working on an automobile level explanation of why topic maps are important.

Or better yet, simply presenting a new automobile and being real quiet about why it works so well. 😉

Doing More with the Hortonworks Sandbox

Filed under: Data,Dataset,Hadoop,Hortonworks — Patrick Durusau @ 2:01 pm

Doing More with the Hortonworks Sandbox by Cheryle Custer.

From the post:

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

While you wait for more tutorials, Cheryle points to some data sets to keep you busy:

For advice, see the Sandbox Forums.

BTW, while you are munging across different data sets, be sure to notice any semantic impedance if you try to merge some data sets.

If you don’t want everyone in your office doing that merging one-off, you might want to consider topic maps.

Design and document a merge between data sets once, run many times.

Even if your merging requirements change. Just change that part of the map, don’t re-create the entire map.

What if mapping companies recreated their maps for every new street?

Or would it be better to add the new street to an existing map?

If that looks obvious, try the extra-bonus question:

Which model, new map or add new street, do you use for schema migration?

Sharpening Your Competitive Edge…

Filed under: Design,Interface Research/Design,Usability,Users — Patrick Durusau @ 1:44 pm

Sharpening Your Competitive Edge with UX Research by Rebecca Flavin.

From the post:

It’s part of our daily work. We can’t imagine creating a product or an application without doing it: understanding the user.

Most of the clients we work with at EffectiveUI already have a good understanding of their customers from a market point of view. They know their target demographics and often have an solid sense of psychographics: their customers’ interests, media habits, and lifestyles.

This is all great information that is critical to a company’s success, but what about learning more about a customer than his or her age, gender, interests, and market segment? What about understanding the customer from a UX perspective?

Not all companies take the time to thoroughly understand exactly why, how, when, and where their customers interact with their brand’s, products and digital properties, as well as those of competing products and services. What are the influences, distractions, desires, and emotions that affect users as they try to purchase or engage with your product or interact with your service?

At EffectiveUI, we’ve seen that user research can be a powerful and invaluable tool for aiding strategic business decisions, identifying market opportunities, and ultimately driving better organizational results. When we’re talking to customers about a digital experience, we frequently uncover opportunities for their business as a whole to shift its strategic direction. Sometimes we even find out that the company has completely missed an opportunity with their customers.

As part of the holistic UX process, user research helps us learn more about customers’ pain points, needs, desires, and goals in order to inform digital design or product direction. The methods we generally employ include:

Great post that merits your attention!

What I continue to puzzle over is how to develop user testing for topic map interfaces?

The broad strokes of user testing are fairly well known, but how to implement those for topic map interfaces isn’t clear.

On one hand, a topic map could present its content much as any other web interface.

On the other hand, a topic map could present a “topicmappish” flavor interface.

And there are all the cases in between.

If it doesn’t involve trade secrets, can anyone comment on how they have tested topic map interfaces?

Bill Gates is naive, data is not objective [Neither is Identification]

Filed under: Data,Identity — Patrick Durusau @ 10:54 am

Bill Gates is naive, data is not objective by Cathy O’Neil.

From the post:

In his recent essay in the Wall Street Journal, Bill Gates proposed to “fix the world’s biggest problems” through “good measurement and a commitment to follow the data.” Sounds great!

Unfortunately it’s not so simple.

Gates describes a positive feedback loop when good data is collected and acted on. It’s hard to argue against this: given perfect data-collection procedures with relevant data, specific models do tend to improve, according to their chosen metrics of success. In fact this is almost tautological.

As I’ll explain, however, rather than focusing on how individual models improve with more data, we need to worry more about which models and which data have been chosen in the first place, why that process is successful when it is, and – most importantly – who gets to decide what data is collected and what models are trained.

Cathy makes a compelling case for data not being objective and concludes:

Don’t be fooled by the mathematical imprimatur: behind every model and every data set is a political process that chose that data and built that model and defined success for that model.

Sounds a lot like identifying subjects.

No identification is objective. They all occur as part of social processes and are bound by those processes.

No identification is “better” than another one, although is some contexts, particular identifications may be more useful that others.

I first saw this in Four short links: 4 February 2013 by Nat Torkington.

February 4, 2013

Dark Patterns Library

Filed under: Design,Interface Research/Design — Patrick Durusau @ 7:12 pm

Dark Patterns Library by Harry Brignull and Marc Miquel.

From the homepage:

A Dark Pattern is a type of user interface that has been carefully crafted to trick users into doing things, such as buying insurance with their purchase or signing up for recurring bills.

Normally when you think of “bad design”, you think of the creator as being sloppy or lazy but with no ill intent. This type of bad design is known as a “UI anti-pattern” Dark Patterns are different – they are not mistakes, they are carefully crafted with a solid understanding of human psychology, and they do not have the user’s interests in mind.

Has the potential to make you a better consumer.

First saw this at Four short links: 4 February 2013 by Nat Torkington.

International Space Apps Challenge

Filed under: Challenges,Contest,NASA — Patrick Durusau @ 7:12 pm

International Space Apps Challenge

From the webpage:

The International Space Apps Challenge is a two-day technology development event during which citizens from around the world will work together to address current challenges relevant to both space exploration and social need.

NASA believes that mass collaboration is key to creating and discovering state-of-the-art technology. The International Space Apps Challenge aims to engage YOU in developing innovative solutions to our toughest challenges.

Join us on April 20-21, 2013, as we join together cities around the world to be part of pioneering the future. Sign up to be notified when registration opens in early 2013!

The list of challenges will be released around March 15th, spaceappschallenge.org.

I won’t be able to attend in person but would be interested in participating with others should a semantic integration challenge come up.

I first saw this at: NASA launches second International Space Apps Challenge by Alex Howard.

jQuery Rain

Filed under: Design,Interface Research/Design,JQuery — Patrick Durusau @ 7:12 pm

jQuery Rain

I happened across jQuery Rain today and thought it worth passing on.

Other web interface sites that you would recommend?

Thinking if you can look at an interface and think, “topic map,” then the interface needs work.

The focus of any interface should be delivery of content and/or capabilities.

Including topic map interfaces.

« Newer PostsOlder Posts »

Powered by WordPress