Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 8, 2011

Toad Virtual Expo – 11.11.11 – 24-hour Toad Event

Filed under: Conferences,Hadoop,HBase,Hive,MySQL,Oracle,Toad — Patrick Durusau @ 7:46 pm

Toad Virtual Expo – 11.11.11 – 24-hour Toad Event

From the website:

24 hours of Toad is here! Join us on 11.11.11, and take an around the world journey with Toad and database experts who will share database development and administration best practices. This is your chance to see new products and new features in action, virtually collaborate with other users – and Quest’s own experts, and get a first-hand look at what’s coming in the world of Toad.

If you are not going to see the Immortals on 11.11.11 or looking for something to do after the movie, drop in on the Toad Virtual Expo! 😉 (It doesn’t look like a “chick” movie anyway.)

Times:

Register today for Quest Software’s 24-hour Toad Virtual Expo and learn why the best just got better.

  1. Tokyo Friday, November 11, 2011 6:00 a.m. JST – Saturday, November 12, 2011 6:00 a.m. JST
  2. Sydney Friday, November 11, 2011 8:00 a.m. EDT – Saturday, November 12, 2011 8:00 a.m. EDT

  3. Tel Aviv Thursday, November 10, 2011 11:00 p.m. IST – Friday, November 11, 2011 11:00 p.m. IST
  4. Central Europe Thursday, November 10, 2011 10:00 p.m. CET – Friday, November 11, 2011 10:00 p.m. CET
  5. London Thursday, November 10, 2011 9:00 p.m. GMT – Friday, November 11, 2011 9:00 p.m. GMT
  6. New York Thursday, November 10, 2011 4:00 p.m. EST – Friday, November 11, 2011 4:00 p.m. EST
  7. Los Angeles Thursday, November 10, 2011 1:00 p.m. PST – Friday, November 11, 2011 1:00 p.m. PST

The site wasn’t long on specifics but this could be fun!

Toad for Cloud Databases (Quest Software)

Filed under: BigData,Cloud Computing,Hadoop,HBase,Hive,MySQL,Oracle,SQL Server — Patrick Durusau @ 7:45 pm

Toad for Cloud Databases (Quest Software)

From the news release:

The data management industry is experiencing more disruption than at any other time in more than 20 years. Technologies around cloud, Hadoop and NoSQL are changing the way people manage and analyze data, but the general lack of skill sets required to manage these new technologies continues to be a significant barrier to mainstream adoption. IT departments are left without a clear understanding of whether development and DBA teams, whose expertise lies with traditional technology platforms, can effectively support these new systems. Toad® for Cloud Databases addresses the skill-set shortage head-on, empowering database professionals to directly apply their existing skills to emerging Big Data systems through an easy-to-use and familiar SQL-based interface for managing non-relational data. 

News Facts:

  • Toad for Cloud Databases is now available as a fully functional, commercial-grade product, for free, at www.quest.com/toad-for-cloud-databases.  Toad for Cloud Databases enables users to generate queries, migrate, browse, and edit data, as well as create reports and tables in a familiar SQL view. By simplifying these tasks, Toad for Cloud Databases opens the door to a wider audience of developers, allowing more IT teams to experience the productivity gains and cost benefits of NoSQL and Big Data.
  • Quest first released Toad for Cloud Databases into beta in June 2010, making the company one of the first to provide a SQL-based database management tool to support emerging, non-relational platforms. Over the past 18 months, Quest has continued to drive innovation for the product, growing its list of supported platforms and integrating a UI for its bi-directional data connector between Oracle and Hadoop.
  • Quest’s connector between Oracle and Hadoop, available within Toad for Cloud Databases, delivers a fast and scalable method for data transfer between Oracle and Hadoop in both directions. The bidirectional characteristic of the utility enables organizations to take advantage of Hadoop’s lower cost of storage and analytical capabilities. Quest also contributed the connector to the Apache Hadoop project as an extension to the existing SQOOP framework, and is also available as part of Cloudera’s Distribution Including Apache Hadoop. 
  • Toad for Cloud Databases today supports:
    • Apache Hive
    • Apache HBase
    • Apache Cassandra
    • MongoDB
    • Amazon SimpleDB
    • Microsoft Azure Table Services
    • Microsoft SQL Azure, and
    • All Open Database Connectivity (ODBC)-enabled relational databases (Oracle, SQL Server, MySQL, DB2, etc)

 

Anything that eases the transition to cloud computing is going to be welcome. Toad being free will increase the ranks of DBAs who will at least experiment on their own.

Statistical Learning Part III

Filed under: Statistical Learning,Statistics — Patrick Durusau @ 7:45 pm

Statistical Learning Part III by Steve Miller.

From the post:

I finally got around to cleaning up my home office the other day. The biggest challenge was putting away all the loose books in such a way that I can quickly retrieve them when needed.

In the clutter I found two copies of “The Elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani and Jerome Friedman – one I purchased two years ago and the other I received at a recent Statistical Learning and Data Mining (SLDM III) seminar taught by first two authors. ESL is quite popular in the predictive modeling world, often referred to by aficionados as “the book”, “the SL book” or the “big yellow book” in reverence to its status as the SL bible.

Hastie, Tibshirani and Friedman are Professors of Statistics at Stanford University, the top-rated stats department in the country. For over 20 years, the three have been leaders in the field of statistical learning and prediction that sits between traditional statistical modeling and data mining algorithms from computer science. I was introduced to their work when I took the SLDM course three years ago.

Interesting discussion of statistical learning with Q/A session at the end.

Apache Lucene Eurocon 2011 – Presentations

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 7:45 pm

Apache Lucene Eurocon 2011 – Presentations

From the website:

Apache Lucene Eurocon 2011, held in Barcelona between October 17-20 was a huge success.The conference was packed with technical sessions, developer content, user case studies, panels, and networking opportunities, Lucene Revolution featured the thought leaders building and deploying Lucene/Solr open source search technology. Compelling speakers and unmatched networking opportunities created a unique community of practice and experience, so you too can unlock the power, versatility and cost-effective capabilities of search across industries, data, and applications.

If you missed the chance attend the Apache Lucene Eurocon or any part of it, you can still get your hands on what the speakers delivered. We have posted most of the presentations here for below download and review, along with videos of select speakers (as available).

Many thanks to Lucid Imagination for the conference and making these conference materials available. Not like being there but does extend the conversations to include those not present.

Making Data Work – O’Reilly Strata Conference – Santa Barbara – Feb. 28 – March 1, 2012

Filed under: BigData,Conferences — Patrick Durusau @ 7:45 pm

Making Data Work – O’Reilly Strata Conference – Santa Barbara – Feb. 28 – March 1, 2012

Important Dates:

  • Best Pricing ends November 10, 2011
  • Early Registration ends January 12, 2012
  • Standard Registration ends February 27, 2012
  • Conference February 28 – March 1, 2012

Conference program topics:

  • Data: Big data and the Hadoop ecosystem, real-time data processing and analytics, crowdsourcing, data acquisition and cleaning, data distribution and markets, data science best practice, predictive analytics, machine learning, data security
  • Business: From research to product, data protection, privacy and policy, becoming a data-driven organization, training, recruitment, management for data, the changing role of business intelligence
  • Interfaces: Visualization and design principles, mobile strategy, applications & futures, augmented reality and immersive interfaces, dashboards, sensors, mobile & wireless, physical interfaces and robotics

Moving to Big Data – 7 December 2011 – 9 AM PT

Filed under: BigData,Conferences — Patrick Durusau @ 7:45 pm

Moving to Big Data

Free O’Reilly online conference, 7 December 2011 – 9 AM PT until 10:30 AM PT.

Short presentations but you should pick up talking points and vocabulary for further exploration.

From the website:

Everyone buys in to the mantra of the data-driven enterprise. Companies that put data to work make smarter decisions than their competitors. They can engage customers, employees, and partners more effectively. And they can adapt faster to changing market conditions. It’s not just internal data, either: a social, connected web has given us new firehoses to drink from, and combining public and private data yields valuable new insights.

Unfortunately for many businesses, the information they need is languishing in data warehouses. It’s accessible only to Business Intelligence experts and database experts. It’s encased in legacy databases with arcane interfaces.

Big Data promises to unlock this data for the entire company. But getting there will be hard: replacing decades-old platforms and entire skill sets doesn’t happen overnight. In this online event, we’ll look at how Big Data stacks and analytical approaches are gradually finding their way into organizations, as well as the roadblocks that can thwart efforts to become more data-driven.

Search + Big Data: It’s (still) All About the User (Users or Documents?)

Filed under: Hadoop,Lucene,LucidWorks,Mahout,Solr,Topic Maps — Patrick Durusau @ 7:44 pm

Search + Big Data: It’s (still) All About the User by Grant Ingersoll.

Slides

Abstract:

Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow’s enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.

Awesome as always!

Please watch the presentation and review the slides before going further. What follows won’t make much sense without Grant’s presentation as a context. I’ll wait……

Back so soon? 😉

On slide 4 (I said to review the slides), Grant presents four overlapping areas, starting with Documents: Models, Feature Selection; Content Relationships: Page Rank, etc., Organization; Queries: Phrases, NLP; User Interaction: Clicks, Ratings/Reviews, Learning to Rank, Social Graph; and the intersection of those four areas is where Grant says search is rapidly evolving.

On slide 5 (sorry, last slide reference), Grant say to mine that intersection is a loop composed of: Search -> Discovery -> Analytics -> (back to Search). All of which involve processing of data that has been collected from use of the search interface.

Grant’s presentation made clear something that I have been overlooking:

Search/Indexing, as commonly understood, does not capture any discoveries or insights of users.

Even the search trails that Grant mentions are just lemming tracks complete with droppings. You can follow them if you like, may find interesting data, may not.

My point being that there is no way to capture the user’s insight that LBJ, for instance, is a common acronym for Lyndon Baines Johnson. So that the next user who searches for LBJ will find the information contributed by a prior user. Such as distinguishing application of Lyndon Baines Johnson to a graduate school (Lyndon B. Johnson School of Public Affairs), a hospital (Lyndon B. Johnson General Hospital), a PBS show (American Experience . The Presidents . Lyndon B. Johnson), a biography (American President: Lyndon Baines Johnson), and that is in just the first ten (10) “hits.” Oh, and as the name of an American President.

Grant made that clear for me with his loop of Search -> Discovery -> Analytics -> (back to Search) because Search only ever focuses on the documents, never the user’s insight into the documents.

And with every search, every user (with the exception of search trails), starts over at the beginning.

What if a colleague found a bug in program code, but you have to start at the beginning of the program and work your way there. Good use of your time? To reset with every user? That is what happens with search, nearly a complete reset. (Not complete because of page rank, etc. but only just.)

If we are going to make it “All About the User,” shouldn’t we be indexing their insights* into data? (Big or otherwise.)

*”Clicks” are not insights. Could be an unsteady hand, DTs, etc.

Someone Is Being Honest on the Internet?

Filed under: MongoDB,NoSQL,Riak — Patrick Durusau @ 7:44 pm

After seeing the raft of Twitter traffic on MongoDB and Riak, In Context (and an apology), I just had to look. The thought of someone being honest on the Internet being even more novel than someone being wrong on the Internet.

At least I would not have to stay up late correcting them. 😉

Sean Cribbs writes:

There has been quite a bit of furor and excitement on the Internet this week regarding some very public criticisms (and defenses) of MongoDB and its creators, 10gen. Unfortunately, a ghost from my recent past also resurfaced as a result. Let me begin by apologizing to 10gen and its engineers for what I said at JSConf, and then I will reframe my comments in a more constructive form.

Mea culpa. It’s way too easy in our industry to set up and knock down strawmen, as I did, than to convey messages of objective and constructive criticism. It’s also too easy, when you are passionate about what you believe in, to ignore the feelings and efforts of others, which I did. I have great respect for the engineers I have met from 10gen, Mathias Stern and Kyle Banker. They are friendly, approachable, helpful and fun to socialize with at conferences. Thanks for being stand-up guys.

Also, whether we like it or not, these kinds of public embarrassments have rippling effects across the whole NoSQL ecosystem. While Basho has tried to distance itself from other players in the NoSQL field, we cannot deny our origins, and the ecosystem as a “thing” is only about 3 years old. Are developers, technical managers and CTOs more wary of new database technologies as a result of these embarrassments? Probably. Should we continue to work hard to develop and promote alternative data-storage solutions? Absolutely.

Sean’s following comments are useful but even more useful was his suggestion that both MongoDB and Riak push to improve their respective capabilities. There is always room for improvement.

Oh, I did notice on thing that needs correcting in Sean’s blog entry. 😉 See: Munnecke, Heath Records and VistA (NoSQL 35 years old?) NoSQL is at least 35 years old, probably longer but I don’t have the citation at hand.

Visualizations as vocabulary…or know the big words, use the small ones

Filed under: Visualization — Patrick Durusau @ 7:44 pm

Visualizations as vocabulary…or know the big words, use the small ones by Zach Gemignani.

From the post:

Telling stories with data. It is an increasingly common business intelligence refrain–and it may well be part of your job description. If it is, why not tap into the time-tested lessons of those who tell stories with words? Just as words are the basic unit of written stories, visualization techniques (charts, visualizations, colors, font sizes, sparklines, etc.) are the tools we have when telling data stories. No matter the form, authors will agonize about choosing the right units of expression, finding a balance between being concise and being comprehensive, simplicity and sophistication.

Illustrates the following three rules:

  1. Smaller, simpler words
  2. Too many words is a symptom of poor understanding
  3. Words are for communication, not show

Very much worth your time to visit.

Jena

Filed under: Jena,RDF,RDFa — Patrick Durusau @ 7:44 pm

Jena

Did you know that Jena is incubating at Apache now?

Welcome to the Apache Jena project! Jena is a Java framework for building Semantic Web applications. Jena provides a collection of tools and Java libraries to help you to develop semantic web and linked-data apps, tools and servers.

The Jena Framework includes:

  • an API for reading, processing and writing RDF data in XML, N-triples and Turtle formats;
  • an ontology API for handling OWL and RDFS ontologies;
  • a rule-based inference engine for reasoning with RDF and OWL data sources;
  • stores to allow large numbers of RDF triples to be efficiently stored on disk;
  • a query engine compliant with the latest SPARQL specification
  • servers to allow RDF data to be published to other applications using a variety of protocols, including SPARQL

Apache Incubator Apache Jena is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Incubator project. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Rob Weir has pointed out that since ODF (OpenDocument Format) 1.2 includes support for RDFa and RDF XML that Jena may have a role to play in ODF’s future.

You can learn more about ODF 1.2 at the OpenDocument TC.

Adding support to the ODFToolkit for RDFa/RDF and/or demonstrating the benefits of RDFa/RDF in ODF 1.2 would be most welcome!

November 7, 2011

Stanford NLP

Filed under: Natural Language Processing,Stanford NLP — Patrick Durusau @ 7:29 pm

Stanford NLP

Usually a reference to the Stanford NLP parser but I have put in the link to the “The Stanford Natural Language Processing Group.”

From its webpage:

The Natural Language Processing Group at Stanford University is a team of faculty, research scientists, postdocs, programmers and students who work together on algorithms that allow computers to process and understand human languages. Our work ranges from basic research in computational linguistics to key applications in human language technology, and covers areas such as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical information extraction, grammar induction, word sense disambiguation, and automatic question answering.

A distinguishing feature of the Stanford NLP Group is our effective combination of sophisticated and deep linguistic modeling and data analysis with innovative probabilistic and machine learning approaches to NLP. Our research has resulted in state-of-the-art technology for robust, broad-coverage natural-language processing in many languages. These technologies include our part-of-speech tagger, which currently has the best published performance in the world; a high performance probabilistic parser; a competition-winning biological named entity recognition system; and algorithms for processing Arabic, Chinese, and German text.

The Stanford NLP Group includes members of both the Linguistics Department and the Computer Science Department, and is affiliated with the Stanford AI Lab and the Stanford InfoLab.

Quick link to Stanford NLP Software page.

Using Lucene and Cascalog for Fast Text Processing at Scale

Filed under: Cascalog,Clojure,LingPipe,Lucene,Natural Language Processing,OpenNLP,Stanford NLP — Patrick Durusau @ 7:29 pm

Using Lucene and Cascalog for Fast Text Processing at Scale

From the post:

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure’s awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

The world of text exploration just gets better all the time!

CumulusRDF

Filed under: Cassandra,RDF — Patrick Durusau @ 7:28 pm

CumulusRDF

From Andreas Harth and Günter Ladwig:

[W]e are happy to announce the first public release of CumulusRDF, a Linked Data server that uses Apache Cassandra [1] as a cloud-based storage backend. CumulusRDF provides a simple HTTP interface [2] to manage RDF data stored in an Apache Cassandra cluster.

Features
* By way of Apache Cassandra, CumulusRDF provides distributed, fault-tolerant and elastic RDF storage
* Supports Linked Data and triple pattern lookups
* Proxy mode: CumulusRDF can act as a proxy server [3] for other Linked Data applications, allowing to deploy any RDF dataset as Linked Data

This is a first beta release that is still somewhat rough around the edges, but the basic functionality works well. The HTTP interface is work-in-progress. Eventually, we plan to extend the storage model to support quads.

CumulusRDF is available from http://code.google.com/p/cumulusrdf/

See http://code.google.com/p/cumulusrdf/wiki/GettingStarted to get started using CumulusRDF.

There is also a paper [4] on CumulusRDF that I presented at the Scalable Semantic Knowledge Base Systems (SSWS) workshop at ISWC last week.

Cheers,
Andreas Harth and Günter Ladwig

[1] http://cassandra.apache.org/
[2] http://code.google.com/p/cumulusrdf/wiki/HttpInterface
[3] http://code.google.com/p/cumulusrdf/wiki/ProxyMode
[4] http://people.aifb.kit.edu/gla/cumulusrdf/cumulusrdf-ssws2011.pdf

Everybody knows I hate to be picky but the abstract of [4] promises:

Results on a cluster of up to 8 machines indicate that CumulusRDF is competitive to state-of-the-art distributed RDF stores.

But I didn’t see any comparison to “state-of-the-art” RDF stores, distributed or not. Did I just overlook something?

I ask because I think this approach has promise, at least as an exploration of indexing strategies for RDF and how usage scenarios may influence those strategies. But that will be difficult to evaluate in the absence of comparison to less imaginative approaches to RDF indexing.

Development Setup for Neo4j and PHP: Part 1 + 2

Filed under: Neo4j,PHP — Patrick Durusau @ 7:28 pm

Development Setup for Neo4j and PHP: Part 1

Development Setup for Neo4j and PHP: Part 2

by Josh Adell.

From part 1:

I would really love to see more of my fellow PHP developers talking about, playing with, and building awesome applications on top of graph databases. They really are a powerful storage solution that fits well into a wide variety of domains.

In this two part series, I’ll detail how to set up a development environment for building a project with a graph database (specifically Neo4j). Part 1 will show how to set up the development and unit testing databases. In Part 2, we’ll create a basic application that talks to the database, including unit tests.

All the steps below were performed on Ubuntu 10.10 Maverick, but should be easy to translate to any other OS.

Forward this to your library and other friends using PHP.

Investing in Big Data’s Critical Needs

Filed under: BigData — Patrick Durusau @ 7:28 pm

I was puzzled by the following summary of investing in Big Data’s “critical needs:”

Recognizing the potential of data exploitation in today’s business environment, my firm, Trident Capital, continues to look for investments in three areas that address Big Data’s critical needs:

  1. People. Data scientists and business analysts have emerged as the critical personnel for the analysis and utilization of data. We are looking for services companies with large teams of such people and scalable analytic processes.
  2. Tools. Today’s Big Data analysis tools remain too low-level requiring analysts to perform many tasks manually. For this reason we are searching for new technologies to help with data ingestion, manipulation and exploitation.
  3. Applications. Businesses are still not able to capitalize on the results of the performed analyses in a timely manner often missing important opportunities. We are searching for companies developing analytic applications that enable business users to actionalize the Big Data sets their organizations collect. Example applications include customer experience data analysis enabling organizations to offer the right level of customer support, Internet of Things data analysis to optimize supply chains, and applications that use analysis results to assist professionals with complex tasks, such as a doctor during the diagnosis process.

The post, Presentation on Big Data Analytics and Watson at IBM’s Information on Demand Conference makes assumptions about “big data” that are subject to change.

For example, consider the first two points, the need for scalable service organizations and tools to avoid manual work with data. Both of are related and directly impacted by the quality of the data in question. It may be “big data” but if it is “clean” data, then the need for scaling service organizations and manual manipulation goes down.

But the presumption underlying this analysis is that we have “dirty big data” and there isn’t anything we can do about it. Really?

What if data, when produced, by human or automated means, were less “dirty,” if not in fact “clean?” At least for some purposes.

  1. Choose a “dirty” data set.
  2. What steps in its production would lessen the amount of dirt in the data?
  3. What would you suggest as ways to evaluate the cost vs. benefit of cleaner data?

Save the Pies for Dessert

Filed under: Graphics,Interface Research/Design — Patrick Durusau @ 7:28 pm

Save the Pies for Dessert by Stephen Few.

A paper on pie charts and why they are mis-leading.

I encountered this quite by accident but it is quite good. It illustrated quite well why pie charts should not be your first choice if you want to communicate effectively.

Something I assume to be a goal of all topic map interface designers, that is to communicated effectively.

From the paper:

Not long ago I received an email from a colleague who keeps watch on business intelligence vendors and rates their products. She was puzzled that a particular product that I happen to like did not support pie charts, a feature that she assumed was basic and indispensable. Because of previous discussions between us, when I pointed out ineffective graphing practices that are popular in many BI products, she wondered if there might also be a problem with pie charts. Could this vendor’s omission of pie charts be intentional and justified? I explained that this was indeed the case, and praised the vendor’s design team for their good sense.

After reading the paper I think you will pause before including pie charts capabilities in your topic map interface.

The future of information workers according to Microsoft, and BI plays a big part

Filed under: Business Intelligence,Information Workers — Patrick Durusau @ 7:28 pm

The future of information workers according to Microsoft, and BI plays a big part by Kasper de Jonge.

New video from Microsoft about a possible IT future.

Does it have a Futurama (New York World’s Fair) sense to you?

Futurama was an exhibition at the 1939 World’s Fair. To get a sense of the exhibit, view: To New Horizons (1940)

Running time is almost 23 minutes and it takes almost a third of that to get to the Futurama part. A vision of what 1960 will look like.

Watch the MS video and then the other.

Discussion questions:

  • How near/far from the future is the MS video?
  • What semantic impedances need to be reduced for such a future? (people to people, people to data (searching), machine to machine)
  • Which ones first?

Challenge.gov

Filed under: Contest,Marketing — Patrick Durusau @ 7:27 pm

Challenge.gov

From the FAQ:

About challenges

What is a challenge?

A government challenge or contest is exactly what the name suggests: it is a challenge by the government to a third party or parties to identify a solution to a particular problem or reward contestants for accomplishing a particular goal. Prizes (monetary or non–monetary) often accompany challenges and contests.

Challenges can range from fairly simple (idea suggestions, creation of logos, videos, digital games and mobile applications) to proofs of concept, designs, or finished products that solve the grand challenges of the 21st century. Find current federal challenges on Challenge.gov.

About Challenge.gov

Why would the government run a challenge?

Federal agencies can use challenges and prizes to find innovative or cost–effective submissions or improvements to ideas, products and processes. Government can identify the goal without first choosing the approach or team most likely to succeed, and pay only for performance if a winning submission is submitted. Challenges and prizes can tap into innovations from unexpected people and places.

Hard to think of better PR for topic maps than being the solution to one or more of these challenges.

If you know of challenges in other countries or by other organizations, please post or email pointers to them.

Semantic Division of Labor

Filed under: Semantic Web,Semantics — Patrick Durusau @ 7:27 pm

I was talking to Sam Hunting the other day about identifying subjects in texts. Since we were talking about HTML pages, the use of an <a> element to surround PCDATA seems like a logical choice. Simple, easy, something users are accustomed to doing.

Sam mentioned that is cleaner than RDFa or any of its kin, which require additional effort, a good bit of additional effort, on the part of users. Which made me wonder why the extra effort? If a user has identified a subject, using an IRI, what more is necessary?

After all, if you identify a subject for me, you don’t have to push a lot of information along with the identification. If I want more information, in addition to the information I already have, that’s my responsibility to obtain it.

The scenario where you as a user contribute semantics, to the benefit of others, is a semantic division of labor.

What is really ironic, is you have to create the ontologies, invoke them in the correct way and then use a special syntax that works with their machines, to contribute your knowledge and/or identification of subjects. Not only do you get a beating, but you have to bring your own stick.

It isn’t hard to imagine a different division of labor. One where users identify their subjects using simple <a> elements and Wikipedia or other sources that seem useful to them. I am sure the chemistry folks have sources they would prefer as do other areas of activity.

If someone else wants to impose semantics on the identifications of those subjects, that is on their watch, not yours.

True, people will argue that you are contributing to the rise of an intelligent web by your efforts, etc. Sure, and the click tracking done by Google, merchandisers and others are designed to get better products into my hands for my benefit. You know, there are some things I won’t even politely pretend to believe.

People are not tracking information or semantics to benefit you. Sorry, I wish I could say that were different but its not. To paraphrase Wesley in the Princess Bride, “..anyone who says differently is selling something.”

When Gamers Innovate

When Gamers Innovate

The problem (partially):

Typically, proteins have only one correct configuration. Trying to virtually simulate all of them to find the right one would require enormous computational resources and time.

On top of that there are factors concerning translational-regulation. As the protein chain is produced in a step-wise fashion on the ribosome, one end of a protein might start folding quicker and dictate how the opposite end should fold. Other factors to consider are chaperones (proteins which guide its misfolded partner into the right shape) and post-translation modifications (bits and pieces removed and/or added to the amino acids), which all make protein prediction even harder. That is why homology modelling or “machine learning” techniques tend to be more accurate. However, they all require similar proteins to be already analysed and cracked in the first place.

The solution:

Rather than locking another group of structural shamans in a basement to perform their biophysical black magic, the “Fold It” team created a game. It uses human brainpower, which is fuelled by high-octane logic and catalysed by giving it a competitive edge. Players challenge their three-dimensional problem-solving skills by trying to: 1) pack the protein 2) hide the hydrophobics and 3) clear the clashes.

Read the post or jump to the Foldit site.

Seems to me there are a lot of subject identity and relationship (association) issues that are a lot less complex that protein folding. Not that topic mappers should shy away from protein folding but we should be more imaginative about our authoring interfaces. Yes?

November 6, 2011

TimesOpen: Social Media on Nov 14

Filed under: Conferences,Social Media — Patrick Durusau @ 5:45 pm

TimesOpen: Social Media on Nov 14

I won’t be in New York for this event but if you are around, well worth the time to attend! Social media content (and its semantics) are going to figure prominently in some future topic map applications. Get a glimpse of the future!

We’re excited to announce the next TimesOpen event, a discussion of what’s next in social media technology, interfaces and business models–Monday, November 14, starting at 6:30 p.m. in the Times Building conference facility on the 15th floor. Registration is open. There is no cost to attend. Seats are limited.

BTW, if you do attend, I would appreciate a pointer to your posts about the event. Thanks!

Rdatamarket Tutorial

Filed under: Data Mining,Government Data,R — Patrick Durusau @ 5:44 pm

From the Revolutions blog:

The good folks at DataMarket have posted a new tutorial on using the rdatamarket package (covered here in August) to easily download public data sets into R for analysis.

The tutorial describes how to install the rdatamarket package, how to extract metadata for data sets, and how to download the data themselves into R. The tutorial also illustrates a feature of the package I wasn’t previously aware of: you can use dimension filtering to extract just the portion of the dataset you need: for example, to read just the population data for specific countries from the entire UN World Population dataset.

DataMarket Blog: Using DataMarket from within R

JDBM

Filed under: B+Tree,HTree,JDBM — Patrick Durusau @ 5:44 pm

JDBM

From the webpage:

JDBM is a transactional persistence engine for Java. It aims to be for Java what GDBM is for other languages (C/C++, Python, Perl, etc.): a fast, simple persistence engine. You can use it to store a mix of objects and BLOBs, and all updates are done in a transactionally safe manner. JDBM also provides scalable data structures, such as HTree and B+Tree, to support persistence of large object collections.

This came up in a discussion of data structures very close to the metal and their influence on “accepted” operating characteristics. Would not hurt to run down links and information on GDBM as well. Something for later this week.

Search Analytics

Filed under: Search Analytics,Searching — Patrick Durusau @ 5:44 pm

Search Analytics

From the post:

Here is another take on Search Analytics, this one being presented at Enterprise Search Summit Fall 2011 in Washington DC, to an audience coming mainly from the US government agencies, very large enterprises, and large international companies with 10s of thousands of employees world wide. The audience was good and posed a number of good questions after the talk. The full slide deck is below as well as in Sematext@Slideshare.

I like the:

If you can’t measure it, you can’t fix it! [emphasis in original, I did fix the punctuation to move the comma from “measure, it” to “measure it,”.]

line. Although I would have liked it better when I was an undergraduate student taking empirical methodology in political science. A number of years later I still agree that measurement is important but am less militant that measurement is always possible or even useful.

Still, a very good slide deck and a good way to start off the week!

Clive Thompson on Why Kids Can’t Search (Wired)

Filed under: Interface Research/Design,Marketing,Searching — Patrick Durusau @ 5:44 pm

Clive Thompson on Why Kids Can’t Search (Wired)

From the post:

We’re often told that young people tend to be the most tech-savvy among us. But just how savvy are they? A group of researchers led by College of Charleston business professor Bing Pan tried to find out. Specifically, Pan wanted to know how skillful young folks are at online search. His team gathered a group of college students and asked them to look up the answers to a handful of questions. Perhaps not surprisingly, the students generally relied on the web pages at the top of Google’s results list.

But Pan pulled a trick: He changed the order of the results for some students. More often than not, those kids went for the bait and also used the (falsely) top-ranked pages. Pan grimly concluded that students aren’t assessing information sources on their own merit—they’re putting too much trust in the machine.

I agree with the conclusion but would add it illustrates a market for topic maps.

A market to deliver critically assessed information as opposed to teaching people to critically assess information. Critical assessment of information, like tensor calculus, can be taught, but how many people are capable of learning/applying it?

Take a practical example (US centric) of the evening news. Every night, for an hour, the news is mostly about murders, fatal accidents, crimes of other sorts, etc. So much so that personal security is a concern for most Americans and they want leaders who are tough on crime, terrorism, etc. Err, but crime rates, including violent crime have been falling for the last decade. They are approaching all time lows.

As far as terrorism, well, that is just a bogeyman for security and military budgets. Yes, 9/11, but 9/11 isn’t anything like having monthly or weekly suicide bombers is it? American are in more danger from the annual flu, medical malpractice, drunk drivers, heart disease, and a host of other causes more than terrorism. The insurance companies admit to 100,000 deaths a year to medical “misadventure.” How many Americans died from terrorist attacks last year in the U.S.? That would be the “naught” or “0” as I was taught.

I suppose my last point, about terrorism, brings up another point about “critical assessment” of information for topic maps. Depends on what your client thinks is “critical assessment.” If you are doing a topic map for the terror defense industry, I would suggest skipping my comparisons with medical malpractice. Legislative auditors, on the other hand, might appreciate a map of expenditures and results, which for the US Department of Homeland Security, would have a naught in the results column. Local police and the FBI, traditional law enforcement agencies, have been responsible for the few terrorist arrests since 9/11.

I read somewhere a long time ago that advertisers think of us as: “Insecure, sex-starved neurotics with attention spans of about 15 seconds.” I am not sure how to codify that as rules for content and interface design but it is a starting point.

I say all that to illustrate that critical assessment of information isn’t a strong point for the general population (or some of its leaders for that matter). Not just kids but their parents, grandparents, etc.

We may as well ask: Why people can’t critically assess information?

I don’t know the answer to that but I think the evidence indicates it is a rare talent, statistically speaking. And probably varies by domain. People who are capable of critical assessment in one domain may not be capable of it in another.

So, if it is a rare talent, statistically speaking, like hitting home runs, let’s market the ability to critically assess information.

piecemeal geodata

Filed under: Geographic Data,Geographic Information Retrieval,Maps,Visualization — Patrick Durusau @ 5:43 pm

piecemeal geodata

Michal Migurski on the difficulties of using OpenStreetMap data:

Two weeks ago, I attended the 5th annual OpenStreetMap conference in Denver, State of the Map. My second talk was called Piecemeal Geodata, and I hoped to communicate some of the pain (and opportunity) in dealing with OpenStreetMap data as a consumer of the information, downstream from the mappers but hoping to make maps or work with the dataset. Harry Wood took notes that suggested I didn’t entirely miss the mark, but after I was done Tom MacWright congratulated me on my “excellent stealth rage talk”. It wasn’t really supposed to be ragey as such, so here are some of my slides and notes along with some followup to the problems I talked about.

Topic maps are in use in a number of commercial and governmental venues but aren’t the sort of thing you hear about like Twitter or Blackberries (mostly about outages).

Anticipating more civil disturbances over the next several years, do topic maps have something to offer when coupled with a technology like Google Maps or OSM?

It is one thing to indicate your location using an app, but can you report movement of forces in a way that updates the maps of some colleagues? In a secure manner?

What features would a topic map need for such an environment?

high road, for better OSM cartography

Filed under: Geographic Data,Geographic Information Retrieval,Maps,Visualization — Patrick Durusau @ 5:43 pm

high road, for better OSM cartography

From the post:

High Road is a framework for normalizing the rendering of highways from OSM data, a critical piece of every OSM-based road map we’ve ever designed at Stamen. Deciding exactly which kinds of roads appear at each zoom level can really be done just once, and ideally shouldn’t be part of a lengthy database query in your stylesheet. In Cascadenik and regular Mapnik’s XML-based layer definitions, long queries balloon the size of a style until it’s impossible to scan quickly. In Carto’s JSON-based layer definitions the multiline-formatting of a complex query is completely out of the question. Further, each system has its own preferred way of helping you handle road casings.

Useful rendering of geographic maps (and the data you attach to them) is likely to be useful in a number of topic map contexts.

PS: OSM = OpenStreetMap.

Smallest Federated Wiki

Filed under: Federation,Wiki — Patrick Durusau @ 5:43 pm

Smallest Federated Wiki

Interesting wiki application that enables moving of information between wiki pages, yours, theirs, ours and yet they retain integrity for the original author. I have just started watching the videos.

Videos:

Videos at Github

Video’s at YouTube

I had sent Jack Part the link (he was already aware of it) and Jack sends back:

Wiki inventor Ward Cunningham in Conversation with Tom at HealthCamp Oregon

with a note to watch this space. Jack has good instincts so take heed. The Tom is Tom Munnecke, inventor of the data core that supports half of the operational electronic health records today. (VistA) I take that as credentials for talking about data systems. Ward Cunningham is the inventor of the wiki. Yeah, that Ward Cunningham. I think Jack is probably right, this is a space to watch.

Munnecke, Heath Records and VistA (NoSQL 35 years old?)

Filed under: Data Management,Data Structures,Medical Informatics,MUMPS — Patrick Durusau @ 5:42 pm

Tom Munnecke is the inventor of Veterans Health Information Systems and Technology Architecture (VISTA), which is the core for half of the operational electronic health records in existence today.

From the VISTA monograph:

In 1996, the Chief Information Office introduced VISTA, which is the Veterans Health Information Systems and Technology Architecture. It is a rich, automated environment that supports day-to-day operations at local Department of Veterans Affairs (VA) health care facilities.

VISTA is built on a client-server architecture, which ties together workstations and personal computers with graphical user interfaces at Veterans Health Administration (VHA) facilities, as well as software developed by local medical facility staff. VISTA also includes the links that allow commercial off-the-shelf software and products to be used with existing and future technologies. The Decision Support System (DSS) and other national databases that might be derived from locally generated data lie outside the scope of VISTA.

When development began on the Decentralized Hospital Computer Program (DHCP) in the early 1980s, information systems were in their infancy in VA medical facilities and emphasized primarily hospital-based activities. DHCP grew rapidly and is used by many private and public health care facilities throughout the United States and the world. Although DHCP represented the total automation activity at most VA medical centers in 1985, DHCP is now only one part of the overall information resources at the local facility level. VISTA incorporates all of the benefits of DHCP as well as including the rich array of other information resources that are becoming vital to the day-to-day operations at VA medical facilities. It represents the culmination of DHCP’s evolution and metamorphosis into a new, open system, client-server based environment that takes full advantage of commercial solutions, including those provided by Internet technologies.

Yeah, you caught the alternative expansion of DHCP. Surprised me the first time I saw it.

A couple of other posts/resources on Munnecke to consider:

Some of my original notes on the design of VistA and Rehashing MUMPS/Data Dictionary vs. Relational Model.

From the MUMPS/Data Dictionary post:

This is another never-ending story, now going 35 years. It seems that there are these Mongolean hordes of people coming over the horizon, saying the same thing about treating medical informatics as just another transaction processing system. They know banking, insurance, or retail, so therefore they must understand medical informatics as well.

I looked very seriously at the relational model, and rejected it because I thought it was too rigid for the expression of medical informatics information. I made a “grand tour” of the leading medical informatics sites to look at what was working for them. I read and spoke extensively with Chris Date http://en.wikipedia.org/wiki/Christopher_J._Date , Stanford CS prof Gio Wiederhold http://infolab.stanford.edu/people/gio.html (who was later to become the major professor of PhD dropout Sergy Brin), and Wharton professor Richard Hackathorn. I presented papers at national conventions AFIPS and SCAMC, gave colloquia at Stanford, Harvard Medical School, Linkoping University in Sweden, Frankfurt University in Germany, and Chiba University in Japan.

So successful, widespread and mainstream NoSQL has been around for 35 years? 😉

End-to-end NLP packages

Filed under: Natural Language Processing — Patrick Durusau @ 5:42 pm

End-to-end NLP packages

From the post:

What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures? Lots of NLP research focuses on single tasks at a time, and thus produces software that does a single task at a time. But for various applications, it is nicer to have a full end-to-end system that just runs on whatever text you give it.

If you believe this is a worthwhile goal (see caveat at bottom), I will postulate there aren’t a ton of such end-to-end, multilevel systems. Here are ones I can think of. Corrections and clarifications welcome.

Brendan O’Connor provides a nice listing of end-to-end NLP packages. One or more may be useful in the creation of topic maps based on large amounts of textual data.

« Newer PostsOlder Posts »

Powered by WordPress