Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 8, 2011

…Harnessing Big Data

Filed under: BigData,Marketing — Patrick Durusau @ 8:12 pm

How Government Could Boost Its Performance by Harnessing Big Data by Robert Atkinson, President, Information Technology and Innovation Foundation.

From the post:

  1. Electric power utilities can use data analytics and smart meters to better manage resources and avoid blackouts,
  2. Food inspectors can use data to better track meat and produce safety from farm to fork ,
  3. Public health officials can use health data to detect infectious disease outbreaks,
  4. Regulators can track pharmaceutical and medical device safety and effectiveness through better data analytics,
  5. Police departments can use data analytics to target crime hotspots and prevent crime waves,
  6. Public utilities can use sensors to collect data on water and sewer usage to detect leaks and reduce water consumption,
  7. First responders can use sensors, GPS, cameras and better communication systems to let police and fire fighters better protect citizens when responding to emergencies, and
  8. State departments of transportation can use data to reduce traffic, more efficiently deploy resources, and implement congestion pricing systems

Numbering added for ease of reference.

By the numbers:

  1. Electric power utilities…[investment in smart meters required and blackouts are usually the result of system failure, monitoring demand isn’t going to help].
  2. Food inspectors… [without adequate food inspectors to enforce standards, tracking potentially unhealthy food isn’t all that interesting a problem],
  3. Public health officials… [already use data to detect disease outbreaks, how did you think it happened?],
  4. Regulators can track… [to do what?, medical devices are already tracked],
  5. Police departments… [police officers don’t know the usual crime spots? need to get different police officers],
  6. Public utilities… [only if they have the sensors and the ability to affect repairs],
  7. First responders… [being able to talk to each other would have a higher priority, most still don’t, ten years after 9/11], and
  8. State departments of transportation… [counting cars will reduce their numbers?, I have to tell our local transportation department].

“Big data” is the flavor of the month but it doesn’t improve your credibility to invoke “big data” when there is none to be seen.

Let’s not make the same mistake with semantic identity topics.

Machine Learning on Hadoop at Huffington Post | AOL

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 8:12 pm

Machine Learning on Hadoop at Huffington Post | AOL

Nice slide deck on creating a pluggable platform for testing large numbers of algorithms and then selecting the best.

October 7, 2011

HaptiMap toolkit beta release

Filed under: Interface Research/Design,Software — Patrick Durusau @ 6:20 pm

HaptiMap toolkit beta release

From the HapiMap homepage:

HaptiMap toolkit beta release. The HaptiMap toolkit which provides a simple cross-platform API that abstracts the complexities of

  • dealing with haptic / audio / visual input and output on a cross-platform basis
  • retrieving, storing and manipulating geographic data

behind a simple interface, leaving user interface developers free to concentrate on maximizing the usability and accessibility of their applications.

Hmmm, a new interface for your topic map?

Graphic Database, NoSQL and Neo4j

Filed under: Neo4j,NoSQL — Patrick Durusau @ 6:19 pm

Graphic Database, NoSQL and Neo4j

Skip if you already know that basics but could be an explanation that resonates with newbies to NoSQL/Neo4j.

Usergrid Source Code Release on GitHub

Filed under: Cassandra,Usergrid — Patrick Durusau @ 6:19 pm

Usergrid Source Code Release on GitHub

From the webpage:

We’re announcing today the first source code release of Usergrid, a comprehensive platform stack for mobile and rich client applications. The entire codebase is now available on GitHub at https://github.com/usergrid/stack. Usergrid is built in Java and runs on top of Cassandra. Although we built Usergrid as a highly scalable cloud service, we’ve also taken a few steps to make it easy to run “small”, including providing a double-clickable desktop app that lets you run your own personal installation on your desktop, so you can get started right away.

I thought I read about “rich clients” with HTML5.

But the W3C web design team buried the HTML 5 draft 5 clicks deep from their homepage. Good thing I knew to keep looking. That’s not just poor marketing, that’s also poor design.

A future of incompatiblity awaits.

Deterministic Parallel Programming in Haskell

Filed under: Haskell,Parallelism — Patrick Durusau @ 6:18 pm

Deterministic Parallel Programming in Haskell by Andres Löaut;h (pdf)

The bundle for the tutorial with slides, source code and exercises: http://www.well-typed.com/Hal6/Hal6.zip.

Very interesting slide deck on parallel programming in Haskell but you will also learn about parallelism in general.

Whether needed or not, you have to admit it is a current marketing phrase.

Optimizing Findability in Lucene and Solr

Filed under: Lucene,Solr — Patrick Durusau @ 6:18 pm

Optimizing Findability in Lucene and Solr

From the post:

To paraphrase an age-old question about trees falling in the woods: “If content lives in your application and you can’t find it, does it still exist?” In this article, we explore how to make your content findable by presenting tips and techniques for discovering what is important in your content and how to leverage it in the Lucene Stack.

I would ask:

“If content is available on the WWW and you can’t find it, does it still exist?”

Unlike the tree example, I think that has a fairly clear answer: No.

It can’t influence your decisions, it can’t shape your policies, it can’t form the basis for new ideas or products, or to help you avoid costly mistakes. That sounds like it doesn’t exist to me.

The post is fairly detailed but well worth the effort. Enjoy!

DeepaMetja 3 v0.5 – Property-Less Data Model

Filed under: Software,Subject Identity — Patrick Durusau @ 6:18 pm

DeepaMetja 3 v0.5 – Property-Less Data Model

I started to outline all the issues with the property-less solution but then thought, what a nice classroom exercise!

What do you think are the issues with the “solution?” Write a maximum of three (3) pages with no citations.

LDIF – Linked Data Integration Framework Version 0.3.

Filed under: Data Integration,Linked Data,LOD — Patrick Durusau @ 6:17 pm

LDIF – Linked Data Integration Framework Version 0.3

From the email announcement:

The LDIF – Linked Data Integration Framework can be used within Linked Data applications to translate heterogeneous data from the Web of Linked Data into a clean local target representation while keeping track of data provenance. LDIF provides an expressive mapping language for translating data from the various vocabularies that are used on the Web into a consistent, local target vocabulary. LDIF includes an identity resolution component which discovers URI aliases in the input data and replaces them with a single target URI based on user-provided matching heuristics. For provenance tracking, the LDIF framework employs the Named Graphs data model.

Compared to the previous release 0.2, the new LDIF release provides:

  • data access modules for gathering data from the Web via file download, crawling and accessing SPARQL endpoints. Web data is cached locally for further processing.
  • a scheduler for launching data import and integration jobs as well as for regularly updating the local cache with data from remote sources.
  • a second use case that shows how LDIF is used to gather and integrate data from several music-related Web data sources.

More information about LDIF, concrete usage examples and performance details are available at http://www4.wiwiss.fu-berlin.de/bizer/ldif/

Over the next months, we plan to extend LDIF along the following lines:

  1. Implement a Hadoop Version of the Runtime Environment in order to be able to scale to really large amounts of input data. Processes and data will be distributed over a cluster of machines.
  2. Add a Data Quality Evaluation and Data Fusion Module which allows Web data to be filtered according to different data quality assessment policies and provides for fusing Web data according to different conflict resolution methods.

Uses SILK (SILK – Link Discovery Framework Version 2.5) identity resolution semantics.

Uberlic

Filed under: Linked Data,Semantic Diversity — Patrick Durusau @ 6:17 pm

Uberlic

From the about documenation:

The Doppelganger service translates between IDs of entities in third party APIs. When you query Doppelganger with an entity ID, you’ll get back IDs of that same entity in other APIs. In addition, a persistent Uberblic ID serves as an anchor for your application that you can use for subsequent queries.

So why link APIs? is answered in a blog entry:

There is an ever-increasing amount of data available on the Web via APIs, waiting to be integrated by product developers. But actually integrating more than just one API into a product poses a problem to developers and their product managers: how do we make the data sources interoperable, both with one another and with our existing databases? Uberblic launches a service today to make that easy.

A location based product, for example, would aim to pull in information like checkins from Foursquare, reviews from Lonely Planet, concerts from LastFM and social connections from Facebook, and display all that along one place’s description. To do that, one would need to identify this particular place in all the APIs – identify the place’s ‘doppelgangers’, if you will. Uberblic does exactly that, mapping doppelgangers across APIs, as a web service. It’s like a dictionary for IDs, the Rosetta Stone of APIs. And geolocation is just the beginning.

Uberblic’s doppelganger engine links data across a variety of data APIs. By matching equivalent records, the engine connects an entity graph that spans APIs and data services. This entity graph provides rich contextual data for product developers, and Uberblic’s APIs serve as a switchboard and broker between data sources.

See the full post at: http://uberblic.com/2011/08/one-api-to-link-them-all/

Useful. But as you have already noticed, no associations, no types, no way to map to other identifiers.

Not that a topic map could not use Uberlic data if available, just not is all that is possible.

Getting Creative with MapReduce

Filed under: Algorithms,Cascalog,MapReduce — Patrick Durusau @ 6:16 pm

Getting Creative with MapReduce

From the post:

One problem with many existing MapReduce abstraction layers is the utter difficulty of testing queries and workflows. End-to-end tests are maddening to craft in vanilla Hadoop and frustrating at best in Pig and Hive. The difficulty of testing MapReduce workflows makes it scary to change code, and destroys your desire to be creative. A proper testing suite is an absolute prerequisite to doing creative work in big data.

In this blog post, I aim to show how most of the difficulty of writing and testing MapReduce queries stems from the fact that Hadoop confounds application logic with decisions about data storage. These problems are the result of poorly implemented abstractions over the primitives of MapReduce, not problems with the core MapReduce algorithms.

The author advocates the use of Cacaslog and its testing suite. Comments?

SpicyNodes

Filed under: Graphs,Visualization,Vizigator — Patrick Durusau @ 6:16 pm

SpicyNodes

Suggested to me by Jill Nelson as an example of node rendering. Possibly a useful example for topic maps.

I didn’t spend a lot of time evaluating the site, particularly since we have a history of software like Vizigator, which accomplishes the same thing, and a bit more directly for topic maps.

Let me know if you think SpicyNodes merits a deeper review for its usefulness in connection with topic maps.

October 6, 2011

PostScript as a Programming Language for Bioinformatics

Filed under: Bioinformatics,PostScript,Visualization — Patrick Durusau @ 5:36 pm

PostScript as a Programming Language for Bioinformatics

From the post:

“PostScript (PS) is an interpreted, stack-based programming language. It is best known for its use as a page description language in the electronic and desktop publishing areas.”[wikipedia]. In this post, I’ll show how I’ve used to create a simple and lightweight view of the genome.

Awesome in a number of respects! Have you used PostScript to visualize a topic map? Not that it would be likely to be a production device but the discipline of doing it could be interesting.

Data Without Borders

Filed under: Data,Non-Profit,Volunteer — Patrick Durusau @ 5:35 pm

Data Without Borders

From the blog site:

Data Without Borders seeks to match non-profits in need of data analysis with freelance and pro bono data scientists who can work to help them with data collection, analysis, visualization, or decision support.

Would not be a bad place to show off your topic maps skills and the results of using them!

Sandbox from YAHOO! Research

Filed under: Dataset — Patrick Durusau @ 5:34 pm

Sandbox from YAHOO! Research

Saw this on Machine Learning (Theory) as reported by Lihong Li.

The data sets sound really great, but then I read:

Eligibility:

Yahoo! is pleased to make these datasets available to researchers who are advancing the state of knowledge and understanding in web sciences. The datasets are only available for academic use by faculty and university researchers who agree to the Data Sharing Agreement.

To be eligible to receive Webscope data you must:

  • Be a faculty member, research employee or student from an accredited university
  • Send the data request from an accredited university .edu or domain name (for international universities) email address
  • Ensure that your request has been acknowledged by your Department Chair

We are not able to share data with:

  • Commercial entities
  • Employees of commercial entities with university appointment
  • Research institutions not affiliated with a research university

Note: You must have a Yahoo! account to apply for Webscope datasets.

I think I can pass everything except “employees of commercial entities with university appointment” since I am an adjunct faculty member and to work outside the university as my primary means of support.

This reads like someone who doesn’t want to share data trying to thing of foot-faults to build into the sharing process. Such as “acknowledged by your Department Chair.” Acknowledged to who? By what means? Is once enough?

I can understand reasonable restrictions, say non-commercial use, attribution on publication, contribution of improvements back to the community, etc., but the user community deserves better rules than these.

KDD and MUCMD 2011

Filed under: Bioinformatics,Biomedical,Data Mining,Knowledge Discovery — Patrick Durusau @ 5:33 pm

KDD and MUCMD 2011

An interesting review of KDD and MUCMD (Meaningful Use of Complex Medical Data) 2011:

At KDD I enjoyed Stephen Boyd’s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler’s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.

Cites a number their favorite papers. Which ones are yours?

Domain Adaptation with Hierarchical Logistic Regression

Filed under: Bayesian Models,Classifier,LingPipe,Regression — Patrick Durusau @ 5:32 pm

Domain Adaptation with Hierarchical Logistic Regression

Bob Carpenter continues his series on domain adaptation:

Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.

Hierarchical Logistic Regression

Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.

Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.

Who Is Using SharePoint? The Fortune 500 That Is Who

Filed under: Marketing,SharePoint — Patrick Durusau @ 5:31 pm

Who Is Using SharePoint? The Fortune 500 That Is Who

From the post (Beyond Search):

Oh boy! Our content wranglers found another great list and we are excited about it. Once more TopSharePoint.com pools its sources to gather twenty-five “Fortune 500 Companies Using SharePoint.

The title of the post is slightly misleading. From the TopSharePoint.com article:

Below you will find a list of few Fortune 500 companies using SharePoint technology for their public-facing websites. This review is trying to highlight the adoption of SharePoint in the corporate world as well as the customization level these companies accomplished.

So this list is only some of the Fortune 500 companies who are using SharePoint for their public-facing websites. That’s sounds like a much smaller number than the Beyond Search post would imply. Granting that I think search can be improved by any number of technologies, including topic maps, but let’s be correct in how we represent other posts. A list of Fortune 500 companies that use SharePoint would be quite longer than the one listed at TopSharePoint.com.

Q-Sensei: Multi-Dimensional Information Management

Filed under: Humor — Patrick Durusau @ 5:30 pm

Q-Sensei: Multi-Dimensional Information Management

From the post (at Beyond Search):

I found the MarketWatch story or news release “Frost & Sullivan Recognizes Q-Sensei’s Innovative Enterprise Search Platform for Providing Relevant Search Results across Information Sources” a buzzword bonanza. The system seems more versatile than Autonomy’s, Exalead’s, and Apache Lucene combined if I believe the story or news release. I am confident some of the azure chip crowd and the former librarians laboring away as search experts will gobble the hook and its plastic worm. Geese eat bread crumbs and trash, by the way.

This post is amusing beyond description. Please read.

VII PythonBrasil

Filed under: Python,Recommendation — Patrick Durusau @ 5:30 pm

VII PythonBrasil

Marcel Caraciolo covers his slides from keynotes at VII PythonBrasil, the most interesting for topic mappers would be Crab – A Python Framework for Building Recommender Systems.

Recommender systems by necessity have to identify the interests of a user (2 subjects, interests and user), match those to other interests (another subject) and then produce a recommendation (yet another subject), plus relationship subjects if you are interested. Recommender systems are already identifying all those subjects and gathering instances of them together.

What would you do to make their constantly interim results available to other systems?

High Performance Computing with Python – Part 03

Filed under: Python,Similarity — Patrick Durusau @ 5:29 pm

High Performance Computing with Python – Part 03

From the post:

In this series we will analyze how to optimize the statistical Spearman Rank’s Correlation coefficient, which it is a particular measure used to compute the similarity between items in recommender systems and assesses how well the relationship between two variables can be described using a monotonic function. The source code for this metric can be found in the first post.

Great series and all the greater for covering measures of similarity between items. If something becomes similar enough, it could be considered to be the same thing.

Event Stream Processor Matrix

Filed under: Event Stream,Stream Analytics — Patrick Durusau @ 5:28 pm

Event Stream Processor Matrix

From the post:

We published our first ever UI-focused post on Top JavaScript Dynamic Table Libraries the other day and got some valuable feedback – thanks!

We are back to talking about the backend again. Our Search Analytics and Scalable Performance Monitoring services/products accept, process, and store huge amounts of data. One thing both of these services do is process a stream of events in real-time (and batch, of course). So what solutions are there that help one process data in real-time and perform some operations on a rolling window of data, such as the last 5 or 30 minutes of incoming event stream? We know of several solutions that fit that bill, so we decided to put together a matrix with essential attributes of those tools in order to compare them and make our pick. Below is the matrix we came up with. If you are viewing this on our site, the table is likely going to be too wide, but it should look find in a proper feed reader.

Another great collection of resources from Sematext!

So, must have subjects because something is being recognized in order to have “events.” That implies to me that with subjects being recognized, it is likely enterprises have other information about those subjects that would be useful to have together, “merged” I think is the term I want. 😉

Has anyone seen “events” recognized in one system being populated with information from another system? With different criteria for the same subject?

October 5, 2011

Machine Learning Module

Filed under: Machine Learning — Patrick Durusau @ 6:58 pm

Machine Learning Module

For anyone taking the machine learning course at Stanford this Fall, some supplemental materials you may find interesting.

Early Music Online

Filed under: Library,Music — Patrick Durusau @ 6:57 pm

Early Music Online

From the website:

Early Music Online is a pilot project in which 300 of the world’s earliest surviving volumes of printed music, held in the British Library, have been digitised and made freely available online.

You can explore the digitised content via the British Library Catalogue. Included are full details of each digitised book, with an inventory of the contents of each, searchable by composer name, title of composition, date and subject, and with links to the digitised content. (Click ‘I want this’ in the Library catalogue to access the digitised music.)

(The British Library link takes you directly to the collection and not to the British Library homepage.)

There are a number of uses which suggest themselves for this data.

On Understanding Data Abstraction, Revisited

Filed under: Data,Semantics — Patrick Durusau @ 6:56 pm

On Understanding Data Abstraction, Revisited by William R. Cook.

Abstract:

In 1985 Luca Cardelli and Peter Wegner, my advisor, published an ACM Computing Surveys paper called “On understanding types, data abstraction, and polymorphism”. Their work kicked off a flood of research on semantics and type theory for object-oriented programming, which continues to this day. Despite 25 years of research, there is still widespread confusion about the two forms of data abstraction, abstract data types and objects. This essay attempts to explain the differences and also why the differences matter.

With all the talk about data types, this is worth re-reading.

C-Rank: A Link-based Similarity Measure for Scientific Literature Databases

Filed under: C-Rank,Similarity — Patrick Durusau @ 6:55 pm

C-Rank: A Link-based Similarity Measure for Scientific Literature Databases by Seok-Ho Yoon, Sang-Wook Kim, and Sunju Park.

Abstract:

As the number of people who use scientific literature databases grows, the demand for literature retrieval services has been steadily increased. One of the most popular retrieval services is to find a set of papers similar to the paper under consideration, which requires a measure that computes similarities between papers. Scientific literature databases exhibit two interesting characteristics that are different from general databases. First, the papers cited by old papers are often not included in the database due to technical and economic reasons. Second, since a paper references the papers published before it, few papers cite recently-published papers. These two characteristics cause all existing similarity measures to fail in at least one of the following cases: (1) measuring the similarity between old, but similar papers, (2) measuring the similarity between recent, but similar papers, and (3) measuring the similarity between two similar papers: one old, the other recent. In this paper, we propose a new link-based similarity measure called C-Rank, which uses both in-link and out-link by disregarding the direction of references. In addition, we discuss the most suitable normalization method for scientific literature databases and propose an evaluation method for measuring the accuracy of similarity measures. We have used a database with real-world papers from DBLP and their reference information crawled from Libra for experiments and compared the performance of C-Rank with those of existing similarity measures. Experimental results show that C-Rank achieves a higher accuracy than existing similarity measures.

Reviews other link-based similarity measures compares them to the proposed C-Rank measure both in theory as well as actual experiments.

Interesting use of domain experts to create baseline similarity measures, against which the similarity measures were compared.

I am not quite convinced by the “difference” argument for scientific literature:

First, few papers exist which are referenced by old papers. This is because very old papers are often not included in the database due to technical and economic reasons. Second, since a paper can reference only the papers published before it (and never the papers published after it), there exist few papers which reference recently-published papers.

As far as old papers not being included in the database, the authors should try philosophy articles which cite a wid range of material that is very unlikely to be a literature database. (Google Books may be changing that for “recent” literature.)

On scientific papers not citing recent papers, I suspect that simply isn’t true for scientific papers. David P. Hamilton (Science, 251:25, 1991) in Research Papers: Who’s Uncited Now?, commenting on work by David Pendlebury of the Institute for Scientific Information that demonstrated “Atomic, molecular, and chemical physics, a field in which onlv 9.2% of articles go uncited…” within five years of publication. That sounds like recent papers being cited to me.

If you are interested in citation practices for monographs, see: Citation Characteristics and Intellectual Acceptance of Scholarly Monographs by Rong Tang (Coll. res. libr. July 2008 69:356-369).

If it isn’t already on your regular reading list, College & Research Libraries should be.

I mention all that to point out that exploring the characteristics of information collections may turn up surprising facts, facts that can influence the development of algorithms for search, similarity and ultimately for use by a user.

Catalog QUDT

Filed under: Measurement,Ontology,Semantic Web — Patrick Durusau @ 6:54 pm

Catalog QUDT

From the website:

The QUDT, or ‘Quantity, Unit, Dimension and Type’ collection of ontologies define base classes, properties, and instances for modeling physical quantities, units of measure, and their dimensions in various measurement systems. The goal of the QUDT collection of models is to provide a machine-processable approach for specifying measurable quantities, units for measuring different kinds of quantities, the numerical values of quantities in different units of measure and the data structures and data types used to store and manipulate these objects in software. A simple treatment of units is separated from a full dimensional treatment of units. Vocabulary graphs will be used to organize units for different disciplines.

Useful in a number of domains. Comparison to other measurement ontology efforts should prove to be interesting.

In Defense of Ambiguity

Filed under: Ambiguity,RDF,Semantic Web — Patrick Durusau @ 6:52 pm

In Defense of Ambiguity by Patrick J. Hayes and Harry A. Halpin.

Abstract:

URIs, a universal identification scheme, are different from human names insofar as they can provide the ability to reliably access the thing identified. URIs also can function to reference a non-accessible thing in a similar manner to how names function in natural language. There are two distinctly different relationships between names and things: access and reference. To confuse the two relations leads to underlying problems with Web architecture. Reference is by nature ambiguous in any language. So any attempts by Web architecture to make reference completely unambiguous will fail on the Web. Despite popular belief otherwise, making further ontological distinctions often leads to more ambiguity, not less. Contrary to appeals to Kripke for some sort of eternal and unique identification, reference on the Web uses descriptions and therefore there is no unambiguous resolution of reference. On the Web, what is needed is not just a simple redirection, but a uniform and logically consistent manner of associating descriptions with URIs that can be done in a number of practical ways that should be made consistent.

Highly readable critique with passages such as:

There are two distinct relationships between names and things: reference and access. The architecture of the Web determines access, but has no direct influence on reference. Identifiers like URIs can be considered types of names. It is important to distinguish these two possible different relationships between a name and a thing.

1. accesses, meaning that the name provides a causal pathway to the thing, perhaps mediated by the Web.

2. refers to, meaning that the name is being used to mention the thing.

Current practice in Web Architecture uses “identifies” to mean both or either of these, apparently in the belief that they are synonyms. They are not, and to think of them as being the same is to be profoundly confused. For example, when uttering the name “Eiffel Tower” one does not in anyway get magically transported to the Eiffel Tower. One can talk about it, have beliefs, plan a trip there, and otherwise have intentions about the Eiffel Tower, but the name has no causal path to the Eiffel Tower itself. In contrast, the URI http://www.tour-eiffel.fr/ offers us access to a group of Web pages via an HTTP-compliant agent. A great deal of the muddle Web architecture finds itself in can be directly traced to this confusion between access and reference.

The solution proffered by Hayes and Halpin:

Regardless of the details, the use of any technology in Web architecture to distinguish between access and reference, including our proposed ex:refersTo and ex:describedBy, does nothing more than allow the author of a URI to explain how they would like the URI to be used.

For those interested in previous recognitions of this distinction, see <resourceRef> and <subjectIndicatorRef> in XTM 1.0.

Datawrangler

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 6:50 pm

Datawrangler

From the post:

Formatting data is a necessary pain, so anything that makes formatting easier is always welcome. Data Wrangler, from the Stanford Visualization Group, is the latest in the growing set of tools to get your data the way you need it (so that you can get to the fun part already). It’s similar to Google Refine in that they’re both browser-based, but my first impression is that Data Wrangler is more lightweight and it feels more responsive.

Data Wrangler also seems to do more guesswork, so you can set less specific parameters. Just roll over stuff, and it’ll show a preview of possible changes or formatting. Keep the change or easily undo it.

The video below describes what all the tool can do, but it’s better to just try it out. Copy and paste your own mangled data or give Data Wrangler a whirl with the sample provided.

From our friends at FlowingData. Perhaps we should ask: Does data exist if it isn’t visualized?

The concept of “aboutness” in subject indexing

Filed under: Indexing,Information Overload — Patrick Durusau @ 6:49 pm

I just finished reading a delightful paper by W. J. Hutchins, ‘The concept of “aboutness” in subject indexing,’ which was presented at a Colloquium on Aboutness held by the Co-ordinate Indexing Group, 18 April 1977 and was reprinted in Readings in Information Retrieval, edited by Karen Sparck Jones and Peter Willett, Morgan Kaufman Publishers, Inc., San Francisco, California, 1997.

I discovered the paper in a hard copy of Readings in Information Retrieval, but it is also online, The concept of “aboutness” in subject indexing.

Hutchins writes in his abstract:

The common view of the ‘aboutness’ of documents is that the index entries (or classifications) assigned to documents represent or indicate in some way the total contents of documents; indexing and classifying are seen as processes involving the ‘summarization’ of the texts of documents. In this paper an alternative concept of ‘aboutness’ is proposed based on an analysis of the linguistic organization of texts, which is felt to be more appropriate in many indexing environments (particularly in non-specialized libraries and information services) and which has implications for the evaluation of the effectiveness of indexing systems.

You can read the details of how he suggests discovering the “aboutness” of documents but I was struck by his observation that the ‘summarization’ practice furthers the end of exhaustive search. Under Objectives of indexing, Hutchins says:

In the context of the special library and similarly specialized information services, the ‘summarization’ approach to subject indexing is most appropriate. Indexers are generally able to define clearly the interests and levels of knowledge of the readers they are serving; they are thus able to produce ‘summaries’ biased in the most helpful directions for their readers. More importantly, indexers can normally assume that most users are already very knowledgeable on most of the topics they look for in the indexes provided. They can assume that the usual search is for references to all documents treating a particular topic, since any one may have something ‘new’ to say about it that the reader did not know before. The fact that some references will lead users to texts which tell them nothing they did not previously know should not normally worry them unduly—it is the penalty they expect to pay for the assurance that the search has been as exhaustive as feasible.

Exhaustive search is one type of search that drives tests for the success of indexing:

The now traditional parameters of ‘recall’, ‘precision’ and ‘fallout’ are clearly valid for systems in which success is measured in terms of the ability to retrieve all documents which have something to say on a particular topic—that is to say, in systems based on the ‘summarization’ approach.*

You could say that full-text indexing/searching is different from ‘summarization’ by a professional indexer, but is it? Or have we simply substituted non-professional indexers into the process?

With ‘summarization,’ a professional indexer chooses terms that represent the content of a document. With full-text searching, the terms chosen on an ad-hoc basis by a user come to represent a ‘summary’ of entire documents. And in both cases, all the documents so summarized are returned to the user, in other words, the search is exhaustive.

Google/Bing/Yahoo! searches are examples of exhaustive searches of little value. I can find two or three thousand (2000-3000) new pages of material relevant to topic map issues everyday. Can you say information overload?

Or is that information volume overload? That out of the two or three thousand (2000-3000) pages per day, probably more like fifty to one hundred (50-100) pages are worth my attention. That is what “old-style” indexing brought to the professional researcher.

« Newer PostsOlder Posts »

Powered by WordPress