Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 16, 2012

Neo4j 1.7.M01 – “Bastuträsk Bänk”

Filed under: Cypher,Neo4j — Patrick Durusau @ 7:35 pm

Neo4j 1.7.M01 – “Bastuträsk Bänk”

The first milestone for Neo4j 1.7 has a number of new features, including improvement to the Cypher query language.

See the post or better yet, grab a copy of the milestone release!

JUNG in Neo4j – Part 2

Filed under: Cypher,D3,Graphs,Neo4j,Visualization — Patrick Durusau @ 7:35 pm

JUNG in Neo4j – Part 2

Max De Marzi writes:

A few weeks ago I showed you how to visualize a graph using the chord flare visualization and how to visualize a network using a force directed graph visualization from D3.js.

Not content to rest on his laurels, Max points to additional resources on non-traditional graph visualizations and starts work on a matrix visualization of a graph (one step on the way to a node quilt).

I better post this quickly before Max posts another part. 😉

JUNG in Neo4j – Part 1

Filed under: JUNG,Neo4j — Patrick Durusau @ 7:35 pm

JUNG in Neo4j – Part 1

Max De Marzi writes:

It’s nice to have an arsenal. In the world of graph databases, one such stock room is the Java Universal Network/Graph Framework(JUNG) which contains a cache of algorithms from graph theory, data mining, and social network analysis, such as routines for clustering, decomposition, optimization, random graph generation, statistical analysis, and calculation of network distances, flows, and importance measures (centrality, PageRank, HITS, etc.).

In very clear writing Max gets you started with JUNG and Neo4j.

A series to follow. Closely.

Using the DataJS Library for Data-Centric JavaScript Web Applications

Filed under: DataJS,Javascript,Odata — Patrick Durusau @ 7:35 pm

Using the DataJS Library for Data-Centric JavaScript Web Applications.

Brian Rinaldi writes:

I recently attended the HTML5 Summit in Miami Beach where one of the speakers was David Zhang of Microsoft speaking about the DataJS library created by Microsoft for building data-centric web applications. The DataJS library can make it easy to integrate services into your application and add features like pre-fetching, paging and caching in local storage, including support for IndexedDB. The project sounded interesting so I made a note to try it out and finally got around to it. Here’s an overview.

From the DataJS project page:

datajs is a new cross-browser JavaScript library that enables data-centric web applications by leveraging modern protocols such as JSON and OData and HTML5-enabled browser features. It’s designed to be small, fast and easy to use.

Are you delivering your next topic map to a web browser or a standalone app?

30 MDM Customer Use Cases (Master Data Management in action)

Filed under: Master Data Management,Use Cases — Patrick Durusau @ 7:34 pm

30 MDM Customer Use Cases (Master Data Management in action)

Jakki Geiger writes:

Master Data Management (MDM) has been used by companies for more than eight years to address the challenge of fragmented and inconsistent data across systems. Over the years we’ve compiled quite a cadre of uses cases across industries and strategic initiatives. I thought this outline of the 30 most common MDM initiatives may be of interest to those of you who are just getting started on your MDM journey.

Although these organizations span different industries, face varied business problems and started with diverse domains, you’ll notice that revenue, compliance and operational efficiency are the most common drivers of MDM initiatives. The impetus is to improve the foundational data that’s used for analysis and daily operations. (Click on the chart to make it larger.)

Curious what you make of the “use cases” in the charts?

They are all good goals but I am not sure I would call them “use cases.”

Take HealthCare under Marketing, which reads:

To improve the customer experience and marketing effectiveness with a better understanding of members, their household relationships and plan/policy information.

Is that a use case? For master data management?

The Wikipedia entry on master data management says in part:

At a basic level, MDM seeks to ensure that an organization does not use multiple (potentially inconsistent) versions of the same master data in different parts of its operations, which can occur in large organizations. A common example of poor MDM is the scenario of a bank at which a customer has taken out a mortgage and the bank begins to send mortgage solicitations to that customer, ignoring the fact that the person already has a mortgage account relationship with the bank. This happens because the customer information used by the marketing section within the bank lacks integration with the customer information used by the customer services section of the bank. Thus the two groups remain unaware that an existing customer is also considered a sales lead. The process of record linkage is used to associate different records that correspond to the same entity, in this case the same person.

Other problems include (for example) issues with the quality of data, consistent classification and identification of data, and data-reconciliation issues.

Can you find any “use cases” in the Infomatica post?

BTW, topic maps avoid “inconsistent” data without forcing you to reconcile and update all your data records. (Inquire.)

Look Ma! I Can Draw!

Filed under: Graphs,Visualization — Patrick Durusau @ 7:34 pm

Diagrammer: Buy PowerPoint-Ready Diagrams for $0.99 a Piece

Finally, a course that can teach me the art of illustration. 😉

From the post:

Diagrammer is based on an intriguing business proposition. So, here you are, in the middle of designing a Powerpoint presentation, and clueless how to visualize the next hypercomplex concept containing heaps of relationships. Based on a topology of 5 different kinds of relations – Flow, Join, Netwerk, Segment, or Stack – you can now browse a collection of over 4,000 unique diagrams and pick the one that suits your communication goals the most.

If you have ever struggled with Gimp or other drawing/image editing programs, this may be the “app” for you. Not like having an internal graphics department, the shapes are varied but fixed choices. Still, I think you may find something you like.

Apache MRUnit 0.8.1-incubating has been released!

Filed under: MapReduce,TMCL,Unit Testing — Patrick Durusau @ 7:34 pm

Apache MRUnit 0.8.1-incubating has been released!

From the post:

We (the Apache MRUnit team) have just released Apache MRUnit 0.8.1-incubating. Apache MRUnit is an Apache Incubator project. MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.

The MRUnit project is actively looking for contributors, even ones brand new to the world of open source software. There are many ways to contribute: documentation, bug reports, blog articles, etc. If you are interested but have no idea where to start, please email brock at cloudera dot com. If you are an experienced open source contributor, the MRUnit wiki explains How you can Contribute.

Opportunity to contribute to unit testing for MapReduce jobs.

BTW, what would unit testing for topic maps, ontologies, data models look like?

I grabbed Developing High Quality Data Models (Matthew West) off the shelf and flipped to the index. No entry for “unit testing,” “testing,” “validation,” etc.

Would the Topic Maps Constraint Language be sufficient for topic maps following the TMDM? So that you could incrementally test the design of streams of data into a topic map?

Neo4j Aces “State Competition “Jugend Forscht Hessen” and best Project award”

Filed under: Graphs,N-Grams,Neo4j — Patrick Durusau @ 7:34 pm

Paul Wagner and Till Speicher won State Competition “Jugend Forscht Hessen” and best Project award using neo4j” writes René Pickhardt.

From the post:

6 months of hard coding and supervising by me are over and end with a huge success! After analyzing 80 GB of Google ngrams data Paul and Till put them to a neo4j graph data base in order to make predictions for fast scentence completion. Today was the award ceremony and the two students from Darmstadt and Saarbrücken (respectivly) won the first place. Additionally the received the “beste schöpferische Arbeit” award. Which is the award for the best project in the entire competition (over all disciplines).

With their technology and the almost finnished android app typing will be revolutionized! While typing a scentence they are able to predict the next word with a recall of 67% creating a huge additional vallue for today’s smartphones.

So stay tuned of the upcomming news and the federal competition on May in Erfurt.

Not that you could tell that René is proud of the team! 😉

Curious: Can you use a Neo4j database to securely exchange messages? Display of messages triggered by series of tokens? Smart phone operator only knows their sequence and nothing more.

March 15, 2012

The Limitation of MapReduce: A Probing Case and a Lightweight Solution

Filed under: MapReduce — Patrick Durusau @ 8:03 pm

The Limitation of MapReduce: A Probing Case and a Lightweight Solution

From the post:

While we usually see enough papers that deal with the applications of the Map Reduce programming model this one for a change tries to address the limitations of the MR model. It argues that MR only allows a program to scale up to process very large data sets, but constrains a program’s ability to process smaller data items. This ability or inability (depending on how you see it) is what it terms as “one-way scalability”. Obviously this “one-wayness” was a requirement for Google but here the authors turn our attention to how this impacts the application of this framework to other computation forms.

The system they argue based on is a distributed compiler and their solution is a more scaled “down” parallelization framework called MRLite that handles more moderate volumes of data. The workload characteristics of a compiler are bit different from analytical workloads. Primary differences being compilation workloads deal with much more humble volumes of data albeit with much greater intertwining amongst the files.

All models have limits so it isn’t surprising that MapReduce does as well.

It will be interesting to see if the limitations of MapReduce are mapped out and avoided in “good practice,” or if some other model becomes the new darling until limits are found for it. Only time will tell.

Google Gives Search a Refresh

Filed under: News,Searching,Semantics — Patrick Durusau @ 8:03 pm

Google Gives Search a Refresh by Amir Efrati.

Google is moving towards “semantic search:”

Google isn’t replacing its current keyword-search system, which determines the importance of a website based on the words it contains, how often other sites link to it, and dozens of other measures. Rather, the company is aiming to provide more relevant results by incorporating technology called “semantic search,” which refers to the process of understanding the actual meaning of words.

Just when I decide that it isn’t meant to be a technical article and to give it a pass on “…understanding the actual meaning of words.” I read:

One person briefed on Google’s plans said the shift to semantic search could directly impact the search results for 10% to 20% of all search queries, or tens of billions per month.

How’s that? Some 80% to 90% of Google search queries don’t involve semantics? As Ben Stein would say, “Wow.” I would have never guessed. Then I suppose slang for female body parts doesn’t require a lot of semantics for a successful search. Is that really 80% to 90% of all Google queries?

And Google has expanded a set of entities for use with its semantic search:

It also approached organizations and government agencies to obtain access to databases, including the CIA World Factbook, which houses up-to-date encyclopedic information about countries worldwide.

Just so you know, the CIA World Factbook, as a U.S. government publication, isn’t subject to copyright. You can use it without permission.

Let’s hope that Google does better than this report on its efforts would lead you to believe.

Linguamatics Puts Big Data Mining on the Cloud

Filed under: Cloud Computing,Data Mining,Medical Informatics — Patrick Durusau @ 8:03 pm

Linguamatics Puts Big Data Mining on the Cloud

From the post:

In response to market demand, Linguamatics is pleased to announce the launch of the first NLP-based, scaleable text mining platform on the cloud. Text mining allows users to extract more value from vast amounts of unstructured textual data. The new service builds on the successful launch by Linguamatics last year of I2E OnDemand, the Software-as-a-Service version of Linguamatics’ I2E text mining software. I2E OnDemand proved to be so popular with both small and large organizations, that I2E is now fully available as a managed services offering, with the same flexibility in choice of data resources as with the in-house, Enterprise version of I2E. Customers are thus able to benefit from best-of-breed text mining with minimum setup and maintenance costs. Such is the strength of demand for this new service that Linguamatics believes that by 2015, well over 50% of its revenues could be earned from cloud and mobile-based products and services.

Linguamatics is responding to the established trend in industry to move software applications on to the cloud or to externally managed servers run by service providers. This allows a company to concentrate on its core competencies whilst reducing the overhead of managing an application in-house. The new service, called “I2E Managed Services”, is a hosted and managed cloud-based text mining service which includes: a dedicated, secure I2E server with full-time operational support; the MEDLINE document set, updated and indexed regularly; and access to features to enable the creation and tailoring of proprietary indexes. Upgrades to the latest version of I2E happen automatically, as soon as they become available. (emphasis added)

Interesting but not terribly so, until I saw the MEDLINE document set was part of the service.

I single that out as an example of creating a value-add for a service by including a data set of known interest.

You could do a serious value-add for MEDLINE or find a collection that hasn’t been made available to an interested audience. Perhaps one for which you could obtain an exclusive license for some period of time. State/local governments are hurting for money and they have lots of data. Can’t buy it but exclusive licensing isn’t the same as buying, in most jurisdictions. Check with local counsel to be sure.

Mancrush on Todd Park?

Filed under: Governance,Government,Marketing — Patrick Durusau @ 8:02 pm

OK, I admit It. I have a mancrush on the new Federal CTO, Todd Park by Tim O’Reilly.

Tim waxes on about Todd’s success with startups and what I would call a vendor/startup show, Health Datapalooza. (Does the agenda for Health Datapalooza 2012 look just a little vague to you? Not what I would call a “technical” conference.)

And Tim closes with this suggestion:

I want to put out a request to all my friends in the technology world: if Todd calls you and asks you for help, please take the call, and do whatever he asks.

Since every denizen of K-Street already has Todd’s private cell number on speed dial, the technology community needs to take another tack.

Assuming you don’t already own several members of Congress and/or federal agencies, watch for news of IT issues relevant to your speciality.

Send in one (1) suggestion on a one (1) page letter that clearly summarizes why your proposal is relevant, cost-effective and worthy of further discussion. The brevity will be such a shocker that your suggestion will stand out from the hand cart stuff that pours in from, err, traditional sources.

The Office of Science and Technology Policy (No link from the Whitehouse homepage, to keep you from having to hunt for it.) This is where Todd will be working.

Contact page for The Office of Science and Technology (You can attach a document to your message.)

I would copy your representative/senators, particularly if you donate on a regular basis.

Todd’s predecessor is described as having “…inspired and productive three years on the job.” (Todd Park Named New U.S. Chief Technology Officer I wonder if that is what Tim means by “productive?”

A Distributed C Compiler System on MapReduce: Mrcc

Filed under: Compilers,Distributed Systems — Patrick Durusau @ 8:02 pm

A Distributed C Compiler System on MapReduce: Mrcc

Alex Popescu of myNoSQL points to software and a paper on distributed C code for compilation.

Changing to distributed architectures may uncover undocumented decisions made long ago and far away. Decisions that we may choose to make differently this time. Hopefully we will do a better job of documenting them. (Not that it will happen but there is no law against hoping.)

Documenting decisions separately from use cases

Filed under: Decision Making,Documentation,Information Integration,Use Cases — Patrick Durusau @ 8:02 pm

Documenting decisions separately from use cases by James Taylor.

From the post:

I do propose making decisions visible. By visible, I mean a separate and explicit step for each decision being made. These steps help the developer identify where possible alternate and exception paths may be placed. These decision points occur when an actor’s input drives the scenario down various paths.

I could not have put this better myself. I am a strong believer in this kind of separation, and of documenting how the decision is made independently of the use case so it can be reused. The only thing I would add is that these decisions need to be decomposed and analyzed, not simply documented. Many of these decisions are non-trivial and decomposing them to find the information, know-how and decisions on which they depend can be tremendously helpful.

James describes development and documentation of use cases and decisions in a context broader than software development. His point on decomposition of decisions is particularly important for systems designed to integrate information.

He describes decomposition of decisions as leading to discovery of “information, know-how and decisions on which they depend….”

Compare and contrast that with simple mapping decisions that map one column in a table to another. Can you say on what basis that mapping was made? Or with more complex systems, what “know-how” is required or on what other decisions that mapping may depend?

If your integration software/practice/system doesn’t encourage or allow such decomposition of decisions, you may need another system.

James also cover’s some other decision management materials that you may find useful in designing, authoring, evaluating information systems. (I started to say “semantic information systems” but all information systems have semantics, so that would be prepending an unnecessary noise word.)

Data and Reality

Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.

I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”

His abstract there read:

The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.

I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.

The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.

I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.

More to follow.

March 14, 2012

BitPath — Label Order Constrained Reachability Queries over Large Graphs

Filed under: Graphs,Reachability — Patrick Durusau @ 7:37 pm

BitPath — Label Order Constrained Reachability Queries over Large Graphs by Medha Atre, Vineet Chaoji, and Mohammed J. Zaki.

Abstract:

In this paper we focus on the following constrained reachability problem over edge-labeled graphs like RDF — “given source node x, destination node y, and a sequence of edge labels (a, b, c, d), is there a path between the two nodes such that the edge labels on the path satisfy a regular expression “*a.*b.*c.*d.*“. A “*” before “a” allows any other edge label to appear on the path before edge “a”. “a.*” forces at least one edge with label “a”. “.*” after “a” allows zero or more edge labels after “a” and before “b”. Our query processing algorithm uses simple divide-and-conquer and greedy pruning procedures to limit the search space. However, our graph indexing technique — based on “compressed bit-vectors” — allows indexing large graphs which otherwise would have been infeasible. We have evaluated our approach on graphs with more than 22 million edges and 6 million nodes — much larger compared to the datasets used in the contemporary work on path queries.

If similarity is a type of relationship (as per Peter Neubauer, Neo4j), does it stand to reason that similarity may be expressed by a series of labeled edges? And if so, would this technique enable robust processing of the same? Suggestions?

NoSQL Paper: The Trinity Graph Engine

Filed under: Trinity — Patrick Durusau @ 7:37 pm

NoSQL Paper: The Trinity Graph Engine.

Alex Popescu of myNoSQL has discovered a paper on the MS Trinity Graph Engine.

There hasn’t been a lot of information on it so this could be helpful.

Thanks Alex!

Segmenting Words and Sentences

Filed under: Linguistics,Segmentation — Patrick Durusau @ 7:36 pm

Segmenting Words and Sentences by Richard Marsden.

From the post:

Even simple NLP tasks such as tokenizing words and segmenting sentences can have their complexities. Punctuation characters could be used to segment sentences, but this requires the punctuation marks to be treated as separate tokens. This would result in abbreviations being split into separate words and sentences.

This post uses a classification approach to create a parser that returns lists of sentences of tokenized words and punctuation.

Splitting text into words and sentences seems like it should be the simplest NLP task. It probably is, but there are a still number of potential problems. For example, a simple approach could use space characters to divide words. Punctuation (full stop, question mark, exclamation mark) could be used to divide sentences. This quickly comes into problems when an abbreviation is processed. “etc.” would be interpreted as a sentence terminator, and “U.N.E.S.C.O.” would be interpreted as six individual sentences, when both should be treated as single word tokens. How should hyphens be interpreted? What about speech marks and apostrophes?

A good introduction to segmentation but I would test the segmentation with a sample text before trusting it too far. Writing habits vary even within languages.

R and Hadoop: Step-by-step tutorials

Filed under: Hadoop,R — Patrick Durusau @ 7:36 pm

R and Hadoop: Step-by-step tutorials by David Smith.

From the post:

At the recent Big Data Workshop held by the Boston Predictive Analytics group, airline analyst and R user Jeffrey Breen gave a step-by-step guide to setting up an R and Hadoop infrastructure. Firstly, as a local virtual instance of Hadoop with R, using VMWare and Cloudera's Hadoop Demo VM. (This is a great way to get familiar with Hadoop.) Then, as single-machine cloud-based instance with lots of RAM and CPU, using Amazon EC2. (Good for more Hadoop experimentation, now with more realistic data sizes.) And finally, as a true distributed Hadoop cluster in the cloud, using Apache whirr to spin up multiple nodes running Hadoop and R.

More pointers and resources await you at David’s post.

No Honor Among Thieves

Filed under: Ad Targeting,Data Mining — Patrick Durusau @ 7:36 pm

Well, the original title is: 50% of the online ads are never seen by Panos Ipeirotis.

About my title: The purpose of ads is to sell you something. Whatever the consequences may be for you. A lesson well taught by US Tobacco, Big Pharma and the corn lobby (think of all the unnatural fructose products in your food).

That said, the post by Panos is a remarkable piece about investigation and data analysis.

From the post:

Almost a year back, I was involved in an advertising fraud case, as part of my involvement with AdSafe Media. (See the related Wall Street Journal story.) Long story short, it was a sophisticated scheme for generating user traffic to websites that were displaying ads to real users but these users could never see these ads, as they were never visible to the user. While we were able to uncover the scheme, what triggered our investigation was almost an accident: our adult-content classifier seemed to detect porn in websites that had absolutely nothing suspicious. While it was a great investigative success, we could not overlook the fact that this was not a systematic method for discovering such attempts for fraud. As part of this effort to make more systematic, the following idea came up:

Let’s monitor the duration for which a user can actually see an ad?

After a few months of development to get this feature to work, it became possible to measure the exact amount of time an was visible to a user. While this feature could easily now detect any fraud attempt that delivers ads to users that never see them, this was now almost secondary. It was the first time that we could monitor the amount of time that users get exposed to ads.

50% of the Ads are (almost) Never Seen.

By measuring the statistics of more than 1.5 billion ad impressions per day, it was possible to understand deeply how different websites perform. Some of the high level results:

  • 38% of the ads are never in view to a user
  • 50% of the ads are in view for less than 0.5 seconds
  • 56% of the ads are in view for less than 5 seconds

Personally, I found these numbers impressive. 50% of the delivered ads are never seen for more than 0.5 seconds! I wanted to check myself whether 0.5 seconds is sufficient to understand the ad. Apparently, the guys at AdSafe thought about that as well, so here is their experiment:

A “pull” advertising model avoids this type of fraud because advertisers could deliver directly to pre-qualified consumers. Better use of funds for psycho-sexual manipulation of pre-qualified consumers, rather than scatter-shot across demographics.

If you are tired of wasting money on “push” advertising (with the hazards and dangers of fraud), consider a different model. Consider topic maps.

Plastic Surgeon Holds Video Contest, Offers Free Nose Job to Winner

Filed under: Contest,Marketing — Patrick Durusau @ 7:35 pm

Plastic Surgeon Holds Video Contest, Offers Free Nose Job to Winner by Tim Nudd.

From the post:

Plastic surgeons aren’t known for their innovating marketing. But then, Michael Salzhauer isn’t your ordinary plastic surgeon. He’s “Dr. Schnoz,” the self-described “Nose King of Miami,” and he’s got an unorthodox offer for would-be patients—a free nose job to the winner of a just-announced video contest.

Can’t give away a nose job but what about a topic map?

What sort of contest should we have?

What would you do for a topic map?

HBase + Hadoop + Xceivers

Filed under: Hadoop,HBase — Patrick Durusau @ 7:35 pm

HBase + Hadoop + Xceivers by Lars George.

From the post:

Introduction

Some of the configuration properties found in Hadoop have a direct effect on clients, such as HBase. One of those properties is called “dfs.datanode.max.xcievers”, and belongs to the HDFS subproject. It defines the number of server side threads and – to some extent – sockets used for data connections. Setting this number too low can cause problems as you grow or increase utilization of your cluster. This post will help you to understand what happens between the client and server, and how to determine a reasonable number for this property.

The Problem

Since HBase is storing everything it needs inside HDFS, the hard upper boundary imposed by the ”dfs.datanode.max.xcievers” configuration property can result in too few resources being available to HBase, manifesting itself as IOExceptions on either side of the connection.

This is a true sysadmin type post.

Error messages say “DataXceiver,” but set the “dfs.datanode.max.xcievers” property. Post notes “xcievers” is misspelled.

Detailed coverage of the nature of the problem, complete with sample log entries. Along with suggested solutions.

And, word of current work to improve the current situation.

If you are using HBase and Hadoop, put a copy of this with your sysadmin stuff.

New index statistics in Lucene 4.0

Filed under: Indexing,Lucene — Patrick Durusau @ 7:35 pm

New index statistics in Lucene 4.0

Mike McCandless writes:

In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.

Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.

Mike uses a simple example to illustrate the statistics available in Lucene 4.0.

Keyword Indexing for Books vs. Webpages

Filed under: Books,Indexing,Keywords,Search Engines — Patrick Durusau @ 7:35 pm

I was watching a lecture on keyword indexing that started off with a demonstration of an index to a book, which was being compared to indexing web pages. The statement was made that the keyword pointed the reader to a page where that keyword could be found, much like a search engine does for a web page.

Leaving aside the more complex roles that indexes for books play, such as giving alternative terms, classifying the nature of the occurrence of the term (definition, mentioned, footnote, etc.), cross-references, etc., I wondered if there is a difference between a page reference in a book index vs. a web page reference by a search engine?

In some 19th century indexes I have used, the page references are followed by a letter of the alphabet, to indicate that the page is divided into sections, sometimes as many as a – h or even higher. Mostly those are complex reference works, dictionaries, lexicons, works of that type, where the information is fairly dense. (Do you know of any modern examples of indexes where pages are divided? A note would be appreciated.)

I have the sense that an index of a book, without sub-dividing a page, is different from a index pointing to a web page. It may be a difference that has never been made explicit but I think it is important.

Some facts about word length on a “page:”

With a short amount of content, average book page length, the user has little difficulty finding an index term on a page. But the longer the web page, the less useful our instinctive (trained?) scan of the page becomes.

In part because part of the page scrolls out of view. As you may know, that doesn’t happen with a print book.

Scanning of a print book is different from scanning of a webpage. How to account for that difference I don’t know.

Before you suggest Ctrl-F, see Do You Ctrl-F?. What was it you were saying about Ctrl-F?

Web pages (or other electronic media) that don’t replicate the fixed display of book pages result in a different indexing experience for the reader.

If a search engine index could point into a page, it would still be different from a traditional index but would come closer to a traditional index.

(The W3C has steadfastly resisted any effective subpage pointing. See the sad history of XLink/XPointer. You will probably have to ask insiders but it is a well known story.)

BTW, in case you are interested in blog length, see: Bloggers: This Is How Long Your Posts Should Be. Informative and amusing.

HyperANF: Graph Neighborhood Functions < 15 Minutes On a Laptop

Filed under: Graphs,HyperANF,HyperLogLog,WebGraph — Patrick Durusau @ 11:01 am

HyperANF: Approximating the Neighbourhood Function of Very Large Graphs on a Budget (2011) by Paolo Boldi, Marco Rosa, and Sebastiano Vigna.

Inducement to read the abstract or paper:

Recently, a MapReduce-based distributed implementation of ANF called HADI [KTA+10] has been presented. HADI runs on one of the fifty largest supercomputers—the Hadoop cluster M45. The only published data about HADI’s performance is the computation of the neighbourhood function of a Kronecker graph with 2 billion links, which required half an hour using 90 machines. HyperANF can compute the same function in less than fifteen minutes on a laptop. (emphasis in original)

Abstract:

The neighbourhood function N G.t / of a graph G gives, for each t 2 N, the number of pairs of nodes hx; yi such that y is reachable from x in less that t hops. The neighbourhood function provides a wealth of information about the graph [PGF02] (e.g., it easily allows one to compute its diameter), but it is very expensive to compute it exactly. Recently, the ANF algorithm [PGF02] (approximate neighbourhood function) has been proposed with the purpose of approximating N G.t / on large graphs. We describe a breakthrough improvement over ANF in terms of speed and scalability. Our algorithm, called HyperANF, uses the new HyperLogLog counters [FFGM07] and combines them efficiently through broadword programming [Knu07]; our implementation uses task decomposition to exploit multi-core parallelism. With HyperANF, for the first time we can compute in a few hours the neighbourhood function of graphs with billions of nodes with a small error and good confidence using a standard workstation.

Then, we turn to the study of the distribution of distances between reachable nodes (that can be efficiently approximated by means of HyperANF), and discover the surprising fact that its index of dispersion provides a clear-cut characterisation of proper social networks vs. web graphs. We thus propose the spid (Shortest-Paths Index of Dispersion) of a graph as a new, informative statistics that is able to discriminate between the above two types of graphs. We believe this is the first proposal of a significant new non-local structural index for complex networks whose computation is highly scalable.

New algorithm for studying the structure of large graphs. Part of the WebGraph project. The “large” version of the software handles 231 nodes.

March 13, 2012

ESPN API

Filed under: ESPN,Marketing — Patrick Durusau @ 8:16 pm

ESPN API

ESPN is developing a public API!

Sam Hunting often said that sports, with all the fan trivia, was a natural for topic maps.

Here is a golden opportunity!

Imagine a topic map that accesses ESPN and merges with local arrest/divorce records, fan blogs, photos from various sources.

First seen at Simply Statistics.

W3C HTML Data Task Force Publishes 2 Notes

Filed under: HTML Data,Microdata,RDF,Semantic Web — Patrick Durusau @ 8:16 pm

W3C HTML Data Task Force Publishes 2 Notes

From the post:

The W3C HTML Data Task Force has published two notes, the HTML Data Guide and Microdata to RDF. According to the abstract of the former, ” This guide aims to help publishers and consumers of HTML data use it well. With several syntaxes and vocabularies to choose from, it provides guidance about how to decide which meets the publisher’s or consumer’s needs. It discusses when it is necessary to mix syntaxes and vocabularies and how to publish and consume data that uses multiple formats. It describes how to create vocabularies that can be used in multiple syntaxes and general best practices about the publication and consumption of HTML data.”

One can only hope that the W3C will eventually sanctify industry standard practices for metadata. Perhaps they will call it RDF-NG. Whatever.

Common Crawl To Add New Data In Amazon Web Services Bucket

Filed under: Common Crawl,Dataset — Patrick Durusau @ 8:15 pm

Common Crawl To Add New Data In Amazon Web Services Bucket

From the post:

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

That’s good news!

At least I think so.

I am sure like everyone else, I will be trying to find the cycles (or at least thinking about it) to play (sorry, explore) the Common Crawl data set.

I hesitate to say without reservation this is a good thing because my data needs are more modest than searching the entire WWW.

That wasn’t so hard to say. Hurt a little but not that much. 😉

I am exploring how to get better focus on information resources of interest to me. I rather doubt that focus is going to start with the entire WWW as an information space. Will keep you posted.

How To Use Google To Find Vulnerabilities In Your IT Environment

Filed under: Searching,Security — Patrick Durusau @ 8:15 pm

How To Use Google To Find Vulnerabilities In Your IT Environment

Francis Brown writes:

The vast volumes of information available on the Internet are of great value to businesses — and to hackers. For years, hackers have been using Google and other search engines to identify vulnerable systems and sensitive data on publicly exposed networks. The practice, known as Google hacking, has seen a resurgence of late, providing new challenges for IT professionals striving to protect their companies from threats growing in number and sophistication.

Google hacking — a term used for penetration testing using any search engine — surged in popularity around 2004, when computer security expert Johnny Long first released his book Google Hacking for Penetration Testers and the Google Hacking Database (GHDB). The database was designed to serve as a repository for search terms, called Google-Dorks, that exposed sensitive information, vulnerabilities, passwords, and much more.

There recently has been an upswing in Google hacking, with a few factors playing a role in the practice’s growth. For one thing, the amount of data indexed and searchable by Google and other search engines has skyrocketed in the last few years. Simply put, this has given hackers much more to work with.

It has always seemed to me that topic maps have a natural role to play in computer security, whatever your hat color.

From efficient access to exploits for particular versions of software packages to tracking weaknesses in source code.

Do you even have a complete list of all the software on premises with versions and latest patches? Not that you need a topic map for that but it could help track hacker exploits that may appear in a wide number of forums, using any number of rubrics.

Then BI and Data Science Thinking Are Flawed, Too

Filed under: Identification,Identifiers,Marketing,Subject Identifiers,Subject Identity — Patrick Durusau @ 8:15 pm

Then BI and Data Science Thinking Are Flawed, Too

Steve Miller writes:

I just finished an informative read entitled “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us,” by social scientist Duncan Watts.

Regular readers of Open Thoughts on Analytics won’t be surprised I found a book with a title like this noteworthy. I’ve written quite a bit over the years on challenges we face trying to be the rational, objective, non-biased actors and decision-makers we think we are.

So why is a book outlining the weaknesses of day-to-day, common sense thinking important for business intelligence and data science? Because both BI and DS are driven from a science of business framework that formulates and tests hypotheses on the causes and effects of business operations. If the thinking that produces that testable understanding is flawed, then so will be the resulting BI and DS.

According to Watts, common sense is “exquisitely adapted to handling the kind of complexity that arises in everyday situations … But ‘situations’ involving corporations, cultures, markets, nation-states, and global institutions exhibit a very different kind of complexity from everyday situations. And under these circumstances, common sense turns out to suffer from a number of errors that systematically mislead us. Yet because of the way we learn from experience … the failings of commonsense reasoning are rarely apparent to us … The paradox of common sense, therefore, is that even as it helps us make sense of the world, it can actively undermine our ability to understand it.”

The author argues that common sense explanations to complex behavior fail in three ways. The first error is that the mental model of individual behavior is systematically flawed. The second centers on explanations for collective behavior that are even worse, often missing the “emergence” – one plus one equals three – of social behavior. And finally, “we learn less from history than we think we do, and that misperception skews our perception of the future.”

Reminds me of Thinking, Fast and Slow by Daniel Kahneman.

Not that two books with a similar “take” proves anything but you should put them on your reading list.

I wonder when/where our perceptions of CS practices have been skewed?

Or where that has played a role in our decision making about information systems?

« Newer PostsOlder Posts »

Powered by WordPress