Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 15, 2011

Lucy (not in the sky with diamonds)

Filed under: Lucy,Perl — Patrick Durusau @ 7:57 pm

Lucy (not in the sky with diamonds)

From the overview:

Apache Lucy is full-text search engine library written in C and targeted at dynamic languages. It is a “loose C” port of Apache Lucene™, a search engine library for Java.

From the FAQ:

Are Lucy and Lucene compatible?

No. Lucy is a “loose” port of Lucene designed to take full advantage of C’s unique feature set, rather than a line-by-line translation from Java. The two libraries are not compatible in terms of either file format or API, and there are no plans to establish such compatibility.

Is Lucy faster than Lucene? It’s written in C, after all.

That depends. As of this writing, Lucy launches faster than Lucene thanks to tighter integration with the system IO cache, but Lucene is faster in terms of raw indexing and search throughput once it gets going. These differences reflect the distinct priorities of the most active developers within the Lucy and Lucene communities more than anything else.

Does Lucy provide a search server like Solr?

Lucy is a low-level library, like Lucene. We’d like to provide a search server eventually, but it will likely be a thin wrapper rather than a comprehensive application like Solr. The low-level capabilities are our core mission.

Why don’t you use Swig?

A major design goal of Lucy is to present bindings which are as idiomatic as possible so that our users feel as though they are programming in their native language and not in C. Swig is a great tool, but it does not offer support for many of the features which make Lucy so user friendly: subclassing, named parameters, default argument values, etc.

I poked around in the former KinoSearch archives, does “dynamic languages” = Perl? Just curious.

November 14, 2011

Ten recent algorithm changes (Google)

Filed under: Search Algorithms,Searching — Patrick Durusau @ 8:01 pm

Ten recent algorithm changes (Google)

From the post:

Today we’re continuing our long-standing series of blog posts to share the methodology and process behind our search ranking, evaluation and algorithmic changes. This summer we published a video that gives a glimpse into our overall process, and today we want to give you a flavor of specific algorithm changes by publishing a highlight list of many of the improvements we’ve made over the past couple weeks.

We’ve published hundreds of blog posts about search over the years on this blog, our Official Google Blog, and even on my personal blog. But we’re always looking for ways to give you even deeper insight into the over 500 changes we make to search in a given year. In that spirit, here’s a list of ten improvements from the past couple weeks:

(skipping the good stuff, go to the post to read it)

If you’re a site owner, before you go wild tuning your anchor text or thinking about your web presence for Icelandic users, please remember that this is only a sampling of the hundreds of changes we make to our search algorithms in a given year, and even these changes may not work precisely as you’d imagine. We’ve decided to publish these descriptions in part because these specific changes are less susceptible to gaming.

I don’t doubt very large vendors (as well as small ones) would try to “game” Google search results.

But, describing search changes in generalities cuts Google off from suggestions from the community that could improve its legitimate search results. I am sure Google staff search all the search conferences and journals for new ideas, which is an indication that Google staff aren’t the only source of good ideas about search.

I don’t know what the mechanism would be like but I think Google should work towards some method to include more interested outsiders in the development of its search algorithms.

I don’t think Google has anything to fear from say Bing taking such a move but it isn’t that hard to imagine a start-up search company in a niche coming up with a viable way to harness an entire community of insight that filters upwards into search algorithms.

Now that would be a fearsome competitor for someone limited only to the “best and the brightest.”

PS: Do go read the ten algorithm changes. See what new ideas they spark in you and vote for Google to continue with the disclosures.

PPS: Is there a general discussion list for search algorithms? I don’t remember one off hand. Lots of specific ones for particular search engines.

Single Build Tool for Clojure

Filed under: Clojure,Functional Programming — Patrick Durusau @ 7:16 pm

Single Build Tool for Clojure

Cake and Leningen are going to merge for the Leningen 2.0 release.

From the post by Justin Balthrop:

Please join the Leiningen mailing list if you haven’t already (http://groups.google.com/group/leiningen) and join us in #leiningen on irc.freenode.com too. Phil also set up a wiki page for brainstorming version 2.0 changes (https://github.com/technomancy/leiningen/wiki/VersionTwo).

As I told those of you I talked to at the Conj, I feel really positive about this transition. For those of us who have contributed to Cake, it means our contributions will have a bigger impact and user base than they previously did. I’m also really excited to be working with Phil. He’s done a terrific job on Leiningen, and he has an amazing ability to organize and motivate open-source contributors.

One more note: I’m planning to rename the Cake mailing list to Flatland as I did for the IRC channel (#cake.clj -> #flatland). Feel free to unsubscribe if you aren’t interested in http://flatland.org open source projects, but we’d love for you to stay 😉

Must be nice to have a small enough body of highly motivated people to make this sort of merger work.

Will be a nice advantage for Clojure programmers in general. And for those wanting to take a functional approach to topic map engines.

SearcherLifetimeManager prevents a broken search user experience

Filed under: Interface Research/Design,Lucene,Searching — Patrick Durusau @ 7:16 pm

SearcherLifetimeManager prevents a broken search user experience

From the post:

In the past, search indices were usually very static: you built them once, called optimize at the end and shipped them off, and didn’t change them very often.

But these days it’s just the opposite: most applications have very dynamic indices, constantly being updated with a stream of changes, and you never call optimize anymore.

Lucene’s near-real-time search, especially with recent improvements including manager classes to handle the tricky complexities of sharing searchers across threads, offers very fast search turnaround on index changes.

But there is a serious yet often overlooked problem with this approach. To see it, you have to put yourself in the shoes of a user. Imagine Alice comes to your site, runs a search, and is looking through the search results. Not satisfied, after a few seconds she decides to refine that first search. Perhaps she drills down on one of the nice facets you presented, or maybe she clicks to the next page, or picks a different sort criteria (any follow-on action will do). So a new search request is sent back to your server, including the first search plus the requested change (drill down, next page, change sort field, etc.).

How do you handle this follow-on search request? Just pull the latest and greatest searcher from your SearcherManager or NRTManager and search away, right?

Wrong!

Read at the post why that’s wrong (it involves getting different searchers for the same search) but consider your topic map.

Does it have the same issue?

A C-Suite users queries your topic map and gets one answer. Several minutes later, a non-C-Suite user does the same query and gets an updated answer. One that isn’t consistent with the information given the C-Suite user. Obviously the non-C-Suite user is wrong as is your software, should push come to shove.

How do you avoid a “broken search user experience” with your topic map? Or do you just hope information isn’t updated often enough for anyone to notice?

Bet You Didn’t Know Lucene Can…

Filed under: Lucene — Patrick Durusau @ 7:15 pm

Bet You Didn’t Know Lucene Can… by Grant Ingersoll.

Grant’s slides from ApacheCon 2011.

Imaginative uses of Lucene for non-search(?) purposes. Depends on how you define search.

Enable SPARQL query in your MVC3 application

Filed under: BrightstarDB,NoSQL — Patrick Durusau @ 7:15 pm

Enable SPARQL query in your MVC3 application

From the post:

BrightstarDB uses SPARQL as its primary query language. Because of this and because all the entities you create with the BrightstarDB entity framework are RDF resources, it is possible to turn your application into a part of the Linked Data web with just a few lines of code. The easiest way to achieve this is to add a controller for running SPARQL queries.

BrightstarDB is a recent .Net NoSQL offering from Networked Planet. Or, as better known to use in the topic maps community as Graham Moore and Kal Ahmed. 😉 I don’t run a Windows server but with Graham and Kal, you can count on this being performance oriented software.

Tinkerpop (New Homepage, Logos)

Filed under: TinkerPop — Patrick Durusau @ 7:15 pm

Tinkerpop (New Homepage, Logos)

Peter Neubauer tweeted about the new Tinkerpop homepage.

I don’t “do” graphics but can appreciate a clean, open design.

Take a look. While you are there, consider “selecting” one or more of the icons to see where it goes. I don’t think you will be disappointed.

Top Five Articles in Data Mining

Filed under: Data Mining — Patrick Durusau @ 7:15 pm

Top Five Articles in Data Mining by Sandro Saitta.

Sandro writes:

During the last years, I’ve read several data mining articles. Here is a list of my top five articles in data mining. For each article, I put the title, the authors and part of the abstract. Feel free to suggest your favorite ones.

From earlier this year but the sort of material that stands the test of time.

Enjoy!

Twitter POS Tagging with LingPipe and ARK Tweet Data

Filed under: LingPipe,Linguistics,POS,Tweets — Patrick Durusau @ 7:15 pm

Twitter POS Tagging with LingPipe and ARK Tweet Data by Bob Carpenter.

From the post:

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

We will train and test on anything that’s easy to parse.

How’s that for a motto! 😉

Social media may be more important than I thought it was several years ago. It may just be the serialization in digital form all the banter in bars, at blocks parties and around the water cooler. If that is true, then governments would be well advised to encourage and assist with access to social media. To give them an even chance of leaving ahead of the widow maker.

Think of mining Twitter data like the NSA and phone traffic, but you aren’t doing anything illegal.

Stephen Robertson on Why Recall Matters

Filed under: Information Retrieval,Precision,Recall — Patrick Durusau @ 7:14 pm

Stephen Robertson on Why Recall Matters November 14th, 2011 by Daniel Tunkelang.

Daniel has the slides and an extensive summary of the presentation. Just to give you an taste of what awaits at Daniel’s post:

Stephen started by reminding us of ancient times (i.e., before the web), when at least some IR researchers thought in terms of set retrieval rather than ranked retrieval. He reminded us of the precision and recall “devices” that he’d described in his Salton Award Lecture — an idea he attributed to the late Cranfield pioneer Cyril Cleverdon. He noted that, while set retrieval uses distinct precision and recall devices, ranking conflates both into decision of where to truncate a ranked result list. He also pointed out an interesting asymmetry in the conventional notion of precision-recall tradeoff: while returning more results can only increase recall, there is no certainly that the additional results will decrease precision. Rather, this decrease is a hypothesis that we associate with systems designed to implement the probability ranking principle, returning results in decreasing order of probability of relevance.

Interested? There’s more where that came from, see like to Daniel’s post above.

A Simple News Exploration Interface

Filed under: Filters,Interface Research/Design,News — Patrick Durusau @ 7:14 pm

A Simple News Exploration Interface

Matthew Hurst writes:

I’ve just pushed out the next version of the hapax page. I’ve changed the interface to allow for dynamic filtering of the news stories presented. You can now type in filter terms (such as ‘bbc’ or ‘greece’) and the page will only display those stories that are related to those terms.

Very cool!

A Google A Day

Filed under: Searching — Patrick Durusau @ 7:14 pm

From Matthew Hurst, Data Mining, a post about ‘A Google A Day‘.

With this description:

At the site quiz questions are presented which essentially require long form queries (and often more than one) to solve. By creating such a site, Google is indicating that it wants to train users to leverage what it believes are its strengths and what will be viewed as the emerging area of natural language search.

I started to say this will work for puzzle lovers but then Sudoku was wildly popular so this search game may have a broad audience.

Certainly the “best” training is training the user is does not recognize as training or enjoys, if possible, both.

November 13, 2011

The Seven-Billion-Person Question

Filed under: Humor — Patrick Durusau @ 10:01 pm

The Seven-Billion-Person Question from Carl Bialak, “The Numbers Guy,” at the Wall Street Journal.

From the post:

My print column examines the numbers underlying the designation by the United Nations of Oct. 31 as 7 Billion Day — the day when the world population will hit that milestone number.

Unlike its approach to the equivalent milestone 12 years and a billion people ago, the U.N. won’t be naming the seven billionth inhabitant of the planet. Instead, the agency is calling for hundreds of newborns to take the mantle, by encouraging all countries to identify their own seven billionth baby. The Canadian magazine Maclean’s recently tracked down Adnan Nevic, the designated No. 6,000,000,000, who receives attention each year on his birthday for his achievement but whose Bosnian family has trouble making ends meet.

Personally I thought they should have auctioned off the right to pick the seven billionth inhabitant on eBay. Only countries allowed to bid and hard currency as a requirement.

And since we may cross the 7 billion line several times (due to wars, natural disasters, people who lost in the bidding, etc.), an auction could be held each time.

As the post points out, the counts are more in the nature of weather forecasts. If you don’t like the forecast on one channel, simply change to another one.

The UN hasn’t called (ever) so I can’t suggest the eBay solution to them.

Perhaps I should write a topic map of political advice I would give if asked so that if a political body calls I won’t fumble around to find all the advice I have for them.

5 Interesting Free Books for R from beginner to experts

Filed under: R — Patrick Durusau @ 10:01 pm

5 Interesting Free Books for R from beginner to experts

From the post:

Always new software language in one technical activity is difficult, normally a good documentation can help, these are three book to use R software for beginner and for experts:

  • “Introduction to the R Project for Statistical Computing for Use at the ITC” by David Rossiter (PDF, 2010-11-21).
  • “R for Beginners” by Emmanuel Paradis (PDF,10 pages).
  • A Little Book of R for Multivariate Analysis (pdf, 49 pages) is a simple introduction to multivariate analysis using the R statistics software. It covers topics such as reading and plotting multivariate data, principal components analysis, and linear discriminant analysis.
  • A Little Book of R for Biomedical Statistics (pdf, 33 pages) is a simple introduction to biomedical statistics using the R statistics software, with sections on relative risks and odds ratios, dose-response analysis, clinical trial design and meta-analysis.
  • A Little Book of R for Time Series (pdf, 71 pages) is a simple introduction to time series analysis using the R statistics software (have you spotted the pattern yet?). It includes instruction on how to read and plot time series, time series decomposition, forecasting, and ARIMA models.

All books are free to use, share and remix under a Creative Commons license, and are available:

A very nice collection of materials on R!

Hadoop Distributions And Kids’ Soccer

Filed under: BigData,Hadoop — Patrick Durusau @ 10:00 pm

Hadoop Distributions And Kids’ Soccer

From the post:

The big players are moving in for a piece of the Big Data action. IBM, EMC, and NetApp have stepped up their messaging, in part to prevent startup upstarts like Cloudera from cornering the Apache Hadoop distribution market. They are all elbowing one another to get closest to “pure Apache” while still “adding value.” Numerous other startups have emerged, with greater or lesser reliance on, and extensions or substitutions for, the core Apache distribution. Yahoo! has found a funding partner and spun its team out, forming a new firm called Hortonworks, whose claim to fame begins with an impressive roster responsible for much of the code in the core Hadoop projects. Think of the Doctor Seuss children’s book featuring that famous elephant, and you’ll understand the name.

While we’re talking about kids – ever watch young kids play soccer? Everyone surrounds the ball. It takes years to learn their position on the field and play accordingly. There are emerging alphas, a few stragglers on the sidelines hoping for a chance to play, community participants – and a clear need for governance. Tech markets can be like that, and with 1600 attendees packing late June’s Hadoop Summit event, all of those scenarios were playing out. Leaders, new entrants, and the big silents, like the absent Oracle and Microsoft.

The ball is indeed in play; the open source Apache Hadoop stack today boasts “customers” among numerous Fortune 500 companies, running critical business workloads on Hadoop clusters constructed for data scientists and business sponsors – and very often with little or no participation by IT and the corporate data governance and enterprise architecture teams. Thousands of servers, multiple petabytes of data, and growing numbers of users are increasingly to be seen.

…. (after many amusing and interesting observations)

That governance will be critical for the future. Other Apache and non-Apache projects, like HBase, Hive, Zookeeper, Pig, Flume, Sqoop, Oozie, et al all have their own agendas. In Apache locution, each has its own “committers” – owners of the code lines – and the task of integrating disparate pieces – each on its own time line – will fall to somebody. Will your distribution owner test the combination of the particular ones you’re using? If not, that will be up to you. One of the biggest barriers to open source adoption so far has been precisely that degree of required self-integration. Gartner’s second half 2010 open source survey showed that more than half of the 547 surveyed organizations have adopted OSS solutions as part of their IT strategy. Data management and integration is the top initiative they name; 46% of surveyed companies named it. This is where the game is.

Topic maps as a mechanism for easing the process of self-integration?

Would certainly be more agile than searching blog posts, user email lists, FAQs, etc.

Cross-Industry Standard Process for Data Mining (CRISP-DM 1.0)

Filed under: CRISP-DM,Data Mining — Patrick Durusau @ 10:00 pm

Cross-Industry Standard Process for Data Mining (CRISP-DM 1.0) (pdf file)

From the foreword:

CRISP-DM was conceived in late 1996 by three “veterans” of the young and immature data mining market. DaimlerChrysler (then Daimler-Benz) was already experienced, ahead of most industrial and commercial organizations, in applying data mining in its business operations. SPSS (then ISL) had been providing services based on data mining since 1990 and had launched the first commercial data mining workbench – Clementine – in 1994. NCR, as part of its aim to deliver added value to its Teradata data warehouse customers, had established teams of data mining consultants and technology specialists to service its clients’ requirements.

At that time, early market interest in data mining was showing signs of exploding into widespread uptake. This was both exciting and terrifying. All of us had developed our approaches to data mining as we went along. Were we doing it right? Was every new adopter of data mining going to have to learn, as we had initially, by trial and error? And from a
supplier’s perspective, how could we demonstrate to prospective customers that data mining was sufficiently mature to be adopted as a key part of their business processes? A standard process model, we reasoned, non-proprietary and freely available, would address these issues for us and for all practitioners.

CRISP-DM has not been built in a theoretical, academic manner working from technical principles, nor did elite committees of gurus create it behind closed doors. Both these approaches to developing methodologies have been tried in the past, but have seldom led to practical, successful and widely–adopted standards. CRISP-DM succeeds because it is soundly based on the practical, real-world experience of how people do data mining projects. And in that respect, we are overwhelmingly indebted to the many practitioners who contributed their efforts and their ideas throughout the project.

You might want to note that despite the issue date of 2000:

Eric King, founder and president of The Modeling Agency, a Pittsburgh-based consulting firm that focuses on analytics and data mining, [said]:

While King believes a guide in the form of a consultant is an invaluable resource for businesses in the planning phase, he noted that his firm follows the Cross Industry Standard Process for Data Mining, a public document he describes as “a cheat sheet,” when it’s working with clients. (emphasis added. Source: Developing a predictive analytics program doable on a limited budget

Developing a predictive analytics program doable on a limited budget

Filed under: Marketing,Prediction — Patrick Durusau @ 10:00 pm

Developing a predictive analytics program doable on a limited budget

From the post:

Predictive analytics is experiencing what David Menninger, a research director and vice president at Ventana Research Inc., calls “a renewed interest.” And he’s not the only one who is seeing a surge in the number of organizations looking to set up a predictive analytics program.

In September, Hurwitz & Associates, a consulting and market research firm in Needham, Mass., released a report ranking 12 predictive analytics vendors that it views as “strong contenders” in the market. Fern Halper, a Hurwitz partner and the principal researcher for the report, thinks predictive analytics is moving into the user mainstream. She said its growing popularity is being driven by better tools, increased access to high-performance computing resources, reduced storage costs and an economic climate that has businesses hungry for better forecasting.

“Especially in today’s economy, they’re realizing they can’t just look in the rearview mirror and look at what has happened,” said Halper. “They need to look at what can happen and what will happen and become as smart as they can possibly be if they’re going to compete.”

While predictive analytics basks in the limelight, the nuances of developing an effective program are tricky and sometimes can be overwhelming for organizations. But the good news, according to a variety of analysts and consultants, is that finding the right strategy is possible — even on a shoestring budget.

Here are some of their best-practices tips for succeeding on predictive analytics without breaking the bank:

What caught my eye was doable on a limited budget.

Limited budgets aren’t uncommon most of the time and in today’s economy they are down right plentiful. In private and public sectors.

The lessons in this post apply to topic maps. Don’t try to sell converting an entire enterprise or operation to topic maps. Pick some small area of pain or obvious improvement and sell a solution for that part. ROI that they can see this quarter or maybe next. Then build on that experience to propose larger or longer range projects.

The Marriage of R and Hadoop

Filed under: Hadoop,R — Patrick Durusau @ 10:00 pm

The marriage of R and Hadoop: Revolution Analytics at Hadoop World by Josh Willis.

Josh covers the presentation of David Champagne, CTO of Revolution Analytics, titled: Leveraging R in Hadoop Environments.

The slides are very good but not for the C-Suite. More for people who want to get enthusiastic about using R and Hadoop.

What is a “Hadoop”? Explaining Big Data to the C-Suite

Filed under: Hadoop,Humor — Patrick Durusau @ 9:59 pm

What is a “Hadoop”? Explaining Big Data to the C-Suite by Vincent Granville.

From the post:

Keep hearing about Big Data and Hadoop? Having a hard time understanding what is behind the curtain?

Hadoop is an emerging framework for Web 2.0 and enterprise businesses who are dealing with data deluge challenges – store, process and analyze large amounts of data as part of their business requirements.

The continuous challenge online is how to improve site relevance, performance, understand user behavior, and predictive insight. This is a never ending arms race as each firm tries to become the portal of choice in a fast changing world. Take for instance, the competitive world of travel. Every site has to improve at analytics and machine learning as the contextual data is changing by the second- inventory, pricing, recommendations, economic conditions, natural disasters etc.

Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. This will be a huge shift in how IT apps are engineered.

I don’t find it helpful to confuse Big Data and Hadoop. Very different things and not helpful for folks in the C-Suite to confuse them. Unless, of course, you are selling Hadoop services and want people to think Hadoop everytime they hear Big Data.

But I am really too close to Hadoop and related technologies to reliably judge explanations for the C-Suite so why not have a poll? Nothing fancy, just comment using one of the following descriptions or make up your own if mine aren’t enough:

I think the “What is ‘Hadoop’?…” explanation:

  1. Is as good as IT explanations get for the C-Suite.
  2. Is adequate but could use (specify changes)
  3. “Everyone […] is now dumber for having listened to it. I award you no points and may God have mercy on your soul.” (Billy Madison)

Comments?

Looking for volunteers for collaborative search study

Filed under: Authoring Topic Maps,Collaboration,Searching,Volunteers — Patrick Durusau @ 9:59 pm

Looking for volunteers for collaborative search study

From the post:

We are about to deploy an experimental system for searching through CiteSeer data. The system, Querium, is designed to support collaborative, session-based search. This means that it will keep track of your searches, help you make sense of what you’ve already seen, and help you to collaborate with your colleagues. The short video shown below (recorded on a slightly older version of the system) will give you a hint about what it’s like to use Querium.

You may also want to visit the Session Search page.

Could be your opportunity to help shape the future of searching! Not to mention being a window into potentials for collaborative topic map authoring!

November 12, 2011

Recommendation with Apache Mahout in CDH3

Filed under: Hadoop,Mahout — Patrick Durusau @ 8:46 pm

Recommendation with Apache Mahout in CDH3 by Josh Patterson.

From the introduction:

The amount of information we are exposed to on a daily basis is far outstripping our ability to consume it, leaving many of us overwhelmed by the amount of new content we have available. Ideally we’d like machines and algorithms to help us find the more interesting (for us individually) things so we more easily focus our attention on items of relevance.

Have you ever been recommended a friend on Facebook or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.

Due to the explosion of web traffic and users the scale of recommendation poses new challenges for recommendation systems. These systems face the dual challenge of producing high quality recommendations while also calculating recommendations for millions of users. In recent years collaborative filtering (CF) has become popular as a way to effectively meet these challenges. CF techniques start off by analyzing the user-item matrix to identify relationships between different users or items and then use that information to produce recommendations for each user.

To use this post as an introduction to recommendation with Apache Mahout, is there anything you would change, subtract from or add to this post? If anything.

I am working on my answer to that question but am curious what you think?

I want to use this and similar material on a graduate library course more to demonstrate the principals than to turn any of the students into Hadoop hackers. (Although that would be a nice result as well.)

Big Data and Text

Filed under: BigData,Text Analytics — Patrick Durusau @ 8:44 pm

Big Data and Text by Bill Inmon.

From the post:

Let’s take a look at big data. Corporations have discovered that there is a lot more data out there then they had ever imagined. There are log tapes, emails and tweets. There are registration records, phone records and TV log records. There are images and medical images. In short, there is an amazing amount of data.

Back in the good old days, there was just plain old transaction data. Bank teller machines. Airline reservation data. Point of sale records. We didn’t know how good we had it in those days. Why back in the good old days, a designer could create a data model and expect the data to fit reasonably well into the data model. Or the designer could define a record type to the database management system. The system would capture and store huge numbers of records that had the same structure. The only thing that was different was the content of the records.

Ah, the good old days – where there was at least a semblance of order when it came to managing and understanding data.

Take a look at the world now. There just is no structure to some of the big data types. Or if there is an order, it is well hidden. Really messing things up is the fact that much of big data is in the form of text. And text defies structure. Trying to put text into a standard database management system is like trying to put a really square peg into a really round hole.

While reading this post (only part of which appears here) it occurred to me that “unstructured data” is being used to mean data that lacks the appearance of outward semantics. That is for any database table, you can show it to a variety of users and all of them will claim to understand the meanings both explicit and implicit in the tables. At least until they are asked to merge databases together as part of a reorganization of a business operation. Then out come old notebooks, emails, guesses and questions for older staff.

True, having outward structure can help, but the divide really isn’t between structured and unstructured data. Mostly because both of them normally lack any explicit semantics.

Mining Lending Club’s Goldmine of Loan Data Part I of II…

Filed under: Dataset,R,Visualization — Patrick Durusau @ 8:43 pm

Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State by Tanya Cashorali.

Very cool post that combines using R with analysis of a financial data set, plus visualization by state in the United States.

Of course the data has a uniform semantic so it really doesn’t present the issues that topic maps normally deal with. Or does it?

What if instead of loan data I had campaign contributions and the promised (but not delivered so far as I know) federal contract database? Which no doubt will have very different terminology as well as shadows and shell companies to conceal interested parties.

Developing your skills with R and visualization of mono-semantic data sets will stand you in good stead when you encounter more complex cases.

Search Silver Bullets, Elixirs, and Magic Potions: Thinking about Findability in 2012

Filed under: Humor,Searching — Patrick Durusau @ 8:42 pm

Search Silver Bullets, Elixirs, and Magic Potions: Thinking about Findability in 2012

From the post:

I feel expansive today (November 9, 2011), generous even. My left eye seems to be working at 70 percent capacity. No babies are screaming in the airport waiting area. In fact, I am sitting in a not too sticky seat, enjoying the announcements about keeping pets in their cage and reporting suspicious packages to law enforcement by dialing 250.

I wonder if the mother who left a pink and white plastic bag with a small bunny and box of animal crackers is evil. Much in today’s society is crazy marketing hype and fear mongering.

Whilst thinking about pets in cages and animal crackers which may be laced with rat poison, and plump, fabric bunnies, my thoughts turned to the notion of instant fixes for horribly broken search and content processing systems.

….

I think you will enjoy the humor in this post and learn a good bit about why search is a hard problem. Enjoy!

PS: And develop a degree of skepticism towards vendor claims.

Real scientists never report fraud

Filed under: Peer Review,Publishing,Research Methods — Patrick Durusau @ 8:41 pm

Real scientists never report fraud

Daniel Lemire writes (in part):

People who want to believe that “peer reviewed work” means “correct work” will object that this is just one case. But what about the recently dismissed Harvard professor Marc Hauser? We find exactly the same story. Marc Hauser published over 200 papers in the best journals, making up data as he went. Again colleagues, journals and collaborators failed to openly challenge him: it took naive students, that is, outsiders, to report the fraud.

While I agree that other “professionals” may not have time to closely check work in the peer review process (see some of the comments), I think that illustrates the valuable role that students can play in the publication process.

Why not have a departmental requirement that papers for publication be circulated among students with an anonymous but public comment mechanism? Students are as pressed for time as anyone but they have the added incentive of wanting to become skilled at criticism of ideas and writing.

Not only would such a review process increase the likelihood of detection of fraud, but it would catch all manner of poor writing or citation practices. I regularly encounter published CS papers that incorrectly cite other published work or that cite work eventually published but under other titles. No fraud, just poor practices.

HCIR 2011 keynote

Filed under: HCIR,Information Retrieval — Patrick Durusau @ 8:40 pm

HCIR 2011 keynote by Gene Golovchinsky

From the post:

HCIR 2011 took place almost three weeks ago, but I am just getting caught up after a week at CIKM 2011 and an actual almost-no-internet-access vacation. I wanted to start off my reflections on HCIR with a summary of Gary Marchionini‘s keynote, titled “HCIR: Now the Tricky Part.” Gary coined the term “HCIR” and has been a persuasive advocate of the concepts represented by the term. The talk used three case studies of HCIR projects as a lens to focus the audience’s attention on one of the main challenges of HCIR: how to evaluate the systems we build.

The projects reviewed are themselves worthy of separate treatments, at length.

Gene’s summary makes one wish for video of the keynote. Perhaps I have overlooked it? If so, please post the link.

Entities, Relationships, and Semantics: the State of Structured Search

Filed under: Entity Extraction,Library,Relation Extraction,Searching,Semantics — Patrick Durusau @ 8:39 pm

Entities, Relationships, and Semantics: the State of Structured Search

Jeff Dalton’s notes on a panel discussion moderated by Daniel Tunkelang. The panel consisted of Andrew Hogue (Google NY), Breck Baldwin (alias-i), Evan Sandhause (NY Times), and Wlodek Zadrozny (IBM. Watson).

Read the notes, watch the discussion.

BTW, Sandhause (New York Times) points out that librarians have been working with structured data for a very long time.

So, libraries want to be more like web search engines and the folks building search engines want to be more like libraries.

Sounds to me like both communities need to spend more time reading each others blogs, cross-attending conferences, etc.

Humans Plus Computers Equals Better Crowdsourcing

Filed under: Crowd Sourcing,Human Cognition — Patrick Durusau @ 8:38 pm

Humans Plus Computers Equals Better Crowdsourcing by Karen Weise.

Business Week isn’t a place I frequent for technology news. This article may change my attitude about it. Not its editorial policy but its technical content, at least sometimes.

From the article.

Greek-born computer scientist Panagiotis Ipeirotis is developing technology that gets computers to help people work smarter, and vice versa

If computer scientist Panagiotis Ipeirotis were to write a profile of himself, he’d start by hiring people online to summarize the key concepts in his published papers. Then he’d write a program to download every word in his 187 blog entries and examine which posts visitors to the site read most. Ipeirotis, an associate professor at New York University’s Stern School of Business, would do all that because his research shows that pairing computer and human intelligence can unearth discoveries neither can find alone. Ipeirotis, 35, is an expert on crowdsourcing, a way to break down big projects into small tasks that many people perform online. He tries to find ways, as he puts it, of using computer databases to augment human inputs.

Ipeirotis describes a recent real-world success with Magnum Photos. The renowned photo agency had hundreds of thousands of images scanned into its digital archive that it couldn’t search because they weren’t tagged with keywords. So Magnum hired Tagasauris, a startup Ipeirotis co-founded, to begin annotating. As Tagasauris’s online workers typed in tags, its analytical software queried databases to make the descriptions more specific. For example, when workers tagged a photo with the word “chicken,” the software tried to clarify whether the worker meant the feathery animal, the raw meat, or the death-defying game.

I really like the line:

He tries to find ways, as he puts it, of using computer databases to augment human inputs.

Rather than either humans or computers trying to do any task along, divide it up so that each is doing stuff it does well. For example, if photos are sorted down to a few possible matches, why not ask a human? Or if you have thousands of records to roughly sort, why not ask a computer?

Augmenting human inputs is something topic maps do well. They provide access to content that may have been input differently than at present. They can also enhance human knowledge of the data structures that hold information, augmenting our knowledge there as well.

Mneme: Scalable Duplicate Filtering Service

Filed under: Duplicates,Redis,Ruby — Patrick Durusau @ 8:36 pm

Mneme: Scalable Duplicate Filtering Service

From the post:

Detecting and dealing with duplicates is a common problem: sometimes we want to avoid performing an operation based on this knowledge, and at other times, like in a case of a database, we want may want to only permit an operation based on a hit in the filter (ex: skip disk access on a cache miss). How do we build a system to solve the problem? The solution will depend on the amount of data, frequency of access, maintenance overhead, language, and so on. There are many ways to solve this puzzle.

In fact, that is the problem – they are too many ways. Having reimplemented at least half a dozen solutions in various languages and with various characteristics at PostRank, we arrived at the following requirements: we want a system that is able to scale to hundreds of millions of keys, we want it to be as space efficient as possible, have minimal maintenance, provide low latency access, and impose no language barriers. The tradeoff: we will accept a certain (customizable) degree of error, and we will not persist the keys forever.

Mneme: Duplicate filter & detection

Mneme is an HTTP web-service for recording and identifying previously seen records – aka, duplicate detection. To achieve the above requirements, it is implemented via a collection of bloomfilters. Each bloomfilter is responsible for efficiently storing the set membership information about a particular key for a defined period of time. Need to filter your keys for the trailing 24 hours? Mneme can create and automatically rotate 24 hourly filters on your behalf – no maintenance required.

Interesting in several respects:

  1. Duplicate detection
  2. Duplicate detection for a defined period of time
  3. Duplicate detection for a defined period of time with “customizable” degree of error

Would depend on your topic map project requirements. Assuming absolute truth forever and ever isn’t one of them, detecting duplicate subject representatives for some time period at a specified error rate may be the concepts you are looking for.

Enables a discussion of how much certainly (error rate) for how long (time period) for detection of duplicates (subject representatives) on what basis? All of those are going to impact project complexity and duration.

Interesting as well as a solution that for some duplicate detection requirements will work quite well.

Misunderstanding Creates Super-Bugs

Filed under: Marketing,Visualization — Patrick Durusau @ 8:35 pm

Misunderstood terminology between doctors and their patients contributes to the evolution of resistant bacteria. That is bacteria that will resist medical treatment. Further translation: You could die.

Colin Purrington has a great graphic in Venn guide to pills that kill things that explains the difference between antibiotic, antibacterial, antifungal, antiviral, etc. What most doctors mean is antibacterial but they don’t say so.

Knowing what you doctor means is a good thing. Same is true for effective data processing.

« Newer PostsOlder Posts »

Powered by WordPress