Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 18, 2011

Deja vu: a Database of Highly Similar Citations

Filed under: Bioinformatics,Biomedical,Deja vu — Patrick Durusau @ 9:37 pm

Deja vu: a Database of Highly Similar Citations

From the webpage:

Deja vu is a database of extremely similar Medline citations. Many, but not all, of which contain instances of duplicate publication and potential plagiarism. Deja vu is a dynamic resource for the community, with manual curation ongoing continuously, and we welcome input and comments.

In the scientific research community plagiarism and multiple publications of the same data are considered unacceptable practices and can result in tremendous misunderstanding and waste of time and energy. Our peers and the public have high expectations for the performance and behavior of scientists during the execution and reporting of research. With little chance for discovery and decreasing budgets, yet sustained pressure to publish, or without a clear understanding of acceptable publication practices, the unethical practices of duplicate publication and plagiarism can be enticing to some. Until now, discovery has been through serendipity alone, so these practices have largely gone unchecked.

The application of text similarity searching can robustly detect highly similar text records, offering a new tool for ensuring integrity in scientific publications. Deja vu is a database of computationally identified, manually confirmed highly similar citations (abstracts and titles), as well as user provided commentary and evidence to affirm or deny a given documents putative categorization. It is available via the web and to other database curators for tagging of their indexed articles. The availability of a search tool, eTBLAST, by which journal submissions can be compared to existing databases to identify potential duplicate citations and intercept them before they are published, and this database of highly similar citations (or exhaustive searching and tagging within Medline and other databases) could be deterrents to this questionable scientific behavior and excellent examples of citations that are highly similar but represent very distinct research publications.

I would broaden the statement:

multiple publications of the same data are considered unacceptable practices and can result in tremendous misunderstanding and waste of time and energy.

to include repeating the same analysis or discoveries out of sheer ignorance of prior work.

Not as an ethical issue but one of “…waste of time and energy.”

Given the semantic diversity in all fields, work is repeated simply due to “tribes” as Jack Park calls them, using different terminology.

Will be using Deja vu to explore topics in *informatics, to discover related materials.

If you are already using Deja vu that way, your experience, observations, comments would be deeply appreciated.

collocations in wikipedia – parts 2 and 3

Filed under: Collocation,Linguistics — Patrick Durusau @ 9:37 pm

Matthew Kelcey continues his series on collocations, although the title to part 3 doesn’t say as much.

collocations in wikipedia, part 2

In part 2 Matt discusses alternatives to “magic” frequency cut-offs for collocation analysis.

I rather like the idea of looking for alternatives to “it’s just that way” methodologies. Accepting traditional cut-offs, etc., maybe the right thing to do in some cases, but only with experience and understanding the alternatives.

finding phrases with mutual information [collocations, part 3]

In part 3 Matt discusses taking collocations beyond just two terms that occur together and techniques for that analysis.

Matt is also posting todo thoughts for further investigation.

If you have the time and interest, drop by Matt’s blog to leave suggestions or comments.

(See collocations in wikipedia, part 1 for our coverage of the first post.)

November 17, 2011

Last Call Working Draft of SPARQL 1.1 Federated Query

Filed under: Federated Search,Query Language,SPARQL — Patrick Durusau @ 8:39 pm

Last Call Working Draft of SPARQL 1.1 Federated Query

From the W3C:

A Last Call Working Draft of SPARQL 1.1 Federated Query, which offers data consumers an opportunity to merge data distributed across the Web from multiple SPARQL query services. Comments on this working draft are welcome before 31 December 2011.

Some “lite” holiday reading. 😉

Machine Learning with Python – Logistic Regression

Filed under: Machine Learning,Python — Patrick Durusau @ 8:39 pm

Machine Learning with Python – Logistic Regression

From the post:

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets. Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

This could be really nice, will send updates when new posts arrive.

Mindbreeze Picks Up Where SharePoint Leaves Off

Filed under: Marketing,Topic Maps — Patrick Durusau @ 8:39 pm

Mindbreeze Picks Up Where SharePoint Leaves Off

From the post:

SharePoint 2010 is a widely implemented application, but not one that solves every solution. The issue is explored further in, “SharePoint 2010 collaboration ISVs focus on workflow, analytics.” The author, Jonathan Gourlay, reports that users are increasingly relying on a number of independent software vendors to plug the holes in the service that SharePoint provides.

Mark Gilbert, lead analyst for Gartner Research had this to say:

“’Just because SharePoint is a lot of stuff, it doesn’t mean it’s all good stuff, but a lot of it is,’ said Gilbert, who estimates he’s spoken to 3,000 companies about SharePoint. He compares the platform to a Swiss Army Knife that allows the user to add tools. ‘To make [SharePoint] a real enterprise-class tool, you typically have to pay a lot of attention to the care and feeding of it and you have to add a lot of third-party tools.’”

Here’s the main question: if SharePoint is being advertised as enterprise-class, why do so many users need ISVs to bring it up to that level? The article goes on to argue that the opportunity for vendors to build upon the SharePoint platform is huge.

We argue that one smart and agile solution could single-handedly solve an organization’s enterprise and SharePoint woes. Fabasoft Mindbreeze is getting good feedback regarding its suite of solutions.

I must admit I will sleep easier tonight knowing that:

SharePoint 2010 is a widely implemented application, but not one that solves every solution.

As long as SharePoint 2010 trys to solve problems, we may stand a chance. 😉

Seriously, I don’t think you have to go very far to find enterprise level solutions by people who work in the .Net world. If it were me, I would ring up Networked Planet, whose website isn’t being rebuilt so no apologies are necessary. (Disclosure: I don’t work for Networked Planet but I do know both of its founders.)

This is another example of where the practice of topic maps can solve real world problems. If you have every used any version of SharePoint, then you know what it means to have problems in need of solutions. Fortunately for you, you don’t have to learn topic maps or even hear the term to enjoy a solution to the problems SharePoint poses.

AI2012: The 25th Canadian Conference on Artificial Intelligence

Filed under: Artificial Intelligence,Conferences — Patrick Durusau @ 8:39 pm

AI2012: The 25th Canadian Conference on Artificial Intelligence

Dates:

When May 28, 2012 – May 30, 2012
Where York University, Toronto, Ontario, Canad
Submission Deadline Jan 16, 2012
Notification Due Feb 20, 2012
Final Version Due Mar 5, 2012

Topics of interest include, but are not limited to:

Agent Systems
AI Applications
Automated Reasoning
Bioinformatics and BioNLP
Case-based Reasoning
Cognitive Models
Constraint Satisfaction
Data Mining
E-Commerce
Evolutionary Computation
Games
Information Retrieval
Knowledge Representation
Machine Learning
Multi-media Processing
Natural Language Processing
Neural Nets
Planning
Robotics
Search
Smart Graphics
Uncertainty
User Modeling
Web Applications

The “usual suspects,” in other words. 😉

Angels of the Right – version 2.0

Filed under: Networks,Visualization — Patrick Durusau @ 8:38 pm

Angels of the Right – version 2.0

From the post:

I’ve been working for the past several months to build AngelsOfTheRight.net a new interactive version of the conservative philanthropy network data from the Media Matters Conservative Transparency Project and other sources. The idea is to have an atlas where you can dive in, explore, and see which organisations have similar patterns of funding relationships. As always, my hope is to make some of these invisible economic and power relationships a bit more tangible.

If you want to see network maps pushed really hard in HTML5, this looks like the place to be.

Certainly useful visualization techniques for a number of purposes.

Modular Unified Tagging Ontology (MUTO)

Filed under: Ontology,Tagging,Taxonomy — Patrick Durusau @ 8:38 pm

Modular Unified Tagging Ontology (MUTO)

From the webpage:

The Modular Unified Tagging Ontology (MUTO) is an ontology for tagging and folksonomies. It is based on a thorough review of earlier tagging ontologies and unifies core concepts in one consistent schema. It supports different forms of tagging, such as common, semantic, group, private, and automatic tagging, and is easily extensible.

I though the tagging axioms were worth repeating:

  • A tag has always exactly one label – otherwise it is not a tag.

    (Additional labels can be separately defined, e.g. via skos:Concept.)
  • Tags with the same label are not necessarily semantically identical.

    (Each tag has its own identity and property values.)
  • A tag can itself be a resource of tagging (tagging of tags).

From the properties defined, however, it isn’t clear how to determine when tags do have the same meaning and/or how to communicate that understanding to others?

Ah, or would that be a tagging of a tagging?

That sounds like it leaves a lot of semantic detail on the cutting room floor but it may be that viable semantic systems, oh, say natural languages, do exactly that. Something to think about isn’t it?

Apps for Science Challenge

Filed under: Marketing,SciVerse — Patrick Durusau @ 8:38 pm

SciVerse held a challenge recently on apps for science. Two out of the three top place finishers had distinctly topic map like features.

Altmetric – First place: Reads in part:

Once the Altmetric app is installed you’ll notice a new ‘Altmetric’ box appear in the sidebar whenever you search on the SciVerse Hub. It’ll show you the articles in the first few pages of your search results that your peers and the general public have been talking about online; if you prefer you can choose to only see articles from the page of results that you’re currently on. You’ll also see some basic information about how and where articles are being discussed underneath the search results themselves.

Refinder – Second place: Reads in part:

When you found the right papers on SciVerse, bring them together with Refinder. Refinder is an intelligent online collaboration tool for teams. Scientists are using it to collect papers, research notes, and more information in one place. Once collected, important facts about documents can be added as comments. By using links, related things are connected. When reading an article in SciVerse, an intelligent algorithm automatically searches and suggests relevant collections, topics, documents, or experts from Refinder.

Teams love it. Shared collections are provided for each team. They are individually configured by inviting members and setting access rights. Teams use collections to share articles, ideas, dicuss topics, ask questions and get answers. Organizations can use Refiner both internally and externally, a useful feature to communicate with partners in projects.

Sounds a lot like a topic map doesn’t it? Except that they have a fairly loose authoring model, which is probably a good thing in some cases. Don’t know if the relations between things are typed or if they have some notion of identity.

iHelp – Third place: Reads in part:

iHelp enables researchers to do search in their native languages. Articles with different languages are retrieved using this multi-lingual search. Option is provided for phonetic typing of search text. User can either do native search (the typed language) or translate and search in English.

I assume talking about full text search but at least attempting to do that across languages. Suspect it has all the issues of full text plus the perils of mechanized translation. Still, if the alternative is no searching at all, this must seem pretty good.

All of these applications represent some subset of what topic maps are about, ranging from subjects being described in different languages to being able to easily collaborate with others or discover other characteristics of a work, such as its popularity.

Offering some modest improvement over current interfaces, improvements that fall far short of the capabilities of topic maps, seem to attract a fair amount of interest. Something for nay sayers about improved information applications to keep in mind.

SciVerse Applications Beta

Filed under: Bibliography,SciVerse — Patrick Durusau @ 8:38 pm

SciVerse Applications Beta

From the webpage:

SciVerse Applications Beta lets you integrate search and discovery applications into SciVerse, to help you be more productive in your research. Login or register, find an application and get started – there is nothing to download or install, the applications you’ve selected will appear immediately within SciVerse.

Developers can create applications for over 15 million SciVerse users worldwide. SciVerse Applications Beta lets you integrate your application directly into the core SciVerse user experience on article, record and search results pages. To learn more, please visit the Developer Network.

SciVerse Applications Beta has just launched and we continue to make improvements. We welcome your feedback on all aspects of this service.

Not a lot of folks but every application has to start somewhere. 😉

There was a contest recently for new apps. I will cover the winners in a separate post.

Author Wordle™

Filed under: Visualization,Word Cloud — Patrick Durusau @ 8:38 pm

Author Wordle™

From the SciVerse description:

The Author Wordle™ application lets you create a Wordle word cloud out of the titles of the last 100 papers from any author in Scopus.

Wordle is a toy for generating word clouds from text. The clouds give greater prominence to words that appear more frequently in the source text. Clouds can be tweaked with different fonts, layouts, and color schemes. The images created with Wordle are yours to use however you like. You can print them out, or save them to the Wordle gallery to share with your friends. Authors can use these Wordle word clouds on their own website as a representation of their research, or just for fun.

Can’t say I am a big fan of word clouds but a lot of people find them quite useful. See how it works for you in evaluating recent work by a particular author.

Next Generation Cluster Computing on Amazon EC2 – The CC2 Instance Type

Filed under: Cloud Computing,Topic Map Software,Topic Maps — Patrick Durusau @ 8:37 pm

Next Generation Cluster Computing on Amazon EC2 – The CC2 Instance Type

From the post:

Today we are introducing a new member of the Cluster Compute Family, the Cluster Compute Eight Extra Large. The API name of this instance is cc2.8xlarge so we’ve taken to calling it the CC2 for short. This instance features some incredible specifications at a remarkably low price. Let’s take a look at the specs:

Processing – The CC2 instance type includes 2 Intel Xeon processors, each with 8 hardware cores. We’ve enabled Hyper-Threading, allowing each core to process a pair of instruction streams in parallel. Net-net, there are 32 hardware execution threads and you can expect 88 EC2 Compute Units (ECU’s) from this 64-bit instance type. That’s nearly 90x the rating of the original EC2 small instance, and almost 3x the rating of the first-generation Cluster Compute instance.

Storage – On the storage front, the CC2 instance type is packed with 60.5 GB of RAM and 3.37 TB of instance storage.

Networking – As a member of our Cluster Compute family, this instance is connected to a 10 Gigabit network and offers low latency connectivity with full bisection bandwidth to other CC2 instances within a Placement Group. You can create a Placement Group using the AWS Management Console:

Pricing – You can launch an On-Demand CC2 instance for just $2.40 per hour. You can buy Reserved Instances, and you can also bid for CC2 time on the EC2 Spot Market. We have also lowered the price of the existing CC1 instances to $1.30 per hour.

You have the flexibility to choose the pricing model that works for you based on your application, your budget, your deadlines, and your ability to utilize the instances. We believe that the price-performance of this new instance type, combined with the number of ways that you can choose to acquire it, will result in a compelling value for scientists, engineers, and researchers.

Seems like it was only yesterday that I posted a note that NuvolaBase.com was running a free cloud beta. Hey! That was only yesterday!

Still a ways off from unmetered computing resources but moving in that direction.

If you have some experience with one of the cloud services, consider writing up a pricing example for experimenting with topic maps. I suspect that would help a lot of people (including me) get their feet wet with topic maps and cloud computing.

November 16, 2011

NuvolaBase.com

Filed under: Cloud Computing,Graphs,OrientDB — Patrick Durusau @ 8:19 pm

NuvolaBase.com

I was surprised to see this at the end of the OrientDB slides on the multi-master architecture, “the first graph database on the Cloud,” but I am used to odd things in slide decks. 😉

From the FAQ:

What is the technology behind NuvolaBase?

NuvolaBase is a cloud of several OrientDB servers deployed in multiple data centers around the globe.

What is the architecture of your cloud?

The cloud is based on multiple servers in different server farms around the globe. This guarantee low latency and high availability. Today we have three server farms, two in Europe and one in USA. We’ve future plans to expand the cloud in China and South America.

Oh, did I mention that during the beta test is it free?

OrientDB – Distributed Architecture…

Filed under: OrientDB — Patrick Durusau @ 8:19 pm

OrientDB – Distributed Architecture with a Multi-Master Approach (available version 1.0, due December 2011) by Luca Garulli.

Tossing old master/slave approach in favor of a multi-master approach.

Great set of slides! One more reason to be looking forward to December!

Yandex – Relevance Prediction Challenge

Filed under: Contest,Relevance — Patrick Durusau @ 8:18 pm

Yandex – Relevance Prediction Challenge

Important Dates:

Oct 15, 2011 – Challenge opens

Dec 15 22, 2011 – End of challenge

Dec 25, 2011 – Winners candidacy notification

Jan 20, 2012 – Reports deadline

Feb 12, 2012 – WSCD workshop at WSDM 2012, Winners announcement

Sorry, you are late starting already, here are some of the details, see the website for more:

From the webpage:

The Relevance Prediction Challenge provides a unique opportunity to consolidate and scrutinize the work from industrial labs on predicting the relevance of URLs using user search behavior. It provides a fully anonymized dataset shared by Yandex which has clicks and relevance judgements. Predicting relevance based on clicks is difficult, and is not a solved problem. This Challenge and the shared dataset will enable a whole new set of researchers to conduct such experiments.

The Relevance Prediction Challenge is a part of series of contests organized by Yandex called Internet Mathematics. This year’s event is the sixth since 2004. Participants will again compete in finding solutions to a real-life problem based on real-life data. In previous years, participants tried to learn to rank documents, predict traffic jams and find similar images.

I can’t think of very many “better” days to find out you won such a contest!

Big Data Just Got Smaller: New Approach to Find Information

Filed under: Artificial Intelligence,Graphs — Patrick Durusau @ 8:18 pm

Big Data Just Got Smaller: New Approach to Find Information

From the post:

San Diego, CA – Artificial intelligence vendor ai-one will unveil a new approach to graphically represent knowledge at the SuperData conference in San Diego on Wednesday November 16, 2011. The discovery, named ai-Fingerprint, is a significant breakthrough because it allows computers to understand the meaning of language much like a person. Unlike other technologies, ai-Fingerprints compresses knowledge in way that can work on any kind of device, in any language and shows how clusters of information relate to each other. This enables almost any developer to use off-the-shelf and open-source tools to build systems like Apple’s SIRI and IBM Watson.

Ondrej Florian, ai-one’s VP of Core Technology invented ai-Fingerprints as a way to find information by comparing the differences, similarities and intersections of information on multiple websites. The approach is dynamic so that the ai-Fingerprint transforms as the source information changes. For example, the shape for a Twitter feed adapts with the conversation. This enables someone to see new information evolve and immediately understand its significance.

“The big idea is that we use artificial intelligence to identify clusters and show how each cluster relates to another,” said Florian. “Our approach enables computers to compare ai-Fingerprints across many documents to find hidden patterns and interesting relationships.”

The ai-Fingerprint is the collection of all the keywords and their associations identified by ai-one’s Topic-Mapper tool. Each keyword and its associations is a coordinate – much like what you would find on a map. The combination of these keywords and associations forms a graph that encapsulates the entire meaning of the document. (emphasis added)

The line “…encapsulates the entire meaning of the document.” goes a bit far.

Whose “entire meaning” of the document? What documents and who were they tested against? Can it understand the tariff portion of phone bill? (Which I rather doubt has a meaning other than the total.)

There have been computational approaches to knowledge before and there will be others that follow this one. Makes for good press and gets all the pundits exercised but that is about all. Will prove useful in some cases but that doesn’t mean it is a truly generalized solution.

Did want to bring it to your attention for whatever use you can make of it in the long term and in the short term something to annoy your cubicle neighbour.

“Big Data” and the Failure to Communicate…

Filed under: BigData,Communication,Semantics — Patrick Durusau @ 8:18 pm

“Big Data” and the Failure to Communicate… by Richard Murnane.

From the post:

All the talk about “Big Data” reminds me of a line or two from an old movie I like, “What we have here is failure to communicate” (Cool Hand Luke, 1967). Why? Well, we’re all talking about a concept which means different things to different people. To make things worse, the press and all the technology vendors are trying to figure this out before the people who need to operationally deal with this “Big Data” every day know what the heck is going on.

The fact is that “Big Data” is in the air and there is no denying that something is up and we all need to grow up and figure this out. The following chart is a Google Trends snapshot comparing “Big Data” to another common (but mature) IT term, “Network Security.” Notice that a year ago “Big Data” was essentially non-existent as something people were searching Google for and now it’s getting about 50% the activity as this much more common term.

You will enjoy the post and it has much to offer but I do have one small niggle. Well, maybe not small, medium? That not right either, let’s just say really, really big and let it go at that:

Richard says “Big Data” “…means different things to different people.”

What he fails to say, is that the data “inside” big data has the same issue of meaning different things to different people.

Processing (outside of a topic map or other semantically nuanced application) requires us to treat data as having one and only meaning. We may actually view or consider some data to have only one meaning. But our viewing or processing data as having only one meaning doesn’t make it so.

Data can have at least as many meanings as there are users to process or view it. (Allowing for users who ascribe multiple meanings to the same data. Post-modernists for the most part.)

expressor – Data Integration Platform

Filed under: Data Integration,Software — Patrick Durusau @ 8:18 pm

expressor – Data Integration Platform

I ran across expressor while reading a blog entry that will be going through Facebook and Twitter data with it as integration software.

It has a community edition but apparently only runs on Windows (XP and Windows 7, there’s a smart move).

Before I download/install, any comments? Suggestions for other integration tasks?

Thanks!

Oh, the post that got me started on this: expressor: Enterprise Application Integration with Social Networking Applications. Realize that expressor is an ETL tool but sometimes that is what a job requires.

Bayesian variable selection [off again]

Filed under: Bayesian Models,Mathematics — Patrick Durusau @ 8:18 pm

Bayesian variable selection [off again]

From the post:

As indicated a few weeks ago, we have received very encouraging reviews from Bayesian Analysis about our [Gilles Celeux, Mohammed El Anbari, Jean-Michel Marin and myself] our comparative study of Bayesian and non-Bayesian variable selections procedures (“Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation“) to Bayesian Analysis. We have just rearXived and resubmitted it with additional material and hope this is the last round. (I must acknowledge a limited involvement at this final stage of the paper. Had I had more time available, I would have liked to remove the numerous tables and turn them into graphs…)

If you are not conversant in Bayesian thinking and recent work, this paper is going to be … difficult. Despite just having gotten past the introduction and looking references to help with part 2, I think it will be a good intellectual exercise and important for your use of Bayesian models in the future. Two very good reasons to spend the time to understand this paper.

Or to put it another way, the world is non-probabilistic only when viewed with a certain degree of coarseness. How useful a coarse view is, varies from circumstance to circumstance. If you don’t have the capability to use a probabilistic view, you will be limited to a coarse one. (Neither better than the other, but having both seems advantageous to me.)

“VCF annotation” with the NHLBI GO Exome Sequencing Project (JAX-WS)

Filed under: Annotation,Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 8:17 pm

“VCF annotation” with the NHLBI GO Exome Sequencing Project (JAX-WS) by Pierre Lindenbaum.

From the post:

The NHLBI Exome Sequencing Project (ESP) has released a web service to query their data. “The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.“.

In the current post, I’ll show how I’ve used this web service to annotate a VCF file with this information.

The web service provided by the ESP is based on the SOAP protocol.

Important news/post for several reasons:

First and foremost, “for the potential to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.”

Second, thanks to Pierre, we have a fully worked example of how to perform the annotation.

Last but not least, the NHLBI Exome Sequencing Project (ESP) did not try to go it alone for the annotations. It did what it does well and then offered the data up for other to use/extend it, hopefully to be used/extended by others.

I can’t count the number of projects of varying sorts that I have seen that tried to do every feature, every annotation, every imaging, every transcription, on their own. All of which resulted in being less than they could have been with greater openness.

I am not suggesting that vendors need to give away data. Vendors for the most part support all of us. It is disingenuous to pretend otherwise. So vendors making money means we get to pay our bills, buy books and computers, etc.

What I am suggesting is that vendors, researches and users need to work towards (yelling at each other doesn’t count) towards commercially viable solutions that enable greater collaboration with regard to research and data.

Otherwise we will have impoverished data sets that are never quite what they could be and vendors will be many many times over the real cost of developing data. Those two conditions don’t benefit anyone. “You, me, them.” (Blues Brothers) 😉

search Google Books by ISSN

Filed under: GoogleBooks,ISBN,ISSN — Patrick Durusau @ 8:17 pm

search Google Books by ISSN

From the post:

Turns out Google Books does support searching by ISSN, using ordinary fielded search syntax, although I don’t believe it’s documented anywhere.

Mostly what you’ll find is digitized bound journals from libraries (that is, digitization of some volumes of the journal, probably not all of them, which may or may not have full text access). Sometimes things that physically look like monographs but are published serially also get ISSNs, you might get some of those too, not sure. Has to be in GBS, and GBS has to have ISSN metadata for the record, not sure how often that happens.

Of particular interest to library students and librarians.

My only caution is that like many “undocumented” features, this may or may not persist. Still, take advantage of it while it is around.

(And a good excuse for me to add ISSN/ISBN to my category list.)

Data Integration Remains a Major IT Headache

Filed under: Data Integration,Marketing — Patrick Durusau @ 2:13 pm

Data Integration Remains a Major IT Headache

From the webpage:

Click through for results from a survey on data integration, conducted by BeyeNetwork on behalf of Syncsort.

…. (with regard to data integration tools)

In particular, the survey makes it clear that not only is data integration still costly, a lot of manual coding is required. The end result is that the fundamentals of data integration are still a big enough issue in most IT organizations to thwart the achievement of strategic business goals.

Complete with bar and pie charts! 😉

If data integration is a problem in the insular data enclaves of today, do you think data integration will get easier when foreign big data comes on the scene?

That’s what I think too.

I will ask BeyeNetwork if they asked this question:

How much manual coded data was the subject of manual coding before?

Or perhaps better:

Where did coders get the information for repeated manual coding of the data? (with follow up questions based on the responses to refine that further)

Reasoning that how we maintain information about data (read metadata) can have an influence on the cost of manual coding, i.e., discovery of what the data means (or is thought to mean).

It isn’t possible to escape manual coding, at least if we want useful data integration. We can, however, explore how to make manual coding less burdensome.

I say we can’t escape manual coding because unless by happenstance two data sets shared the same semantics, I am not real sure how they would be integrated sight unseen with any expectation of a meaningful result.

Or to put it differently, meaningful data integration efforts, like lunches, are not free.

PS: And you thought I was going to say topic maps were the answer to data integration headaches. 😉 Maybe, maybe, depends on your requirements.

You should never buy technology or software because of its name, everyone else is using it, your boss saw it during a Super Bowl half-time show, or similar reasons. I am as confident that topic maps will prove to be the viable solution in some cases as I am that other solutions are more appropriate in others. Topic mappers should not be afraid to say so.

November 15, 2011

Hadoop and Data Quality, Data Integration, Data Analysis

Filed under: Data Analysis,Data Integration,Hadoop — Patrick Durusau @ 7:58 pm

Hadoop and Data Quality, Data Integration, Data Analysis by David Loshin.

From the post:

If you have been following my recent thread, you will of course be anticipating this note, in which we examine the degree to which our favorite data-oriented activities are suited to the elastic yet scalable massive parallelism promised by Hadoop. Let me first summarize the characteristics of problems or tasks that are amenable to the programming model:

  1. Two-Phased (2-φ) – one or more iterations of “computation” followed by “reduction.”
  2. Big data – massive data volumes preclude using traditional platforms
  3. Data parallel (Data-||) – little or no data dependence
  4. Task parallel (Task-||) – task dependence collapsible within phase-switch from Map to Reduce
  5. Unstructured data – No limit on requiring data to be structured
  6. Communication “light” – requires limited or no inter-process communication except what is required for phase-switch from Map to Reduce

OK, so I happen to agree with David’s conclusions. (see his post for the table) That isn’t the only reason I posted this note.

Rather I think this sort of careful analysis lends itself to test cases, which we can post and share with specification of the tasks performed.

Much cleaner and more enjoyable than the debates measured by who can sink the lowest fastest.

Test cases to suggest anyone?

John Giannandrea on Freebase – A Rosetta Stone for Entities

Filed under: Entities,Semantic Diversity,Semantics — Patrick Durusau @ 7:58 pm

John Giannandrea on Freebase – A Rosetta Stone for Entities by Daniel Tunkelang.

From the post:

John started by introducing Freebase as a representation of structured objects corresponding to real-world entities and connected by a directed graph of relationships. In other words, a semantic web. While it isn’t quite web-scale, Freebase is a large and growing knowledge base consisting of 25 million entities and 500 million connections — and doubling annually. The core concept in Freebase is a type, and an entity can have many types. For example, Arnold Schwarzenegger is a politician and an actor. John emphasized the messiness of the real world. For example, most actors are people, but what about the dog who played Lassie? It’s important to support exceptions.

The main technical challenge for Freebase is reconciliation — that is, determining how similar a set of data is to existing Freebase topics. John pointed out how critical it is for Freebase to avoid duplication of content, since the utility of Freebase depends on unique nodes in its graph corresponding to unique objects in the world. Freebase obtains many of its entities by reconciling large, open-source knowledge bases — including Wikipedia, WordNet, Library of Congress Authorities, and metadata from the Stanford Library. Freebase uses a variety of tools to implement reconciliation, including Google Refine (formerly known as Freebase Gridworks) and Matchmaker, a tool for gathering human judgments. While reconciliation is a hard technical problem, it is made possible by making inferences across the web of relationships that link entities to one another.

John then presented Freebase as a Rosetta Stone for entities on the web. Since an entity is simply a collection of keys (one of which is its name), Freebase’s job is to reverse engineer the key-value store that is distributed among the entity’s web references, e.g., the structured databases backing web sites and encoding keys in URL parameters. He noted that Freebase itself is schema-less (it is a graph database), and that even the concept of a type is itself an entity (“Type type is the only type that is an instance of itself”). Google makes Freebase available through an API and the Metaweb Query Language (MQL).

(emphasis added)

<tedious-self-justification>…., entity is a collection of keys indeed! Key/value pairs I would say, with no presumptions about the structure of either one.</tedious-self-justification>

There is not now nor will there ever be agreement on the “unique objects in the world.” And why should that be a value? If we have the key/value pairs, we can each arrive at our own conclusions about whether certain “unique nodes” correspond to what we think of as “unique objects in the world.”

I suspect, but don’t know having never asked former President Bush II, that we disagree on the existence of any unique objects in the world and it is unlikely there is any evidence that would persuade either one of us to change.

Remember the Rosetta Stone had three (3) version of the same inscription. It did not try to say one version was closer to the original than the others.

The Rosetta Stone is one of the earliest honorings of semantic diversity. Unlike systems that try to push only one common semantic or vision.

Models for MapReduce

Filed under: MapReduce,Mathematics — Patrick Durusau @ 7:58 pm

Models for MapReduce by Suresh Venkatasubramanian

From the post:

I’ve been listening to Jeff Phillips‘ comparison of different models for MapReduce (he’s teaching a class on models for massive data). In what follows, I’ll add the disclaimer IANACT (I am not a complexity theorist).

There’s something that bothers me about the various models floating around that attempt to capture the MapReduce framework (specifically the MUD framework by Feldman et al, the MRC framework by Karloff, Suri and (co-blogger) Vassilvitskii, and the newer Goodrich-Sitchinava-Zhang framework).

I won’t spoil the rest of the post for you, read it and the comments.

There is a lot of work to be done towards modeling and understanding mapreduce.

Personally I suspect there will be some general models that give way to more specialized ones for some domains.

Serendipity Is Not An Intent

Filed under: Advertising,Intent,Searching,Serendipity — Patrick Durusau @ 7:58 pm

Serendipity Is Not An Intent

From the post:

Wired had two amazing pieces on online advertising yesterday and while Felix Salmon’s piece The Future of Online Advertising could be Yieldbot’s manifesto it is the piece Can ‘Serendipity’ Be a Business Model? that deals more directly with our favorite topic, intent.

…..

Twitter is the greatest discovery engine ever created on the web. But discovery can be and not be serendipitous. Sometimes,, as Dorsey alludes to, you discover things you had no idea existed but much more often you discover things after you have intent around what you want to discover. This is an important differentiation for Twitter to consider. It’s important because it’s a different algorithm.

Discovery intent is not an algo about “how do we introduce you to something that would otherwise be difficult for you to find, but something that you probably have a deep interest in?” There is no “introduce” and “probably” in the discovery intent algo. Most importantly, there is no “we.” It’s an algo about “how do you discover what you’re interested in.”

Discovering more about what you’re interested in has always been Twitter’s greatest strength. It leverages both user-defined inputs and the rich content streams where context and realtime matching can occur. Just like Search.

If Twitter wants to build a discovery system for advertising it should look like this. (emphasis added)

Inverts the advertising and when you think about it, the search algorithm. Rather than discovering, poorly, what interests the user or answer as question, enable the user to discover (a pull model) what interests them.

Completely different way of thinking about advertising and search.

Priesthood of the user? Worked (depending on who you ask) a long time ago.

Maybe, just maybe, a service architecture based on that as a goal, could disrupt the current “I know better than you” push models for search and advertising.

VC funding for Hadoop and NoSQL tops $350m

Filed under: Funding,Hadoop,NoSQL — Patrick Durusau @ 7:58 pm

VC funding for Hadoop and NoSQL tops $350m

From the post:

451 Research has today published a report looking at the funding being invested in Apache Hadoop- and NoSQL database-related vendors. The full report is available to clients, but below is a snapshot of the report, along with a graphic representation of the recent up-tick in funding.

According to our figures, between the beginning of 2008 and the end of 2010 $95.8m had been invested in the various Apache Hadoop- and NoSQL-related vendors. That figure now stands at more than $350.8m, up 266%.

That statistic does not really do justice to the sudden uptick of interest, however. The figures indicate that funding for Apache Hadoop- and NoSQL-related firms has more than doubled since the end of August, at which point the total stood at $157.5m.

Takes more work than winning the lottery but on the other hand it is encouraging to see that kind of money being spread around.

But, past funding is just that, past funding. Encouraging but the real task is creating solutions that attract future funding.

Suggestions/comments?

Data Science by Analyticbridge

Filed under: Data Science — Patrick Durusau @ 7:57 pm

Data Science by Analyticbridge

From the post:

Our Data Science e-Book provides recipes, intriguing discussions and resources for data scientists and executives or decision makers. You don’t need an advanced degree to understand the concepts. Most of the material is written in simple English, however it offers simple, better and patentable solutions to many modern business problems, especially about how to leverage big data.

Emphasis is on providing high-level information that executives can easily understand, while being detailed enough so that data scientists can easily implement our proposed solutions. Unlike most other data science books, we do not favor any specific analytic method nor any particular programming language: we stay one level above practical implementations. But we do provide recommendations about which methods to use when necessary.

Most of the material is original, and can be used to develop better systems, derive patents or write scientific articles. We also provide several rules of the thumbs and details about craftsmanship used to avoid traditional pitfalls when working with data sets. The book also contains interviews with analytic leaders, and material about what should be included in a business analytics curriculum, or about how to efficiently optimize a search to fill an analytic position.

I am not sure about the book offering “better and patentable solutions to many modern business problems” or else there will be a run on the patent office the day of its release. 😉

Seriously, I expect it to be a very useful book and you do have the opportunity to review the fifty-six (56) pages released so far.

Due out in December of 2011 as a free ebook. Something for your new ebook reader. 😉

Computational Statistics: An Introduction

Filed under: Computational Statistics,Statistics — Patrick Durusau @ 7:57 pm

Computational Statistics: An Introduction by James E. Gentle, Wolfgang Härdle, Yuichi Mori.

I suspect this to be:

Handbook of Computational Statistics (Volume I) Concepts and Methods
Gentle, James E.; Härdle, Wolfgang; Mori, Yuichi (Eds.)
2004, XII, 1070 p. 236 illus., Hardcover
ISBN: 3-540-40464-3
Language: English
Publisher: Springer-Verlag New York

(source: http://www.rmi.nus.edu.sg/csf/webpages/)

But there is no date or other information about the book on the webpages that I could find.

If it is the 2004 edition (seems likely), I doubt the basics of computational statistics have changed all that much. The only way to know for sure would be to get a copy of the second edition, if it issues, and compare the two.

If you make a comparison or have other information about this resource, please post a response.

I checked at the Spring site and they no longer list this work in the series. There are a nice selection of > $300 (US) books in the series if you are interested.

Data mining is going in the direction of greater reliance on computational analysis so a firm grounding in statistics (and their limitations) is a good skill set to have.

Recall vs. Precision

Filed under: Precision,Recall — Patrick Durusau @ 7:57 pm

Recall vs. Precision by Gene Golovchinsky.

From the post:

Stephen Robertson’s talk at the CIKM 2011 Industry event caused me to think about recall and precision again. Over the last decade precision-oriented searches have become synonymous with web searches, while recall has been relegated to narrow verticals. But is precision@5 or NCDG@1 really the right way to measure the effectiveness of interactive search? If you’re doing a known-item search, looking up a common factoid, etc., then perhaps it is. But for most searches, even ones that might be classified as precision-oriented ones, the searcher might wind up with several attempts to get at the answer. Dan Russell’s a Google a day lists exactly those kinds of challenges: find a fact that’s hard to find.

So how should we think about evaluating the kinds of searches that take more than one query, ones we might term session-based searches?

Read the post and the comments more than once!

Then think about how you would answer the questions raised, in or out of a topic map context.

Much food for thought here.

« Newer PostsOlder Posts »

Powered by WordPress