Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 27, 2011

Siri’s Sibling Launches Intelligent Discovery Engine

Filed under: Agents,Artificial Intelligence,Search Engines,Searching — Patrick Durusau @ 8:56 pm

Siri’s Sibling Launches Intelligent Discovery Engine

Completely unintentional but I ran across this article that concerns Siri as well:

We’re all familiar with the standard search engines such as Google and Yahoo, but there is a new technology on the scene that does more than just search the web – it discovers it.

Trapit, which is a personalized discovery engine for the web that’s powered by the same artificial intelligence technology behind Apple’s Siri, launched its public beta last week. Just like Siri, Trapit is a product of the $200 million CALO Project (Cognitive Assistant that Learns and Organizes), which was the largest artificial intelligence project in U.S. history, according to Mashable. This million-dollar project was funded by DARPA (Defense Advanced Research Projects Agency), the Department of Defense’s research arm.

Trapit, which was first unveiled in June, is a system that personalizes content for its users based on keywords, URLs and reading habits. This service, which can identify related content based on contextual data from more than 50,000 sources, provides a simple, carefree way to discover news articles, images, videos and other content on specific topics.

So, I put in keywords and Trapit uses those to return content to me, which if I then “trapit,” the system will continue to hunt for related content. Yawn. Stop me if you have heard this story before.

Keywords? That’s what we get from “…the largest artificial intelligence project in U.S. history?”

From Wikipedia on CALO:

Its five-year contract brought together 300+ researchers from 25 of the top university and commercial research institutions, with the goal of building a new generation of cognitive assistants that can reason, learn from experience, be told what to do, explain what they are doing, reflect on their experience, and respond robustly to surprise.

And we got keywords. Which Trapit uses to feed back similar content to us. I don’t need similar content, I need content that doesn’t use my keywords and yet is relevant to my query.

But rather than complain, why not build a topic map system based upon “…cognitive assistants that can reason, learn from experience, be told what to do, explain what they are doing, reflect on their experience, and respond robustly to surprise.” Err. that would be crowdsourcing topic map authoring, yes?

‘Siri, You’re Stupid’: Limitations of artificial intelligence baffle kids who expect more

Filed under: Artificial Intelligence — Patrick Durusau @ 8:55 pm

‘Siri, You’re Stupid’: Limitations of artificial intelligence baffle kids who expect more by Lauren Barack.

A deeply amusing post that begins:

My eight-year-old daughter, Harper, got her hands on a new iPhone 4S, and that’s when trouble started. Within minutes, she grew impatient with Siri after posing some queries to Apple’s speech-recognition “assistant” feature: “Can you pronounce my Mother’s name?” “Where do I live?” and “Is there dust on the moon?”—questions she did not assume the artificial voice wouldn’t answer. As it failed, delivering replies such as “Sorry, I don’t know where that is,” Harper became increasingly irritated, until she loudly concluded, “Siri, you’re stupid!” It responded “I’m doing my best.”

I think there is a lesson here to not create expectations among our users that are unrealistic. True, I think semantic technologies can be useful but they are not magical nor can they convert management/personnel issues into technical ones, much less solve them.

If two departments are not reliably sharing information now, the first question to investigate is why? It may well be simply a terminology issue, in which case a topic map could help them overcome that barrier and more effectively share information.

If the problem is that they are constantly undermining each others work and would rather the business fail than share information with the other department that might make it stand out, then topic maps are unlikely to be of assistance.

Tracking Scholars (and their work)

Filed under: Authoring Topic Maps — Patrick Durusau @ 8:54 pm

In An R function to analyze your Google Scholar Citations page I mused:

Scholars are fairly peripatetic these days and so have webpages, projects, courses, not to mention social media postings using various university identities. A topic map would be a nice complement to this function to gather up the “grey” literature that underlies final publications.

Matt O’Donnell follows that post up with a tweet asking what such a map would look like?

An example would help make the point but I did not want to choose one with a known outcome. Since I recently blogged about the Natural Language Processing being taught by Christoper Manning and Dan Jurafsky, I will use both of them as examples.

From the course description we know:

Dan Jurafsky is Professor of Linguistics and Professor by Courtesy of Computer Science at Stanford University. Dan received his Bachelors degree in Linguistics in 1983 and his Ph.D. in Computer Science in 1992, both from the University of California at Berkeley, and also taught at the University of Colorado, Boulder before joining the Stanford faculty in 2004. He is the recipient of a MacArthur Fellowship and has served on a variety of editorial boards, corporate advisory boards, and program committees. Dan’s research extends broadly throughout natural language processing as well as its application to the behavioral and social sciences.

Jurafsky has at least three (possibly more) email addresses:

  • University of California at Berkley – ending somewhere in the early 1990’s
  • University of Colorado, Boulder – between early 1990’s and 2004
  • Stanford – starting in 2004

Just following the link in the class blurb we have: jurafsky(at)stanford.edu for his (current) email at Stanford (it may have changed, can’t say based on what we know now) and a URL to use as a subject identifier, http://www.stanford.edu/~jurafsky/.

I should make up some really difficult technique at this point for discovering prior email addresses. 😉 Some of those may be necessary but what follows is a technique that works for most academics.

We know that Jurafsky started at Stanford in 2004 and for purposes of this exercise we will assume his email at Stanford has been stable. So we need email addresses prior to 2004. At least for CS or CS related fields, the first place I would go is The DBLP Computer Science Bibliography. Choosing author search and inputting “jurafsky” I get two “hits.”

# Dan Jurafsky
# Daniel Jurafsky

You will note on the right hand side of the listing of articles, on the “Ask Others…” line, there is a text box with the value used by DBLP to conduct the search. For both “Dan Jurafsky” and “Daniel Jurafsky” it is using author:daniel_jurafsky:. That is it has regularized the name so that when you as for “Dan Jurafsky,” the search is on the longer form.

Sorry, digression. Anyway, we know we need an address for sometime prior to 2004 and scanning the publications prior to 2004, I saw the following citation:

Daniel Gildea, Daniel Jurafsky: Automatic Labeling of Semantic Roles. Computational Linguistics 28(3): 245-288 (2002)

The source in Computational Linguistics is important because if you follow the Computational Linguistics 28 link, it will take you to a listing of that article in that particular issue of Compuational Linguistics.

Oh, the icons:

Electronic Edition Link to electronic version if one exists (may be a pay-per-view site)
CiteSeerX Searches the title as a string at CiteSeerX
Google scholar Searches the title as a string at Google Scholar
pubzone.org Links to article if it appears in PubZone, a service of ETH Zurich in cooperation with ACM SIGMOD
BibTeX The article’s citation in BibTeX format.
bibliographical record in XML The article’s citation in XML.

If you choose the first icon, it will take you to a paper by Dan Jurafsky in 2002, where his email address is listed as: jurafsky@colorado.edu. (Computational Linguistics is open access now, all issues. Reason why I suggested it first.)

You could also look at Jurafsky’s publication page and find the same paper.

Where there is a listing of publications, try there first but realize that DPLP is a valuable research tool.

The oldest paper that Jurafsky has listed:

Jurafsky, Daniel, Chuck Wooters, Gary Tajchman, Jonathan Segal, Andreas Stolcke, Eric Fosler, and Nelson Morgan. 1994. Integrating Experimental Models of Syntax, Phonology, and Accent/Dialect in a Speech Recognizer (in AAAI-94 workshop)

Gives us his old Berkeley address: jurafsky@icsi.berkeley.edu.

Updating the information we have for Jurafsky:

  • University of California at Berkeley – jurafsky@icsi.berkeley.edu
  • University of Colorado, Boulder – jurafsky@colorado.edu
  • Stanford – jurafsky(at)stanford.edu

And his current homepage for a subject identifier: http://www.stanford.edu/~jurafsky/.

Or, in CTM notation for a topic map:

http://www.stanford.edu/~jurafsky/ # subject identifier
– “Dan Jurafsky”; # name with default type
email: jurafsky(at)stanford.edu @stanford; #occurrence with scope
email: jurafsky@colorado.edu @colorado; #occurrence with scope
email: jurafsky@icsi.berkeley.edu #icis.berkeley. #occurrence with scope, note period ending the topic “block”

I thought about and declined to use the notion of “currentEmail.” Using scopes allows for future changes in emails, while maintaining a sense of when certain email addresses were in use. Search engine policies not withstanding, the world is not a timeless place.

I have some of the results of using the Prof. Jurafsky’s prior addresses, but want to polish that up a bit before posting it.

(I will get to Christopher in the next part.)

Topic Map Tool Chain

Filed under: Authoring Topic Maps,Topic Maps — Patrick Durusau @ 8:52 pm

I have talked about a lot of software and techniques since starting this blog but I don’t have an easy way to organize them by topic map task. That is, when do you need which tool? And how would you evaluate one tool against another?

The second question, comparing tools, probably isn’t something I will get to in the coming year. Might but don’t get your hopes up. I do think I can start to outline one view of when you need which tool.

To talk about tools for topic maps, I need to have an outline of the process of creating a topic map.

My first cut at that process looks like this:

I already see some places that need repair/expansion so don’t take this as anything but a rough draft.

It can become better but only with your comments.

For example, I like the cloud metaphor, mostly because it is popular and people think they know what it means. 😉 But, here is leaves the false impression that “clouds” are the only source of data for a topic map.

What about people and their experiences? Or museums, art, books (those hard rectangular things), sensors, etc. Public vs. private clouds.

Maybe what I should do is keep the cloud and remove data/text and let the cloud be a hyperlink to another image that has more detail? Something like “universes of knowledge – enter here” or something like that. What do you think?

Question: For purposes of just blocking the process, should indexing point to “processing?” I know it can occur later or earlier but just curious how others feel.

The double ended arrows are to show that interaction is possible between stages. Such as authoring and the topic map instance. That the act of authoring a topic map can make the author of the topic map create different paths than originally intended. That happens so constantly that I thought it important to capture.

Question: And similarity measures. Where do I put them? Personally I think they fall under mining/analysis because that will be the basis for creation of the topic map but I can see an argument for merging/processing of the topic map also needing such rules in case another topic map ventures within merging distance.

Comments/suggestions?

PS: I would like to keep the diagram fairly uncluttered, even if I have to use the images or arrows to lead to other information or expand in someway. Diagrams that can’t be interpreted in a glance seem to defeat the purpose of having a diagram. (Not claiming that quality for this diagram, one of the reasons I am asking for you help.)

H2

Filed under: Database,Java — Patrick Durusau @ 8:51 pm

H2

From the webpage:

Welcome to H2, the Java SQL database. The main features of H2 are:

  • Very fast, open source, JDBC API
  • Embedded and server modes; in-memory databases
  • Browser based Console application
  • Small footprint: around 1 MB jar file size

I ran across this the other day and it looked interesting.

Particularly since I want to start exploring the topic maps tool chain. And what parts can be best done by what software?

Top Three Technologies to Tame the Big Data Beast

Filed under: Description Logic,RDF,Semantic Web — Patrick Durusau @ 8:51 pm

Top Three Technologies to Tame the Big Data Beast by Steve Hamby.

I would re-order some of Steve’s remarks. For example, on the Semantic Web, why not put those paragraphs first:

The first technology needed to tame Big Data — derived from the “memex” concept — is semantic technology, which loosely implements the concept of associative indexing. Dr. Bush is generally considered the godfather of hypertext based on the associative indexing concept, per his 1945 article. The Semantic Web, paraphrased from a definition by the World Wide Web Consortium (W3C), extends hyperlinked Web pages by adding machine-readable metadata about the Web page, including relationships across Web pages, thus allowing machine agents to process the hyperlinks automatically. The W3C provides a series of standards to implement the Semantic Web, such as Web Ontology Language (OWL), Resource Description Framework (RDF), Rule Interchange Format (RIF), and several others.

The May 2001 Scientific American article “The Semantic Web” by Tim Berners-Lee, Jim Hendler, and Ora Lassila described the Semantic Web as agents that query ontologies representing human knowledge to find information requested by a human. OWL ontology is based on Description Logics, which are both expressive and decidable, and provide a foundation for developing precise models about various domains of knowledge. These ontologies provide the “memory index” that enables searches across vast amounts of data to return relevant, actionable information, while addressing key data trust challenges as well. The ability to deliver semantics to a mobile device, such as what the recent release of the iPhone 4S does with Siri, is an excellent step in taming the Big Data beast, since users can get the data they need when and where they need it. Big Data continues to grow, but semantic technologies provide the needed check points to properly index vital information in methods that imitate the way humans think, as Dr. Bush aptly noted.

Follow that with the amount of data recitation and the comments about Vannevar Bush:

In the July 1945 issue of The Atlantic Monthly, Dr. Vannevar Bush’s famous essay, “As We May Think,” was published as one of the first articles addressing Big Data, information overload, or the “growing mountain of research” as stated in the article. The 2010 IOUG Database Growth Survey, conducted in July-August 2010, estimates that more than a zettabyte (or a trillion gigabytes) of data exists in databases, and that 16 percent of organizations surveyed reported a data growth rate in excess of 50 percent annually. A Gartner survey, also conducted in July-August 2010, reported that 47 percent of IT staffers surveyed ranked data growth as one of the top three challenges faced by their IT organization. Based on two recent IBM articles derived from their CIO Survey, one in three CIOs make decisions based on untrusted data; one in two feel they do not have the data they need to make an informed decision; and 83 percent cite better analytics as a top concern. A recent survey conducted for MarkLogic asserts that 35 percent of respondents believe their unstructured data sources will surpass their structured data sources in size in the next 36 months, while 86 percent of respondents claim that unstructured data is important to their organization. The survey further asserts that only 11 percent of those that consider unstructured data important have an infrastructure that addresses unstructured data.

Dr. Bush conceptualized a “private library,” coined “memex” (mem[ory ind]ex) in his essay, which could ingest the “mountain of research,” and use associative indexing — how we think — to correlate trusted data to support human decision making. Although Dr. Bush conceptualized “memex” as a desk-based device complete with levers, buttons, and a microfilm-based storage device, he recognized that future mechanisms and gadgetry would enhance the basic concepts. The core capabilities of “memex” were needed to allow man to “encompass the great record and to grow in the wisdom of race experience.”

That would allow exploration of questions and comments like:

1) With a zettabyte of data and more coming in every day, precisely how are we going to create/impose OWL ontologies to develop “…precise models about various domains of knowledge?”

2) Curious on what grounds hyperlinking is considered the equivalent of associative indexing? Hyperlinks can be used by indexes but hyperlinking isnt indexing. Wasn’t then, isn’t now.

3) The act of indexing is collecting references to a list of subjects. Imposing RDF/OWL may be preparatory steps towards indexing but are not indexing in and of themselves.

4) Description Logics are decidable but why does Steve think human knowledge can be expressed in decidable fashion? There is a vast amount of human knowledge in religion, philosophy, politics, ethics, economics, etc., that cannot be expressed in decidable fashion. Parking regulations can be expressed in decidable fashion, I think, but I don’t know if they are worth the trouble of RDF/OWL.

5) For that matter, where does Steve get the idea that human knowledge is precise? I suppose you could have made that argument in the 1890’s, except for some odd cases, classical physics was sufficient. At least until 1905. (Hint: Think of Albert Einstein.) Human knowledge is always provisional, uncertain and subject to revision. The CERN has apparently observed neutrinos going faster than the speed of light, for example. More revisions of physics are on the way.

Part of what we need to tame the big data “beast” is acceptance that we need information systems that are like ourselves.

That is to say information systems that are tolerant of imprecision, perhaps even inconsistency, that don’t offer a false sense of decidability and omniscience. Then at least we can talk about and recognize the parts of big data that remain to be tackled.

DOD looks to semantics for better data-sharing, cost savings

Filed under: Federation,Funding,Government Data — Patrick Durusau @ 8:50 pm

DOD looks to semantics for better data-sharing, cost savings by Amber Currin.

From Federal Computer Week:

In its ongoing quest to catalyze cost efficiencies and improve information-sharing, the Defense Department is increasingly looking to IT to solve problems of all sizes. The latest bid involves high-tech search capabilities, interoperable data and a futuristic, data-rich internet known as semantic web.

In a new RFI, the Defense Information Systems Agency and Deputy Chief Management Office are looking to strengthen interoperability and data-sharing for a vast array of requirements through an enterprise information web (EIW). Their envisioned EIW is built on semantic web, which will allow better enterprise-wide collection, analysis and reporting of data necessary for managing personnel information and business systems, as well as protecting troops on the ground with crucial intelligence.

“At its heart, semantic web is about making it possible to integrate and share information at a web scale in a simple way that traditional databases don’t allow,” said James Hendler, senior constellation professor of the Tetherless World Research Constellation at Rensselaer Polytechnic Institute.

One way semantic web helps is by standardizing information to enable databases to better communicate with each other – something that could be particularly helpful for DOD’s diverse systems and lexicons.

“The information necessary for decision-making is often contained in multiple source systems managed by the military services, components and/or defense agencies. In order to provide an enterprise view or answer questions that involve multiple services or components, each organization receives data requests then must interpret the question and collect, combine and present the requested information,” the RFI reads.

Oh, and:

“DOD historically spends more than $6 billion annually developing and maintaining a portfolio of more than 2,000 business systems and web services. Many of these systems, and the underlying processes they support, are poorly integrated. They often deliver redundant capabilities that optimize a single business process with little consideration to the overall business enterprise,” DOD Deputy Chief Management Officer Beth McGrath said in an April 4 memo. “It is imperative, especially in today’s limited budget environment, to optimize our business processes and the systems that support them to reduce our annual business systems spending.”

Just in case you are interested, the deadline for responses is 19 December 2011. A direct link to the RFI.

I may actually respond. Would there be any interest in my posting my response to the RFI to get reader input on my responses?

So I could revise it week by week until the deadline.

Might be a nice way to educate other contenders and the DoD about topic maps in general.

Comments?

BTW, if you are interested in technology and the U.S. federal government, try reading Federal Computer Week on a regular basis. At least you will know what issues are “up in the air” and vocabulary being used to talk about them.

Concord: A Tool That Automates the Construction of Record Linkage Systems

Concord: A Tool That Automates the Construction of Record Linkage Systems by Christopher Dozier, Hugo Molina Salgado, Merine Thomas, Sriharsha Veeramachaneni, 2010.

From the webpage:

Concord is a system provided by Thomson Reuters R&D to enable the rapid creation of record resolution systems (RRS). Concord allows software developers to interactively configure a RRS by specifying match feature functions, master record retrieval blocking functions, and unsupervised machine learning methods tuned to a specific resolution problem. Based on a developer’s configuration process, the Concord system creates a Java based RRS that generates training data, learns a matching model and resolves record information contained in files of the same types used for training and configuration.

A nice way to start off the week! Deeply interesting paper and a new name for record linkage.

Several features of Concord that merit your attention (among many):

A choice of basic comparison operations with the ability to extend seems like a good design to me. No sense overwhelming users with all the general comparison operators, to say nothing of the domain specific ones.

The blocking functions, which operate just as you suspect, narrows the potential set of records for matching down, is also appealing. Sometimes you may be better at saying what doesn’t match than what does. This gives you two bites at a successful match.

Surrogate learning, although I have located the paper cited on this subject and will be covering it in another post.

I have written to ThomsonReuters inquiring about availability of Concord, its ability to interchange mapping settings between instances of Concord or beyond. Will update when I hear back from them.

November 26, 2011

CNetS

Filed under: Complex Networks,Systems Research — Patrick Durusau @ 8:07 pm

CNetS: Center for Complex Networks and Systems Research

Work of the Center:

The types of problems that we work on include mining usage and traffic patterns in technological networks such as the Web and the Internet; studying the interaction between social dynamics and online behaviors; modeling the evolution of complex social and technological networks; developing adaptive, distributed, collaborative, agent-based applications for Web search and recommendation; understanding complex biological networks and complex reaction in biochemistry; developing models for the spread of diseases; understanding how coordinated behavior arises from the dynamical interaction of nervous system, body, and environment; studying social human behavior; exploring reasons underlying species diversity; studying the interplay between self-organization and natural selection; understanding how information arises and is used in biological systems; and so on. All these examples are characterized by complex nonlinear feedback mechanisms and it is now being increasingly recognized that the outcome of such interactions can only be understood through mathematical and computational models.

Lots of interesting content. I will be calling some of it out in the future.

Invenio

Filed under: Invenio,Library software — Patrick Durusau @ 8:05 pm

Invenio

From the webpage:

Invenio is a free software suite enabling you to run your own digital library or document repository on the web. The technology offered by the software covers all aspects of digital library management from document ingestion through classification, indexing, and curation to dissemination. Invenio complies with standards such as the Open Archives Initiative metadata harvesting protocol (OAI-PMH) and uses MARC 21 as its underlying bibliographic format. The flexibility and performance of Invenio make it a comprehensive solution for management of document repositories of moderate to large sizes (several millions of records).

Invenio has been originally developed at CERN to run the CERN document server, managing over 1,000,000 bibliographic records in high-energy physics since 2002, covering articles, books, journals, photos, videos, and more. Invenio is being co-developed by an international collaboration comprising institutes such as CERN, DESY, EPFL, FNAL, SLAC and is being used by about thirty scientific institutions worldwide (see demo).

If you would like to try it out yourself, please download our latest version. If you have any questions about the software or the support behind it, please join our mailing lists or contact us.

A stage where even modest improvements in results would be likely to attract attention.

Lucene/Solr 3.5 Release Imminient!

Filed under: Lucene,Solr — Patrick Durusau @ 8:04 pm

The mirrors are being updated for the release of Lucene/Solr 3.5!

Expect the formal announcement any time now.

My mirror sites show directory creation dates of 25 Nov. 2011.

For Lucene, your nearer download site.

For Solr, your nearer download site.

MontySolr: A Search Solution for Python Lovers With the Speed of Native Java

Filed under: INSPIRE,Solr — Patrick Durusau @ 8:02 pm

MontySolr: A Search Solution for Python Lovers With the Speed of Native Java

From the post:

The folks at CERN wanted a better way to search High Energy Physics fulltext paper repositories and bibliographical databases that produce result set numbers in the multi-millions. INSPIRE, the that merges the sources’ query results, though, is written in Python. In order to move back and forth as quickly as possible between the two systems, CERN decided among a number of options to embed INSPIRE in Solr.

The result, MontySolr, utilizes the power of Java and works with any Python applicatio, as well as any C/C++ app that Python understands. For more information on MontySolr, check this video of Roman Chyla (CERN).

Let’s run all counters back to zero and start again. This time with the abstract from the original presentation in San Francisco, May, 2011:

SPIRES is the biggest bibliographic database for High Energy Physics, ArXiv is the biggest fulltext repository for the fulltext papers in High Energy Physics, and INSPIRE is the biggest digital library that merges the two. We must work with result sets bigger than 1 million for citation related queries and our partners from Astrophysics with 6 million sets, however INSPIRE is written in Python. So how do we move several million result sets between the two systems fast? How do we take advantage of our special NLP processing pipeline written in Python? How do we join them? We do not use Jython. We do not use pipes. We do not embed Solr inside INSPIRE. We embed INSPIRE into Solr! The talk shows benefits and challenges of this surprisingly elegant solution.

With the original title:

CPython Embedded in Solr – Search Solution for Python Lovers With the Speed of Native Java

You will need to slides to really appreciate the video.

And MontySolr on Github.

Impressive results!

But, the real kicker is that C and C++ apps are made available insider Solr. Such as for NLP!

INSPIRE

Filed under: CERN,INSPIRE — Patrick Durusau @ 8:01 pm

INSPIRE

From the webpage:

CERN, DESY, Fermilab and SLAC have built the next-generation High Energy Physics (HEP) information system, INSPIRE, which empowers scientists with innovative tools for successful research at the dawn of an era of new discoveries.

INSPIRE combines the successful SPIRES database content, curated at DESY, Fermilab and SLAC, with the Invenio digital library technology developed at CERN. INSPIRE is run by a collaboration of the four labs, and interacts closely with HEP publishers, arXiv.org, NASA-ADS, PDG, and other information resources.

INSPIRE represents a natural evolution of scholarly communication, built on successful community-based information systems, and provides a vision for information management in other fields of science.

INSPIRE builds on SPIRES’ expertise

  • Decades of trusted, curated content
  • Experience in managing a discipline’s wide information resources
  • Close relationship with the worldwide user community

What are the major innovations of INSPIRE?

  • Author disambiguation for high-quality profiles and improved search capabilities
  • Fulltext search and snippet display for access restricted content
  • Faster results
  • Variety of search and display options
  • Detailed record pages
  • Searchable fulltext for 5 years of arXiv content
  • Figures and searchable figure captions extracted from 5 years of arXiv articles
  • LHC experimental notes

What will be available soon?

  • Personalized features (bookshelves, author pages, paper claiming)
  • More APIs for third parties to build new tools
  • More historical content
  • Conference slides

Deeply cool digital library system from CERN.

Recommendation with Apache Mahout in CDH3 – Update

Filed under: Mahout — Patrick Durusau @ 7:59 pm

Recommendation with Apache Mahout in CDH3 – Update

My original post was to a page at Cloudera. That page has now gone away.

I saw a tweet by Alex Popescu asking about the page and when I checked, all I got was a 404.

Started to update my post but then decided there is a broader question as to whether I should cache local copies of pages and resources? So that at least you will see the page as I saw it when I made the entry?

Comments?

Neo4j 1.6 – Milestone 01!

Filed under: Graphs,Neo4j,NoSQL — Patrick Durusau @ 7:57 pm

Neo4j 1.6 – Milestone 01!

From the post:

The theme of 1.6 is mainly about improving infrastructure and QA. These improvements include faster builds, moving from TC to Jenkins, and extending our tests to cover more client platforms, both browser and operating system wise. The reason for these changes is that, while we’ve delivered many great features very rapidly over the last few months, we’re always looking to do better. Improving our internal build infrastructure helps us deliver quality features faster, and helps us better turn around responses to the community’s requests for features.

Infrastructure isn’t our only focus for 1.6, however. We are also working on Neo4j so that we can store graph metadata, e.g. configuration settings. This will help us to better evolve the internal infrastructure.

As always, there are a number of bugs that have been fixed, both internally and for the community issues. See: https://github.com/neo4j/community/issues?sort=created&direction=desc&state=closed&page=1

Something to keep you busy over the holidays!

November 25, 2011

GeoIQ API Overview

Filed under: Geo Analytics,Geographic Data,Geographic Information Retrieval — Patrick Durusau @ 4:29 pm

GeoIQ API Overview

From the webpage:

GeoIQ is the engine that powers the GeoCommons Community. GeoIQ includes a full Application Programming Interface (API) that allows developers to build unique and powerful domain specific applications. The API provides capability for uploading and download data, searching for data and maps, building, embedding, and theming maps or charts, as well as general user, group, and permissions management.

The GeoIQ API consists of a REST API and a JavaScript API. REST means that it uses simple URL’s and HTTP methods to perform all of the actions. For example, a dataset is a specific endpoint that a user can create, read, update or delete (CRUD).

Another resource for topic mappers who want to link information to “real” locations. 😉

CALI – eBooks for Legal Education

Filed under: Law - Sources,Legal Informatics — Patrick Durusau @ 4:28 pm

CALI – eBooks for Legal Education

CALI is a center for projects to make legal materials freely available.

A recent example would be the Federal Rules of Civil Procedure, Federal Rules of Criminal Procedure and the Federal Rules of Evidence in free ebook form.

All three of which could be mapped into current case law streams, or other topic map type projects.

If you are interested in law and topic map type projects, CALI would be a good starting point for inquiries into what is needed the most.

PS: If you can, please join CALI to support their work.

Hadapt is moving forward

Filed under: Data Warehouse,Hadapt,Hadoop,Query Language — Patrick Durusau @ 4:27 pm

Hadapt is moving forward

A bullet-point type review, mostly a summary of information from the vendor. Not a bad thing, can be useful. But, you would think that when reviewing a vendor or their product, there would be a link to the vendor/product. Yes? No one that I can find in that post.

Let me make it easy for you: Hadapt.com. How hard was that? Maybe 10 seconds of my time and that is because I have gotten slow? The point of the WWW, at least as I understand it, is to make information more accessible to users. But it doesn’t happen by itself. Put in hyperlinks where appropriate.

There is a datasheet on the Adaptive Analytic Platform &trade:.

You can follow the link for the technical report and register, but it is little more than a sales brochure.

More informative is: Efficient Processing of Data Warehousing Queries in a Split Execution Environment.

I don’t have a local setup that would exercise Hadapt. If you do or if you are using it in the cloud, would appreciate any comments or pointers you have.

How to Cache PHP Sessions in Membase

Filed under: Membase,PHP — Patrick Durusau @ 4:26 pm

How to Cache PHP Sessions in Membase

Another “practical” post for today! 😉

A good tutorial that outlines the issues with Memcache and then proceeds to solve them with Membase.

From the blog:

Membase is memcache with data persistence. And it doesn’t use something like memcache, it is memcache. So if you have code that already is using memcache, you can have it use membase right away, usually with no change to your code.

The improvement of having data persistence is that if you need to bring down a server, you don’t have to worry about all that dainty, floaty data in memory that is gonna get burned. Since membase has replication and persistence built-in, you can feel free to restart a troublesome server without fear of your database getting pounded as the caches need to refill, or that a set of unlucky users will get logged out. I’ll let you read about all the many other advantages of membase here. It’s much more than I’ve mentioned here.

I know a lot of libraries run PHP based interfaces so please forward this to any librarians that you know.

SpiderDuck: Twitter’s Real-time URL Fetcher

Filed under: Software,Topic Map Software,Topic Map Systems,Tweets — Patrick Durusau @ 4:26 pm

SpiderDuck: Twitter’s Real-time URL Fetcher

A bit of a walk on the engineering side but in order to be relevant, topic maps do have to be written and topic map software implemented.

This a very interesting write-up of how Twitter relied mostly on open source tools to create a system that could be very relevant to topic map implementations.

For example, the fetch/no-fetch decision for URLs is based on a comparison to URLs fetched within X days. Hmmm, comparison of URLs, oh, those things that occur in subjectIdentifier and subjectLocator properties of topics. Do you smell relevance?

And there is harvesting of information from web pages, one assumes that could be done on “information items” from a topic map as well, except there it would be properties, etc. Even more relevance.

What parts of SpiderDuck do you find most relevant to a topic map implementation?

Topic Map Query Language (TMQL) – Last Draft

Filed under: TMQL — Patrick Durusau @ 4:25 pm

Topic Map Query Language (TMQL) – Last Draft

The last draft of TMQL – ISO/IEC 18048, has been posted to the SC 34 repository.

There will be no further drafts of the TMQL standard unless and until WG 3 has sufficient resources to take up work in this area in the future.

Stanford Courses

Filed under: CS Lectures — Patrick Durusau @ 4:24 pm

Stanford Courses

Kirk Lowery forwarded this link which has all the current and one supposes future Stanford courses that you can take for free online.

Ones that are of particular interest to the practice of topic maps I will continue to call out separately.

Natural Language Processing

Filed under: Natural Language Processing — Patrick Durusau @ 4:24 pm

Natural Language Processing with Christopher Manning and Dan Jurafsky.

From the webpage:

We are offering this course on Natural Language Processing free and online to students worldwide, January 23rd – March 18th 2012, continuing Stanford’s exciting forays into large scale online instruction. Students have access to screencast lecture videos, are given quiz questions, assignments and exams, receive regular feedback on progress, and can participate in a discussion forum. Those who successfully complete the course will receive a statement of accomplishment. Taught by Professors Jurafsky and Manning, the curriculum draws from Stanford’s courses in Natural Language Processing. You will need a decent internet connection for accessing course materials, but should be able to watch the videos on your smartphone.

Course Description

The course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.

The class will start January 23 2012, and will last approximately 8 weeks.

If you don’t know any more about natural language processing at the end of March 2012 than you did at New Years, whose fault is that? 😉

SPARQL 1.1 Overview

Filed under: SPARQL — Patrick Durusau @ 4:23 pm

SPARQL 1.1 Overview

From the webpage:

Abstract:

This document is an overview of SPARQL 1.1. It provides an introduction to a set of W3C specifications that facilitate querying and manipulating RDF graph content on the Web or in an RDF store. (First Public Working draft)

Not a deep introduction but does include enough pointers and other material that it is worth reading.

StatTrek

Filed under: Statistics — Patrick Durusau @ 4:22 pm

StatTrek

I saw this site referenced by the Analysis Factor when discussing calculation of the binomial distribution. Like the writer there, I just fell in love with the name.

There is a fair amount of advertising but that isn’t going to hurt you. Besides, the site has a number of useful resources.

November 24, 2011

Libraries: Where It All Went Wrong

Filed under: Library — Patrick Durusau @ 3:57 pm

Libraries: Where It All Went Wrong

From the introduction:

Bill Gates wrote a bestseller in 1995. He was on a roll: Microsoft Windows had finally crushed its old foe the Macintosh computer from Apple, Microsoft was minting money hand over fist, and he was hugely respected in the industry he had helped start. He roped in other big brains from Microsoft to write a book to answer the question, “what next?” The Road Ahead talked about the implications of everyone having a computer and how they would use the great Information Superhighway that was going to happen.

The World Wide Web appears in the index to The Road Ahead precisely four times. Bill Gates didn’t think the Internet would be big. The Information Superhighway of Gates’s fantasies would have more structure than the Internet, be better controlled than the Internet, in short it would be more the sort of thing that a company like Microsoft would make.

Bill Gates and Microsoft were caught flat-footed by the take-up of the Internet. They had built an incredibly profitable and strong company which treated computers as disconnected islands: Microsoft software ran on the computers, but didn’t help connect them. Gates and Microsoft soon realized the Internet was here to stay and rushed to fix Windows to deal with it, but they never made up for that initial wrong-footing.

At least part of the reason for this was because they had this fantastic cash cow in Windows, the island software. They were victims of what Clayton Christenson calls the Innovator’s Dilemma: they couldn’t think past their own successes to build the next big thing, the thing that’d eat their lunch. They still haven’t got there: Bing, their rival to Google, has eaten $5.5B since 2009 and it isn’t profitable yet.

I’m telling you this because libraries are like Microsoft.

Read this post and then re-read this post if you care about libraries.

I think he is spot on in his analysis but it is going to be up to you to find how libraries can be a value-add that is visible and vital to the general public. What difference do you make in their lives?

Before anyone flames me, let me point out that my wife is a librarian, my daughter (only child) is studying to be a librarian, I am adjunct faculty at a library school (GSLIS/UIUC), and have spent most of my life in libraries of one sort or another.

However, I do agree with Nat Torkington that libraries have an enormous value-add but they need to make that case in terms of the Internet and networking knowledge. Making the case for pre-Internet libraries is doomed to failure and dooms the libraries that make it.

FactLab

Filed under: Data,Data Source,Interface Research/Design — Patrick Durusau @ 3:55 pm

FactLab

From the webpage:

Factlab collects official stats from around the world, bringing together the World Bank, UN, the EU and the US Census Bureau. How does it work for you – and what can you do with the data?

From the guardian in the UK.

Very impressive and interactive site.

Don’t agree with their philosophical assumptions about “facts,” but none the less, a number of potential clients do. So long as they are paying the freight, facts they are. 😉

Weather forecast and good development practices

Filed under: Data,Data Management — Patrick Durusau @ 3:54 pm

Weather forecast and good development practices by Paolo Sonego.

From the post:

Inspired by this tutorial, I thought that it would be nice to have the possibility to have access to weather forecast directly from the R command line, for example for a personalized start-up message such as the one below:

Weather summary for Trieste, Friuli-Venezia Giulia:
The weather in Trieste is clear. The temperature is currently 14°C (57°F). Humidity: 63%.

Fortunately, thanks to the always useful Duncan Temple Lang’s XML package (see here for a tutorial about XML programming under R), it is straightforward to write few lines of R code to invoke the google weather api for the location of interest, retrieve the XML file, parse it using the XPath paradigm and get the required informations:

You may need weather information for your topic map but more importantly, it will be useful if small routines or libraries are written for common data sets. There is little reason for multiple libraries for say census data, unless the data is substantially different.

An R function to determine if you are a data scientist

Filed under: Data Science,Humor,R — Patrick Durusau @ 3:52 pm

An R function to determine if you are a data scientist

From the post:

“Data scientist” is one of the buzzwords in the running for rebranding applied statistics mixed with some computing. David Champagne, over at Revolution Analytics, described the skills for being a data scientist with a Venn Diagram. Just for fun, I wrote a little R function for determining where you land on the data science Venn Diagram. Here is an example of a plot the function makes using the Simply Statistics bloggers as examples.

I mention this more for amusement than serious use in hiring. Probably best not to tell HR that you have such an R function.

Labeling a person or concept using a computer does not mean:

  1. The person or concept is correctly labeled, or
  2. If correctly labeled, you understand the label.

An R function to analyze your Google Scholar Citations page

Filed under: Citation Indexing,Social Media — Patrick Durusau @ 3:51 pm

An R function to analyze your Google Scholar Citations page

From the post:

Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself here.

I asked John Muschelli and Andrew Jaffe to write me a function that would download my Google Scholar Citations data so I could play with it. Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end.

Features include:

The function will download all of Rafa’s citation data and put it in the matrix out. It will also make wordclouds of (a) the co-authors on his papers and (b) the titles of his papers and save them in the pdf file specified (There is an option to turn off plotting if you want).

It can also calculate citation indices.

Scholars are fairly peripatetic these days and so have webpages, projects, courses, not to mention social media postings using various university identities. A topic map would be a nice complement to this function to gather up the “grey” literature that underlies final publications.

« Newer PostsOlder Posts »

Powered by WordPress