Archive for December, 2011

High-Quality Images from the Amsterdam Rijksmuseum

Saturday, December 31st, 2011

High-Quality Images from the Amsterdam Rijksmuseum

From the post:

The Amsterdam Rijksmuseum has made images from its “basic collection” – a little over 103,000 objects – available under a Creative Commons BY 3.0 license which allows you to:

  • Share — to copy, distribute and transmit the work
  • Remix — to adapt the work
  • Make commercial use of the work

These images may be used not only for classroom study and research but also for publishing, as long as the museum receives proper attribution. The collections database, in Dutch, is available here. Over 70,000 objects are also cataloged using ICONCLASS subject headings in English; this interface is available here. Click here for an example of the scan quality.

Geertje Jacobs posted a response:

Geertje Jacobs says:
December 14, 2011 at 1:16 am

Thank you for the post on our new API service!

I’d like to add an extra link to the API page. On this page http://www.rijksmuseum.nl/api, you’ll find information about our service (very soon also in English). This is also the place to ask for the key to make use of our data and images!
If there are any questions please contact api@rijksmuseum.nl.

Enjoy our collection!

A very promising resource for use in European history, historical theology and the intellectual history of Europe studies. Coupled with a topic map, geographic, written and other resources can be combined together with the visual resources from the Amsterdam Rijksmuseum.

Webdam Project: Foundations of Web Data Management

Saturday, December 31st, 2011

Webdam Project: Foundations of Web Data Management

From the homepage:

The goal of the Webdam project is to develop a formal model for Web data management. This model will open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability. Specifically, the goal is to develop a universally accepted formal framework for describing complex and flexible interacting Web applications featuring notably data exchange, sharing, integration, querying and updating. We also propose to develop formal foundations that will enable peers to concurrently reason about global data management activities, cooperate in solving specific tasks and support services with desired quality of service. Although the proposal addresses fundamental issues, its goal is to serve as the basis for future software development for Web data management.

Books from the project:

  • Foundation of Database, Serge Abiteboul, Rick Hull, Victor Vianu, open access online edition
  • Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart, open access online edition
  • Modeling, Querying and Mining Uncertain XML Data Evgeny Kharlamov and Pierre Senellart, , In A. Tagarelli, editor, XML Data Mining: Models, Methods, and Applications. IGI Global, 2011. open access online edition

I discovered this project via a link to “Web Data Management and Distribution” in Christophe Lalanne’s A bag of tweets / Dec 2011, that pointed to the PDF file, some 400 pages. I went looking for the HTML page with the link and discovered this project along with these titles.

There are a number of other publications associated with the project that you may find useful. The “Querying and Mining Uncertain XML” is only a chapter out of a larger publication by IGI Global. About what one expects from IGI Global. Cambrige Press published the title just proceeding this chapter and allows download for personal use of the entire book.

I think there is a lot to be learned from this project, even if it has not resulted in a universal framework for web applications that exchange data. I don’t think we are in any danger of universal frameworks on or off the web. And we are better for it.

Guava project

Saturday, December 31st, 2011

Guava project

From the web page:

The Guava project contains several of Google’s core libraries that we rely on in our Java-based projects: collections, caching, primitives support, concurrency libraries, common annotations, string processing, I/O, and so forth.

Something you may find useful in the coming year.

I first saw this in Christophe Lalanne’s A bag of tweets / Dec 2011.

News Cracking 1 : Meet the Editors

Saturday, December 31st, 2011

News Cracking 1 : Meet the Editors

Matthew Hurst writes:

I posted recently about visualizing the relationships between editors and the countries appearing in articles they edit for news articles published by Reuters. I’ve since updated my experimental news aggregation site (now it is intended to eventually be more of a meta-news analysis site) to display only Reuters articles and to extract the names of contributors, including the editors. The overall list of editors is maintained (in the right column) and each editor is displayed with the number of articles observed for which they have attribution. Currently Cynthia Johnston and Tim Dobbyn are at the top of the list.

What do you think about Matthew’s plans for future tracking? Thoughts on how subject identities might/might not be helpful? Comment at Matthew’s blog.

I don’t know if CNN is still this way because it has been a long time since I have seen it but it used to repeat the same news over and over again every 24 hour cycle. It might be amusing to see how short a summary could be created for some declared 24 hour news cycle. I suppose the only problem would be that if CNN “knew” it was being watched, they would introduce artificial diversity into the newscast.

Still, I suppose one could capture the audio track and using voice recognition software collapse all the repetitive statements, excluding the commercials (or including commercials as well). Maybe I do need a cable TV connection in my home office. 😉

Topic Models

Saturday, December 31st, 2011

Topic Models

From the post:

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

Just in case you need some starter materials on discovering “topics” (non-topic map sense) in documents.

Handling Criticism the Michelangelo Way

Saturday, December 31st, 2011

Handling Criticism the Michelangelo Way

From the post:

I had a chance to visit Florence, Italy earlier this month and visited the Galleria dell’Accademia Museum, the home of Michelangelo’s David. The presentation of David was captivating and awe-inspiring. The famous sculpture contained such incredible detail and every chisel and angle contributed to the exact message that the artist wanted to convey. It just worked on so many levels.

As I sat there, I remembered some of the backstory of David. The piece of marble was deemed of very high quality and for a long time awaited its artist and its ultimate use. Eventually, the job landed with Michelangelo and the target was determined to be a young, naked David from the Bible about to go into battle.

Michelangelo preferred to do his work in private and even shielded himself during David to avoid any would-be onlookers. Then one day well into the final product, along came Piero Soderini, an official of some sort who was a sponsor of the work. Soderini, the story goes, commented that the “nose was too thick.” We’d like to think Michelangelo would know better than Soderini about this and that the nose was not really “too thick.” However, it put Michelangelo in a dilemma.

Have you ever had this dilemma?

What was your response? (write it down here)

Now read the post for how Michelangelo responded….original post.

I am going to try to use the Michelangelo response to criticism in 2012!

How about you?

BUCC 2012: The Fifth Workshop on Building and Using Comparable Corpora

Saturday, December 31st, 2011

BUCC 2012: The Fifth Workshop on Building and Using Comparable Corpora (Special topic: Language Resources for Machine Translation in Less-Resourced Languages and Domains

Dates:

DEADLINE FOR PAPERS: 15 February 2012
Workshop Saturday, 26 May 2012
Lütfi Kirdar Istanbul Exhibition and Congress Centre
Istanbul, Turkey

Some of the information is from: Call for papers. the main conference site does not (yet) have the call for papers posted. Suggest that you verify dates with conference organizers before making travel arrangements.

From the call for papers:

In the language engineering and the linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest in themselves by making possible inter-linguistic discoveries and comparisons. It is generally accepted in both communities that comparable corpora are documents in one or several languages that are comparable in content and form in various degrees and dimensions. We believe that the linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora for applications of statistical NLP. As such, it is of great interest to bring together builders and users of such corpora.

The scarcity of parallel corpora has motivated research concerning the use of comparable corpora: pairs of monolingual corpora selected according to the same set of criteria, but in different languages or language varieties. Non-parallel yet comparable corpora overcome the two limitations of parallel corpora, since sources for original, monolingual texts are much more abundant than translated texts. However, because of their nature, mining translations in comparable corpora is much more challenging than in parallel corpora. What constitutes a good comparable corpus, for a given task or per se, also requires specific attention: while the definition of a parallel corpus is fairly straightforward, building a non-parallel corpus requires control over the selection of source texts in both languages.

Parallel corpora are a key resource as training data for statistical machine translation, and for building or extending bilingual lexicons and terminologies. However, beyond a few language pairs such as English-French or English-Chinese and a few contexts such as parliamentary debates or legal texts, they remain a scarce resource, despite the creation of automated methods to collect parallel corpora from the Web. To exemplify such issues in a practical setting, this year’s special focus will be on

Language Resources for Machine Translation in Less-Resourced Languages and Domains

with the aim of overcoming the shortage of parallel resources when building MT systems for less-resourced languages and domains, particularly by usage of comparable corpora for finding parallel data within and by reaching out for “hidden” parallel data. Lack of sufficient language resources for many language pairs and domains is currently one of the major obstacles in further advancement of machine translation.

Curious about the use of topic maps in the creation of comparable corpora? Seems like the use of language/domain scopes on linguistic data could result in easier construction of comparable corpora.

ADMI2012: The Eighth International Workshop on Agents and Data Mining Interaction

Saturday, December 31st, 2011

ADMI2012: The Eighth International Workshop on Agents and Data Mining Interaction

Dates:

Electronic submission of full papers: February 28, 2012
Notification of paper acceptance: April. 10, 2012
Camera-ready copies of accepted papers: April 15, 2012
AAMAS-2012 workshop: June 4-5, 2012

From the Call for Papers:

The ADMI workshop provides a premier forum for sharing research and engineering results, as well as potential challenges and prospects encountered in the respective communities and the coupling between agents and data mining. The workshop welcomes theoretical work and applied dissemination aiming to:

  1. exploit agent-enriched data mining and demonstrate how intelligent agent technology can contribute to critical data mining problems in theory and practice;
  2. improve data mining-driven agents and show how data mining can strengthen agent intelligence in research and practical applications;
  3. explore the integration of agents and data mining towards a super-intelligent system;
  4. discuss existing results, new problems, challenges and impact of integration of agent and data mining technologies as applied to highly distributed heterogeneous, including mobile, systems operating in ubiquitous and P2P environments;
  5. identify challenges and directions for future research and development on the synergy between agents and data mining.

See the call for further details.

I almost forgot: JUNE 04-08, 2011 Valencia, Spain

Early summer, Spain? And a conference where subject identity (as in data mining) is going to be discussed? Not sure what more to ask for!

Weecology … has new mammal dataset

Saturday, December 31st, 2011

Weecology … has new mammal dataset

A post on using R with the Weecology data set.

From the post:

So the Weecology folks have published a large dataset on mammal communities in a data paper in Ecology. I know nothing about mammal communities, but that doesn’t mean one can’t play with the data…

Knowing nothing about data sets hasn’t deterred any number of think tanks and government organizations. It would make useful merging more difficult. But, if you really don’t know anything about the data set, then one merging is just as good as another. Perhaps you should go to work in one of the 2012 campaigns. 😉

On the other hand, you could learn about the data set and merge it usefully into other data sets for a particular area. Or for use in data-based planning of civic projects.

FreeBookCentre.Net

Saturday, December 31st, 2011

FreeBookCentre.Net

Books and online materials on:

  • Computer Science
  • Physics
  • Mathematics
  • Electronics

I just scanned a few of the categories and the coverage isn’t systematic. Still, if you need a text for quick study, the price is right.

Hadoop Hits 1.0!

Friday, December 30th, 2011

Hadoop Hits 1.0!

From the news:

After six years of gestation, Hadoop reaches 1.0.0! This release is from the 0.20-security code line, and includes support for:

  • security
  • HBase (append/hsynch/hflush, and security)
  • webhdfs (with full support for security)
  • performance enhanced access to local files for HBase
  • other performance enhancements, bug fixes, and features

Please see the complete Hadoop 1.0.0 Release Notes for details.

With the release prior to this one being 0.22.0, I was reminded that of a publication by the Union of Concerned Scientists that had a clock on the cover, showing how close or how far away the world was to a nuclear “midnight.” Always counting towards midnight, except for one or more occasions when more time was added. The explanation I remember was that these were nuclear scientists, not clock experts. 😉

I am sure there will be some explanation for the jump in revisions that will pass into folklore and then into publications about Hadoop.

In the meantime, I would suggest that we all download copies and see what 2012 holds with Hadoop 1.0 under our belts.

Sexual Accommodation

Friday, December 30th, 2011

Sexual Accommodation by Mark Liberman.

From the post:

You’ve probably noticed that how people talk depends on who they’re talking with. And for 40 years or so, linguists and psychologists and sociologists have referred to this process as “speech accommodation” or “communication accommodation” — or, for short, just plain “accommodation”. This morning’s Breakfast Experiment™ explores a version of the speech accommodation effect as applied to groups rather than individuals — some ways that men and women talk differently in same-sex vs. mixed-sex conversations.

I got the idea of doing this a couple of days ago, as I was indexing some conversational transcripts in order to find material for an experiment on a completely different topic. The transcripts in question come from a large collection of telephone conversations known as the “Fisher English” corpus, collected at the LDC in 2003 and published in 2004 and 2005. These two publications together comprise 11,699 two-person conversations, involving a diverse collection of speakers. While the sample is not demographically balanced in a strict sense, there is a good representation of speakers from all over the United States, across a wide range of ages, educational levels, occupations, and so forth.

I mention this because if usage varies by gender, doesn’t it also stand to reason that usage (read identification of subjects) varies by position in an organization?

Anyone who has been in an IT position can attest that conversations inside the IT department use a completely different vocabulary than when addressing people outside the department. For one, the term “idiot” is probably not used with reference to the CEO outside of the IT department. 😉

Capturing the differences in vocabularies could be as useful as any result for an actual topic map, in terms of communication across levels of an organization.

Suggestions for text archives where that sort of difference could be investigated?

23 Useful Online HTML5 Tools

Friday, December 30th, 2011

23 Useful Online HTML5 Tools

Just in case you are working on delivery of topic maps using HTML5.

I am curious about the “Are you aware that HTML5 is captivating the web by leaps and bounds?” lead off line.

Particularly when I read articles like: HTML5: Current progress and adoption rates.

Or the following quote from: HTML5 Adoption Might Hurt Apple’s Profit, Research Finds

The switch from native apps to HTML5 apps will not happen overnight. At the moment, HTML5 apps have some problems that native apps do not. HTML5 apps are typically slower than native apps, which is a particularly important issue for games. An estimated 20 percent of mobile games will most likely never be Web apps, Bernstein said.

Furthermore, there are currently differences in Web browsers across mobile platforms that can raise development costs for HTML5 apps. They can also pose a greater security risk. This can result in restricting access to underlying hardware by handset manufacturers to reduce the possible impact of these risks.

Taking all this into account, Bernstein Research reckoned that HTML5 will mature in the next few years, which will in turn have an impact on Apple’s revenue growth. Nevertheless, the research firm, which itself makes a market in Apple, still recommended investing in the company.

Apple executives are reported to be supporters of HTML5. Which makes sense if you think about it. By the time HTML5 matures enough to be a threat, Apple will have moved on, leaving the HTML5ers to fight over what is left in a diminishing market share. Supporting a technology that makes your competition’s apps slower and less secure makes sense as well.

How are you using HTML5 with topic maps?

Closing the Knowledge Gap:.. (Lessons for TMs?)

Friday, December 30th, 2011

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications by Tony Frazier, Director of Product Management, Cisco Systems and David Fishman, Marketing, Lucid Imagination.

From the post:

Cisco Systems set out to build a system that takes the search for knowledge beyond documents into the content of social network inside the enterprise. The resulting Cisco Pulse platform was built to deliver corporate employees a better understanding who’s communicating with whom, how, and about what. Working with Lucid Imagination, Cisco turned to open source — specifically, Solr/Lucene technology — as the foundation of the search architecture.

Cisco’s approach to this project centered on vocabulary-based tagging and search. Every organization has the ability to define keywords for their personalized library. Cisco Pulse then tags a user’s activity, content and behavior in electronic communications to match the vocabulary, presenting valuable information that simplifies and accelerates knowledge sharing across an organization. Vocabulary-based tagging makes unlocking the relevant content of electronic communications safe and efficient.

You need to read the entire article but two things to note:

  • No uniform vocabulary: Every “organization” created its own.
  • Automatic tagging: Content was automatically tagged (read users did not tag)

The article doesn’t go into any real depth about the tagging but it is implied that who created the content and other information is getting “tagged” as well.

I read that to mean in a topic maps context that with the declaration of a vocabulary and automatic tagging, that another process could create associations with roles and role players and other topic map constructs without bothering end users about those tasks.

Not to mention that declaring equivalents between tags as part of the reading/discovery process might be limited to some but not all users.

An incremental or perhaps even evolving authoring of a topic map.

Rather than a dead-tree resource, delivered a fait accompli, a topic map can change as new information or new views of existing/new information are added to the map. (A topic map doesn’t have to be so useful. It can be the equivalent of a dead-tree resource if you really want.)

Solr Reference Guide 3.4

Friday, December 30th, 2011

Solr Reference Guide 3.4

From the post:

The material as presented assumes that you’re familiar with some basic search concepts and that you can read XML; it does not assume that you are a Java programmer, although knowledge of Java is helpful when working directly with Lucene or when developing custom extensions to a Lucene/Solr installation.

GuideKey topics covered in the Reference Guide include:

  • Getting Started: Installing Solr and getting it running for the first time.
  • Using the Solr Admin Web Interface: How to use the built-in UI.
  • Documents, Fields, and Schema Design: Designing the index for optimal retrieval.
  • Understanding Analyzers, Tokenizers, and Filters: Setting up Solr to handle your content.
  • Indexing and Basic Data Operations: Indexing your content.
  • Searching: Ways to improve the search experience for your users.
  • The Well Configured Solr Instance: Optmal settings to keep the system running smoothly.
  • Managing Solr: Web containers, logging and backups.
  • Scaling and Distribution: Best practices for increasing system capacity.
  • Client APIs: Clients that can be used to provide search interfaces for users.

The guide is available online or as a download.

BTW, have you seen any books on Solr that you like? The reviews I have seen don’t look promising.

Why Not AND, OR, And NOT?

Friday, December 30th, 2011

Why Not AND, OR, And NOT?

From the post:

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

Required reading if you want to understand how these the “Boolean Operators” work in Lucene/Solr and a superior alternative.

Apache Mahout user meeting – session slides and videos are now available!

Friday, December 30th, 2011

Apache Mahout user meeting – session slides and videos are now available!

From the post:

The first San Francisco Apache Mahout user meeting was held on November 29th 2011 at Lucid Imagination head quarters in Redwood City. The 3-hour session hosted 2 talks followed by networking, food and drinks.

Session topics –

  • “Using Mahout to cluster, classify and recommend, plus a demonstration of using scripts packaged with Mahout” by Grant Ingersoll from Lucid Imagination.
  • “How using random projection in Machine learning can benefit performance with out sacrificing quality” by Ted Dunning from MapR Technologies.

Sharpening your Mahout skills is never a bad idea!

LucidWorks Enterprise 2.0.1 Release

Friday, December 30th, 2011

LucidWorks Enterprise 2.0.1 Release

From the post:

LucidWorks Enterprise 2.0.1 is an interim bug-fix release. We’ve have resolved couple of critical bugs and LDAP integration issues. The list of issues resolved with this updates are available here.

Explorations in Parallel Distributed Processing:..

Friday, December 30th, 2011

Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises by James L. McClelland.

From Chapter 1, Introduction:

Several years ago, Dave Rumelhart and I first developed a handbook to introduce others to the parallel distributed processing (PDP) framework for modeling human cognition. When it was first introduced, this framwork represented a new way of thinking about perception, memory, learning, and thought, as well as a new way of characterizing the computational mechanisms for intelligent information processing in general. Since it was first introduced, the framework has continued to evolve, and it is still under active development and use in modeling many aspects of cognition and behavior.

Our own understanding of parallel distributed processing came about largely through hands-on experimentation with these models. And, in teaching PDP to others, we discovered that their understanding was enhanced through the same kind of hands-on simulation experience. The original edition of the handbook was intended to help a wider audience gain this kind of experience. It made many of the simulation models discussed in the two PDP volumes (Rumelhart et al.,  1986; McClelland et al.,  1986) available in a form that is intended to be easy to use. The handbook also provided what we hoped were accessible expositions of some of the main mathematical ideas that underlie the simulation models. And it provided a number of prepared exercises to help the reader begin exploring the simulation programs.

The current version of the handbook attempts to bring the older handbook up to date. Most of the original material has been kept, and a good deal of new material has been added. All of simulation programs have been implemented or re-implemented within the MATLAB programming environment. In keeping with other MATLAB projects, we call the suite of programs we have implemented the PDPTool software.

Latest revision (Sept. 2011) is online for your perusal. A good way to develop an understanding of parallel processing.

Apologies for not seeing this before Christmas. Please consider it an early birthday present for your birthday in 2012!

Spectral Graph Theory

Friday, December 30th, 2011

Spectral Graph Theory by Fan R K Chung.

A developing area of mathematics that may be important for high dimensional data mining. Relevant to the spectral feature selection post from yesterday.

You can see the first four revised chapters and the bibliography at: Spectral Graph Theory (revised and improved)

Top 50 JavaScript, jQuery Plugins and Tutorials From 2011

Thursday, December 29th, 2011

Top 50 JavaScript, jQuery Plugins and Tutorials From 2011

From the post:

jQuery is amazing as you can find plugins to accomplish almost anything you want. It makes your work easy and quick. And Javascript has always been a favorite of people. we have a list of top jQuery, Javascript plugins and tutorials that will help you a lot and will definitely make your life easier.

I hope you would find these apps useful. It includes apps that will let you create loading animation, which help make your website responsive to user interaction as they would know that the server is processing their request. Then we have apps that will help you optimize your website for mobile phones. We also have some awesome Javascript color picker plugins and JavaScript Games too. Not only that, we have some really cool Javascript Experiments and jQuery tutorials for you, which would help you learn tricks that you might not have known before. So what are you waiting for?! Check them out!

You will have to let me know about the optimizing the web interface for a topic map on a cell phone. Some of the others? I will try to report back to you. 😉

International Journal on Natural Language Computing (IJNLC)

Thursday, December 29th, 2011

International Journal on Natural Language Computing (IJNLC)

Dates:

Submission deadline : 25 January, 2012

Acceptance notification: 25 February, 2012
Final manuscript due : 28 February, 2012
Publication date : determined by the Editor-in-Chief

Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.

Apparently the first issue of a new journal on natural language processing. Every journal has a first issue so please consider contributing something strong for the first issue of this journal.

Linked Data Paradigm Can Fuel Linked Cities

Thursday, December 29th, 2011

Linked Data Paradigm Can Fuel Linked Cities

The small city of Cluj in Romania, of some half-million inhabitants, is responsible for a 2.5 million triple store, as part of a Recognos-led project to develop a “Linked City” community portal. The project was submitted for this year’s ICT Call – SME initiative on Digital Content and Languages, FP7-ICT-2011-SME-DCL. While it didn’t receive funding from that competition, Recognos semantic web researcher Dia Miron, is hopeful of securing help from alternate sources in the coming year to expand the project, including potentially bringing the concept of linked cities to other communities in Romania or elsewhere in Europe.

The idea was to publish information from sources such as local businesses about their services and products, as well as data related to the local government and city events, points of interest and projects, using the Linked Data paradigm, says Miron. Data would also be geolocated. “So we take all the information we can get about a city so that people can exploit it in a uniform manner,” she says.

The first step was to gather the data and publish it in a standard format using RDF and OWL; the next phase, which hasn’t taken place yet (it’s funding-dependent), is to build exportation channels for the data. “First we wanted a simple query engine that will exploit the data, and then we wanted to build a faceted search mechanism for those who don’t know the data structure to exploit and navigate through the data,” she says. “We wanted to make it easier for someone not very acquainted with the models. Then we wanted also to provide some kind of SMS querying because people may not always be at their desks. And also the final query service was an augmented reality application to be used to explore the city or to navigate through the city to points of interest or business locations.”

Local Cluj authorities don’t have the budgets to support the continuation of the project on their own, but Miron says the applications will be very generic and can easily be transferred to support other cities, if they’re interested in helping to support the effort. Other collaborators on the project include Ontotext and STI Innsbruck, as well as the local Cluj council.

I don’t doubt this would be useful information for users but is this the delivery model that is going to work for users, assuming it is funded? Here or elsewhere?

How hard do users work with searches? See Keyword and Search Engines Statistics to get an idea by country.

Some users can be trained to perform fairly complex searches but I suspect that is a distinct minority. And the type of searches that need to be performed vary by domain.

For example, earlier today, I was searching for information on “spectral graph theory,” which I suspect has different requirements than searching for 24-hour sushi bars within a given geographic area.

I am not sure how to isolate those different requirements, much less test how close any approach is to satisfying them, but I do think both areas merit serious investigation.

Read-through of ‘Gödel, Escher, Bach’

Thursday, December 29th, 2011

Read-through of ‘Gödel, Escher, Bach’

A read through of ‘Gödel, Escher, Bach’ by Douglas R. Hofstadter, starting 17 January 2012.

Are there other off-line or electronic books you would suggest as “read through” candidates?

Web scraping with Python – the dark side of data

Thursday, December 29th, 2011

Web scraping with Python – the dark side of data

From the post:

In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites.

“..user-unfriendly websites.”? What about “user-hostile websites?” 😉

Looks like a good presentation up to “user-unfriendly.”

It will be useful for anyone who needs data from sites that are not configured to delivery it properly (that is to users).

I suppose “user-hostile” would fall under some prohibited activity.

Would make a great title for a book: “Penetration and Mapping of Hostile Hosts.” Could map of vulnerable hosts with their exploits as a network graph.

Spectral Feature Selection for Data Mining

Thursday, December 29th, 2011

Spectral Feature Selection for Data Mining by Zheng Alan Zhao and Huan Liu.

I did not find the publisher’s description all that helpful.

You may want to review:

The supplemental page maintained by the authors, Spectral Feature Selection for Data Mining. There you will also find source code by chapter in Matlab format and some other materials.

Earlier work by the authors, see:

Spectral feature selection for supervised and unsupervised learning (2007) by Zheng Zhao , Huan Liu.

Slow going but the early work appears to hold a great deal of promise.

If you have or get a copy of the book, please forward or point to your comments.

Backbone of the flavor network

Thursday, December 29th, 2011

Backbone of the flavor network by Nathan Yau at FlowingData.

From the post:

Food flavors across cultures and geography vary a lot. Some cuisines use a lot of scallion and ginger, whereas another might use a lot of onion and butter. Then again, everyone seems to use garlic. Yong-Yeol Ahn, et al. took a closer look at what makes food taste different, breaking ingredients into flavor compounds and examining what the ingredients had in common. A flavor network was the result:

Each node denotes an ingredient, the node color indicates food category, and node size reflects the ingredient prevalence in recipes. Two ingredients are connected if they share a significant number of flavor compounds, link thickness representing the number of shared compounds between the two ingredients. Adjacent links are bundled to reduce the clutter.

Mushrooms and liver are on the edges, out on their lonesome.

You really need to see the graph/network with the post.

I am not ready to throw over my Julia Child’s cookbook in favor of using it to create recipes but it is an impressive piece of work.

Certainly could figure into being data that is merged into a recipe topic map to explain (possibly) substitutions or possible substitutions of ingredients.

Any foodies in the house?

400 Free Online Courses from Top Universities

Wednesday, December 28th, 2011

400 Free Online Courses from Top Universities

Just in case hard core math/cs stuff isn’t your cup of tea or you want to write topic maps about some other area of study, this may be a resource for you.

Oddly enough (?), every listing of free courses seems to be different from other listings of free courses.

If you happen to run across seminar lectures (graduate school) on Ancient or Medieval philosophy, drop me a line. Or even better, on individual figures.

I first saw this linked on John Johnson’s Realizations in Biostatistics. John was pointing to the statistics/math courses but there is a wealth of other material as well.

Visualizing 4+ Dimensions

Wednesday, December 28th, 2011

Visualizing 4+ Dimensions

From the post:

When people realize that I study pure math, they often ask about how to visualize four or more dimensions. I guess it’s a natural question to ask, since mathematicians often have to deal with very high (and sometimes infinite) dimensional objects. Yet people in pure math never really have this problem.

Pure mathematicians might like you to think that they’re just that much smarter. But frankly, I’ve never had to visualize anything high-dimensional in my pure math classes. Working things out algebraically is much nicer, and using a lower-dimensional object as an example or source of intuition usually works out — at least at the undergrad level.

But that’s not a really satisfying answer, for two reasons. One is that it is possible to visualize high-dimensional objects, and people have developed many ways of doing so. Dimension Math has on its website a neat series of videos for visualizing high-dimensional geometric objects using stereographic projection. The other reason is that while pure mathematicians do not have a need for visualizing high-dimensions, statisticians do. Methods of visualizing high dimensional data can give useful insights when analyzing data.

This is an important area for study, but not only because identifications can consist of values in multiple dimensions.

It is important because the recognition of an identifier can also consist of values spread across multiple dimensions.

More on that second statement before year’s end (so you don’t have to wait very long, just until holiday company leaves).

I first saw this in Christophe Lalanne’s A bag of tweets / Dec 2011.

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations

Wednesday, December 28th, 2011

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations by Ryan K. Dale, Brent S. Pedersen and Aaron R. Quinlan.

Abstract:

Summary: pybedtools is a flexible Python software library for manipulating and exploring genomic datasets in many common formats. It provides an intuitive Python interface that extends upon the popular BEDTools genome arithmetic tools. The library is well documented and efficient, and allows researchers to quickly develop simple, yet powerful scripts that enable complex genomic analyses.

From the documentation:

Formats with different coordinate systems (e.g. BED vs GFF) are handled with uniform, well-defined semantics described in the documentation.

Starting to sound like HyTime isn’t it? Transposition between coordinate systems.

If you venture into this area with a topic map, something to keep in mind.

I first saw this in Christophe Lalanne’s A bag of tweets / Dec 2011.