Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 27, 2012

Data and visualization blogs worth following

Filed under: Data,Graphics,Visualization — Patrick Durusau @ 6:11 pm

Data and visualization blogs worth following

Nathan Yau has posted a list of 38 blogs he follows on:

  • Design and Aesthetics
  • Statistical and Analytical Visualization
  • Journalism
  • General Visualization
  • Maps
  • Data and Statistics

Thought you would enjoy a weekend of updating your blog readers!

Scout, in Open Beta

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 6:11 pm

Scout, in Open Beta

Eric Mill writes:

Scout is an alert system for the things you care about in state and national government. It covers Congress, regulations across the whole executive branch, and legislation in all 50 states.

You can set up notifications for new things that match keyword searches. Or, if you find a particular bill you want to keep up with, we can notify you whenever anything interesting happens to it — or is about to.

Just to emphasize, this is a beta – it functions well and looks good, but we’re really hoping to hear from the community on how we can make it stronger. You can give us feedback by using the Feedback link at the top of the site, or by writing directly to scout@sunlightfoundation.com.

Legal terminology variation between states plus the feds is going to make keyword searches iffy.

Will vary among areas of law.

Greatest variation in family and criminal law, least among some parts of commercial law.

Anyone know if there is a cross-index of terminology between the legal systems of the states?

Leveson Inquiry: all the evidence relating to Rupert Murdoch

Filed under: News — Patrick Durusau @ 6:11 pm

Leveson Inquiry: all the evidence relating to Rupert Murdoch

John Burn-Murdoch writes:

Rupert Murdoch has been appearing this week at the Leveson Inquiry into the culture, practice and ethics of the press. Anyone who has been following the proceedings will know that references to various exhibits of evidence are made throughout, and without knowing what each document is it can be difficult to fully appreciate what is being said.

To make the process a little easier, we have catalogued these documents – to date there are over 100 – and you can browse them in pdf format by using the table of links below.

Just the sort of thing you or the INS needs to kick off a topic map on Rupert Murdoch.

BTW, be sure to give the Guardian a shout-out in your work. They do good work, if a tad conservative.

Harvard Library releases big data for its books

Filed under: Books,Library — Patrick Durusau @ 6:11 pm

Harvard Library releases big data for its books

Audrey Watters writes in part:

Harvard University announced this week that it would make more than 12 million catalog records from its 73 libraries publicly available. These records contain bibliographic information about books, manuscripts, maps, videos, and audio recordings. The Harvard Library is making these records available under a Creative Commons 0 license, in accordance with its Open Metadata Policy.

In MARC21 format, these records should lend themselves to a number of interesting uses.

I have always been curious about semantic drift across generations of librarians for subject headings.

Did we as users “learn” the cataloging of particular collections?

How can we recapture that “learning” in a topic map?

Parallel Language Corpus Hunting?

Filed under: Corpora,EU,Language,Linguistics — Patrick Durusau @ 6:11 pm

Parallel language corpus hunters, particularly in legal informatics can rejoice!

[A] parallel corpus of all European Union legislation, called the Acquis Communautaire, translated into all 22 languages of the EU nations — has been expanded to include EU legislation from 2004-2010…

If you think semantic impedance in one language is tough, step up and try that across twenty-two (22) languages.

Of course, these countries share something of a common historical context. Imagine the gulf when you move up to languages from other historical contexts.

See: DGT-TM-2011, Parallel Corpus of All EU Legislation in Translation, Expanded to Include Data from 2004-2010 for links and other details.

Apollo and Gemini Computing Systems

Filed under: Space Data — Patrick Durusau @ 6:10 pm

Apollo and Gemini Computing Systems

Ronald Burkey writes:

Here you’ll find a collection of all the AGC, AGS, LVDC, and Gemini spacecraft computer documentation and software that I’ve managed to find whilst working on Virtual AGC. Every document on this page is archived here at Virtual AGC, regardless of whether it originated here or not. In the early days I used to include only material I uncovered by my own efforts, but there have increasingly been contributions by readers, including some of the original AGC developers. And there’s material here that has been duplicated from other Apollo-centric websites for your convenience; see the FAQ page for a list of the fine Apollo and Gemini websites I raided. Now, there is some value-added in this process, since I add searchable text to those PDFs which are image-only, as well as adding metadata and bookmark panes where they don’t exist. My intention is to eventually provide one-stop-shopping for all of your Apollo and Gemini computing-system documentation needs. Note however, that I choose to duplicate only scanned or photographic images of the original documents. In other words, I provide something as close to the “real thing” as I can. On some sites, notably the Apollo Flight Journal and Apollo Lunar Surface Journal, great pains have been taken to produce HTML forms of the documents. I do not duplicate those improved reformulations here, because that’s original work for which I think credit is due; so you will have to visit those sites to use those improved versions.

Awesome data set!

OK, I admit to a bit of nostalgia because I grew up watching these and earlier space flights.

Indexing and mapping the terminology of these documents would make an interesting project.

To say nothing of comparing the terminology here to later space efforts.

ArcSpread for analyzing web archives

Filed under: Archives — Patrick Durusau @ 6:10 pm

ArcSpread for analyzing web archives

Pete Warden writes:

Stanford runs a fantastic project for capturing important web pages as they change over time, and then presenting the results in a form that future historians will be able to use. This paper talks about some of the techniques they use for removing boilerplate navigation and ad content, so that researchers can work with the meat of the page.

I was relieved to read:

We did not excise any advertising images from the presented pages, but asked participants to disregard advertising related images.

Poorly done digital newspaper archives remove advertising content on a “meat of the page” theory.

Researchers cannot notice what was advertised, how and at what prices. Ads may not interest us, but may interest others.

At one time thousands if not hundreds of thousands of people knew how Egyptian pyramids were build.

So commonly known it was not written down.

Perhaps there is a lesson there for us.

20 free R tutorials (and one reference card)

Filed under: R — Patrick Durusau @ 6:10 pm

20 free R tutorials (and one reference card)

David Smith gives some quick notes on a listing of 20 free (university based) R tutorials.

Nothing new but spread across subject areas from climate to bioinformatics to econometrics.

Shop around a bit for relevant (to you) examples.

Making Search Hard(er)

Filed under: Identity,Searching,Semantics — Patrick Durusau @ 6:10 pm

Rafael Maia posts:

first R, now Julia… are programmers trying on purpose to come up with names for their languages that make it hard to google for info? 😛

I don’t know that two cases prove that programmers are responsible for all the semantic confusion in the world.

A search for FORTRAN produces FORTRAN Formula Translation/Translator.

But compare COBOL:


COBOL Common Business-Oriented Language
COBOL Completely Obsolete Business-Oriented Language 🙂
COBOL Completely Over and Beyond Obvious Logic 🙂
COBOL Compiles Only By Odd Luck 🙂
COBOL Completely Obsolete Burdensome Old Language 🙂

May be something to programmers peeing in the semantic pool.

On the other hand, there are examples prior to programming of semantic overloading of strings.

Here is an interesting question:

Is a string overloaded, semantically speaking, when used or read?

Does your answer impact how you would build a search engine? Why/why not?

And Now, For Really Big Data: 550 billion particles

Filed under: Astroinformatics,BigData — Patrick Durusau @ 6:09 pm

Want big data? Really big data? Consider the following description from 550 billion particles:

The amounts involved in this simulation are simply mindboggling: 92 000 CPUs, 150 PBytes of data, 2 (U.S.) quadrillion flops (2 PFlop/s), the equivalent of 30 million computing hours, each particle has the size of the Milky Way, and so on…

Data, from the DEUS (for Dark Energy Universe Simulation) project is freely available.

April 26, 2012

Simple tools for building a recommendation engine

Filed under: Dataset,R,Recommendation — Patrick Durusau @ 6:31 pm

Simple tools for building a recommendation engine by Joseph Rickert.

From the post:

Revolution’s resident economist, Saar Golde, is very fond of saying that “90% of what you might from a recommendation engine can be achieved with simple techniques”. To illustrate this point (without doing a lot of work), we downloaded the million row movie dataset from www.grouplens.org with the idea of just taking the first obvious exploratory step: finding the good movies. Three zipped up .dat files comprise this data set. The first file, ratings.dat, contains 1,000,209 records of UserID, MovieID, Rating, and Timestamp for 6,040 users rating 3,952 movies. Ratings are whole numbers on a 1 to 5 scale. The second file, users.dat, contains the UserID, Gender, Age, Occupation and Zip-code for each user. The third file, movies.dat, contains the MovieID, Title and Genre associated with each movie.

Curious, if a topic map engine performed 90% of the possible merges in a topic map, would that be enough?

Would your answer differ if the topic map had less than 10,000 topics and associations versus a topic map with 100 million topics and associations?

Would your answer differ based on a timeline of the data? Say the older the data, the less reliable the merging. Recent medical data < 1% error rate, up to ten years, ten to twenty years, <= 10% error rate, more than twenty years, best efforts. Which of course raises the question of how you would test for conformance to such requirements?

The Shades of Time Project

Filed under: Data,Dataset,Diversity — Patrick Durusau @ 6:31 pm

The Shades of TIME project by Drew Conway.

Drew writes:

A couple of days ago someone posted a link to a data set of all TIME Magazine covers, from March, 1923 to March, 2012. Of course, I downloaded it and began thumbing through the images. As is often the case when presented with a new data set I was left wondering, “What can I ask of the data?”

After thinking it over, and with the help of Trey Causey, I came up with, “Have the faces of those on the cover become more diverse over time?” To address this questions I chose to answer something more specific: Has the color values of skin tones in faces on the covers changed over time?

I developed a data visualization tool, I’m calling the Shades of TIME, to explore the answer to that question.

An interesting data set and an illustration of why topic map applications are more useful if they have dynamic merging (user selected).

Presented with the same evidence, the covers of TIME magazine I most likely would have:

  • Mapped people on the covers to historical events
  • Mapped people on the covers to additional historical resources
  • Mapped covers into library collections
  • etc.

I would not have set out to explore the diversity in skin color on the covers. In part because I remember when it changed. That is part of my world knowledge. I don’t have to go looking for evidence of it.

My purpose isn’t to say authors, even topic map authors, should avoid having a point of view. Isn’t possible in any event. What I am suggesting is that to the extent possible, users be enabled to impose their views on a topic map as well.

Git: the NoSQL Database

Filed under: Git,NoSQL — Patrick Durusau @ 6:31 pm

Git: the NoSQL Database

Brandon Keepers has a nice slide deck on using Git as a NoSQL database.

If you have one of his use cases, consider Git.

I recommend the slidedeck more for his analysis of what is or is not possible with Git.

All too often the shortcomings of a database or ten year old code is seen as fundamental rather than accidental.

Accidents, like mistakes, can be corrected.

Graphs in the Cloud: Neo4j and Heroku

Filed under: Cloud Computing,Heroku,Neo4j — Patrick Durusau @ 6:30 pm

Graphs in the Cloud: Neo4j and Heroku

From the registration page:

Thursday May 10 10:00 PDT / 17:00 GMT

With more and more applications in the cloud, developers are looking for a fast solution to deploy their applications. This webinar is intended for developers that are interested in the value of launching your application in the cloud, and the power of using a graph database.

In this session, you will learn:

  • how to build Java applications that connect to the Neo4j graph database.
  • how to instantly deploy and scale those applications on the cloud with Heroku.

Speaker: James Ward, Heroku Developer Evangelist

Sounds interesting. Not as much fun as being in Amsterdam but not everyday can be like that! Besides, this way you may remember some of the presentation. 😉

DATA Act passes House

Filed under: DATA Act,Mapping,Topic Maps — Patrick Durusau @ 6:30 pm

DATA Act passes House

Alice Lipowicz writes:

Open government watchdog groups are applauding the House passage of the Digital Accountability and Transparency Act (DATA Act) on April 25 that would require federal agencies to consistently report spending information on a new, searchable Web platform.

The legislation passed by a voice vote and will now go before the Senate. If it becomes law, it will establish standards for identifying and publishing electronic information about federal spending.

The federal government would need to spend $575 million over five years to create new structures and systems under the DATA Act, according to a Congressional Budget Office report issued last year.

If I have ever heard of an opportunity for topic maps, this is one.

Not law, yet, but as soon as it is, there will be a variety of tooling up exercises that will set the parameters for later development.

The Digital Accountability & Transparency Act (DATA), H.R. 2146 (as of this data)

BTW, they mention ISO:

Common data elements developed and maintained by an international voluntary consensus standards body, as defined by the Office of Management and Budget, such as the International Organization for Standardization. [Sec. 3611(a)(3)(A)]

Two thoughts:

First, the need of agencies for mapping solutions to report their current systems in the new target form.

Second, the creation of “common data elements” that have pre-defined hooks for mapping, using topic maps.

GOTO Amsterdam 2012

Filed under: Conferences — Patrick Durusau @ 6:30 pm

GOTO Amsterdam 2012

Finally, a use of goto that we can all agree on!

May 24th and 25th for the conference. May 26th for training.

If you need some motivation/justification other than it being in Amsterdam, see the schedule.

A multi-track conference that no matter what track you pick, you will not be disappointed but will regret missing the other track.

CodeX: Standard Center for Legal Informatics

Filed under: Law,Legal Informatics — Patrick Durusau @ 6:30 pm

CodeX: Standard Center for Legal Informatics

Language and semantics are noticed more often with regard to legal systems than they are elsewhere. Failing to “get” a joke on television show doesn’t have the same consequences, potentially, as breaking a law.

Within legal systems topic maps are important for capturing and collating complex factual and legal semantics. As the world grows more international, legal system bump up against each other and topic maps provide a way to map across such systems.

From the website:

CodeX is a multidisciplinary laboratory operated by Stanford University in association with affiliated organizations from industry, government, and academia. The staff of the Center includes a core of full-time employees, together with faculty and students from Stanford and professionals from affiliated organizations.

CodeX’s primary mission is to explore ways in which information technology can be used to enhance the quality and efficiency of our legal system. Our goal is “legal technology” that empowers all parties in our legal system and not solely the legal profession. Such technology should help individuals find, understand, and comply with legal rules that govern their lives; it should help law-making bodies analyze proposed laws for cost, overlap, and inconsistency; and it should help enforcement authorities ensure compliance with the law.

Projects carried out under the CodeX umbrella typically fall into one or more of the following areas:

  • Legal Document Management: Legal Document Management:is concerned with the creation, storage, and retrieval of legal documents of all types, including statutes, case law, patents, regulations, etc. The $50B e-discovery market is heavily dependent on Information Retrieval (IR) technology. By automating information retrieval, cost can be dramatically reduced. Furthermore, it is generally the case that well-tuned automated procedures can outperform manual search in terms of accuracy. CodeX is investigating various innovative legal document management methodologies and helping to facilitate the use of such methods across the legal spectrum.
  • Legal Infrastructure: Some CodeX projects focus on building the systems that allow the stakeholders in the legal system to connect and collaborate more efficiently. Leveraging advances in the field of computer science and building upon national and international standardization efforts, these projects have the potential to provide economic and social benefits by streamlining the interactions of individuals, organizations, legal professionals and government as they acquire and deliver legal services. By combining the development of such platforms with multi-jurisdictional research on relevant regulations issued by governments and bar associations, the Center supports responsible, forward-looking innovation in the legal industry.
  • Computational Law: Computational law is an innovative approach to legal informatics based on the explicit representation of laws and regulations in computable form. Computational Law techniques can be used to “embed” the law in systems used by individuals and automate certain legal decision making processes or in the alternative bring the legal information as close to the human decision making as possible. The Center’s work in this area includes theoretical research on representations of legal information, the creation of technology for processing and utilizing information expressed within these representations, and the development of legal structures for ratifying and exploiting such technology. Initial applications include systems for helping individuals navigate contractual regimes and administrative procedures, within relatively discrete e-commerce and governmental domains.

LucidWorks 2.1

Filed under: Lucene,LucidWorks,Solr — Patrick Durusau @ 6:30 pm

LucidWorks 2.1

There are times, not very often, when picking only a few features to report would be unfair to a product.

This is one of those times.

I have reproduced the description of LucidWorks 2.1 as it appears on the Lucid Imagination site:

LucidWorks 2.1 new features list:

Enhancement Areas Key Benefits

Includes the latest Lucene/Solr 4.0

  • Near Real Time
  • Fault Tolerance and High Availability
  • Data Durability
  • Centralized Configuration
  • Elasticity

Business Rules

  • Integrate your business processes and rules with the user search experience
  • Examples: Landing Pages, provide targeted search results per user, etc.
  • Framework to integrate with your BRMS (Business Rules Management System)
  • OOB integration with leading open source BRMS – Drools

Upgrade and Migrations

  • Lucid can help upgrade customers from Solr 3.x to 4.0 or older Solr versions to LucidWorks 2.1
  • Upgrades for existing LucidWorks customers on previous versions of LucidWorks to LucidWorks 2.1

Enhanced Connector Framework

  • Easily build integrations to index data from any application or data sources
  • Framework supports REST API driven integration, generates dynamic configuration UI, and allows admins to schedule the new connectors
  • Connectors available to crawl large amounts of HDFS data, integrate twitter updates into index, and CMIS connector to support CMS systems like Alfresco, etc.

Efficient Crawl of Large web content

  • OOB integration for Nutch  (open source)
  • Helps crawl Webscale data into your index

REST API and UI Enhancements

  • Supports memory and cache settings, schema less configuration using Dynamic fields from UI
  • Subject Matter Experts can create Best Bets for improved search experience

Key features and benefits of LucidWorks search platform

  • Streamlined search configuration, optimization and operations: Well-organized UI makes Solr innovation easier to consume, better adapting to constant change.
  • Enterprise-grade, business critical manageability Includes tools for infrastructure administration, monitoring and reporting so your search application can thrive within a well-defined, well-managed operational environment; includes upgradability across successive releases. We can help migrate Solr installations to LucidWorks 2.1.
  • Broad-based content acquisition Access big data and enterprise content faster and more securely with built-in support for Hadoop and Amazon S3, along with Sharepoint and traditional online content types – plus a new open connector framework to customize access to other data sources
  • Versatile access and data security Flexible, resilient built-in security simplifies getting search connected right to the right data and content
  • Advanced search experience enhancements Powerful, innovative search capabilities deliver faster, better, more useful results for a richer user experience; easily integrates into your application and infrastructure; REST API automates and integrates search as a service with your application.
  • Open source power and innovation Complete, supported release of Lucene/Solr 4.0, including latest innovations in Near Real Time search, distributed indexing and more versatile field faceting over and above Apache Lucene/Solr 3.x; all the flexibility of open source, packaged for business-critical development, maintenance and deployment
  • Cost-effective commercial grade expertise & Global 24×7 Support a range of annual support subscriptions including bundled services, consulting, training and certification from the world’s leading experts in Lucene/Solr open source.

Math is not “out there”

Filed under: Graphs,Mathematics — Patrick Durusau @ 6:29 pm

Number Line Is Learned, Not Innate Human Intuition

From the post:

Tape measures. Rulers. Graphs. The gas gauge in your car, and the icon on your favorite digital device showing battery power. The number line and its cousins — notations that map numbers onto space and often represent magnitude — are everywhere. Most adults in industrialized societies are so fluent at using the concept, we hardly think about it. We don’t stop to wonder: Is it “natural”? Is it cultural?

Now, challenging a mainstream scholarly position that the number-line concept is innate, a study suggests it is learned.

The study, published in PLoS ONE April 25, is based on experiments with an indigenous group in Papua New Guinea. It was led by Rafael Nunez, director of the Embodied Cognition Lab and associate professor of cognitive science in the UC San Diego Division of Social Sciences.

“Influential scholars have advanced the thesis that many of the building blocks of mathematics are ‘hard-wired’ in the human mind through millions of years of evolution. And a number of different sources of evidence do suggest that humans naturally associate numbers with space,” said Nunez, coauthor of “Where Mathematics Comes From” and co-director of the newly established Fields Cognitive Science Network at the Fields Institute for Research in Mathematical Sciences.

“Our study shows, for the first time, that the number-line concept is not a ‘universal intuition’ but a particular cultural tool that requires training and education to master,” Nunez said. “Also, we document that precise number concepts can exist independently of linear or other metric-driven spatial representations.”

I am not sure how “universal intuition[s]” regained currency but I am glad someone is sorting this out, again.

Universal intuition is a perennial mistake that attempts to put some “facts” beyond dispute. They are “universal.”

I concede the possibility that “universal” intuitions exist.

But advocates always have some particular “universal” intuition they claim to exist, which oddly enough supports some model or agenda of theirs.

Anecdotal evidence to be sure but I have never seen an advocate of a particular “universal” intuition pushing for one that was contrary to their model or agenda. Could just be coincidence but I leave that to your judgement.

I offer this study as a evidence you can cite in the face of “universal” intuitions in databases, ontologies, logic, etc. They are all cultural artifacts that we can use or leave as suits our then present purposes.

For more information see: Núñez R, Cooperrider K, Wassmann J. Number Concepts without Number Lines in an Indigenous Group of Papua New Guinea. PLoS ONE, 7(4): e35662 DOI: 10.1371/journal.pone.0035662

Berlin Buzzwords 2012 Program

Filed under: Conferences — Patrick Durusau @ 6:29 pm

Berlin Buzzwords 2012 Program

The program for Berlin Buzzwords 2012 is up and what follows is my take on some of the reasons to attend. I should have listed all the presentations but then it would be too long to read. Besides, it is at the website anyway.

WARNING: Partial and Arbitrary Listing of Presentations!

  • Analyzing Hadoop Source Code with Hadoop
  • Machine Learning in the cloud with Mahout and Whirr
  • Real Time Datamining and Aggregation at Scale
  • Large scale graph computation with Apache Giraph
  • Introducing Cascalog: Functional Data Processing for Hadoop
  • Automata Invasion
  • Hydra – an open source processing framework
  • Serious network analysis using Hadoop and Neo4j

Berlin Buzzwords will take place June 4th and 5th 2012 at Urania Berlin (http://www.uraniaberlin.de) Tickets are available.

BTW, Berlin is a great place to be in the summer! As usual, your blog posts, tweets, etc. about the conference will be greatly appreciated!

April 25, 2012

Just what do you mean by “number”?

Filed under: Humor,Identity — Patrick Durusau @ 6:30 pm

Just what do you mean by “number”?

John D. Cook writes:

Tom Christiansen gave an awesome answer to the question of how to match a number with a regular expression. He begins by clarifying what the reader means by “number”, then gives answers for each.

An eighteen (18) question subset of all the questions about what is meant by “number.”

The simplicity of identity is a polite veneer over boundless complexity.

Most of the time the complexity can remain hidden. Most of the time.

Other times, we are hobbled if our information systems keep us from peeking.

Peeping Tom takes us on an abbreviated tour around the identity of “number.”

Introducing CDH4 Beta 2

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 6:27 pm

Introducing CDH4 Beta 2

Charles Zedlewski writes:

I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.

CDH4 has a great many enhancements compared to CDH3.

  • Availability – a high availability namenode, better job isolation, improved hard disk failure handling, and multi-version support
  • Utilization – multiple namespaces and a slot-less resource management model
  • Performance – improvements in HBase, HDFS, MapReduce, Flume and compression performance
  • Usability – broader BI support, expanded API options, a more responsive Hue with broader browser support
  • Extensibility – HBase co-processors enable developers to create new kinds of real-time big data applications, the new MapReduce resource management model enables developers to run new data processing paradigms on the same cluster resources and storage
  • Security – HBase table & column level security and Zookeeper authentication support

Some items of note about this beta:

This is the second (and final) beta for CDH4, and this version has all of the major component changes that we’ve planned to incorporate before the platform goes GA. The second beta:

  • Incorporates the Apache Flume, Hue, Apache Oozie and Apache Whirr components that did not make the first beta
  • Broadens the platform support back out to our normal release matrix of Red Hat, CentOS, SUSE, Ubuntu and Debian
  • Standardizes our release matrix of supported databases to include MySQL, PostgresSQL and Oracle
  • Includes a number of improvements to existing components like adding auto-failover support to HDFS’s high availability feature and adding multi-homing support to HDFS and MapReduce
  • Incorporates a number of fixes that were identified during the first beta period like removing a HBase performance regression

Second (and final) beta?

Sounds like time to beat and beat hard on this one.

I suspect feedback will be appreciated!

Online tool can detect patterns in US election news coverage

Filed under: Machine Learning,News,Politics — Patrick Durusau @ 6:27 pm

Online tool can detect patterns in US election news coverage

From the website:

The US presidential election dominates the global media every four years, with news articles, which are carefully analysed by commentators and campaign strategists, playing a major role in shaping voter opinion.

Academics at the University of Bristol’s Intelligent Systems Laboratory have developed an online tool, Election Watch, which analyses the content of news about the US election by the international media.

A paper about the project will be presented at the Proceedings of the 13th conference of the European Chapter of the Association for Computational Linguistics held in Avignon, France.

Election Watch automatically monitors political discourse about the 2012 US presidential election from over 700 American and international news outlets. The information displayed is based, so far, on 91,456 articles.

The web tool allows users to explore news stories via an interactive interface and demonstrates the application of modern machine learning and language technologies. After analysing news articles about the 2012 US election the researchers have found patterns in the political narrative.

The online site is updated daily, by presenting narrative patterns as they were extracted from news. Narrative patterns include actors, actions, triplets representing political support between actors, and automatically inferred political allegiance of actors.

The site also presents the key named entities, timelines and heat maps. Network analysis allows the researchers to infer the role of each actor in the general political discourse, recognising adversaries and allied actors. Users can browse articles by political statements, rather than by keywords. For example, users can browse articles where Romney is described as criticising Obama. All the graphical briefing is automatically generated and interactive and each relation presented to the user can be used to retrieve supporting articles, from a set of hundreds of online news sources.

You really have to see this website. Quite amazing.

I would disagree with the placement of Obama to the far left in at least one of the graphics.

From where I sit he should be cheek and jowl with Romney, albeit on his left side.

I wonder if the data set is going to be released or if that is possible?

PBS should ask permission to carry this in a frame on their site.

OAG Launches Mapper, a New Network Analysis Mapping Tool

Filed under: Aviation,Books,Marketing,Travel — Patrick Durusau @ 6:27 pm

OAG Launches Mapper, a New Network Analysis Mapping Tool

From the post:

OAG, a UBM Aviation brand, today unveiled its new aviation analysis mapping tool, OAG Mapper. This latest innovation, from the global leader in aviation intelligence, combines a powerful global flight schedule query with advanced mapping software technology to quickly plot route network maps, based on data drawn from OAG’s market leading schedules database of 1,000 airlines and over 3,500 airports. It is ideal for those in commercial, marketing and strategic planning roles across the airlines, airports, tourism, consulting and route network related industry sectors.

A web-based tool that eliminates the need to hand-draw network routes onto maps, OAG Mapper allows users to either import IATA Airport codes, or to enter a carrier, airport, equipment type or a combination of these and generate custom network maps in seconds. The user can then highlight key routes by changing the thickness and colour of the lines and label them for easy reference, save the map to their profile and export to jpeg for use in network planning, forecasting, strategy and executive presentations.

This has aviation professional written all over it.

And what does aviation bring to mind? that’s right! Coin of the realm! Lot of coins from lots of realms.

Two thoughts:

First and the most obvious, use this service in tandem with other information for aviation professionals to create enhanced services for their use. Ask aviation professional what they would like to see and how they would like to see it. (Novel software theory: Give users what they want, how they want it. Easier sell than educating them.)

Second, we have all seen the travel sites that plot schedules, fees, destinations, hotels and car rentals.

But when was the last time you flew to an airport, rented a car and stayed in a hotel? That was the sum total of your trip?

Every location in the world has more to offer than that, well, not the South Pole but they don’t have a car rental agency. Or any beach. So why go there?

Sorry, got distracted. Every location in the world (with one exception, see above) has more than airports, hotels and car rentals. Suggestion: Use topic maps (non-obviously) to create information/reservation rich information environments.

The Frankfurt Book Fair is an example of an event with literally thousands of connections to be made in addition to airport, hotel and car rental. Your application could be the one that crosses all the information systems (or lack thereof) to provide that unique experience.

Could hard code it but I assume you are brighter than that.

Faster Apache CouchDB

Filed under: CouchDB,NoSQL — Patrick Durusau @ 6:26 pm

Faster Apache CouchDB.

Kay Ewbak reports:

Apache has announced the release of CouchDB 1.2.0. It brings lots of improvements, some of which mean apps written for older versions of CouchDB will no longer work.

According to the blog post from its developers, the changes start with improved performance and security. The performance is better because the developers have added a native JSON parser where the performance critical portions are implemented in C, so latency and throughput for all database and view operations is improved. JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and for machines to parse and generate. The CouchDB team is using the yajl library for its JSON parser.

The new version of CouchDB also has optional file compression for database and view index files, with all storage operations being passed through Google’s snappy compressor. This means less data has to be transferred, so access is faster.

Alongside these headline changes for performance, the team has also made other changes that take the Erlang runtime system into account to improve concurrency when writing data to databases and view index files.

Grab a copy here, or see Kay’s post for more details.

Replacing dtSearch

Filed under: dtSearch,Lucene,Query Language — Patrick Durusau @ 6:26 pm

An open source replacement for the dtSearch closed source search engine

From the webpage:

We’ve been working on a client project where we needed to replace the dtSearch closed source search engine, which doesn’t perform that well at scale in this case. As the client has significant investment in stored queries (it’s for a monitoring application) they were keen that the new engine spoke exactly the same query language as the old – so we’ve built a version of Apache Lucene to replace dtSearch. There are a few other modifications we had to do as well, to return such things as positional information from deep within the Lucene code (this is particularly important in monitoring as you want to show clients where the keywords they were interested in appeared in an article – they may be checking their media coverage in detail, and position on the page is important).

The preservation/reuse of stored queries is a testimony to the configurable nature of Lucene software.

How far can the query preservation/reuse capabilities of Lucene be extended?

A long and winding road (….introducing serendipity into music recommendation)

Filed under: Music,Recommendation,Serendipity — Patrick Durusau @ 6:26 pm

Auralist: introducing serendipity into music recommendation

Abstract:

Recommendation systems exist to help users discover content in a large body of items. An ideal recommendation system should mimic the actions of a trusted friend or expert, producing a personalised collection of recommendations that balance between the desired goals of accuracy, diversity, novelty and serendipity. We introduce the Auralist recommendation framework, a system that – in contrast to previous work – attempts to balance and improve all four factors simultaneously. Using a collection of novel algorithms inspired by principles of “serendipitous discovery”, we demonstrate a method of successfully injecting serendipity, novelty and diversity into recommendations whilst limiting the impact on accuracy. We evaluate Auralist quantitatively over a broad set of metrics and, with a user study on music recommendation, show that Auralist‘s emphasis on serendipity indeed improves user satisfaction.

A deeply interesting article for anyone interested in recommendation systems and the improvement thereof.

It is research that should go forward but among my concerns about the article:

1) I am not convinced of the definition of “serendipity:”

Serendipity represents the “unusualness” or “surprise” of recommendations. Unlike novelty, serendipity encompasses the semantic content of items, and can be imagined as the distance between recommended items and their expected contents. A recommendation of John Lennon to listeners of The Beatles may well be accurate and novel, but hardly constitutes an original or surprising recommendation. A serendipitous system will challenge users to expand their tastes and hopefully provide more interesting recommendations, qualities that can help improve recommendation satisfaction [23]

Or perhaps I am “hearing” it in the context of discovery. Such as searching for Smokestack Lighting and not finding the Yardbirds but Howling Wolf as the performer. Serendipity in that sense not having any sense of “challenge.”

2) A survey of 21 participants, mostly students, is better than experimenters asking each other for feedback but only just. The social sciences department should be able to advise on test protocols and procedures.

3) There was no showing that “user satisfaction,” the item to be measured, is the same thing as “serendipity.” I am not entirely sure that other than by example, “serendipity” can even be discussed, let alone measured.

Take my Howling Wolf example. How close or far away is the “serendipity” there versus an instance of “serendipity” as offered by Auralist? Unless and until we can establish a metric, at least a loose one, it is hard to say which one has more “serendipity.”

LAILAPS

LAILAPS

From the website:

LAILAPS combines a keyword driven search engine for an integrative access to life science databases, machine learning for a content driven relevance ranking, recommender systems for suggestion of related data records and query refinements with a user feedback tracking system for an self learning relevance training.

Features:

  • ultra fast keyword based search
  • non-static relevance ranking
  • user specific relevance profiles
  • suggestion of related entries
  • suggestion of related query terms
  • self learning by user tracking
  • deployable at standard desktop PC
  • 100% JAVA
  • installer for in-house deployment

I like the idea of a recommender system that “suggests” related data records and query refinements. It could be wrong.

I am as guilty as anyone of thinking in terms of “correct” recommendations that always lead to relevant data.

That is applying “crisp” set thinking to what is obviously a “rough” set situation. We as readers have to sort out the items in the “rough” set and construct for ourselves, a temporary and fleeting “crisp” set for some particular purpose.

If you are using LAILAPS, I would appreciate a note about your experiences and impressions.

NYCFacets

Filed under: Marketing,Open Data — Patrick Durusau @ 6:26 pm

NYCFacets: Smart Open Data Exchange

From the FAQ:

Smart Open Data Exchange?

A: We just don’t catalog the metadata for each datasource. We squeeze additional metadata – extrametadata as we call it, and correlate all the datasources to allow Open Data Users to see the “forest for the trees”. Or in the case of NYC – the “city for the streets”? (TODO: find urban equivalent of “See Forest for the Trees“)

The “Smart” comes from a process we call “Crowdknowing” – leveraging metadata + extrametadata to score each dataset from various perspectives, automatically correlate them, and in the near future, perform semi-automatic domain mapping.

Extrametadata?

A: Derived Metadata – Statistics (Quantitative and Qualitative), Ontologies, Semantic Mappings, Inferences, Federated Queries, Scores, Curations, Annotations plus various other Machine and Human-powered signals through a process we call “Crowdknowing“.

Crowdknowing?

A: Human-powered, machine-accelerated, collective knowledge systems cataloging metadata + derived extrametadata (derived using semantics, statistics, algorithm and the crowd). At this stage, the human-powered aspect is not emphasized because we found that the NYC Data Catalog community is still in its infancy – there were very few comments and ratings. But we hope to help improve that over time as we crawl secondary signals (e.g. votes and comments in NYCBigApps, Challengepost and Appstores; Facebook likes; Tweets, etc.).

OK, it was covered as the winner of the most recent NYCBigApps contest but I thought it needed a separate shout-out.

Take a close look at what this site has done with a minimum of software and some clever thinking.

NYC BigApps

Filed under: Contest,Mapping,Marketing — Patrick Durusau @ 6:25 pm

NYC BigApps

From the webpage:

New York City is challenging software developers to create apps that use city data to make NYC better.

There are three completed contests (one just ended) that resulted in very useful applications.

NYC BigApps 3.0 resulted in:

NYC Facets: Best Overall Application – Grand Prize – Explores and visualizes more than 1 million facts about New York City.

Work+: Best Overall Application – Second – Prices – Working from home not working for you? Discover new places to get things done.

Funday Genie: Investor’s Choice Application – The Funday Genie is an application for planning a free day. Our unique scheduling and best route algorithm creates a smart personalized day-itinerary of things to do, including events, attractions, restaurants, shopping, and more, based on the user’s preferences. Everyday can be a Funday.

among others.

Quick question: How would you interchange information between any two of these apps? Or if you like, any other two apps in this or prior contests?

Second question: How would you integrate additional information into any of these apps, prepared for use by another application?

Topic maps can:

  • collate information for display.
  • power re-usable and extensible mappings of data into other formats.
  • augment data for applications that lack merging semantics.

Where is your data today and where would you like for it to be tomorrow?

« Newer PostsOlder Posts »

Powered by WordPress