September « 2011 « Another Word For It

September 9, 2011

Kasabi

Filed under: Data,Data as Service (DaaS),Data Source,RDF — Patrick Durusau @ 7:16 pm

A data as service site that offers access to data (no downloads) via API codes. Has helps for authors to prepare their data, APIs for data, etc. Currently in beta.

I mention it because data as service is one model for delivery of topic map content so the successes, problems and usage of Kasabi may be important milestones to watch.

True, Lexis/Nexis, WestLaw, and any number of other commercial vendors have sold access to data in the past but it was mostly dumb data. That is you had to contribute something to it to make it meaningful. We are in the early stages but I think a data market for data that works with my data is developing.

The options to download citations in formats that fit particular bibliographic programs are an impoverished example of delivered data working with local data.

Not quite the vision for the Semantic Web but it isn’t hard to imagine your calendaring program showing links to current news stories about your appointments. You have to supply the reasoning to cancel the appointment with the bank president just arrested for securities fraud and to increase your airline reservations to two (2).

Comments Off

Authoritative URIs for Geo locations? Multi-lingual labels?

Filed under: Geographic Data,Linked Data,RDF — Patrick Durusau @ 7:14 pm

Some Geo location and label links that came up on the pub-lod list:

Datasets: Geography (Kasabi) 34 as of 9 September 2011.
Freebase
GADM-RDF (RDF version of GADM)
GeoData
GeoLinked Data (Spain, multi-lingual labels)
GeoNames
Ordnance Survey Linked Data (UK)
Yahoo Geoplanet (multi-lingual labels)

Not a complete list nor does it include historical references or designations used over the millenia. Still, you may find it useful.

Comments Off

Chess@home Building the Largest Chess AI ever

Filed under: Artificial Intelligence,Collaboration,Crowd Sourcing — Patrick Durusau @ 7:11 pm

Chess@home Building the Largest Chess AI ever

From the post:

Many people are familiar with the SETI@home project: a very large scale effort to search for patterns from alien civilizations in the ocean of data we receive from the sky, using the computing power of millions of computers around the globe (“the grid”).

SETI@home has been a success, obviously not in finding aliens, but in demonstrating the potential of large-scale distributed computing. Projects like BOINC have been expanding this effort to other fields like biology, medicine and physics.

Last weekend, a team at Joshfire (Thomas, Nathan, Mickael and myself) participated in a 48-hour coding contest called Node Knockout. Rules were simple: code the most amazing thing you can in the weekend, as long as it uses server-side JavaScript.

JavaScript is an old but exciting technology, currently undergoing a major revival through performance breakthroughs, server-side engines, and an array of new possibilities offered by HTML5. It is also the language that has the biggest scale. Any computer connected to the web can run your JavaScript code extremely easily, just by typing an URL in a browser or running a script in a command line.

We decided to exploit this scale, together with Node’s high I/O performance, to build a large-scale distributed computing project that people could join just by visiting a web page. Searching for aliens was quickly eliminated as we didn’t have our own antenna array available at the time. So we went with a somewhat easier problem: Chess.

Easier problem: Take the coming weekend and sketch out how you think Javascript and/or HTML5 are going to impact the authoring/delivery of topic maps.

Comments Off

LATC – Linked Open Data Around-the-Clock

Filed under: Government Data,Linked Data,LOD — Patrick Durusau @ 7:10 pm

LATC – Linked Open Data Around-the-Clock

This appears to be an early release of the site because it has an “unfinished” feel to it. For example, you to poke around a bit to find the tools link. And it isn’t clear how the project intends to promote the use of those tools or originate others to promote the use of linked data.

I suppose it is too late to avoid the grandiose “around-the-clock” project name? Web servers, barring some technical issue, are up 24 x 7. They keep going even as we sleep. Promise.

Objectives:

… increase the number, the quality and the accuracy of data links between LOD datasets. LATC contributes to the evolution of the World Wide Web into a global data space that can be exploited by applications similar to a local database today. By increasing the number and quality of data links, LATC makes it easier for European Commission-funded projects to use the Linked Data Web for research purposes.

… support institutions as well as individuals with Linked Data publication and consumption. Many of the practical problems that a European Commission-funded project may discover when interaction with the Web of Data are solved on the conceptual level and the solutions have been implemented into freely available data publication and consumption tools. What is still missing is the dissemination of knowledge about how to use these tools to interact with the Web of Linked Data. We aim at providing this knowledge.

… create an in-depth test-bed for data intensive applications by publishing datasets produced by the European Commission, the European Parliament, and other European institutions as Linked Data on the Web and by interlinking them with other governmental data, such as found in the UK and elsewhere.

Comments (1)

Why “Second Chance” Tweets Matter:…

Filed under: Interface Research/Design — Patrick Durusau @ 7:08 pm

Why “Second Chance” Tweets Matter: After 3 Hours, Few Care About Socially Shared Links

From the post:

There have been various studies suggesting that if someone doesn’t see a tweet or a Facebook post within a few hours, they’ll never see it at all. Now link shortening service Bit.ly is out with another. After three hours, Bit.ly has found, links have sent about all the traffic they’re going to send. So start thinking about doing “second chance” tweets, as I call them.

What I found interesting was the chart comparing the “half-life” of Facebook, Twitter, YouTube and direct. Or if you prefer the numbers:

Twitter: 2.8 hours

Facebook: 3.2 hours

YouTube: 7.4 hours

I suspect the true explanation is simply that volume pushes tweets and/or Facebook postings “below the fold” as it were and most people don’t look further than the current screen for content.

That may be an important lesson for topic map interfaces, if at all possible, keep content to a single screen. Just as in the “olden print days,” readers don’t look below the “fold.”

Another aspect that needs investigation is the “stickyness” of your presentation. The long half-life on YouTube is the slower rate of posts but I suspect there is more to it. If the presentation captures your imagination, there is at least some likelihood it will capture the imagination of others.

I suspect that some data sets lend themselves to “sticky” explanations more than others but that is why you need graphic artists (not CS majors), focus groups (not CS majors), professional marketers (not CS majors) to design your interfaces and delivery mechanisms. What “works” for insiders is likely to be the worst candidate for the general public. (Don’t ask me either. I am interested in recursive information structures for annotation of biblical texts. I am not a “good” candidate for user studies.)

Comments Off

R and Hadoop

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 7:06 pm

From Revolution Analytics:

White paper: Advanced ‘Big Data’ Analytics with R and Hadoop

Webinar: Revolution Webinar: Leveraging R in Hadoop Environments 21 September 2011 – 10AM – 10:30AM Pacific Time

RHadoop: RHadoop

From GitHub:

RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. The packages have been implemented and tested in Cloudera’s distribution of Hadoop (CDH3) with R 2.13.0. RHadoop consists of the following packages:

rmr – functions providing Hadoop MapReduce functionality in R
rhdfs – functions providing file management of the HDFS from within R
rhbase – functions providing database management for the HBase distributed database from within R

Comments Off

What Can Apache Hadoop do for You?

Filed under: Hadoop — Patrick Durusau @ 7:21 am

What Can Apache Hadoop do for You?

This is extremely amusing! I liked the non-correct answers better than I did the correct ones. 😉

From the description:

What can Apache Hadoop do for you? Watch the video to find out what other people think Apache Hadoop is (or isn’t).

Share your own ideas about Apache Hadoop. Get out your video camera or phone, channel your inner filmmaker and submit a short clip or mini film of what you think Apache Hadoop can do for you. Let your creative energy flow: It can be sincere, funny, educational or witty.

By participating, you could be selected as a Cloudera Hero and win a four-day trip to San Francisco to spend a day hacking code with Doug Cutting, co-founder of the Apache Hadoop project. Find out how to get entered.

Don’t have a video to enter? Help us pick the winner and give your favorite contestant a thumbs up.

Go to www.facebook.com/cloudera and click the “Be a Cloudera Hero for Apache Hadoop” tab for full details.

Comments Off

September 8, 2011

Summing up Properties with subjectIdentifiers/URLs?

Filed under: Identification,Identifiers,Intelligence,Subject Identifiers,Subject Identity — Patrick Durusau @ 6:06 pm

I was picking tomatoes in the garden when I thought about telling Carol (my wife) the plants are about to stop producing.

Those plants are at a particular address, in the backyard, middle garden bed of three, are of three different varieties, but I am going to sum up those properties by saying: “The tomatoes are about to stop producing.”

It occurred to me that a subjectIdentifier could be assigned to a topic element on the basis of summing up properties of the topic.* That would have the advantage of enabling merging on the basis of subjectIdentifiers as opposed to more complex tests upon properties of a topic.

Disclosure of the basis for assignment of a subjectIdentifier is an interesting question.

It could be that a service wishes to produce subjectIdentifiers and index information based upon complex property measures, producing for consumption, the subjectIdentifiers and merge-capable indexes on one or more information sets. The basis for merging being the competitive edge offered by the service.

If promoting merging with a vendor’s process or format, which is seeking to become the TCP/IP of some area, the basis for merging and tools to assist with it will be supplied.

Or if you are an intelligence agency and you want an inward and outward facing interface that promotes merging of information but does not disclose your internal basis for identification, variants of this technique may be of interest.

*The notion of summing up imposes no prior constraints on the tests used or the location of the information subjected to those tests.

Comments Off

When Should Identifications Be Immutable?

Filed under: Identification,Immutable,TMDM,Topic Maps — Patrick Durusau @ 6:03 pm

After watching a presentation on Clojure and its immutable data structures, I began to wonder when should identifications be immutable?

Note that I said when should identifications… which means I am not advocating a universal position for all identifiers but rather a choice that may vary from situation to situation.

We may change our minds about an identification, the fact remains that at some point (dare I say state?) a particular identification was made.

For example, you make a intimate gesture at a party only to discover your spouse wasn’t the recipient of the gesture. But at the time you made the gesture, at least I am willing to believe, you thought it was your spouse. New facts are now apparent. But it is also a new identification. As your spouse will remind you, you did make a prior, incorrect identification.

As I recall, topics (and other information items) are immutable for purposes of merging. (TMDM, 6.2 and following.) That is merging results in a new topic or other new information item. On the other hand, merging also results in updating information items other than the one subject to merging. So those information items are not being treated as immutable.

But since the references are being updates, I don’t think it would be inconsistent with the TMDM to create new information items to be the carriers of the new identifiers and thus treating the information items as immutable.

Would be application/requirement specific but say for accounting/banking/securities and similar applications, it may be important for identifications to be immutable. Such that we can “unroll” a topic map as it were to any prior arbitrary identification or state.

Comments Off

Cheatsheets

Filed under: Documentation — Patrick Durusau @ 6:01 pm

Cheatsheets

I was looking at some Hadoop output today and wishing I had a quick cheatsheet at hand.

I didn’t find one but I did find this site devoted to cheatsheets.

There were two (2) for Hadoop (DZone) but they were both concerned with what I type and not messages I may or may not see in return.

I am sure it is in the documentation but does anyone have a pointer to a message cheatsheet for Hadoop?

Comments Off

Press.net News Ontologies & rNews

Filed under: Linked Data,LOD,Ontology — Patrick Durusau @ 5:58 pm

Press.net News Ontologies

From the webpage:

The news ontology is comprised of several ontologies, which describe assets (text, images, video) and the events and entities (people, places, organisations, abstract concepts etc.) that appear in news content. The asset model is the representation of news content as digital assets created by a news provider (e.g. text, images, video and data such as csv files). The domain model is the representation of the ‘real world’ which is the subject of news. There are simple entities, which we have labelled with the amorphous term of ‘stuff‘ and complex entities. Currently, the only complex entity the ontology is concerned with is events. The term stuff has been used to include abstract and intangible concepts (e.g. Feminism, Love, Hate Crime etc.) as well as tangible things (e.g. Lord Ashdown, Fiat Punto, Queens Park Rangers).

Assets (news content) are about things in the world (the domain model). The connection between assets and the entities that appear in them is made using tags. Assets are further holistically categorised using classification schemes (e.g. IPTC Media Topic Codes, Schema.org Vocabulary or Press Association Categorisations).

No sooner had I seen that on the LOD list, than Stephanie Corlosquet pointed out rNews, another ontology for news.

From the rNews webpage:

rNews is a proposed standard for using RDFa to annotate news-specific metadata in HTML documents. The rNews proposal has been developed by the IPTC, a consortium of the world’s major news agencies, news publishers and news industry vendors. rNews is currently in draft form and the IPTC welcomes feedback on how to improve the standard in the rNews Forum.

I am sure there are others.

Although I rather like stuff as an alternative to SUMO’s thing or was that Cyc?

The point being that mapping strategies, when the expense can be justified, are the “answer” to the semantic diversity and richness of human discourse.

Comments Off

InfiniteGraph and RDF tuples

Filed under: Graphs,InfiniteGraph,RDF — Patrick Durusau @ 5:56 pm

InfiniteGraph and RDF tuples

Short answer to the question: “[Does] InfiniteGraph supports RDF (Resource Descriptive Framework) tuples (triples), whether it works like a triplestore, and/or if we can easily work alongside a triple store[?]”

Yes.

It also raises the question: Why would you want to?

Comments Off

Solr’s Realtime Get: Increasing its NoSQL Capabilities

Filed under: NoSQL,Searching,Solr — Patrick Durusau @ 5:55 pm

Solr’s Realtime Get: Increasing its NoSQL Capabilities

From the post:

As readers probably know, Lucene/Solr search works off of point-in-time snapshots of the index. After changes have been made to the index, a commit (or a new Near Real Time softCommit) needs to be done before those changes are visible. Even with Solr’s new NRT (Near Real Time) capabilities, it’s probably not advisable to reopen the searcher more than once a second. However there are some use cases that require the absolute latest version of a document, as opposed to just a very recent version. This is where Solr’s new realtime get comes to the rescue, where the latest version of a document can be retrieved without reopening the searcher and risk disrupting other normal search traffic.

Comments Off

How Neo4j uses Scala’s Parser Combinator: Cypher’s internals – Part 1

Filed under: Cypher,Graphs,Neo4j,Query Language,Scala — Patrick Durusau @ 5:54 pm

How Neo4j uses Scala’s Parser Combinator: Cypher’s internals – Part 1

From the post:

I think that most of us, software developers, while kids, always wanted to know how things were made by inside. Since I was a child, I always wanted to understand how my toys worked. And then, what I used to do? Opened’em, sure. And of course, later, I wasn’t able to re-join its pieces properly, but this is not this post subject 😉 . Well, understanding how things works behind the scenes can teach us several things, and in software this is no different, and we can study how an specific piece of code was created and mixed together with other code.

In this series of posts I’ll share what I’ve found inside Neo4J implementation, specifically, at Cypher’s code (its query language).

In this first part, I’ll briefly introduce Neo4J and Cypher and then I’ll start to explain the internals of its parser and how it works. Since it is a long (very very long subject, in fact), part 2 and subsequents are coming very very soon.

If you want to understand the internals of a graph query language, this looks like a good place to start.

Update: Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point

Comments Off

Akka

Filed under: Actor-Based,Akka — Patrick Durusau @ 5:53 pm

Akka

From the webpage:

Akka is the platform for the next generation event-driven, scalable and fault-tolerant architectures on the JVM

We believe that writing correct concurrent, fault-tolerant and scalable applications is too hard. Most of the time it’s because we are using the wrong tools and the wrong level of abstraction.

Akka is here to change that.

Using the Actor Model together with Software Transactional Memory we raise the abstraction level and provide a better platform to build correct concurrent and scalable applications.

For fault-tolerance we adopt the “Let it crash” / “Embrace failure” model which have been used with great success in the telecom industry to build applications that self-heal, systems that never stop.

Actors also provides the abstraction for transparent distribution and the basis for truly scalable and fault-tolerant applications.

Akka is Open Source and available under the Apache 2 License.

I am increasingly convinced that “we are using the wrong tools and the wrong level of abstraction.”

Everyone seems to agree on that part. Where they differ is on the right tool and right level of abstraction. 😉

I suspect the disagreement isn’t going away. But I mention Akka in case it seems like the right tool and right level of abstraction to you.

I would be mindful that the marketplace for non-concurrent, not so scalable semantic applications is quite large. Think of it this way, someone has the be the “Office” of semantic applications. May as well be you. Leave the high-end, difficult stuff to others.

Comments Off

JavaZone 2011 Videos

Filed under: Conferences,Java — Patrick Durusau @ 5:51 pm

JavaZone 2011 Videos

These just appeared online today and you are the best judge of the ones that interest you.

If you think some need to be called out, give a shout!

Comments Off

Bioportal 3.2

Filed under: Bioinformatics,Biomedical,Ontology — Patrick Durusau @ 5:50 pm

Bioportal 3.2

From the announcement:

The National Center for Biomedical Ontology is pleased to announce the release of BioPortal 3.2.

New features include updates to the Web interface and Web services:

Added Ontology Recommender feature, http://bioportal.bioontology.org/recommender
Added support for access control for viewing ontologies
Added link to subscribe to BioPortal Notes emails
Synchronized “Jump To” feature with ontology parsing and display
Added documentation on Ontology Groups
Annotator Web service – disabled use of “longest only” parameter when also selecting “ontologies to expand” parameter
Removed the metric “Number of classes without an author”
Handling of obsolete terms, part 1 – term name is grayed out and element is returned in Web service response for obsolete terms from OBO and RRF ontologies. This feature will be extended to cover OWL ontologies in a subsequent release.

Bug Fix

Fixed calculation of “Classes with no definition” metric
Added re-direct from old BioPortal URL format to new URL format to provide working links from archived search results

Firefox Extension for NCBO API Key:

To make it easier to test Web service calls from your browser, we have released the NCBO API Key Firefox Extension. This extension will automatically add your API Key to NCBO REST URLs any time you visit them in Firefox. The extension is available at Mozilla’s Add-On site. To use the extension, follow the installation directions, restart Firefox, and add your API Key into the “Options” dialog menu on the Add-Ons management screen. After that, the extension will automatically append your stored API Key any time you visit http://rest.bioontology.org.

Upcoming software license change:

The next release of NCBO software will be under the two-clause BSD license rather than under the currently used three-clause BSD license. This change should not affect anyone’s use of NCBO software and this change is to a less restrictive license. More information about these licenses is available at the site: http://www.opensource.org/licenses. Please contact support at bioontology.org with any questions concerning this change.

Even if you aren’t active in the bioontology area, you need to spend some time with this site.

Comments Off

September 7, 2011

Categorial Compositionality:….

Filed under: Category Theory,Human Cognition — Patrick Durusau @ 7:03 pm

Categorial Compositionality: A Category Theory Explanation for the Systematicity of Human Cognition

Abstract:

Classical and Connectionist theories of cognitive architecture seek to explain systematicity (i.e., the property of human cognition whereby cognitive capacity comes in groups of related behaviours) as a consequence of syntactically and functionally compositional representations, respectively. However, both theories depend on ad hoc assumptions to exclude specific instances of these forms of compositionality (e.g. grammars, networks) that do not account for systematicity. By analogy with the Ptolemaic (i.e. geocentric) theory of planetary motion, although either theory can be made to be consistent with the data, both nonetheless fail to fully explain it. Category theory, a branch of mathematics, provides an alternative explanation based on the formal concept of adjunction, which relates a pair of structure-preserving maps, called functors. A functor generalizes the notion of a map between representational states to include a map between state transformations (or processes). In a formal sense, systematicity is a necessary consequence of a higher-order theory of cognitive architecture, in contrast to the first-order theories derived from Classicism or Connectionism. Category theory offers a re-conceptualization for cognitive science, analogous to the one that Copernicus provided for astronomy, where representational states are no longer the center of the cognitive universe—replaced by the relationships between the maps that transform them.

Categorial Compositionality II: Universal Constructions and a General Theory of (Quasi-)Systematicity in Human Cognition

Abstract:

A complete theory of cognitive architecture (i.e., the basic processes and modes of composition that together constitute cognitive behaviour) must explain the systematicity property—why our cognitive capacities are organized into particular groups of capacities, rather than some other, arbitrary collection. The classical account supposes: (1) syntactically compositional representations; and (2) processes that are sensitive to—compatible with—their structure. Classical compositionality, however, does not explain why these two components must be compatible; they are only compatible by the ad hoc assumption (convention) of employing the same mode of (concatenative) compositionality (e.g., prefix/postfix, where a relation symbol is always prepended/appended to the symbols for the related entities). Architectures employing mixed modes do not support systematicity. Recently, we proposed an alternative explanation without ad hoc assumptions, using category theory. Here, we extend our explanation to domains that are quasi-systematic (e.g., aspects of most languages), where the domain includes some but not all possible combinations of constituents. The central category-theoretic construct is an adjunction involving pullbacks, where the primary focus is on the relationship between processes modelled as functors, rather than the representations. A functor is a structure-preserving map (or construction, for our purposes). An adjunction guarantees that the only pairings of functors are the systematic ones. Thus, (quasi-)systematicity is a necessary consequence of a categorial cognitive architecture whose basic processes are functors that participate in adjunctions.

“Copernican revolution” may be a bit strong but these are interesting articles.

I am more sympathetic to the discussion of the short-falls of first-order theories than I am to their replacement with other theories. Although theories can make for entertaining reading.

Comments Off

Web Pages Clustering: A New Approach

Filed under: Clustering,Dictionary — Patrick Durusau @ 7:02 pm

Web Pages Clustering: A New Approach by Jeevan H E, Prashanth P P, Punith Kumar S N, and Vinay Hegde.

Abstract:

The rapid growth of web has resulted in vast volume of information. Information availability at a rapid speed to the user is vital. English language (or any for that matter) has lot of ambiguity in the usage of words. So there is no guarantee that a keyword based search engine will provide the required results. This paper introduces the use of dictionary (standardised) to obtain the context with which a keyword is used and in turn cluster the results based on this context. These ideas can be merged with a metasearch engine to enhance the search efficiency.

The first part of this paper is concerned with the use of a dictionary to create separate queries for each “sense” of a term. I am not sure that is an innovation.

I don’t have the citation at hand but seem to recall that term rewriting for queries has used something very much like a dictionary. Perhaps not a “dictionary” in the conventional sense but I would not even bet on that. Anyone have a better memory than mine and/or working in query rewriting?

Comments Off

A Performance Study of Data Mining Techniques: Multiple Linear Regression vs. Factor Analysis

Filed under: Data Mining,Factor Analysis,Linear Regression — Patrick Durusau @ 7:01 pm

A Performance Study of Data Mining Techniques: Multiple Linear Regression vs. Factor Analysis by Abhishek Taneja and R.K.Chauhan.

Abstract:

The growing volume of data usually creates an interesting challenge for the need of data analysis tools that discover regularities in these data. Data mining has emerged as disciplines that contribute tools for data analysis, discovery of hidden knowledge, and autonomous decision making in many application domains. The purpose of this study is to compare the performance of two data mining techniques viz., factor analysis and multiple linear regression for different sample sizes on three unique sets of data. The performance of the two data mining techniques is compared on following parameters like mean square error (MSE), R-square, R-Square adjusted, condition number, root mean square error(RMSE), number of variables included in the prediction model, modified coefficient of efficiency, F-value, and test of normality. These parameters have been computed using various data mining tools like SPSS, XLstat, Stata, and MS-Excel. It is seen that for all the given dataset, factor analysis outperform multiple linear regression. But the absolute value of prediction accuracy varied between the three datasets indicating that the data distribution and data characteristics play a major role in choosing the correct prediction technique.

I had to do a double-take when I saw “factor analysis” in the title of this article. I remember factor analysis from Schubert’s The judicial mind revisited : psychometric analysis of Supreme Court ideology, where Schubert used factor analysis to model the relative positions of the Supreme Court Justices. Schubert taught himself factor analysis on a Frieden rotary calculator. (I had one of those too but that’s a different story.)

The real lesson of this article comes at the end of the abstract: the data distribution and data characteristics play a major role in choosing the correct prediction technique.

Comments Off

Photometric Catalogue of Quasars and Other Point Sources in the Sloan Digital Sky Survey

Filed under: Astroinformatics,Machine Learning — Patrick Durusau @ 6:57 pm

Photometric Catalogue of Quasars and Other Point Sources in the Sloan Digital Sky Survey by Sheelu Abraham, Ninan Sajeeth Philip, Ajit Kembhavi, Yogesh G Wadadekar, and Rita Sinha. (Submitted on 9 Nov 2010 (v1), last revised 25 Aug 2011 (this version, v3))

Abstract:

We present a catalogue of about 6 million unresolved photometric detections in the Sloan Digital Sky Survey Seventh Data Release classifying them into stars, galaxies and quasars. We use a machine learning classifier trained on a subset of spectroscopically confirmed objects from 14th to 22nd magnitude in the SDSS {\it i}-band. Our catalogue consists of 2,430,625 quasars, 3,544,036 stars and 63,586 unresolved galaxies from 14th to 24th magnitude in the SDSS {\it i}-band. Our algorithm recovers 99.96% of spectroscopically confirmed quasars and 99.51% of stars to i $\sim$21.3 in the colour window that we study. The level of contamination due to data artefacts for objects beyond $i=21.3$ is highly uncertain and all mention of completeness and contamination in the paper are valid only for objects brighter than this magnitude. However, a comparison of the predicted number of quasars with the theoretical number counts shows reasonable agreement.

OK, admittedly more interest to me than probably anyone else that reads this blog.

Still, every machine learning technique and data requirement that you learn has potential application in other fields.

Comments Off

An Open Source Platform for Virtual Supercomputing

Filed under: Cloud Computing,Erlang,GPU,Supercomputing — Patrick Durusau @ 6:55 pm

An Open Source Platform for Virtual Supercomputing, Michael Feldman reports:

Erlang Solutions and Massive Solutions will soon launch a new cloud platform for high performance computing. Last month they announced their intent to bring a virtual supercomputer (VSC) product to market, the idea being to enable customers to share their HPC resources either externally or internally, in a cloud-like manner, all under the banner of open source software.

…

The platform will be based on Clustrx and Xpandrx, two HPC software operating systems that were the result of several years of work done by Erlang Solutions, based in the UK, and Massive Solutions, based in Gibraltar. Massive Solutions has been the driving force behind the development of these two OS’s, using Erlang language technology developed by its partner.

In a nutshell, Clustrx is an HPC operating system, or more accurately, middleware, which sits atop Linux, providing the management and monitoring functions for supercomputer clusters. It is run on its own small server farm of one or more nodes, which are connected to the compute servers that make up the HPC cluster. The separation between management and compute enables it to support all the major Linux distros as well as Windows HPC Server. There is a distinct Clustrx-based version of Linux for the compute side as well, called Compute Based Linux.

A couple of things to note from within the article:

The only limitation to this model is its dependency on the underlying capabilities of Linux. For example, although Xpandrx is GPU-aware, since GPU virtualization is not yet supported in any Linux distros, the VSC platform can’t support virtualization of those resources. More exotic HPC hardware technology would, likewise, be out of the virtual loop.

The common denominator for VSC is Erlang, not just the company, but the language http://www.erlang.org/, which is designed for programming massively scalable systems. The Erlang runtime has built-in to support for things like concurrency, distribution and fault tolerance. As such, it is particularly suitable for HPC system software and large-scale interprocess communication, which is why both Clustrx and Xpandrx are implemented in the language.

As computing power and access to computing power increases, have you seen an increase in robust (in your view) topic map applications?

Comments Off

SolrCloud

Filed under: Cloud Computing,Solr — Patrick Durusau @ 6:52 pm

SolrCloud

From the webpage:

SolrCloud is the set of Solr features that take Solr’s distributed search to the next level, enabling and simplifying the creation and use of Solr clusters.

Central configuration for the entire cluster

Automatic load balancing and fail-over for queries

Zookeeper is integrated and used to coordinate and store the configuration of the cluster.

Under the Developer-TODO section I noticed:

optionally allow user to query by multiple collections (assume schemas are compatible)

I assume it would have to be declarative but shouldn’t there be re-write functions that cause different schemas to be seen as “compatible?”

From a topic map perspective I would like to see the “why” of such a mapping but having the capacity for the mapping is a step in the right direction.

Oh, and the ability to use the mapping or not and perhaps to choose between mappings.

Mappings, even between ontologies are in someone’s view. Pick one, topic maps will be waiting.

Comments Off

NIEM EDemocracy Initiative

Filed under: Governance,National Information Exchange Model NIEM — Patrick Durusau @ 6:51 pm

NIEM EDemocracy Initiative

Bare-bones at the moment but apparently intended as an extension of the < NIEM > mechanisms to legislation and other matters related to the democratic process.

Not much to see at the moment but subject identity issues abound in any representation of governmental processes.

Will try to keep watch on it.

Comments Off

Accumulo Proposal

Filed under: Accumulo,HBase,NoSQL — Patrick Durusau @ 6:49 pm

Accumulo Proposal

From the Apache incubator:

Abstract

Accumulo is a distributed key/value store that provides expressive, cell-level access labels.

Proposal

Accumulo is a sorted, distributed key/value store based on Google’s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.

Background

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, CloudStore, and Cassandra. Accumulo began its development in 2008.

Rationale

There is a need for a flexible, high performance distributed key/value store that provides expressive, fine-grained access labels. The communities we expect to be most interested in such a project are government, health care, and other industries where privacy is a concern. We have made much progress in developing this project over the past 3 years and believe both the project and the interested communities would benefit from this work being openly available and having open development.

Further explanation of access labels and iterators:

Access Labels

Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user. The column visibilities are boolean AND and OR combinations of arbitrary strings (such as “(A&B)|C”) and authorizations are sets of strings (such as {C,D}).

Iterators

Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.

The use case for modifying data written to disk is unclear to me but I suppose the data “returned to the user” involves modification of data for security reasons.

Sponsored in part by the NSA, National Security Agency of the United States.

The access label line of thinking has implications for topic map merging. What if a similar mechanism were fashioned to permit or prevent “merging” based on the access of the user? (Where merging isn’t a file based activity.)

Comments Off

Vowpal Wabbit 6.0

Filed under: Machine Learning,Vowpal Wabbit — Patrick Durusau @ 6:47 pm

Vowpal Wabbit 6.0

From the post:

I just released Vowpal Wabbit 6.0. Since the last version:

VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh. Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point.

The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs.

The new matrix factorization code from Jake adds a core algorithm.

We finally have basic persistent daemon support, again with Jake’s help.

Adaptive gradient calculations can now be made dimensionally correct, following up on Paul’s post, yielding a better algorithm. And Nikos sped it up further with SSE native inverse square root.

The LDA core is perhaps twice as fast after Paul educated us about SSE and representational gymnastics.

All of the above was done without adding significant new dependencies, so the code should compile easily.

The VW mailing list has been slowly growing, and is a good place to ask questions.

Comments Off

Scala for the Intrigued & Scala Traits

Filed under: Scala — Patrick Durusau @ 6:46 pm

Scala for the Intrigued & Scala Traits

The current issue of PragPub (Sept-2011) has a pair of articles on Scala.

“Scala for the Intrigued” by Venkat Subramaniam starts a new series on Scala. Conciseness is the emphasis of the first post, with the implication that conciseness is a virtue. Perhaps, perhaps. I prefer to think of “conciseness” as the proper use of macros for substitution. Still, it looks like an interesting series to follow.

Brian Tarbox in “Scala Traits” counters an “antipattern” in Java with a “pattern” in Scala using “traits.” Different choices than were made for Java so Scala offers different capabilities. Capabilities that you may find useful. Or not. But worth your while to consider.

Comments Off

September 6, 2011

Legal Ontology Engineering Methodologies….

Filed under: Legal Informatics — Patrick Durusau @ 7:18 pm

Legal Ontology Engineering Methodologies, Modelling Trends, and the Ontology of Professional Judicial Knowledge by Dr. Núria Casellas.

Publisher’s description (from Legal Informatics:

Enabling information interoperability, fostering legal knowledge usability and reuse, enhancing legal information search, in short, formalizing the complexity of legal knowledge to enhance legal knowledge management are challenging tasks, for which different solutions and lines of research have been proposed.

During the last decade, research and applications based on the use of legal ontologies as a technique to represent legal knowledge has raised a very interesting debate about their capacity and limitations to represent conceptual structures in the legal domain. Making conceptual legal knowledge explicit would support the development of a web of legal knowledge, improve communication, create trust and enable and support open data, e-government and e-democracy activities. Moreover, this explicit knowledge is also relevant to the formalization of software agents and the shaping of virtual institutions and multi-agent systems or environments.

This book explores the use of ontologism in legal knowledge representation for semantically-enhanced legal knowledge systems or web-based applications. In it, current methodologies, tools and languages used for ontology development are revised, and the book includes an exhaustive revision of existing ontologies in the legal domain. The development of the Ontology of Professional Judicial Knowledge (OPJK) is presented as a case study.

Well, it is the sort of thing that I would enjoy as leisure reading. 😉

I keep threatening to get one of the personal research accounts just to see how much or how little progress has been made by legal information vendors. Haven’t yet but maybe I can find a sponsor for a project to undertake such a comparison.

Comments Off

First Look – Oracle Data Mining Update

Filed under: Data Mining,Database,Information Retrieval,SQL — Patrick Durusau @ 7:18 pm

First Look – Oracle Data Mining Update by James Taylor.

From the post:

I got an update from Oracle on Oracle Data Mining (ODM) recently. ODM is an in-database data mining and predictive analytics engine that allows you to build and use advanced predictive analytic models on data that can be accessed through your Oracle data infrastructure. I blogged about ODM extensively last year in this First Look – Oracle Data Mining and since then they have released ODM 11.2.

The fundamental architecture has not changed, of course. ODM remains a “database-out” solution surfaced through SQL and PL-SQL APIs and executing in the database. It has the 12 algorithms and 50+ statistical functions I discussed before and model building and scoring are both done in-database. Oracle Text functions are integrated to allow text mining algorithms to take advantage of them. Additionally, because ODM mines star schema data it can handle an unlimited number of input attributes, transactional data and unstructured data such as CLOBs, tables or views.

This release takes the preview GUI I discussed last time and officially releases it. This new GUI is an extension to SQL Developer 3.0 (which is available for free and downloaded by millions of SQL/database people). The “Classic” interface (wizard-based access to the APIs) is still available but the new interface is much more in line with the state of the art as far as analytic tools go.

BTW, the correct link to: First Look – Oracle Data Mining. (Taylor’s post last year on Oracle Data Mining.)

For all the buzz about NoSQL, topic map mavens should be aware of the near universal footprint of SQL and prepare accordingly.

Comments Off

JT on EDM

Filed under: Data Mining,Decision Making — Patrick Durusau @ 7:17 pm

JT on EDM – James Taylor on Everything Decision Management

From the about page:

James Taylor is a leading expert in Decision Management and an independent consultant specializing in helping companies automate and improve critical decisions. Previously James was a Vice President at Fair Isaac Corporation where he developed and refined the concept of enterprise decision management or EDM. Widely credited with the invention of the term and the best known proponent of the approach, James helped create the Decision Management market and is its most passionate advocate.

James has 20 years experience in all aspects of the design, development, marketing and use of advanced technology including CASE tools, project planning and methodology tools as well as platform development in PeopleSoft’s R&D team and consulting with Ernst and Young. He has consistently worked to develop approaches, tools and platforms that others can use to build more effective information systems.

Another mainstream IT/data site that you would do well to read.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 9, 2011

September 8, 2011

September 7, 2011

September 6, 2011