Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 13, 2013

Automated compound classification using a chemical ontology

Filed under: Cheminformatics,Ontology — Patrick Durusau @ 8:12 pm

Automated compound classification using a chemical ontology by Claudia Bobach, Timo Böhme, Ulf Laube, Anett Püschel and Lutz Weber. (Journal of Cheminformatics 2012, 4:40 doi:10.1186/1758-2946-4-40)

Abstract:

Background

Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds.

Results

In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships.

Conclusions

A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemistry structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.

While creating an ontology to assist with compound classification, the authors concede the literature contains much semantic diversity:

Chemists use a variety of expressions to create compound class terms from a specific compound name – for example “backbone”, “scaffold”, “derivative”, “compound class” are often used suffixes or “substituted” is a common prefix that generates a class term. Unfortunately, the meaning of different chemical class terms is often not defined precisely and their usage may differ significantly due to historic reasons and depending on the compound class. For example, 2-ethyl-imidazole 1 belongs without doubt to the class of compounds having a imidazole scaffold, backbone or being an imidazole derivative or substituted imidazole. In contrast, pregnane 2 illustrates a more complicated case – as in case of 2-ethyl-imidazole this compound could be considered a 17-ethyl-derivative of the androstane scaffold 3. However, this would suggest a wrong compound classification as pregnanes are not considered to be androstane derivatives – although 2 contains androstane 3 as a substructure (Figure 1). This particular, structurally illogical naming convention goes back to the fundamentally different biological activities of specific compounds with a pregnane or androstane backbone, resulting in the perception that androstanes and pregnanes do not show a parent–child relation but are rather sibling concepts at the same hierarchical level. Thus, any expert chemical ontology will appreciate this knowledge and the androstane compound class structural definition needs to contain a definition that any androstane shall NOT contain a carbon substitution at the C-17 position. (emphasis added)

Not that present day researchers would create a structurally illogical naming convention in the view of future researchers.

Rise And Fall Of Computer Languages In 2012

Filed under: Functional Programming,Programming — Patrick Durusau @ 8:12 pm

Rise And Fall Of Computer Languages In 2012 by Andrew Binstock.

From the post:

The most recent processor phenomenon — the transition from the multicore to many-core era — was expected to set the stage for the emergence of functional languages, which fit well with concurrent programming. But most surveys from 2012 still show no major breakthrough. If a functional language does separate from the pack, the leading candidates are Scala and Clojure, with Scala enjoying the greater adoption right now. This per Ohloh’s language figures, which cover all open source projects, and Google trends, which indicate search traffic. On the venerable Tiobe index, which tracks the number of Web pages that mention a given language, Haskell, Erlang, and Scala are effectively tied and ahead of Clojure.

The full story appears at Dr. Dobbs.

Other reasons to prefer functional languages/approaches:

  • Creation of auditable merging.
  • Re-use of subject representatives (not everyone has the same merging criteria).

Visualizing the Transfer Market of Europe’s Top Football Leagues

Filed under: Graphics,Visualization — Patrick Durusau @ 8:10 pm

Visualizing the Transfer Market of Europe’s Top Football Leagues by Andrew Vande Moere.

From the post:

European Football Transfer Tool [signal-noise.co.uk], developed by information design studio Signal | Noise, is a clever online data visualization of all the transfers throughout Europe’s top football leagues, such as the “Premier League” of England, “Serie A” of Italy, “Eredivisie” of The Netherlands, “Primera Liga” of Spain, Germany’s “Bundesliga”, and several more.

A very interesting visualization technique. One that should be applicable to other “transfer” situations.

For example, the “transfer market” between CS graduate programs.

Or the “transfer market” between former agency officials and their employers (or vice-versa).

Apache Pig 0.10.1 Released

Filed under: Hadoop,Pig — Patrick Durusau @ 8:10 pm

Apache Pig 0.10.1 Released by Daniel Dai.

From the post:

We are pleased to announce that Apache Pig 0.10.1 was recently released. This is primarily a maintenance release focused on stability and bug fixes. In fact, Pig 0.10.1 includes 42 new JIRA fixes since the Pig 0.10.0 release.

Time to update your Pig installation!

Taming Text is released!

Filed under: Text Analytics,Text Mining — Patrick Durusau @ 8:09 pm

Taming Text is released! by Mike McCandless.

From the post:

There’s a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.

I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don’t follow a few simple, easily codified rules and are still not fully understood today.

The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!

N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.

You can see:

Table of Contents.

Sample chapter 1

Sample chapter 8

Source code (98 MB)

Or, you can do like I did, grab the source code and order the eBook (PDF) version of Taming Text.

More comments to follow!

January 12, 2013

Introduction to the Legislative Process in the U.S. Congress

Filed under: Government,Law — Patrick Durusau @ 7:07 pm

Introduction to the Legislative Process in the U.S. Congress from Full Text Reports….

The report: Introduction to the Legislative Process in the U.S. Congress (PDF)

From the post:

This report introduces the main steps through which a bill (or other item of business) may travel in the legislative process─from introduction to committee and floor consideration to possible presidential consideration. However, the process by which a bill can become law is rarely predictable and can vary significantly from bill to bill. In fact, for many bills, the process will not follow the sequence of congressional stages that are often understood to make up the legislative process. This report presents a look at each of the common stages through which a bill may move, but complications and variations abound in practice.

Throughout, the report provides references to a variety of other CRS reports that focus on specific elements of congressional procedure. CRS also has many other reports not cited herein that address some procedural issues in additional detail (including congressional budget and appropriations processes). These reports are organized by subject matter on the Congressional Operations portion of the CRS webpage, a link to which is on the main CRS homepage, but can also be found at http://crs.gov/analysis/Pages/CongressionalOperations.aspx.

Congressional action on bills is typically planned and coordinated by party leaders in each chamber, though as described in this report, majority party leaders in the House have more tools with which to set the floor agenda than do majority party leaders in the Senate. In both chambers, much of the policy expertise resides in the standing committees, panels of Members who typically take the lead in developing and assessing proposed legislation within specified policy jurisdictions.

Most accurate as a guide to the explicit steps in the legislative process in the U.S. Congress.

But those explicit steps are only pale reflections of the social dynamics and self-interest that drive the inputs into the legislative process.

Transparency of the fiscal cliff legislation would have to start with the relationships between senators, lobbyists and vested interests long before the agreements on tax benefits in the summer of 2012.

And trace those relationships and interactions up to and through the inclusion of those benefits in the fiscal cliff legislation.

Publishing the formal steps in that process is like a magician’s redirection of your attention.

You looking at the wrong time and for the wrong information.

Advanced Power Searching [January 23, 2013]

Filed under: Search Engines,Searching — Patrick Durusau @ 7:07 pm

Advanced Power Searching

From the post:

Advanced Power Searching with Google begins on January 23, 2013!

Register now to sharpen your research skills and strengthen your use of advanced Google search techniques to answer complex questions. Throughout this course you’ll also:

  • Take your search strategies to a new level with sophisticated, independent search challenges.
  • Join a community of Advanced Searchers working together to solve search challenges.
  • Pose questions to Google search experts live in Hangouts and through a course forum.
  • Receive an Advanced Power Searching certificate upon completion.

Not sure if you’re ready for Advanced Power Searching? Brush up on your search skills by visiting the Power Searching with Google course.

Topic maps help keep found information found but you have to find it first. 😉

Enjoy!

13 Things People Hate about Your Open Source Docs

Filed under: Documentation,Open Source,Software — Patrick Durusau @ 7:06 pm

13 Things People Hate about Your Open Source Docs by Andy Lester.

From the post:

1. Lacking a good README or introduction

2. Docs not available online

3. Docs only available online

4. Docs not installed with the package

5. Lack of screenshots

6. Lack of realistic examples

7. Inadequate links and references

8. Forgetting the new user

9. Not listening to the users

10. Not accepting user input

11. No way to see what the software does without installing it

12. Relying on technology to do your writing

13. Arrogance and hostility toward the user

See Andy’s post for the details and suggestions on ways to improve.

Definitely worth a close read!

Schemaless Data Structures

Filed under: Data Structures,Database,Schema — Patrick Durusau @ 7:05 pm

Schemaless Data Structures by Martin Fowler.

From the first slide:

In recent years, there’s been an increasing amount of talk about the advantages of schemaless data. Being schemaless is one of the main reasons for interest in NoSQL databases. But there are many subtleties involved in schemalessness, both with respect to databases and in-memory data structures. These subtleties are present both in the meaning of schemaless and in the advantages and disadvantages of using a schemaless approach.

Martin points out that “schemaless” does not mean the lack of a schema but rather the lack of an explicit schema.

Sounds a great deal like the implicit subjects that topic maps have the ability to make explicit.

Is there a continuum of explicitness for any given subject/schema?

Starting from entirely implied, followed by an explicit representation, then further explication as in a data dictionary, and at some distance from the start, a subject defined as a set of properties, which are themselves defined as sets of properties, in relationships with other sets of properties.

How far you go down that road depends on your requirements.

JUnit Rule for ElasticSearch

Filed under: ElasticSearch,Solr — Patrick Durusau @ 7:02 pm

JUnit Rule for ElasticSearch by Florian Hopf.

From the post:

While I am using Solr a lot in my current engagement I recently started a pet project with ElasticSearch to learn more about it. Some of its functionality is rather different from Solr so there is quite some experimentation involved. I like to start small and implement tests if I like to find out how things work (see this post on how to write tests for Solr).

ElasticSearch internally uses TestNG and the test classes are not available in the distributed jar files. Fortunately it is really easy to start an ElasticSearch instance from within a test so it’s no problem to do something similar in JUnit. Felix Müller posted some useful code snippets on how to do this, obviously targeted at a Maven build. The ElasticSearch instance is started in a setUp method and stopped in a tearDown method:

Useful information about tests for Solr and ElasticSearch is too useful to pass up.

Besides, it reminded me of the need to have testable merging instances, both for TMDM merging as well as more complex merging scenarios.

Volt University

Filed under: VoltDB — Patrick Durusau @ 7:01 pm

Volt University

From the homepage:

Volt University is designed to inspire and enable the art of disruption. It gives enterprise and independent developers worldwide the insight, tools, and best practices they need to build applications never before imagined, applications that ingest, analyze, and act on incredibly large volumes of data with real-time speed. This is the power of VoltDB – the fully durable in-memory database that combines high-velocity data ingestion with real-time data analytics and decisioning to turn imagination into reality.

Led by VoltDB’s own engineering organization, Volt University provides customers, partners, and members of the entire VoltDB Community with a vast portfolio of instructional content, classes, tools, and other resources. The curriculum and supporting material range from beginner to advanced, giving developers at all levels the practical knowledge and support they need to build whatever application they can envision.

Formal classes and certification aren’t free but:

Volt University Online – VoltDB delivers a wealth of information and educational content to the VoltDB Community through its Volt University Online offering. From live monthly webcasts and on-demand “how to” videos to white papers, tutorials, demonstrations, and code samples, VoltDB users have a significant library of material to draw from for inspiration and instruction as they design and build high velocity applications on VoltDB. Content is available free of charge for all members of the VoltDB Community – simply click here to access (no form) and start building!

Re-post and say nice things about VoltDB. This sort of behavior should be encouraged.

I first saw this at: VoltDB Launches Volt University.

The Xenbase literature curation process

Filed under: Bioinformatics,Curation,Literature — Patrick Durusau @ 7:01 pm

The Xenbase literature curation process by Jeff B. Bowes, Kevin A. Snyder, Christina James-Zorn, Virgilio G. Ponferrada, Chris J. Jarabek, Kevin A. Burns, Bishnu Bhattacharyya, Aaron M. Zorn and Peter D. Vize.

Abstract:

Xenbase (www.xenbase.org) is the model organism database for Xenopus tropicalis and Xenopus laevis, two frog species used as model systems for developmental and cell biology. Xenbase curation processes centre on associating papers with genes and extracting gene expression patterns. Papers from PubMed with the keyword ‘Xenopus’ are imported into Xenbase and split into two curation tracks. In the first track, papers are automatically associated with genes and anatomy terms, images and captions are semi-automatically imported and gene expression patterns found in those images are manually annotated using controlled vocabularies. In the second track, full text of the same papers are downloaded and indexed by a number of controlled vocabularies and made available to users via the Textpresso search engine and text mining tool.

Which curation workflow will work best for your topic map activities will depend upon a number of factors.

What would you adopt, adapt or alter from the curation workflow in this article?

How would you evaluate the effectiveness of any of your changes?

Manual Alignment of Anatomy Ontologies

Filed under: Alignment,Bioinformatics,Biomedical,Ontology — Patrick Durusau @ 7:00 pm

Matching arthropod anatomy ontologies to the Hymenoptera Anatomy Ontology: results from a manual alignment by Matthew A. Bertone, István Mikó, Matthew J. Yoder, Katja C. Seltmann, James P. Balhoff, and Andrew R. Deans. (Database (2013) 2013 : bas057 doi: 10.1093/database/bas057)

Abstract:

Matching is an important step for increasing interoperability between heterogeneous ontologies. Here, we present alignments we produced as domain experts, using a manual mapping process, between the Hymenoptera Anatomy Ontology and other existing arthropod anatomy ontologies (representing spiders, ticks, mosquitoes and Drosophila melanogaster). The resulting alignments contain from 43 to 368 mappings (correspondences), all derived from domain-expert input. Despite the many pairwise correspondences, only 11 correspondences were found in common between all ontologies, suggesting either major intrinsic differences between each ontology or gaps in representing each group’s anatomy. Furthermore, we compare our findings with putative correspondences from Bioportal (derived from LOOM software) and summarize the results in a total evidence alignment. We briefly discuss characteristics of the ontologies and issues with the matching process.

Database URL: http://purl.obolibrary.org/obo/hao/2012-07-18/arthropod-mappings.obo.

A great example of the difficulty of matching across ontologies, particularly when the granularity or subjects of ontologies vary.

SDDC And The Elephant In the Room

Filed under: Data Silos,SDDC,Virtualization — Patrick Durusau @ 6:59 pm

SDDC And The Elephant In the Room by Chuck Hollis.

From the post:

Like many companies, we at EMC start our new year with a leadership gathering. We gather to celebrate, connect, strategize and share. They are *always* great events.

I found this year’s gathering was particularly rewarding in terms of deep content. The majority of the meeting was spent unpacking the depth behind the core elements of EMC’s strategy: cloud, big data and trust.

We dove in from a product and technology perspective. We came at it from a services view. Another take from a services and skills viewpoint. And, finally, the organizational and business model implications.

For me, it was like a wonderful meal that just went on and on. Rich, detailed and exceptionally well-thought out — although your head started to hurt after a while.

Underlying much of the discussion was the central notion of a software-defined datacenter (SDDC for short), representing the next generation of infrastructure and operational models. All through the discussion, that was clearly the conceptual foundation for so much of what needed to happen in the industry.

And I started to realize we still have a lot of explaining to do: not only around the concepts themselves, but what they mean to IT groups and the organizations they support.

I’ve now had some time to think and digest, and I wanted to add a few different perspectives to the mix.

The potential of software-defined datacenters (SDDC) comes across loud and clear in Chuck’s post. Particularly for ad-hoc integration of data sources for new purposes.

But then I remembered, silos aren’t built by software. Silos are build by users and software is just a means for building a silo.

Silos won’t become less frequent because of software-defined datacenters, unless users stop building silos.

There will be a potential for fewer silos and more pressure on users to build fewer silos, maybe, but that is no guarantee of fewer silos.

Even a subject-defined datacenter (SubDDC) cannot guarantee no silos.

A SubDDC that defines subjects in its data, structures and software offers a chance to move across silo barriers.

How much of a chance depends on its creator and the return from crossing across silo barriers.

What’s New in Cassandra 1.2 (Notes)

Filed under: Cassandra,Clustering (servers),CQL - Cassandra Query Language — Patrick Durusau @ 6:59 pm

What’s New in Cassandra 1.2

From the description:

Apache Cassandra Project Chair, Jonathan Ellis, looks at all the great improvements in Cassandra 1.2, including Vnodes, Parallel Leveled Compaction, Collections, Atomic Batches and CQL3.

There is only so much you can cover in an hour but Jonathan did a good job of hitting the high points of virtual nodes (rebuild failed drives/nodes faster), atomic batches (fewer requirements on clients, new default btw), CQL improvements, and tracing.

Enough to make you interested in running (not watching) the examples plus your own.

The slides: http://www.slideshare.net/DataStax/college-credit-whats-new-in-apache-cassandra-12

Cassandra homepage.

CQL 3 Language Reference.

January 11, 2013

RavenDB 2.0…

Filed under: RavenDB — Patrick Durusau @ 7:38 pm

RavenDB 2.0 Is Out: Over 6 Months of Features, Improvements, and Bug Fixes by Alex Popescu.

Alex posted about the RavenDB 2.0 release and dug up older information that listed the interesting features for RavenDB 2.0.

Substantial number of improvements. See Alex’s post for the details.

Re-Introducing Page Description Diagrams

Filed under: Design,Interface Research/Design,Usability,Users — Patrick Durusau @ 7:37 pm

Re-Introducing Page Description Diagrams by Colin Butler and Andrew Wirtanen.

From the post:

There’s no such thing as a “standard” client or project in a typical agency setting, because every business has its own specific goals—not to mention the goals of its users. Because of this, we’re constantly seeking ways to improve our processes and better meet the needs of our clients, regardless of their unique characteristics.

Recently, we discovered the page description diagram (PDD), a method for documenting components without specifying layout. At first, it seemed limited, even simplistic, relative to our needs. But with some consideration, we began to understand the value. We started looking at whether or not PDDs could help us improve our process.

As it turns out, these things have been around for quite a while. Dan Brown devised them way back in 1999 as a way to communicate information architecture to a client in a way that addressed some of his primary issues with wireframes. Those issues were that, looking at wireframes, clients would form expectations prematurely and that designers would be limited in their innovation by a prescribed layout. Brown’s approach was to remove layout entirely, providing priority instead. Each component of a page would be described in terms of the needs it met and how it met those needs, arranged into three priority columns with wireframe-like examples when necessary. …

Because of its UI context, I originally read this post as a means of planning interfaces.

But on reflection, the same questions of “needs to meet” and “how to meet those needs” applies equally to topics, associations and occurrences.

Users should be encouraged to talk through their expectations for what information comes together, in what order and how they will use it.

As opposed to focusing too soon on questions of how a topic map architecture will support those capabilities.

Interesting technical questions but no nearly as interesting, for users at any rate, as their information needs.

The post also cites a great primer on Page Description Diagrams.

EU – Law-Related Authority Files

Filed under: Authority Record,EU,Vocabularies — Patrick Durusau @ 7:37 pm

The EU Data Portal has a number of law-related authority files:

I first saw these at: New EU Data Portal links to several law-related authority files.

Legal Informatics Glossary of Terms

Filed under: Glossary,Legal Informatics — Patrick Durusau @ 7:36 pm

Legal Informatics Glossary of Terms by Grant Vergottini.

From the post:

I work with people from around the world on matters relating to legal informatics. One common issue we constantly face is the issue of terminology. We use many of the same terms, but the subtly of their definitions end up causing no end of confusion. To try and address this problem, I’ve proposed a number of times that we band together to define a common vocabulary, and when we can’t arrive at that, at least we can understand the differences that exist amongst us.

To get the ball rolling, I have started a wiki on GitHub and populated it with many of the terms I use in my various roles. Their definitions are a work-in-progress at this point. I am refining them as I find the time. However, rather than trying to build my own private vocabulary, I would like this to be a collaborative effort. To that end, I am inviting anyone with an interest in this to help build out the vocabulary by adding your own terms with definitions to the list and improving the ones I have started.

My legal informatics glossary of terms can be found in my public legal Informatics project at:

https://github.com/grantcv1/Legal-Informatics/wiki/Glossary

Now there is a topic map sounding like project.

I first saw this at: Vergottini: Legal Informatics Glossary of Terms.

Probability Theory — A Primer

Filed under: Mathematics,Probability — Patrick Durusau @ 7:36 pm

Probability Theory — A Primer by Jeremy Kun.

From the post:

It is a wonder that we have yet to officially write about probability theory on this blog. Probability theory underlies a huge portion of artificial intelligence, machine learning, and statistics, and a number of our future posts will rely on the ideas and terminology we lay out in this post. Our first formal theory of machine learning will be deeply ingrained in probability theory, we will derive and analyze probabilistic learning algorithms, and our entire treatment of mathematical finance will be framed in terms of random variables.

And so it’s about time we got to the bottom of probability theory. In this post, we will begin with a naive version of probability theory. That is, everything will be finite and framed in terms of naive set theory without the aid of measure theory. This has the benefit of making the analysis and definitions simple. The downside is that we are restricted in what kinds of probability we are allowed to speak of. For instance, we aren’t allowed to work with probabilities defined on all real numbers. But for the majority of our purposes on this blog, this treatment will be enough. Indeed, most programming applications restrict infinite problems to finite subproblems or approximations (although in their analysis we often appeal to the infinite).

We should make a quick disclaimer before we get into the thick of things: this primer is not meant to connect probability theory to the real world. Indeed, to do so would be decidedly unmathematical. We are primarily concerned with the mathematical formalisms involved in the theory of probability, and we will leave the philosophical concerns and applications to future posts. The point of this primer is simply to lay down the terminology and basic results needed to discuss such topics to begin with.

So let us begin with probability spaces and random variables.

Jeremy’s “primer” posts make good background reading. (A primers listing.)

Work through them carefully for best results.

Solr vs ElasticSearch: Part 5 – Management API Capabilities

Filed under: ElasticSearch,Search Engines,Searching,Solr — Patrick Durusau @ 7:35 pm

Solr vs ElasticSearch: Part 5 – Management API Capabilities by Rafał Kuć.

From the post:

In previous posts, all listed below, we’ve discussed general architecture, full text search capabilities and facet aggregations possibilities. However, till now we have not discussed any of the administration and management options and things you can do on a live cluster without any restart. So let’s get into it and see what Apache Solr and ElasticSearch have to offer.

Rafał continues this excellent series on Solr and ElasticSearch and promises there is more to come!

This series sets a high standard for posts comparing search capabilities!

Getting Started with VM Depot

Filed under: Azure Marketplace,Cloud Computing,Linux OS,Microsoft,Virtual Machines — Patrick Durusau @ 7:35 pm

Getting Started with VM Depot by Doug Mahugh.

From the post:

Do you need to deploy a popular OSS package on a Windows Azure virtual machine, but don’t know where to start? Or do you have a favorite OSS configuration that you’d like to make available for others to deploy easily? If so, the new VM Depot community portal from Microsoft Open Technologies is just what you need. VM Depot is a community-driven catalog of preconfigured operating systems, applications, and development stacks that can easily be deployed on Windows Azure.

You can learn more about VM Depot in the announcement from Gianugo Rabellino over on Port 25 today. In this post, we’re going to cover the basics of how to use VM Depot, so that you can get started right away.

Doug outlines simple steps to get you rolling with the VM Depot.

Sounds a lot easier than trying to walk casual computer users through installation and configuration of software. I assume you could even load data onto the VMs.

Users just need to fire up the VM and they have the interface and data they want.

Sounds like a nice way to distribute topic map based information systems.

Critical Ruby On Rails Issue Threatens 240,000 Websites [Ruby TMs Beware]

Filed under: Ruby,Topic Map Software — Patrick Durusau @ 7:34 pm

Critical Ruby On Rails Issue Threatens 240,000 Websites by Mathew J. Schwartz.

From the post:

All versions of the open source Ruby on Rails Web application framework released in the past six years have a critical vulnerability that an attacker could exploit to execute arbitrary code, steal information from databases and crash servers. As a result, all Ruby users should immediately upgrade to a newly released, patched version of the software.

That warning was sounded Tuesday in a Google Groups post made by Aaron Patterson, a key Ruby programmer. “Due to the critical nature of this vulnerability, and the fact that portions of it have been disclosed publicly, all users running an affected release should either upgrade or use one of the work arounds immediately,” he wrote. The patched versions of Ruby on Rails (RoR) are 3.2.11, 3.1.10, 3.0.19 and 2.3.15.

As a result, more than 240,000 websites that use Ruby on Rails Web applications are at risk of being exploited by attackers. High-profile websites that employ the software include Basecamp, Github, Hulu, Pitchfork, Scribd and Twitter.

Ruby developers will already be aware of this issue but if you have Ruby-based topic map software, you may not have an in-house Ruby developer.

The major players in the Ruby community are concerned so it’s time to ask someone to look at any Ruby software, topic maps or not, that you are running.


If you are interested in the details, see: Analysis of Rails XML Parameter Parsing Vulnerability.

At its heart, a subject identity issue.

If symbol and yaml types had defined properties/values (or value ranges) as part of their “identity,” then other routines could reject instances that do not meet a “safe” identity test.

But because instances are treated as having primitive identities, what gets injected is what you get (WGIIWY).

Javascript Plugins To Handle Keyboard Events – 18 Items

Filed under: Interface Research/Design,Javascript,JQuery — Patrick Durusau @ 7:33 pm

Javascript Plugins To Handle Keyboard Events – 18 Items by Bogdan Sandu.

From the post:

Users want to see pages really quickly and avoid scrolling a site too much or using the mouse for various events that can be done easier in another way. In order to increase the functionality of a site many web designers use keyboard events so that the users’ experience on the site is better and more enjoyable by navigating easier and seeing more content faster. Of course, this is not the only reason why a web designer would add a jQuery plugin to handle keyboard events on a site, there are other.

Eighteen plugins that can help you with keyboard events in a web interface.

I think the case is fairly compelling for keyboard shortcuts but I type in my sleep. 😉

Your mileage, and that of your users, may vary.

Test with users, deploy and listen to user feedback.

(The opposite of where insiders design, then deploy and user feedback is discarded.)

Starting Data Analysis with Assumptions

Filed under: Data Analysis,Data Mining,Data Models — Patrick Durusau @ 7:33 pm

Why you don’t get taxis in Singapore when it rains? by Zafar Anjum.

From the post:

It is common experience that when it rains, it is difficult to get a cab in Singapore-even when you try to call one in or use your smartphone app to book one.

Why does it happen? What could be the reason behind it?

Most people would think that this unavailability of taxis during rain is because of high demand for cab services.

Well, Big Data has a very surprising answer for you, as astonishing as it was for researcher Oliver Senn.

When Senn was first given his assignment to compare two months of weather satellite data with 830 million GPS records of 80 million taxi trips, he was a little disappointed. “Everyone in Singapore knows it’s impossible to get a taxi in a rainstorm,” says Senn, “so I expected the data to basically confirm that assumption.” As he sifted through the data related to a vast fleet of more than 16,000 taxicabs, a strange pattern emerged: it appeared that many taxis weren’t moving during rainstorms. In fact, the GPS records showed that when it rained (a frequent occurrence in this tropical island state), many drivers pulled over and didn’t pick up passengers at all.

Senn did discover the reason for the patterns in the data, which is being addressed.

The first question should have been: Is this a big data problem?

True, Senn had lots of data to crunch, but that isn’t necessarily an indicator of a big data problem.

Interviews of a few taxi drivers would have dispelled the original assumption of high demand for taxis. It would also have led to the cause of the patterns Senn recognized.

That is the patterns were a symptom, not a cause.

I first saw this in So you want to be a (big) data hero? by Vinnie Mirchandani.

January 10, 2013

Lost: House Floor Record for 1 January 2013. If found please call…

Filed under: Government,Government Data — Patrick Durusau @ 1:49 pm

U.S. House of Representatives floor proceedings for the 109th Congress, 1st Session (2005) to 113th Congress, 1st Session-to-Date (2013) are now available for download in XML. (House Floor Activities Download)

One obvious test of the data, the House vote on the “fiscal cliff” legislation.

In fact, the Clerk of the House for January 01, 2013, has posted a web version of that day.

Question: If you download 112th Congress, 2nd Session (2012), will you find the vote on the “fiscal cliff” legislation?

Answer: No!

The entire legislative day in the House of Representatives is missing from the 112th Congress, 2nd Session (2012) file.

See for yourself: I have uploaded the 112th Congress, 2nd Session (2012) and the Clerk of the House of Representatives file for January 1, 2013, in the file: Missing1January2013.

Search for: “On motion that the House agree to the Senate amendments Agreed to by recorded vote: 257 – 167” in HDoc-112-2-FloorProceedings.xml. (112th Congress, 2nd Session).

Typos and errors happen all the time. To everyone. But missing an entire day is more than just a typo. It indicates a lack of concern for quality control.

App-lifying USGS Earth Science Data

Filed under: Challenges,Contest,Data,Geographic Data,Science — Patrick Durusau @ 1:49 pm

App-lifying USGS Earth Science Data

Challenge Dates:

Submissions: January 9, 2013 at 9:00am EST – Ends April 1, 2013 at 11:00pm EDT.

Public Voting: April 5, 2013 at 5:00pm EDT – Ends April 25, 2013 at 11:00pm EDT.

Judging: April 5, 2013 at 5:00pm EDT – Ends April 25, 2013 at 11:00pm EDT.

Winners Announced: April 26, 2013 at 5:00pm EDT.

From the webpage:

USGS scientists are looking for your help in addressing some of today’s most perplexing scientific challenges, such as climate change and biodiversity loss. To do so requires a partnership between the best and the brightest in Government and the public to guide research and identify solutions.

The USGS is seeking help via this platform from many of the Nation’s premier application developers and data visualization specialists in developing new visualizations and applications for datasets.

USGS datasets for the contest consist of a range of earth science data types, including:

  • several million biological occurrence records (terrestrial and marine);
  • thousands of metadata records related to research studies, ecosystems, and species;
  • vegetation and land cover data for the United States, including detailed vegetation maps for the National Parks; and
  • authoritative taxonomic nomenclature for plants and animals of North America and the world.

Collectively, these datasets are key to a better understanding of many scientific challenges we face globally. Identifying new, innovative ways to represent, apply, and make these data available is a high priority.

Submissions will be judged on their relevance to today’s scientific challenges, innovative use of the datasets, and overall ease of use of the application. Prizes will be awarded to the best overall app, the best student app, and the people’s choice.

Of particular interest for the topic maps crowd:

Data used – The app must utilize a minimum of 1 DOI USGS Core Science and Analytics (CSAS) data source, though they need not include all data fields available in a particular resource. A list of CSAS databases and resources is available at: http://www.usgs.gov/core_science_systems/csas/activities.html. The use of data from other sources in conjunction with CSAS data is encouraged.

CSAS has a number of very interesting data sources. Classifications, thesauri, data integration, metadata and more.

Contest wins you a recognition and bragging rights, not to mention visibility for your approach.

Machine Learning Throwdown: The Reckoning

Filed under: Machine Learning — Patrick Durusau @ 1:49 pm

Machine Learning Throwdown: The Reckoning by Charles Parker.

From the post:

As you, our faithful readers know, we compared some machine learning services several months ago in our machine learning throwdown. In another recent blog post, we talked about the power of ensembles, and how your BigML models can be made into an even more powerful classifier when many of them are learned over samples of the data. With this in mind, we decided to re-run the performance tests from the fourth throwdown post using BigML ensembles as well as single BigML models.

You can see the results in an updated version of the throwdown details file. As you’ll be able to see, the ensemble of classifiers (Bagged BigML Classification/Regression Trees) almost always outperform their solo counterparts. In addition, if we update our “medal count” table tracking the competition among our three machine learning services, we see that the BigML ensembles now lead in the number of “wins” over all datasets:

Charles continues his comparison of machine learning services.

Charles definitely has a position. 😉

On the other hand, the evidence suggests a close look at your requirements, data and capabilities before defaulting to one solution or another.

Les Misérables [Visualized]

Filed under: Graphics,Literature,Visualization — Patrick Durusau @ 1:48 pm

Novel Views: 4 Static Data Visualizations of the Novel Les Misérables by Andrew Vande Moere.

From the post:

Novel Views [neoformix.com], developed by Jeff Clarck, showcases 4 different visualizations of the text appearing in the novel Les Misérables, which itself spans about 48 books and 365 chapters.

The “Character Mentions” graphic shows where the names of the primary characters are mentioned within the text. The “Radial Word Connections” reveals the connections between the different terms used in the text. The words in the middle are connected using lines of the same color to the chapters where they are used. “Segment Word Clouds” is a small collection of small word clouds, where the size of a word reflects its frequency. Lastly, “Characteristic Verbs” provides an interpretation of the personalities and actions of each character, in that each character is listed with its most common terms and verbs.

Stunning graphics.

In this age of dynamic graphics, I wonder how the depictions would change on a chapter by chapter basis?

So a reader could see how their perception of a character is changing as the novel develops?

Common Crawl URL Index

Filed under: Common Crawl,Data,WWW — Patrick Durusau @ 1:48 pm

Common Crawl URL Index by Lisa Green.

From the post:

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.

From Scott’s post:

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

What research project would you want to do first?

« Newer PostsOlder Posts »

Powered by WordPress