Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 14, 2012

Accelerating Inference: towards a full Language, Compiler and Hardware stack

Filed under: Bayesian Models,Dimple,Factor Graphs,Graphical Models,Inference — Patrick Durusau @ 6:55 am

Accelerating Inference: towards a full Language, Compiler and Hardware stack by Shawn Hershey, Jeff Bernstein, Bill Bradley, Andrew Schweitzer, Noah Stein, Theo Weber, Ben Vigoda.

Abstract:

We introduce Dimple, a fully open-source API for probabilistic modeling. Dimple allows the user to specify probabilistic models in the form of graphical models, Bayesian networks, or factor graphs, and performs inference (by automatically deriving an inference engine from a variety of algorithms) on the model. Dimple also serves as a compiler for GP5, a hardware accelerator for inference.

From the introduction:

Graphical models alleviate the complexity inherent to large dimensional statistical models (the so-called curse of dimensionality) by dividing the problem into a series of logically (and statistically) independent components. By factoring the problem into subproblems with known and simple interdependencies, and by adopting a common language to describe each subproblem, one can considerably simplify the task of creating complex Bayesian models. Modularity can be taken advantage of further by leveraging this modeling hierarchy over several levels (e.g. a submodel can also be decomposed into a family of sub-submodels). Finally, by providing a framework which abstracts the key concepts underlying classes of models, graphical models allow the design of general algorithms which can be efficiently applied across completely different fields, and systematically derived from a model description.

Suggestive of sub-models of merging?

I first saw this in a tweet from Stefano Bertolo.

Structure and Dynamics of Information Pathways in Online Media

Filed under: Information Flow,Information Theory,Networks,News,Social Networks — Patrick Durusau @ 6:16 am

Structure and Dynamics of Information Pathways in Online Media by Manuel Gomez Rodriguez, Jure Leskovec, Bernhard Schölkopf.

Abstract:

Diffusion of information, spread of rumors and infectious diseases are all instances of stochastic processes that occur over the edges of an underlying network. Many times networks over which contagions spread are unobserved, and such networks are often dynamic and change over time. In this paper, we investigate the problem of inferring dynamic networks based on information diffusion data. We assume there is an unobserved dynamic network that changes over time, while we observe the results of a dynamic process spreading over the edges of the network. The task then is to infer the edges and the dynamics of the underlying network.

We develop an on-line algorithm that relies on stochastic convex optimization to efficiently solve the dynamic network inference problem. We apply our algorithm to information diffusion among 3.3 million mainstream media and blog sites and experiment with more than 179 million different pieces of information spreading over the network in a one year period. We study the evolution of information pathways in the online media space and find interesting insights. Information pathways for general recurrent topics are more stable across time than for on-going news events. Clusters of news media sites and blogs often emerge and vanish in matter of days for on-going news events. Major social movements and events involving civil population, such as the Libyan’s civil war or Syria’s uprise, lead to an increased amount of information pathways among blogs as well as in the overall increase in the network centrality of blogs and social media sites.

A close reading of this paper will have to wait for the holidays but it will be very near the top of the stack!

Transient subjects anyone?

December 13, 2012

Linked Jazz

Filed under: Linked Data,Music — Patrick Durusau @ 6:55 pm

Linked Jazz

Network display of Jazz artists with a number of display options.

Using Linked Data.

Better network display than I am accustomed to and I know that Lars likes jazz. 😉

I first saw this in a tweet by Christophe Viau.

PS: You may also like the paper: Visualizing Linked Jazz: A web-based tool for social network analysis and exploration.

Optique

Filed under: BigData,Linked Data,Optique — Patrick Durusau @ 6:47 pm

Optique

From the homepage:

Scalable end-user access to Big Data is critical for e ffective data analysis and value creation. Optique will bring about a paradigm shift for data access:

  • by providing a semantic end-to-end connection between users and data sources;
  • enabling users to rapidly formulate intuitive queries using familiar vocabularies and conceptualisations;
  • seamlessly integrating data spread across multiple distributed data sources, including streaming sources;
  • exploiting massive parallelism for scalability far beyond traditional RDBMSs and thus reducing the turnaround time for information requests to minutes rather than days.

Another new EU data project.

Website reports first software will be available towards the end of 2013.

Not much in the way of specifics but it is very early in the project.

Can anyone point me to a public version of their funding application?

I have been given to understand that funding applications have more detail that may appear in public announcements.

PS: I had trouble downloading a presentation by Peter Haase that is cited on the website so when I obtained it, I uploaded a local copy: On Demand Access to Big Data Through Semantic Technologies. (PDF)

I have seen the Linked Data cloud illustration many times. Have you seen it in comparison with the overall data cloud?

Big Graph Data on Hortonworks Data Platform

Filed under: Aurelius Graph Cluster,Faunus,Gremlin,Hadoop,Hortonworks,Titan — Patrick Durusau @ 5:24 pm

Big Graph Data on Hortonworks Data Platform by Marko Rodriguez.

The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.

If you like graphs at all or have been looking at graph solutions, you are going to like this post.

European Data Forum Call for Contribution

Filed under: BigData,Conferences — Patrick Durusau @ 5:14 pm

European Data Forum Call for Contribution

Submissions by 22nd Feb 2013, 02.00pm CET.

Conference: Dublin, Ireland on April 9-10, 2013.

From the call:

The European Data Forum (EDF) is a regularly scheduled (yearly) meeting place for industry, research, policy makers, and community initiatives to discuss the challenges and opportunities of (Big) Data in Europe. These aspects have both a technical (in terms of technology and infrastructure needed to master the volumes, heterogeneity, and dynamicity of Big Data), and a socio-economic component (speaking about emerging types of products and services and their commercialization, innovation and business models, but also policies and regulations).

Our aim is to bring together all stakeholders involved in the data value chain to exchange ideas and develop actionable roadmaps addressing these challenges and opportunities in order to strengthen the European data economy and its positioning worldwide. The roadmaps will be offered as a contribution to the definition of research, development, and policy activities at the level of the European Union institutions and those of its member states.

An equally important goal of the European Data Forum is to create and foster a truly European Big Data community. This emerging community will enable promising ideas to move from the stage of research questions all the way to successful deployment and the acquisition of capital; in the same time, its stakeholders will mutually reinforce commercial strategies that require a forward-looking, dynamic, and well-integrated EU-wide industry and venture capital.

The next edition of the EDF will be held in Dublin, Ireland on April 9-10, 2013. The program will consist of a mixture of presentations and networking sessions by industry, academics, policy makers, and community initiatives on topics ranging from research and technology development, to training and knowledge transfer, and commercialization.

We are seeking inspiring presentations addressing fundamental technical, application, socio-economic and policy-related topics related to Big Data management and analytics. The presentations will vary in format and focus depending on the main expected audience and their contributors. They will be reviewed by the Organization Committee of the Forum according to their relevance to the scope and purpose of the event and the specific topics we foresee to target, which are listed in the following:

Sounds like a great opportunity to meet other “big data” types!

D3 Replusive

Filed under: D3,Graphics,Humor — Patrick Durusau @ 4:50 pm

D3 Replusive

An unlikely key sequence that triggers this behavior in a graph interface to a topic map could be quite amusing. 😉

I first saw this in a tweet by Christophe Viau.

D3 Tips and Tricks

Filed under: D3,Graphics — Patrick Durusau @ 4:45 pm

D3 Tips and Tricks (PDF)

Introduction to graphics with D3.

Covers graphs (the sort you see in the male vitality ads, not nodes/arcs) at this point with more promised to follow.

You are going to have to learn the basics and this will get you started.

BTW, follow the blog as well: http://www.d3noob.org/

I first saw this in a tweet by Christophe Viau.

HTML+RDFa 1.1

Filed under: HTML5,RDFa — Patrick Durusau @ 4:32 pm

HTML+RDFa 1.1 Support for RDFa in HTML4 and HTML5

Abstract:

This specification defines rules and guidelines for adapting the RDFa Core 1.1 and RDFa Lite 1.1 specifications for use in HTML5 and XHTML5. The rules defined in this specification not only apply to HTML5 documents in non-XML and XML mode, but also to HTML4 and XHTML documents interpreted through the HTML5 parsing rules.

Crowdsourcing campaign spending: …

Filed under: Crowd Sourcing,Government Data,Journalism — Patrick Durusau @ 3:43 pm

Crowdsourcing campaign spending: What ProPublica learned from Free the Files by Amanda Zamora.

From the post:

This fall, ProPublica set out to Free the Files, enlisting our readers to help us review political ad files logged with Federal Communications Commission. Our goal was to take thousands of hard-to-parse documents and make them useful, helping to reveal hidden spending in the election.

Nearly 1,000 people pored over the files, logging detailed ad spending data to create a public database that otherwise wouldn’t exist. We logged as much as $1 billion in political ad buys, and a month after the election, people are still reviewing documents. So what made Free the Files work?

A quick backstory: Free the Files actually began last spring as an effort to enlist volunteers to visit local TV stations and request access to the “public inspection file.” Stations had long been required to keep detailed records of political ad buys, but they were only available on paper and required actually traveling to the station.

In August, the FCC ordered stations in the top 50 markets to begin posting the documents online. Finally, we would be able to access a stream of political ad data based on the files. Right?

Wrong. It turns out the FCC didn’t require stations to submit the data in anything that approaches an organized, standardized format. The result was that stations sent in a jumble of difficult to search PDF files. So we decided if the FCC or stations wouldn’t organize the information, we would.

Enter Free the Files 2.0. Our intention was to build an app to help translate the mishmash of files into structured data about the ad buys, ultimately letting voters sort the files by market, contract amount and candidate or political group (which isn’t possible on the FCC’s web site), and to do it with the help of volunteers.

In the end, Free the Files succeeded in large part because it leveraged data and community tools toward a single goal. We’ve compiled a bit of what we’ve learned about crowdsourcing and a few ideas on how news organizations can adapt a Free the Files model for their own projects.

The team who worked on Free the Files included Amanda Zamora, engagement editor; Justin Elliott, reporter; Scott Klein, news applications editor; Al Shaw, news applications developer, and Jeremy Merrill, also a news applications developer. And thanks to Daniel Victor and Blair Hickman for helping create the building blocks of the Free the Files community.

The entire story is golden but a couple of parts shine brighter for me than the others.

Design consideration:

The success of Free the Files hinged in large part on the design of our app. The easier we made it for people to review and annotate documents, the higher the participation rate, the more data we could make available to everyone. Our maxim was to make the process of reviewing documents like eating a potato chip: “Once you start, you can’t stop.”

Let me re-say that: The easier it is for users to author topic maps, the more topic maps they will author.

Yes?

Semantic Diversity:

But despite all of this, we still can’t get an accurate count of the money spent. The FCC’s data is just too dirty. For example, TV stations can file multiple versions of a single contract with contradictory spending amounts — and multiple ad buys with the same contract number means radically different things to different stations. But the problem goes deeper. Different stations use wildly different contract page designs, structure deals in idiosyncratic ways, and even refer to candidates and groups differently.

All true but knowing the semantics vary ahead of time, station to station, why not map the semantics in the markets ahead of time?

Granting I second their request to the FCC to request standardized data but having standardized blocks doesn’t mean the information has the same semantics.

The OMB can’t keep the same semantics for a handful of terms in one document.

What chance is there with dozens and dozens of players in multiple documents?

Taming Text [Coming real soon now!]

Filed under: Lucene,Mahout,Solr,Text Mining — Patrick Durusau @ 3:14 pm

Taming Text by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris.

During a webinar today Grant said that “Taming Text” should be out in ebook form in just a week or two.

Grant is giving up the position of being the second longest running MEAP project. (He didn’t say who was first.)

Let’s all celebrate Grant and his co-authors crossing the finish line with a record number of sales!

This promises to be a real treat!

PS: Not going to put this on my wish list, too random and clumsy a process. Will just order it direct. 😉

Reflective Intelligence and Unnatural Acts

Filed under: LucidWorks,MapR — Patrick Durusau @ 3:07 pm

I wasn’t in the best of shape today but did manage to attend the webinar: Crowd Sourcing Reflected Intelligence Using Search and Big Data.

Not a lot of detail but there were two topics that caught my attention.

The first was “reflective intelligence,” that is a system that reflects the intelligence of the users back to other users.

Intelligence derived from tracking “clicks,” search terms, etc.

Question: How does your topic map solution “reflect” the intelligence of its users?

That is how do responses “improve” (by some measure) as a result of user interaction.

Could be measuring user behavior, what links do they select for particular query terms. (That is an example from the webinar.) Or could be users adding information, perhaps even suggesting/voting on merges.

The second riff that got my attention was a description of the software under discussion as:

“I don’t have to do unnatural acts.”

Is that like the Papa John’s “better ingredients?” Taken to imply that other pizzas use sub-par ingredients?

Or in this case, other software solutions require “unnatural acts?”

Interesting selling point.

What unusual properties would you claim for topic maps or topic map software?

December 12, 2012

Visualising Zeus’s infidelities:…

Filed under: Graphics,Visualization — Patrick Durusau @ 8:54 pm

Visualising Zeus’s infidelities: the Greek god’s affairs as chronicled over the centuries, laid out as a graphic by John Burn-Murdoch.

From the post:

A team of data-visualisation designers have created a fascinating graphic representation of the genealogy of Greek god Zeus. Dozens of authors have chronicled his relationships and offspring, with some of his lovers and children cited consistently, and others mentioned only in one or two accounts. Zeus is represented by the thick black lines, with his siblings and ancestors in the inner circles and offspring in the outer rings. Zeus and his lovers are on the inside of each line, with their children on the outside. Each coloured line refers to one author’s account. Click a name for more details, or an author to view only the affairs they mentioned.

See the post for the graphic, its authors and further information.

Raises all sorts of interesting possibilities doesn’t it?

UC Irvine Extension Announces Winter Predictive Analytics and Info System Courses

Filed under: Predictive Analytics,Text Analytics — Patrick Durusau @ 8:42 pm

UC Irvine Extension Announces Winter Predictive Analytics and Info System Courses

From the post:

Predictive Analytics Certificate Program:

This program is designed for professionals who are using or wish to use Predictive Analytics to optimize business performance at a variety of levels. UC Irvine Extension is offering the following webinar and two courses during winter quarter:

Predictive Analytics Special Topic Webinar: Text Analytics & Text Mining (Jan. 15, 11:30 a.m. to 12:30 p.m., PST) – This free webinar will provide participants with the introductory concepts of text analytics and text mining that are used to recognize how stored, unstructured data represents an extremely valuable source of business information.

Course: Effective Data Preparation (Jan. 7 to Feb. 24) – This online course will address how to extract stored data elements, transform their formats, and derive new relationships among them, in order to produce a dataset suitable for analytical modeling. Course instructor Dr. Robert Nisbet, chief scientist at Smogfarm, which studies crowd psychology, will provide attendees with the skills to produce a fully processed data set compatible for building powerful predictive models.

Course: Text Analytics & Text Mining (Jan. 28 to March 24) – This new online course instructed by Dr. Gary Miner, author of Handbook of Statistical Analysis & Data Mining Applications and Practical Text Mining, will focus on basic concepts of textual information including tokenization and part-of-speech tagging. The course will expose participants to practical techniques for text extraction and text mining, document clustering and classification, information retrieval, and the enhancement of structured data.

Just so you know, the webinar is free but Effective Data Preparation and Text Analytics & Text Mining are $695.00 each.

I am always made more curious by the omission of the most obvious questions from an FAQ or location of the information in very non-prominent places.

I suspect well worth the price but why not be up front with the charges?

d3js/SVG Export demo

Filed under: D3,Graphics,SVG,Visualization — Patrick Durusau @ 8:20 pm

d3js/SVG Export demo

From the post:

d3js is a JavaScript library for manipulating documents based on data. The library enables stunning client-side visualization inside the webbrowser.

Commonly in science-related websites (and possibly many others), users need to save the generated visualization in vectorized format (e.g. PDF), to be able to incorporate the graphics in presentation or publications.

This website demonstate one possible method of saving d3js graphics to PDF.

See below for more technical details.

I can’t imagine anyone wanting a static image of a topic map but you never know. 😉

I first saw this in a tweet by Christophe Viau.

ArangoDB

Filed under: ArangoDB,Database,NoSQL — Patrick Durusau @ 8:12 pm

ArangoDB

From the webpage:

A universal open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript/Ruby extensions.

Design considerations:

In a nutshell:

  • Schema-free schemas with shapes: Inherent structures at hand are automatically recognized and subsequently optimized.
  • Querying: ArangoDB is able to accomplish complex operations on the provided data (query-by-example and query-language).
  • Application Server: ArangoDB is able to act as application server on Javascript-devised routines.
  • Mostly memory/durability: ArangoDB is memory-based including frequent file system synchronizing.
  • AppendOnly/MVCC: Updates generate new versions of a document; automatic garbage collection.
  • ArangoDB is multi-threaded.
  • No indices on file: Only raw data is written on hard disk.
  • ArangoDB supports single nodes and small, homogenous clusters with zero administration.

I have mentioned this before but ran across it again at: An experiment with Vagrant and Neo4J by Patrick Mulder.

Upcoming release of EuroVoc 4.4, EU’s multilingual thesaurus [December 18, 2012]

Filed under: EU,Thesaurus,Vocabularies — Patrick Durusau @ 7:13 pm

Upcoming release of EuroVoc 4.4, EU’s multilingual thesaurus

From the post:

EuroVoc 4.4 will be released on December 18, 2012. During this day, the website might be temporary unavailable.

6.883 thesaurus concepts

This new edition is the result of a thorough revision among other things according to the concepts introduced by the ‘Lisbon Treaty’. It includes 6.883 thesaurus concepts of which 85 concepts are new, 142 have been updated and 28 have been classified as obsolete concepts.

These new concepts are the results of the proposals sent by the librarians from the libraries of the national parliaments in Europe, the European Institutions namely the European Parliament and the users of EuroVoc. All the terms in Portuguese have been revised according to the Portuguese language spelling reform. The prior lexical value remains available as Non-Preferred Terms.

EuroVoc, the EU’s multilingual thesaurus

EuroVoc is a multilingual, multidisciplinary thesaurus covering the activities of the EU, the European Parliament in particular. It contains terms in 22 EU languages. It is managed by the Publications Office, which moved forward to ontology-based thesaurus management and semantic web technologies conformant to W3C recommendations as well as latest trends in thesaurus standards.

There are documents prior to this version of the thesaurus and even documents prior to there being a EuroVoc thesaurus at all.

And there will be documents after EuroVoc has been superceded.

Not to mention in between there will be documents that use other vocabularies.

Good thing we have topic maps to use this resource to its best advantage.

A way station in a sea of semantic currents and drifts.

Identifiers, 404s and Document Security

Filed under: HTML,Identifiers,Security — Patrick Durusau @ 5:28 pm

I am working on a draft about identifiers (using the standard <a> element) when it occurred to me that URLs could play an unexpected role in document security. (At least unexpected by me, your mileage may vary.)

What if I create a document that has URLs like:

<a href="http://server-exists.x/page-does-not.html>text content</a>

So that a user who attempts to follow the link, gets a “404” message back.

Why is that important?

What if I am writing HTML pages at a nuclear weapon factory? I would be very interested in knowing if one of my pages had gotten off the reservation so to speak.

The server being accessed for a page that deliberately does not exist could route the contact information for an appropriate response.

Of course, I would use better names or have pages that load, while transmitting the same contact information.

Or have a very large uuencoded “password” file that burps, bumps and slowly downloads. (Always knew there was a reason to keep a 2400 baud modem around.)

Have suggestions on how to make a non-existent URL work but will save that for another day.

“Entitlement” Topic Map?

Filed under: Marketing,Topic Maps — Patrick Durusau @ 5:14 pm

One “hot” topic of discussion in the United States has been tax and “entitlement” reform.

Asking because I don’t know: Is anyone working on an “entitlement” topic map?

And because I wanted to make some suggestions on what falls within the definition of an “entitlement” from the government:

  • Payment of agricultural subsidies to non-family agribusinesses.
  • Copyright (limiting it to a reasonable term, say seven years, total, would encourage authors to be more productive and to limit the growth of parasitic families)
  • Patents (making patents apply to non-software inventions only would improve the bottom line at most software companies, particularly the larger ones. Reducing the footprint of their legal cost centers)
  • Criminal enforcement of copyright (let the people making money pay for their own protection racket)
  • Distribution of genetically modified seed to assist Monsanto and similar actors.

And that’s just off the top of my head. There are thousands of others.

I suppose you should include Social Security, Medicare and similar programs.

But if and only if all government “entitlements” are on the table.

Maybe someone already has a list of various Federal “entitlements.” Give a shout if you see it.

December 11, 2012

Developing CODE for a Research Database

Filed under: Entity Extraction,Entity Resolution,Linked Data — Patrick Durusau @ 8:19 pm

Developing CODE for a Research Database by Ian Armas Foster.

From the post:

The fact that there are a plethora of scientific papers readily available online would seem helpful to researchers. Unfortunately, the truth is that the volume of these articles has grown such that determining which information is relevant to a specific project is becoming increasingly difficult.

Austrian and German researchers are thus developing CODE, or Commercially Empowered Linked Open Data Ecosystems in Research, to properly aggregate research data from its various forms, such as PDFs of academic papers and data tables upon which those papers are based, into a single system. The project is in a prototype stage, with the goal being to integrate all forms into one platform by the project’s second year.

The researchers from the University of Passau in Germany and the Know-Center in Graz, Austria explored the challenges to CODE and how the team intends to deal with those challenges in this paper. The goal is to meliorate the research process by making it easier to not only search for both text and numerical data in the same query but also to use both varieties in concert. The basic architecture for the project is shown below.

Stop me if you have heard this one before: “There was this project that was going to disambiguate entities and create linked data….”

I would be the first one to cheer if such a project were successful. But, a few paragraphs in a paper, given the long history of entity resolution and its difficulties, isn’t enough to raise my hopes.

You?

Big data & data science

Filed under: BigData,Data Science — Patrick Durusau @ 7:58 pm

Big data & data science

A Google+ group moderated by Edd Dumbill that may be of interest to you.

Just skimming the posts (I saw the link today), looks like very high quality content.

Videos from Coursera’s four week course in R

Filed under: Programming,R — Patrick Durusau @ 7:47 pm

Videos from Coursera’s four week course in R by David Smith.

From the post:

Coursera’s Computing for Data Analysis course on R is now over, with four weeks of free, in-depth training on the R language. While you’ll have to wait for the next installment of the course to participate in the full online learning experience, you can still view the lecture videos, courtesy of course presenter Roger Peng’s YouTube page. The course materials are helpfully organized into four video playlists by week; I’ve embedded each week’s content below with an index to the individual video chapters.

Just in case you missed the class or need a refresher!

The Theoretical Astrophysical Observatory

Filed under: Astroinformatics — Patrick Durusau @ 7:42 pm

The Theoretical Astrophysical Observatory by Darren Croton.

If you guessed from the title that the acronym would be “TAO,” take a point for your house.

This post is not going to be about CANDELS directly, but about work that, in the long run, could play an enormous part in helping CANDELS astronomers analyse and interpret their data.

At Swinburne University in Australia, myself and my group are developing a new tool, called the Theory Astrophysical Observatory (TAO), which will make access to cutting edge supercomputer simulations of galaxy formation almost trivial. TAO will put the latest theory data in to the “cloud” for use by the international astronomy community, plus add a number of science enhancing eResearch tools. It is part of a larger project funded by the Australian Government called the All Sky Virtual Observatory (ASVO).

TAO boasts a clean and intuitive web interface. It avoids the need to know a database query language (like SQL) by providing a custom point-and-click web-form to select virtual galaxies and their properties, which auto-generates the query code in the background. Query results can then be funneled through additional “modules” and sent to a local supercomputer for further processing and manipulation….

You may not have a local supercomputer today, but in a year or two? Maybe as accessible as FaceBook is today. Hopefully more useful. 😉

Are you still designing for desk/laptop processing capabilities?

Music Network Visualization

Filed under: Graphs,Music,Networks,Similarity,Subject Identity,Visualization — Patrick Durusau @ 7:23 pm

Music Network Visualization by Dimiter Toshkov.

From the post:

My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization!

Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph.

Interesting for the visualization but also the use of similarity measures.

The test for identity of a subject, particularly collective subjects, artists “similar” to X, is as unlimited as your imagination.

Solving real world analytics problems with Apache Hadoop [Webinar]

Filed under: Cloudera,Hadoop — Patrick Durusau @ 7:14 pm

Solving real world analytics problems with Apache Hadoop

Thursday December 13, 2012 at 8:30 a.m. PST/11:30 a.m. EST

From the registration page:

Agenda:

  • Defining big data
  • What are the most critical components of a big data solution?
  • The business and technical challenges of delivering a solution
  • How Cloudera accelerates big data value?
  • Why partner with HP?
  • The HP AppSystem powered by Cloudera

Doesn’t look heavy on the technical side but on the other hand, attending means you will be entered in a raffle for an HP Mini Notebook.

December 10, 2012

[P]urchase open source software products…based on its true technical merits [Alert: New Government Concept]

Filed under: Government,Open Source,Talend — Patrick Durusau @ 8:36 pm

Talend, an open source based company, took the lead in obtained a favorable ruling on software conformance with the Trade Agreement Act (TAA).

Trade Agreement Act: Quick summary – Goods manufactured in non-designated countries, cannot be purchased by federal agencies. Open source software can have significant contact with non-designated countries. Non-conformance with the TAA, means open source software loses an important market.

Talend obtained a very favorable ruling for open source software. The impact of that ruling:

The Talend Ruling is significant because government users now have useful guidance specifically addressing open source software that is developed and substantially transformed in a designated country, but also includes, or is based upon, source code from a non-designated country,” said Fern Lavallee, DLA Piper LLP (US), counsel to Talend. “Federal agencies can now purchase open source software products like Talend software based on its true technical merits, including ease of use, flexibility, robust documentation and data components and its substantial life-cycle cost advantages, while also having complete confidence in the product’s full compliance with threshold requirements like the TAA. The timing of this Ruling is right given the Department of Defense’s well publicized attention and commitment to Better Buying Power and DoD’s recent Open Systems Architecture initiative. (Quote from Government Agency Gives Talend Green Light on Open Source)

An important ruling for all open source software projects, including topic maps.

I started to post about it when it first appeared but reports of rulings aren’t the same as the rulings themselves.

Talend graciously forwarded a copy of the ruling and gave permission for it to be posted for your review. Talend-Inc-US-Customs-and-Border-Protection-Response-Letter.pdf

Looking forward to news of your efforts to make it possible for governments to buy open source software “…based on its true technical merits.”

Fractal Tree Indexing Overview

Filed under: B-trees,Fractal Trees,TokuDB,Tokutek — Patrick Durusau @ 7:37 pm

Fractal Tree Indexing Overview by Martin Farach-Colton.

From the post:

We get a lot of questions about how Fractal Tree indexes work. It’s a write-optimized index with fast queries, but which write-optimized indexing structure is it?

In this ~15 minute video (which uses these slides), I give a quick overview of how they work and what they are good for.

Suggestion: Watch the video along with the slides. (Some of the slides are less than intuitive. Trust me on this one.)

Martin Gardner explaining fractals in SciAm it’s not but it will give you a better appreciation for fractal trees.

BTW, did you know B-Trees are forty years old this year?

Tools for Data-Intensive Astronomy – a VO Community Day in Baltimore, MD (Archive)

Filed under: Astroinformatics,BigData — Patrick Durusau @ 7:19 pm

Tools for Data-Intensive Astronomy – a VO Community Day in Baltimore, MD (Archive)

In case you missed the original webcast, for your viewing pleasure:

2012 VAO (Virtual Astronomical Observatory) – Thursday Nov 29, 2012
Welcome, Overview, Objectives
Bob Hanisch, Matt Mountain  (Space Telescope Science Institute)
Science Capabilities of the VO
Joe Lazio (Jet Propulsion Laboratory)
Spectral Analysis and SEDs
Ivo Busko (Space Telescope Science Institute)
Data Discovery
Tom Donaldson (Space Telescope Science Institute)
CANDELS and the VO
Anton Koekemoer (Space Telescope Science Institute)
The VO and Python, VAO futures
Perry Greenfield  (Space Telescope Science Institute)
Hands-on Session, Q&A
Tom Donaldson, Bob Hanisch, Ivo Busko, Anton Koekemoer (Space Telescope Science Institute)

The academics would call it being “inter-disciplinary.”

I call it being “innovative and successful.”

PlagSpotter [Ghost of Topic Map Past?]

Filed under: Duplicates,Plagiarism — Patrick Durusau @ 5:47 pm

I found a link to PlagSpotter in the morning mail.

I found it quite responsive, although I thought the “Share and Help Your Friends Protect Their Web Content” rather limiting.

Here’s why:

To test the software, I choose a blog entry from another blog, one I quoted late yesterday, to test the timeliness of PlagSpotter.

And it worked!

While looking at the results, I saw people I expected to quote the same post, but then noticed there were people unknown to me on the list.

Rather than detecting plagiarism, the first off-label use of PlagSpotter is to identify communities quoting the same content.

With just a little more effort, the second off-label use of PlagSpotter is to track the spread of content across a community, by time. (With a little post processing, location, language as well.)

A third off-label use of PlagSpotter is to generate a list of sources that use the same content, a great seed list for a private search engine for a particular area/community.

The earliest identifiable discussion of topic maps as topic maps, involved detection of duplicated content (with duplicated charges for that content) for documentation in government contracts.

Perhaps why topic maps never gained much traction in government contracting. Cheats dislike being identified as cheats.

Ah, a fourth off-label use of PlagSpotter, detecting duplicated documentation submitted as part of weapon system or other documentation.

I find all four off-label uses of PlagSpotter more persuasive than protecting content.

Content only has value when other people use it, hopefully with attribution.

Apache Gora

Filed under: BigData,Gora,Hadoop,HBase,MapReduce — Patrick Durusau @ 5:26 pm

Apache Gora

From the webpage:

What is Apache Gora?

The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column
stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

Why Apache Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use in-memory data model and persistence for big data framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

When writing about the Nutch 2.X development path, I discovered my omission of Gora from this blog. Apologies for having overlooked it until now.

« Newer PostsOlder Posts »

Powered by WordPress