Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 19, 2013

UNESCO Publications and Data (Open Access)

Filed under: Data,Government,Government Data — Patrick Durusau @ 8:49 am

UNESCO to make its publications available free of charge as part of a new Open Access policy

From the post:

The United Nations Education Scientific and Cultural Organisation (UNESCO) has announced that it is making available to the public free of charge its digital publications and data. This comes after UNESCO has adopted an Open Access Policy, becoming the first agency within the United Nations to do so.

The new policy implies that anyone can freely download, translate , adapt, and distribute UNESCO’s publications and data. The policy also states that from July 2013, hundreds of downloadable digital UNESCO publications will be available to users through a new Open Access Repository with a multilingual interface. The policy seeks also to apply retroactively to works that have been published.

There’s a treasure trove of information for mapping, say against the New York Times historical archives.

If presidential libraries weren’t concerned with helping former administration officials avoid accountability, digitizing presidential libraries for complete access, would be another great treasure trove.

Got Balls?

Filed under: Intelligence,Military,Security — Patrick Durusau @ 8:16 am

IED Trends: Turning Tennis Balls Into Bombs

From the post:

Terrorists are relentlessly evolving tactics and techniques for IEDs (Improvised Explosive Devices), and analyzing reporting on IEDs can provide insight complementary to HUMINT on emerging militant methods. Preparing for an upcoming webcast with our friends at Terrogence, we found incidents using sports balls, particularly tennis balls and cricket balls, more frequently appearing as a delivery vehicle for explosives.

When we break these incidents from the last four months down by location, the city of Karachi in southern Pakistan stands out as a hotbed. There is also evidence that this tactic is being embraced around the globe as you can see sports balls fashioned into bombs found from Longview, Washington in the United States to Varanasi in India.

We can use Recorded Future’s Web Intelligence platform to plot out the locations where incidents have recently occurred as well as the frequency and timing.

Interesting but the military, by their stated doctrines, should be providing this information in theater specific IED briefings.

See for example: FMI 3-34.119/MCIP 3-17.01 IMPROVISED EXPLOSIVE DEVICE DEFEAT

On boobytraps (the old name) in general, see: FM 5-31 Boobytraps (1965), which includes pressure cookers (pp. 73-74) and rubber balls (p. 87).

Topic maps offer over rapid dissemination of “new” forms and checklists for where they may be found. (As opposed to static publications.)

Interesting that FM 5-31 reports an electric iron as boobytrap, but an electric iron is more likely to show up on Antiques Roadshow than as an IED.

At least in the United States.

May 18, 2013

Apache Hive 0.11: Stinger Phase 1 Delivered

Filed under: Hadoop,Hive,SQL,STINGER — Patrick Durusau @ 3:47 pm

Apache Hive 0.11: Stinger Phase 1 Delivered by Owen O’Malley.

From the post:

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop. Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them. See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Stinger

Welcome news for the Hive and SQL communities alike!

Warp: Multi-Key Transactions for Key-Value Stores

Filed under: HyperDex,Key-Value Stores,NoSQL,Warp — Patrick Durusau @ 1:43 pm

Warp: Multi-Key Transactions for Key-Value Stores by Robert Escriva, Bernard Wong and Emin Gün Sirer†.

Abstract:

Implementing ACID transactions has been a longstanding challenge for NoSQL systems. Because these systems are based on a sharded architecture, transactions necessarily require coordination across multiple servers. Past work in this space has relied either on heavyweight protocols such as Paxos or clock synchronization for this coordination.

This paper presents a novel protocol for coordinating distributed transactions with ACID semantics on top of a sharded data store. Called linear transactions, this protocol achieves scalability by distributing the coordination task to only those servers that hold relevant data for each transaction. It achieves high performance by serializing only those transactions whose concurrent execution could potentially yield a violation of ACID semantics. Finally, it naturally integrates chain-replication and can thus tolerate faults of both clients and servers. We have fully implemented linear transactions in a commercially available data store. Experiments show that the throughput of this system achieves 1-9× more throughput than MongoDB, Cassandra and HyperDex on the Yahoo! Cloud Serving Benchmark, even though none of the latter systems provide transactional guarantees.

Warp looks wicked cool!

Of particular interest is the non-ordering of transactions that have no impact on other transactions. That alone would be interesting for a topic map merging situation.

For more details, see the Warp page, or

Download Warp

Warp Tutorial

Warp Performance Benchmarks

I first saw this at High Scalability.

Open Data and Wishful Thinking

Filed under: Government,Government Data,Open Data — Patrick Durusau @ 12:58 pm

BLM Fracking Rule Violates New Executive Order on Open Data by Sofia Plagakis.

From the post:

Today, the U.S. Department of the Interior’s Bureau of Land Management (BLM) released its revised proposed rule for natural gas drilling (commonly referred to as fracking) on federal and tribal lands. The much-anticipated rule violates President Obama’s recently issued executive order that requires new government information to be made available to the public in open, machine-readable formats.

Last week, President Obama signed an executive order requiring that all newly generated public data be pushed out in open, machine-readable formats. Concurrently, the Office of Management and Budget (OMB) and the Office of Science and Technology Policy (OSTP) released an Open Data Policy designed to make previously unavailable government data accessible to entrepreneurs, researchers, and the public.

The executive order and accompanying policy must have been in development for months, and agencies, including BLM, should have been fully aware of the new policy. But instead of establishing a modern example of government information collection and sharing, BLM’s proposed rule would allow drilling companies to report the chemicals used in fracking to a third-party, industry-funded website, called FracFocus.org, which does not provide data in machine-readable formats. FracFocus.org only allows users to download PDF files of reports on fracked wells. Because PDF files are not machine-readable, the site makes it very difficult for the public to use and analyze data on wells and chemicals that the government requires companies to collect and make available.

I wonder if Sofia simply overlooked:

When implementing the Open Data Policy, agencies shall incorporate a full analysis of privacy, confidentiality, and security risks into each stage of the information lifecycle to identify information that should not be released. These review processes should be overseen by the senior agency official for privacy. It is vital that agencies not release information if doing so would violate any law or policy, or jeopardize privacy, confidentiality, or national security. [From “We won’t get fooled again…”]

Or if her “…requires new government information to be made available to the public in open, machine-readable formats” is wishful thinking?

The Obama just released the Benghazi emails in PDF format. So we have an example of the Whitehouse violating its own “open data” policy.

We don’t need more “open data.”

What we need are more leakers. A lot more leakers.

Just be sure you leak or pass on leaks in “open, machine-readable formats.”

The foreign adventures, environmental pollution, failures in drug or food safety, etc., avoided by leaks may save your life, the lives of your children or grandchildren.

Leak today!

Graph Representation – Edge List

Filed under: Graphs,Networks,Programming — Patrick Durusau @ 12:44 pm

Graph Representation – Edge List

From the post:

An Edge List is a form of representation for a graph. It maintains a list of all the edges in the graph. For each edge, it keeps track of the 2 connecting vertices as well as the weight between them.

Followed by C++ code as an example.

A hypergraph would require tracking of 3 or more connected nodes.

Dumb Jocks?

Filed under: Graphics,Visualization — Patrick Durusau @ 10:46 am

Coaches are highest paid public employees by Nathan Yau.

Coach payments

Nathan makes another amazing find!

The topic map lesson here is effective presentation of information.

For the same data, the Obama Administration would list all U.S. public employees in order by internal department names and separately list positions with salaries, in a PDF file.

I know which strategy I prefer.

You?

A Trillion Triples in Perspective

Filed under: BigData,Music,RDF,Semantic Web — Patrick Durusau @ 10:11 am

Mozart Meets MapReduce by Isaac Lopez.

From the post:

Big data has been around since the beginning of time, says Thomas Paulmichl, founder and CEO of Sigmaspecto, who says that what has changed is how we process the information. In a talk during Big Data Week, Paulmichl encouraged people to open up their perspective on what big data is, and how it can be applied.

During the talk, he admonished people to take a human element into big data. Paulmichl demonstrated this by examining the work of musical prodigy, Mozart – who Paulmichl noted is appreciated greatly by both music scientists, as well as the common music listener.

“When Mozart makes choices on writing a piece of work, the number of choices that he has and the kind of neural algorithms that his brain goes through to choose things is infinitesimally higher that what we call big data – it’s really small data in comparison,” he said.

Taking Mozart’s The Magic Flute as an example, Paulmichl, discussed the framework that Mozart used to make his choices by examining a music sheet outlining the number of bars, the time signature, the instrument and singer voicing.

“So from his perspective, he sits down, and starts to make what we as data scientists call quantitative choices,” explained Paulmichl. “Do I put a note here, down here, do I use a different instrument; do I use a parallel voicing for different violins – so these are all metrics that his brain has to decide.”

Exploring the mathematics of the music, Paulmichl concluded that in looking at The Magic Flute, Mozart had 4.72391E+21 creative variations (and then some) that he could have taken with the direction of it over the course of the piece. “We’re not talking about a trillion dataset; we’re talking about a sextillion or more,” he says adding that this is a very limited cut of the quantitative choice that his brain makes at every composition point.

“[A] sextillion or more…” puts the question of processing a trillion triples into perspective.

Another musical analogy?

Triples are the one finger version of Jingle Bells*:

*The gap is greater than the video represents but it is still amusing.

Does your analysis/data have one finger subtlety?

May 17, 2013

Faunus: Graph Analytics Engine

Filed under: Faunus,Graphs — Patrick Durusau @ 6:22 pm

Faunus: Graph Analytics Engine by Marko Rodriguez.

From the description:

Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.

I saw this slide deck after posting ConceptNet5 [Herein of Hypergraphs] and writing about the “id-less” nodes and edges of ConceptNet5.

So when I see nodes and edges with IDs, I have to wonder why?

What requirement is being met or advantage that is obtained by using IDs and not addressing a node by its content?*

Remembering that we are no longer concerned with shaving bits off of identifiers for storage and/or processing concerns.


* I suspect that addressing by content presumes a level of granularity that may not be appropriate in all cases. Hard to say. But I do want to look at the issue more closely.

Common Locale Data Repository (CLDR) 23.1

Filed under: Unicode — Patrick Durusau @ 6:07 pm

Common Locale Data Repository (CLDR) 23.1

From the CLDR project homepage:

What is CLDR?

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. It includes:

  • Locale-specific patterns for formatting and parsing: dates, times, timezones, numbers and currency values
  • Translations of names: languages, scripts, countries and regions, currencies, eras, months, weekdays, day periods, timezones, cities, and time units
  • Language & script information: characters used; plural cases; gender of lists; capitalization; rules for sorting & searching; writing direction; transliteration rules; rules for spelling out numbers; rules for segmenting text into graphemes, words, and sentences
  • Country information: language usage, currency information, calendar preference and week conventions, postal and telephone codes
  • Other: ISO & BCP 47 code support (cross mappings, etc.), keyboard layouts

CLDR uses the XML format provided by UTS #35: Unicode Locale Data Markup Language (LDML). LDML is a format used not only for CLDR, but also for general interchange of locale data, such as in Microsoft’s .NET.

For a set of slides on the technical contents of CLDR, see Overview.

Great set of widely used mappings between locale data.

Solr 4, the NoSQL Search Server [Webinar]

Filed under: NoSQL,Searching,Solr — Patrick Durusau @ 4:46 pm

Solr 4, the NoSQL Search Server by Yonik Seeley

Date: Thursday, May 30, 2013
Time: 10:00am Pacific Time

From the description:

The long awaited Solr 4 release brings a large amount of new functionality that blurs the line between search engines and NoSQL databases. Now you can have your cake and search it too with Atomic updates, Versioning and Optimistic Concurrency, Durability, and Real-time Get!

Learn about new Solr NoSQL features and implementation details of how the distributed indexing of Solr Cloud was designed from the ground up to accommodate them.
Featured Presenter:

Yonik Seeley – Research creator of Apache Solr and the Chief Open Source Architect and Co-Founder at LucidWorks. Mr. Seeley is an Apache Lucene/Solr PMC member and committer and an expert in distributed search systems architecture and performance. His work experience includes CNET Networks, BEA and Telcordia. He earned his M.S. in Computer Science from Stanford University.

This could be a real treat!

Notes on the webinar to follow.

A self-updating road map of The Cancer Genome Atlas

Filed under: Bioinformatics,Biology,Biomedical,Medical Informatics,RDF,Semantic Web,SPARQL — Patrick Durusau @ 4:33 pm

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

Organizing Digital Information for Others

Filed under: Interface Research/Design,Usability,Users,UX — Patrick Durusau @ 4:14 pm

Organizing Digital Information for Others by Maish Nichani. (ebook, no registration required)

From the description:

When we interact with web and intranet teams, we find many struggling to move beyond conceptual-level discussions on information organization. Hours on end are spent on discussing the meaning of “metadata”, “controlled vocabulary” and “taxonomy” without any strategic understanding of how everything fits together. Being so bogged down at this level they fail to look beyond to the main reason for their pursuit—organizing information for others (the end users) so that they can find the information easily.

Web and intranet teams are not the only ones facing this challenge. Staff in companies are finding themselves tasked with organizing, say, hundreds of project documents on their collaboration space. And they usually end up organizing it in the only way they know—for themselves. Team members then often struggle to locate the information that they thought should be in “this folder”!

In this short book, we explore how lists, categories, trees and facets can be better used to organize information for others. We also learn how metadata and taxonomies can connect different collections and increase the findability of information across the website or intranet.

But more than that we hope that this book can start a conversation around this important part of our digital lives.

So let the conversation begin!

The theme of delivering information to others cannot be emphasized enough.

Your notes, interface choices, etc., are just that, your notes, interface choices, etc.

Unless you are independently wealthy, that isn’t a very good marketing model.

Nor are users going to be “trained” to work, search, author, the “right way” in your view.

An introduction to be sure but this short (50 odd pages) work is entertaining and has additional references.

Very much worth the time to read.

Properties of Morphisms

Filed under: Category Theory,Mathematical Reasoning,Mathematics — Patrick Durusau @ 3:53 pm

Properties of Morphisms by Jeremy Kun.

From the post:

This post is mainly mathematical. We left it out of our introduction to categories for brevity, but we should lay these definitions down and some examples before continuing on to universal properties and doing more computation. The reader should feel free to skip this post and return to it later when the words “isomorphism,” “monomorphism,” and “epimorphism” come up again. Perhaps the most important part of this post is the description of an isomorphism.

Isomorphisms, Monomorphisms, and Epimorphisms

Perhaps the most important paradigm shift in category theory is the focus on morphisms as the main object of study. In particular, category theory stipulates that the only knowledge one can gain about an object is in how it relates to other objects. Indeed, this is true in nearly all fields of mathematics: in groups we consider all isomorphic groups to be the same. In topology, homeomorphic spaces are not distinguished. The list goes on. The only way to do determine if two objects are “the same” is by finding a morphism with special properties. Barry Mazur gives a more colorful explanation by considering the meaning of the number 5 in his essay, “When is one thing equal to some other thing?” The point is that categories, more than existing to be a “foundation” for all mathematics as a formal system (though people are working to make such a formal system), exist primarily to “capture the essence” of mathematical discourse, as Mazur puts it. A category defines objects and morphisms, but literally all of the structure of a category lies in its morphisms. And so we study them.

If you are looking for something challenging to read over the weekend, Jeremy’s latest post on morphisms should fit the bill.

The question of “equality” is easy enough to answer as lexical equivalence in UTF-8, but what if you need something more?

Being mindful that “lexical equivalence in UTF-8” is a highly unreliable form of “equality.”

Jeremy is easing us down the road where such discussions can happen with a great deal of rigor.

Hadoop SDK and Tutorials for Microsoft .NET Developers

Filed under: .Net,Hadoop,MapReduce,Microsoft — Patrick Durusau @ 3:39 pm

Hadoop SDK and Tutorials for Microsoft .NET Developers by Marc Holmes.

From the post:

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel. It also covers some Mahout use to build a recommendation engine.
  • Microsoft Hive ODBC Driver. The examples above use this preview driver to enable the connection from Hive to Excel.

If all of the above excites you our Hadoop on Windows for Developers training course also similar content in a lot of depth.

Hadoop is coming to an office/data center near you.

Will you be ready?

ConceptNet5 [Herein of Hypergraphs]

Filed under: Concept Maps,Graphs,Hypergraphs — Patrick Durusau @ 3:29 pm

ConceptNet5

From the website:

ConceptNet is a semantic network containing lots of things computers should know about the world, especially when understanding text written by people.

It is built from nodes representing concepts, in the form of words or short phrases of natural language, and labeled relationships between them. These are the kinds of things computers need to know to search for information better, answer questions, and understand people’s goals. If you wanted to build your own Watson, this should be a good place to start!

ConceptNet contains everyday basic knowledge:

(…)

ConceptNet 5 is a graph

To be precise, it’s a hypergraph, meaning it has edges about edges. Each statement in ConceptNet has justfications pointing to it, explaining where it comes from and how reliable the information seems to be.

Previous versions of ConceptNet has been distributed as idiosyncratic database structures plus some software to interact with them. ConceptNet 5 is not a piece of software or a database; it is a graph. It’s a set of nodes and edges, which we can represent in multiple formats including JSON. You probably know better than we do what software you want to use to interact with it!

(That said, you can have our idiosyncratic Solr index if you want, but that’s not ConceptNet, it’s just a system for quickly looking things up in ConceptNet.)

Some other interesting properties:

  • The ConceptNet graph is ID-less. Every node and assertion contains all the information necessary to identify it and no more in its URI, and does not rely on arbitrarily-assigned IDs. The advantage of this is that if multiple branches of ConceptNet are developed in multiple places, we can later merge them simply by taking the union of the nodes and edges. (And we hope for this to happen!)
  • ConceptNet supports linked data: you can download a list of links to the greater Semantic Web, via DBPedia and via RDF/OWL WordNet. For example, our concept cat is linked to the DBPedia node at http://dbpedia.org/resource/Cat.

In addition to being a data source, interesting notion of “ID-less” nodes and edges.

Information on the software setup, Solr and Python to deliver ConceptNet5 as a hypergraph is also available.

I first saw this in Max De Marzi’s Knowledge Bases in Neo4j. You will find that Max’s approach involves dumbing down the hypergraph.

Knowledge Bases in Neo4j

Filed under: Graphs,Neo4j — Patrick Durusau @ 2:05 pm

Knowledge Bases in Neo4j by Max De Marzi.

From the post:

From the second we are born we are collecting a wealth of knowledge about the world. This knowledge is accumulated and interrelated inside our brains and it represents what we know. If we could export this knowledge and give it to a computer, it would look like ConceptNet. ConceptNet is a semantic network that…

…is built from nodes representing concepts, in the form of words or short phrases of natural language, and labeled relationships between them. These are the kinds of things computers need to know to search for information better, answer questions, and understand people’s goals.

I wrote a little ruby script to import ConceptNet5 into Neo4j and it gives us a nice graph (243MB) to work with. ConceptNet5 as presented in csv files is actually a hypergraph, with a reason for the concept:

Max gives a script to remove the reasons for concepts (the hypergraph part of ConceptNet5) as duplicate content.

It does make the graph smaller, but only at the expense of information loss.

Think of it as the Benghazi emails with all the duplicate prose removed along with who said it.

If that fits your requirements, ok, but I doubt it would fit in any environment that requires information auditing.

Hadoop Toolbox: When to Use What

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:39 pm

Hadoop Toolbox: When to Use What by Mohammad Tariq.

From the post:

Eight years ago not even Doug Cutting would have thought that the tool which he’s naming after his kid’s soft toy would so soon become a rage and change the way people and organizations look at their data. Today Hadoop and Big Data have almost become synonyms to each other. But Hadoop is not just Hadoop now. Over time it has evolved into one big herd of various tools, each meant to serve a different purpose. But glued together they give you a powerpacked combo.

Having said that, one must be careful while choosing these tools for their specific use case as one size doesn’t fit all. What is working for someone might not be that productive for you. So, here I will show you which tool should be picked in which scenario. It’s not a big comparative study but a short intro to some very useful tools. And, this is based totally on my experience so there is always some scope of suggestions. Please feel free to comment or suggest if you have any. I would love to hear from you. Let’s get started :

Not shallow enough to be useful for the c-suite types, not deep enough for decision making.

Nice to use in a survey context, where users need an overview of the Hadoop ecosystem.

May 16, 2013

Automated Archival and Visual Analysis of Tweets…

Filed under: Searching,Tweets — Patrick Durusau @ 7:24 pm

Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others by Stephen Turner.

From the post:

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I’ve been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter’s built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t – the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t‘s documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here’s the code as of May 14, 2013:

Stephen promises in his post that the script updates automatically and you may find “unsavory” tweets.

I didn’t but that may be a matter of happenstance or sensitivity. 😉

Linguists Circle the Wagons, or Disagreement != Danger

Filed under: Artificial Intelligence,Linguistics,Natural Language Processing — Patrick Durusau @ 2:47 pm

Pullum’s NLP Lament: More Sleight of Hand Than Fact by Christopher Phipps.

From the post:

My first reading of both of Pullum’s recent NLP posts (one and two) interpreted them to be hostile, an attack on a whole field (see my first response here). Upon closer reading, I see Pullum chooses his words carefully and it is less of an attack and more of a lament. He laments that the high-minded goals of early NLP (to create machines that process language like humans do) has not been reached, and more to the point, that commercial pressures have distracted the field from pursuing those original goals, hence they are now neglected. And he’s right about this to some extent.

But, he’s also taking the commonly used term “natural language processing” and insisting that it NOT refer to what 99% of people who use the term use it for, but rather only a very narrow interpretation consisting of something like “computer systems that mimic human language processing.” This is fundamentally unfair.

In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.

I feel Pullum is moving the goal posts on us when he says “there is, to my knowledge, no available system for unaided machine answering of free-form questions via general syntactic and semantic analysis” [my emphasis]. Pullum’s agenda appears to be to create a straw-man NLP world where NLP techniques are only admirable if they mimic human processing. And this is unfair for two reasons.

If there is unfairness in this discussion, it is the insistence by Christopher Phipps (and others) that Pullum has invented “…a straw-man NLP world where NLP techniques are only admirable if they mimic human processing.”

On the contrary, it was 1949 when Warren Weaver first proposed computers as the solution to world-wide translation problems. Weaver’s was not the only optimistic projection of language processing by computers. Those have continued up to and including the Semantic Web.

Yes, NLP practitioners such as Christopher Phipps use NLP in a more precise sense than Pullum. And NLP as defined by Phipps has too many achievements to easily list.

Neither one of those statements takes anything away from Pullum’s point that Google found a “sweet spot” between machine processing and human intelligence for search purposes.

What other insights Pullum has to offer may be obscured by the “…circle the wagons…” attitude from linguists.

Disagreement != Danger.

How-to: Configure Eclipse for Hadoop Contributions

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 12:34 pm

How-to: Configure Eclipse for Hadoop Contributions by Karthik Kambatla.

From the post:

Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

A post to ease your way towards contributing to the Hadoop project!

Or if you simply want to know the code you are running cold.

Or something in between!

HAL: a hierarchical format for storing…

Filed under: Bioinformatics,Genomics,Graphs — Patrick Durusau @ 12:27 pm

HAL: a hierarchical format for storing and analyzing multiple genome alignments by Glenn Hickey, Benedict Paten, Dent Earl, Daniel Zerbino and David Haussler. (Bioinformatics (2013) 29 (10): 1341-1342. doi: 10.1093/bioinformatics/btt128)

Abstract:

Motivation: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance.

Results: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover).

Availability: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal.

Important work for bioinformatics and genome alignment as well as specializing graphs for that work.

Graphs are a popular subject these days but successful projects will rely on graphs with particular properties and structures to be useful.

The more examples of graph-based projects, the more we learn about general principles of graphs for particular applications or requirements.

Open Government and Benghazi Emails

Filed under: Government,Government Data,Open Government — Patrick Durusau @ 9:13 am

The controversy over the “Benghazi emails” is a good measure of what the Obama Administration means by “open government.”

News of the release of the Benghazi emails broke yesterday, NPR, USA Today, among others.

I saw the news at Benghazi Emails Released, Wall Street Journal. PDF of the emails

If you go to WhiteHouse.gov and search for “Benghazi emails,” can you find the White House release of the emails?

I thought not.

The emails show congressional concern over the “talking points” on Benghazi to be a tempest in a teapot, as many of us already suspected.

Early release of the emails would have avoided some of the endless discussion rooted in congressional ignorance and bigotry.

But, the Obama administration has so little faith in “open government” that it conceals information that would be to its advantage if revealed.

Now imagine how the Obama administration must view information that puts it at a disadvantage.

Does that help to clarify the commitment of the Obama administration to open government?

It does for me.

May 15, 2013

GraphLab – Next Generation [Johnny Come Lately VCs]

Filed under: GraphLab,Graphs — Patrick Durusau @ 5:55 pm

Funding for the next generation of GraphLab by Danny Bickson.

From the post:

The GraphLab journey began with the desire:

  • to rethink the way we approach Machine Learning and Graph analytics,
  • to demonstrate that with the right abstractions and system design we can achieve unprecedented levels of performance, and
  • to build a community around large-scale graph computation.

We have been blown away by the excitement and growth of the GraphLab community and have been unable to keep up with the incredible interest from our amazing users.

Therefore, we are proud to announce GraphLab Inc, a company devoted to accelerating the development of the open-source GraphLab project.

(…)

[GraphLab will remain an open source project]

GraphLab 2.2 is just around the corner, see here for more details as to what is in it. Beyond that, we are exploring a new computation engine and further enhancements to the communication layer, as well as simpler integration with existing Cloud technologies, easier installation procedures, and an exciting new graph storage system. And of course, we look forward to working with you to develop the roadmap and build the next generation of the GraphLab system. [Missing hyperlink for details on GraphLab 2.2 in original]

Very cool!

For you Johnny Come Lately VCs:

GraphLab Raises $6.75M For Data Analysis Used In Consumer Recommendation Services by Alex Williams.

From the post:

GraphLab, the open-source distributed database, has received $6.75 million from Madrona Venture Group and NEA for its machine learning technology used to analyze data graphs for recommendation engines.

Developed five years ago at Carnegie Mellon University, the open-source data analysis platform takes semi-structured data that describe relationships between people, web traffic, product purchases and other data. It then analyzes that data for services to provide online recommendations.

There may be more room at the table. I don’t know so you would have to ask the GraphLab folks.

Full Disclosure: I have no financial interest in GraphLab, although I am very interested in promoting work that is well done. GraphLab is an example of such work.

EDAM: an ontology of bioinformatics operations,…

Filed under: Bioinformatics,Ontology — Patrick Durusau @ 5:51 pm

EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats by Jon Ison, Matúš Kalaš, Inge Jonassen, Dan Bolser, Mahmut Uludag, Hamish McWilliam, James Malone, Rodrigo Lopez, Steve Pettifer and Peter Rice. (Bioinformatics (2013) 29 (10): 1325-1332. doi: 10.1093/bioinformatics/btt113)

Abstract:

Motivation: Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.

Results: EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, datasets and publications within bioinformatics. EDAM applies to organizing and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations.

Availability: The latest stable version of EDAM is available in OWL format from http://edamontology.org/EDAM.owl and in OBO format from http://edamontology.org/EDAM.obo. It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to http://edamontology.org. This article describes version 1.2 available at http://edamontology.org/EDAM_1.2.owl.

No matter how many times I read it, I just don’t get:

Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.

I will be generous and assume the authors meant “machine-processable descriptions” when I read “machine-understandable descriptions.” It is well known that machines don’t “understand” data, they simply process it according to specified instructions.

But more to the point, machines are indifferent to the type or number of descriptions they have for any subject. It might confuse a human processor to have thirty (30) different descriptions for the same subject but there has been no showing of such a limit for machines.

Every effort to produce a “comprehensive” ontology/classification/taxonomy, pick your brand of poison, has been in the face of competing and different descriptions. That is, after all, the rationale for a comprehensive …, that there are too many choices already.

The outcome of all such efforts, assuming there are N diverse descriptions is N + 1 diverse descriptions, the 1 being the current project added to existing diverse descriptions.

Geography of hate against gays, races, and the disabled

Filed under: Geography,Mapping,Maps — Patrick Durusau @ 3:51 pm

Geography of hate against gays, races, and the disabled by Nathan Yau.

Hate Map

Nathan reports on the work of Floating Sheep who relied on 150,000 tags to create this map.

More details at Nathan’s site but as Nathan says, read the FAQ before you get too torqued about the map.

If nothing else, this should be a good lesson in the choices made collecting and mapping “objective” data (the tweets) and what questions you should ask about that process.

I found it interesting that the sea coast along the Gulf of Mexico seemed to have less hate.

How would you defend the choices you make when making a topic map?

Some information, that is important to someone will have to be left out. Was that out of religious, political, social or ethnic bias?

You can’t avoid that sort of question but you can be comfortable with your own answers should it arise.

My stock response is:

“The paying client is happy with the map. Become a paying client and you can be map happy too.”

Keyword Search, Plus a Little Magic

Filed under: Keywords,Search Behavior,Searching — Patrick Durusau @ 3:34 pm

Keyword Search, Plus a Little Magic by Geoffrey Pullum.

From the post:

I promised last week that I would discuss three developments that turned almost-useless language-connected technological capabilities into something seriously useful. The one I want to introduce first was introduced by Google toward the end of the 1990s, and it changed our whole lives, largely eliminating the need for having full sentences parsed and translated into database query language.

The hunch that the founders of Google bet on was that simple keyword search could be made vastly more useful by taking the entire set of pages containing all of the list of search words and not just returning it as the result but rather ranking its members by influentiality and showing the most influential first. What a page contains is not the only relevant thing about it: As with any academic publication, who values it and refers to it is also important. And that is (at least to some extent) revealed in the link structure of the Web.

In his first post, which wasn’t sympathetic to natural language processing, Geoffrey baited his critics into fits of frenzied refutation.

Fits of refutation that failed to note Geoffrey hadn’t completed his posts on natural language processing.

Take the keyword search posting for instance.

I won’t spoil the surprise for you but the fourth fact that Geoffrey says Google relies upon could have serious legs for topic map authoring and interface design.

And not a little insight into what we call natural language processing.

More posts are to follow in this series.

I suggest we savor each one as it appears and after reflection on the whole, sally forth onto the field of verbal combat.

PostgreSQL 9.3 Beta 1 Released

Filed under: PostgreSQL,SQL — Patrick Durusau @ 3:22 pm

PostgreSQL 9.3 Beta 1 Released

From the post:

The first beta release of PostgreSQL 9.3, the latest version of the world’s best open source database, is now available. This beta contains previews of all of the features which will be available in version 9.3, and is ready for testing by the worldwide PostgreSQL community. Please download, test, and report what you find.

Major Features

The major features available for testing in this beta include:

  • Writeable Foreign Tables, enabling pushing data to other databases
  • pgsql_fdw driver for federation of PostgreSQL databases
  • Automatically updatable VIEWs
  • MATERIALIZED VIEW declaration
  • LATERAL JOINs
  • Additional JSON constructor and extractor functions
  • Indexed regular expression search
  • Disk page checksums to detect filesystem failures

In 9.3, PostgreSQL has greatly reduced its requirement for SysV shared memory, changing to mmap(). This allows easier installation and configuration of PostgreSQL, but means that we need our users to rigorously test and ensure that no memory management issues have been introduced by the change. We also request that users spend extra time testing the improvements to Foreign Key locks.

If that isn’t enough features for you to test, see the full announcement! 😉

Causium Sales Model

Filed under: Marketing,Topic Maps — Patrick Durusau @ 3:16 pm

Atlassian’s Causium Sales Model Reaches $2.5 Million Charity Donations by Kit Eaton.

From the post:

Back in May 2010 Atlassian, a large innovative software company, revealed that its alternative business model causium had allowed it to donate $500,000 to international literacy improvement charity Room to Read. Now the firm says it has surpassed $2.5 million in donations, and is holding a special event with the charity on May 14th to celebrate.

Causium is an alternative to the freemium business model that many companies–from the Wall Street Journal to Babbel–follow. Under freemium thinking, Atlassian would give away some of its enterprise-grade code for free in order to attract business for its paid services. But instead, the company charges a nominal $10 fee, which it then donates to charity. The fee works in two ways–as a boost to charitable causes, and also to demonstrate to the software’s end-users that the code itself has value.

Atlassian’s President Jay Simons spoke to Fast Company, explaining that the plan has worked better than they expected: “We didn’t appreciate at the time that we were effectively building this annuity stream. Customers that buy the 10-user license will buy it again the following year.” The first year of the plan resulted in some $300,000 in charity donations, and the growth of the company’s reputation since means they donated the same amount in the first quarter of 2013. The donations are important to Room to Read, Simons says, because “they have a reliable funding source” on a regular basis.

Important to note that Atlassian had the market presence to make a causium sales model work.

On that score, see:

Why Atlassian is to Software as Apple is to Design by Mark Fidelman.

and, of course:

Atlassian.

Important lessons if you hope to make your software or service a success.

« Newer PostsOlder Posts »

Powered by WordPress