Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 3, 2012

Introducing Apache Hadoop YARN

Filed under: Hadoop,Hadoop YARN,HDFS,MapReduce — Patrick Durusau @ 3:03 pm

Introducing Apache Hadoop YARN by Arun Murthy.

From the post:

I’m thrilled to announce that the Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF!

Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.

In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.

Considering the explosive growth of Hadoop, what new data processing applications do you see emerging first in YARN?

Column Statistics in Hive

Filed under: Cloudera,Hive,Merging,Statistics — Patrick Durusau @ 2:48 pm

Column Statistics in Hive by Shreepadma Venugopalan.

From the post:

Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.

Motivation

While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.

In contrast, the query optimizer in a traditional RDBMS is cost based; it uses the statistical properties of the input column values to estimate the cost alternative query plans and chooses the plan with the lowest cost. The cost model for query plans assigns an estimated execution cost to the plans. The cost model is based on the CPU and I/O costs of query execution for every operator in the query plan. As an example consider a query that represents a join among {A, B, C} with the predicate {A.x == B.x == C.x}. Assume table A has a total of 500 records, table B has a total of 6000 records, table C has a total of 1000 records. In the absence of cost based query optimization, the system picks the join order specified by the user. In our example, let us further assume that the result of joining A and B yields 2000 records and the result of joining A and C yields 50 records.Hence the cost of performing the join between A, B and C, without join reordering, is the cost of joining A and B + cost of joining the output of A Join B with C. In our example this would result in a cost of (500 * 6000) + (2000 * 1000). On the other hand, a cost based optimizer (CBO) in a RDBMS would pick the more optimal alternate order [(A Join C) Join B] thus resulting in a cost of (500 * 1000) + (50 * 6000). However, in order to pick the more optimal join order the CBO needs cardinality estimates on the join column.

Today, Hive supports statistics at the table and partition level – count of files, raw data size, count of rows etc, but doesn’t support statistics on column values. These table and partition level statistics are insufficient for the purpose of building a CBO because they don’t provide any information about the individual column values. Hence obtaining the statistical summary of the column values is the first step towards building a CBO for Hive.

In addition to join reordering, Hive’s query optimizer will be able to take advantage of column statistics to decide whether to perform a map side aggregation as well as estimate the cardinality of operators in the execution plan better.

Some days I wonder where improvements to algorithms and data structures are going to lead?

Other days, I just enjoy the news.

Today is one of the latter.

PS: What a cost based optimizer (CBO) would look like for merging operations? Or perhaps better, merge cost estimator (MCE)? Metered merging anyone?

De novo assembly and genotyping of variants using colored de Bruijn graphs

Filed under: Bioinformatics,De Bruijn Graphs,Genome,Graphs,Networks — Patrick Durusau @ 2:06 pm

De novo assembly and genotyping of variants using colored de Bruijn graphs by Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek & Gil McVean. (Nature Genetics 44, 226–232 (2012))

Abstract:

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

You will need access to Nature Genetics but rounding out today’s posts on de Bruijn graphs with a recent research article.

Comments on the Cortex software appreciated.

Genome assembly and comparison using de Bruijn graphs

Filed under: Bioinformatics,De Bruijn Graphs,Genome,Graphs,Networks — Patrick Durusau @ 10:41 am

Genome assembly and comparison using de Bruijn graphs by Daniel Robert Zerbino. (thesis)

Abstract:

Recent advances in sequencing technology made it possible to generate vast amounts of sequence data. The fragments produced by these high-throughput methods are, however, far shorter than in traditional Sanger sequencing. Previously, micro-reads of less than 50 base pairs were considered useful only in the presence of an existing assembly. This thesis describes solutions for assembling short read sequencing data de novo, in the absence of a reference genome.

The algorithms developed here are based on the de Bruijn graph. This data structure is highly suitable for the assembly and comparison of genomes for the following reasons. It provides a flexible tool to handle the sequence variants commonly found in genome evolution such as duplications, inversions or transpositions. In addition, it can combine sequences of highly different lengths, from short reads to assembled genomes. Finally, it ensures an effective data compression of highly redundant datasets.

This thesis presents the development of a collection of methods, called Velvet, to convert a de Bruijn graph into a traditional assembly of contiguous sequences. The first step of the process, termed Tour Bus, removes sequencing errors and handles biological variations such as polymorphisms. In its second part, Velvet aims to resolve repeats based on the available information, from low coverage long reads (Rock Band) or paired shotgun reads (Pebble). These methods were tested on various simulations for precision and efficiency, then on control experimental datasets.

De Bruijn graphs can also be used to detect and analyse structural variants from unassembled data. The final chapter of this thesis presents the results of collaborative work on the analysis of several experimental unassembled datasets.

De Bruijn graphs are covered in pages 22-42 if you want to cut to the chase.

Obviously of interest to the bioinformatics community.

Where else would you use de Bruijn graph structures?

Is “Massive Data” > “Big Data”?

Filed under: BigData,Bioinformatics,Genome,Graphs,Networks — Patrick Durusau @ 8:32 am

Science News announces: New Computational Technique Relieves Logjam from Massive Amounts of Data, which is a better title than: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs by Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, and C. Titus Brown.

But I have to wonder about “massive data,” versus “big data,” versus “really big data,” versus “massive data streams,” as informative phrases. True, I have a weakness for an eye-catching headline but in prose, shouldn’t we say what data is under consideration? Let the readers draw their own conclusions?

The paper abstract reads:

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

If “de Bruijn graphs,” sounds familiar, see: Memory Efficient De Bruijn Graph Construction [Attn: Graph Builders, Chess Anyone?].

Halide Wins Image Processing Gold Medal!

Filed under: Graphics,Halide,Image Processing,Programming — Patrick Durusau @ 5:01 am

OK, the real title is: Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines (Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand.)

And there is no image processing gold medal. (To avoid angry letters from the big O folks.)

Still, when an abstract says:

Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism.

We propose a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code.

We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations.

Some excitement is understandable.

Expect programmers will generalize this decoupling of algorithms from storage and order.

What did Grace Slick say? “It’s a new dawn, people.”

Paper (12 MB PDF)

Code: http://halide-lang.org/

August 2, 2012

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part III

Filed under: Bioinformatics,Biomedical,Hadoop,MapReduce — Patrick Durusau @ 9:23 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part III by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Hadoop and Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.

Biomedical researchers will be interested in the results but I am more interested in the observation that Hadoop makes it possible to retain results for ad hoc analysis.

US drone strikes listed and detailed in Pakistan, Somalia and Yemen

Filed under: Military,News — Patrick Durusau @ 3:33 pm

US drone strikes listed and detailed in Pakistan, Somalia and Yemen

From Simon Rogers of the Guardian.

Photos of women and children as casualties, linked to particular drone attacks might make drone technology seem less acceptable.

Category Theory

Filed under: Category Theory — Patrick Durusau @ 3:15 pm

Category Theory by Steve Awodey.

While writing up my earlier post on category theory I encountered this Summer, 2011 course page:

Description:

Category theory, a branch of abstract algebra, has found many applications in mathematics, logic, and computer science. Like such fields as elementary logic and set theory, category theory provides a basic conceptual apparatus and a collection of formal methods useful for addressing certain kinds of commonly occurring formal and informal problems, particularly those involving structural and functional considerations. This course is intended to acquaint students with these methods, and also to encourage them to reflect on the interrelations between category theory and the other basic formal disciplines.

There is an extensive set of course notes and homework problem sets.

Using machine learning to extract quotes from text

Filed under: Machine Learning,Text Mining — Patrick Durusau @ 2:44 pm

Using machine learning to extract quotes from text by Chase Davis.

From the post:

Since we launched our Politics Verbatim project a couple of years ago, I’ve been hung up on what should be a simple problem: How can we automate the extraction of quotes from news articles, so it doesn’t take a squad of bored-out-of-their-minds interns to keep track of what politicians say in the news?

You’d be surprised at how tricky this is. At first glance, it looks like something a couple of regular expressions could solve. Just find the text with quotes in it, then pull out the words in between! But what about “air quotes?” Or indirect quotes (“John said he hates cheeseburgers.”)? Suffice it to say, there are plenty of edge cases that make this problem harder than it looks.

When I took over management of the combined Center for Investigative Reporting/Bay Citizen technology team a couple of months ago, I encouraged everyone to have a personal project on the back burner – an itch they wanted to scratch either during slow work days or (in this case) on nights and weekends.

This is mine: the citizen-quotes project, an app that uses simple machine learning techniques to extract more than 40,000 quotes from every article that ran on The Bay Citizen since it launched in 2010. The goal was to build something that accounts for the limitations of the traditional method of solving quote extraction – regular expressions and pattern matching. And sure enough, it does a pretty good job.

Illustrates the application of machine learning to a non-trivial text analysis problem.

Universal properties of mythological networks

Filed under: Networks,Social Networks — Patrick Durusau @ 2:30 pm

Universal properties of mythological networks by Pádraig Mac Carron and Ralph Kenna (2012 EPL 99 28002 doi:10.1209/0295-5075/99/28002)

Abstract:

As in statistical physics, the concept of universality plays an important, albeit qualitative, role in the field of comparative mythology. Here we apply statistical mechanical tools to analyse the networks underlying three iconic mythological narratives with a view to identifying common and distinguishing quantitative features. Of the three narratives, an Anglo-Saxon and a Greek text are mostly believed by antiquarians to be partly historically based while the third, an Irish epic, is often considered to be fictional. Here we use network analysis in an attempt to discriminate real from imaginary social networks and place mythological narratives on the spectrum between them. This suggests that the perceived artificiality of the Irish narrative can be traced back to anomalous features associated with six characters. Speculating that these are amalgams of several entities or proxies, renders the plausibility of the Irish text comparable to the others from a network-theoretic point of view.

A study that suggests there is more to be learned about networks, social, mythological and otherwise. But three (3) examples out of extant accounts, mythological and otherwise, isn’t enough for definitive conclusions.

BTW, if you are interested in the use of social networks with literature, see: Extracting Social Networks from Literary Fiction by David K. Elson , Nicholas Dames , Kathleen R. Mckeown for one approach. (If you know of a recent survey on extraction of social networks, please forward and I will cite you in a post.)

Community Based Annotation (mapping?)

Filed under: Annotation,Bioinformatics,Biomedical,Interface Research/Design,Ontology — Patrick Durusau @ 1:51 pm

Enabling authors to annotate their articles is examined in: Assessment of community-submitted ontology annotations from a novel database-journal partnership by Tanya Z. Berardini, Donghui Li, Robert Muller, Raymond Chetty, Larry Ploetz, Shanker Singh, April Wensel and Eva Huala.

Abstract:

As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles’ contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed.

We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality.

It is encouraging that this annotation effort started with the persons most likely to know the correct answers, authors of the papers in question.

The low initial participation rate (16%) and improved after email reminder rate (53%), were less encouraging.

I suspect unless and until prior annotation practices (by researchers) becomes a line item on current funding requests (how many annotations were accepted by publishers of your prior research?), we will continue to see annotations to be a low priority item.

Perhaps I should suggest that as a study area for the NIH?

Publishers, researchers who build annotation software, annotated data sources and their maintainers, are all likely to be interested.

Would you be interested as well?

Does category theory make you a better programmer?

Filed under: Category Theory,Modeling — Patrick Durusau @ 11:00 am

Does category theory make you a better programmer? by Debasish Ghosh.

From the post:

How much of category theory knowledge should a working programmer have ? I guess this depends on what kind of language the programmer uses in his daily life. Given the proliferation of functional languages today, specifically typed functional languages (Haskell, Scala etc.) that embeds the typed lambda calculus in some form or the other, the question looks relevant to me. And apparently to a few others as well. In one of his courses on Category Theory, Graham Hutton mentioned the following points when talking about the usefulness of the theory :

  • Building bridges—exploring relationships between various mathematical objects, e.g., Products and Function
  • Unifying ideas – abstracting from unnecessary details to give general definitions and results, e.g., Functors
  • High level language – focusing on how things behave rather than what their implementation details are e.g. specification vs implementation
  • Type safety – using types to ensure that things are combined only in sensible ways e.g. (f: A -> B g: B -> C) => (g o f: A -> C)
  • Equational proofs—performing proofs in a purely equational style of reasoning

Many of the above points can be related to the experience that we encounter while programming in a functional language today. We use Product and Sum types, we use Functors to abstract our computation, we marry types together to encode domain logic within the structures that we build and many of us use equational reasoning to optimize algorithms and data structures.

But how much do we need to care about how category theory models these structures and how that model maps to the ones that we use in our programming model ?

Read the post for Debasish’s answer for programmers.

For topic map authors, remember category theory began as an effort to find commonalities between abstract mathematical structures.

Commonalities? That sounds a lot like subject sameness doesn’t it?

With category theory you can describe, model, uncover commonalities in mathematical structures and commonalities in other areas as well.

A two for one as it were. Sounds worthwhile to me.

I first saw this at DZone.

How to Make an Interactive Network Visualization

Filed under: D3,Graphics,Graphs,Javascript,Music,Networks,Visualization — Patrick Durusau @ 10:10 am

How to Make an Interactive Network Visualization by Jim Vallandingham.

From the post:

Interactive network visualizations make it easy to rearrange, filter, and explore your connected data. Learn how to make one using D3 and JavaScript.

Networks! They are all around us. The universe is filled with systems and structures that can be organized as networks. Recently, we have seen them used to convict criminals, visualize friendships, and even to describe cereal ingredient combinations. We can understand their power to describe our complex world from Manuel Lima’s wonderful talk on organized complexity. Now let’s learn how to create our own.

In this tutorial, we will focus on creating an interactive network visualization that will allow us to get details about the nodes in the network, rearrange the network into different layouts, and sort, filter, and search through our data.

In this example, each node is a song. The nodes are sized based on popularity, and colored by artist. Links indicate two songs are similar to one another.

Try out the visualization on different songs to see how the different layouts and filters look with the different graphs.

You know this isn’t a post about politics because it would be visualizing friendships with convicted criminals. 😉

A degree of separation graph between elected public officials and convicted white collar criminals? A topic map for another day.

For today, enjoy learning how to use D3 and JavaScript for impressive network visualizations.

Imagine mapping the cereal visualization to the shelf locations at your local Kroger, where selecting one ingredient identifies the store locations of others.

August 1, 2012

Balisage 2012 – Proceedings & Symposium

Filed under: Conferences,CS Lectures — Patrick Durusau @ 8:07 pm

The Balisage Proceedings and Symposium materials are online! (before the conference/symposium):

Balisage 2012

cover: http://www.balisage.net/Proceedings/vol8/cover.html
table of contents: http://www.balisage.net/Proceedings/vol8/contents.html

Symposium

cover: http://www.balisage.net/Proceedings/vol9/cover.html
table of contents: http://www.balisage.net/Proceedings/vol9/contents.html

As of tomorrow, you have 4 days (starts August 6th) to make the Symposium and 5 days (starts August 7th) to make Balisage.

Same day ticket purchase/travel is still possible but why risk it? Besides, I’m sure Greece can’t afford Interpol fees anymore. 😉

Your choices are:

Attend or,

Spend the rest of the year making up lame excuses for not being at Balisage in Montreal.

Choice is yours!

Swoosh: a generic approach to entity resolution

Filed under: Deduplication,Entity Resolution — Patrick Durusau @ 7:53 pm

Swoosh: a generic approach to entity resolution by Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal.

Do you remember Swoosh?

I saw it today in Five Short Links by Pete Warden.

Abstract:

We consider the Entity Resolution (ER) problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the 4 properties. F-Swoosh in addition assumes knowledge of the “features” ( e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.

It sounds familiar.

Running some bibliographic searches, looks like 100 references since 2011. That’s going to take a while! But it all looks like good stuff.

Updated: Lists of Legal Metadata and Legal Knowledge Representation Resources

Filed under: Law,Legal Informatics — Patrick Durusau @ 7:33 pm

Updated: Lists of Legal Metadata and Legal Knowledge Representation Resources

Updated resource lists for anyone interested in legal informatics.

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part II

Filed under: Bioinformatics,Biomedical,Hadoop,Signal Processing — Patrick Durusau @ 7:19 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part II by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

As mentioned in Part I, although Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop/Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.

Details of the setup for processing rat brain signals with Hadoop.

Looking back, I did not see any mention of data sets? Perhaps in part III?

Known Unknowns

Filed under: Government,Military — Patrick Durusau @ 7:01 pm

I discovered a good example of a “known unknown” today. The GAO report entitled: Multiple DOD Organizations are Developing Numerous Initiatives gives a good example.

From the summary:

We identified 1,340 potential, separate initiatives that DOD funded from fiscal year 2008 through the first quarter of fiscal year 2012 that, in DOD officials’ opinion, met the above definition for C-IED initiatives. We relied on our survey, in part, to determine this number because DOD has not determined, and does not have a ready means for determining, the universe of C-IED initiatives. Of the 1,340 initiatives, we received detailed survey responses confirming that 711 initiatives met our C-IED definition. Of the remaining 629 initiatives for which we did not receive survey responses, 481 were JIEDDO initiatives. JIEDDO officials attribute their low survey returns for reasons including that C-IED initiatives are currently not fully identified, catalogued, and retrievable; however, they expect updates to their information technology system will correct this deficiency. Our survey also identified 45 different organizations that DOD is funding to undertake these 1,340 identified initiatives. Some of these organizations receive JIEDDO funding while others receive other DOD funding. We documented $4.8 billion of DOD funds expended in fiscal year 2011 in support of C-IED initiatives, but this amount is understated because we did not receive survey data confirming DOD funding for all initiatives. As an example, at least 94 of the 711 responses did not include funding amounts for associated C-IED initiatives. Further, the DOD agency with the greatest number of C-IED initiatives identified—JIEDDO—did not return surveys for 81 percent of its initiatives.

Our survey results showed that multiple C-IED initiatives were concentrated within some areas of development, resulting in overlap within DOD for these efforts—i.e., programs engaged in similar activities to achieve similar goals or target similar beneficiaries. For example, our survey data identified 19 organizations with 107 initiatives being developed to combat cell phone-triggered IEDs. While the concentration of initiatives in itself does not constitute duplication, this concentration taken together with the high number of different DOD organizations that are undertaking these initiatives and JIEDDO’s inability to identify and compare C-IED initiatives, demonstrates overlap and the potential for duplication of effort. According to JIEDDO officials, the organization has a robust coordinating process in place that precludes unintended overlap. However, through our survey and follow-up with relevant agency officials, we found examples of overlap in the following areas: (1) IED-related intelligence analysis: two organizations were producing and disseminating similar IED-related intelligence products to the warfighter, (2) C-IED hardware development: two organizations were developing similar robotics for detecting IEDs from a safe distance, and (3) IED detection: two organizations had developed C-IED initiatives using chemical sensors that were similar in their technologies and capabilities.

Our survey results showed that a majority of respondents said they communicated with JIEDDO regarding their C-IED initiatives; however, JIEDDO does not consistently record and track this data. Based on our prior work, JIEDDO does not have a mechanism for recording data communicated on C-IED efforts. Therefore, these data are not available for analysis by JIEDDO or others in DOD to reduce the risk of duplicating efforts and avoid repeating mistakes. (emphasis added)

As the summary points out, there is no reason to presume duplication with 1,340 initiatives to address the same problem. Why would anyone think that?

And for that matter, you have to have data from the 629 non-responding programs. BTW, 481 of those are from the Joint Improvised Explosive Device Defeat Organization, JIEDDO. I don’t guess there is any reason to call attention to the organization responsible for defeating IEDs is busy not tracking efforts to defeat them.

Any known unknowns in your organization?

Practical machine learning tricks…

Filed under: Machine Learning,MapReduce — Patrick Durusau @ 2:03 pm

Practical machine learning tricks from the KDD 2011 best industry paper by David Andrzejewski.

From the post:

A machine learning research paper tends to present a newly proposed method or algorithm in relative isolation. Problem context, data preparation, and feature engineering are hopefully discussed to the extent required for reader understanding and scientific reproducibility, but are usually not the primary focus. Given the goals and constraints of the format, this can be seen as a reasonable trade-off: the authors opt to spend scarce "ink" on only the most essential (often abstract) ideas.

As a consequence, implementation details relevant to the use of the proposed technique in an actual production system are often not mentioned whatsoever. This aspect of machine learning is often left as "folk wisdom" to be picked up from colleagues, blog
posts, discussion boards, snarky tweets, open-source libraries, or more often than not, first-hand experience.

Papers from conference "industry tracks" often deviate from this template, yielding valuable insights about what it takes to make machine learning effective in practice. This paper from Google on detecting "malicious" (ie, scam/spam) advertisements won best industry paper at KDD 2011 and is a particularly interesting example.

Detecting Adversarial Advertisements in the Wild

D. Sculley, Matthew Otey, Michael Pohl, Bridget Spitznagel,

John Hainsworth, Yunkai Zhou

http://research.google.com/pubs/archive/37195.pdf

At first glance, this might appear to be a "Hello-World" machine learning problem straight out of a textbook or tutorial: we simply train a Naive Bayes on a set of bad ads versus a set of good ones. However this is apparently far from being the case – while Google is understandably shy about hard numbers, the paper mentions several issues which make this especially challenging and notes that this is a business-critical problem for Google.

The paper describes an impressive and pragmatic blend of different techniques and tricks. I've briefly described some of the highlights, but I would certainly encourage the interested reader to check out the original paper and presentation slides.

In addition to the original paper and slides, I would suggest having David’s comments at hand while you read the paper. Not to mention having access to a machine and online library at the same time.

There is much here to repurpose to assist you and your users.

Semantic Silver Bullets?

Filed under: Information Sharing,Marketing,Semantics — Patrick Durusau @ 1:46 pm

The danger of believing in silver bullets

Nick Wakeman writes in the Washington Technology Business Beat:

Whether it is losing weight, getting rich or managing government IT, it seems we can’t resist the lure of a silver bullet. The magic pill. The easy answer.

Ten or 12 years ago, I remember a lot of talk about leasing and reverse auctions, and how they were going to transform everything.

Since then, outsourcing and insourcing have risen and fallen from favor. Performance-based contracting was going to be the solution to everything. And what about the huge systems integration projects like Deepwater?

They start with a bang and end with a whimper, or in some cases, a moan and a whine. And of course, along the way, millions and even billions of dollars get wasted.

I think we are in the midst of another silver bullet phenomenon with all the talk around cloud computing and everything as a service.

I wish I could say that topic maps are a semantic silver bullet. Or better yet, a semantic hand grenade. One that blows other semantic approaches away.

Truthfully, topic maps are neither one.

Topic maps rely upon users, assisted by various technologies, to declare and identify subjects they want to talk about and, just as importantly, relationships between those subjects. Not to mention where information about those subjects can be found.

If you need evidence of the difficulty of those tasks, consider the near idiotic results you get from search engines. Considering the task they do pretty good but pretty good still takes time and effort to sort out every time you search.

Topic maps aren’t easy, no silver bullet, but you can capture subjects of interest to you, define their relationships to other subjects and specify where more information can be found.

Once captured, that information can be shared, used and/or merged with information gathered by others.

Bottom line is that better semantic results, for sharing, for discovery, for navigation, all require hard work.

Are you ready?

Useful junk?:…

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 10:12 am

Useful junk?: the effects of visual embellishment on comprehension and memorability of charts by Scott Bateman, Regan L. Mandryk, Carl Gutwin, Aaron Genest, David McDine, and Christopher Brooks.

Abstract:

Guidelines for designing information charts (such as bar charts) often state that the presentation should reduce or remove ‘chart junk’ – visual embellishments that are not essential to understanding the data. In contrast, some popular chart designers wrap the presented data in detailed and elaborate imagery, raising the questions of whether this imagery is really as detrimental to understanding as has been proposed, and whether the visual embellishment may have other benefits. To investigate these issues, we conducted an experiment that compared embellished charts with plain ones, and measured both interpretation accuracy and long-term recall. We found that people’s accuracy in describing the embellished charts was no worse than for plain charts, and that their recall after a two-to-three-week gap was significantly better. Although we are cautious about recommending that all charts be produced in this style, our results question some of the premises of the minimalist approach to chart design.

No, I didn’t just happen across this work while reading the morning paper. 😉

I started at Nathan Yau’s Nigel Holmes on explanation graphics and how he got started and followed a link to a Column Five Media interview with Holmes, Nigel Holmes on 50 Years of Designing Infographics, because of a quote from Holmes on Edward Tufte that Nathan quotes:

Recent academic studies have proved many of his theses wrong.

which finally brings us to the article I link to above.

It may be the case that Edward Tufte does better with charts designed with the minimalist approach, but this article shows that other people may do better with other chart design principles.

But that’s the trick isn’t it?

We start from what makes sense to us and then generalize that to be the principle that makes the most sense for everyone.

I fear that is also the case with the design of topic map (and other) interfaces. We start with what works for us and generalize that to “that should work for everyone.”

Hard to hear evidence to the contrary. “If you just try it you will see that it works better than X way.”

I fear the solution is to test interfaces with actual user populations. Perhaps even injecting “randomness” into the design so we can test things we would never think of. Or even give users (shudder) the capacity to draw in controls or arrangements of controls.

You may not like the resulting interface but do you want to market to an audience of < 5 or educate and market to a larger audience? (Ask one of your investors if you are unsure.)

Fuse: From Invention to Antimatter

Filed under: Typography — Patrick Durusau @ 6:51 am

Fuse: From Invention to Antimatter (Amazon Link)

James Cheshire has his brother review “Fuse: From Invention to Antimatter,” under the blog title: Book Review: Fuse.

From the post:

“In a world of generic mediocrity and corporate obeyance, new flowers of exuberance bloom in dark crevices. FUSE is a breach in the wall, a genetic mutation from which new lifeforms can spring […] Never before has FUSE been so relevant and so necessary.”

The words of Neville Brody open FUSE 1-20, From Invention to Antimatter: 20 years of FUSE with the air of positive aggression and idealism that continues throughout the book. Across twenty editions (since 1991) FUSE has sought to challenge and invigorate the language of typography. Always contained within a cardboard box, each themed issue featured written editorials from leading designers, posters and a disc with four or more fonts for personal use and exploration. This new book (within a FUSE box) from Taschen is essentially a retrospective of all the FUSE editions to date, along with additional essays, conference transcripts, and two new issues – FUSE19 and FUSE20.

Typography, like page layout, is one of those things most of us pass over without realizing its impact on communication or even understanding itself.

Definitely on my priority order list!

BTW, do note that Amazon lists as: “Fuse: Neville Brody,” Chesire’s review gives both “Fuse: From Invention to Antimatter,” and “FUSE: 1-20, From Invention to Antimatter: 20 years of FUSE,” and no doubt other variations will abound.

Indexes in RAM?

Filed under: Indexing,Lucene,Zing JVM — Patrick Durusau @ 6:35 am

The Mike McCandless post: Lucene index in RAM with Azul’s Zing JVM will help make your case for putting your index in RAM!

From the post:

Google’s entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?

RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.

The obvious approach is to load the index into Lucene’s RAMDirectory, right?

Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a List of byte[1024] fragments (there are open Jira issues to address this but they haven’t been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from RAMDirectory back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using RAMDirectory only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using MMapDirectory (see Uwe’s excellent post for more details).

While there are open issues to improve RAMDirectory (LUCENE-4123 and LUCENE-3659), they haven’t been committed and many users simply use RAMDirectory anyway.

Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of RAMDirectory should not be a problem for Zing. Let’s test it! But first, a quick digression on the importance of measuring search response time of all requests.

There are obvious speed advantages to holding indexes in RAM.

Curious, is RAM just a quick disk? Or do we need to think about data structures/access differently with RAM? Pointers?

« Newer Posts

Powered by WordPress