Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 4, 2013

NLTK 1.1 – Computing with Language: …

Filed under: Lisp,Natural Language Processing,NLTK — Patrick Durusau @ 3:56 pm

NLTK 1.1 – Computing with Language: Texts and Words by Vsevolod Dyomkin.

From the post:

OK, let’s get started with the NLTK book. Its first chapter tries to impress the reader with how simple it is to accomplish some neat things with texts using it. Actually, the underlying algorithms that allow to achieve these results are mostly quite basic. We’ll discuss them in this post and the code for the first part of the chapter can be found in nltk/ch1-1.lisp.

A continuation of Natural Language Meta Processing with Lisp.

Who knows? You might decide that Lisp is a natural language. 😉

A.nnotate

Filed under: Annotation,Indexing — Patrick Durusau @ 3:29 pm

A.nnotate

From the homepage:

A.nnotate is an online annotation, collaboration and indexing system for documents and images, supporting PDF, Word and other document formats. Instead of emailing different versions of a document back and forth you can now all comment on a single read-only copy online. Documents are displayed in high quality with fonts and layout just like the printed version. It is easy to use and runs in all common web browsers, with no software or plugins to install.

Hosted solutions are available for individuals and workgroups. For enterprise users the full system is available for local installation. Special discounts apply for educational use. A.nnotate technology can also be used to enhance existing document and content management systems with high quality online document viewing, annotation and collaboration facilities.

I suppose that is one way to solve the “index merging” problem.

Everyone use a common document.

Doesn’t help if a group starts with different copies of the same document.

Or if other indexes from other documents need to be merged with the present document.

Not to mention merging indexes/annotations separate from any particular document instance.

Still, a step away from the notion of a document as a static object.

Which is a good thing.

I first saw this in a tweet by Stian Danenbarger.

FoundationDB

Filed under: FoundationDB,NoSQL — Patrick Durusau @ 3:07 pm

FoundationDB

FoundationDB Beta 1 is now available!

It will take a while to sort out all of its features, etc.

I should mention that it is refreshing that the documentation contains Known Limitations.

All software has limitations but few every acknowledge them up front.

You have to encounter one before one of the technical folks says: “…yes, we have been meaning to work on that.”

I would rather know up front what the limitation are.

Whether FoundationDB meets your requirements or not, it is good to see that kind of transparency.

GraphBuilder – A Scalable Graph Construction Library for Apache™ Hadoop™

Filed under: GraphBuilder,Graphs,Hadoop,MapReduce,Networks — Patrick Durusau @ 2:56 pm

GraphBuilder – A Scalable Graph Construction Library for Apache™ Hadoop™ by Theodore L. Willke, Nilesh Jain and Haijie Gu. (whitepaper)

Abstract:

The exponential growth in the pursuit of knowledge gleaned from data relationships that are expressed naturally as large and complex graphs is fueling new parallel machine learning algorithms. The nature of these computations is iterative and data-dependent. Recently, frameworks have emerged to perform these computations in a distributed manner at commercial scale. But feeding data to these frameworks is a huge challenge in itself. Since graph construction is a data-parallel problem, Hadoop is well-suited for this task but lacks some elements that would make things easier for data scientists that do not have domain expertise in distributed systems engineering. We developed GraphBuilder, a scalable graph construction software library for Apache Hadoop, to address this gap. GraphBuilder offloads many of the complexities of graph construction, including graph formation, tabulation, compression, transformation, partitioning, output formatting, and serialization. It is written in Java for ease of programming and scales using the MapReduce parallel programming model. We describe the motivation for GraphBuilder, its architecture, and present two case studies that provide a preliminary evaluation.

The “whitepaper” introduction to GraphBuilder.

$1.55 Trillion 
in 
Federal
 Spending 
Misreported
 in 2011

Filed under: Government,Government Data — Patrick Durusau @ 11:30 am

With
 $1.55
 Trillion 
in 
Federal
 Spending 
Misreported
 in 
2011,
 Data
Transparency
 Coalition
 Renews
 Call
 for
 Congressional
 Action

Updating Senator Dirksen for inflation: “A trillion here, a trillon there, and pretty soon you’re talking real money.” (Attributed to Senator Dirksen but not documented.)

From the press release:

The Data
 Transparency
 Coalition,
 the
 only 
group 
unifying 
the
 technology
 industry 
in 
support
 of 
federal
 data
 reform,
 applauded 
the
 release
 today
 of 
the 
Sunlight
 Foundation’s Clearspending 
report 
and
 called
 for
 the
 U.S. 
Congress
 to 
reintroduce
 and
 pass 
the 
Digital
 Accountability 
and
 Transparency 
Act
 (DATA
 Act) 
in 
order 
to 
rectify 
the
 misreporting
 of 
trillions
 of
 dollars 
in
 federal 
spending
 each 
year.

The 
Clearspending 
report, 
which 
analyzes the
 federal 
government’s 
spending
 information 
as
 published
 on USASpending.gov, 
showed 
that 
federal
 grant
 information
 published
 during 
fiscal
 year
 2011
 was 
inconsistent
 with 
other data 
sources
 for
 nearly 
70 
percent 
of 
all
 grant 
spending 
and
 lacked
 required
 data
 fields
 for 
26 percent
 of 
all
 grant
 spending. 
In
 all, 
$1.55
 trillion, 
or 
94.5
 percent
 of 
all
grant
 spending,
 was
 inconsistent,
 incomplete, 
or 
untimely.
 The DATA 
Act 
would
 help
 rectify 
these 
problems
 by
 requiring full 
publication
 and
 consistent 
data
 standards 
for 
all
 federal
 spending.

“USASpending.gov 
fails,
 year 
after 
year, 
to 
deliver 
accurate 
data
 for
 one
 reason: 
the 
federal 
government
 lacks
 data 
standards
 for
 spending,”
 said 
Hudson 
Hollister, 
Executive 
Director
 of 
the 
Data
 Transparency
 Coalition.
 “The 
DATA Act 
would 
bring
 transparency
 and
 order 
to
 USASpending.gov 
by
 requiring
 consistent
 data 
standards
 for
 all 
federal
 spending
 information.
 Right
 now,
 there
 are
 no
 electronic 
codes 
to 
identify 
government
 grants, contracts, or
 even 
the grantees 
and 
contractors 
themselves.
 Without 
award
 IDs
 and
 a 
nonproprietary 
recipient
 IDs, 
there 
is
 no 
way
 to
 easily
 check
 USASpending.gov 
for 
accuracy 
or 
even
 verify
 that
 agencies
 are
 actually
 submitting 
the 
information 
the law 
requires 
them
 to
 submit
–
and as 
Clearspending 
shows, many 
are 
not.”

Note the limitation of the report to grant information.

That is USASpending.gov does not include non-grant spending, such as defense contracts and similar 21st century follies.

I have questions about the feasibility of universal, even within the U.S. government, data standards for spending. But I will address those in a separate post.

A New Representation of WordNet® using Graph Databases

Filed under: Graph Databases,Graphs,Neo4j,Networks,WordNet — Patrick Durusau @ 10:46 am

A New Representation of WordNet® using Graph Databases by Khaled Nagi.

Abstract:

WordNet® is one of the most important resources in computation linguistics. The semantically related database of English terms is widely used in text analysis and retrieval domains, which constitute typical features, employed by social networks and other modern Web 2.0 applications. Under the hood, WordNet® can be seen as a sort of read-only social network relating its language terms. In our work, we implement a new storage technique for WordNet® based on graph databases. Graph databases are a major pillar of the NoSQL movement with lots of emerging products, such as Neo4j. In this paper, we present two Neo4j graph storage representations for the WordNet® dictionary. We analyze their performance and compare them to other traditional storage models. With this contribution, we also validate the applicability of modern graph databases in new areas beside the typical large-scale social networks with several hundreds of millions of nodes.

Finally, a paper that covers “moderate size databases!”

Think about the average graph database you see on this blog. Not really in the “moderate” range, even though a majority of users work in the moderate range.

Compare the number of Facebook size enterprises with the number of enterprises generally.

Not dissing super-sized graph databases or research on same. I enjoy both a lot.

But for your average customer, experience with “moderate size databases” may be more immediately relevant.

I first saw this in a tweet from Peter Neubauer.

Duke 1.0 Release!

Filed under: Duke,Entity Resolution,Record Linkage — Patrick Durusau @ 9:52 am

Duke 1.0 Release!

From the project page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.0 (see ReleaseNotes).

Features

  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples DataSources.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This early presentation describes the ideas behind the engine and the intended architecture; a later and more up to date presentation has more practical detail and examples. There's also the ExamplesOfUse page, which lists real examples of using Duke, complete with data and configurations.

Excellent news on the data depulication front!

And for topic map authors as well (see the examples).

Kudos to Lars Marius Garshol!

March 3, 2013

TAXMAP GOES HEAD TO HEAD WITH GOOGLE

Filed under: Marketing,Topic Maps — Patrick Durusau @ 4:20 pm

TAXMAP GOES HEAD TO HEAD WITH GOOGLE

I wrote it for Topicmaps.com.

You can self-guide yourself in a comparison of TaxMap (the oldest online topic map based application) to Google.

What do you conclude from the comparison?

A Fast Parallel Maximum Clique Algorithm…

Filed under: Dynamic Graphs,Graphs,Networks,Sparse Data — Patrick Durusau @ 4:01 pm

A Fast Parallel Maximum Clique Algorithm for Large Sparse Graphs and Temporal Strong Components by Ryan A. Rossi, David F. Gleich, Assefaw H. Gebremedhin, Md. Mostofa Ali Patwary.

Abstract:

We propose a fast, parallel, maximum clique algorithm for large, sparse graphs that is designed to exploit characteristics of social and information networks. We observe roughly linear runtime scaling over graphs between 1000 vertices and 100M vertices. In a test with a 1.8 billion-edge social network, the algorithm finds the largest clique in about 20 minutes. For social networks, in particular, we found that using the core number of a vertex in combination with a good heuristic clique finder efficiently removes the vast majority of the search space. In addition, we parallelize the exploration of the search tree. In the algorithm, processes immediately communicate changes to upper and lower bounds on the size of maximum clique, which occasionally results in a super-linear speedup because vertices with especially large search spaces can be pruned by other processes. We use this clique finder to investigate the size of the largest temporal strong components in dynamic networks, which requires finding the largest clique in a particular temporal reachability graph.

Thirty-two networks are reported in this paper and a promised online appendix as around eighty (80).

The online appendix is live but as of today (March 2, 2013), it has no content.

No matter, the paper should keep you busy for more than a little while. 😉

I am interested in parallel graph processing in general but the concept of communicating “…changes to upper and lower bounds on the size of maximum clique…” seems applicable to “merging” in topic maps.

That is if some set of topics share some common characteristic that exclude them from consideration for merging, why apply the merging test at all?

Will have to think about that.

Spring for Hadoop …

Filed under: Hadoop,Spring Hadoop — Patrick Durusau @ 3:41 pm

Spring for Hadoop simplifies application development

From the post:

After almost exactly a year of development, SpringSource has released Spring for Hadoop 1.0 with the goal of making the development of Hadoop applications easier for users of the distributed application framework. VMware engineer Costin Leau said in the release announcement that the company has often seen developers use the out-of-the-box tools that come with Hadoop in ways that lead to a “poorly structured collection of command line utilities, scripts and pieces of code stitched together.” Spring for Hadoop aims to change this by applying the Template API design pattern from Spring to Hadoop.

This application gives helper classes such as HBaseTemplate, HiveTemplate and PigTemplate which interface with the different parts of the Hadoop ecosystem, Java-centric APIs such as Cascading can also be used with or without additional configuration. The software enables Spring functionality such as thread-safe access to lower level resources and lightweight object mapping in Hadoop applications. Leau also says that Spring for Hadoop is designed to allow projects to grow organically. To do this, users can mix and match various runner classes for scripts and, as the complexity of the application increases, developers can migrate to Spring Batch and manage these processes through a REST-based API.

Spring for Hadoop 1.0 is available from the SpringSource web site under the Apache 2.0 License. The developers say they are testing the software daily against various Hadoop 1.x distributions such as Apache Hadoop and Greenplum HD, as well as Cloudera CDH3 and CDH4. Greenplum HD already includes Spring for Hadoop in its distribution. Support for Hadoop 2.x is expected “in the near future”.

I’m going to leave characterization of present methods of working with Hadoop for others. 😉

The Jigsaw secure distributed file system [TM Equivalents?]

Filed under: Cybersecurity,Jigsaw File System,Security — Patrick Durusau @ 3:25 pm

The Jigsaw secure distributed file system by Jiang Biana and Remzi Seker.

Abstract:

The Jigsaw Distributed File System (JigDFS) aims to securely store and retrieve files on large scale networks. The design of JigDFS is driven by the privacy needs of its users. Files in JigDFS are sliced into small segments using an Information Dispersal Algorithm (IDA) and distributed onto different nodes recursively. JigDFS provides fault-tolerance against node failures while assuring confidentiality, integrity, and availability of the stored data. Layered encryption is applied to each file segment with keys produced by a hashed-key chain algorithm. Recursive IDA and layered encryption enhance users’ anonymity and provide a degree of plausible deniability. JigDFS is envisioned to be an ideal long-term storage solution for developing secure data archiving systems.

Very interesting!

Reminds me that data could be split into topics, which only merge if you know the basis for meaningful merger. Otherwise it is a schema-free bag of tuples. 😉

In other words, you know someone in a population of 10,000 medical records is HIV positive but without the proper merging key, it isn’t possible to say who?

I first saw this at Datanami.

Data models for version management…

Filed under: Data Models,Government,Government Data,Legal Informatics — Patrick Durusau @ 2:27 pm

Data models for version management of legislative documents by María Hallo Carrasco, M. Mercedes Martínez-González, and Pablo de la Fuente Redondo.

Abstract:

This paper surveys the main data models used in projects including the management of changes in digital normative legislation. Models have been classified based on a set of criteria, which are also proposed in the paper. Some projects have been chosen as representative for each kind of model. The advantages and problems of each type are analysed, and future trends are identified.

I first saw this at Legal Informatics, which had already assembled the following resources:

The legislative metadata models discussed in the paper include:

Useful as models of change tracking should you want to express that in a topic map.

To say nothing of overcoming the semantic impedance between these model.

Liferay / Marketplace

Filed under: Enterprise Integration,Open Source,Software — Patrick Durusau @ 2:14 pm

Liferay. Enterprise. Open Source. For Life.

Enterprise.

Liferay, Inc. was founded in 2004 in response to growing demand for Liferay Portal, the market’s leading independent portal product that was garnering industry acclaim and adoption across the world. Today, Liferay, Inc. houses a professional services group that provides training, consulting and enterprise support services to our clientele in the Americas, EMEA, and Asia Pacific. It also houses a core development team that steers product development.

Open Source.

Liferay Portal was, in fact, created in 2000 and boasts a rich open source heritage that offers organizations a level of innovation and flexibility unrivaled in the industry. Thanks to a decade of ongoing collaboration with its active and mature open source community, Liferay’s product development is the result of direct input from users with representation from all industries and organizational roles. It is for this reason, that organizations turn to Liferay technology for exceptional user experience, UI, and both technological and business flexibility.

For Life.

Liferay, Inc. was founded for a purpose greater than revenue and profit growth. Each quarter we donate to a number of worthy causes decided upon by our own employees. In the past we have made financial contributions toward AIDS relief and the Sudan refugee crisis through well-respected organizations such as Samaritan’s Purse and World Vision. This desire to impact the world community is the heart of our company, and ultimately the reason why we exist.

The Liferay Marketplace may be of interest for open source topic map projects.

There are only a few mentions of topic maps in the mailing list archives and none of those are recent.

Could be time to rekindle that conversation.

I first saw this at: Beyond Search.

Graph Based Recommendations using “How-To” Guides Dataset

Filed under: Graphs,Networks,Recommendation — Patrick Durusau @ 1:58 pm

Graph Based Recommendations using “How-To” Guides Dataset by Marcel Caraciolo.

From the post:

In this post I’d like to introduce another approach for recommender engines using graph concepts to recommend novel and interesting items. I will build a graph-based how-to tutorials recommender engine using the data available on the website SnapGuide (By the way I am a huge fan and user of this tutorials website), the graph database Neo4J and the graph traversal language Gremlin.

What is SnapGuide ?

Snapguide is a web service for anyone who wants to create and share step-by-step “how to guides”. It is available on the web and IOS app. There you can find several tutorials with easy visual instructions for a wide array of topics including cooking, gardening, crafts, projects, fashion tips and more. It is free and anyone is invitide to submit guides in order to share their passions and expertise with the community. I have extracted from their website for only research purposes the corpus of tutorials likes. Several users may like the tutorial and this signal can be quite useful to recommend similar tutorials based on what other users liked. Unfortunately I can’t provide the dataset for download but the code you can follow below for your own data set.

An excellent tutorial that walks you through the creation of graph based recommendations, from acquiring the data to posting queries to it.

The SnapGuide site looks like another opportunity for topic map related tutorial material.

GraphBuilder

Filed under: Graphs,Networks — Patrick Durusau @ 1:45 pm

GraphBuilder: Large-Scale Graph Construction using Apache™ Hadoop™

From the project page:

GraphBuilder is a Java library for constructing graphs out of large datasets for data analytics and structured machine learning applications that exploit relationships in data. The library offloads many of the complexities of graph construction, such as graph formation, tabulation, compression, transformation, partitioning, output formatting, and serialization. It scales using the MapReduce parallel programming model. The major components of GraphBuilder library, and its relation to Hadoop MapReduce, are shown below.

You may remember my post about the original release of this library in: Building graphs with Hadoop.

The GraphBuilder mailing list archives don’t show a lot of traffic, yet, so it may be easy to get noticed.

Project Panthera…

Filed under: Hadoop,SQL — Patrick Durusau @ 1:38 pm

Project Panthera: Better Analytics with SQL and Hadoop

Another Hintel project focused on Hadoop.

From the project page:

We have worked closely with many enterprise users over the past few years to enhance their new data analytics platforms using the Hadoop stack. Increasingly, these platforms have evolved from a batch-style, custom-built system for unstructured data, to become an integral component of the enterprise application framework. While the Hadoop stack provides a solid foundation for these platforms, gaps remain; in particular, enterprises are looking for full SQL support to seamlessly integrate these new platforms into their existing enterprise data analytics infrastructure. Project Panthera is our open source efforts to provide efficient support of standard SQL features on Hadoop, so as to enable many important, advanced use cases not supported by Hadoop today, including:

  • Exploring data with complex and sophisticated SQL queries (such as nested subqueries with aggregation functions) – for instance, about half of the queries in TPC-H (a standard decision support benchmark) use subqueries
  • Efficient storage engine for high-update rate SQL query workloads – while HBase is often used to support such workloads, query processing (e.g., Hive) on HBase can incur significant overheads as the storage engine completely ignores the SQL relational model
  • Utilizations of new hardware platform technologies (e.g., new flash technologies and large RAM capacities available in modern servers) for efficient SQL query processing

The objective of Project Panthera is to collaborate with the larger Hadoop community in enhancing the SQL support of the platform for a broader set of use cases. We are building these new capabilities on top of the Hadoop stack, and contributing necessary improvements of the underlying stack back to the existing Apache Hadoop projects. Our initial goals are:

SQL is still alive! Who knew? 😉

A good example of new technologies not replacing old ones, but being grafted onto them.

With that grafting, semantic impedance between the systems remains.

You can remap over that impedance on an ad hoc and varying basis.

Or, you can create mapping today that can be re-used tomorrow.

Which sounds like a better option to you?

Project Rhino

Filed under: Cybersecurity,Hadoop,MapReduce,Project Rhino,Security — Patrick Durusau @ 1:21 pm

Project Rhino

Is Wintel becoming Hintel? 😉

If history is a guide, that might not be a bad thing.

From the project page:

As Hadoop extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with all Hadoop projects and HBase must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

The core of the Apache Hadoop ecosystem as it is commonly understood is:

  • Core: A set of shared libraries
  • HDFS: The Hadoop filesystem
  • MapReduce: Parallel computation framework
  • ZooKeeper: Configuration management and coordination
  • HBase: Column-oriented database on HDFS
  • Hive: Data warehouse on HDFS with SQL-like access
  • Pig: Higher-level programming language for Hadoop computations
  • Oozie: Orchestration and workflow management
  • Mahout: A library of machine learning and data mining algorithms
  • Flume: Collection and import of log and event data
  • Sqoop: Imports data from relational databases

These components are all separate projects and therefore cross cutting concerns like authN, authZ, a consistent security policy framework, consistent authorization model and audit coverage loosely coordinated. Some security features expected by our customers, such as encryption, are simply missing. Our aim is to take a full stack view and work with the individual projects toward consistent concepts and capabilities, filling gaps as we go.

Like I said, might not be a bad thing!

Different from recent government rantings. Focused on a particular stack with the intent to analyze that stack, not the world at large, and to make specific improvements (read measurable results).

March 2, 2013

Hellerstein: Humans are the Bottleneck [Not really]

Filed under: Data,Subject Identity,Topic Maps — Patrick Durusau @ 5:06 pm

Hellerstein: Humans are the Bottleneck by Isaac Lopez.

From the post:

Humans are the bottleneck right now in the data space, commented database systems luminary, Joe Hellerstein during an interview this week at Strata 2013.

“As Moore’s law drives the cost of computing down, and as data becomes more prevalent as a result, what we see is that the remaining bottleneck in computing costs is the human factor,” says Hellerstein, one of the fathers of adaptive query processing and a half dozen other database technologies.

Hellerstein says that recent research studies conducted at Stanford and Berkeley have found that 50-80 percent of a data analyst’s time is being used for the data grunt work (with the rest left for custom coding, analysis, and other duties).

“Data prep, data wrangling, data munging are words you hear over and over,” says Hellerstein. “Even with very highly skilled professionals in the data analysis space, this is where they’re spending their time, and it really is a big bottleneck.”

Just because humans gather at a common location, in “data prep, data wrangling, data munging,” doesn’t mean they “are the bottleneck.”

The question to ask is: Why are people spending so much time at location X in data processing?

Answer: poor data quality and/or rather the inability of machines to process effectively data from different origins. That’s the bottleneck.

A problem that management of subject identities for data and its containers is uniquely poised to solve.

Kepler Data Tutorial : What can you do?

Filed under: Astroinformatics,Data,Data Analysis — Patrick Durusau @ 4:55 pm

Kepler Data Tutorial : What can you do?

The Kepler mission was designed to hunt for planets orbiting foreign stars. When a planet passes between the Kepler satellite and its home star, the brightness of the light from the star dips.

That isn’t the only reason for changes in brightness but officially, Kepler has to ignore those other reasons. Unofficially, Kepler has encouraged professional and amateur astronomers to search the Kepler data for other reasons for light curves.

As I mentioned last year, Kepler Telescope Data Release: The Power of Sharing Data, a group of amateurs discovered the first system with four (4) suns and at least one (1) planet.

The Kepler Data Tutorial introduces you to analysis of this data set.

Hadoop++ and HAIL [and LIAH]

Filed under: Hadoop,HAIL,MapReduce — Patrick Durusau @ 3:33 pm

Hadoop++ and HAIL

From the webpage:

Hadoop++

Hadoop++: Nowadays, working over very large data sets (Petabytes of information) is a common reality for several enterprises. In this context, query processing is a big challenge and becomes crucial. The Apache Hadoop project has been adopted by many famous companies to query their Petabytes of information. Some examples of such enterprises are Yahoo! and Facebook. Recently, some researchers from the database community indicated that Hadoop may suffer from performance issues when running analytical queries. We believe this is not an inherent problem of the MapReduce paradigm but rather some implementation choices done in Hadoop. Therefore, the overall goal of Hadoop++ project is to improve Hadoop’s performance for analytical queries. Already, our preliminary results show an improvement of Hadoop++ over Hadoop by up to a factor 20. In addition, we are currently investigating the impact of a number of other optimizations techniques.

HAIL elephant

HAIL (Hadoop Aggressive Indexing Library) is an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop and even outperforms Hadoop++.

Isn’t that a cool aggressive elephant?

But before you get too excited, consider:

Towards Zero-Overhead Adaptive Indexing in Hadoop by Stefan Richter, Jorge-Arnulfo Quiané-Ruiz, Stefan Schuh, Jens Dittrich.

Abstract:

Several research works have focused on supporting index access in MapReduce systems. These works have allowed users to significantly speed up selective MapReduce jobs by orders of magnitude. However, all these proposals require users to create indexes upfront, which might be a difficult task in certain applications (such as in scientific and social applications) where workloads are evolving or hard to predict. To overcome this problem, we propose LIAH (Lazy Indexing and Adaptivity in Hadoop), a parallel, adaptive approach for indexing at minimal costs for MapReduce systems. The main idea of LIAH is to automatically and incrementally adapt to users’ workloads by creating clustered indexes on HDFS data blocks as a byproduct of executing MapReduce jobs. Besides distributing indexing efforts over multiple computing nodes, LIAH also parallelises indexing with both map tasks computation and disk I/O. All this without any additional data copy in main memory and with minimal synchronisation. The beauty of LIAH is that it piggybacks index creation on map tasks, which read relevant data from disk to main memory anyways. Hence, LIAH does not introduce any additional read I/O-costs and exploit free CPU cycles. As a result and in contrast to existing adaptive indexing works, LIAH has a very low (or invisible) indexing overhead, usually for the very first job. Still, LIAH can quickly converge to a complete index, i.e. all HDFS data blocks are indexed. Especially, LIAH can trade early job runtime improvements with fast complete index convergence. We compare LIAH with HAIL, a state-of-the-art indexing technique, as well as with standard Hadoop with respect to indexing overhead and workload performance. In terms of indexing overhead, LIAH can completely index a dataset as a byproduct of only four MapReduce jobs while incurring a low overhead of 11% over HAIL for the very first MapReduce job only. In terms of workload performance, our results show that LIAH outperforms Hadoop by up to a factor of 52 and HAIL by up to a factor of 24.

The Information Systems Group, Saarland University, Prof. Dr. Jens Dittrich is a place to watch.

An Overview of Scalding

Filed under: Hadoop,Scalding — Patrick Durusau @ 3:06 pm

An Overview of Scalding by Dean Wampler.

From the description:

Dean Wampler, Ph.D., is Principal Consultant at Think Big Analytics. In this video he will cover its benefits over the Java API include a dramatic reduction in the source code required, reflecting several Scala improvements over Java, full access to “functional programming” constructs that are ideal for data problems, and a Matrix library addition to support machine learning and other algorithms. He also demonstrates the benefits of Scalding using examples and explains just enough Scala syntax so you can follow along. Dean’s philosophy is that there is no better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need. This presentation was given on February 12th at the Nokia offices in Chicago, IL.

Slides: slideshare.net/ChicagoHUG/scalding-for-hadoop

During this period of rapid innovation around “big data,” what interests me is the development of tools to fit problems.

As opposed to fitting problems to fixed data models and tools.

Both require a great deal of skill, but they are different skill sets.

Yes?

I first saw this at Alex Popescu’s myNoSQL.

March 1, 2013

Free Sequester Data Here!

Filed under: Government,Government Data — Patrick Durusau @ 5:33 pm

You may recall that I did a series of posts on the sequestration report from the Office of Management and Budget (OMB).

Machine readable data files and analysis:

OMB-Sequestration-Data-Appendix-A.zip

OMB-Sequestration-Data-Appendix-B.zip

Appendix-A-Exempt-Intragovernmental-With-Appendix-B-Pages

All of the data is keyed to pages of: OMB Report Pursuant to the Sequestration Transparency Act of 2012 (P. L. 112–155)

So you can satisfy yourself that the files are an accurate representation of the OMB report.

As far as I know, these are the only electronic versions of the report accessible by the public.

One assumes the OMB has these files in electronic format but they have not been posted.

BTW, my posts in order of appearance:

U.S. Sequestration Report – Out of the Shadows/Into the Light? First post, general questions about consequences of cuts. (September 17, 2012)

Topic Map Modeling of Sequestration Data (Help Pls!) Second post, question of how to model the “special” rules as written. (September 29, 2012)

Modeling Question: What Happens When Dots Don’t Connect? Third post, modeling associations question (October 13, 2012)

Fiscal Cliff + OMB or Fool Me Once/Twice Fourth post, Appendix A appears, with comments. (December 7, 2012)

Over The Fiscal Cliff – Blindfolded Fifth post, Appendix B appears. (December 29, 2012)

The 560+ $Billion Shell Game Sixth post, detailed analysis of appendixes A and B. (January 1, 2013)

No Joy in Vindication Seventh post, Confirmation by the GAO that the problem I describe in the 560+ $Billion Shell Game exists in the DoD. (January 21, 2013)


Update:

Refried Numbers from the OMB

In its current attempt at sequester obfuscation, the OMB combined the approaches used in Appendices A and B of its earlier report and reduced the percentage of sequestration. See: OMB REPORT TO THE CONGRESS ON THE JOINT COMMITTEE SEQUESTRATION FOR FISCAL YEAR 2013.

Pig Eye for the SQL Guy

Filed under: Hadoop,MapReduce,Pig,SQL — Patrick Durusau @ 5:33 pm

Pig Eye for the SQL Guy by Cat Miller.

From the post:

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.

As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.

Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)

This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

Do you speak SQL?

Want to learn to speak Pig?

This is the right post for you!

Incremental association rule mining: a survey

Filed under: Association Rule Mining,Machine Learning — Patrick Durusau @ 5:33 pm

Incremental association rule mining: a survey by B. Nath, D. K. Bhattacharyya, A. Ghosh. (WIREs Data Mining Knowl Discov 2013. doi: 10.1002/widm.1086)

Abstract:

Association rule mining is a computationally expensive task. Despite the huge processing cost, it has gained tremendous popularity due to the usefulness of association rules. Several efficient algorithms can be found in the literature. This paper provides a comprehensive survey on the state-of-the-art algorithms for association rule mining, specially when the data sets used for rule mining are not static. Addition of new data to a data set may lead to additional rules or to the modification of existing rules. Finding the association rules from the whole data set may lead to significant waste of time if the process has started from the scratch. Several algorithms have been evolved to attend this important issue of the association rule mining problem. This paper analyzes some of them to tackle the incremental association rule mining problem.

Not suggesting that it is always a good idea to model association rules as “associations” in the topic map sense but it is an important area of data mining.

The paper provides:

  • a taxonomy on the existing frequent itemset generation techniques and an analysis of their pros and cons,
  • a comprehensive review on the existing static and incremental rule generation techniques and their pros and cons, and
  • identification of several important issues and research challenges.

Some thirteen (13) pages and sixty-six (66) citations to the literature so a good starting point for research in this area.

If you need a more basic starting point, consider: Association rule learning (Wikipedia).

Bellman’s GAP…

Filed under: Bioinformatics,Programming — Patrick Durusau @ 5:32 pm

Bellman’s GAP—a language and compiler for dynamic programming in sequence analysis by Georg Sauthoff, Mathias Möhl, Stefan Janssen and Robert Giegerich. (Bioinformatics (2013) 29 (5): 551-560. doi: 10.1093/bioinformatics/btt022)

Abstract:

Motivation: Dynamic programming is ubiquitous in bioinformatics. Developing and implementing non-trivial dynamic programming algorithms is often error prone and tedious. Bellman’s GAP is a new programming system, designed to ease the development of bioinformatics tools based on the dynamic programming technique.

Results: In Bellman’s GAP, dynamic programming algorithms are described in a declarative style by tree grammars, evaluation algebras and products formed thereof. This bypasses the design of explicit dynamic programming recurrences and yields programs that are free of subscript errors, modular and easy to modify. The declarative modules are compiled into C++ code that is competitive to carefully hand-crafted implementations.

This article introduces the Bellman’s GAP system and its language, GAP-L. It then demonstrates the ease of development and the degree of re-use by creating variants of two common bioinformatics algorithms. Finally, it evaluates Bellman’s GAP as an implementation platform of ‘real-world’ bioinformatics tools.

Availability: Bellman’s GAP is available under GPL license from http://bibiserv.cebitec.uni-bielefeld.de/bellmansgap. This Web site includes a repository of re-usable modules for RNA folding based on thermodynamics.

Contact: robert@techfak.uni-bielefeld.de

Supplementary information: Supplementary data are available at Bioinformatics online

Focused on bioinformatics but dynamic programming is not limited to that field.

There is a very amusing story about how the field came to have the name “dynamic programming” in the Wikipedia article: Dynamic Programming.

MongoDB + Fractal Tree Indexes = High Compression

Filed under: Fractal Trees,Indexing,MongoDB,Requirements — Patrick Durusau @ 5:31 pm

MongoDB + Fractal Tree Indexes = High Compression by Tim Callaghan.

You may have heard that MapR Technologies broke the MinuteSort Record by sorting 15 billion 100-btye records in 60 seconds. Used 2,103 virtual instances in the Google Compute Engine and each instance had four virtual cores and one virtual disk, totaling 8,412 virtual cores and 2,103 virtual disks. Google Compute Engine, MapR Break MinuteSort Record.

So, the next time you have 8,412 virtual cores and 2,103 virtual disks, you know what is possible, 😉

But if you have less firepower than that, you will need to be clever:

One doesn’t have to look far to see that there is strong interest in MongoDB compression. MongoDB has an open ticket from 2009 titled “Option to Store Data Compressed” with Fix Version/s planned but not scheduled. The ticket has a lot of comments, mostly from MongoDB users explaining their use-cases for the feature. For example, Khalid Salomão notes that “Compression would be very good to reduce storage cost and improve IO performance” and Andy notes that “SSD is getting more and more common for servers. They are very fast. The problems are high costs and low capacity.” There are many more in the ticket.

In prior blogs we’ve written about significant performance advantages when using Fractal Tree Indexes with MongoDB. Compression has always been a key feature of Fractal Tree Indexes. We currently support the LZMA, quicklz, and zlib compression algorithms, and our architecture allows us to easily add more. Our large block size creates another advantage as these algorithms tend to compress large blocks better than small ones.

Given the interest in compression for MongoDB and our capabilities to address this functionality, we decided to do a benchmark to measure the compression achieved by MongoDB + Fractal Tree Indexes using each available compression type. The benchmark loads 51 million documents into a collection and measures the size of all files in the file system (–dbpath).

More benchmarks to follow and you should remember that all benchmarks are just that, benchmarks.

Benchmarks do not represent experience with your data, under your operating load and network conditions, etc.

Investigate software based on the first, purchase software based on the second.

WANdisco: Free Hadoop Training Webinars

Filed under: Hadoop,HBase,MapReduce — Patrick Durusau @ 5:31 pm

WANdisco: Free Hadoop Training Webinars

WANdisco has four Hadoop webinars to put on your calendar:

A Hadoop Overview

This webinar will include a review of major components including HDFS, MapReduce, and HBase – the NoSQL database management system used with Hadoop for real-time applications. An overview of Hadoop’s ecosystem will also be provided. Other topics covered will include a review of public and private cloud deployment options, and common business use cases.

Register now Weds, March 13, 10:00 a.m. PT/1:00 p.m. ET

A Hadoop Deep Dive

This webinar will cover Hadoop misconceptions (not all clusters are thousands of machines), information about real world Hadoop deployments, a detailed review of Hadoop’s ecosystem (Sqoop, Flume, Nutch, Oozie, etc.), an in-depth look at HDFS, and an explanation of MapReduce in relation to latency and dependence on other Hadoop activities.

This webinar will introduce attendees to concepts they will need as a prerequisite for subsequent training webinars covering MapReduce, HBase and other major components at a deeper technical level.

Register now Weds, March 27, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: A MapReduce Tutorial

This webinar will cover MapReduce at a deep technical level.

This session will cover the history of MapReduce, how a MapReduce job works, its logical flow, the rules and types of MapReduce jobs, de-bugging and testing MapReduce jobs, writing foolproof MapReduce jobs, various workflow tools that are available, and more.

Register now Weds, April 10, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: HBase In-Depth

This webinar will provide a deep technical review of HBase, and cover flexibility, scalability, components (cells, rows, columns, qualifiers), schema samples, hardware requirements and more.

Register now Weds, April 24, 10:00 a.m. PT/1:00 p.m. ET

I first saw this at: WANdisco Announces Free Hadoop Training Webinars.

A post with no link to WANdisco or to registration for any of the webinars.

If you would prefer that I put in fewer hyperlinks to resources, please let me know.

Shedding Light on the Dark Data in the Long Tail of Science

Filed under: Curation,Librarian/Expert Searchers,Library — Patrick Durusau @ 5:30 pm

Shedding Light on the Dark Data in the Long Tail of Science by P. Bryan Heidorn. (P. Bryan Heidorn. “Shedding Light on the Dark Data in the Long Tail of Science.” Library Trends 57.2 (2008): 280-299. Project MUSE. Web. 28 Feb. 2013. .)

Abstract:

One of the primary outputs of the scientific enterprise is data, but many institutions such as libraries that are charged with preserving and disseminating scholarly output have largely ignored this form of documentation of scholarly activity. This paper focuses on a particularly troublesome class of data, termed dark data. “Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost. The article discusses how the concepts from long-tail economics can be used to understand potential solutions for better curation of this data. The paper describes why this data is critical to scientific progress, some of the properties of this data, as well as some social and technical barriers to proper management of this class of data. Many potentially useful institutional, social, and technical solutions are under development and are introduced in the last sections of the paper, but these solutions are largely unproven and require additional research and development.

From the article:

In this paper we will use the term dark data to refer to any data that is not easily found by potential users. Dark data may be positive or negative research findings or from either “large” or “small” science. Like dark matter, this dark data on the basis of volume may be more important than that which can be easily seen. The challenge for science policy is to develop institutions and practices such as institutional repositories, which make this data useful for society.

Dark Data = Any data that is not easily found by potential users.

A number of causes are discussed, not the least of which is our old friend, the Tower of Babel.

A final barrier that cannot be overlooked is the Digital Tower of Babel that we have created with seemingly countless proprietary as well as open data formats. This can include versions of the same software products that are incompatible. Some of these formats are very efficient for the individual applications for which they were designed including word processing, databases, spreadsheets, and others, but they are ineffective to support interoperability and preservation.

As you know already, I don’t think the answer to data curation, long term, lies in uniform formats.

Uniform formats are very useful but are domain, project and time bound.

The questions always are:

“What do we do when we change data formats?”

“Do we dump data in old formats that we spent $$$ developing?”

“Do we migrate data in old formats, assuming anyone remembers the old format?”

“Do we document and map across old and new formats, preparing for the next ‘new’ format?”

None of the answers are automatic or free.

But it is better to make in informed choice than a default one of letting potentially valuable data rot.

Looking out for the little guy: Small data curation

Filed under: Curation,Librarian/Expert Searchers,Library — Patrick Durusau @ 5:30 pm

Looking out for the little guy: Small data curation by Katherine Goold Akers. (Akers, K. G. (2013), Looking out for the little guy: Small data curation. Bul. Am. Soc. Info. Sci. Tech., 39: 58–59. doi: 10.1002/bult.2013.1720390317)

Abstract:

While big data and its management are in the spotlight, a vast number of important research projects generate relatively small amounts of data that are nonetheless valuable yet rarely preserved. Such studies are often focused precursors to follow-up work and generate less noisy data than grand scale projects. Yet smaller quantity does not equate to simpler management. Data from smaller studies may be captured in a variety of file formats with no standard approach to documentation, metadata or preparation for archiving or reuse, making its curation even more challenging than for big data. As the information managers most likely to encounter small datasets, academic librarians should cooperate to develop workable strategies to document, organize, preserve and disseminate local small datasets so that valuable scholarly information can be discovered and shared.

A reminder that for every “big data” project in need of curation, there are many more smaller, less well known projects that need the same services.

Since topic maps don’t require global or even regional agreement on ontology or methodological issues, it should be easier for academic librarians to create topic maps to curate small datasets.

When it is necessary or desired to merge small datasets that were curated with different topic map assumptions, new topics can be created that merge the data that existed in separate topic maps.

But only when necessary and at the point of merging.

To say it another way, topic maps need not anticipate or fear the future. Tomorrow will take care of itself.

Unlike “now I am awake” approaches, that must fear the next moment of consciousness will bring change.

Methods of Proof — Contradiction

Filed under: Mathematical Reasoning,Mathematics — Patrick Durusau @ 5:30 pm

Methods of Proof — Contradiction by Jeremy Kun.

From the post:

In this post we’ll expand our toolbox of proof techniques by adding the proof by contradiction. We’ll also expand on our knowledge of functions on sets, and tackle our first nontrivial theorem: that there is more than one kind of infinity.

Impossibility and an Example Proof by Contradiction

Many of the most impressive results in all of mathematics are proofs of impossibility. We see these in lots of different fields. In number theory, plenty of numbers cannot be expressed as fractions. In geometry, certain geometric constructions are impossible with a straight-edge and compass. In computing theory, certain programs cannot be written. And in logic even certain mathematical statements can’t be proven or disproven.

In some sense proofs of impossibility are hardest proofs, because it’s unclear to the layman how anyone could prove it’s not possible to do something. Perhaps this is part of human nature, that nothing is too impossible to escape the realm of possibility. But perhaps it’s more surprising that the main line of attack to prove something is impossible is to assume it’s possible, and see what follows as a result. This is precisely the method of proof by contradiction:

Assume the claim you want to prove is false, and deduce that something obviously impossible must happen.

There is a simple and very elegant example that I use to explain this concept to high school students in my guest lectures.

I hope you are following this series of posts but if not, at least read the example Jeremy has for proof by contradiction.

It’s a real treat.

« Newer PostsOlder Posts »

Powered by WordPress