Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 21, 2012

Counterpoint: Why We Should Not Use the Cloud

Filed under: Cloud Computing — Patrick Durusau @ 4:36 pm

Counterpoint: Why We Should Not Use the Cloud by Andrea Di Maio.

Andrea writes:

The IT world has embraced the concept of cloud computing. Vendors, users, consultants, analysts, we all try to figure out how to leverage the increasing commoditization of IT from both an enterprise and a personal perspective.

Discussions on COTS have turned into discussions on SaaS, People running their own data center claim they run (or are developing) a private cloud. Shared service providers rebrand their services as community cloud. IT professionals in user enterprises dream to move up the value chain by leaving the boring I&O stuff to vendors and developing more vertical business analysis and demand management skills. What used to be called outsourcing is now named cloud sourcing, while selective sourcing morphs into hybrid clouds or cloud brokerage. Also personally, we look at our USB stick or disk drive with disdain, waiting for endless, ultracheap personal clouds to host all of our emails, pictures, music.

It looks like none of us is truly reflecting about whether this is good or bad. Of course, many are moving cautiously, they understand they are not ready for prime time for all sorts of security, confidentiality, maturity reasons. However it always looks like they have to justify themselves. “Cloud first”, some say, and you’ll have to tell us why you are not planning to go cloud. So those who want to hold to their own infrastructure (without painting it as a “private cloud”) or want to keep using traditional delivery models from their vendors (such as hosting or colocation) almost feel like children of a lesser God when compared to all those bright and lucky IT executives who can venture into the cloud (and – when moving early enough – still get an interview on a newspaper or a magazine).

Let me be clear. I am intimately convinced that the move to cloud computing is inevitable and necessary, even it may happen more slowly that many believe or hope for. However I would like to voice some concerns that may give good reasons not to move. There are probably many others, but it is important to ask ourselves – both as users and providers – tougher questions to make sure we have convincing answers as we approach or dive into the cloud.

That’s like saying your firm doesn’t have “big data.” 😉

The biggest caution is one that Andrea misses.

That is thinking that moving to the “cloud” is going to save on IT expenses.

A commonly repeated mantra in Washington by McNamara types. If you don’t remember the “cost saving” reforms in the military in the early 1960’s, now would be a good time to brush up on your history. An elaborate scheme was created to determine equipment requirements based on usage.

So if you were in a warm climate, most of the year, you did not need snowplows, for example. Except that if you are an air field and it does snow, oops, you need a snowplow that day and little else will work.

At a sufficient distance, the plans seemed reasonable. Particularly with people who did not understand the subject under discussion. Like cost saving consolidations in IT now under way in Washington.

SPAMS (SPArse Modeling Software)

Filed under: SPAMS — Patrick Durusau @ 4:36 pm

SPAMS (SPArse Modeling Software)

From the webpage:

SPAMS (SPArse Modeling Software) is an optimization toolbox for solving various sparse estimation problems.

  • Dictionary learning and matrix factorization (NMF, sparse PCA, …)
  • Solving sparse decomposition problems with LARS, coordinate descent, OMP, SOMP, proximal methods
  • Solving structured sparse decomposition problems (l1/l2, l1/linf, sparse group lasso, tree-structured regularization, structured sparsity with overlapping groups,…).

SPAMS Documentation

I first saw this at the Nuit Blanche post SPAMS (SPArse Modeling Software) now with Python and R.

Building Highly Available Systems in Erlang

Filed under: Erlang,Topic Map Software — Patrick Durusau @ 4:35 pm

Building Highly Available Systems in Erlang

From the description:

Summary

Joe Armstrong discusses highly available (HA) systems, introducing different types of HA systems and data, HA architecture and algorithms, 6 rules of HA, and how HA is done with Erlang.

Bio

Joe Armstrong is the principal inventor of Erlang and coined the term “Concurrency Oriented Programming”. At Ericsson he developed Erlang and was chief architect of the Erlang/OTP system. In 1998 he formed Bluetail, which developed all its products in Erlang. In 2003 he obtain his PhD from the Royal Institute of Technology, Stockholm. He is author of the book “Software for a concurrent world”.

Gives the six (6) rules for highly available systems and how Erlang meets those six (6) rules.

  • Isolation rule: Operations must be isolated
  • Concurrency: The world is concurrent
  • Must detect failures: If can’t detect, can’t fix
  • Fault Identification: Enough detail to do something.
  • Live Code Upgrade: Upgrade software while running.
  • Stable Storage: Must survive universal power failure.

Quotes: Why Computers Stop and What Can Be Done About It, Jim Gray, Technical Report 85.7, Tandem Computers 1985, for example.

Highly entertaining and informative.

What do you think of the notion of an evolving software system?

How would you apply that to a topic map system?

On multi-form data

Filed under: MongoDB,NoSQL — Patrick Durusau @ 4:35 pm

On multi-form data

From the post:

I read an excellent debrief on a startup’s experience with MongoDB, called “A Year with MongoDB”.

It was excellent due to its level of detail. Some of its points are important — particularly global write lock and uncompressed field names, both issues that needlessly afflict large MongoDB clusters and will likely be fixed eventually.

However, it’s also pretty clear from this post that they were not using MongoDB in the best way.

An interesting take on when and just as importantly, when not to use MongoDB.

As NoSQL offerings mature, are we doing to see more of this sort of treatment or will more treatments like this drive the maturity of NoSQL offerings?

Pointers to “more like this?” (not just on MongoDB but other NoSQL offerings as well)

Saving the Old IR Literature: a new batch

Filed under: Information Retrieval — Patrick Durusau @ 4:35 pm

Saving the Old IR Literature: a new batch

Saw a retweet of a tweet from @djoerd on this new release.

Thanks ACM SIGIR! (Special Interest Group on Information Retrieval)

Just the titles should get you interested:

  • Natural Language in Information Retrieval – Donald E. Walker, Hans Karlgren, Martin Kay – Skriptor AB, Stockholm, 1977
  • Annual Report: Automatic Informative Abstracting and Extracting – L. L. Earl – Lockheed Missiles and Space Company, 1972
  • Free Text Retrieval Evaluation – Pauline Atherton, Kenneth H. Cook, Jeffrey Katzer – Syracuse University, 1972
  • Information Storage and Retrieval: Scientific Report No. ISR-7 – Gerard Salton – The National Science Foundation, 1964
  • Information Storage and Retrieval: Scientific Report No. ISR-8 – Gerard Salton – The National Science Foundation, 1964
  • Information Storage and Retrieval: Scientific Report No. ISR-9 – Gerard Salton – The National Science Foundation, 1965
  • Information Storage and Retrieval: Scientific Report No. ISR-14 – Gerard Salton – The National Science Foundation, 1968
  • Information Storage and Retrieval: Scientific Report No. ISR-16 – Gerard Salton – The National Science Foundation, 1969
  • Automatic Indexing: A State of the Art Review – Karen Sparck Jones – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5193, 1974
  • Final Report on International Research Forum in Information Science: The Theoretical Basis of Information Science – B.C. Vickery, S.E. Robertson, N.J. Belkin – British Library Research and Development Report No. 5262, 1975
  • Report on the Need for and Provision for an ‘IDEAL’Information Retrieval Test Collection – K. Sparck Jones, C.J. Van Rijsbergen – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5266, 1975
  • Report on a Design Study for the’IDEAL’ Information Retrieval Test Collection – K. Sparck Jones, R.G. Bates – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5428, 1977
  • Research on Automatic Indexing 1974-1976, Volume 1: Text – K. Sparck Jones, R.G. Bates – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5464, 1977
  • Statistical Bases of Relevance Assessment for the ‘IDEAL’ Information Retrieval Test Collection – H. Gilbert, K. Sparck Jones – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5481, 1979
  • Design Study for an Anomalous State of Knowledge Based Information Retrieval System – N.J. Belkin, R.N. Oddy – University of Aston, Computer CentreBritish Library Research and Development Report No. 5547, 1979
  • Research on Relevance Weighting, 1976-1979 – K. Sparck Jones, C.A. Webster – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5553, 1980
  • New Models in Probabilistic Information Retrieval – C.J. van Rijsbergen, S.E. Robertson, M.F. Porter – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5587, 1980
  • Statistical problems in the application of probabilistic models to information retrieval – S.E. Robertson, J.D. Bovey – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5739, 1982
  • A front-end for IR experiments – S.E. Robertson, J.D. Bovey – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5807, 1983
  • An operational evaluation of weighting, ranking and relevance feedback via a front-end system – S.E. Robertson, C.L. Thompson – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5549, 1987
  • Okapi at City: An evaluation facility for interactive – Stephen Walker, Micheline Hancock-Beaulieu – Centre for Information Science City UniversityBritish Library Research and Development Report No. 6056, 1991
  • Improving Subject Retrieval in Online Catalogues: Stemming, automatic spelling correction and cross-reference tables – Stephen Walker, Richard M Jones – The Polytechnic of Central LondonBritish Library Research Paper No. 24, 1987
  • Designing an Online Public Access Catalogue: Okapi, a catalogue on a local area network – Nathalie Nadia Mitev, Gillian M Venner, Stephen Walker – The Polytechnic of Central LondonLibrary and Information Research Report 39, 1985
  • Improving Subject Retrieval in Online Catalogues: Relevance feedback and query expansion – Stephen Walker, Rachel De Vere – The Polytechnic of Central LondonBritish Library Research Paper No. 72, 1989
  • Evaluation of Online Catalogues: an assessment of methods – Micheline Hancock-Beaulieu, Stephen Robertson, Colin Neilson – Centre for Information Science City UniversityBritish Library Research Paper No. 78, 1990

Neo4j 1.7 GA “Bastuträsk Bänk” released

Filed under: Neo4j,NoSQL — Patrick Durusau @ 4:35 pm

Neo4j 1.7 GA “Bastuträsk Bänk” released

We’re very pleased to announce that Neo4j 1.7 GA, codenamed “Bastuträsk Bänk” is now generally available. The many improvements ushered in through milestones have been properly QA’d and documented, making 1.7 the preferred stable version for all production deployments. Let’s review the highlights.

The release includes a number of features but I was surprised by:

With 1.7, Cypher now has a full range of common math functions for use in the RETURN and WHERE clause.

Because the “full range of common math functions” turned out to be ABS, ROUND, SQRT, and SIGN. That doesn’t look like a “full range of common math functions” to me. How about you?

Math operators are documented at: Operators

Complex Networks -> Local Networks

Filed under: Complex Networks,Graphs,Networks — Patrick Durusau @ 4:34 pm

The Game of Go: A Complex Network post reminds us that complex networks, with care, can be decomposed into local networks.

From the post:

Using a database containing around 5 000 games played by professional and amateur go players in international tournaments, Bertrand Georgeot from the Theoretical Physics Laboratory (Université Toulouse III-Paul Sabatier/CNRS) and Olivier Giraud from the Laboratory of Theoretical Physics and Statistical Models (Université Paris-Sud/CNRS) applied network theory to this game of strategy. They constructed a network whose nodes are local patterns on the board, while the vertices (which represent the links) reflect the sequence of moves. This enabled them to recapture part of the local game strategy. In this game, where players place their stones at the intersections of a grid consisting of 19 vertical and 19 horizontal lines (making 361 intersections), the researchers studied local patterns of 9 intersections. They showed that the statistical frequency distribution of these patterns follows Zipf’s law, similar to the frequency distribution of words in a language.

Although the go network’s features resemble those of other real networks (social networks or the Internet), it has its own specific properties. While the most recent simulation programs already include statistical data from real games, albeit at a still basic level, these new findings should allow better modeling of this kind of board game.

The researchers did not even attempt to solve the entire board but rather looked for “local” patterns on the board.

What “local patterns” are you missing in “big data?”

Article reference: The game of go as a complex network. B. Georgeot and O. Giraud 2012 EPL 97 68002.

Abstract:

We study the game of go from a complex network perspective. We construct a directed network using a suitable definition of tactical moves including local patterns, and study this network for different datasets of professional and amateur games. The move distribution follows Zipf’s law and the network is scale free, with statistical peculiarities different from other real directed networks, such as, e.g., the World Wide Web. These specificities reflect in the outcome of ranking algorithms applied to it. The fine study of the eigenvalues and eigenvectors of matrices used by the ranking algorithms singles out certain strategic situations. Our results should pave the way to a better modelization of board games and other types of human strategic scheming.

Twister

Filed under: MapReduce,Twister — Patrick Durusau @ 4:34 pm

Twister

From the webpage:

MapReduce programming model has simplified the implementations of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among parallel computing communities. From the years of experience in applying MapReduce programming model to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. Twister is a lightweight MapReduce runtime we have developed by incorporating these enhancements.

Twister provides the following features to support MapReduce computations. (Twister is developed as part of Jaliya Ekanayake’s Ph.D. research and is supported by the S A L S A Team @ IU)

Useful links:

Download

Samples

UserGuide

That’s the right order, yes? Download, if can’t get it running, look at samples, and if not running then, look at the documentation? 😉

Noticed at: Alex Popescu’s myNoSQL – Twister

Distributed Temporal Graph Database Using Datomic

Filed under: Datomic,Distributed Systems,Graphs,Temporal Graph Database — Patrick Durusau @ 4:34 pm

Distributed Temporal Graph Database Using Datomic

Post by Alex Popescu calling out construction of a “distributed temporal graph database.”

Temporal used in the sense of timestamping entries in the database.

Beyond such uses, beware, there be dragons.

Temporal modeling isn’t for the faint of heart.

Neo4j.rb – Update

Filed under: Neo4j,Neo4j.rb,Ruby — Patrick Durusau @ 2:32 pm

Neo4j.rb – Update

From the webpage:

Neo4j.rb is a graph database for JRuby.

You can think of Neo4j as a high-performance graph engine with all the features of a mature and robust database. The programmer works with an object-oriented, flexible network structure rather than with strict and static tables — yet enjoys all the benefits of a fully transactional, enterprise-strength database.

It uses two powerful and mature Java libraries:

  • Neo4J – for persistence and traversal of the graph
  • Lucene for querying and indexing.

New features include:

Rules, and

Cypher DSL queries.

April 20, 2012

Past, Present and Future – The Quest to be Understood

Filed under: Identification,Identifiers,Identity — Patrick Durusau @ 6:27 pm

Without restricting it to being machine readable, I think we would all agree there are three ages of data:

  1. Past data
  2. Present data
  3. Future data

And we have common goals for data (or parts of it):

  1. Past data – To understand past data.
  2. Present data – To be understood by others.
  3. Future data – For our present data to persist and be understood by then users.

Common to those ages and goals is the need for management of identifiers for our data. (Where identifiers may be data as well.)

I say “management of identifiers” because we cannot control identifiers used in the past, identifiers used by others in the present, or identifiers that may be used in the future.

You would think in an obviously multi-lingual world that multiple identifier identification would be the default position.

Just a personal observation but hardly a day passes without someone or some group saying the equivalent of:

I know! I will create a list of identifiers that everyone must use! That’s the answer to the confusion (Babel) of identifiers.

Such efforts are always defeated by past identifiers, other identifiers in the present and future identifiers.

Managing tides of identifiers is a partial solution but more workable than trying to stop the tide.

What do you think?

Approximate Bregman near neighbors

Filed under: Approximate Nearest Neighbors (ANN),Bregman Divergences — Patrick Durusau @ 6:26 pm

Approximate Bregman near neighbors

From the post:

(tl;dr: Our upcoming paper in SoCG 2012 shows that with a nontrivial amount of work, you can do approximate Bregman near neighbor queries in low dimensional spaces in logarithmic query time)

Or you may prefer the full paper:

Approximate Bregman near neighbors in sublinear time: Beyond the triangle inequality

Abstract:

In this paper we present the first provable approximate nearest-neighbor (ANN) algorithms for Bregman divergences. Our first algorithm processes queries in O(log^d n) time using O(n log^d n) space and only uses general properties of the underlying distance function (which includes Bregman divergences as a special case). The second algorithm processes queries in O(log n) time using O(n) space and exploits structural constants associated specifically with Bregman divergences. An interesting feature of our algorithms is that they extend the ring-tree + quad-tree paradigm for ANN searching beyond Euclidean distances and metrics of bounded doubling dimension to distances that might not even be symmetric or satisfy a triangle inequality.

Tough sledding but interesting work on Bregman divergences. Leads to proposed improvements in data search structures.

The following references may be helpful: Bregman divergence

Functional thinking: Functional design patterns, Part 1 (and Part 2)

Filed under: Church,Functional Programming — Patrick Durusau @ 6:26 pm

Functional thinking: Functional design patterns, Part 1: How patterns manifest in the functional world

Neal Ford writes:

Some contingents in the functional world claim that the concept of the design pattern is flawed and isn’t needed in functional programming. A case can be made for that view under a narrow definition of pattern — but that’s an argument more about semantics than use. The concept of a design pattern — a named, cataloged solution to a common problem — is alive and well. However, patterns sometimes take different guises under different paradigms. Because the building blocks and approaches to problems are different in the functional world, some of the traditional Gang of Four patterns (see Resources) disappear, while others preserve the problem but solve it radically differently. This installment and the next investigate some traditional design patterns and rethink them in a functional way.

In the functional-programming world, traditional design patterns generally manifest in one of three ways:

  • The pattern is absorbed by the language.
  • The pattern solution still exists in the functional paradigm, but the implementation details differ.
  • The solution is implemented using capabilities other languages or paradigms lack. (For example, many solutions that use metaprogramming are clean and elegant — and they’re not possible in Java.)

I’ll investigate these three cases in turn, starting in this installment with some familiar patterns, most of which are wholly or partially subsumed by modern languages.

Neal continues this series in: Functional thinking: Functional design patterns, Part 2

In the lead-in Neal says:

… patterns sometimes take different guises under different paradigms.

What? Expression of patterns vary from programming language to programming language?

Does that mean if I could trace patterns from one language to another that I might have greater insight into the use of patterns across languages?

Not having to relearn a pattern or its characteristics?

Does that sound like a net win?

If it does, consider topic maps as one path to that net win.

Using Sigma.js with Neo4j

Filed under: Neo4j,Sigma.js — Patrick Durusau @ 6:25 pm

Using Sigma.js with Neo4j by Max De Marzi.

From the post:

I’ve done a few posts recently using D3.js and now I want to show you how to use two other great Javascript libraries to visualize your graphs. We’ll start with Sigma.js and soon I’ll do another post with Three.js.

We’re going to create our graph and group our nodes into five clusters. You’ll notice later on that we’re going to give our clustered nodes colors using rgb values so we’ll be able to see them move around until they find their right place in our layout. We’ll be using two Sigma.js plugins, the GEFX (Graph Exchange XML Format) parser and the ForceAtlas2 layout.

Do notice the coloration that Max uses in his examples.

Graphs don’t have a “correct” visualization.

They can have visualizations that lead to or represent insights into data.

Statistical Data and Metadata eXchange (SDMX)

Filed under: RDF Data Cube Vocabulary,SDMX — Patrick Durusau @ 6:25 pm

Statistical Data and Metadata eXchange (SDMX)

SDMX is the core information model that informs the vocabulary of the RDF Data Cube Vocabulary.

It isn’t clear in working draft of 05 April 2011, which version of the SDMX materials informs the RDF Data Cube Vocabulary work.

You may also be interested in SDMX pages on domains where statistical work is ongoing, implementations and tools.

On SDMX in general:

SDMX 2.1 Technical Specification

Section 1 – Framework. Introduces the documents and the content of the revised Version 2.1

Section 2 – Information Model. UML model and functional description, definition of classes, associations and attributes

Section 3A – SDMX_ML. Specifies and documents the XML formats for describing structure, data, reference metadata, and interfaces to the registry

Section 3B – SDMX-ML. XML schemas, samples, WADL and WSDL (update: 12 May 2011)

Section 4 – SDMX-EDI. Specifies and documents the UN/EDIFACT format for describing structure and data.

Section 5 – Registry Specification – Logical Interfaces. Provides the specification for the logical registry interfaces, including subscription/notification, registration of data and metadata, submission of structural metadata, and querying

Section 6 – Technical Notes. Provides some technical information which may be useful for the implementation (this was called “Implementor’s Guide” in the 2.0 release)

Section 7 – Web Services Guidelines. Provides guidelines for using SDMX standards to promote interoperability among SDMX web services

ZIP file of all the documents: SDMX 2.1 ALL SECTIONS

And SDMX concepts:

The SDMX Content-Oriented Guidelines recommend practices for creating interoperable data and metadata sets using the SDMX technical standards. They are intended to be applicable to all statistical subject-matter domains. The Guidelines focus on harmonising specific concepts and terminology that are common to a large number of statistical domains. Such harmonisation is useful for achieving an even more efficient exchange of comparable data and metadata, and builds on the experience gained in implementations to date.

Content-Oriented Guidelines

The Guidelines are supplemented by five annexes:

Annex 1 – Cross-Domain Concepts
Annex 2 – Cross-Domain Code Lists
Annex 3 – Statistical Subject-Matter Domains
Annex 4 – Metadata Common Vocabulary
Annex 5 – SDMX-ML for Content-Oriented Guidelines (zip file)

Additional information is provided in the following files:

  1. Mapping of SDMX Cross-Domain Concepts to metadata frameworks at international organisations (IMF-Data Quality Assessment Framework, Eurostat-SDMX Metadata Structure and OECD-Metastore)
  2. Use of Cross-Domain Concepts in Data and Metadata Structure Definitions
  3. A disposition log of comments and suggestions directly received by the SDMX Secretariat.

Tiny ToCS Vol. 1

Filed under: Humor — Patrick Durusau @ 6:24 pm

Tiny ToCS Vol. 1

Although many topic map authors will have difficulty meeting the submission requirements:

Tiny TOCS is a highly selective venue where full papers typically report novel results and ideas. Submissions can be up to 140 characters in length, with an abstract of no more than 250 words and a title of no more than 118 characters.

The primary content of submissions should fit into 140 characters. The abstract is not intended to expound upon your finding but instead to provide context for your work. What is the background for your research? Concisely, how does this work improve on related research? Submissions which violate this requirement will be rejected without consideration.

You may use three references in your submission. References will count as one word when used in the abstract section and three letters in the body section (“[1]”).

I would be remiss in not calling this publication opportunity to your attention.

Final submission: July 1st, 2012, Midnight PST
Notification: August 1st
Publication: August 14th

What Makes Good Data Visualization?

Filed under: Data,Graphics,Visualization — Patrick Durusau @ 6:24 pm

What Makes Good Data Visualization?

Panel discussion at the New York Public Library.

Panelists:

Kaiser Fung, Blogger, junkcharts.typepad.com/numbersruleyourworld
Andrew Gelman, Director, Applied Statistics Center, Columbia University
Mark Hansen, Artist; Professor of Statistics, UCLA
Tahir Hemphill, Creative Director; Founder, Hip Hop Word Count Project
Manuel Lima, Founder, VisualComplexity.com; Senior UX Design Lead, Microsoft Bing

Infovis and Statistical Graphics: Different Goals, Different Looks by Andrew Gelman and Antony Unwin is said in the announcement to be relevant. (It’s forty-four pages so don’t try to read it while watching the video. It is, however, worth your time.)

Unfortunately, the sound quality is very uneven. Ranges from very good to almost inaudible.

@16:04 the show is finally about to begin.

The screen is nearly impossible to see. I have requested that the slides be posted.

The parts of the discussion that are audible are very good, which makes it even more disappointing that so much of it is missing.

Of particular interest (at least to me) were the early comments and illustrations (not visible in the video) of how current graphic efforts are re-creating prior efforts to illustrate data.

Standardizing Federal Transparency

Filed under: Government Data,Identity,Transparency — Patrick Durusau @ 6:24 pm

Standardizing Federal Transparency

From the post:

A new federal data transparency coalition is pushing for standardization of government documents and support for legislation on public records disclosures, taxpayer spending and business identification codes.

The Data Transparency Coalition announced its official launch Monday, vowing nonpartisan work with Congress and the Executive Branch on ventures toward digital publishing of government documents in a standardized and integrated formats. As part of that effort, the coalition expressed its support of legislative proposals such as: the Digital Accountability and Transparency Act, which would open public spending records published on a single digital format; the Public Information Online Act, which pushes for all records to be released digitally in a machine-readable format; and the Legal Entity Identifier proposal, creating a standard ID code for companies.

The 14 founding members include vendors Microsoft, Teradata, MarkLogic, Rivet Software, Level One Technologies and Synteractive, as well as the Maryland Association of CPAs, financial advisory BrightScope, and data mining and pattern discovery consultancy Elder Research. The coalition board of advisors includes former U.S. Deputy CTO Beth Noveck, data and information services investment firm partner Eric Gillespie and former Recovery Accountability and Transparency Board Chairman Earl E. Devaney.

Data Transparency Coalition Executive Director Hudson Hollister, a former counsel for the House of Representatives and U.S. Securities and Exchange Commission, noted that when the federal government does electronically publish public documents it “often fails to adopt consistent machine-readable identifiers or uniform markup languages.”

Sounds like an opportunity for both the markup and semantic identity communities, topic maps in particular.

Reasoning not only will there need to be mappings between vocabularies and entities but also between “uniform markup languages” as they evolve and develop.

On Schemas and Lucene

Filed under: Lucene,Schema,Solr — Patrick Durusau @ 6:24 pm

On Schemas and Lucene

Chris Male writes:

One of the very first thing users encounter when using Apache Solr is its schema. Here they configure the fields that their Documents will contain and the field types which define amongst other things, how field data will be analyzed. Solr’s schema is often touted as one of its major features and you will find it used in almost every Solr component. Yet at the same time, users of Apache Lucene won’t encounter a schema. Lucene is schemaless, letting users index Documents with any fields they like.

To me this schemaless flexibility comes at a cost. For example, Lucene’s QueryParsers cannot validate that a field being queried even exists or use NumericRangeQuerys when a field is numeric. When indexing, there is no way to automate creating Documents with their appropriate fields and types from a series of values. In Solr, the most optimal strategies for faceting and grouping different fields can be chosen based on field metadate retrieved from its schema.

Consequently as part of the modularisation of Solr and Lucene, I’ve always wondered whether it would be worth creating a schema module so that Lucene users can benefit from a schema, if they so choose. I’ve talked about this with many people over the last 12 months and have had a wide variety of reactions, but inevitably I’ve always come away more unsure. So in this blog I’m going ask you a lot of questions and I hope you can clarify this issue for me.

What follows is a deeply thoughtful examination of the pros and cons of schemas for Lucene and/or their role in Solr.

If you using Lucene, take the time to review Chris’s questions and voice your questions or concerns.

The Lucene you improve will be your own.

If you are interested in either Lucene or Solr, now would be a good time to speak up.

With Perfect Timing, UK Audit Office Review Warns Open Government Enthusiasts

Filed under: Government Data,Open Data — Patrick Durusau @ 6:24 pm

With Perfect Timing, UK Audit Office Review Warns Open Government Enthusiasts

Andrea Di Maio writes:

Right in the middle of the Open Government Partnership conference, which I mentioned in my post yesterday, the UK National Audit Office (NAO) published its cross-government review on Implementing Transparency.

The report, while recognizing the importance and the potential for open data initiatives, highlights a few areas of concern that should be taken quite seriously by the OGP conference attendees, most of which are making open data more a self-fulfilling prophecy than an actual tool for government transparency and transformation.

The areas of concern highlighted in the review are an insufficient attention to assess costs, risks and benefits of transparency, the variation in completeness of information and the mixed progress. While the two latter can improve with greater maturity, it is the first time that requires the most attention.

Better late than never.

I have yet to hear a discouraging word in the U.S. about the rush to openness by the Obama administration.

Not that I object to “openness,” but I would like to see meaningful “openness.”

Take campaign finance for example. Treating all contributions over fifty dollars ($50) the same is hiding influence buying in the chaff of reporting.

What matters is any contribution of over say $100,000 to a candidate. That would make the real supporters (purchasers really) of a particular office stand out.

The Obama Whitehouse uses hiding in the chaff to say they are disclosing White House visitors. Who are mixed into the weekly visitor log for the White House. Girl and Boy Scout troop visits don’t count the same as personal audiences with the President.

Government contract data should be limited to contracts over 500,000 and include individual owner and corporate names plus the names of their usual government contract officers. Might need to bump the $500,000 up but could try it for a year.

If we bring up the house lights we have to search everyone. Why not a flashlight on the masher in the back row?

April 19, 2012

NSA Money Trap

Filed under: Humor,Marketing — Patrick Durusau @ 7:23 pm

I am posting this under humor, in part due to the excellent writing of James Bamford.

Here is a sample of what you will find at: The NSA Is Building the Country’s Biggest Spy Center (Watch What You Say):

Today Bluffdale is home to one of the nation’s largest sects of polygamists, the Apostolic United Brethren, with upwards of 9,000 members. The brethren’s complex includes a chapel, a school, a sports field, and an archive. Membership has doubled since 1978—and the number of plural marriages has tripled—so the sect has recently been looking for ways to purchase more land and expand throughout the town.

But new pioneers have quietly begun moving into the area, secretive outsiders who say little and keep to themselves. Like the pious polygamists, they are focused on deciphering cryptic messages that only they have the power to understand. Just off Beef Hollow Road, less than a mile from brethren headquarters, thousands of hard-hatted construction workers in sweat-soaked T-shirts are laying the groundwork for the newcomers’ own temple and archive, a massive complex so large that it necessitated expanding the town’s boundaries. Once built, it will be more than five times the size of the US Capitol.

Rather than Bibles, prophets, and worshippers, this temple will be filled with servers, computer intelligence experts, and armed guards. And instead of listening for words flowing down from heaven, these newcomers will be secretly capturing, storing, and analyzing vast quantities of words and images hurtling through the world’s telecommunications networks. In the little town of Bluffdale, Big Love and Big Brother have become uneasy neighbors.

There is enough doom and gloom to keep the movie industry busy through Terminator XXX – The Commodore 128 Conspiracy.

Why am I not worried?

  1. 70% of all IT projects fail – Odds are better than 50% this is one of them.
  2. Location – Build a computer center in one of the hottest location in the 48 states. Is that a comment on the planning of this center?
  3. Technology – In the time from planning to completion, two or three generations of computing architecture and design have occurred. Care to bet on the mixture of systems to be found at this “secret” location?
  4. 70% of all IT projects fail – Odds are better than 50% this is one of them.
  5. NSA advances in cryptography. Sure, just like Oakridge was breeched by an “advanced persistent threat“:

    Oak Ridge National Labs blamed the incident on an “advanced persistent threat,” (APT) a term commonly used by organizations to imply that the threat was so advanced that they would never have been able to protect themselves, Gunter Ollmann, vice-president of research at Damballa,

    Would you expect anyone to claim being the victim of a high school level hack? Do you really think the NSA is going to say its a little ahead, maybe?

  6. Consider the NSA’s track record against terrorism, revolution, etc. You would get more timely information reading the Washington Post. Oh, but their real contribution is a secret.
  7. When they have a contribution, like listening to cell phones of terrorists, they leak it. No leaks, no real contributions.
  8. 70% of all IT projects fail – Odds are better than 50% this is one of them.
  9. Apparently there is no capacity (unless is is secret) to proof signals intell with human intell. That’s like watching I Love Lucy episodes for current weather information. It has to be right part of the time.
  10. 70% of all IT projects fail – Odds are better than 50% this is one of them.

What is missing from most IT projects is an actor with technical expertise but no direct interest in the project. Someone who has no motive for CYA on the part of the client or contractor.

Someone who can ask of the decision makers: “What specific benefit is derived from ability X?” Such as the capacity to mine “big data.” To what end?

The oft cited benefit of “making better decisions” is not empowered by “big data.”

If you are incapable of making good business decisions now, that will be true after you have “big data.” (Sorry.)

Building an AWS CloudSearch domain for the Supreme Court

Filed under: Amazon CloudSearch,Law - Sources,Legal Informatics — Patrick Durusau @ 7:20 pm

Building an AWS CloudSearch domain for the Supreme Court by Michael J Bommarito II.

Michael writes:

It should be pretty clear by now that two things I’m very interested in are cloud computing and legal informatics. What better way to show it than to put together a simple AWS CloudSearch tutorial using Supreme Court decisions as the context? The steps below should take you through creating a fully functional search domain on AWS CloudSearch for Supreme Court decisions.

A sure to be tweeted and read (at least among legal informatics types) introduction to AWS CloudSearch.

The source file only covers U.S. Supreme Court decisions announced by March of 2008. I am looking for later sources of information. And documentation on the tagging/metadata of the files.

7 top tools for taming big data

Filed under: BigData,Jaspersoft,Karmasphere,Pentaho,Skytree,Splunk,Tableau,Talend — Patrick Durusau @ 7:20 pm

7 top tools for taming big data by Peter Wayner.

Peter covers:

  • Jaspersoft BI Suite
  • Pentaho Business Analytics
  • Karmasphere Studio and Analyst
  • Talend Open Studio
  • Skytree Server
  • Tableau Desktop and Server
  • Splunk

Not as close to the metal as Lucene/Solr, Hadoop, HBase, Neo4j, and many other packages but not bad starting places.

Do be mindful of Peter’s closing paragraph:

At a recent O’Reilly Strata conference on big data, one of the best panels debated whether it was better to hire an expert on the subject being measured or an expert on using algorithms to find outliers. I’m not sure I can choose, but I think it’s important to hire a person with a mandate to think deeply about the data. It’s not enough to just buy some software and push a button.

Let us abolish page limits in scientific publications

Filed under: Writing — Patrick Durusau @ 7:20 pm

Let us abolish page limits in scientific publications

Daniel Lemire writes:

As scientists, we are often subjected to strict page limits. These limits made sense when articles were printed on expensive paper. They are now obsolete.

On the contrary, page limits (or their digital equivalents) are more important in a digital setting than when articles appeared in print. (Research data, source code and the like should not be subject to limits but that is a different issue.)

Why?

Perhaps you have heard of the DRY principle:


Don’t

Repeat

Yourself

The hazards of repeating yourself inconsistency, change of one reference and not another, etc., are multiplied in prose writing.

Why?

At least in computer programming, if you otherwise follow good programming practices, some of your tests or even the compiler will catch repetition as a bug. Which can then be fixed. Repetition in prose lacks the advantage of a compiler to catch such errors.

Do you want to be known for “buggy” prose?

Moreover, good writing isn’t accidental. It is a matter of domain knowledge, hard work and practice. Write in a sloppy fashion and before too long, bad habits will creep into your “professional” writing.

Do you want to lose the ability to express yourself clearly?

Finally, your writing reflects your respect (or lack thereof) for readers. Your work is being read for possible use in publications or research. Why would you inflict poor writing on such readers?

To me personally, poor writing reflects a poor understanding of content. Is that how you want to be known?

Contra: Search Engine Land’s Mediocre Post on Local Search

Filed under: Searching,Statistics — Patrick Durusau @ 7:19 pm

Search Engine Land’s Mediocre Post on Local Search

Matthew Hurst writes:

A colleague brought to my attention a post on the influential search blog Search Engine Land which makes claims about the quality of local data found on search engines and local verticals: Yellow Pages Sites Beat Goolge In Local Data Accuracy Test. The author describes surprise at the outcome reported – that Yellow Pages sites are better at local search than Google. Rather, we should express surprise at how poorly this article is written and at the intentional misleading nature of the title.

What surprises me is how far Matthew had to go to find something “misleading.”

You may not agree with the definition of “local businesses” but it was clearly stated, so if the results are “misleading,” it is because readers did not appreciate the definition of “local businesses.” Since it was stated, whose fault is that?

As far as “…swinging back to bad reporting…” (I didn’t see any bad reporting up to this point but it is his post), the last table with its “coverage of an attribute” saying nothing about its quality.

If you can find where the Search Engine Land post ever said anything about the quality of “additional information” I would appreciate a pointer.

That the “additional information” category is fairly vacuous but that wasn’t hidden from the reader. Or claimed to be something it wasn’t.

The original post did not follow Matthew’s preferences. That’s my take away from Matthew’s post.

Choices of variable and their definitions always, always favor a particular outcome.

What other reason is there to choose a variable and its definition?

Gapminder

Filed under: Graphics,Statistics,Visualization — Patrick Durusau @ 7:19 pm

Gapminder by Hans Rosling.

If you don’t know the name, Hans Rosling, you should.

A promoter of the use of statistics (and their illustration) to make sense of a complex and changing world.

Hans sees the world from the perspective of a public health expert.

Statistics are used to measure the effectiveness of public health programs.

The most impressive aspect of the site is its ability to create animated graphs on the fly from the data sets, for your viewing and manipulation.

Who’s accountable for IT failure? (Parts 1 & 2)

Filed under: IT,Project Management — Patrick Durusau @ 7:18 pm

Michael Krigsman has an excellent two part series IT failure:

Who’s accountable for IT failure? (Part One)

Who’s accountable for IT failure? (Part Two)

Michael goes through the horror stories and stats about IT failures (about 70%) in some detail.

But think about just the failure rate for a minute: 70%?

Would you drive a car with a 70% chance of failure?

Would you fly in a plane with a 70% chance of failure?

Would you trade securities with 70% chance your information is wrong?

Would you use a bank account where the balance has a 70% inaccuracy rate?

But, the government is about to embark on IT projects to make government more transparent and accountable.

Based on past experience, how many of those IT projects are going to fail?

If you said 70%, your right!

The senior management responsible for those IT projects needs a pointer to the posts by Michael Krigsman.

For that matter, I would like to see Michael post a PDF version that can be emailed to senior management and project participants at the start of each project.

SciDB Version 12.3

Filed under: NoSQL,SciDB — Patrick Durusau @ 7:18 pm

SciDB Version 12.3

From the email notice:

Highlights of this release include:

  • more compact storage
  • vectorized expression evaluation
  • improvements to grand, grouped and window aggregates
  • support for non-integer dimensions within most major operators, including joins
  • transactional storage engine with error detection and rollback

Internal benchmarks comparing this release with the prior releases show disk usage reduced by 25%-50% and queries that use vectorized expression evaluation sped up by 4-10X.

Hyperdex: Documentation

Filed under: HyperDex,NoSQL — Patrick Durusau @ 7:18 pm

Posting on the Hyperdex documentation separately from its latest release. It may be of more lasting interest.

Current Documentation:

Hyperdex Documentation (web based)

Hyperdex Documentation (PDF)

Mailing lists:

hyperdex-announce

hyperdex-discuss

Other:

Hyperdex: A New Era in High Performance Data Stores for the Cloud (presentation, 13 April 2012)

Hyperdex Tutorial

Hyperdex (homepage)

Hyperdex: A Searchable Distributed Key-Value Store (New Release)

Filed under: HyperDex — Patrick Durusau @ 7:17 pm

Hyperdex: A Searchable Distributed Key-Value Store (New Release)

From the homepage:

2012-04-16: NEW RELEASE! HyperDex now supports lists, sets, and maps natively, with atomic operations on each of these structures. This enables HyperDex to be used in ever-more demanding applications that make use of these rich datastructures.

« Newer PostsOlder Posts »

Powered by WordPress