November « 2012 « Another Word For It

November 30, 2012

Read this: Revising Prose

Filed under: Writing — Patrick Durusau @ 5:57 pm

Read this: Revising Prose by Jason Zimdars.

From the post:

There are plenty of books that will teach you to be a better writer, but I’ve never found one so immediately useful as Revising Prose by Richard A. Lanham. Following along as Lanham revises example upon example of real world writing is like exercise for your writing muscles.

My favorite takeaway is this tip for improving the rhythm and cadence of your writing. Many of us have learned to read text out loud as a method to reveal awkward transitions or generally dull passages, but you can also spot poor rhythm visually. A red flag for dull cadence is a run of sentences that are all of similar length. Try adding a carriage return after every sentence or phrase, the rhythm is evident:

This looks like an interesting and easy technique to use on your own as well as the prose of others.

The better a writer you become, the easier it will be for potential clients, colleagues and others to understand what you have written.

Modulo that if you go into “public service” as they say in the United States, being understood may put you at a disadvantage.

As with most advice, it could go either way.

Comments Off

Campaign Finance Data in Splunk [Cui bono?]

Filed under: Government,Government Data,Splunk — Patrick Durusau @ 5:29 pm

Two post you may find interesting:

SPLUNK’D: Federal Election Commission Campaign Finance Data

and,

Spluk4Good Announces public data project highlighting FEC Campaign Finance Data

Project link.

The project reveals answers to our burning questions:

What state gives the most?

Which state gives the most per capita? (Bet you won’t guess this one!)

What does aggregate giving look like visualized over the election cycle?

Is your city more Red or more Blue?

What does a map viz with drilldown reveal about giving by zip codes or cities?

What occupation gives the most?

Are geologists more Red or more Blue (Hint: think about where geologist live and who they work for!)

Impressive performance but some of my burning questions would be:

Closing which tax loopholes would impact particular taxpayers who contributed to X political campaign?
Which legislative provisions benefits particular taxpayers or their investments?
Which regulations by federal agencies benefit particular taxpayers or their businesses?

The FEC data isn’t all you would need to answer those questions. But the answers are known.

Someone asked for the benefits in all three cases. Someone wrote the laws, regulations or loop holes with the intent to grant those benefits.

Not all of those are dishonest. Consider the charitable contributions that sustain fine art, music, libraries and research that benefits all of us.

There are other benefits that are less benign.

To identify the givers, recipients, legislation/regulation and the benefit, would require collocation of data from disparate domains and vocabularies.

Interested?

Comments Off

Linking Web Data for Education Project [Persisting Heterogeneity]

Filed under: Education,Linked Data,WWW — Patrick Durusau @ 3:48 pm

Linking Web Data for Education Project

From the about page:

LinkedUp aims to push forward the exploitation of the vast amounts of public, open data available on the Web, in particular by educational institutions and organizations.

This will be achieved by identifying and supporting highly innovative large-scale Web information management applications through an open competition (the LinkedUp Challenge) and dedicated evaluation framework. The vision of the LinkedUp Challenge is to realise personalised university degree-level education of global impact based on open Web data and information. Drawing on the diversity of Web information relevant to education, ranging from Open Educational Resources metadata to the vast body of knowledge offered by the Linked Data approach, this aim requires overcoming substantial challenges related to Web-scale data and information management involving Big Data, such as performance and scalability, interoperability, multilinguality and heterogeneity problems, to offer personalised and accessible education services. Therefore, the LinkedUp Challenge provides a focused scenario to derive challenging requirements, evaluation criteria, benchmarks and thresholds which are reflected in the LinkedUp evaluation framework. Information management solutions have to apply data and learning analytics methods to provide highly personalised and context-aware views on heterogeneous Web data.

Before linked data, we had: “…interoperability, multilinguality and heterogeneity problems….”

After linked data, we have: “…interoperability, multilinguality and heterogeneity problems….” + linked data (with heterogeneity problems).

Not unexpected but still need a means of resolution. Topic maps anyone?

Comments Off

Give me human editors and the New York Times

Filed under: Curation,Human Cognition — Patrick Durusau @ 6:38 am

Techmeme founder: Give me human editors and the New York Times by Jeff John Roberts.

From the post:

At the event in New York, which was hosted by media company Outbrain, Rivera explained to Business Insider’s Steve Kovach why algorithms will never be able to curate as effectively as humans.

“A lot of people who think they can go all the way with the automated approach fail to realize a news story has become obsolete,” said Rivera, explaining that an article can be quickly superseded even if it receives a million links or tweets.

This is why Rivera now relies on human editors to shepherd the headlines that bubble up and swat down the inappropriate ones. He argues any serious tech or political news provider will always have to do the same.

Rivera is also not enthused about social-based news platforms — sites like LinkedIn Today or Flipboard that assemble news stories based on what your friends are sharing on social media. Asked if Techmeme will offer a social-based news feed, Rivera said don’t count on it.

“People like to go to the New York Times and look at what’s on the front page because they have a lot of trust in what editors decide and they know other people read it. We want to do the same thing,” he said. “There’s value in being divorced from your friends … I’d rather see what’s on the front of the New York Times.”

Are you trapped in a social media echo chamber?

Escape with the New York Times.

I first saw this in a tweet by Peter Cooper.

Comments (1)

HTML metadata for journal articles [Diversity Behind, Diversity Ahead]

Filed under: Bibliography,HTML,Metadata,Ontology — Patrick Durusau @ 6:18 am

HTML metadata for journal articles by Alf Eaton.

From the post:

You’d think it would be easy to pin down an ontology for journal articles. There are basically just these properties:

title

authors[]

datePublished

abstract

But… some of those are shared with more generic classes higher up the tree, so abtract becomes description, title becomes name, author becomes creator. Each author can be a string or an object. Each author has one or more affiliations, which have addresses. The authors are in a specific order, and some of them have certain roles. There are several different dates: creation, review, update, publication.

A good listing of all the various options for bibliographic metadata.

A variety that arose in the last ten (10) years. If we pushed that back twenty (20) or thirty (30) years, even more diversity.

All of the systems mentioned are useful in their original contexts and will be supplanted by other systems over time.

But unlike humans, the components of the so-called “Semantic Web” don’t adapt to change. Or should I say they don’t adapt without the assistance of their human authors?

Any change to an ontology forces more work onto their maintainers. Perhaps that accounts for static ontologies that don’t account for prior diversity or that will will surely follow.

I have always thought, “X works with my software,” as a poor reason to adopt any particular approach. I much prefer approaches that meet my requirements, not those of a software vendor.

I first saw this in a tweet by Duncan Hull.

Comments Off

November 29, 2012

Best Practices for a Successful TokuDB Evaluation (Webinar)

Filed under: Fractal Trees,TokuDB,Tokutek — Patrick Durusau @ 7:20 pm

Best Practices for a Successful TokuDB Evaluation by Gerry Narvaja

Date: December 11th
Time: 2 PM EST / 11 AM PST

From the webpage:

In this webinar we will show step by step how to install, configure, and test TokuDB for a typical performance evaluation. We’ll also be flagging potential pitfalls that can ruin the eval results. It will describe the differences between installing from scratch and replacing an existing MySQL / MariaDB installation. It will also review the most common issues that may arise when running TokuDB binaries.

You have seen the TokuDB numbers on their data.

Now you can see what numbers you can get with your data.

Comments Off

The Web engineer’s online toolbox

Filed under: Programming,Web Applications,Web Browser — Patrick Durusau @ 6:34 pm

The Web engineer’s online toolbox by Ivan Zuzak.

From the post:

I wanted to compile a list of online, Web-based tools that Web engineers can use for their work in development, testing, debugging and documentation. The requirements for a tool to make the list are:

must be a live Web application (no extensions or apps you have to host yourself),

free to use (some kind of free plan available),

generic applicability (not usable only for a specific application/platform),

and must be useful to Web engineers (not just for Web site design).

If you are delivering content over the Web, you are either using or will be interested in using one or more of these tools.

Comments Off

International Aid Transparency Initiative (IATI) Standard

Filed under: Code Lists,Government,Government Data — Patrick Durusau @ 6:22 pm

International Aid Transparency Initiative (IATI) Standard

From the webpage:

The International Aid Transparency Initiative (IATI) is a global transparency standard that makes information about aid spending easier to access, use and understand.

More precisely it is a standard for normalizing financial data in order to provide transparency.

Transparency is desired by donors of international aid so they can judge and control the use of the aid they donate. Non-transparency is desired by the recipients of international aid because they resent the paternalism of and interference in local affairs by donors.

I sense a lack of the common interest that would be required to make this standard truly effective.

Its code lists, on the other hand, could be quite valuable in creating mapping solutions between disparate information systems.

I first saw this standard mentioned in Using Graphs to Analyse Public Spending on International Development by James Hughes.

Comments Off

Using Graphs to Analyse Public Spending on International Development

Filed under: Neo4j,Transparency — Patrick Durusau @ 6:07 pm

Using Graphs to Analyse Public Spending on International Development by James Hughes.

From the description:

James has been working on a really interesting project for the Department for International Development (DfID, http://www.dfid.gov.uk/), a UK government agency working on providing transparency around the ways that aid money gets spent on different development projects. He has been working on a web application that is providing a frontend + API access for people to interrogate a very detailed data format that details how Countries, Regions, Organisations, Activites, Budgets are related. During his talk, he will be explaining the history of the project, the reasons for moving from a MySQL backend to Neo4j, the benefits and problems that he faced in his experience along the way.

I would wait for the open source software to appear.

If you already know Neo4j, no extra information. If you don’t know Neo4j, no enough information to be useful.

FYI, “transparency” isn’t achieved using a normalized reporting system like IATI. Otherwise, self-reporting tax systems would have no tax evasion. Yes?

If you want useful transparency, it does not involve self-reporting and you have access to third parties who can verified reported transactions.

Slide deck here.

Comments (1)

Streaming data into Apache HBase using Apache Flume

Filed under: Flume,HBase — Patrick Durusau @ 2:37 pm

Streaming data into Apache HBase using Apache Flume

From the post:

Apache Flume was conceived as a fault-tolerant ingest system for the Apache Hadoop ecosystem. Flume comes packaged with an HDFS Sink which can be used to write events into HDFS, and two different implementations of HBase sinks to write events into HBase. You can read about the basic architecture of Apache Flume 1.x in this blog post. You can also read about how Flume’s File Channel persists events and still provides extremely high performance in an earlier blog post. In this article, we will explore how to configure Flume to write events into HBase, and write custom serializers to write events into HBase in a format of the user’s choice.

Streaming data is great, but being able to capture it when needed, is even better!

Comments Off

Notation as a Tool of Thought

Filed under: Language,Language Design,Programming — Patrick Durusau @ 1:33 pm

Notation as a Tool of Thought by Kenneth E. Iverson.

From the introduction:

Nevertheless, mathematical notation has serious deficiencies. In particular, it lacks universality, and must be interpreted differently according to the topic, according to the author, and even according to the immediate context. Programming languages, because they were designed for the purpose of directing computers, offer important advantages as tools of thought. Not only are they universal (general-purpose), but they are also executable and unambiguous. Executability makes it possible to use computers to perform extensive experiments on ideas expressed in a programming language, and the lack of ambiguity makes possible precise thought experiments. In other respects, however, most programming languages are decidedly inferior to mathematical notation and are little used as tools of thought in ways that would be considered significant by, say, an applied mathematician.

The thesis of the present paper is that the advantages of executability and universality found in programming languages can be effectively combined, in a single coherent language, with the advantages offered by mathematical notation.

Will expose you to APL but that’s not a bad thing. The history of reasoning about data structures can be interesting and useful.

Iverson’s response to critics of the algorithms in this work was in part as follows:

…overemphasis of efficiency leads to an unfortunate circularity in design: for reasons of efficiency early programming languages reflected the characteristics of the early computers, and each generation of computers reflects the needs of the programming languages of the preceding generation. (5.4 Mode of Presentation)

A good reason to understand the nature of a problem before reaching for the keyboard.

Comments Off

Abusing Cloud-Based Browsers for Fun and Profit [Passing Messages, Not Data]

Filed under: Cloud Computing,Javascript,MapReduce,Messaging — Patrick Durusau @ 12:58 pm

Abusing Cloud-Based Browsers for Fun and Profit by Vasant Tendulkar, Joe Pletcher, Ashwin Shashidharan, Ryan Snyder, Kevin Butler and William Enck.

Abstract:

Cloud services have become a cheap and popular means of computing. They allow users to synchronize data between devices and relieve low-powered devices from heavy computations. In response to the surge of smartphones and mobile devices, several cloud-based Web browsers have become commercially available. These “cloud browsers” assemble and render Web pages within the cloud, executing JavaScript code for the mobile client. This paper explores how the computational abilities of cloud browsers may be exploited through a Browser MapReduce (BMR) architecture for executing large, parallel tasks. We explore the computation and memory limits of four cloud browsers, and demonstrate the viability of BMR by implementing a client based on a reverse engineering of the Puffin cloud browser. We implement and test three canonical MapReduce applications (word count, distributed grep, and distributed sort). While we perform experiments on relatively small amounts of data (100 MB) for ethical considerations, our results strongly suggest that current cloud browsers are a viable source of arbitrary free computing at large scale.

Excellent work on extending the use of cloud-based browsers. Whether you intend to use them for good or ill.

The use of messaging as opposed to passage of data is particularly interesting.

Shouldn’t that work for the process of merging as well?

Comments/suggestions?

Comments Off

Conway’s Game of Life for Curved Surfaces (Parts 1 and 2)

Filed under: Cellular Automata,Game of Life,Programming — Patrick Durusau @ 6:16 am

Conway’s Game of Life for Curved Surfaces (Part 1) and Conway’s Game of Life for Curved Surfaces (Part 2) by Mikola Lysenko.

A generalization of John Conway’s original Game of Life on curved surfaces.

Definitely not for the faint of heart and will likely have you consulting old text books.

A simple game that even in its original version, unfolds into complexity. To say nothing of the extended version.

See Cellular automaton (history and applications).

I first saw this in a tweet from Math Update.

Comments Off

November 28, 2012

Stratosphere

Filed under: BigData,MapReduce — Patrick Durusau @ 4:39 pm

Stratosphere

I saw a tweet from Stratosphere today saying: “20 secs per iteration for PageRank on a billion scale graph using #stratosphere’s iterative data flows.” Enough to get me to look further!

Tracking the source of the tweet, I found the homepage of Stratosphere and there read:

Stratosphere is a DFG-funded research project investigating “Information Management on the Cloud” and creating the Stratosphere System for Big Data Analytics. The current openly released version is 0.2 with many new features and enhancements for usability, robustness, and performance. See the Change Log for a complete list of new features.

What is the Stratosphere System?

The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises a rich stack of components with different programming abstractions for complex analytics tasks:

An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases. Internally, Meteor scripts are translated into Sopremo algebra and optimized.

A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations. PACT is based on second-order functions and features an optimizer that chooses parallelization strategies.

An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.

Stratosphere is open source under the Apache License, Version 2.0. Feel free to download it, try it out and give feedback or ask for help on our mailing lists.

Meteor Language

Meteor is a textual higher-level language for rapid composition of queries. It uses a JSON-like data model and features in its core typical operation for analysis and transformation of (semi-) structured nested data.

The meteor language is highly extensible and supports the addition of custom operations that integrate fluently with the syntax, in order to create problem specific Domain Languages. Meteor queries are translated into Sopremo algebra, optimized, and transformed into PACT programs by the compiler.

PACT Programming Model

The PACT programming model is an extension of the well known MapReduce programming model. PACT features a richer set of second-order functions (Map/Reduce/Match/CoGroup/Cross) that can be flexibly composed as DAGs into programs. PACT programs use a generic schema-free tuple data model to ease composition of more complex programs.

PACT programs are parallelized by a cost-based compiler that picks data shipping and local processing strategies such that network- and disk I/O is minimized. The compiler incorporates user code properties (when possible) to find better plans; it thus alleviates the need for many manual optimizations (such as job merging) that one typically does to create efficient MapReduce programs. Compiled PACT programs are executed by the Nephele Data Flow Engine.

Nephele Data Flow Engine

Nephele is a massively parallel data flow engine dealing with resource management, work scheduling, communication, and fault tolerance. Nephele can run on top of a cluster and govern the resources itself, or directly connect to an IaaS cloud service to allocate computing resources on demand.

Another big data contender!

Comments Off

Dereferencing Issues

Filed under: Humor,Semantic Web — Patrick Durusau @ 3:06 pm

Robert Cerny, a well known topic map maven, tweeted his favourite #GaryLarson cartoon, this one on dereferencing:

Dereferencing

Comments Off

Bash One-Liners Explained (series)

Filed under: Bash,Data Mining,String Matching,Text Mining — Patrick Durusau @ 10:26 am

Bash One-Liners Explained by Peteris Krumins.

The series page for posts by Peteris Krumins on Bash one-liners.

So far:

One real advantage to Bash scripts is the lack of a graphical interface to get in the way.

A real advantage with “data” files but many times “text” files as well.

Comments Off

Netflix open sources Hystrix resilience library [Component for Distributed TMs]

Filed under: Distributed Systems,Hystrix — Patrick Durusau @ 10:11 am

Netflix open sources Hystrix resilience library

From the post:

Netflix has moved on from just releasing the tools it uses to test the resilience of the cloud services that power the video streaming company, and has now open sourced a library that it uses to engineer in that resilience. Hystrix is an Apache 2 licensed library which Netflix engineers have been developing over the course of 2012 and which has been adopted by many teams within the company. It is designed to manage how distributed services interact and give more tolerance to latency within those connections and the inevitable failures that can occur.

The library isolates access points between services and then stops any failures from cascading between those access points. Hystrix uses a Command pattern to execute or queue Command objects and evaluate whether the circuit to the service for which the command is destined for is in operation. This may not be the case where what Hystrix calls a circuit breaker has triggered leaving the circuit “open”. Circuit breakers can be placed into a system to make it easier to trigger a coordinated failover. The library also checks for other issues which may prevent the execution of the command.

Does your distributed TM have the resilience of Netflix?

Is that the new “normal” for resilience?

The post goes on to say that a dashboard is forthcoming to monitor Hystrix.

Comments Off

Mortar [Public Launch, Python and Hadoop]

Filed under: Hadoop,Mortar,Usability — Patrick Durusau @ 9:59 am

Announcing our public launch

From the post:

Last week, we announced our $1.8 million fundraising. For those of you who follow big data startups, our blog post probably felt…underwhelming. Startups typically come out and make a huge publicity splash, jam-packed with buzzwords and vision galore. While we feel very fortunate to have what we need to help us grow, we know that VC funding is merely a means, and not an end.

But now you get to see us get really excited, because Mortar’s Hadoop PaaS and open source framework for big data is now publicly available. This means if you want to try it, you can activate your trial right now on our site without having to talk to anyone (unless you want to!).

You can get started on Mortar using Web Projects (using Mortar entirely online through the browser) or Git Projects (using Mortar locally on your own machine with the Mortar development framework). You can see more info about both here.

All trial accounts come with our full Hadoop PaaS, unlimited use of the Mortar framework, our site, and dev tools, and 10 free Hadoop node-hours. (You can get another 15 free node-hours per month and additional support at no cost by simply adding your credit card to the account.)

Mortar accepts PIG scripts and “real Python.” So you can use your favourite Python libraries with Hadoop.

I don’t know if there is any truth to the rumor that Mortar supports Python because Lars Marius Garshol and Steve Newcomb use it. So don’t ask me.

I first saw this in a tweet by David Fauth.

Comments Off

xkcd: Calendar of meaningful dates

Filed under: Graphics,Mapping,Visualization — Patrick Durusau @ 6:37 am

xkcd: Calendar of meaningful dates by Nathan Yau.

From the post:

Using the Google ngrams corpus, xkcd sized the days of the year based on usage volume. Lots of firsts of the month and September 11th.

Interesting presentation of date usage in English language books since 2000.

Suggestive though of other applications.

Such as plotting the number of sick days taken by particular departments? Or on what day of the week?

Thinking product releases scheduled for when staff isn’t getting sick or caught up after being out.

Calendars are familiar objects and for some types of data, might make a useful mapping target/interface.

Comments Off

Semantic Web Explained

Filed under: Humor,Semantic Web — Patrick Durusau @ 6:11 am

Inge Hendriksen tweets: “The #SemanticWeb explained in a single cartoon frame…“

Comments Off

iFinder (Knowledge Maps)

Filed under: iFinder,Mapping,Maps — Patrick Durusau @ 6:06 am

iFinder (Knowledge Maps)

From the webpage:

Knowledge Maps

Search from a different angle

As we know from current studies and user data analyses of search engine providers, most users enter max. two terms to start their search. By using the Knowledge Map, it is no longer necessary to enter even a single search term – just by a few mouse clicks the user can targetedly and comprehensibly reach the desired search result.

Your corporate knowledge at a glance

The IntraFind solution “Knowledge Map” offers a user-friendly surface for doing research in company internal data sources.

All available data are clearly visualized in a “360 degree view” and can be quickly and easily narrowed to the desired hit document just by mouse click without the need to enter one single search term.

The product guide for IntraFind’s Knowlege Map enhancement for iFinder has several riffs adaptable to promoting topic maps.

Difficult to tell from the product literature, which was sparser than most, what lies under the hood. Appears to be a metadata harvesting/navigation solution.

Did not see any signs of the ability to share/combine mappings together.

If you took this as a baseline, the value of mapping, then topic maps are a value-add to traditional mapping.

Comments Off

Computational Finance with Map-Reduce in Scala [Since Quants Have Funding]

Filed under: Finance Services,MapReduce,Scala — Patrick Durusau @ 5:48 am

Computational Finance with Map-Reduce in Scala by Ron Coleman, Udaya Ghattamaneni, Mark Logan, and Alan Labouseur. (PDF)

Assuming the computations performed by quants are semantically homogeneous (a big assumption), the sources of their data and application of the outcomes, are not.

The clients of quants aren’t interested in you humming “…its a big world after all…,” etc. They are interested in furtherance of their financial operations.

Using topic maps to make an already effective tool more effective, is the most likely way to capture their interest. (Short of taking hostages.)

I first saw this in a tweet by Data Science London.

Comments Off

Pathfinding with Neo4j Unmanaged Extensions

Filed under: Graphs,Neo4j — Patrick Durusau @ 5:33 am

Pathfinding with Neo4j Unmanaged Extensions by Max De Marzi.

From the post:

In Extending Neo4j I showed you how to create an unmanaged extension to warm up the node and relationship caches. Let’s try doing something more interesting like exposing the A* (A Star) search algorithm through the REST API. The graph we created earlier looks like this:

What would you want to add to Neo4j?

Comments Off

November 27, 2012

For Attribution… [If One Identifier/URL isn’t enough]

Filed under: Citation Practices,Data,Data Attribution — Patrick Durusau @ 4:12 pm

For Attribution — Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop by Paul F. Uhlir.

From the preface:

The growth of electronic publishing of literature has created new challenges, such as the need for mechanisms for citing online references in ways that can assure discoverability and retrieval for many years into the future. The growth in online datasets presents related, yet more complex challenges. It depends upon the ability to reliably identify, locate, access, interpret and verify the version, integrity, and provenance of digital datasets.

Data citation standards and good practices can form the basis for increased incentives, recognition, and rewards for scientific data activities that in many cases are currently lacking in many fields of research. The rapidly-expanding universe of online digital data holds the promise of allowing peer-examination and review of conclusions or analysis based on experimental or observational data, the integration of data into new forms of scholarly publishing, and the ability for subsequent users to make new and unforeseen uses and analyses of the same data – either in isolation, or in combination with other datasets.

The problem of citing online data is complicated by the lack of established practices for referring to portions or subsets of data. As funding sources for scientific research have begun to require data management plans as part of their selection and approval processes, it is important that the necessary standards, incentives, and conventions to support data citation, preservation, and accessibility be put into place.

Of particular interest are the four questions that shaped this workshop:

1. What is the status of data attribution and citation practices in the natural and social (economic and political) sciences in United States and internationally?

2. Why is the attribution and citation of scientific data important and for what types of data? Is there substantial variation among disciplines?

3. What are the major scientific, technical, institutional, economic, legal, and socio-cultural issues that need to be considered in developing and implementing scientific data citation standards and practices? Which ones are universal for all types of research and which ones are field or context specific?

4. What are some of the options for the successful development and implementation of scientific data citation practices and standards, both across the natural and social sciences and in major contexts of research?

The workshop did not presume a solution (is that a URL in your pocket?) but explores the complex nature of attribution and citation.

Michael Sperberg-McQueen remarks:

Longevity: Finally, there is the question of longevity. It is well known that the half-life of citations is much higher in humanities than in the natural sciences. We have been cultivating a culture of citation of referencing for about 2,000 years in the West since the Alexandrian era. Our current citation practice may be 400 years old. The http scheme, by comparison, is about 19 years old. It is a long reach to assume, as some do, that http URLs are an adequate mechanism for all citations of digital (and non-digital!) objects. It is not unreasonable for scholars to be skeptical of the use of URLs to cite data of any long-term significance, even if they are interested in citing the data resources they use. [pp. 63-64]

What I find the most attractive about topic maps is you can have:

A single URL as a citation/identifier.
Multiple URLs as citations/identifiers (for the same data resource).
Multiple URLs and/or other forms of citations/identifiers as they develop(ed) over time for the same data resource.

Why the concept of multiple citations/identifiers (quite common in biblical studies) for a single resource is so difficult I cannot explain.

Comments Off

Extending Neo4j

Filed under: Neo4j,Programming — Patrick Durusau @ 3:16 pm

Extending Neo4j by Max De Marzi.

From the post:

One of the great things about Neo4j is how easy it is to extend it. You can extend Neo4j with Plugins and Unmanaged Extensions. Two great examples of plugins are the Gremlin Plugin (which lets you use the Gremlin library with Neo4j) and the Spatial Plugin (which lets you perform spatial operations like searching for data within specified regions or within a specified distance of a point of interest).

Plugins are meant to extend the capabilities of the database, nodes, or relationships. Unmanaged extensions are meant to let you do anything you want. This great power comes with great responsibility, so be careful what you do here. David Montag cooked up an unmanaged extension template for us to use on github so lets give it a whirl. We are going to clone the project, compile it, download Neo4j, configure Neo4j to use the extension, test the extension and tweak it a bit.

Max walks you through extending Neo4j, to build your favourite features.

Comments Off

Data Gift Guide

Filed under: Humor — Patrick Durusau @ 2:38 pm

Data Gift Guide by Nathan Yau.

From the post:

Now that we’re done giving thanks for all the intangibles like love, friends, family, and drunkenness, it’s time to turn our attention to the physical objects we don’t have yet. It’s the most wonderful time of year! Here are gift ideas for your data geek friends and family. A few of these take a while to make, so be sure to order them now so that you get them in time for Christmas.

Nathan has collected some interesting sources of gifts for the season.

From things I have never wondered about, pillows shaped like statistical distributions, to the more familiar books and electronics.

I mention this in lieu of any topic map specific gift sources that come to mind. Perhaps that will be different by next Christmas!

One possibility: Instead of a book of politicians and their dumb ideas, what if you had a book of dumb ideas with the politicians that hold them?

A reverse index of dumb ideas.

Other suggestions? (Volunteers to watch the news to create such an item? I have avoided it for years. What it doesn’t get wrong, is largely irrelevant.)

Comments Off

SINAInnovation: Innovation and Data

Filed under: Bioinformatics,Cloudera,Data — Patrick Durusau @ 2:26 pm

SINAInnovation: Innovation and Data by Jeffrey Hammerbacher.

From the description:

Cloudera Co-founder Jeff Hammerbacher speaks about data and innovation in the biology and medicine fields.

Interesting presentation, particularly on creating structures for innovation.

One of his insights I would summarize as “break early, rebuild fast.” His term for it was “lower batch size.” Try new ideas and when they fail, try a new one.

I do wonder about his goal to : “Lower the cost of data storage and processing to zero.”

It may get to be “too cheap to meter” but that isn’t the same thing as being zero. Somewhere in the infrastructure, someone is paying bills for storage and processing.

I mention that because some political parties think that infrastructure can exist without ongoing maintenance and care.

Failing infrastructures don’t lead to innovation.

SINAInnovation description:

SINAInnovations was a three-day conference at The Mount Sinai Medical Center that examined all aspects of innovation and therapeutic discovery within academic medical centers, from how it can be taught and fostered within academia, to how it can accelerate drug discovery and the commercialization of emerging biotechnologies.

Comments Off