Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 9, 2014

Simple Testing Can Prevent Most Critical Failures:…

Filed under: Distributed Computing,Programming,Systems Administration — Patrick Durusau @ 6:28 pm

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems by Ding Yuan, et al.

Abstract:

Large, production quality distributed systems still fail periodically,and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop Map Reduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code–the last line of defense–even with out an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

If you aren’t already convinced you need to read this paper, consider one more quote:

almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software. (emphasis added)

How will catastrophic system failure reflect on your product or service? Hint: It doesn’t reflect well on topic maps or any other service or technology.

I say “read” this paper, perhaps putting it on a 90-day reading rotation would be better.

Intriguing properties of neural networks [Gaming Neural Networks]

Filed under: Data Analysis,Neural Networks — Patrick Durusau @ 4:48 pm

Intriguing properties of neural networks by Christian Szegedy, et al.

Abstract:

Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties.

First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.

Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. Specifically, we find that we can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network’s prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

Both findings are of interest but the discovery of “adversarial examples” that can cause a trained network to misclassify images, is the more intriguing of the two.

How do you validate a result from a neural network? Possessing the same network and data isn’t going to help if it contains “adversarial examples.” I suppose you could “spot” a misclassification but one assumes a neural network is being used because physical inspection by a person isn’t feasible.

What “adversarial examples” work best against particular neural networks? How to best generate such examples?

How do users of off-the-shelf neural networks guard against “adversarial examples?” (One of those cases where “shrink-wrap” data services may not be a good choice.)

I first saw this in a tweet by Xavier Amatriain

Sir Tim Berners-Lee speaks out on data ownership

Filed under: Merging,Semantic Web — Patrick Durusau @ 4:12 pm

Sir Tim Berners-Lee speaks out on data ownership by Alex Hern.

From the post:

The data we create about ourselves should be owned by each of us, not by the large companies that harvest it, the Tim Berners-Lee, the inventor of the world wide web, said today.

Berners-Lee told the IPExpo Europe in London’s Excel Centre that the potential of big data will be wasted as its current owners use it to serve ever more “queasy” targeted advertising.

Berners-Lee, who wrote the first memo detailing the idea of the world wide web 25 years ago this year, while working for physics lab Cern in Switzerland, told the conference that the value of “merging” data was under-appreciated in many areas.

Speaking to public data providers, he said: “I’m not interested in your data; I’m interested in merging your data with other data. Your data will never be as exciting as what I can merge it with.

No disagreement with: …the value of “merging” data was under-appreciated in many areas. 😉

Considerable disagreement on how best to accomplish that merging but will be an empirical question when people wake up to the value of “merging” data.

Berners-Lee may be right about who “should” own data about ourselves, but that isn’t in fact who owns it now. Changing property laws means taking rights away from those with them under the current regime and creating new rights for others in a new system. Property laws have changed before but it requires more than slogans and wishful thinking to make it so.

Programming for Biologists

Filed under: Bioinformatics,Biology,Programming,Teaching — Patrick Durusau @ 3:26 pm

Programming for Biologists by Ethan White.

From the post:

This is the website for Ethan White’s programming and database management courses designed for biologists. At the moment there are four courses being taught during Fall 2014.

The goal of these courses is to teach biologists how to use computers more effectively to make their research easier. We avoid a lot of the theory that is taught in introductory computer science classes in favor of covering more of the practical side of programming that is necessary for conducting research. In other words, the purpose of these courses is to teach you how to drive the car, not prepare you to be a mechanic.

Hmmm, less theory of engine design and more driving lessons? 😉

Despite my qualms about turn-key machine learning solutions, more people want to learn to drive a car than want to design an engine.

Should we teach topic maps the “right way” or should we teach them to drive?

I first saw this in a tweet by Christophe Lalanne.

Twitter sues US federal agencies in attempt to remove the gag around surveillance

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:36 am

Twitter sues US federal agencies in attempt to remove the gag around surveillance by Lisa Vaas.

From the post:

Twitter doesn’t want its transparency report to be fuzzy to the point of meaninglessness, full of “broad, inexact ranges” about how many times the US government has shaken the company down in its surveillance operations, it says – for example, by counting them to the nearest thousand.

So on Tuesday, Twitter sued the Feds over the surveillance laws they’re using to gag it.

Twitter’s lawyer, Ben Lee, said in a post that First Amendment rights should allow the company to be crystal clear about the actual scope of surveillance of Twitter users by the US, as opposed to the current state of affairs, where companies such as Twitter are bound by laws that punish them for disclosing requests for information.

Lisa has links to the court documents and mentions that Twitter isn’t standing alone against government surveillance:

Both Apple and Google announced in September new mobile phone encryption policies meant to thwart government attempts to get at user data – a move that’s sparked hand-wringing on the part of multiple government officials.

Other US tech companies, including Microsoft, Facebook, Dropbox, and, again, Google, have been fighting government demands for user data in other ways, including attempting to convince the Senate to reform government surveillance.

The “hand-wringing” Lisa mentions is a measure of the technical illiteracy of government policy makers. New mobile phone policies will make secure voice marginally easier for the average user, but even the semi-literate have had access to secure voice for years, see: PRISM-proof your phone with these encrypted apps and services.

Support technology company opposition to government surveillance at every opportunity.

October 8, 2014

Unicode Version 7.0…

Filed under: Unicode — Patrick Durusau @ 4:32 pm

Unicode Version 7.0 – Complete Text of the Core Specification Published

From the post:

The Unicode® Consortium announces the publication of the core specification for Unicode 7.0. The Version 7.0 core specification contains significant changes:

  • Major reorganization of the chapters and overall layout
  • New page size tailored for easy viewing on e-readers and other mobile devices
  • Addition of twenty-two new scripts and a shorthand writing system
  • Alignment with updates to the Unicode Bidirectional Algorithm

In Version 7.0, the standard grew by 2,834 characters. This version continues the Unicode Consortium’s long-term commitment to support the full diversity of languages around the world with its newly encoded scripts and other additional characters. The text of the latest version documents two newly adopted currency symbols: the manat, used in Azerbaijan, and the ruble, used in Russia and other countries. It also includes information about newly added pictographic symbols, geometric symbols, arrows and ornaments.

This version of the Standard brings technical improvements to support implementers, including further clarification of the case pair stability policy, and a new stability policy for Numeric_Type.

All other components of Unicode 7.0 were released on June 16, 2014: the Unicode Standard Annexes, code charts, and the Unicode Character Database, to allow vendors to update their implementations of Unicode 7.0 as early as possible. The release of the core specification completes the definitive documentation of the Unicode Standard, Version 7.0.

For more information on all of The Unicode Standard, Version 7.0, see http://www.unicode.org/versions/Unicode7.0.0/.

For non-backtick + Unicode character applications, this is good news!

Following the Unicode standard should be the first test for consideration of an application. The time for ad hoc character hacks passed a long time ago.

A look at Cayley

Filed under: Cayley,Graphs,Neo4j — Patrick Durusau @ 4:15 pm

A look at Cayley by Tony.

From the post:

Recently I took the time to check out Cayley, a graph database written in Go that’s been getting some good attention.

cayley

https://github.com/google/cayley

A great introduction to Cayley. Tony has some comparisons to Neo4j, but for beginners with graph databases, those comparisons may not be real useful. Come back for those comparisons once you have moved beyond example graphs.

Incremental Classification, concept drift and Novelty detection (IClaNov)

Filed under: Classification,Concept Drift,Novelty — Patrick Durusau @ 10:51 am

Incremental Classification, concept drift and Novelty detection (IClaNov)

From the post:

The development of dynamic information analysis methods, like incremental clustering, concept drift management and novelty detection techniques, is becoming a central concern in a bunch of applications whose main goal is to deal with information which is varying over time. These applications relate themselves to very various and highly strategic domains, including web mining, social network analysis, adaptive information retrieval, anomaly or intrusion detection, process control and management recommender systems, technological and scientific survey, and even genomic information analysis, in bioinformatics. The term “incremental” is often associated to the terms dynamics, adaptive, interactive, on-line, or batch. The majority of the learning methods were initially defined in a non-incremental way. However, in each of these families, were initiated incremental methods making it possible to take into account the temporal component of a data stream. In a more general way incremental clustering algorithms and novelty detection approaches are subjected to the following constraints:

  • Possibility to be applied without knowing as a preliminary all the data to be analyzed;
  • Taking into account of a new data must be carried out without making intensive use of the already considered data;
  • Result must but available after insertion of all new data;
  • Potential changes in the data description space must be taken into consideration.

This workshop aims to offer a meeting opportunity for academics and industry-related researchers, belonging to the various communities of Computational Intelligence, Machine Learning, Experimental Design and Data Mining to discuss new areas of incremental clustering, concept drift management and novelty detection and on their application to analysis of time varying information of various natures. Another important aim of the workshop is to bridge the gap between data acquisition or experimentation and model building.

ICDM 2014 Conference: December 14, 2014

The agenda for this workshop has been posted.

Does your ontology support incremental classification, concept drift and novelty detection? All of those exist in the ongoing data stream of experience if not within some more limited data stream from a source.

You can work from a dated snapshot of the world as it was, but over time will that best serve your needs?

Remember that for less than $250,000 (est.) the attacks on 9/11 provoked the United States into spending $trillions based on a Cold War snapshot of the world. Probably the highest return on investment for an attack in history.

The world is constantly changing and your data view of it should be changing as well.

BadUSB Conference Swag?

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:01 am

Phison 2251-03 (2303) Custom Firmware & Existing Firmware Patches (BadUSB) by Adam Caudill and Brandon Wilson.

Not as catchy a title as the BBC: Attack code for ‘unpatchable’ USB flaw released.

The BBC quotes Karsten Nohl (one of the original discoverers of the USB flaw) as saying:

In the case of BadUSB, however, the problem is structural,” he said. “The standard itself is what enables the attack and no single vendor is in a position to change that.

The market figures for USB flash drives highlight the significance of this “flaw:”

The podcast that lead to this post, SSCC 168 – Amaze your friends by ruining all their USB drives! [PODCAST] mentions PROMs (Programmable Read-Only Memory) as a defense for future USB products. PROMs are programmed by physically altering the chip, “burning,” which prevents some reprogramming of the chip. Connections that aren’t “burnt” could be altered on a PROM but the effectiveness of that for reprogramming isn’t known.

Should USB flash drives with firmware protected by PROMs prove to be popular, it is always possible to build USB flash drives with rogue PROMs. Various logos with custom malware installed.

The security lesson here is that devices are insecure if you can’t verity their firmware.

October 7, 2014

Software Security (MOOC, Starts October 13, 2014!)

Filed under: Cybersecurity,Programming,Security,Software — Patrick Durusau @ 7:21 pm

Software Security

From the post:

Weekly work done at your own pace and schedule by listening to lectures and podcasts, completing quizzes and exercises and peer evaluations. Estimated time commitment is 4 hours/week. Course runs for 9 weeks (ends December 5)


This MOOC introduces students to the discipline of designing, developing, and testing secure and dependable software-based systems. Students will be exposed to the techniques needed for the practice of effective software security techniques. By the end of the course, you should be able to do the following things:

  • Security risk management. Students will be able to assess the security risk of a system under development. Risk management will include the development of formal and informal misuse case and threat models. Risk management will also involve the utilization of security metrics.
  • Security testing. Students will be able to perform all types of security testing, including fuzz testing at each of these levels: white box, grey box, and black box/penetration testing.
  • Secure coding techniques. Students will understand secure coding practices to prevent common vulnerabilities from being injected into software.
  • Security requirements, validation and verification. Students will be able to write security requirements (which include privacy requirements). They will be able to validate these requirements and to perform additional verification practices of static analysis and security inspection.

This course is run by the Computer Science department at North Carolina State University.

Register

One course won’t make you a feared White/Black Hat but everyone has to start somewhere.

Looks like a great opportunity to learn about software security issues and to spot where subject identity techniques could help collate holes or fixes.

The Definitive “Getting Started” Tutorial for Apache Hadoop + Your Own Demo Cluster

Filed under: Cloudera,Hadoop — Patrick Durusau @ 7:11 pm

The Definitive “Getting Started” Tutorial for Apache Hadoop + Your Own Demo Cluster by Justin Kestelyn.

From the post:

Most Hadoop tutorials take a piecemeal approach: they either focus on one or two components, or at best a segment of the end-to-end process (just data ingestion, just batch processing, or just analytics). Furthermore, few if any provide a business context that makes the exercise pragmatic.

This new tutorial closes both gaps. It takes the reader through the complete Hadoop data lifecycle—from data ingestion through interactive data discovery—and does so while emphasizing the business questions concerned: What products do customers view on the Web, what do they like to buy, and is there a relationship between the two?

Getting those answers is a task that organizations with traditional infrastructure have been doing for years. However, the ones that bought into Hadoop do the same thing at greater scale, at lower cost, and on the same storage substrate (with no ETL, that is) upon which many other types of analysis can be done.

To learn how to do that, in this tutorial (and assuming you are using our sample dataset) you will:

  • Load relational and clickstream data into HDFS (via Apache Sqoop and Apache Flume respectively)
  • Use Apache Avro to serialize/prepare that data for analysis
  • Create Apache Hive tables
  • Query those tables using Hive or Impala (via the Hue GUI)
  • Index the clickstream data using Flume, Cloudera Search, and Morphlines, and expose a search GUI for business users/analysts

I can’t imagine what “other” tutorials that Justin has in mind. 😉

To be fair, I haven’t taken this particular tutorial. Hadoop tutorials you suggest as comparisons to this one? Your comparisons of Hadoop tutorials?

History of Apache Storm and lessons learned

Filed under: Marketing,Storm — Patrick Durusau @ 7:00 pm

History of Apache Storm and lessons learned by Nathan Marz.

From the post:

Apache Storm recently became a top-level project, marking a huge milestone for the project and for me personally. It’s crazy to think that four years ago Storm was nothing more than an idea in my head, and now it’s a thriving project with a large community used by a ton of companies. In this post I want to look back at how Storm got to this point and the lessons I learned along the way.

Storm

The topics I will cover through Storm’s history naturally follow whatever key challenges I had to deal with at those points in time. The first 25% of this post is about how Storm was conceived and initially created, so the main topics covered there are the technical issues I had to figure out to enable the project to exist. The rest of the post is about releasing Storm and establishing it as a widely used project with active user and developer communities. The main topics discussed there are marketing, communication, and community development.

Any successful project requires two things:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

What I think many developers fail to understand is that achieving that second condition is as hard and as interesting as building the project itself. I hope this becomes apparent as you read through Storm’s history.

Every project/case is somewhat different but this history of Storm is a relevant and great read!

I would highlight: It solves a useful problem.

I don’t read that to say:

  • It solves a problem I want to solve
  • It solves a problem you didn’t know you had
  • It solves a problem I care about
  • etc.

To be a “useful” problem, some significant segment of users must recognize it as a problem. If they don’t see it as a problem, then it doesn’t need a solution.

Boiling Sous-Vide Eggs using Clojure’s Transducers

Filed under: Clojure,Programming — Patrick Durusau @ 6:45 pm

Boiling Sous-Vide Eggs using Clojure’s Transducers by Stian Eikeland.

From the post:

I love cooking, especially geeky molecular gastronomy cooking, you know, the type of cooking involving scientific knowledge, -equipment and ingredients like liquid nitrogen and similar. I already have a sous-vide setup, well, two actually (here is one of them: sousvide-o-mator), but I have none that run Clojure. So join me while I attempt to cook up some sous-vide eggs using the new transducers coming in Clojure 1.7. If you don’t know what transducers are about, take a look here before you continue.

To cook sous-vide we need to keep the temperature at a given point over time. For eggs, around 65C is pretty good. To do this we use a PID-controller.

egg

I was hoping that Clojure wasn’t just of academic interest and would have some application in the “real world.” Now, proof arrives of real world relevance! 😉

For those of you who don’t easily recognize humor, I know that Clojure is used in many “real world” applications and situations. Comments to that effect will be silently deleted.

Whether the toast and trimmings were also prepared using Clojure the author does not say.

Magna Carta Ballot – Deadline 31 October 2014

Filed under: Contest — Patrick Durusau @ 4:46 pm

Win a chance to see all four original 1215 Magna Carta manuscripts together for the first time #MagnaCartaBallot

From the post:

Magna Carta is one of the world’s most influential documents. Created in 1215 by King John and his barons, it has become a potent symbol of liberty and the rule of law.

Eight hundred years later, all four surviving original manuscripts are being brought together for the first time on 3 February 2015. The British Library, Lincoln Cathedral and Salisbury Cathedral have come together to stage a one-off, one-day event sponsored by Linklaters.

This is your chance to be part of history as we give 1,215 people the unique opportunity to see all four Magna Carta documents at the British Library in London.

The unification ballot to win tickets is free to enter. The closing date is 31 October 2014.

According to the FAQ you have to get yourself to London on the specified date and required time.

Good luck!

October 6, 2014

Bioinformatics tools extracted from a typical mammalian genome project

Filed under: Archives,Bioinformatics,Documentation,Preservation — Patrick Durusau @ 7:55 pm

Bioinformatics tools extracted from a typical mammalian genome project

From the post:

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.

A must read if you are interested in useful preservation of research and data. The paper focuses on needed improvements in bioinformatics but the issues raised are common to all fields.

How well does your field perform when compared to bioinformatics?

Vertex Meta- and Multi-Properties in TinkerPop3

Filed under: TinkerPop — Patrick Durusau @ 7:43 pm

Marko Rodriguez tweeted this link http://www.tinkerpop.com/docs/3.0.0.M3/#the-crew-toy-graph which takes you to a diagram and REPL work that demonstrates vertex meta- and multiproperties in TinkerPop3.

If you haven’t looked at the TinkerPop documentation in a while, make the time to do so.

TinkerPop 3.0.0.M3 Released (A Gremlin Rāga in 7/16 Time)

Filed under: Gremlin,TinkerPop — Patrick Durusau @ 7:30 pm

TinkerPop 3.0.0.M3 Released (A Gremlin Rāga in 7/16 Time) by Marko Rodriguez.

From the post:

TinkerPop 3.0.0.M3 has been released. This release has numerous core bug-fixes/optimizations/features. We were anxious to release M3 due to some changes in the Process API. These changes should not effect the user, only vendors providing a Gremlin language variant (e.g. Gremlin-Scala, Gremlin-JavaScript, etc.). From what I hear, it “just worked” for Gremlin-Scala so that is good. Here are links to the release:

CHANGELOG: https://github.com/tinkerpop/tinkerpop3/blob/master/CHANGELOG.asciidoc#tinkerpop-300m3-release-date-october-6-2014
AsciiDoc: http://www.tinkerpop.com/docs/3.0.0.M3/
JavaDoc: http://www.tinkerpop.com/javadocs/3.0.0.M3/
Downloads:
– Gremlin-Console: http://www.tinkerpop.com/downloads/3.0.0.M3/gremlin-console-3.0.0.M3.zip
– Gremlin-Server: http://www.tinkerpop.com/downloads/3.0.0.M3/gremlin-server-3.0.0.M3.zip

Are you going to accept Marko’s anecdotal assurances, it “just worked” for Gremlin-Scala or will you put this release to the test? 😉

I am sure Marko and others would like to know!

Bossies 2014: The Best of Open Source Software Awards

Filed under: Open Source,Software — Patrick Durusau @ 4:30 pm

Bossies 2014: The Best of Open Source Software Awards by Doug Dineley.

From the post:

If you hadn’t noticed, we’re in the midst of an incredible boom in enterprise technology development — and open source is leading it. You’re unlikely to find better proof of that dynamism than this year’s Best of Open Source Software Awards, affectionately known as the Bossies.

Have a look for yourself. The result of months of exploration and evaluation, plus the recommendations of many expert contributors, the 2014 Bossies cover more than 130 award winners in six categories:

(emphasis added)

Hard to judge the count because winners are presented one page at a time in each category. Not to mention that at least one winner appears in two separate categories.

Put into lists and sorted for review we find:

Open source applications (16)

Open source application development tools (42)

Open source big data tools (20)

Open source desktop and mobile software (14)

Open source data center and cloud software (19)

Open source networking and security software (9)

Creating the list presentation allows us to discover the actual count, allowing for entries with more than one software package mentioned, is 122 software packages.

BTW, Docker appears under application development tools and under data center and cloud software. Which should make the final count 121 different software packages. (You will have to check the entries at InfoWorld to verify that number.)

PS: The original presentation was in no discernible order. I put the lists into alphabetical order for ease of finding.

October 5, 2014

The Barrier of Meaning

Filed under: Artificial Intelligence,Computer Science,Meaning — Patrick Durusau @ 6:40 pm

The Barrier of Meaning by Gian-Carlo Rota.

The author discusses the “AI-problem” with Stanislaw Ulam. Ulam makes reference to the history of the “AI-problem” and then continues:

Well, said Stan Ulam, let us play a game. Imagine that we write a dictionary of common words. We shall try to write definitions that are unmistakeably explicit, as if ready to be programmed. Let us take, for instance, nouns like key, book, passenger, and verbs like waiting, listening, arriving. Let us start with the word “key.” I now take this object out of my pocket and ask you to look at it. No amount of staring at this object will ever tell you that this is a key, unless you already have some previous familiarity with the way keys are used.

Now look at that man passing by in a car. How do you tell that it is not just a man you are seeing, but a passenger?

When you write down precise definitions for these words, you discover that what you are describing is not an object, but a function, a role that is tied inextricably tied to some context. Take away that context, and the meaning also disappears.

When you perceive intelligently, as you sometimes do, you always perceive a function, never an object in the set-theoretic or physical sense.

Your Cartesian idea of a device in the brain that does the registering is based upon a misleading analogy between vision and photography. Cameras always register objects, but human perception is always the perceptions of functional roles. The two porcesses could not be more different.

Your friends in AI are now beginning to trumpet the role of contexts, but they are not practicing their lesson. They still want to build machines that see by imitating cameras, perhaps with some feedback thrown in. Such an approach is bound to fail since it start out with a logical misunderstanding….

Should someone mention this to the EC Brain project?

BTW, you may be able to access this article at: Physica D: Nonlinear Phenomena, Volume 22, Issues 1–3, Pages 1-402 (October–November 1986), Proceedings of the Fifth Annual International Conference. For some unknown reason, the editorial board pages are $37.95, as are all the other articles, save for this one by Gian-Carlo Rota. Which as of today, is freely accessible.

The webpages say Physica D supports “open access.” I find that rather doubtful when only three (3) pages out of four hundred and two (402) requires no payment. For material published in 1986.

You?

EcoData Retriever

Filed under: Data Repositories,Ecoinformatics — Patrick Durusau @ 6:01 pm

EcoData Retriever

From the webpage:

Most ecological datasets do not adhere to any agreed-upon standards in format, data structure or method of access. As a result acquiring and utilizing available datasets can be a time consuming and error prone process. The EcoData Retriever automates the tasks of finding, downloading, and cleaning up ecological data files, and then stores them in a local database. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days. Small datasets can be downloaded and installed in seconds and large datasets in minutes. The program also cleans up known issues with the datasets and automatically restructures them into standard formats before inserting the data into your choice of database management systems (Microsoft Access, MySQL, PostgreSQL, and SQLite, on Windows, Mac and Linux).

When faced with:

…datasets [that] do not adhere to any agreed-upon standards in format, data structure or method of access

you can:

  • Complain to fellow cube dwellers
  • Complain about data producers
  • Complain to the data producers
  • Create a solution to clean up and reformat the data as open source

Your choice?

I first saw this in a tweet by Dan McGlinn

Gödel for Goldilocks…

Filed under: Mathematics,Philosophy — Patrick Durusau @ 3:27 pm

Gödel for Goldilocks: A Rigorous, Streamlined Proof of Gödel’s First Incompleteness Theorem, Requiring Minimal Background by Dan Gusfield.

Abstract:

Most discussions of Gödel’s theorems fall into one of two types: either they emphasize perceived philosophical “meanings” of the theorems, and maybe sketch some of the ideas of the proofs, usually relating Gödel’s proofs to riddles and paradoxes, but do not attempt to present rigorous, complete proofs; or they do present rigorous proofs, but in the traditional style of mathematical logic, with all of its heavy notation and difficult definitions, and technical issues which reflect Gödel’s original exposition and needed extensions by Gödel’s contemporaries. Many non-specialists are frustrated by these two extreme types of expositions and want a complete, rigorous proof that they can understand. Such an exposition is possible, because many people have realized that Gödel’s first incompleteness theorem can be rigorously proved by a simpler middle approach, avoiding philosophical discussions and hand-waiving at one extreme; and also avoiding the heavy machinery of traditional mathematical logic, and many of the harder detail’s of Gödel’s original proof, at the other extreme. This is the just-right Goldilocks approach. In this exposition we give a short, self-contained Goldilocks exposition of Gödel’s first theorem, aimed at a broad audience.

Proof that even difficult subjects can be explained without “hand=waiving” or “heavy machinery of traditional mathematical logic.”

I first saw this in a tweet by Lars Marius Garshol.

Nobody Cares About Your “Billion Dollar Idea”

Filed under: Marketing — Patrick Durusau @ 10:19 am

Nobody Cares About Your “Billion Dollar Idea” by Gary Vaynerchuk.

From the post:

I have UNLIMITED ideas. If you have the idea that’s nice, but if you don’t have the dollars or the inventory, well then, you have nothing. So, the only way you can do something about that is to go ahead and get dollars from somebody.

An echo of Jack Park’s “Just f*cking do it!,” albeit in a larger forum.

What are you doing this week to turn your idea into a tangible reality?

October 4, 2014

JUNO

Filed under: Julia,Programming — Patrick Durusau @ 8:09 pm

JUNO: Juno is a powerful, free environment for the Julia language.

From the about page:

Juno began as an attempt to provide basic support for Julia in Light Table. I’ve been working on it over the summer as part of Google Summer of Code, and as the project has evolved it’s come closer to providing a full IDE for Julia, with a particular focus on providing a good experience for beginners.

The Juno plugin itself is essentially a thin wrapper which provides nice defaults; the core functionality is provided in a bunch of packages and plugins:

  • Julia-LT – which provides the basic language support for Julia in Light Table
  • Jewel.jl – A Julia source code analysis and manipulation library for Julia
  • June – Nicer themes and font defaults for LT
  • Reminisce – Sublime-style saving of files and content for LT

In case you have forgotten about Julia:

Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. The library, largely written in Julia itself, also integrates mature, best-of-breed C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing. In addition, the Julia developer community is contributing a number of external packages through Julia’s built-in package manager at a rapid pace. IJulia, a collaboration between the IPython and Julia communities, provides a powerful browser-based graphical notebook interface to Julia.

Julia programs are organized around multiple dispatch; by defining functions and overloading them for different combinations of argument types, which can also be user-defined. For a more in-depth discussion of the rationale and advantages of Julia over other systems, see the following highlights or read the introduction in the online manual.

Curious to see if this project will follow Light Table onto the next IDE project, Eve.

Why Academics Stink at Writing [Programmers Too]

Filed under: Marketing,Writing — Patrick Durusau @ 7:45 pm

Why Academics Stink at Writing by Steven Pinker.

From the post:

Together with wearing earth tones, driving Priuses, and having a foreign policy, the most conspicuous trait of the American professoriate may be the prose style called academese. An editorial cartoon by Tom Toles shows a bearded academic at his desk offering the following explanation of why SAT verbal scores are at an all-time low: “Incomplete implementation of strategized programmatics designated to maximize acquisition of awareness and utilization of communications skills pursuant to standardized review and assessment of languaginal development.” In a similar vein, Bill Watterson has the 6-year-old Calvin titling his homework assignment “The Dynamics of Inter­being and Monological Imperatives in Dick and Jane: A Study in Psychic Transrelational Gender Modes,” and exclaiming to Hobbes, his tiger companion, “Academia, here I come!”

Steven’s analysis applies mostly to academic writing styles, although I have suffered through more than one tome in CS that apologies for some topic X being in another chapter. Enough already, just get on with it. Needed a severe editing which would have left it shorter and an easier read.

Worth the read if you try to identify issues in your own writing style. Identifying errors in the writing style of others won’t improve your writing.

I first saw this in a twee by Steven Strogatz

PS: Being able to communicate effectively with others is essential to marketing yourself or products/services.

You Don’t Have to Be Google to Build an Artificial Brain

Filed under: Artificial Intelligence,Deep Learning — Patrick Durusau @ 7:24 pm

You Don’t Have to Be Google to Build an Artificial Brain by Cade Metz.

From the post:

When Google used 16,000 machines to build a simulated brain that could correctly identify cats in YouTube videos, it signaled a turning point in the art of artificial intelligence.

Applying its massive cluster of computers to an emerging breed of AI algorithm known as “deep learning,” the so-called Google brain was twice as accurate as any previous system in recognizing objects pictured in digital images, and it was hailed as another triumph for the mega data centers erected by the kings of the web.

But in the middle of this revolution, a researcher named Alex Krizhevsky showed that you don’t need a massive computer cluster to benefit from this technology’s unique ability to “train itself” as it analyzes digital data. As described in a paper published later that same year, he outperformed Google’s 16,000-machine cluster with a single computer—at least on one particular image recognition test.

This was a rather expensive computer, equipped with large amounts of memory and two top-of-the-line cards packed with myriad GPUs, a specialized breed of computer chip that allows the machine to behave like many. But it was a single machine nonetheless, and it showed that you didn’t need a Google-like computing cluster to exploit the power of deep learning.

Cade’s article should encourage you to do two things:

  • Learn GPU’s cold
  • Ditto on Deep Learning

Google and others will always have more raw processing power than any system you are likely to afford. However, while a steam shovel can shovel a lot of clay, it takes a real expert to make a vase. Particularly a very good one.

Do you want to pine for a steam shovel or work towards creating a fine vase?

PS: Google isn’t building “an artificial brain,” not anywhere close. That’s why all their designers, programmers and engineers are wetware.

General Theory of Natural Equivalences [Category Theory – Back to the Source]

Filed under: Category Theory — Patrick Durusau @ 7:08 pm

General Theory of Natural Equivalences by Samuel Eilenberg and Saunders MacLane. (1945)

While reading the Stanford Encyclopedia of Philosophy entry on category theory, I was reminded that despite seeing the citation Eilenberg and MacLane, General Theory of Natural Equivalences, 1945 uncounted times, I have never attempted to read the original paper.

Considering I had a graduate seminar on running biblical research back to original sources (as nearly as possible), a severe oversight on my part. An article comes to mind that proposed inserting several glyphs into a particular inscription. Plausible, until you look at the tablet in question and realize perhaps one glyph could be restored, but not two or three.

It has been my experience that was not a unique case nor is it limited to biblical studies.

Category Theory (Stanford Encyclopedia of Philosophy)

Filed under: Category Theory,Philosophy — Patrick Durusau @ 4:50 pm

Category Theory (Stanford Encyclopedia of Philosophy)

From the entry:

Category theory has come to occupy a central position in contemporary mathematics and theoretical computer science, and is also applied to mathematical physics. Roughly, it is a general mathematical theory of structures and of systems of structures. As category theory is still evolving, its functions are correspondingly developing, expanding and multiplying. At minimum, it is a powerful language, or conceptual framework, allowing us to see the universal components of a family of structures of a given kind, and how structures of different kinds are interrelated. Category theory is both an interesting object of philosophical study, and a potentially powerful formal tool for philosophical investigations of concepts such as space, system, and even truth. It can be applied to the study of logical systems in which case category theory is called “categorical doctrines” at the syntactic, proof-theoretic, and semantic levels. Category theory is an alternative to set theory as a foundation for mathematics. As such, it raises many issues about mathematical ontology and epistemology. Category theory thus affords philosophers and logicians much to use and reflect upon.

Several tweets contained “category theory” and links to this entry in the Stanford Encyclopedia of Philosophy. The entry was substantially revised as of October 3, 2014, but I don’t see a mechanism that allows discovery of changes to the prior text.

For a PDF version of this entry (or other entries), join the Friends of the SEP Society. The cost is quite modest and the SEP is an effort that merits your support.

As a reading/analysis exercise, treat the entries in SEP as updates to Copleston‘s History of Philosophy:

A History of Philosophy 1: Greek and Rome

A History of Philosophy 2: Medieval

A History of Philosophy 3: Late Medieval and Renaissance

A History of Philosophy 4: Modern: Descartes to Leibniz

A History of Philosophy 5: Modern British, Hobbes to Hume

A History of Philosophy 6: Modern: French Enlightenment to Kant

A History of Philosophy 7: Modern Post-Kantian Idealiststo Marx, Kierkegaard and Nietzsche

A History of Philosophy 8: Modern: Empiricism, Idealism, Pragmatism in Britain and America

A History of Philosophy 9: Modern: French Revolution to Sartre, Camus, Lévi-Strauss

Enjoy!

SIGGRAPH 2014 Open Access Conference Content

Filed under: Open Access — Patrick Durusau @ 10:23 am

SIGGRAPH 2014 Open Access Conference Content

From the webpage:

Starting with the SIGGRAPH 2014 conference, SIGGRAPH will not produce any printed or DVD-based documentation. Conference content (technical papers, course notes, etc.) will be available for free in the ACM Digital Library starting two weeks prior to the start of the conference, and will remain available for free until one week after the end of the conference. After the one-month “free access” period, and until the start of the next SIGGRAPH conference, this content will be available for free exclusively through the open access links below. ACM SIGGRAPH members always have free access to all SIGGRAPH-sponsored materials in the ACM Digital Library.

Bizarre wording but I can attest that SIGGRAPH Proceedings from 2013 are open access and SIGGRAPH 2014 commenced on 10 August 2014.

Ask the next slate of ACM candidates if the leading CS organization in the world will become open access without qualifiers and exceptions?

October 3, 2014

Open Challenges for Data Stream Mining Research

Filed under: BigData,Data Mining,Data Streams,Text Mining — Patrick Durusau @ 4:58 pm

Open Challenges for Data Stream Mining Research, SIGKDD Explorations, Volume 16, Number 1, June 2014.

Abstract:

Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, over-looking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

Under entity stream mining, the authors describe the challenge of aggregation:

The first challenge of entity stream mining task concerns information summarization: how to aggregate into each entity e at each time point t the information available on it from the other streams? What information should be stored for each entity? How to deal with differences in the speeds of individual streams? How to learn over the streams efficiently? Answering those questions in a seamless way would allow us to deploy conventional stream mining methods for entity stream mining after aggregation.

Sounds remarkably like an issue for topic maps doesn’t it? Well, not topic maps in the sense that every entity has an IRI subjectIdentifier but in the sense that merging rules define the basis on which two or more entities are considered to represent the same subject.

The entire issue is on “big data” and if you are looking for research “gaps,” it is a great starting point. Table of Contents: SIGKDD explorations, Volume 16, Number 1, June 2014.

I included the TOC link because for reasons only known to staff at the ACM, the articles in this issue don’t show up in the library index. One of the many “features” of the ACM Digital Library.

In addition to the committee which oversees the Digital Library being undisclosed to members and available for contact only by staff.

« Newer PostsOlder Posts »

Powered by WordPress