Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 10, 2014

Apache Hadoop 2.4.0 Released!

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 6:20 pm

Apache Hadoop 2.4.0 Released! by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.4.0! Thank you to every single one of the contributors, reviewers and testers!

Hadoop 2.4.0 continues that momentum, with additional enhancements to both HDFS & YARN:

  • Support for Access Control Lists in HDFS (HDFS-4685)
  • Native support for Rolling Upgrades in HDFS (HDFS-5535)
  • Smooth operational upgrades with protocol buffers for HDFS FSImage (HDFS-5698)
  • Full HTTPS support for HDFS (HDFS-5305)
  • Support for Automatic Failover of the YARN ResourceManager (YARN-149) (a.k.a Phase 1 of YARN ResourceManager High Availability)
  • Enhanced support for new applications on YARN with Application History Server (YARN-321) and Application Timeline Server (YARN-1530)
  • Support for strong SLAs in YARN CapacityScheduler via Preemption (YARN-185)

And of course:

Links

See Arun’s post for more details or just jump to the downloads links.

Scalding 0.9: Get it while it’s hot!

Filed under: Hadoop,MapReduce,Scalding,Tweets — Patrick Durusau @ 6:11 pm

Scalding 0.9: Get it while it’s hot! by P. Oscar Boykin.

From the post:

It’s been just over two years since we open sourced Scalding and today we are very excited to release the 0.9 version. Scalding at Twitter powers everything from internal and external facing dashboards, to custom relevance and ad targeting algorithms, including many graph algorithms such as PageRank, approximate user cosine similarity and many more.

Oscar covers:

  • Joins
  • Input/output
    • Parquet Format
    • Avro
    • TemplateTap
  • Hadoop counters
  • Typed API
  • Matrix API

Or if you want something a bit more visual and just as enthusiastic, see:

Basically the same content but with Oscar live!

Binary Search Trees (Clojure)

Filed under: Binary Search,Clojure,Search Trees,Trees — Patrick Durusau @ 4:38 pm

Data Structures in Clojure: Binary Search Trees by Max Countryman.

From the post:

Trees Everywhere

So far we have talked about two fundamental and pervasive data structures: linked lists and hash tables. Here again we discuss another important data structure and one that you will find is quite common: trees. Trees offer a powerful way of organizing data and approaching certain problems. In particular, searching and traversal. Whether you know it or not, you no doubt use trees in your programs today. For instance, Clojure’s vectors are backed by a special kind of tree!

Here we will construct our own tree, just like with our linked list and hash table implementations. Specifically, our tree will be a kind of tree known as a Binary Search Tree (BST). Often when someone says tree, they mean a BST.

We will look the basic structure of our tree, how we insert things into it, and how we find them again. Then we will explore traversing, and finally, removing nodes. At the end of this tutorial you will have a basic, functioning Binary Search Tree, which will be the basis for further explorations later on in this series.

Another installment by Max on data structures in Clojure.

Enjoy!

Thinking for Programmers

Filed under: Programming — Patrick Durusau @ 2:35 pm

Thinking for Programmers by Leslie Lamport.

From the webpage:

Leslie Lamport inventor of Paxos and developer of LaTeX introduces techniques and tools that help programmers think above the code level to determine what applications and services should do and ensure that they do it. Depending on the task, the appropriate tools can range from simple prose to formal, tool-checked models written in TLA+ or PlusCal.

Deeply impressive but marred by the lack of video and the slides from the presentation.

My favorite quote from the presentation:

“Writing is nature’s way of letting you know how sloppy your thinking is.” Guindon.

Leslie talks about the his use of TLA, which:

stands for the Temporal Logic of Actions, but it has become a shorthand for referring to the TLA+ specification language and the PlusCal algorithm language, together with their associated tools.

Advocates creating and debugging specifications as opposed to debugging code.

Would better design avoid future Heartbleed bugs?

April 9, 2014

IRS Data?

Filed under: Government,Government Data,Open Access,Open Data — Patrick Durusau @ 7:45 pm

New, Improved IRS Data Available on OpenSecrets.org by Robert Maguire.

From the post:

Among the more than 160,000 comments the IRS received recently on its proposed rule dealing with candidate-related political activity by 501(c)(4) organizations, the Center for Responsive Politics was the only organization to point to deficiencies in a critical data set the IRS makes available to the public.

This month, the IRS released the newest version of that data, known as 990 extracts, which have been improved considerably. Now, the data is searchable and browseable on OpenSecrets.org.

“Abysmal” IRS data

Back in February, CRP had some tough words for the IRS concerning the information. In the closing pages of our comment on the agency’s proposed guidelines for candidate-related political activity, we wrote that “the data the IRS provides to the public — and the manner in which it provides it — is abysmal.”

While I am glad to see better access to 501(c) 990 data, in a very real sense, this isn’t “IRS data” is it?

This is data that the government collected under penalty of law from tax entities in the United States.

Granting it was sent in “voluntarily” but there is a lot of data that entities and individuals send to local, state and federal government “voluntarily.” Not all of it is data that most of us would want handed out because other people are curious.

As I said, I like better access to 990 data but we need to distinguish between:

  1. Government sharing data it collected from citizens or other entities, and
  2. Government sharing data about government meetings, discussions, contacts with citizens/contractors, policy making, processes and the like.

If I’m not seriously mistaken, most of the open data from government involves a great deal of #1 and very little of #2.

Is that your impression as well?

One quick example. The United States Congress, with some reluctance, seems poised on delivery of near real-time information on legislative proposals before Congress. Which is a good thing.

But there has been no discussion of tracking the final editing of bills to trace the insertion or deletion of language by who and with whose agreement? Which is a bad thing.

It makes no difference how public the process is up to final edits, if the final version is voted upon before changes can be found and charged to those responsible.

clortex

Filed under: Clojure,Neural Information Processing,Neural Networks,Neuroinformatics — Patrick Durusau @ 7:24 pm

clortex – Clojure Library for Jeff Hawkins’ Hierarchical Temporal Memory

From the webpage:

Hierarchical Temporal Memory (HTM) is a theory of the neocortex developed by Jeff Hawkins in the early-mid 2000’s. HTM explains the working of the neocortex as a hierarchy of regions, each of which performs a similar algorithm. The algorithm performed in each region is known in the theory as the Cortical Learning Algorithm (CLA).

Clortex is a reimagining and reimplementation of Numenta Platfrom for Intelligent Computing (NuPIC), which is also an Open Source project released by Grok Solutions (formerly Numenta), the company founded by Jeff to make his theories a practical and commercial reality. NuPIC is a mature, excellent and useful software platform, with a vibrant community, so please join us at Numenta.org.

Warning: pre-alpha software. This project is only beginning, and everything you see here will eventually be thrown away as we develop better ways to do things. The design and the APIs are subject to drastic change without a moment’s notice.

Clortex is Open Source software, released under the GPL Version 3 (see the end of the README). You are free to use, copy, modify, and redistribute this software according to the terms of that license. For commercial use of the algorithms used in Clortex, please contact Grok Solutions, where they’ll be happy to discuss commercial licensing.

An interesting project both in terms of learning theory but also for the requirements for the software implementing the theory.

The first two requirements capture the main points:

2.1 Directly Analogous to HTM/CLA Theory

In order to be a platform for demonstration, exploration and experimentation of Jeff Hawkins’ theories, the system must at all levels of relevant detail match the theory directly (ie 1:1). Any optimisations introduced may only occur following an effectively mathematical proof that this correspondence is maintained under the change.

2.2 Transparently Understandable Implementation in Source Code

All source code must at all times be readable by a non-developer. This can only be achieved if a person familiar with the theory and the models (but not a trained programmer) can read any part of the source code and understand precisely what it is doing and how it is implementing the algorithms.

This requirement is again deliberately very stringent, and requires the utmost discipline on the part of the developers of the software. Again, there are several benefits to this requirement.

Firstly, the extreme constraint forces the programmer to work in the model of the domain rather than in the model of the software. This constraint, by being adhered to over the lifecycle of the project, will ensure that the only complexity introduced in the software comes solely from the domain. Any other complexity introduced by the design or programming is known as incidental complexity and is the cause of most problems in software.

Secondly, this constraint provides a mechanism for verifying the first requirement. Any expert in the theory must be able to inspect the code for an aspect of the system and verify that it is transparently analogous to the theory.

Despite my misgivings about choosing the domain in which you stand, I found it interesting the project recognizes the domain of its theory and the domain of software to implement that theory are separate and distinct.

How closely two distinct domains can be mapped one to the other should be an interesting exercise.

BTW, some other resources you will find helpful:

NuPicNumenta Platform for Intelligent Computing

Cortical Learning Algorithm (CLA) white paper in eight languages.

Real Machine Intelligence with Clortex and NuPIC (book)

Scaling Graphs

Filed under: Graphics,Humor — Patrick Durusau @ 4:28 pm

Fox News

If you ever wonder why your data stream is “dirty,” I have an explanation.

I first saw this in a tweet by Scott Chamberlain.

BumbleBee, a tool for spreadsheet formula transformations

Filed under: Excel,Functional Programming,Programming,Spreadsheets — Patrick Durusau @ 4:13 pm

BumbleBee, a tool for spreadsheet formula transformations by Felienne Hermans.

From the webpage:

Some spreadsheets can be improved

While looking at spreadsheet and how they are used, over the past years I have noticed that many users don’t make their spreadsheets as easy as they could be. For instance, they use A1+A2+A3+A4+A5 instead of the simpler SUM(A1:A5) Sometimes because they are unaware of a simpler construct, or because the spreadsheet evolved over time. For instance, in used to be A1+A2, then A3 was added and so forth. Such complex formulas were exactly the aim of our previous work on smell detection.

If you say smell, you say… refactorings!

So in order to improve spreadsheets, we and other researchers have developed a number of refactorings to improve spreadsheet formulas. Over the last few months, I have been working on BumbleBee, a tool to perform not only refactorings, but more general transformations on spreadsheet formulas.

Update on her work on refactoring spreadsheets, along with a BumbleBee paper preprint, along with an installer for Excel 2010.

Imagine, going where users are using data.

This could prove to be explosively popular.

Anatomy of a data leakage bug…

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:49 pm

Anatomy of a data leakage bug – the OpenSSL “heartbleed” buffer overflow by Paul Ducklin.

From the post:

An information disclosure vulnerability has been found, and promptly patched, in OpenSSL.

OpenSSL is a very widely used encryption library, responsible for putting the S in HTTPS, and the padlock in the address bar, for many websites.

The bug only exists in the OpenSSL 1.0.1 source code (from version 1.0.1 to 1.0.1f inclusive), because the faulty code relates to a fairly new feature known as the TLS Heartbeat Extension.

The heartbeat extension was first documented in RFC 6520 in February 2012.

TLS heartbeats are used as “keep alive” packets so that the ends of an encrypted connection can agree to keep the session open even when they don’t have any official data to exchange.

Because the heartbeats consist of a reply and a matching response, they allow either end to confirm not only that the session is open, but also that end-to-end connectivity is working properly.

Paul goes on to give you a detailed description of the bug.

If you are interested in experimenting with joern to find bugs in source code, checking unpatched source code of OpenSSL should be good practice.

Once you identify the pattern, where else can you find examples of it?

Revealing the Uncommonly Common…

Filed under: Algorithms,ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:34 pm

Revealing the Uncommonly Common with Elasticsearch by Mark Harwood.

From the summary:

Mark Harwood shows how anomaly detection algorithms can spot card fraud, incorrectly tagged movies and the UK’s most unexpected hotspot for weapon possession.

Makes me curious about the market for a “Mr./Ms. Normal” service?

A service that enables you to enter your real viewing/buying/entertainment preferences and for a fee, the service generates a paper trail for you than hides your real habits in digital dust.

If you order porn from NetFlix then the “Mr./Ms. Normal” service will order enough PBS and NatGeo material to even out your renting record.

Depending on how extreme your buying habits happen to be, you may need a “Mr./Ms. Abnormal” service that shields you from any paper trail at all.

As data surveillance grows, having a pre-defined Mr./Ms. Normal/Abnormal account may become a popular high school/college graduation or even a wedding present.

The usefulness of data surveillance depends on the cooperation of its victims. Have you ever considered not cooperating? But appearing to?

Glasgow Haskell Compiler — version 7.8.1

Filed under: Functional Programming,Haskell — Patrick Durusau @ 2:17 pm

The (Interactive) Glasgow Haskell Compiler — version 7.8.1

From the announcement:

The GHC Team is pleased to announce a new major release of GHC. There
have been a number of significant changes since the last major release,
including:

  • New type-system features
    • Closed type families
    • Role checking
    • An improved solver for type naturals
  • Better support for cross compilation
  • Full iOS support
  • Massive scalability improvements to the I/O manager
  • Dynamic linking for GHCi
  • Several language improvements
    • Pattern synonyms
    • Overloaded list syntax
    • Kind-polymorphic ‘Typeable’ class
  • A new parallel –make mode
  • Preliminary SIMD intrinsic support
  • A brand-new low level code generator
  • Many bugfixes and other performance improvements.

The full release notes are here:

http://haskell.org/ghc/docs/7.8.1/html/users_guide/release-7-8-1.html

Other links:

http://www.haskell.org/ghc/ Downloads.

http://www.haskell.org/ Haskell homepage

April has gotten off to a good start, now wondering what else is coming in April?

Erlang OTP 17.0 Released!

Filed under: Erlang,Functional Programming — Patrick Durusau @ 2:00 pm

Erlang OTP 17.0 has been released

From the news release:

Erlang/OTP 17.0 is a new major release with new features, characteristics improvements, as well as some minor incompatibilities. See the README file and the documentation for more details.

Among other things, the README file reports:

OTP-11719

The default encoding of Erlang files has been changed from ISO-8859-1 to UTF-8.

The encoding of XML files has also been changed to UTF-8.

A reminder that supporting UTF-8 as UTF-8 is greatly preferred.

Structure and Interpretation of Computer Programs (OCW)

Filed under: Lisp,Programming — Patrick Durusau @ 1:40 pm

Structure and Interpretation of Computer Programs (MIT OCW)

From the course description:

This course introduces students to the principles of computation. Upon completion of 6.001, students should be able to explain and apply the basic methods from programming languages to analyze computational systems, and to generate computational solutions to abstract problems. Substantial weekly programming assignments are an integral part of the course. This course is worth 4 Engineering Design Points.

Twenty lecture videos of Abelson and Sussman, twenty-four sets of lecture notes, a couple of tests and while there I found:

Structure and Interpretation of Computer Programs Tutor

This is a public implementation of the online tutor for the MIT course Structure and Interpretation of Computer Programs (MIT course 6.001).

I’m not sure how helpful you will find it. Apparently part of an experiment that was abandoned some time ago. Still, it has slides problem sets, links to other materials, etc. but I was unable to find the promised audio files.

The main course should be cited as:

Grimson, Eric, Peter Szolovits, and Trevor Darrell. 6.001 Structure and Interpretation of Computer Programs, Spring 2005. (MIT OpenCourseWare: Massachusetts Institute of Technology), http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-001-structure-and-interpretation-of-computer-programs-spring-2005 (Accessed 9 Apr, 2014). License: Creative Commons BY-NC-SA

Enjoy!

Learning Lisp With C

Filed under: C/C++,Functional Programming,Lisp,Programming — Patrick Durusau @ 12:53 pm

Build Your Own Lisp by Daniel Holden.

From the webpage:

If you’re looking to learn C, or you’ve ever wondered how to build your own programming language, this is the book for you.

In just a few lines of code, I’ll teach you how to effectively use C, and what it takes to start building your own language.

Along the way we’ll learn about the weird and wonderful nature of Lisps, and what really makes a programming language. By building a real world C program we’ll learn implicit things that conventional books cannot teach. How to develop a project, how to make life easy for your users, and how to write beautiful code.

This book is free to read online. Get started now!

Read Online!

This looks interesting and useful.

Enjoy!

April 8, 2014

Blog Odometer Reads: 10,000 (with this post)

Filed under: Marketing,Topic Maps — Patrick Durusau @ 6:11 pm

I haven’t been posting as heavily every day for the last week or so. Mostly because I wanted to have something special for post #10,000. That “something special” is still a couple of weeks away but I do have observations to mark post #10,000 on this blog.

First and foremost, I have been deeply impressed with the variety of projects seeking to make information easier to retrieve, use and archive. Those are just on the ones I managed to find and post about. I have literally missed thousands of others. My apologies for missing any of your favorite projects and consider this an open invitation to sent them to my attention: patrick@durusau.net.

Second, I have been equally saddened by the continued use of names as semantic primitives, that is without any basis for comparison to other names. A name for an element or attribute may be “transparent” to some observers today, but what about ten (10) years from now? Or one hundred (100) years from now? Many of our “classic” texts survive in only one copy or even multiple fragments. Do you really want to rely on chance documenting of data?

Thousands if not hundreds of thousands of people saw the pyramids being built. Such common knowledge they never bothered to write down how it was done. Are you trusting mission critical applications with the same level of documentation?

Third, the difference between semantic projects that flourish and less successful projects isn’t technology, syntax, or an array of vendors leading the band. Rather, the difference is one of ROI (return on investment). If your solution requires decades of investment by third parties who may or may not choose to participate, however clever your solution, it is DOA.

Despite my deep interest in complex and auditable identity based information systems, those aren’t going to be market leaders. Weapons manufacturers, research labs, biomedical, governments and/or wannabe governments are their natural markets.

The road to less complex and perhaps in some ways unauditable identity based information systems has to start with what subjects are you not going to identify? It’s a perfectly legitimate choice to make and one I would be asking about in the world of big data.

You need to know which subjects are not documented and which subjects are documented. As a conscious decision. Unless you don’t mind paying IT to reconstruct what might have been meant by a former member of IT staff.

Fourth, the world of subjects and the “semantic impedance” that Steve Newcomb identified so long ago, is increasing at an exponential rate.

Common terminologies or vocabularies emerge in some fields but even there the question of access to legacy data remains. Not to mention that “legacy” is a term that moves a frame behind our current stage of progress.

Windows XP, used by 95% of bank ATMs becomes unsupported as of today. In twelve short years XP has gone from being “new” software, to being the standard software, now legacy software and in not too many years, dead software.

What are your estimates for the amount of data that will die with Windows XP? For maximum impact, give your estimate in terms of equivalents to the Library at Alexandria. (An unknown amount but it has as much validity as many government and RIAA estimates.)

Finally, as big data and data processing power grows, the need and opportunity for using data from diverse sources grows. Is that going to be your opportunity or the opportunity someone else has to sell you their view of semantics?

I am far more interested in learning and documenting the semantics of you and your staff than creating alien semantics to foist on a workforce (FBI Virtual Case Management project) or trying to boil some tide pool of the semantic ocean (RDF).

You can document your semantics where there is a business, scientific, or research ROI, or you can document someone else’s semantics about your data.

Your call.


If you have found the material in this blog helpful (I hope so), please consider making a donation or send me a book via Amazon or your favorite bookseller.

I have resisted carrying any advertising because I find it distracting at best and at worse it degrades the information content of the blog. Others have made different choices and that’s fine, for their blogs.

April 7, 2014

Conway’s Game of Life in Clojure

Filed under: Cellular Automata,Game of Life — Patrick Durusau @ 7:44 pm

Conway’s Game of Life in Clojure by Charles Ditzel.

Charles has collected links to three separate implementations of the “Game of Life” (aka cellular automata).

Before you dismiss cellular automata as “just graphics,” you might want to remember that Stephen Wolfram, the inventor of Mathematica is a long time CA enthusiast.

I’m not saying there is a strong connection between those facts but it seems foolish to presume there is none at all.

April 6, 2014

Eight (No, Nine!) Problems With Big Data

Filed under: BigData — Patrick Durusau @ 7:34 pm

Eight (No, Nine!) Problems With Big Data by Gary Marcus and Ernest Davis.

From the post:

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

If you or your manager is drinking the “big data” kool-aid you may want to skip this article. Or if you stand to profit for the sale of “big data” appliances and/or services.

No point in getting confused about issues your clients aren’t likely to raise.

On the other hand, if you are a government employee who is tired of seeing the public coffers robbed for less than useful technology, you probably need to print out this article by Marcus and Davis.

Don’t quote from it but ask questions about any proposed “big data” project from each of the nine problem areas.

“Big data” and its tools have a lot of potential.

But consumers are responsible for preventing that potential being their pocketbooks.

Perhaps “caveat emptor” should now be written: “CAVEAT EMPTOR (Big Data).”

What do you think?

April 5, 2014

GeoCanvas

Filed under: Geographic Data,Geography,Maps,Visualization — Patrick Durusau @ 7:34 pm

Synthicity Releases 3D Spatial Data Visualization Tool, GeoCanvas by Dean Meyers.

From the post:

Synthicity has released a free public beta version of GeoCanvas, its 3D spatial data visualization tool. The software provides a streamlined toolset for exploring geographic data, lowering the barrier to learning and using geographic information systems.

GeoCanvas is not limited to visualizing parcels in cities. By supporting data formats such as the widely available shapefile for spatial geometry and text files for attribute data, it opens the possibility of rapid 3D spatial data visualization for a wide range of uses and users. The software is expected to be a great addition to the toolkits of students, researchers, and practitioners in fields as diverse as data science, geography, planning, real estate analysis, and market research. A set of video tutorials explaining the basic concepts and a range of examples have been made available to showcase the possibilities.

The public beta version of GeoCanvas is available as a free download from www.synthicity.com.

Well, rats! I haven’t installed a VM with Windows 7/8 or Max OS X 10.8 or later.

Sounds great!

Comments from actual experience?

Algorithmic Number Theory, Vol. 1: Efficient Algorithms

Filed under: Algebra,Algorithms,Mathematics — Patrick Durusau @ 7:20 pm

Algorithmic Number Theory, Vol. 1: Efficient Algorithms by Eric Bach and Jeffrey Shallit.

From the preface:

This is the first volume of a projected two-volume set on algorithmic number theory, the design and analysis of algorithms for problems from the theory of numbers. This volume focuses primarily on those problems from number theory that admit relatively efficient solutions. The second volume will largely focus on problems for which efficient algorithms are not known, and applications thereof.

We hope that the material in this book will be useful for readers at many levels, from the beginning graduate student to experts in the area. The early chapters assume that the reader is familiar with the topics in an undergraduate algebra course: groups, rings, and fields. Later chapters assume some familiarity with Galois theory.

As stated above, this book discusses the current state of the art in algorithmic number theory. This book is not an elementary number theory textbook, and so we frequently do not give detailed proofs of results whose central focus is not computational. Choosing otherwise would have made this book twice as long.

The webpage offers the BibTeX files for the bibliography, which includes more than 1800 papers and books.

BTW, Amazon notes that Volume 2 was never published.

Now that high performance computing resources are easily available, perhaps you can start working on your own volume 2. Yes?

I first saw this in a tweet by Alvaro Videla.

Formatting Affects Perception?

Filed under: Graphics,Visualization — Patrick Durusau @ 7:08 pm

Before you jump to this link by Ed H. Chi, how would you answer the question:

Does table formatting affect your perception of a table?

The equivalent of “data is data” I suppose.

This is not a one-off example. The same answer is true for any other data set.

How’s your presentation of data?

April 4, 2014

Making Data Classification Work

Filed under: Authoring Topic Maps,Classification,Interface Research/Design — Patrick Durusau @ 7:06 pm

Making Data Classification Work by James H. Sawyer.

From the post:

The topic of data classification is one that can quickly polarize a crowd. The one side believes there is absolutely no way to make the classification of data and the requisite protection work — probably the same group that doesn’t believe in security awareness and training for employees. The other side believes in data classification as they are making it work within their environments, primarily because their businesses require it. The difficulty in choosing a side lies in the fact that both are correct.

Apologies, my quoting of James is mis-leading.

James is addressing the issue of “classification” of data in the sense of keeping information secret.

What is amazing is that the solution James proposes for “classification” in terms of what is kept secret, has a lot of resonance for “classification” in the sense of getting users to manage categories of data or documents.

One hint:

Remember how poorly even librarians use the Library of Congress subject listings? Contrast that with nearly everyone using aisle categories at the local grocery store.

You can design a topic map where experts use it poorly, or so nearly everyone be able to use it.

Your call.

Jetson TK1:… [$192.00]

Filed under: GPU,HPC,NVIDIA — Patrick Durusau @ 6:53 pm

Jetson TK1: Mobile Embedded Supercomputer Takes CUDA Everywhere by Mark Harris.

From the post:

Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.

Tegra K1 is NVIDIA’s latest mobile processor. It features a Kepler GPU with 192 cores, an NVIDIA 4-plus-1 quad-core ARM Cortex-A15 CPU, integrated video encoding and decoding support, image/signal processing, and many other system-level features. The Kepler GPU in Tegra K1 is built on the same high-performance, energy-efficient Kepler GPU architecture that is found in our high-end GeForce, Quadro, and Tesla GPUs for graphics and computing. That makes it the only mobile processor today that supports CUDA 6 for computing and full desktop OpenGL 4.4 and DirectX 11 for graphics.

Tegra K1 is a parallel processor capable of over 300 GFLOP/s of 32-bit floating point computation. Not only is that a huge achievement in a processor with such a low power footprint (Tegra K1 power consumption is in the range of 5 Watts for real workloads), but K1′s support for CUDA and desktop graphics APIs means that much of your existing compute and graphics software will compile and run largely as-is on this platform.

Are you old enough to remember looking at the mini-computers on the back of most computer zines?

And then sighing at the price tag?

Times have changed!

Order Jetson TK1 Now, just $192

Jetson TK1 is available to pre-order today for $192. In the United States, it is available from the NVIDIA website, as well as newegg.com and Micro Center. See the Jetson TK1 page for details on international orders.

Some people, General Clapper comes to mind, use supercomputers to mine dots that are already connected together (phone data).

Other people, create algorithms to assist users in connecting dots between diverse and disparate data sources.

You know who my money is riding on.

You?

April 3, 2014

Developing a 21st Century Global Library for Mathematics Research

Filed under: Identification,Identifiers,Identity,Mathematics,Subject Identity — Patrick Durusau @ 8:58 pm

Developing a 21st Century Global Library for Mathematics Research by Committee on Planning a Global Library of the Mathematical Sciences.

Care to guess what one of the major problems facing mathematical research might be?

Currently, there are no satisfactory indexes of many mathematical objects, including symbols and their uses, formulas, equations, theorems, and proofs, and systematically labeling them is challenging and, as of yet, unsolved. In many fields where there are more specialized objects (such as groups, rings, fields), there are community efforts to index these, but they are typically not machine-readable, reusable, or easily integrated with other tools and are often lacking editorial efforts. So, the issue is how to identify existing lists that are useful and valuable and provide some central guidance for further development and maintenance of such lists. (p. 26)

Does that surprise you?

What do you think the odds are of mathematical research slowing down enough for committees to decide on universal identifiers for all the subjects in mathematical publications?

That’s about what I thought.

I have a different solution: Why not ask mathematicians who are submitting articles for publication to identity (specify properties for) what they consider to be the important subjects in their article?

The authors have the knowledge and skill, not to mention the motivation of wanting their research to be easily found by others.

Over time I suspect that particular fields will develop standard identifications (sets of properties per subject) that mathematicians can reuse to save themselves time when publishing.

Mappings across those sets of properties will be needed but that can be the task of journals, researchers and indexers who have an interest and skill in that sort of enterprise.

As opposed to having a “boil the ocean” approach that tries to do more than any one project is capable of doing competently.

Distributed subject identification is one way to think about it. We already do it, this would be a semi-formalization of that process and writing down what each author already knows.

Thoughts?

PS: I suspect the condition recited above is true for almost any sufficiently large field of study. A set of 150 million entities sounds large only without context. In the context of of science, it is a trivial number of entities.

April 2, 2014

DH Tools for Beginners

Filed under: Digital Research,Humanities — Patrick Durusau @ 4:17 pm

DH Tools for Beginners by Quinn Warnick.

A short collection of tutorials for “digital humanities novices.”

It is a good start and if you know of other resources or want to author such tutorials, please do.

I don’t know that I was ever entirely comfortable with the phrase “digital humanities.”

In part because it creates an odd division between humanists and humanists who use digital tools.

We don’t call literature scholars who use concordances “concordance humanists.”

Any more than we call scholars who use bibliographic materials “bibliographic humanists.”

Mostly because concordances and bibliographic materials are tools by which one does humanities research and scholarship.

Shouldn’t that be the same for “digital” humanities?

That digital tools are simply more tools for doing humanities research and scholarship?

Given the recent and ongoing assaults on the humanities in general, standing closer together and not further apart as humanists sounds like a good idea.

Apache Tajo

Filed under: Apache Tajo,Hadoop — Patrick Durusau @ 3:57 pm

Apache Tajo SQL-on-Hadoop engine now a top-level project by Derrick Harris.

From the post:

Apache Tajo, a relational database warehouse system for Hadoop, has graduated to to-level status within the Apache Software Foundation. It might be easy to overlook Tajo because its creators, committers and users are largely based in Korea — and because there’s a whole lot of similar technologies, including one developed at Facebook — but the project could be a dark horse in the race for mass adoption. Among Tajo’s lead contributors are an engineer from LinkedIn and members of the Hortonworks technical team, which suggests those companies see some value in it even among the myriad other options.

It is far too early to be choosing winners in the Hadoop ecosystem.

There are so many contenders, with their individual boosters, that if you don’t like the solutions offered today, wait a week or so, another one will pop up on the horizon.

Which isn’t a bad thing. There isn’t any reason to think IT has uncovered the best data structures or algorithms for your data. Anymore than you would have thought that twenty years ago.

The caution I would offer is to hold tightly to your requirements and not those of some solution. Compromise may be necessary on your part, but fully understand what you are giving up and why.

The only utility that software can have, for any given user, is that it performs some task they require to be performed. For vendors, adopters, promoters, software has other utilities, which are unlikely to interest you.

Open Access Maps at NYPL

Filed under: Maps,Open Access — Patrick Durusau @ 3:47 pm

Open Access Maps at NYPL by Matt Knutzen, Stephen A. Schwarzman Building, Map Division.

From the post:

The Lionel Pincus & Princess Firyal Map Division is very proud to announce the release of more than 20,000 cartographic works as high resolution downloads. We believe these maps have no known US copyright restrictions.* To the extent that some jurisdictions grant NYPL an additional copyright in the digital reproductions of these maps, NYPL is distributing these images under a Creative Commons CC0 1.0 Universal Public Domain Dedication. The maps can be viewed through the New York Public Library’s Digital Collections page, and downloaded (!), through the Map Warper. First, create an account, then click a map title and go. Here’s a primer and more extended blog post on the warper.

…image omitted…

What’s this all mean?

It means you can have the maps, all of them if you want, for free, in high resolution. We’ve scanned them to enable their use in the broadest possible ways by the largest number of people.

Though not required, if you’d like to credit the New York Public Library, please use the following text “From The Lionel Pincus & Princess Firyal Map Division, The New York Public Library.” Doing so helps us track what happens when we release collections like this to the public for free under really relaxed and open terms. We believe our collections inspire all kinds of creativity, innovation and discovery, things the NYPL holds very dear.

In case you were unaware of it, librarians as a class have a very subversive agenda.

They want to provide as many people as much access to information as is possible.

People + information is a revolutionary mixture.

Apache Lucene/Solr 4.7.1

Filed under: Lucene,Solr — Patrick Durusau @ 3:16 pm

Apache Lucene 4.7.1

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene Changes.txt

Apache Solr 4.7.1

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr Changes.txt

Fixes include a bad memory leak: Solr-5875 so upgrading is advised.

Hortonworks Data Platform 2.1

Filed under: Apache Ambari,Falcon,Hadoop,Hadoop YARN,Hive,Hortonworks,Knox Gateway,Solr,Storm,Tez — Patrick Durusau @ 2:49 pm

Hortonworks Data Platform 2.1 by Jim Walker.

From the post:

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

A VM available now, full releases to follow later in April.

Just grabbing the headings from Jim’s post:

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Data Governance with Apache Falcon

Security with Apache Knox

Stream Processing with Apache Storm

Searching Hadoop Data with Apache Solr

Advanced Operations with Apache Ambari

See Jim’s post for some of the details and the VM for others.

April 1, 2014

April Fools’ Day: The 7 Funniest Data Cartoons

Filed under: Humor — Patrick Durusau @ 7:17 pm

April Fools’ Day: The 7 Funniest Data Cartoons

R-Bloggers had the best April Fools’ Day post I encountered today.

I think Scott Adams must have known one of my former managers.

Enjoy!

Want to make a great puzzle game?…

Filed under: Combinatorics,Merging — Patrick Durusau @ 7:06 pm

Want to make a great puzzle game? Get inspired by theoretical computer science by Jeremy Kun.

From the post:

Two years ago, Erik Demaine and three other researchers published a fun paper to the arXiv proving that most incarnations of classic nintendo games are NP-hard. This includes almost every Super Mario Brothers, Donkey Kong, and Pokemon title. Back then I wrote a blog post summarizing the technical aspects of their work, and even gave a talk on it to a room full of curious undergraduate math majors.

But while bad tech-writers tend to interpret NP-hard as “really really hard,” the truth is more complicated. It’s really a statement about computational complexity, which has a precise mathematical formulation. Sparing the reader any technical details, here’s what NP-hard implies for practical purposes:

You should abandon hope of designing an algorithm that can solve any instance of your NP-hard problem, but many NP-hard problems have efficient practical “good-enough” solutions.

The very definition of NP-hard means that NP-hard problems need only be hard in the worst case. For illustration, the fact that Pokemon is NP-hard boils down to whether you can navigate a vastly complicated maze of trainers, some of whom are guaranteed to defeat you. It has little to do with the difficulty of the game Pokemon itself, and everything to do with whether you can stretch some subset of the game’s rules to create a really bad worst-case scenario.

So NP-hardness has very little to do with human playability, and it turns out that in practice there are plenty of good algorithms for winning at Super Mario Brothers. They work really well at beating levels designed for humans to play, but we are highly confident that they would fail to win in the worst-case levels we can cook up. Why don’t we know it for a fact? Well that’s the P \ne NP conjecture.

Can we say the same about combinatorial explosion problems? That under some set of assumptions such problems are intractable but that practical solutions do exist?

I mention this because one of the more shallow arguments in favor of a master ontology is that mapping between multiple ontologies (drum roll please), leads to a combinatorial explosion.

True, if you are dumb enough to insist on mapping from every known ontology to every other known ontology in an area, that’s very likely true.

However, if I only map from my ontology to the next ontology known to me, leaving other one to one mappings to others who want them, I am not closer to a combinatorial explosion than creating the mapping to a master ontology.

I must admit my bias in this case because I have no master ontology to sell you.

I prefer that you use a classification, taxonomy, ontology you already know. If it’s all the same to you.

Mapping your classification/taxonomy/ontology to another is something I would be happy to assist with.

PS: Do read Jeremy’s post in full. You will learn a lot and possibly pick up a new hobby.

« Newer PostsOlder Posts »

Powered by WordPress