March « 2015 « Another Word For It

March 14, 2015

Hacking Academia: Data Science and the University

Filed under: Data Science,Education — Patrick Durusau @ 6:44 pm

Hacking Academia: Data Science and the University by Jake Vanderplas

From the post:

…

In the words of Alex Szalay, these sorts of researchers must be “Pi-shaped” as opposed to the more traditional “T-shaped” researcher. In Szalay’s view, a classic PhD program generates T-shaped researchers: scientists with wide-but-shallow general knowledge, but deep skill and expertise in one particular area. The new breed of scientific researchers, the data scientists, must be Pi-shaped: that is, they maintain the same wide breadth, but push deeper both in their own subject area and in the statistical or computational methods that help drive modern research:

Perhaps neither of these labels or descriptions is quite right. Another school of thought on data science is Jim Gray’s idea of the “Fourth Paradigm” of scientific discovery: First came the observational insights of empirical science; second were the mathematically-driven insights of theoretical science; third were the simulation-driven insights of computational science. The fourth paradigm involves primarily data-driven insights of modern scientific research. Perhaps just as the scientific method morphed and grew through each of the previous paradigmatic transitions, so should the scientific method across all disciplines be modified again for this new data-driven realm of knowledge.

…

Neither one of the labels in the graphic are correct. In part because this a classic light versus dark dualism, along the lines of Middle Age scholars making reference to the dark ages. You could not have asked anyone living between the 6th and 13th centuries, what it felt like to live in the “dark ages.” That was a name later invented to distinguish the “dark ages,” an invention that came about in the “Middle Ages.” The “Middle Ages” being coined, of course, during the Renaissance.

Every age thinks it is superior to those that came before and the same is true for changes in the humanities and sciences. Fear not, someday your descendants will wonder how we fed ourselves, being hobbled with such vastly inferior software and hardware.

I mention this because the “Pi-shaped” graphic is making the rounds on Twitter. It is only one of any number of new “distinctions” that are springing up in academia and elsewhere. None of which will be of interest or perhaps even intelligible in another twenty years.

Rather than focusing on creating ephemeral labels for ourselves and others, how about we focus on research and results, whatever label has been attached to someone? Yes?

Comments Off

Scam and Phishing Tactics Vary; Which Ones Are You Likely to Fall for?

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:04 pm

Kelly Morgan has written a three part series on phishing titled:

Scam and Phishing Tactics Vary; Which Ones Are You Likely to Fall for? Part 1.

Scam and Phishing Tactics Vary; Which Ones Are You Likely to Fall for? Part 2.

Scam and Phishing Tactics Vary; Which Ones Are You Likely to Fall for? Part 3.

I know my average reader won’t fall for any of those traps but Brian Prince reports in Hacking Critical Infrastructure Companies — A Pen Tester’s View that a full 18% of users do.

One out of every five people who walk by your desk/cube today are susceptible to phishing email scams.

How secure is your network?

I first saw this in a tweet by Frontier Secure.

Comments Off

Mapping Your Music Collection [Seeing What You Expect To See]

Filed under: Audio,Machine Learning,Music,Python,Visualization — Patrick Durusau @ 4:11 pm

Mapping Your Music Collection by Christian Peccei.

From the post:

In this article we’ll explore a neat way of visualizing your MP3 music collection. The end result will be a hexagonal map of all your songs, with similar sounding tracks located next to each other. The color of different regions corresponds to different genres of music (e.g. classical, hip hop, hard rock). As an example, here’s a map of three albums from my music collection: Paganini’s Violin Caprices, Eminem’s The Eminem Show, and Coldplay’s X&Y.

To make things more interesting (and in some cases simpler), I imposed some constraints. First, the solution should not rely on any pre-existing ID3 tags (e.g. Arist, Genre) in the MP3 files—only the statistical properties of the sound should be used to calculate the similarity of songs. A lot of my MP3 files are poorly tagged anyways, and I wanted to keep the solution applicable to any music collection no matter how bad its metadata. Second, no other external information should be used to create the visualization—the only required inputs are the user’s set of MP3 files. It is possible to improve the quality of the solution by leveraging a large database of songs which have already been tagged with a specific genre, but for simplicity I wanted to keep this solution completely standalone. And lastly, although digital music comes in many formats (MP3, WMA, M4A, OGG, etc.) to keep things simple I just focused on MP3 files. The algorithm developed here should work fine for any other format as long as it can be extracted into a WAV file.

Creating the music map is an interesting exercise. It involves audio processing, machine learning, and visualization techniques.
…

It would take longer than a weekend to complete this project with a sizable music collection but it would be a great deal of fun!

Great way to become familiar with several Python libraries.

BTW, when I saw Coldplay, I thought of Coal Chamber by mistake. Not exactly the same subject. 😉

I first saw this in a tweet by Kirk Borne.

Comments Off

Announcing Spark 1.3!

Filed under: Machine Learning,Spark — Patrick Durusau @ 3:27 pm

Announcing Spark 1.3! by Patrick Wendell.

From the post:

Today I’m excited to announce the general availability of Spark 1.3! Spark 1.3 introduces the widely anticipated DataFrame API, an evolution of Spark’s RDD abstraction designed to make crunching large datasets simple and fast. Spark 1.3 also boasts a large number of improvements across the stack, from Streaming, to ML, to SQL. The release has been posted today on the Apache Spark website.

We’ll be publishing in depth overview posts covering Spark’s new features over the coming weeks. Some of the salient features of this release are:

A new DataFrame API
…
Spark SQL Graduates from Alpha
…
Built-in Support for Spark Packages
…
Lower Level Kafka Support in Spark Streaming
…
New Algorithms in MLlib

See Patrick’s post and/or the release notes for full details!

BTW, Patrick promises more posts to follow covering Spark 1.3 in detail.

I first saw this in a tweet by Vidya.

Comments Off

Incentives and the Insecure Internet of Things (IIoT)

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:03 pm

Study Says Internet of Things Is As Insecure As Ever by Lily Hay Newman

From the post:

We already know that Internet of Things devices tend to have vulnerabilities, and there have been some efforts to establish guidelines for creating a more secure environment. But a new study from Symantec, the company that makes Norton AntiVirus, says that your smart-home devices could be “giving away the keys to your kingdom.”

Great graphic right!?

Except that Symantec forgets to tell you that all software vendors, including Symantec, are shipping buggy code.

Remember our discussion about financial incentives? The reason companies don’t fix cybersecurity [Same reason software is insecure]

No incentives to make the Internet of Things secure = Insecure Internet of Things (IIoT).

What seems unclear about that?

I first saw this in a tweet by Tarun Tejpal.

Comments Off

KDE and The Semantic Desktop

Filed under: Linux OS,Merging,RDF,Semantics,Topic Maps — Patrick Durusau @ 2:30 pm

KDE and The Semantic Desktop by Vishesh Handa.

From the post:

During the KDE4 years the Semantic Desktop was one of the main pillars of KDE. Nepomuk was a massive, all encompassing, and integrated with many different part of KDE. However few people know what The Semantic Desktop was all about, and where KDE is heading.

History

The Semantic Desktop as it was originally envisioned comprised of both the technology and the philosophy behind The Semantic Web.

The Semantic Web is built on top of RDF and Graphs. This is a special way of storing data which focuses more on understanding what the data represents. This was primarily done by carefully annotating what everything means, starting with the definition of a resource, a property, a class, a thing, etc.

This process of all data being stored as RDF, having a central store, with applications respecting the store and following the ontologies was central to the idea of the Semantic Desktop.

The Semantic Desktop cannot exist without RDF. It is, for all intents and purposes, what the term “semantic” implies.
…

A brief post-mortem on the KDE Semantic Desktop which relied upon NEPOMUK (Networked Environment for Personal, Ontology-based Management of Unified Knowledge) for RDF-based features. (NEPOMUK was an EU project.)

The post mentions complexity more than once. A friend recently observed that RDF was all about supporting AI and not capturing arbitrary statements by a user.

Such as providing alternative identifiers for subjects. With enough alternative identifications (including context, which “scope” partially captures in topic maps), I suspect a deep learning application could do pretty well at subject recognition, including appropriate relationships (associations).

But that would not be by trying to guess or formulate formal rules (a la RDF/OWL) but by capturing the activities of users as they provide alternative identifications of and relationships for subjects.

Hmmm, merging then would be a learned behavior by our applications. Will have to give that some serious thought!

I first saw this in a tweet by Stefano Bertolo.

Comments Off

Speculation + Repetition != Evidence (ISIS)

Filed under: Politics,Security — Patrick Durusau @ 1:35 pm

A new EU counter-terrorism unit will tackle extremists online by Pierluigi Paganini.

From the post:

Terrorists are exploiting the web for propaganda purpose and to menace the Western infidels for this reason intelligence agencies and law enforcement need to increase their efforts to tackle any kind of extremist content

Group of terrorists and sympathizers are sharing an impressive amount of extremist content online, social media, mobile platforms, and the Deep Web offer privileged environment to share the dangerous content.

A recent research published by the Brookings Institution reported that between September and December 2014, ISIS supporters were in control of at least 46,000 accounts across the social network. The ISIS has released a manual for its militants, titled “How to Tweet Safely Without Giving out Your Location to NSA”, that explain how avoid surveillance.
…(I corrected a broken link at “propaganda.”)

The authority for “propaganda” is a link to an earlier post by the same author quoting speculation by law enforcement that ISIS members might be using Bitcoin. No evidence of either Bitcoin use or exploitation of the web for propaganda. (I suspect the exploitation is no more and possibly a good deal less than propaganda by Western governments and their sympathizers on the web.)

As far as the “…impressive amount of extremist content online, social media, mobile platforms….” no evidence is cited or referenced, the claim is simply repeated. You would think with the number of times that has been claimed, someone would have collected an archive of attributable propaganda for inspection.

Not exactly on point but Pierluigi does cite: ISIS Social Media Analysis of twitter news outlets following @Th3j35t3r Tango-Downs by Matteo G.P. Flora.

Matteo, contrary to other commentators, has at least gathered historical twitter data for more traditional analysis. I suggest that you read Matteo’s post in full. The questions it does not answer are:

How does ISIS use of social media compare to similarly sized groups? (One measure of success)
How effective is ISIS “propaganda” in securing support or recruits for its cause?

Both of those are traditional social science questions that are subject to empirical verification.

At present, at least to most Western observers, the information flow from ISIS is only a trickle, which prevents many observers from deciding whether ISIS accounts or those of Western governments are the most trustworthy. That has been due in no small part to social media providers censoring content at the behest or to curry favor with Western governments.

I can propaganda propaganda from ISIS or any other group and I extend to my readers the same assumption. Governments are the ones who think their citizens are too stupid to avoid being taken in by propaganda. That is members of government think they are smarter than their citizens. Says more about members of government than it does about their citizens.

Before I forget, here’s the image that ran with this story:

Does that look posed to you? I ask because it is after Labor Day and a number of people in the photo are wearing very clean white shoes. Are they really terrorists or intelligence operatives with no real fashion sense?

PS: If you want to practice your Arabic, the ISIS Manual on Removing Metadata.

PPS: Before I get caught up in a Brookings Institute “ISIS supporter” sweep, let me say that I’m not an ISIS supporter. But then I am not an ISIS opponent either. What happens in Iraq and Syrian should be left up to the people of Iraq and Sryia. If a Caliphate as literate and humane as we remember from history were established, it would be far preferable the collage of toady governments presently found in that part of the world. That is just an observation, not an endorsement of anyone or any group or any strategy for bringing that about. I do oppose censorship by all sides so I have no allies in Middle East discussions.

Comments Off

Google Chrome (Version 41.0.2272.89 (64-bit)) WARNING!

Filed under: Browsers,Visualization,WWW — Patrick Durusau @ 11:22 am

An update of Google Chrome on Ubuntu this morning took my normal bookmark manager list of small icons and text to:

What do the kids say these days?

That sucks!

Some of you may prefer the new display. Good for you.

As far as I can tell, Chrome does not offer an option to revert to the previous display.

I keep quite a few bookmarks with an active blog so the graphic images are a waste of screen space and force me to scroll far more often than otherwise. I often work with the bookmark manager open in a separate screen.

For people who like this style, great. My objection is to it being forced on users who may prefer the prior style of bookmarks.

Here’s your design tip for the day: Don’t help users without giving them the ability to decline the help. Especially with display features.

Comments Off

March 13, 2015

HawkEye G Selected as Part of an Active Cyber Defense System…

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:53 pm

HawkEye G Selected as Part of an Active Cyber Defense System to Protect Federal Networks from Advanced Cyber Attacks

From the post:

Hexis Cyber Solutions (Hexis), a wholly-owned subsidiary of The KEYW Holding Corporation (KEYW) and a provider of advanced cybersecurity solutions for commercial companies and government agencies, today announced that HawkEye G has been selected by key members of the United States Intelligence Community as part of an integrated Active Cyber Defense (ACD) solution, protecting federal agencies’ networks against nation-state adversaries. As a core component, HawkEye G provides the only automated advanced threat removal capability available today. The ACD solution, referred to by the name SHORTSTOP, is provided as a turn-key system or as a reference design to federal agencies seeking best in class cyber defense. SHORTSTOP facilitates a convergence of commercial security technologies including HawkEye G and products from Palo Alto Networks, FireEye, and Splunk.

“The Intelligence customers that built this system understand the capabilities of today’s best cyber security products, and how to combine them to find previously undetectable attacks and remove them at machine speed. They are taking advantage of HawkEye G to sense at the endpoints, provide threat detection, pinpoint attacks, reduce false positives, and use automation to remove the threats,” said Chris Fedde, President of Hexis Cyber Solutions. “The SHORTSTOP architecture is consistent with the capabilities developed over the last three years by our engineers. As a result, government and commercial organizations can execute policy-driven threat mitigation in real-time to combat against advanced cyberattacks.”

HawkEye G is a next-generation cyber security platform that provides advanced threat detection, investigation and automated response capabilities. Security teams can continuously detect, investigate and remove advanced threats from within the network before adversaries can steal sensitive data, compromise intellectual property or cause critical process disruption. HawkEye G provides endpoint and network sensing, threat detection analytics, automated countermeasures that remove network threats, and a flexible policy engine that enables users to govern actions using both micro and macro policy controls.

According to research published by leading industry analysts, current forms of advanced persistent threat (APT) malware can live on a network host undetected for months. During this time, organizations are losing billions of dollars and in the case of many government entities, exposing highly sensitive intellectual property and data. With it becoming increasingly clear that perimeter and traditional endpoint solutions are failing to keep up with threats and that manual responses allow threats to compromise networks, government and commercial organizations are recognizing the need to automate decision-making and response.

…

I don’t know how many Raspberry Pis that could have gone to secondary students were sacrificed for this purchase.

Two things jump out at me from the Network World Review of HawkEye G.pdf where HawkEye G was rated 4.875 out of 5.

First, HawkEye G, as an appliance:

HawkEye G is installed as an appliance, which makes the physical deployment rather simple. You do need to open up a hole in your firewall to allow the device to communicate with the Hexis Security Operations Center, where information about new threats is collected and pushed out.

Poke a hole in your own firewall?

Sure, what could possibly go wrong?

Of course HawkEye G watches for internal breaches:

The first thing that was checked was if a human had typed in the restricted URL, or if it were done by a program. If a human did it, there are several steps that could be taken based on the cybercon level. A warning could be issued at one end of the spectrum all the way up to the revoking of user privileges at the other. But since this was being done by a program, that step was skipped.

I suppose that works so long as Edward Snowden stays in Russia and none of your staff share passwords.

Do you know if anyone gives odds on breaching specific software packages? Just curious.

Comments Off

Quick start guide to R for Azure Machine Learning

Filed under: Azure Marketplace,Machine Learning,R — Patrick Durusau @ 7:22 pm

Quick start guide to R for Azure Machine Learning by Larry Franks.

From the post:

Microsoft Azure Machine Learning contains many powerful machine learning and data manipulation modules. The powerful R language has been described as the lingua franca of analytics. Happily, analytics and data manipulation in Azure Machine Learning can be extended by using R. This combination provides the scalability and ease of deployment of Azure Machine Learning with the flexibility and deep analytics of R.

…

This document will help you quickly start extending Azure Machine Learning by using the R language. This guide contains the information you will need to create, test and execute R code within Azure Machine Learning. As you work though this quick start guide, you will create a complete forecasting solution by using the R language in Azure Machine Learning.
…

BTW, I deleted an ad in the middle of the pasted text that said you can try Azure learning free. No credit card required. Check the site for details because terms can and do change.

I don’t know who suggested “quick” be in the title but it wasn’t anyone who read the post. 😉

Seriously, despite being long it is a great onboarding to using RStudio with Azure Machine Learning that ends with lots of good R resources.

Combining the strength of cloud based machine learning with a language that is standard in data science is a winning combination.

People will differ in their preferences for cloud based machine learning environments but this guide sets a high mark for guides concerning the same.

Enjoy!

I first saw this in a tweet by Ashish Bhatia.

Comments Off

Building A Digital Future

Filed under: Computer Science,Data Science,Marketing — Patrick Durusau @ 7:06 pm

You may have missed BBC gives children mini-computers in Make it Digital scheme by Jane Wakefield.

From the post:

One million Micro Bits – a stripped-down computer similar to a Raspberry Pi – will be given to all pupils starting secondary school in the autumn term.

The BBC is also launching a season of coding-based programmes and activities.

It will include a new drama based on Grand Theft Auto and a documentary on Bletchley Park.

Digital visionaries

The initiative is part of a wider push to increase digital skills among young people and help to fill the digital skills gap.

The UK is facing a significant skills shortage, with 1.4 million “digital professionals” estimated to be needed over the next five years.

The BBC is joining a range of organisations including Microsoft, BT, Google, Code Club, TeenTech and Young Rewired State to address the shortfall.

At the launch of the Make it Digital initiative in London, director-general Tony Hall explained why the BBC was getting involved.
…

Isn’t that clever?

Odd that I haven’t heard about a similar effort in the United States.

There are only 15 million (14.6 million actually) secondary students this year in the United States and at $35 per Raspberry Pi, that’s only $525,000,000. That may sound like a lot but remember that the 2015 budget request for the Department of Homeland security is $38.2 Billion (yes, with a B). We are spending 64 times the amount needed to buy every secondary student in the United States a Raspberry Pi on DHS. A department that has yet to catch a single terrorist.

There would be consequences to buying every secondary student in the United States a Raspberry Pi:

Manufacturers of Raspberry Pi would have a revenue stream for more improvements
A vast secondary markets for add-ons for Raspberry Pi computers would be born
An even larger market for tutors and classes on Raspberry Pi would jump start
Millions of secondary students would be taking positive steps towards digital literacy

The only real drawback that I foresee is that the usual suspects would not be at the budget trough.

Maybe, just this once, the importance of digital literacy and inspiring a new generation of CS researchers is worth taking that hit.

Any school districts distributing Raspberry Pis on their own to set an example for the feds?

PS: I would avoid getting drawn into “accountability” debates. Some students will profit from them, some won’t. The important aspect is development of an ongoing principle of digital literacy and supporting it. Not every child reads books from the library but every community is poorer for the lack of a well supported library.

I first saw this in a tweet by Bart Hannsens.

Comments Off

W3C Invites Implementations of XQuery and XPath Full Text 3.0;…

Filed under: W3C,XPath,XQuery — Patrick Durusau @ 4:26 pm

W3C Invites Implementations of XQuery and XPath Full Text 3.0; Supporting Requirements and Use Cases Draft Updated

From the post:

The XML Query Working Group and the XSLT Working Group invite implementation of the Candidate Recommendation of XQuery and XPath Full Text 3.0. The Full Text specification extends the XPath and XQuery languages to support fast and efficient full text searches over arbitrarily large collections of documents. This release brings the Full Text specification up to date with XQuery 3.0 and XPath 3.0; the language itself is unchanged.

Both groups also published an updated Working Draft of XQuery and XPath Full Text 3.0 Requirements and Use Cases. This document specifies requirements and use cases for Full-Text Search for use in XQuery 3.0 and XPath 3.0. The goal of XQuery and XPath Full Text 3.0 is to extend XQuery and XPath Full Text 1.0 with additional functionality in response to requests from users and implementors.

If you have comments that arise out of implementation experience, be advised that XQuery and XPath Full Text 3.0 will be a Candidate Recommendation until at least 26 March 2015.

Enjoy!

Comments Off

The reason companies don’t fix cybersecurity [Same reason software is insecure]

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:11 pm

The reason companies don’t fix cybersecurity by Erick Sherman.

From the post:

U.S. air traffic control systems are vulnerable to hackers, says the General Accounting Office. Cybercriminals target retail loyalty cards. Obsolete encryption leaves phones vulnerable.

When it comes to giant data breaches suffered by Sony (SNE), Home Depot (HD), Target (TGT), Anthem (ANTM) and many others, the vulnerability of online information is by now a fact of life. So why don’t corporations plug the gaps, improve their practices and safeguard sensitive consumer data? After all, these measures would prevent potential financial loss and identity theft.

The answer: The losses involved are so small compared to the revenue that it’s easier to take a chance and write off any losses should they occur. In other words, worrying about data breaches isn’t worth it to them.

To understand the attitude, you need to follow the money. Benjamin Dean, a fellow for Internet governance and cybersecurity at Columbia University’s School of International and Public Affairs, compared some high-profile data breach costs to the revenue of the companies. It turned out, some major breaches cost the companies that had lost the data relatively little.

Remember Target’s loss of 40 million of debit and credit card numbers and 70 million other records, which included addresses and phone numbers? The company recently said the total bill was $252 million between 2013 and 2014. After $90 million insurance coverage, $162 million was left. Tax deductions brought that amount down to $105 million. The sum was about 0.1 percent of Target’s 2014 revenue.

This isn’t unusual. For the loss of 56 million credit and debit card numbers and 53 million email addresses to hackers in 2014, Home Depot was out only a net $28 million, after a $15 million insurance payment. That’s less than 0.01 percent of the company’s 2014 revenue.

…

In 2014, the Ponemon Institute surveyed 314 companies around the world. The smallest by annual revenue was on the order of $100 million. Most were multibillion-dollar corporations. Ponemon’s estimate of the average data breach cost to these companies for the year was $3.5 million. The organization ran some numbers for CBS MoneyWatch. The average revenue size was $1.967 billion. That means the average data breach represented only 0.18 percent of revenue — a rounding error. (emphasis added)

Erick’s article is a must read on the lack of motivation for corporations to improve cybersecurity. The cost of breaches to major corporations now? In his words “…a rounding error.”

But major corporations didn’t write the software that led to universal cyberinsecurity. Software companies did. What would a cost analysis show for their cost of data breaches?

If you think 0.18 percent of revenue is low, software breaches cost software vendors 0.00 percent of revenue.

Tell me, what incentives do software vendors have to aggressively pursue the production of secure software?

Can you say: NONE AT ALL?

Some vendors do try harder than others but remember they are competing against other vendors who have no cost for ignoring insecure software.

Do the math and ask yourself: Where are the incentives for secure handling of data and secure software?

Without security incentives everyone and their data will be left naked and exposed to the entire world.

If you are eligible to vote in the United States, contact your Representatives and Senators saying if the 2016 elections arrive with no realistic incentives to avoid data breaches and to produce secure software, you will vote against every incumbent on the ballot.

Comments Off

The Future of Algorithm Development

Filed under: Algorithms — Patrick Durusau @ 2:03 pm

The Future of Algorithm Development

From the post:

We’re building a community around state-of-the-art algorithm development. Users can create, share, and build on other algorithms and then instantly make them available as a web service.

A year ago, we shared with you our belief that Algorithm Development is Broken, where we laid out problems in the world of algorithm development and our vision of how to make things better.

Since then, we’ve been working hard to create our vision of the world’s first open marketplace for algorithms – and you, the developers of the world, have enthusiastically joined us to help active the algorithm economy.

Today, we are proud to open our doors to the world, and deliver on our promise of a global, always-available, open marketplace for algorithms.

We believe that the future of algorithm development is:

Live: easy access to any algorithm, callable at any time

No dependencies: no downloads, no installs, no stress

Interoperable: call any algorithm regardless of programming language

Composable: algorithms are building blocks, build something bigger!

Invisible but powerful infrastructure: don’t worry about containers, networks, or scaling; we make your algorithm scale to hundreds of machines

No operations: we scale, meter, optimize and bill for you

For the first time, algorithm developers can publish their algorithms as a live API and applications can be made smarter by taking advantage of these intelligent algorithms.

With our public launch we are enabling payments and go from being a repository of algorithms, to a true marketplace:

For algorithm developers, Algorithmia provides a unique opportunity to increase the impact of their work and collect the rewards associated with it.

For application developers, Algorithmia provides the largest repository of live algorithms ever assembled, supported by a vibrant community of developers.

So far, users have contributed over 800 algorithms, all under one API that grows every day. The most exciting thing from our point of view is watching the Algorithmia Marketplace grow and expand in its functionality.

In just the last few months, Algorithmia has gained the ability to:

Summarize and generate topics from unstructured text

Extract structure and information from arbitrary websites

Perform computer vision tasks like Face Detection and Image Similarity

Look for patterns and anomalies in data

Dozens of microservice building blocks

… and more added every day

Thanks to the thousands of developers who joined our cause, contributed algorithms, and gave us countless hours of feedback so far. We are very excited to be opening our doors, and we look forward to helping you create something amazing!

Come check it out

doppenhe & platypii

Algorithmia has emerged from private beta!.

The future of Snow Crash, at least in terms of contracting, gets closer every day!

Enjoy!

Comments Off

For Context – Why Metadata Really Matters

Filed under: Context,Metadata — Patrick Durusau @ 10:23 am

For Context – Why Metadata Really Matters

March 26, 2015, 03:00PM ET

From the post:

For Context: Why Metadata Really Matters The creative geniuses behind Hitchhiker’s Guide to the Galaxy summed up the value of context with their brilliant machine-generated answer to the universe: 42! The dripping irony spoke of just how meaningless a number can be, absent its context. Today, with Big Data continuing to dominate the minds of enterprise architects and business analysts alike, the value of context is more important than ever. How can organizations keep their focus on the data that really matters? Metadata holds the key! Register for this episode of DM Radio to learn more! Host Eric Kavanagh will interview Roy Anjan of Deloitte, Dr. Robin Bloor of The Bloor Group, Dr. Geoffrey Malafsky of Phasic Systems Inc., and a special guest from Embarcadero.

May be something, may be nothing. No promises as I haven’t heard the podcast, yet.

It is interesting that context/metadata, etc. are making a comeback after being on the back burner for so many years.

My personal explanation is that data has gotten large enough that even the average IT person is encountering data they don’t understand more and more often. Which has lead many of them to conclude that as data gets bigger, so will their ignorance of strange data. No kidding.

The ignorance of others about your data is amusing, if not job protecting but your ignorance of strange data could be costly if not job threatening. As more data impinges on your borders the greater your need to understand it, even at the cost of losing some of the insular nature of your IT operations.

Company file clerks (Radar O’Reilly, who had a unique filing system for instance) or COBOL programmers with their spaghetti code, aren’t simply going to roll over and surrender secret knowledge it has taken them years to acquire. That’s what they call a “people problem.” (I have suggestions for incentives and disincentives for specific situations.)

Topic maps, being able to describe any subject, which includes subjects that are fields, terminology, processes, files, routines, anything you can imagine existing in IT, are a convenient way to capture “metadata” about IT and its processes. In part because they don’t have to burden your existing systems with changes or additions in order to make them more robust from a metadata perspective.

Think of it as “your present, metadata poor IT systems” versus “your present, metadata poor IT systems + topic maps.” That is topic maps don’t have to be a rip and replace technology (one way graph technology is promoted) but rather an addition to your present infrastructure that makes it more sustainable and robust. So all the political alliances and decisions that lead to your current IT structure can remain in place.

Something to think about as you wait for the podcast!

Enjoy!

Comments Off

NYC 311 with Turf

Filed under: Graphics,Visualization — Patrick Durusau @ 9:35 am

NYC 311 with Turf by Morgan Herlocker.

From the post:

In this example, Turf is processing and visualizing the largest historical point dataset on Socrata and data.gov. The interface below visualizes every 311 call in NYC since 2010 from a dataset that weighs in at nearly 8 GB.

An excellent visualization from Mapbox to start a Friday morning!

The Turf homepage reports: “Advanced geospatial analysis for browsers and node.”

Wikipedia on 311 calls.

The animation is interactive and can lead to some interesting questions. For example, when I slow down the animation (icons on top), I can see a pattern that develops in Brooklyn with a large number of calls from November to January of each year (roughly and there are other patterns). Zooming in, I located one hot spot at the intersection of Woodruff Ave. and Ocean Ave.

The “hotspot” appears to be an artifact of summarizing the 311 data because the 311 records contain individual addresses for the majority of reports. I say “majority,” I didn’t download the data set to verify that statement, just scanned the first 6,000 or so records.

Deeper drilling into the data could narrow the 311 hotspots to block or smaller locations.

As you have come to expect, Mapbox has a tutorial on using Turf analysis.

If this hasn’t captured your interest yet, perhaps the keywords, composable, scale, scaling, will:

Unlike a traditional GIS database, Turf’s flexibility allows for composable algorithms that scale well past what fits into memory or even on a single machine.

Morgan discusses a similar project and the use of steamgraphs. Great way to start a Friday!

Comments Off

March 12, 2015

Speaking of Numbers and Big Data Disruption

Filed under: BigData,Statistics,Survey — Patrick Durusau @ 6:49 pm

Survey: Big Data is Disrupting Business as Usual by George Leopold.

From the post:

…

Sixty-four percent of the enterprises surveyed said big data is beginning to change the traditional boundaries of their businesses, allowing more agile providers to grab market share. More than half of those surveyed said they are facing greater competition from “data-enabled startups” while 27 percent reported competition from new players from other industries.

Hence, enterprises slow to embrace data analytics are now fretting over their very survival, EMC and the consulting firm argued.

Those fears are expected to drive investment in big data over the next three years, with 54 percent of respondents saying they plan to increase investment in big data tools. Among those who have already made big data investments, 61 percent said data analytics are already driving company revenues. The fruits of these big data efforts are proving as valuable as existing products and services, the survey found.

…

That sounds important, except they never say how business is being disrupted? Seems like that would be an important point to make. Yes?

And note the 61% who “…said data analytics are already driving company revenues…” are “…among those who have already made big data investments….” Was that ten people? Twenty? And who after making a major investment is going to say that it sucks?

The survey itself sounds suspect if you read the end of the post:

Capgemini said its big data report is based on an online survey conducted in August 2014 of more than 1,000 senior executives across nine industries in ten global markets. Survey author FreshMinds also conducted follow-up interviews with some respondents.

I think there is a reason that Gallup and those sort of folks don’t do online surveys. It has something to do with accuracy if I recall correctly. 😉

Comments Off

Detecting Text Reuse in Nineteenth-Century Legal Documents:…

Filed under: History,Law - Sources,Text Analytics,Text Mining,Texts — Patrick Durusau @ 6:32 pm

Detecting Text Reuse in Nineteenth-Century Legal Documents: Methods and Preliminary Results by Lincoln Mullen.

From the post:

How can you track changes in the law of nearly every state in the United States over the course of half a century? How can you figure out which states borrowed laws from one another, and how can you visualize the connections among the legal system as a whole?

Kellen Funk, a historian of American law, is writing a dissertation on how codes of civil procedure spread across the United States in the second half of the nineteenth century. He and I have been collaborating on the digital part of this project, which involves identifying and visualizing the borrowings between these codes. The problem of text reuse is a common one in digital history/humanities projects. In this post I want to describe our methods and lay out some of our preliminary results. To get a fuller picture of this project, you should read the four posts that Kellen has written about his project:
…

Quite a remarkable project with many aspects that will be relevant to other projects.

Lincoln doesn’t use the term but this would be called textual criticism, if it were being applied to the New Testament. Of course here, Lincoln and Kellen have the original source document and the date of its adoption. New Testament scholars have copies of copies in no particular order and no undisputed evidence of the original text.

Did I mention that all the source code for this project is on Github?

Comments Off

Figures Don’t Lie, But Liars Can Figure

Filed under: Database,NoSQL,Oracle — Patrick Durusau @ 5:59 pm

A pair of posts that you may find amusing on the question of “free” and “cheaper.”

HBase is Free but Oracle NoSQL Database is cheaper

When does “free” challenge that old adage, “You get what you pay for”?

Two brief quotes from the first post set the stage:

How can Oracle NoSQL Database be cheaper than “free”? There’s got to be a catch. And of course there is, but it’s not where you are expecting. The problem in that statement isn’t with “cheaper” it’s with “free”.

…

An HBase solution isn’t really free because you do need hardware to run your software. And when you need to scale out, you have to look at the how well the software scales. Oracle NoSQL Database scales much better than HBase which translated in this case to needing much less hardware. So, yes, it was cheaper than free. Just be careful when somebody says software is free.

The second post tries to remove the vendor (Oracle) from the equation:

Read-em and weep …. NOT according to Oracle, HBase does not take advantage of SSD’s anywhere near the extent with which Oracle NoSQL does … couldn’t even use the same scale on the vertical bar.

SanDisk on HBase with SSD

SanDisk on Oracle NoSQL with SSD

And so the question remains, when does “free” challenge the old adage “you get what you pay for”, because in this case, the adage continues to hold up.

And as the second post notes, Oracle has committed code back to the HBase product so it isn’t unfamiliar to them.

First things first, the difficulty that leads to these spats is using “cheap,” “free,” “scalable,” “NoSQL,” etc. as the basis for IT marketing or decision making. That may work with poorer IT decision makers and however happy it makes the marketing department, it is just noise. Noise that is a disservice to IT consumers.

Take “cheaper,” and “free” as used in these posts. Is hardware really the only cost associated with HBase or Oracle installations? If it is, I have been severely misled.

On the Hbase expense side I would expect to find HBase DBAs, maintenance of those personnel, hardware (+maintenance), programmers, along with use case requirements that must be met.

On the Oracle expense side I would expect to find Oracle DBAs, maintenance of those personnel, Oracle software licensing, hardware (+maintenance), programmers, along with use case requirements that must be met.

Before you jump to my listing “Oracle software licensing,” consider how that will impact the availability of appropriate personnel, the amount of training needed to introduce new IT staff to HBase, etc.

Not to come down too hard for Oracle, Oracle DBAs and their maintenance aren’t cheap, nor are some of the “features” of Oracle software.

Truth be told there is a role for project requirements, experience of current IT personnel, influence IT has over the decision makers, and personal friendships of decision makers in any IT decision making.

To be very blunt, IT decision making is just as political as any other enterprise decision.

Numbers are a justification for a course chosen for other reasons. As a user I am always more concerned with my use cases being met than numbers. Aren’t you?

Comments Off

Why CISOs Need a Security Manifesto

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:25 pm

Why CISOs Need a Security Manifesto by Marc Solomon.

From the post:

Manifestos have been around for centuries but seem to have become trendy lately. Originally manifestos were used by political parties or candidates to publicly declare policies, goals, or opinions before an election. More recently, manifestos have gone mainstream and are used by companies, individuals, and groups to promote better work and life habits. There are even articles and blogs devoted to collecting inspirational manifestos or teaching us how to write a manifesto.

But when I started thinking about the idea of a “Security Manifesto” it was with the original intent in mind. As I wrote in my previous column, security needs to become a boardroom discussion, and having members with technology and cybersecurity expertise at the table is the only way for this to happen effectively. Today’s CISOs are candidates in the midst of a campaign, striving to ascend even higher in the organization: to the boardroom. Every candidate needs a platform upon which to run, and that’s where the manifesto comes in.
…

The high points of Marc’s principles (see his post for details) to underlie a security manifesto:

Security must be considered a growth engine for the business.

Security must work with existing architecture, and be usable.

Security must be transparent and informative.

Security must enable visibility and appropriate action.

Security must be viewed as a “people problem.”

Marc’s principles are a great basis for a security manifesto but I would re-order them to make #5 “people problem” #1.

In part to counter management’s tendency to see people problems as amenable to technical solutions. If users cannot be motivated to use good security practices, buying additional technical solutions for security issues is a waste of resources. Such users need to become users at some other enterprise.

Comments Off

Apache Kafka 8.2.1 (and reasons to upgrade from 8.2)

Filed under: Kafka — Patrick Durusau @ 2:46 pm

Apache Kafka 8.2.1 has been released!

I don’t normally note point releases but a tweet by Michael G. Noll pointing to the 8.2.1 release notes caused this post.

The release notes read:

Bug

[KAFKA-1919] – Metadata request issued with no backoff in new producer if there are no topics

[KAFKA-1952] – High CPU Usage in 0.8.2 release

[KAFKA-1971] – starting a broker with a conflicting id will delete the previous broker registration

[KAFKA-1984] – java producer may miss an available partition

More than you can get into a tweet but still important information.

Comments Off

North Korea vs. TED Talk

Filed under: Humor,Language — Patrick Durusau @ 2:21 pm

Quiz: North Korean Slogan or TED Talk Sound Bite? by Dave Gilson.

From the post:

North Korea recently released a list of 310 slogans, trying to rouse patriotic fervor for everything from obeying bureaucracy (“Carry out the tasks given by the Party within the time it has set”) to mushroom cultivation (“Let us turn ours into a country of mushrooms”) and aggressive athleticism (“Play sports games in an offensive way, the way the anti-Japanese guerrillas did!”). The slogans also urge North Koreans to embrace science and technology and adopt a spirit of can-do optimism—messages that might not be too out of place in a TED talk.

Can you tell which of the following exhortations are propaganda from Pyongyang and which are sound bites from TED speakers? (Exclamation points have been added to all TED quotes to match North Korean house style.)

When you discover the source of the quote, do your change your interpretation of its reasonableness, etc.?

All I will say about my score is that either I need to watch far more TED talks and/or pay closer attention to North Korean Radio. 😉

Enjoy!

PS: I think a weekly quiz with White House, “terrorist” and Congressional quotes would be more popular than the New York Times Crossword puzzle.

Comments Off

Complexity Updates!

Filed under: Complexity,Fractals — Patrick Durusau @ 1:53 pm

The Complexity Explorer (Santa Fe Institute) has posted several updates to its homepage.

No news courses for Spring 2015. The break will be spent working on new mathematics modules, Vector and Matrix Algebra and Maximum Entropy Methods, due out later this year. Previous Santa Fe Complexity Courses are online.

If you need a complexity “fix” pushed at you, try the Twitter or Facebook.

If you are more than a passive consumer of news, volunteers are needed for:

Subtitling videos (something was said about a T-shirt, check the site for details), and

Other volunteer opportunities.

Enjoy!

Comments Off

March 11, 2015

Selling Big Data to Big Oil

Filed under: BigData,Marketing — Patrick Durusau @ 7:42 pm

Oil firms are swimming in data they don’t use by Tom DiChristopher.

From the post:

McKinsey & Company wanted to know how much of the data gathered by sensors on offshore oil rigs is used in decision-making by the energy industry. The answer, it turns out, is not much at all.

…

After studying sensors on rigs around the world, the management consulting firm found that less than 1 percent of the information gathered from about 30,000 separate data points was being made available to the people in the industry who make decisions.
Technology that can deliver data on virtually every aspect of drilling, production and rig maintenance has spread throughout the industry. But the capability—or, in some cases, the desire—to process that data has spread nowhere near as quickly. As a result, drillers are almost certainly operating below peak performance—leaving money on the table, experts said.

Drilling more efficiently could also help companies achieve the holy grail—reducing the break-even cost of producing a barrel of oil, said Kirk Coburn, founder and managing director at Surge Ventures, a Houston-based energy technology investment firm.

Separately, a report by global business consulting firm Bain & Co. estimated that better data analysis could help oil and gas companies boost production by 6 to 8 percent. The use of so-called analytics has become commonplace in other industries from banking and airlines to telecommunications and manufacturing, but energy firms continue to lag.
…

Great article although Tom does seem to assume that better data analysis will automatically lead to better results. It can but I would rather under promise and over deliver, particularly in a industry without a lot of confidence in the services being offered.

Comments Off

Convolutional Neural Nets in Net#

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:19 pm

Convolutional Neural Nets in Net# by by Alexey Kamenev.

From the post:

After introducing Net# in the previous post, we continue with our overview of the language and examples of convolutional neural nets or convnets.

Convnets have become very popular in recent years as they consistently produce great results on hard problems in computer vision, automatic speech recognition and various natural language processing tasks. In most such problems, the features have some geometric relationship, like pixels in an image or samples in audio stream. An excellent introduction to convnets can be found here:

https://www.coursera.org/course/neuralnets (Lecture 5)
http://deeplearning.net/tutorial/lenet.html

Before we start discussing convnets, let’s introduce one definition that is important to understand when working with Net#. In a neural net structure, each trainable layer (a hidden or an output layer) has one or more connection bundles. A connection bundle consists of a source layer and a specification of the connections from that source layer. All the connections in a given bundle share the same source layer and the same destination layer. In Net#, a connection bundle is considered as belonging to the bundle’s destination layer. Net# supports various kinds of bundles like fully connected, convolutional, pooling and so on. A layer might have multiple bundles which connect it to different source layers.
…

BTW, the previous post was: Neural Nets in Azure ML – Introduction to Net#. Not exactly what I was expecting by the Net# reference.

Machine Learning Blog needs to be added to your RSS feed.

If you need more information: Guide to Net# neural network specification language.

Enjoy!

I first saw this in a tweet by Michael Cavaretta.

Comments Off

3 Reasons Every Business Should Think About Location Intelligence

Filed under: GIS — Patrick Durusau @ 6:48 pm

3 Reasons Every Business Should Think About Location Intelligence

From the post:

The ease-of-use of mobile apps like Google Maps and Strava (which is now being used for urban planning) has inspired a lot of companies to start thinking differently about location. Consequently, a lot of IT professionals are getting asked to create location-based applications.

“IT is now being required to build spatially-aware or enabled applications, without a history of working with these technologies,” says Clarence Hempfield, Director of Product Management at Pitney Bowes. “That means organizations like Pitney Bowes have had to deliver these capabilities in such a way that a non-GIS expert can build and deliver spatial applications, without that foundation of years of working with the technology.”

This rapid consumerization of GIS technology has allowed anyone with a smartphone to use aspects of GIS technology with a few taps of their fingers, revealing valuable location intelligence data that they use to find new stores, directions and more… and consumers are expecting companies to follow suit.
…

I was struck by the contrast between the claim “…and consumers are expecting companies to follow suit,” and the three reasons given for location intelligence:

Local Can Amplify Social.

Data and Maps Can Help You Plan for the Future.

Location Can Super-Charge BI

None of those reasons confer benefits upon consumers. Demographics are used to project a consumer’s expected choices, limiting your range of selection to the expectations of the business. What if you or I are outliers? Does that simply not count?

Consumer research on consumers wanting companies to track them and combine data with other data sources? It may well exist and if it does, please put a pointer in the comments.

If you are not one of those consumers who wants to be tracked, invest in a cellphone pouch that blocks tracking of your phone.

Comments Off

What’s all the fuss about Dark Data?…

Filed under: Dark Data,Transparency — Patrick Durusau @ 6:29 pm

What’s all the fuss about Dark Data? Big Data’s New Best Friend by Martyn Jones.

From the post:

Dark data, what is it and why all the fuss?

First, I’ll give you the short answer. The right dark data, just like its brother right Big Data, can be monetised – honest, guv! There’s loadsa money to be made from dark data by ‘them that want to’, and as value propositions go, seriously, what could be more attractive?

Let’s take a look at the market.

Gartner defines dark data as "the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes" (IT Glossary – Gartner)

Techopedia describes dark data as being data that is "found in log files and data archives stored within large enterprise class data storage locations. It includes all data objects and types that have yet to be analyzed for any business or competitive intelligence or aid in business decision making." (Techopedia – Cory Jannsen)

Cory also wrote that "IDC, a research firm, stated that up to 90 percent of big data is dark data."

In an interesting whitepaper from C2C Systems it was noted that "PST files and ZIP files account for nearly 90% of dark data by IDC Estimates." and that dark data is "Very simply, all those bits and pieces of data floating around in your environment that aren’t fully accounted for:" (Dark Data, Dark Email – C2C Systems)

Elsewhere, Charles Fiori defined dark data as "data whose existence is either unknown to a firm, known but inaccessible, too costly to access or inaccessible because of compliance concerns." (Shedding Light on Dark Data – Michael Shashoua)

Not quite the last insight, but in a piece published by Datameer, John Nicholson wrote that "Research firm IDC estimates that 90 percent of digital data is dark." And went on to state that "This dark data may come in the form of machine or sensor logs" (Shine Light on Dark Data – Joe Nicholson via Datameer)

Finally, Lug Bergman of NGDATA wrote this in a sponsored piece in Wired: "It" – dark data – "is different for each organization, but it is essentially data that is not being used to get a 360 degree view of a customer.

Well, I would say that 90% of 2.7 Zetabytes (as of last October) of data being dark is a reason to be concerned.

But like the Wizard of Oz, Martyn knows what you are lacking, a data inventory:

…
You don’t need a Chief Data Officer in order to be able to catalogue all your data assets. However, it is still good idea to have a reliable inventory of all your business data, including the euphemistically termed Big Data and dark data.

If you have such an inventory, you will know:

What you have, where it is, where it came from, what it is used in, what qualitative or quantitative value it may have, and how it relates to other data (including metadata) and the business.
…

Really? A data inventory? Relief to know the MDM (master data management) folks have been struggling for the past two decades for no reason. All they needed was a data inventory!

You might want to recall AnHai Doan’s observation for a single enterprise mapping project:

…the manual creation of semantic mappings has long been known to be extremely laborious and error-prone. For example, a recent project at the GTE telecommunications company sought to integrate 40 databases that have a total of 27,000 elements (i.e., attributes of relational tables) [LC00]. The project planners estimated that, without the database creators, just finding and documenting the semantic mappings among the elements would take more than 12 person years.

That’s right. One enterprise, 40 databases, 12 person years.

How that works out: PersonYears x 2.7 Zetabytes = ???, no one knows.

Oh, why did I lose the 90% as “dark data?” Simple enough, the data AnHai was mapping wasn’t entirely “dark.” At least it had headers that were meaningful to someone. Unstructured data has no headers at all.

What Martyn is missing?

What is known about data is the measure of its darkness, not usage.

But supplying opaque terms (all terms are opaque to someone) for data, only puts you into the AnHai situation. Either you enlist people who know the meanings of the terms and/or you create new meanings for them from scratch. Hopefully in the latter case you approximate the original meanings assigned to the terms.

If you want to improve on opaque terms, you need to provide alternative opaque terms that may be recognized by some future user instead of the primary opaque term you would use otherwise.

Make no mistake, it isn’t possible to escape opacity but you can increase your odds that your data can be useful at some future point in time. How many alternatives = some degree of future usefulness isn’t known.

So far as I know, the question hasn’t been researched. Every new set of opaque terms (read ontology, classification, controlled vocabulary) presents itself as possessing semantics for the ages. Given the number of such efforts, I find their confidence misplaced.

Comments Off

Getting Started with Apache Spark and Neo4j Using Docker Compose

Filed under: Graphs,Neo4j,Spark — Patrick Durusau @ 4:08 pm

Getting Started with Apache Spark and Neo4j Using Docker Compose by Kenny Bastani.

From the post:

I’ve received a lot of interest since announcing Neo4j Mazerunner. People from around the world have reached out to me and are excited about the possibilities of using Apache Spark and Neo4j together. From authors who are writing new books about big data to PhD researchers who need it to solve the world’s most challenging problems.

I’m glad to see such a wide range of needs for a simple integration like this. Spark and Neo4j are two great open source projects that are focusing on doing one thing very well. Integrating both products together makes for an awesome result.

Less is always more, simpler is always better.

Both Apache Spark and Neo4j are two tremendously useful tools. I’ve seen how both of these two tools give their users a way to transform problems that start out both large and complex into problems that become simpler and easier to solve. That’s what the companies behind these platforms are getting at. They are two sides of the same coin.

One tool solves for scaling the size, complexity, and retrieval of data, while the other is solving for the complexity of processing the enormity of data by distributed computation at scale. Both of these products are achieving this without sacrificing ease of use.

Inspired by this, I’ve been working to make the integration in Neo4j Mazerunner easier to install and deploy. I believe I’ve taken a step forward in this and I’m excited to announce it in this blog post.

….

Now for something a bit more esoteric than CSV. 😉

This guide will get you into Docker land as well.

Please share and forward.

Enjoy!

Comments Off

In Praise of CSV

Filed under: CSV,Dark Data — Patrick Durusau @ 3:23 pm

In Praise of CSV by Waldo Jaquith

From the post:

Comma Separated Values is the file format that open data advocates love to hate. Compared to JSON, CSV is clumsy; compared to XML, CSV is simplistic. Its reputation is as a tired, limited, it’s-better-than-nothing format. Not only is that reputation is undeserved, but CSV should often be your first choice when publishing data.

It’s true—CSV is tired and limited, though superior to not having data, but there’s another side to those coins. One man‘s tired is another man’s established. One man’s limited is another man’s focused. And “better than nothing” is in, fact, better than nothing, which is frequently the alternative to producing CSV.

A bit further on:

…
The lack of typing makes schemas generally impractical, and as a result validation of field contents is also generally impractical.
…

There is ongoing work to improve that situation at the CSV on the Web Working Group (W3C). As of today, see: Metadata Vocabulary for Tabular Data, W3C Editor’s Draft 11 March 2015.

The W3C work is definitely a step in the right direction but even if you “know” a field heading or its data type, do you really “know” the semantics of that field? Assume you have a floating point number, is that “pound-seconds” or “newton-seconds?” Mars orbiters really need to know.

Perhaps CSV files are nearly the darkest dark data with a structure. Even with field names and data types, the semantics of any field and/or its relationship to other fields, remains a mystery.

It may be the case that within a week, month or year, someone may remember the field semantics but what of ten (10) years or even one hundred (100) years from now?

Comments (3)

termsql

Filed under: Spark,SQL,SQLite,Text Analytics — Patrick Durusau @ 2:06 pm

termsql

From the webpage:

Convert text from a file or from stdin into SQL table and query it instantly. Uses sqlite as backend. The idea is to make SQL into a tool on the command line or in scripts.

…

Online manual: http://tobimensch.github.io/termsql

So what can it do?

convert text/CSV files into sqlite database/table

work on stdin data on-the-fly

it can be used as swiss army knife kind of tool for extracting information from other processes that send their information to termsql via a pipe on the command line or in scripts

termsql can also pipe into another termsql of course

you can quickly sort and extract data

creates string/integer/float column types automatically

gives you the syntax and power of SQL on the command line

…

Sometimes you need the esoteric and sometimes not!

Enjoy!

I first saw this in a tweet by Christophe Lalanne.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 14, 2015

That sucks!

March 13, 2015

If you think 0.18 percent of revenue is low, software breaches cost software vendors 0.00 percent of revenue.

March 12, 2015

March 11, 2015