Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 12, 2015

Estimating “known unknowns”

Filed under: Data Science,Mathematics,Probability,Proofing,Statistics — Patrick Durusau @ 4:36 pm

Estimating “known unknowns” by Nick Berry.

From the post:

There’s a famous quote from former Secretary of Defense Donald Rumsfeld:

“ … there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”

I write this blog. I’m an engineer. Whilst I do my best and try to proof read, often mistakes creep in. I know there are probably mistakes in just about everything I write! How would I go about estimating the number of errors?

The idea for this article came from a book I recently read by Paul J. Nahin, entitled Duelling Idiots and Other Probability Puzzlers (In turn, referencing earlier work by the eminent mathematician George Pólya).

Proof Reading2

Imagine I write a (non-trivially short) document and give it to two proof readers to check. These two readers (independantly) proof read the manuscript looking for errors, highlighting each one they find.

Just like me, these proof readers are not perfect. They, also, are not going to find all the errors in the document.

Because they work independently, there is a chance that reader #1 will find some errors that reader #2 does not (and vice versa), and there could be errors that are found by both readers. What we are trying to do is get an estimate for the number of unseen errors (errors detected by neither of the proof readers).*

*An alternate way of thinking of this is to get an estimate for the total number of errors in the document (from which we can subtract the distinct number of errors found to give an estimate to the number of unseen errros.

A highly entertaining posts on estimating “known unknowns,” such as the number of errors in a paper that has been proofed by two independent proof readers.

Of more than passing interest to me because I am involved in a New Testament Greek Lexicon project that is an XML encoding of a 500+ page Greek lexicon.

The working text is in XML, but not every feature of the original lexicon was captured in markup and even if that were true, we would still want to improve upon features offered by the lexicon. All of which depend upon the correctness of the original markup.

You will find Nick’s analysis interesting and more than that, memorable. Just in case you are asked about “estimating ‘known unknowns'” in a data science interview.

Only Rumsfeld could tell you how to estimate an “unknown unknowns.” I think it goes: “Watch me pull a number out of my ….”

😉

I was found this post by following another post at this site, which was cited by Data Science Renee.

For Linguists on Your Holiday List

Filed under: Humanities,Humor,Linguistics,Semiotics — Patrick Durusau @ 3:55 pm

Hey Linguists!—Get Them to Get You a Copy of The Speculative Grammarian Essential Guide to Linguistics.

From the website:

Hey Linguists! Do you know why it is better to give than to receive? Because giving requires a lot more work! You have to know what someone likes, what someone wants, who someone is, to get them a proper, thoughtful gift. That sounds like a lot of work.

No, wait. That’s not right. It’s actually more work to be the recipient—if you are going to do it right. You can’t just trust people to know what you like, what you want, who you are.

You could try to help your loved ones understand a linguist’s needs and wants and desires—but you’d have to give them a mini course on historical, computational, and forensic linguistics first. Instead, you can assure them that SpecGram has the right gift for you—a gift you, their favorite linguist, will treasure for years to come: The Speculative Grammarian Essential Guide to Linguistics.

So drop some subtle or not-so-subtle hints and help your loved ones do the right thing this holiday season: gift you with this hilarious compendium of linguistic sense and nonsense.

If you need to convince your friends and family that they can’t find you a proper gift on their own, send them one of the images below, and try to explain to them why it amuses you. That’ll show ’em! (More will be added through the rest of 2015, just in case your friends and family are a little thick.)

• If guilt is more your style, check out 2013’s Sad Holiday Linguists.

• If semi-positive reinforcement is your thing, check out 2014’s Because You Can’t Do Everything You Want for Your Favorite Linguist.

Disclaimer: I haven’t proofed the diagrams against the sources cited. Rely on them at your own risk. 😉

There are others but the Hey Semioticians! reminded me of John Sowa (sorry John):

semiotics

The greatest mistake across all disciplines is taking ourselves (and our positions) far too seriously.

Enjoy!

Prismatic Closes December 20th, 2015

Filed under: Graphs,News — Patrick Durusau @ 3:33 pm

Writing the next chapter for Prismatic

From the post:

Four years ago, we set out to build a personalized news reader that would change the way people consume content. For many of you, we did just that. But we also learned content distribution is a tough business and we’ve failed to grow at a rate that justifies continuing to support our Prismatic News products.

Beginning December 20th, 2015, our iOS, Android and web news reader products will no longer be available and access to our Interest Graph APIs will be discontinued.

Once the product is shut down, you will no longer have access to the stories you’ve read or saved within Prismatic. We recommend saving anything you want to remember by adding the story to Pocket, emailing it to yourself or copying and pasting the links into another document.

Thanks to all of you who supported us during this journey; we hope you’ve learned lots of interesting things along the way.

The Prismatic Team

December 20, 2015 will be here sooner than you think.

Much could be said about viable Internet business models, economies, etc., but for the moment, concentrate on preserving content that will go dark by December 20, 2015.

Repost and tweet this to your friends and followers.

Yes, I think this is sad. In the short term, however, preservation of content should be the overriding concern.

Best of luck to the Prismatic Team.

Data scientists: Question the integrity of your data [Relevance/Fitness – Not “Integrity”]

Filed under: Data Mining,Modeling — Patrick Durusau @ 3:19 pm

Data scientists: Question the integrity of your data by Rebecca Merrett.

From the post:

If there’s one lesson website traffic data can teach you, it’s that information is not always genuine. Yet, companies still base major decisions on this type of data without questioning its integrity.

At ADMA’s Advancing Analytics in Sydney this week, Claudia Perlich, chief scientist of Dstillery, a marketing technology company, spoke about the importance of filtering out noisy or artificial data that can skew an analysis.

“Big data is killing your metrics,” she said, pointing to the large portion of bot traffic on websites.

“If the metrics are not really well aligned with what you are truly interested in, they can find you a lot of clicking and a lot of homepage visits, but these are not the people who will buy the product afterwards because they saw the ad.”

Predictive models that look at which users go to some brands’ home pages, for example, are open to being completely flawed if data integrity is not called into question, she said.

“It turns out it is much easier to predict bots than real people. People write apps that skim advertising, so a model can very quickly pick up what that traffic pattern of bots was; it can predict very, very well who would go to these brands’ homepages as long as there was bot traffic there.”

The predictive model in this case will deliver accurate results when testing its predictions. However, that doesn’t bring marketers or the business closer to reaching its objective of real human ad conversions, Perlich said.

The on-line Merriam-Webster’s defined “integrity” as:

  1. firm adherence to a code of especially moral or artistic values : incorruptibility
  2. an unimpaired condition : soundness
  3. the quality or state of being complete or undivided : completeness

None of those definitions of “integrity” apply to the data Perlich describes.

What Perlich criticizes is measuring data with no relationship to the goal of the analysis, “…human ad conversions.”

That’s not “integrity” of data. Perhaps appropriate/fitness for use or relevance but not “integrity.”

Avoid vague and moralizing terminology when discussing data and data science.

Discussions of ethics are difficult enough without introducing confusion with unrelated issues.

I first saw this in a tweet by Data Science Renee.

December 11, 2015

Introducing OpenAI [Name Surprise: Not SkyNet II or Terminator]

Filed under: Artificial Intelligence,EU,Machine Learning — Patrick Durusau @ 11:55 pm

Introducing OpenAI by Greg Brockman, Ilya Sutskever, and the OpenAI team.

From the webpage:

OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.

Since our research is free from financial obligations, we can better focus on a positive human impact. We believe AI should be an extension of individual human wills and, in the spirit of liberty, as broadly and evenly distributed as is possible safely.

The outcome of this venture is uncertain and the work is difficult, but we believe the goal and the structure are right. We hope this is what matters most to the best in the field.

Background

Artificial intelligence has always been a surprising field. In the early days, people thought that solving certain tasks (such as chess) would lead us to discover human-level intelligence algorithms. However, the solution to each task turned out to be much less general than people were hoping (such as doing a search over a huge number of moves).

The past few years have held another flavor of surprise. An AI technique explored for decades, deep learning, started achieving state-of-the-art results in a wide variety of problem domains. In deep learning, rather than hand-code a new algorithm for each problem, you design architectures that can twist themselves into a wide range of algorithms based on the data you feed them.

This approach has yielded outstanding results on pattern recognition problems, such as recognizing objects in images, machine translation, and speech recognition. But we’ve also started to see what it might be like for computers to be creative, to dream, and to experience the world.

Looking forward

AI systems today have impressive but narrow capabilities. It seems that we’ll keep whittling away at their constraints, and in the extreme case they will reach human performance on virtually every intellectual task. It’s hard to fathom how much human-level AI could benefit society, and it’s equally hard to imagine how much it could damage society if built or used incorrectly.

OpenAI

Because of AI’s surprising history, it’s hard to predict when human-level AI might come within reach. When it does, it’ll be important to have a leading research institution which can prioritize a good outcome for all over its own self-interest.

We’re hoping to grow OpenAI into such an institution. As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies.

OpenAI’s research director is Ilya Sutskever, one of the world experts in machine learning. Our CTO is Greg Brockman, formerly the CTO of Stripe. The group’s other founding members are world-class research engineers and scientists: Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, John Schulman, Pamela Vagata, and Wojciech Zaremba. Pieter Abbeel, Yoshua Bengio, Alan Kay, Sergey Levine, and Vishal Sikka are advisors to the group. OpenAI’s co-chairs are Sam Altman and Elon Musk.

Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed $1 billion, although we expect to only spend a tiny fraction of this in the next few years.

You can follow us on Twitter at @open_ai or email us at info@openai.com.

Seeing that Elon Musk is the co-chair of this project I was surprised the name wasn’t SkyNet II or Terminator. But OpenAI is a more neutral one and given the planned transparency of the project, a good one.

I also appreciate the project not being engineered for the purpose of spending money over a ten year term. Doing research first and then formulating plans for the next step in research sounds like a more sensible plan.

Whether any project ever achieves “artificial intelligence” equivalent to human intelligence or not, this project may be a template for how to usefully explore complex scientific questions.

d3.compose [Charts as Devices of Persuasion]

Filed under: Charts,D3,Graphics,Visualization — Patrick Durusau @ 10:17 pm

d3.compose

Another essential but low-level data science skill, data-driven visualizations!

From the webpage:

Composable

Create small and sharp charts/components that do one thing one well (e.g. Bars, Lines, Legend, Axis, etc.) and compose them to create complex visualizations.

d3.compose works great with your existing charts (even those from other libraries) and it is simple to extend/customize the built-in charts and components.

Automatic Layout

When creating complex charts with D3.js and d3.chart, laying out and sizing parts of the chart are often manual processes.
With d3.compose, this process is automatic:

  • Automatically size and position components
  • Layer components and charts by z-index
  • Responsive by default, with automatic scaling

Why d3.compose?

  • Customizable: d3.compose makes it easy to extend, layout, and refine charts/components
  • Reusable: By breaking down visualizations into focused charts and components, you can quickly reconfigure and reuse your code
  • Integrated: It’s straightforward to use your existing charts or charts from other libraries with d3.compose to create just the chart you’re looking for

Don’t ask me why but users/executives are impressed by even simple charts.

(shrugs) I have always assumed that people use charts to avoid revealing the underlying data and what they did to it before making the chart.

That’s not very charitable but I have never been disappointed in assuming either incompetence and/or malice in chart preparation.

People prepare charts because they are selling you a point of view. It may be a “truthful” point of view, at least in their minds but it is still an instrument of persuasion.

Use well-constructed charts to persuade others to your point of view and be on guard for the use of charts to persuade you. Both of those principles will serve you well as a data scientist.

Cleaning CSV Data… [Interview Questions?]

Filed under: CSV,Python — Patrick Durusau @ 10:03 pm

Cleaning CSV Data Using the Command Line and csvkit, Part 1 by Srini Kadamati.

From the post:

The Museum of Modern Art is one of the most influential museums in the world and they have released a dataset on the artworks in their collection. The dataset has some data quality issues, however, and requires cleanup.

In a previous post, we discussed how we used Python and Pandas to clean the dataset. In this post, we’ll learn about how to use the csvkit library to acquire and explore tabular data.

Why the command line?

Great question! When working in cloud data science environments, you sometimes only have access to a server’s shell. In these situations, proficiency with command line data science is a true superpower. As you become more proficient, using the command line for some data science tasks is much quicker than writing a Python script or a Hadoop job. Lastly, the command line has a rich ecosystem of tools and integration into the file system. This makes certain kinds of tasks, especially those involving multiple files, incredibly easy.

Some experience working in the command line is expected for this post. If you’re new to the command line, I recommend checking out our interactive command line course.

csvkit

csvkit is a library optimized for working with CSV files. It’s written in Python but the primary interface is the command line. You can install csvkit using pip:

pip install csvkit

You’ll need this library to follow along with this post.

If you want to be a successful data scientist, may I suggest you follow this series and similar posts on data cleaning techniques?

Reports vary but the general figure is 50% to 90% of the time of a data scientist is spent cleaning data. Report: Data scientists spend bulk of time cleaning up

Being able to clean data, the 50% to 90% of your future duties, may not get you past the data scientist interview.

There are several 100+ data scientist interview question sets that don’t have any questions about data cleaning.

Seriously, not a single question.

I won’t name names in order to protect the silly but can say that SAS does have one data cleaning question out of twenty. Err, that’s 5% for those of you comparing to the duties of a data scientist at 50% to 90%. Of course the others I reviewed, had 0% out of 50% to 90% so they were even worse.

Oh, the SAS question on data cleaning:

Give examples of data cleaning techniques you have used in the past.

You have to wonder about a data science employer who asks so many questions unrelated to the day to day duties of data scientists.

Maybe when asked some arcane question you can ask back:

An when in the last six (6) months has your average data scientist hire used that concept/technique?

It might not land you a job but do you really want to work at a firm that can’t apply data science to its own hiring process?

Data science employers, heal yourselves!

PS: I rather doubt most data science interviewers understand the epistemological assumptions behind most algorithms so you can memorize a bit of that for your interview.

Will convince them customers will believe your success is just short of divine intervention in their problem.

It’s an old but reliable technique.

Flashback: Breaking Coffee DRM in 2014

Filed under: DRM — Patrick Durusau @ 8:14 pm

Cory Doctorow tweeted a post from 2014: Defeat Keurig’s K-Cup DRM with a single piece of tape.

It’s difficult to imagine a more environmentally unfriendly coffee maker than those by Keurig.

For every cup of coffee it brews, it adds to landfill waste. Yeah, for every cup, the environment is incrementally diminished. Not by much for any one cup but imagine the thousands of cups per day that pour (sorry) from Keurig machines.

Normally I enjoy stories of breaking DRM efforts but in this particular case, it only encourages more environmentally unfriendly companies to spring up manufacturing the same wasteful products as Keurig.

The best way to deal with a Keurig machine is to superglue or weld the damned thing shut. That will decrease the demand for more outlets selling environmentally unfriendly forms of coffee. Well, not just one machine, there needs to be an epidemic of people sealing off their own machines.

Working from home I do quite well with a late 1950’s/mid-1960’s drip pot that requires only hot water and coffee. Nothing disposable except for coffee grounds and they go in the compost heap. Well, and the coffee bag that goes into recycling.

Make 2016 the year when the conspicuous consumption and waste of Keurig coffee machines ends.

PS: A common pot of coffee also saves time by narrowing the range of choices: the coffee is hot and black or the pot is empty. Fewer choices, quicker turn around at the coffee machine. 😉

12 Habits = 12 Steps = 12 Days of Christmas?

Filed under: Humor — Patrick Durusau @ 7:57 pm

If your phone isn’t lite up by every imaginable charity, from the worthy to the “I didn’t know that was a problem” kind, please post a reply with your secret.

For those of you without an answer to that question, see Matt Bors and These 12 Habits Are Absolutely Slaughtering Your Productivity if you need discussion topics for the office party or elsewhere.

For some habits, you should take into account how much sharing is too much sharing, depending upon your audience and whether you have the power to hire or fire them.

Enjoy!

December 10, 2015

Kidnapping Caitlynn (47 AKAs – Is There a Topic Map in the House?)

Filed under: Semantic Diversity,Semantic Inconsistency,Topic Maps — Patrick Durusau @ 10:05 pm

Kidnapping Caitlynn in 10 minutes long, but has accumulated forty-seven (47 AKAs).

Imagine the search difficulty in finding reviews under all forty-eight (48) titles.

Even better, imagine your search request was for something that really mattered.

Like known terrorists crossing national borders using their real names and passports.

Intelligence services aren’t doing all that hot even with string to string matches.

Perhaps that explains their inability to consider more sophisticated doctrines of identity.

If you can’t do string to string, more complex notions will grind your system to a halt.

Maybe intelligence agencies need new contractors. You think?

IoT: The New Tower of Babel

640-babel

Luke Anderson‘s post at Clickhole, titled: Humanity Could Totally Pull Off The Tower Of Babel At This Point, was a strong reminder of the Internet of Things (IoT).

See what you think:

If you went to Sunday school, you know the story: After the Biblical flood, the people of earth came together to build the mighty Tower of Babel. Speaking with one language and working tirelessly, they built a tower so tall that God Himself felt threatened by it. So, He fractured their language so that they couldn’t understand each other, construction ceased, and mankind spread out across the ancient world.

We’ve come a long way in the few millennia since then, and at this point, humanity could totally pull off the Tower of Babel.

Just look at the feats of human engineering we’ve accomplished since then: the Great Wall; the Golden Gate Bridge; the Burj Khalifa. And don’t even get me started on the International Space Station. Building a single tall building? It’d be a piece of cake.

Think about it. Right off the bat, we’d be able to communicate with each other, no problem. Besides most of the world speaking either English, Spanish, and/or Chinese by now, we’ve got translators, Rosetta Stone, Duolingo, the whole nine yards. Hell, IKEA instructions don’t even have words and we have no problem putting their stuff together. I can see how a guy working next to you suddenly speaking Arabic would throw you for a loop a few centuries ago. But now, I bet we could be topping off the tower and storming heaven in the time it took people of the past to say “Hey, how ya doing?”

Compare this Internet of Things statement from the Masters of Contracts that Yield No Useful Result:


IoT implementation, at its core, is the integration of dozens and up to tens of thousands of devices seamlessly communicating with each other, exchanging information and commands, and revealing insights. However, when devices have different usage scenarios and operating requirements that aren’t compatible with other devices, the system can break down. The ability to integrate different elements or nodes within broader systems, or bringing data together to drive insights and improve operations, becomes more complicated and costly. When this occurs, IoT can’t reach its potential, and rather than an Internet of everything, you see siloed Internets of some things.

The first, in case you can’t tell from it being posted at Clickhole, was meant as sarcasm or humor.

The second was deadly serious from folks who would put a permanent siphon on your bank account. Whether their services are cost effective or not is up to you to judge.

The Tower of Babel is a statement about semantics and the human condition. It should come as no surprise that we all prefer our language over that of others, whether those are natural or programming languages. Moreover, judging from code reuse, to say nothing of the publishing market, we prefer our restatements of the material, despite equally useful statements by others.

How else would you explain the proliferation of MS Excel books? 😉 One really good one is more than enough. Ditto for Bible translations.

Creating new languages to “fix” semantic diversity just adds another partially adopted language to the welter of languages that need to be integrated.

The better option, at least from my point of view, is to create mappings between languages, mappings that are based on key/value pairs to enable others to build upon, contract or expand those mappings.

It simply isn’t possible to foresee every use case or language that needs semantic integration but if we perform such semantic integration as returns ROI for us, then we can leave the next extension or contraction of that mapping to the next person with a different ROI.

It’s heady stuff to think we can cure the problem represented by the legendary Tower of Babel, but there is a name for that. It’s called hubris and it never leads to a good end.

Cursive 1.0, Gingerbread Cuneiform Tablets, XQuery

Filed under: Clojure,ClojureScript,Editor,XQuery — Patrick Durusau @ 4:32 pm

Cursive 1.0

From the webpage:

Cursive 1.0 is finally here! Here’s everything you need to know about the release.

One important thing first, we’ve had to move some things around. Our website is now at cursive-ide.com, we’re on Twitter at @CursiveIDE, and our Github handle is now cursive-ide. Hopefully everything should be redirected as much as possible, but let me know if you can’t find anything or if you find anything where it shouldn’t be. The main Cursive email address is now cursive@cursive-ide.com but the previous one will continue to work. The mailing lists will continue to use the previous domain for the moment.

Cursive 1.0 is considered a stable build, and is now in the JetBrains plugin repo (here). Going forward, we’ll have a stable release channel which will have new builds every 2-3 months, and an EAP channel which will have builds more or less as they have been during the EAP program. Stable builds will be published to the JetBrains repo, and EAP builds will continue to be published to our private repo. You’ll be prompted at startup which channel you’d like to use, and Cursive will set up the new URLs for the EAP channel automatically if required. Stable builds will be numbered x.y.0, and EAP builds will use x.y.z numbering.

From https://cursive-ide.com/:

Cursive is available as an IntelliJ plugin for use with the Community or Ultimate editions, and will be available in the future as a standalone Clojure-focused IDE. It is a commercial product, with a free non-commercial licence for open-source work, personal hacking, and student work.

I’m not brave enough to install software on someone’s computer as a “surprise” holiday present. Unless you are a sysadmin, I suggest you don’t either.

Of course, you could take the UPenn instructions for making gingerbread cuneiform tablets and make a “card” for the lucky recipient of a license for the full version of Cursive 1.0.

For that matter, you could inscribe smaller gingerbread pieces with bits of XQuery syntax and play holiday games with arbitrary XML documents. Timed trials for the most imaginative XQuery, shortest, etc.

PS: Suggestions on an XQuery corpus to determine the frequency of XQuery syntax representations? “For,” “(,” and, “)” are likely the most common ones, but that’s just guessing on my part.

Nutch 1.11 Release

Filed under: Nutch,Search Engines — Patrick Durusau @ 3:56 pm

Nutch 1.11 Release

From the homepage:

The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.11, we advise all current users and developers of the 1.X series to upgrade to this release.

This release is the result of many months of work and around 100 issues addressed. For a complete overview of these issues please see the release report.

As usual in the 1.X series, release artifacts are made available as both source and binary and also available within Maven Central as a Maven dependency. The release is available from our DOWNLOADS PAGE.

I have played with Nutch but never really taken advantage of it as a day-to-day discovery tool.

I don’t need to boil the Internet ocean to cover well over 90 to 95% of all the content that interests me on a day to day basis.

Moreover, searching a limited part of the Internet would enable things like granular date sorting and not within a week, month, last year.

Not to mention I could abandon the never sufficiently damned page-rank sorting of search results. Maybe you need to look “busy” as you sort through search result cruft, time and again, but I have other tasks to fill my time.

Come to think of it, as I winnow through search results, I could annotate, tag, mark, choose your terminology, such that a subsequent search turns up my evaluation, ranking, preference among those items.

Try that with Google, Bing or other general search appliance.

This won’t be an end of 2015 project, mostly because I am trying to learn a print dictionary layout from the 19th century for representation in granular markup and other tasks are at hand.

However, in early 2016 I will grab the Nutch 1.11 release and see if I can put it into daily use. More on that in 2016.

BTW, what projects are you going to be pursuing in 2016?

Paradise Lost (John MILTON, 1608 – 1674) Audio Version

Filed under: Audio,Books,Interface Research/Design,Literature — Patrick Durusau @ 2:57 pm

Paradise Lost (John MILTON, 1608 – 1674) Audio Version.

As you know, John Milton was blind when he wrote Paradise Lost. His only “interface” for writing, editing and correcting was aural.

Shoppers and worshipers need to attend very closely to the rhetoric of the season. Listening to Paradise Lost even as Milton did, may sharpen your ear for rhetorical devices and words that would otherwise pass unnoticed.

For example, what are the “good tidings” of Christmas hymns? Are they about the “…new born king…” or are they anticipating the sacrifice of that “…new born king…” instead of ourselves?

The first seems traditional and fairly benign, the second, seems more self-centered and selfish than the usual Christmas holiday theme.

If you think that is an aberrant view of the holiday, consider that in A Christmas Carol by Charles Dickens, that Scrooge, spoiler alert, ends the tale by keeping Christmas in his heart all year round.

One of the morals being that we should treat others kindly and with consideration every day of the year. Not as some modern Christians do, half-listening at an hour long service once a week and spending the waking portion of the other 167 hours not being Christians.

Paradise Lost is a complex and nuanced text. Learning to spot its rhetorical moves and devices will make you a more discerning observer of modern discourse.

Enjoy!

The Preservation of Favoured Traces [Multiple Editions of Darwin]

Filed under: Books,Graphics,Text Encoding Initiative (TEI),Visualization,XQuery — Patrick Durusau @ 1:19 pm

The Preservation of Favoured Traces

From the webpage:

Charles Darwin first published On the Origin of Species in 1859, and continued revising it for several years. As a result, his final work reads as a composite, containing more than a decade’s worth of shifting approaches to his theory of evolution. In fact, it wasn’t until his fifth edition that he introduced the concept of “survival of the fittest,” a phrase that actually came from philosopher Herbert Spencer. By color-coding each word of Darwin’s final text by the edition in which it first appeared, our latest book and poster of his work trace his thoughts and revisions, demonstrating how scientific theories undergo adaptation before their widespread acceptance.

The original interactive version was built in tandem with exploratory and teaching tools, enabling users to see changes at both the macro level, and word-by-word. The printed poster allows you to see the patterns where edits and additions were made and—for those with good vision—you can read all 190,000 words on one page. For those interested in curling up and reading at a more reasonable type size, we’ve also created a book.

The poster and book are available for purchase below. All proceeds are donated to charity.

For textual history fans this is an impressive visualization of the various editions of On the Origin of Species.

To help students get away from the notion of texts as static creations, plus to gain some experience with markup, consider choosing a well known work that has multiple editions that is available in TEI.

Then have the students write XQuery expressions to transform a chapter of such a work into a later (or earlier) edition.

Depending on the quality of the work, that could be a means of contributing to the number of TEI encoded texts and your students would gain experience with both TEI and XQuery.

The Quartz guide to bad data

Filed under: Journalism,News,Reporting — Patrick Durusau @ 11:50 am

The Quartz guide to bad data

From the webpage:

An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

As a reporter your world is full of data. And those data are full of problems. This guide presents thorough descriptions and possible solutions to many of the kinds of problems that you will encounter when working with data.

Most of these problems can be solved. Some of them can’t be solved and that means you should not use the data. Others can’t be solved, but with precautions you can continue using the data. In order to allow for these ambiguities, this guide is organized by who is best equipped to solve the problem: you, your source, an expert, etc. In the description of each problem you may also find suggestions for what to do if that person can’t help you.

You can not possibly review every dataset you encounter with for all of these problems. If you try to do that you will never get anything published. However, by familiarizing yourself with the kinds of issues you are likely to encounter you will have a better chance of identifying an issue before it causes you to make a mistake.

If you have questions about this guide please email Chris. Good luck!

I hesitate at the word exhaustive for all but the most trivial of collections.

Saying “this is a guide to the most common bad data problems encountered by journalists” comes closer to the mark.

It makes a great checklist to develop your own habit of routine data checks that you apply to all data, even data from trusted sources. (Perhaps even more so from trusted sources.)

Enjoy!

FBI Official Acknowledges Using Top Secret Hacking Weapons

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 11:11 am

FBI Official Acknowledges Using Top Secret Hacking Weapons by Robert Hackett.

From the post:

The Federal Bureau of Investigation recently made an unprecedented admission: It uses undisclosed software vulnerabilities when hacking suspects’ computers.

Amy Hess, head of the FBI’s science and technology arm, recently went on the record about the practice with the Washington Post. “Hess acknowledged that the bureau uses zero-days,” the Post reported on Tuesday, using industry-speak for generally unknown computer bugs. The name derives from the way such flaws blind side security pros. By the time attackers have begun taking advantage of these coding flubs, software engineers are left with zero days to fix them.

Never before has an FBI official conceded the point, the Post notes. That’s noteworthy. Although the news itself is not exactly a shocker. It is well known among cybersecurity and privacy circles that the agency has had a zero day policy in place since 2010, thanks to documents obtained by the American Civil Liberties Union and published earlier this year on Wired. And working groups had been assembled at least two years earlier to begin mapping out that policy, as a document obtained by the Electronic Frontier Foundation privacy organization and also published on Wired shows. Now though, Hess, an executive assistant director with the FBI, seems to have confirmed the activity.

(People surmised as much after the FBI was outed as a customer of the Italian spyware firm Hacking Team after hackers stole some of its internal documents and published them online this year, too.)

The agency’s “network investigative techniques,” as these hacking operations are known, originate inside the FBI’s Operational Technology Division in an enclave known as its Remote Operations Unit, according to the Post. They’re rarely discussed publicly, and many privacy advocates have a number of concerns about the system, which they say could potentially be abused or have unsavory consequences.

Robert does a great job in covering this latest admission by the FBI and pointing to other resources to fill in its background.

It’s hard to think of a better precedent for this use of hacking weapons than of Silverthorne Lumber Co., Inc. v. United States
251 U.S. 385 (1920).

The opinion for the majority of the Supreme Court was delivered by Justice Holmes at the height of his career. It isn’t long so I quote the opinion in full:

This is a writ of error brought to reverse a judgment of the District Court fining the Silverthorne Lumber Company two hundred and fifty dollars for contempt of court and ordering Frederick W. Silverthorne to be imprisoned until he should purge himself of a similar contempt. The contempt in question was a refusal to obey subpoenas and an order of Court to produce books and documents of the company before the grand jury to be used in regard to alleged violation of the statutes of the United States by the said Silverthorne and his father. One ground of the refusal was that the order of the Court infringed the rights of the parties under the Fourth Amendment of the Constitution of the United States.

The facts are simple. An indictment upon a single specific charge having been brought against the two Silverthornes mentioned, they both were arrested at their homes early in the morning of February 25, and were detained in custody a number of hours. While they were thus detained, representatives of the Department of Justice and the United States marshal, without a shadow of authority, went to the office of their company and made a clean sweep of all the books, papers and documents found there. All the employes were taken or directed to go to the office of the District Attorney of the United States, to which also the books, &c., were taken at once. An application, was made as soon as might be to the District

Page 251 U. S. 391

Court for a return of what thus had been taken unlawfully. It was opposed by the District Attorney so far as he had found evidence against the plaintiffs in error, and it was stated that the evidence so obtained was before the grand jury. Color had been given by the District Attorney to the approach of those concerned in the act by an invalid subpoena for certain documents relating to the charge in the indictment then on file. Thus, the case is not that of knowledge acquired through the wrongful act of a stranger, but it must be assumed that the Government planned or at all events ratified, the whole performance. Photographs and copies of material papers were made, and a new indictment was framed based upon the knowledge thus obtained. The District Court ordered a return of the originals, but impounded the photographs and copies. Subpoenas to produce the originals then were served, and, on the refusal of the plaintiffs in error to produce them, the Court made an order that the subpoenas should be complied with, although it had found that all the papers had been seized in violation of the parties’ constitutional rights. The refusal to obey this order is the contempt alleged. The Government now, while in form repudiating and condemning the illegal seizure, seeks to maintain its right to avail itself of the knowledge obtained by that means which otherwise it would not have had.

The proposition could not be presented more nakedly. It is that, although, of course, its seizure was an outrage which the Government now regrets, it may study the papers before it returns them, copy them, and then may use the knowledge that it has gained to call upon the owners in a more regular form to produce them; that the protection of the Constitution covers the physical possession, but not any advantages that the Government can gain over the object of its pursuit by doing the forbidden act. Weeks v. United States, 232 U. S. 383, to be sure, had established that laying the papers directly before the grand jury was

Page 251 U. S. 392

unwarranted, but it is taken to mean only that two steps are required instead of one. In our opinion, such is not the law. It reduces the Fourth Amendment to a form of words. 232 U. S. 232 U.S. 393. The essence of a provision forbidding the acquisition of evidence in a certain way is that not merely evidence so acquired shall not be used before the Court, but that it shall not be used at all. Of course, this does not mean that the facts thus obtained become sacred and inaccessible. If knowledge of them is gained from an independent source they may be proved like any others, but the knowledge gained by the Government’s own wrong cannot be used by it in the way proposed. The numerous decisions, like Adams v. New York, 192 U. S. 585, holding that a collateral inquiry into the mode in which evidence has been got will not be allowed when the question is raised for the first time at the trial, are no authority in the present proceeding, as is explained in Weeks v. United States, 232 U. S. 383, 232 U. S. 394, 232 U. S. 395. Whether some of those decisions have gone too far or have given wrong reasons it is unnecessary to inquire; the principle applicable to the present case seems to us plain. It is stated satisfactorily in Flagg v. United States, 233 Fed.Rep. 481, 483. In Linn v. United States, 251 Fed.Rep. 476, 480, it was thought that a different rule applied to a corporation, on the ground that it was not privileged from producing its books and papers. But the rights of a corporation against unlawful search and seizure are to be protected even if the same result might have been achieved in a lawful way.

In classic Holmes style, the crux of the case mirrors the use of illegal means to gain information, which then shapes the use of more lawful means of investigation:

It is that, although, of course, its seizure was an outrage which the Government now regrets, it may study the papers before it returns them, copy them, and then may use the knowledge that it has gained to call upon the owners in a more regular form to produce them; that the protection of the Constitution covers the physical possession, but not any advantages that the Government can gain over the object of its pursuit by doing the forbidden act.

Concealment of the use of “top secret hacking weapons,” like flight, is more than ample evidence of a corporate “guilty mind” when it comes to illegal gathering of evidence. If it were pursuing lawful means of investigation, the FBI would not go to such lengths to conceal its activities. Interviews with witnesses, physical evidence, records, wiretaps, pen registers, etc. are all lawful and disclosed means of investigation in general and in individual cases.

The FBI as an organization has created a general exception to all criminal laws and the protections offered United States citizens, when and where its agents, not courts, decide such exceptions are necessary.

Privacy of individual citizens is at risk but the greater danger is the FBI being a lawless enterprise where its goals and priorities take precedent over both the laws of the United States and its Constitution.

The United States suffers from murders, rapes and bank robberies every week of the year, yet none of those grim statistics has forced the wholesale abandonment of the rule of law by law enforcement agencies. Prefacing attacks with the word “terrorist” should have no different result.

December 9, 2015

why I try to teach writing when I am supposed to be teaching art history

Filed under: Language,Semantic Diversity,Writing — Patrick Durusau @ 11:58 am

why I try to teach writing when I am supposed to be teaching art history

From the post:

My daughter asked me yesterday what I had taught for so long the day before, and I told her, “the history of photography” and “writing.” She found this rather funny, since she, as a second-grader, has lately perfected the art of handwriting, so why would I be teaching it — still — to grown ups? I told her it wasn’t really how to write so much as how to put the ideas together — how to take a lot of information and say something with it to somebody else. How to express an idea in an organised way that lets somebody know what and why you think something. So, it turns out, what we call writing is never really just writing at all. It is expressing something in the hopes of becoming less alone. Of finding a voice, yes, but also in finding an ear to hear that voice, and an ear with a mouth that can speak back. It is about learning to enter into a conversation that becomes frozen in letters, yes, but also flexible in the form of its call and response: a magic trick that has the potential power of magnifying each voice, at times in conflict, but also in collusion, and of building those voices into the choir that can be called community. I realise that there was a time before I could write, and also a time when, like my daughter, writing consisted simply of the magic of transforming a line from my pen into words that could lift off the page no different than how I had set them down. But it feels like the me that is me has always been writing, as long as I can remember. It is this voice, however far it reaches or does not reach, that has been me and will continue to be me as long as I live and, in the strange way of words, enter into history. Someday, somebody will write historiographies in which they will talk about me, and I will consist not of this body that I inhabit, but the words that I string onto a page.

This is not to say that I write for the sake of immortality, so much as its opposite: the potential for a tiny bit of immortality is the by product of my attempt to be alive, in its fullest sense. To make a mark, to piss in the corners of life as it were, although hopefully in a slightly more sophisticated and usually less smelly way. Writing is, to me, the greatest output for the least investment: by writing, I gain a voice in the world which, like the flap of a butterfly’s wing, has the potential to grow on its own, outside of me, beyond me. My conviction that I should write is not so much because I think I’m better or have more of a right to speak than anybody else, but because I’m equally not convinced that anybody, no matter what their position of authority, is better or has more of an authorisation to write than me.

Writing is the greatest power that I can ever have. It is also an intimate passion, an orgy, between the many who write and the many who read, excitedly communicating with each other. For this reason it is not a power that I wish only for myself, for that would be no more interesting than the echo chamber of my own head. I love the power that is in others to write, the liberty they grant me to enter into their heads and hear their voices. I love our power to chime together, across time and space. I love the strange ability to enter into conversations with ghosts, as well as argue with, and just as often befriend, people I may never meet and people I hope to have a chance to meet. Even when we disagree, reading what people have written and taking it seriously feels like a deep form of respect to other human beings, to their right to think freely. It is this power of voices, of the many being able of their own accord to formulate a chorus, that appeals to the idealist deep within my superficially cynical self. To my mind, democracy can only emerge through this chorus: a cacophanous chorus that has the power to express as well as respect the diversity within itself.

A deep essay on writing that I recommend you read in full.

There is a line that hints at a reason for semantic diversity data science and the lack of code reuse in programming.

My conviction that I should write is not so much because I think I’m better or have more of a right to speak than anybody else, but because I’m equally not convinced that anybody, no matter what their position of authority, is better or has more of an authorisation to write than me.

Beyond the question of authority, whose writing do you understand better or more intuitively, yours or the writing or code of someone else? (At least assuming not too much time has gone by since you wrote it.)

The vast majority of use are more comfortable with our own prose or code, even though it required the effort to transpose prose or code written by others into our re-telling.

Being more aware of the nearly universal casting of prose/code to be our own, should help us acknowledge the moral debts to others and to point back to the sources of our prose/code.

I first saw this in a tweet by Atabey Kaygun.

If You Can’t See ROI, Don’t Invest

Filed under: BigData — Patrick Durusau @ 11:34 am

Simple enough: If you can’t identify and quantify an ROI from an investment, don’t invest.

That applies buying raw materials, physical machinery and plant, advertising and….big data processing.

Larisa Bedgood writes in Why 96% of Companies Fail With Marketing Data Insights:

At a time in our history when there is more data than ever before, the overwhelming majority of companies have yet to see the full potential of better data insights. PwC and Iron Mountain recently released a survey on how well companies are gaining value from information. The results showed a huge disconnect in the information that is available to companies and the actual insights being derived from it.

According to survey findings:

  • Only 4% of businesses can extract full value from the information they hold
  • 43% obtain very little benefit from their data
  • 23% derive no benefit whatsoever
  • 22% don’t apply any type of data analytics to the information they have

The potential of utilizing data can equate intro very big wins and even greater revenue. Take a look at this statistic based on research by McKinsey:

Unlike most big data vendor literature, Larisa captures the #1 thing you should do before investing in big or small data management:

1. Establish an ROI

establish-roi

Establishing a strong return on investment (ROI) will help get new data projects off the ground. Begin by documenting any problems caused by incorrect data, including missed opportunities, and wasted marketing spend. This doesn’t have to be a time intensive project, but gather as much supporting documentation as possible to justify the investment. (emphasis added)

An added advantage of establishing an ROI prior to investment is you will have the basis for judging the success of a data management project. Did the additional capabilities of data analysis/management in fact lead to the expected ROI?

To put it another way, a big data project may be “successful” in the sense that it was completed on time, on budget and it performs exactly as specified, but if it isn’t meeting your ROI projections, the project overall is a failure.

From a profit making business perspective, there is no other measure of success or failure than meeting or failing to meet an expected ROI goal.

Everyone else may be using X or Y technology, but if there is no ROI for you, why bother?

You can see my take on the PwC and Iron Mountain at: Avoiding Big Data: More Business Intelligence Than You Would Think.

How Effective Is Phone Data Mining?

Filed under: Government,Privacy — Patrick Durusau @ 11:04 am

If you missed Drug Agents Use Vast Phone Trove, Eclipsing N.S.A.’s by Scott Shane and Colin Moynihan when it first appeared in 2013, take a look at it now.

From the post:

For at least six years, law enforcement officials working on a counternarcotics program have had routine access, using subpoenas, to an enormous AT&T database that contains the records of decades of Americans’ phone calls — parallel to but covering a far longer time than the National Security Agency’s hotly disputed collection of phone call logs.

The Hemisphere Project, a partnership between federal and local drug officials and AT&T that has not previously been reported, involves an extremely close association between the government and the telecommunications giant.

The government pays AT&T to place its employees in drug-fighting units around the country. Those employees sit alongside Drug Enforcement Administration agents and local detectives and supply them with the phone data from as far back as 1987.

The project comes to light at a time of vigorous public debate over the proper limits on government surveillance and on the relationship between government agencies and communications companies. It offers the most significant look to date at the use of such large-scale data for law enforcement, rather than for national security.

The leaked presentation slides that inform this article claim some success stories but don’t offer an accounting for the effort expended and its successes.

Beyond the privacy implications, the potential for governmental overreaching, etc., there remains the question of how much benefit is being gained for the cost of the program.

Rather than an airy policy debate, numbers on expenditures and results could empower a far more pragmatic debate on this program.

I don’t doubt the success stories but random chance dictates that some drug dealers will be captured every year, whatever law enforcement methods are in place.

More data on phone data mining by the Drug Enforcement Administration could illustrate how effective or ineffective such mining is in the enforcement of drug laws. Given the widespread availability of drugs, I am anticipating a low score on that test.

Should that prove to be the case, it will be additional empirical evidence to challenge the same methods being used, ineffectively, in the prosecution of the “war” on terrorism.

Proving that such methods are ineffectual in addition to being violations of privacy rights could be what tips the balance in favor of ending all such surveillance techniques.

December 8, 2015

Congress: More XQuery Fodder

Filed under: Government,Government Data,Law - Sources,XML,XQuery — Patrick Durusau @ 8:07 pm

Congress Poised for Leap to Open Up Legislative Data by Daniel Schuman.

From the post:

Following bills in Congress requires three major pieces of information: the text of the bill, a summary of what the bill is about, and the status information associated with the bill. For the last few years, Congress has been publishing the text and summaries for all legislation moving in Congress, but has not published bill status information. This key information is necessary to identify the bill author, where the bill is in the legislative process, who introduced the legislation, and so on.

While it has been in the works for a while, this week Congress confirmed it will make “Bill Statuses in XML format available through the GPO’s Federal Digital System (FDsys) Bulk Data repository starting with the 113th Congress,” (i.e. January 2013). In “early 2016,” bill status information will be published online in bulk– here. This should mean that people who wish to use the legislative information published on Congress.gov and THOMAS will no longer need to scrape those websites for current legislative information, but instead should be able to access it automatically.

Congress isn’t just going to pull the plug without notice, however. Through the good offices of the Bulk Data Task Force, Congress will hold a public meeting with power users of legislative information to review how this will work. Eight sample bill status XML files and draft XML User Guides were published on GPO’s GitHub page this past Monday. Based on past positive experiences with the Task Force, the meeting is a tremendous opportunity for public feedback to make sure the XML files serve their intended purposes. It will take place next Tuesday, Dec. 15, from 1-2:30. RSVP details below.

If all goes as planned, this milestone has great significance.

  • It marks the publication of essential legislative information in a format that supports unlimited public reuse, analysis, and republication. It will be possible to see much of a bill’s life cycle.
  • It illustrates the positive relationship that has grown between Congress and the public on access to legislative information, where there is growing open dialog and conversation about how to best meet our collective needs.
  • It is an example of how different components within the legislative branch are engaging with one another on a range of data-related issues, sometimes for the first time ever, under the aegis of the Bulk Data Task Force.
  • It means the Library of Congress and GPO will no longer be tied to the antiquated THOMAS website and can focus on more rapid technological advancement.
  • It shows how a diverse community of outside organizations and interests came together and built a community to work with Congress for the common good.

To be sure, this is not the end of the story. There is much that Congress needs to do to address its antiquated technological infrastructure. But considering where things were a decade ago, the bulk publication of information about legislation is a real achievement, the culmination of a process that overcame high political barriers and significant inertia to support better public engagement with democracy and smarter congressional processes.

Much credit is due in particular to leadership in both parties in the House who have partnered together to push for public access to legislative information, as well as the staff who worked tireless to make it happen.

If you look at the sample XML files, pay close attention to the <bioguideID> element and its contents. Is is the same value as you will find for roll-call votes, but there the value appears in the name-id attribute of the <legislator> element. See: http://clerk.house.gov/evs/2015/roll643.xml and do view source.

Oddly, the <bioguideID> element does not appear in the documentation on GitHub, you just have to know the correspondence to the name-id attribute of the <legislator> element

As I said in the title, this is going to be XQuery fodder.

Apache Cassandra 3.1!

Filed under: Cassandra — Patrick Durusau @ 7:32 pm

Apache Cassandra 3.1 hit the streets today!

If you don’t know Apache Cassandra, from the home page:

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

The full set of changes for release 3.1.

Enjoy!

Order of Requirements Matter

Filed under: Computer Science,Cybersecurity,Visualization — Patrick Durusau @ 7:03 pm

Sam Lightstone posted a great illustration of why the order of requirements can matter to Twitter:

asimov-robotics

Visualizations rarely get much clearer.

You could argue that Minard’s map of Napoleon’s invasion of Russia is equally clear:

600px-Minard

But Minard drew with the benefit of hindsight, not foresight.

The Laws of Robotics, on the other hand, have predictive value for the different orders of requirements.

I don’t know how many requirements Honeywell had for the Midas and Midas Black Gas Detectors but you can bet IP security was near the end of the list, if explicit at all.

IP security should be #1 with a bullet, especially for devices that detect Ammonia (caustic, hazarous), Arsine (highly toxic, flammable), Chlorine (extremely dangerous, poisonous for all living organisms), Hydrogen cyanide, and Hydrogen flouride (“Hydrogen fluoride is a highly dangerous gas, forming corrosive and penetrating hydrofluoric acid upon contact with living tissue. The gas can also cause blindness by rapid destruction of the corneas.”)

When IP security is not the first requirement, it’s not hard to foresee the outcome, an Insecure Internet of Things.

Is that what we want?

Is It End-To-End Encrypted?

Filed under: Cryptography,Cybersecurity,Encryption — Patrick Durusau @ 5:44 pm

ZeroDB has kicked off the new question for all networked software:

Is It End-To-End Encrypted?, with a resounding YES!

From ZeroDB, an end-to-end encrypted database, is open source!:

We’re excited to release ZeroDB, an end-to-end encrypted database, to the world. ZeroDB makes it easy to develop applications with strong security and privacy guarantees by enabling applications to query encrypted data.

zerodb repo: https://github.com/zero-db/zerodb/
zerodb-server repo: https://github.com/zero-db/zerodb-server/
Documentation: http://docs.zerodb.io/

Now that it’s open source, we want your help to make it better. Try it, build awesome things with it, break it. Then tell us about it.

Today, we’re releasing a Python implementation. A JavaScript client will be following soon.

Questions? Ask us on Slack or Google Groups.

The post was authored by MacLane & Michael and you can find more information at http://zerodb.io.

PS: The question Is It End-To-End Encrypted? is a yes or no question. If anyone gives you an answer other than an unqualified yes, it’s time to move along to the next vendor. Sometimes, under some circumstances, maybe, added feature, can be, etc., are all unacceptable answers.

Just like the question: Does it have any backdoors at all? What purpose the backdoor serves isn’t relevant. That a backdoor exists is also the time at which to move to another vendor.

The answers to both of those questions should be captured in contractual language with stipulated liability in the event of breach and minimal stipulated damages.

I first saw this in Four short links: 8 December 2015 by Nat Torkington.

XQuery, 2nd Edition, Updated! (A Drawback to XQuery)

Filed under: XML,XPath,XQuery — Patrick Durusau @ 3:57 pm

XQuery, 2nd Edition, Updated! by Priscilla Walmsley.

The updated version of XQuery, 2nd Edition has hit the streets!

As a plug for the early release program at O’Reilly, yours truly appears in the acknowledgments (page xxii) from having submitted comments on the early release version of XQuery. You can too. Early release participation is yet another way to contribute back to the community.

There is one drawback to XQuery which I discuss below.

For anyone not fortunate enough to already have a copy of XQuery, 2nd Edition, here is the full description from the O’Reilly site:

The W3C XQuery 3.1 standard provides a tool to search, extract, and manipulate content, whether it’s in XML, JSON or plain text. With this fully updated, in-depth tutorial, you’ll learn to program with this highly practical query language.

Designed for query writers who have some knowledge of XML basics, but not necessarily advanced knowledge of XML-related technologies, this book is ideal as both a tutorial and a reference. You’ll find background information for namespaces, schemas, built-in types, and regular expressions that are relevant to writing XML queries.

This second edition provides:

  • A high-level overview and quick tour of XQuery
  • New chapters on higher-order functions, maps, arrays, and JSON
  • A carefully paced tutorial that teaches XQuery without being bogged down by the details
  • Advanced concepts for taking advantage of modularity, namespaces, typing, and schemas
  • Guidelines for working with specific types of data, such as numbers, strings, dates, URIs, maps and arrays
  • XQuery’s implementation-specific features and its relationship to other standards including SQL and XSLT
  • A complete alphabetical reference to the built-in functions, types, and error messages

Drawback to XQuery:

You know I hate to complain, but the brevity of XQuery is a real drawback to billing.

For example, I have a post pending on taking 604 lines of XSLT down to 35 lines of XQuery.

Granted the XQuery is easier to maintain, modify, extend, but all a client will see is the 35 lines of XQuery. At least 604 lines of XSLT looks like you really worked to produce something.

I know about XQueryX but I haven’t seen any automatic way to convert XQuery into XQueryX. Am I missing something obvious? If that’s possible, I could just bulk up the deliverable with an XQueryX expression of the work and keep the XQuery version for production use.

As excellent as I think XQuery and Walmsley’s book both are, I did want to warn you about the brevity of your XQuery deliverables.

I look forward to finish reading XQuery, 2nd Edition. I started doing so many things based on the first twelve or so chapters that I just read selectively from that point on. It merits a complete read. You won’t be sorry you did.

December 7, 2015

Toxic Gas Detector Alert!

Filed under: Cybersecurity,IoT - Internet of Things,Security — Patrick Durusau @ 9:55 pm

For years the Chicken Little‘s of infrastructure security have been warning of nearly impossible cyber-attacks on utilities and other critical infrastructure.

Despite nearly universal scorn from security experts, once those warning are heard, they are dutifully repeated by a non-critical press and echoed by elected public officials.

Despite not having been insecure originally, the Internet of Things is catching up to infrastructure and making what was once secure, insecure.

Consider Mark Stockley‘s report: Industrial gas detectors vulnerable to a remote ‘attacker with low skill’.

From the post:

Users of Honeywell’s Midas and Midas Black gas detectors are being urged to patch their firmware to protect against a pair of critical, remotely exploitable vulnerabilities.

These extremely serious vulnerabilities, found by researcher Maxim Rupp and reported by ICS-CERT (the Industrial Control Systems Cyber Emergency Response Team) in advisory ICSA-15-309-02, are simple enough to be exploited by an “attacker with low skill”:

Successful exploitation of these vulnerabilities could allow a remote attacker to gain unauthenticated access to the device, potentially allowing configuration changes, as well as the initiation of calibration or test processes.

…These vulnerabilities could be exploited remotely.

…An attacker with low skill would be able to exploit these vulnerabilities.

So, how bad is the problem?

You judge:

Midas and Midas Black gas detectors are used worldwide in numerous industrial sectors including chemical, manufacturing, energy, food, agriculture and water to:

…detect many key toxic, ambient and flammable gases in a plant. The device monitors points up to 100 feet (30 meters) away while using patented technology to regulate flow rates and ensure error-free gas detection.

The vulnerabilities could allow the devices’ authentication to be bypassed completely by path traversal (CVE-2015-7907) or to be compromised by attackers grabbing an administrator’s password as it’s transmitted in clear text (CVE-2015-7908).

That’s still not a full picture of the danger posed by these vulnerabilities. Take a look at the sales brochure on the Midas Gas Detector and you will find this chart of the “over 35 gases” the Midas Gas Detector can detect:

35-gases

Several nasty gases on the list, Ammonia (caustic, hazarous), Arsine (highly toxic, flammable), Chlorine (extremely dangerous, poisonous for all living organisms), Hydrogen cyanide, and Hydrogen flouride (“Hydrogen fluoride is a highly dangerous gas, forming corrosive and penetrating hydrofluoric acid upon contact with living tissue. The gas can also cause blindness by rapid destruction of the corneas.”)

Bear in mind that patch application doesn’t have an encouraging history: Potent, in-the-wild exploits imperil customers of 100,000 e-commerce sites

Honeywell has put the detection of extremely dangerous gases, at the mercy of script kiddies.

Suggestion: If you worn on-site where Midas Gas Detectors may be in use, inquire before setting foot on the site if they are using Midas Gas Detectors of the relevant models and whether they are patched?

Bear in mind that the risk of “…corrosive and penetrating hydrofluoric acid upon contact with living tissue…” is your in some situations. I would ask first.

Untraceable communication — guaranteed

Filed under: Cybersecurity,Privacy,Security — Patrick Durusau @ 8:50 pm

Untraceable communication — guaranteed by Larry Hardesty.

From the post:

Anonymity networks, which sit on top of the public Internet, are designed to conceal people’s Web-browsing habits from prying eyes. The most popular of these, Tor, has been around for more than a decade and is used by millions of people every day.

Recent research, however, has shown that adversaries can infer a great deal about the sources of supposedly anonymous communications by monitoring data traffic though just a few well-chosen nodes in an anonymity network. At the Association for Computing Machinery Symposium on Operating Systems Principles in October, a team of MIT researchers presented a new, untraceable text-messaging system designed to thwart even the most powerful of adversaries.

The system provides a strong mathematical guarantee of user anonymity, while, according to experimental results, permitting the exchange of text messages once a minute or so.

“Tor operates under the assumption that there’s not a global adversary that’s paying attention to every single link in the world,” says Nickolai Zeldovich, an associate professor of computer science and engineering, whose group developed the new system. “Maybe these days this is not as good of an assumption. Tor also assumes that no single bad guy controls a large number of nodes in their system. We’re also now thinking, maybe there are people who can compromise half of your servers.”

Because the system confuses adversaries by drowning telltale traffic patterns in spurious information, or “noise,” its creators have dubbed it “Vuvuzela,” after the noisemakers favored by soccer fans at the 2010 World Cup in South Africa.

Pay particular attention to the generation of dummy messages as “noise.”

In topic map terms, I would say that the association between sender and a particular message, or between the receiver and a particular message, its identity has been obscured.

The reverse of the usual application of topic map principles. Which is a strong indication that the means to identify those associations, are also establishing associations and their identities. Perhaps not in traditional TMDM terms but they are associations with identities none the less.

For some unknown reason, the original post did not have a link to the article, Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis by Jelle van den Hooff, David Lazar, Matei Zaharia, and Nickolai Zeldovich.

The non-technical post concludes:

“The mechanism that [the MIT researchers] use for hiding communication patterns is a very insightful and interesting application of differential privacy,” says Michael Walfish, an associate professor of computer science at New York University. “Differential privacy is a very deep and sophisticated theory. The observation that you could use differential privacy to solve their problem, and the way they use it, is the coolest thing about the work. The result is a system that is not ready for deployment tomorrow but still, within this category of Tor-inspired academic systems, has the best results so far. It has major limitations, but it’s exciting, and it opens the door to something potentially derived from it in the not-too-distant future.”

It isn’t clear how such a system would defeat an adversary that has access to all the relevant nodes. Where “relevant nodes” is a manageable subset of all the possible nodes in the world. It’s unlikely that any adversary, aside from the NSA, CIA and other known money pits, would attempt to monitor all network traffic.

But monitoring all network traffic is both counter-productive and unnecessary. In general, one does not set out from the Washington Monument in the search of spies based in the United States. Or at least people who hope to catch spies don’t. I can’t speak for the NSA or CIA.

While you could search for messages between people unknown to you, that sounds like a very low-grade ore mining project. You could find a diamond in the rough, but its unlikely.

The robustness of this proposal should assume that both the sender and receiver have been identified and their network traffic is being monitored.

I think what I am groping towards is the notion that “noise” comes too late in this proposal. If either party is known, or suspected, it may be time consuming to complete the loop on the messages but adding noise at the servers is more of an annoyance than serious security.

At least when the adversary can effectively monitor the relevant nodes. Assuming that the adversary can’t perform such monitoring, seems like a risky proposition.

Thoughts?

IoT: Move Past the Rhetoric and Focus on Success

Filed under: Cybersecurity,IoT - Internet of Things,Security — Patrick Durusau @ 8:03 pm

Move Past the Rhetoric and Focus on Success

A recent missive from Booz Allen on the Internet of Things.

Two critical points that I want to extract for your consideration:


New Models for Security

The proliferation of IoT devices drastically increases the attack surface and creates attractive, and sometimes easy, targets for attackers. Traditional means of securing networks will no longer suffice as attack risks increase exponentially. We will help you learn how to think about security in an IoT world and new security models.

[page 4]

You have to credit Booz Allen with being up front about “…attack risks increas[ing] exponentially.” Considering that “Hello Barbie” has an STD on her first Christmas.

Do you have a grip on your current exposure to cyber-risk? What is that going to look like when it increases exponentially?

I’m not a mid-level manager but I would be wary of increasing cyber-risk exponentially, especially without a concrete demonstration of value add from the Internet-of-Things.

The second item:

Interoperability is Key to Everything

IoT implementations typically contain hundreds of sensors embedded in different “things”, connected to gateways and the Cloud, with data flowing back and forth via a communication protocol. If each node within the system “speaks” the same language, then the implementation functions seamlessly. When these nodes don’t talk with each other, however, you’re left with an Internet of one or some things, rather than an Internet of everything. [page 4]

IoT implementation, at its core, is the integration of dozens and up to tens of thousands of devices seamlessly communicating with each other, exchanging information and commands, and revealing insights. However, when devices have different usage scenarios and operating requirements that aren’t compatible with other devices, the system can break down. The ability to integrate different elements or nodes within broader systems, or bringing data together to drive insights and improve operations, becomes more complicated and costly. When this occurs, IoT can’t reach its potential, and rather than an Internet of everything, you see siloed Internets of some things.

Haven’t we seen this play before? Wasn’t it called the Semantic Web? I suppose now called the Failed Semantic Web (FSW)?

Booz Allen would be more forthright to say, “…the system is broken down…” rather than “…the system can break down.”

I can’t think of a better way to build a failing IoT project than to presume that interoperability exists now (or that it is likely to exist, outside highly constrained circumstances).

Let’s take a simpler problem than everything and limit it to interchange of pricing data in the energy market. As you might expect, there is a standard, a rather open one, on that topic: Energy Market Information Exchange (EMIX) Version 1.0.

That specification is dated 11 January of 2012, which means on 11 January 2016, it will be four years old.

As of today, a search on “Energy Market Information Exchange (EMIX) Version 1.0” produces 695 “hits,” but 327 of them are at Oasis-open.org, the organization where a TC produced this specification.

Even more interesting, only three pages of results are returned with the notation that beyond 30 results, the rest have been suppressed as duplicates.

So, at three years and three hundred and thirty days, Energy Market Information Exchange (EMIX) Version 1.0 has thirty (30) non-duplicate “hits?”

I can’t say that inspires me with a lot of hope for impact on interoperability in the U.S. Energy Market. Like the work Booz Allen cites, this too was sponsored by NIST and the DOE (EISA).

One piece of advice from Booz Allen is worth following:

Start Small

You may actually have IoT implementations within your organization that you aren’t aware of. And if you have any type of health wearable, you are actually already participating in IoT. You don’t have to instrument every car, road, and sign to have an Internet of some things. [page 10]

Building the Internet of Things for everybody should not be on your requirements list.

An Internet of Some Things will be your things and with proper planning it will improve your bottom line. (Contrary to the experience with the Semantic Web.)

Jupyter on Apache Spark [Holiday Game]

Filed under: Python,Reddit,Spark — Patrick Durusau @ 4:46 pm

Using Jupyter on Apache Spark: Step-by-Step with a Terabyte of Reddit Data by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how install Jupyter on your Spark cluster and use PySpark for some ad hoc analysis of reddit comment data on Amazon S3.

This following tutorial installs Jupyter on your Spark cluster in standalone mode on top of Hadoop and also walks through some transformations and queries on the reddit comment data on Amazon S3. We assume you already have an AWS EC2 cluster up with Spark 1.4.1 and Hadoop 2.7 installed. If not, you can go to our previous post on how to quickly deploy your own Spark cluster.

In Need a Bigoted, Racist Uncle for Holiday Meal? I mentioned the 1.6 billion Reddit comments that are the subject of this tutorial.

If you can’t find comments offensive to your guests in the Reddit comment collection, they are comatose and/or inanimate objects.

Big Data Holiday Game:

Divide into teams with at least one Jupyter/Apache Spark user on each team.

Play three timed rounds (time for each round dependent on your local schedule) where each team attempts to discover a Reddit comment that is the most offensive for the largest number of guests.

The winner gets bragging rights until next year, you get to show off your data mining skills, not to mention, you get a free pass on saying offensive things to your guests.

Watch for more formalized big data games of this nature by the holiday season for 2016!

Enjoy!

I first saw this in a tweet by Data Science Renee.

Assault Weapons: Christmas Shopping News

Filed under: Government,Verification — Patrick Durusau @ 3:31 pm

Just in time to boost the sales of assault weapons for Christmas 2015, David G. Savage reports in Supreme Court lets local ban on assault weapons stand:

In a victory for gun-control advocates, the Supreme Court on Monday rejected a 2nd Amendment challenge to a local law that forbids the sale or possession of semiautomatic weapons that carry more than 10 rounds of ammunition.

The justices by a 7-2 vote refused to review rulings by judges in Chicago who upheld a ban on assault weapons in the city of Highland Park as a reasonable gun-control regulation. Justices Clarence Thomas and Antonin Scalia dissented.

The court’s decision, while not a formal ruling, strongly suggests the justices do not see the 2nd Amendment as protecting a right to own or carry powerful weapons in public.

If you are a member of the NRA (I am), you will be getting frantic communications from Wayne LaPierre decrying this latest government infringement on your Second Amendment rights.

Except he will probably phrase it as jack-booted thugs who are crouched next to your front door, ready to seize guns you mean to use only for hunting and self-defense.

Anyone who needs an assault rifle for hunting isn’t a hunter in the sense of Field and Stream. They are more like the description in Death Wish (Charles Bronson) where one character says:

…thinks we shoot our guns because it’s an extension of our penises.

To which Charles Bronson’s character replies:

I never thought about it that way. It could be true.

I mention that because the fabled accounts about this Supreme Court refusal of review are going to paint it as doom and gloom for anyone wanting to buy an assault rifle with a reasonably sized magazine.

NOT SO!

The City of Highland Park is all of 12 square miles in size.

Within those 12 square miles, you cannot possess or sell a semiautomatic weapon that has more than 10 rounds of ammunition.

I haven’t adjusted for federal courthouses and similar areas where you cannot possess any firearm, much less a semiautomatic weapon with or without more than 10 rounds of ammunition, but the total area of the United States is 3,794,083 square miles, including water (or 9,826,630 square km).

Impact of the Supreme Court ruling: you cannot possess or sell a semiautomatic weapon with more than 10 rounds of ammunition for 12 square miles but you can both possess and sell such a weapon for 3,794,071 square miles, including water (or 9,836,599.1 square km).

I don’t feel all the threatened by a 12 square mile area ban on assault weapons with more than 10 rounds of ammunition.

Don’t be stampeded into spending your Christmas money on yet another assault rifle or more ammunition.

Buy yourself something useful in case we have “technicals” as in Mogadishu.

WikiTravel describes Mogadishu this way:

WARNING: There is a high threat from terrorism, including kidnapping, throughout Somalia, excluding Somaliland. Terrorist groups have made threats against Westerners and those working for Western organizations. It is known that there is a constant threat of terrorist attacks in Mogadishu. The city also remains in great danger of suicide bombings and other terrorist attacks carried out by extremists who manage to get past security checkpoints around the city. Walking the streets of Mogadishu remains very dangerous, even with armed guards. Tourists are emphatically discouraged from visiting Mogadishu for the time being, while business travelers should take extreme caution and make thorough plans for any trips. Travel outside Mogadishu remains extremely dangerous and should be avoided at all costs. Those working for aid agencies should consult the security plans or advice of your organization.

May I suggest you consider acquiring a dozen or more Rocket-Propelled Grenades (RPG)? This includes the U.S. M72 LAW rocket launcher (disposable tube).

The legality of both purchasing and possessing an RPG of any make or model vary considerably from place to place, as will your appetite for risk. However, they are definitely a step up from assault rifles even with more than 10 rounds of ammunition.

You may remember seeing the Lenco Bearcat scooting around San Bernadino at a reported cost of $375,000.00.

Keep in mind that an RPG:

640px-RPG-7_detached

is to a Lenco Bearcat:

640px-Nash_Bearcat

as an electric can opener:

640px-Krups_electric_can_opener

is to a tin can:

640px-No_name_sans_nom_tomato_juice

Don’t believe the hype from the NRA or desk jockey “war fighters.” Neither side really stands up to minimal verification.

« Newer PostsOlder Posts »

Powered by WordPress