Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 15, 2015

Mutable Data Structures

Filed under: Data Structures,Functional Programming — Patrick Durusau @ 11:41 am

The best illustration of the dangers of mutable data structures I have found to date.

Mutable data structures and algorithms based upon them were necessary when computer memory was far more limited than it is today.*

Why are you still using mutable data structures and algorithms based on hardware that doesn’t exist anymore?

I first saw this in a tweet by the Software Exorcist.

* Granting there are cases, the CERN comes to mind, where the memory requirements for some applications exceed available memory. You aren’t working at the CERN are you?

Magnificent Maps of New York

Filed under: Mapping,Maps — Patrick Durusau @ 11:01 am

Magnificent Maps of New York by Kate Marshall.

From the post:

The British Library’s ongoing project to catalogue and digitise the King’s Topographical Collection, some 40,000 maps, prints and drawings collected by George III, has highlighted some extraordinary treasures. The improved and up-dated catalogue records are now accessible to all, anywhere in the world, via the Library’s catalogue, Explore, and offer a springboard for enhanced study.

Your donations to this and other projects enable us to digitise more of our collections, the results of which are invaluable. One such example of further research using material digitised with help from donors is the recently published book by Richard H. Brown and Paul E. Cohen, Revolution. Mapping the Road to American Independence, 1755-1783, which features a number of maps from the K.Top.

The Explore link takes to the main interface for the British Library but Maps is a more direct route to the map collection materials.

Practically everyone has made school presentations about their country’s history. With resources such as the British Map collection becoming available online, it isn’t too much to expect student to supplement their reports with historical maps.

Enjoy!

November 14, 2015

4 Tips to Learn More About ACS Data [$400 Billion Market, 3X Big Data]

Filed under: BigData,Census Data,R — Patrick Durusau @ 9:59 pm

4 Tips to Learn More About ACS Data by Ari Lamstein.

From the post:

One of the highlights of my recent east coast trip was meeting Ezra Haber Glenn, the author of the acs package in R. The acs package is my primary tool for accessing census data in R, and I was grateful to spend time with its author. My goal was to learn how to “take the next step” in working with the census bureau’s American Community Survey (ACS) dataset. I learned quite a bit during our meeting, and I hope to share what I learned over the coming weeks on my blog.

Today I’ll share 4 tips to help you get started in learning more. Before doing that, though, here is some interesting trivia: did you know that the ACS impacts how over $400 billion is allocated each year?

If the $400 billion got your attention, follow the tips in Ari’s post first, look for more posts in that series second, then visit the American Community Survey (ACS) website.

For comparison purposes, keep in mind that Forbes projects the Big Data Analytics market in 2015 to be a paltry $125 Billion.

The ACS data market is over 3 times larger ($400 Billion (ACS) versus $125 Billion (BigData) for 2015.

Suddenly, ACS data and R look quite attractive.

Querying Biblical Texts: Part 1 [Humanists Take Note!]

Filed under: Bible,Text Mining,XML,XQuery — Patrick Durusau @ 5:13 pm

Querying Biblical Texts: Part 1 by Jonathan Robie.

From the post:

This is the first in a series on querying Greek texts with XQuery. We will also look at the differences among various representations of the same text, starting with the base text, morphology, and three different treebank formats. As we will see, the representation of a text indicates what the producer of the text was most interested in, and it determines the structure and power of queries done on that particular representation. The principles discussed here also apply to other languages.

This is written as a tutorial, and it can be read in two ways. The first time through, you may want to simply read the text. If you want to really learn how to do this yourself, you should download an XQuery processor and some data (in your favorite biblical language) and try these queries and variations on them.

Humanists need to follow this series and pass it along to others.

Texts of interest to you will vary but the steps Jonathan covers are applicable to all texts (well, depending upon your encoding).

In exchange for learning a little XQuery, you can gain a good degree of mastery over XML encoded texts.

Enjoy!

The 100 Most Used Clojure Expressions

Filed under: Clojure,Education,Programming — Patrick Durusau @ 5:03 pm

The 100 Most Used Clojure Expressions by Eric Normand.

From the post:

Summary: Would you like to optimize your learning of Clojure? Would you like to focus on learning only the most useful parts of the language first? Take this lesson from second language learning: learn the expressions in order of frequency of use.

When I was learning Spanish, I liked to use Anki to drill new vocabulary. It’s a flashcard program. I found that someone had made a set of cards from an analysis of thousands of newspapers. They read in all of the words from the newspapers, counted them up, and figured out what the most common words were. The top 1000 made it into the deck.

It turns out that this is a very good strategy for learning words. Word frequency follows a hockey stick distribution. The most common words are used so much more than the less common words. For instance, the 100 most common English words make up more than 50% of text. If you’ve got limited time, you should learn those most common words first.

People who are trying to learn Clojure have been asking me “how do I learn all of this stuff? There’s so much!” It’s a valid question and I haven’t had a good answer. I remembered the Spanish newspaper analysis and I thought I’d try to do a similar analysis of Clojure expressions.

Is Eric seriously suggesting using lessons learned in another field? 😉

Of course, for a CS conference using the top 100 most common Clojure expressions would have a title similar to:

Use of High Frequency Terminology Repetition: A Small Group Study (maybe 12 participants)

You could, of course, skip waiting for a conference presentation with a title like that one, followed by peer reviewed paper(s), more conference presentations and its final appearance in a collection of potential ways to improve CS instruction.

Let me know if Eric’s suggestion works for you.

Enjoy!

PS: Thanks Eric!

November 13, 2015

LIQUi|> – A Quantum Computing Simulator

Filed under: Computer Science,Physics,Quantum — Patrick Durusau @ 8:23 pm

With quantum computing simulator, Microsoft offers a sneak peek into future of computing by Allison Linn.

From the post:


Next week, at the SuperComputing 2015 conference in Austin, Texas, Dave Wecker, a lead architect on the QuArC team, will discuss the recent public release on GitHub of a suite of tools that allows computer scientists to simulate a quantum computer’s capabilities. That’s a crucial step in building the tools needed to run actual quantum computers.

“This is the closest we can get to running a quantum computer without having one,” said Wecker, who has helped develop the software.

The software is called Language-Integrated Quantum Operations, or LIQUi|>. The funky characters at the end refer to how a quantum operation is written in mathematical terms.

The researchers are hoping that, using LIQUi|>, computer scientists at Microsoft and other academic and research institutions will be able to perfect the algorithms they need to efficiently use a quantum computer even as the computers themselves are simultaneously being developed.

“We can actually debut algorithms in advance of running them on the computer,” Svore said.

As of today, November 13, 2015, LIQUi|> has only one (1) hit at GitHub. Will try back next week to see what the numbers look like then.

You won’t have a quantum computer by the holidays but you may have created your first quantum algorithm by then.

Enjoy!

Bruce Schneier on (Not!) Secure Email

Filed under: Cybersecurity,Security — Patrick Durusau @ 8:01 pm

Bruce Schneier writes:

I have recently come to the conclusion that e-mail is fundamentally unsecurable. The things we want out of e-mail, and an e-mail system, are not readily compatible with encryption. I advise people who want communications security to not use e-mail, but instead use an encrypted message client like OTR or Signal.

From: Testing the Usability of PGP Encryption Tools.

If you need robust security, take Schneier at his word.

The Pentagon’s plan to outsource lethal cyber-weapons

Filed under: Cybersecurity,Law,Security — Patrick Durusau @ 5:51 pm

The Pentagon’s plan to outsource lethal cyber-weapons by Violet Blue.

From the post:

The Pentagon has quietly put out a call for vendors to bid on a contract to develop, execute and manage its new cyber weaponry and defense program. The scope of this nearly half-billion-dollar “help wanted” work order includes counterhacking, as well as developing and deploying lethal cyberattacks — sanctioned hacking expected to cause real-life destruction and loss of human life.

In June 2016, work begins under the Cyberspace Operations Support Services contract (pdf) under CYBERCOM (United States Cyber Command). The $460 million project recently came to light and details the Pentagon’s plan to hand over its IT defense and the planning, development, execution, management, integration with the NSA, and various support functions of the U.S. military’s cyberattacks to one vendor.

Violet’s post will bring you up to date on discussions of cyber-weapons and where a large number of questions remain, such as what law governs cyber-weapons.

It isn’t clear how worried anyone should be at this point because the Pentagon is following its traditional acquisition process for cyber-weapons. Had the Pentagon started hiring top name exploit merchants and hackers, the danger of cyber-weapons would be imminent.

Traditional contracting process? We may have quantum computing long before cyber-weapons from the traditional process post a threat to then outdated software.

But in all events, do read and pass Violet’s post along.

VIS’15 Recap with Robert Kosara and Johanna Fulda (DS #63)

Filed under: Conferences,Visualization — Patrick Durusau @ 5:32 pm

VIS’15 Recap with Robert Kosara and Johanna Fulda (DS #63)

data-story-podcast

And that’s not the entire agenda for the podcast!

So say nothing of the fourteen links to papers, videos and pre-views that follow the podcast agenda.

A recap of the 2015 IEEE Visualization Conference (VIS) (25 Oct – 30 Oct 2015).

If you missed the conference or just want a great weekend activity, consider the podcast and related resources.

Reverse Engineering Challenges

Filed under: Programming,Reverse Engineering,Software Engineering — Patrick Durusau @ 4:42 pm

Reverse Engineering Challenges by Dennis Yorichev.

After the challenge/exercise listing:

About the website

Well, “challenges” is a loud word, these are rather just exercises.

Some exercises were in my book for beginners, some were in my blog, and I eventually decided to keep them all in one single place like this website, so be it.

The source code of this website is also available at GitHub: https://github.com/dennis714/challenges.re. I would love to get any suggestions and notices about misspellings and typos.

Exercise numbers

There is no correlation between exercise number and hardness. Sorry: I add new exercises occasionally and I can’t use some fixed numbering system, so numbers are chaotic and has no meaning at all.

On the other hand, I can assure, exercise numbers will never change, so my readers can refer to them, and they are also referred from my book for beginners.

Duplicates

There are some pieces of code which are really does the same thing, but in different ways. Or maybe it is implemented for different architectures (x86 and Java VM/.NET). That’s OK.

A major resource for anyone interested in learning reverse engineering!

If you are in the job market, Dennis concludes with this advice:

How can I measure my performance?

  • As far as I can realize, If reverse engineer can solve most of these exercises, he is a hot target for head hunters (programming jobs in general).
  • Those who can solve from ¼ to ½ of all levels, perhaps, can freely apply for reverse engineering/malware analysts/vulnerability research job positions.
  • If you feel even first level is too hard for you, you may probably drop the idea to learn RE.

You have a target, the book and the exercises. The rest is up to you.

Wrangler Conference 2015

Filed under: Conferences,Data Management — Patrick Durusau @ 3:38 pm

Wrangler Conference 2015

Videos!

Given the panel nature of some of the presentatons, ordering these videos by speaker would not be terribly useful.

However, I have exposed the names of the participants in a single list of all the videos.

Enjoy!

Bytes that Rock! Software Awards 2015 (Nominations Open Now – Close 16th November 2015)

Filed under: Blogs,Contest,Games,Software — Patrick Durusau @ 2:38 pm

Bytes that Rock! Software Awards 2015 (Nominations Open Now – Close 16th November 2015)

An awards program for excellence in software and blogs!

The only limitation I could find is:

Bytes that Rock recognizes the best software and blogs for their excellence in the past 12 months.

Your game/software/blog may have been excellent three (3) years ago but that doesn’t count. 😉

Subject to that mild limitation, step up and:

Submit a blog, software or game clicking on the categories below!

Software blogs
VideoGame blogs
Security blogs

PC Software
Software UI
Innovative Software
Protection Software
Open Source Software

PC Games
Indie Games
Mods for games

This is not a next week, or after I ask X, or when I get home task.

This is a hit a submit link now task!

You will feel better after having made a nomination. Promise. 😉

BTR_1

(Select the graphic for a much larger version of the image.)

Microsoft open sources Distributed Machine Learning Toolkit…

Filed under: Distributed Computing,Machine Learning,Microsoft,Open Source — Patrick Durusau @ 2:12 pm

Microsoft open sources Distributed Machine Learning Toolkit for more efficient big data research by George Thomas Jr.

From the post:

Researchers at the Microsoft Asia research lab this week made the Microsoft Distributed Machine Learning Toolkit openly available to the developer community.

The toolkit, available now on GitHub, is designed for distributed machine learning — using multiple computers in parallel to solve a complex problem. It contains a parameter server-based programing framework, which makes machine learning tasks on big data highly scalable, efficient and flexible. It also contains two distributed machine learning algorithms, which can be used to train the fastest and largest topic model and the largest word-embedding model in the world.

The toolkit offers rich and easy-to-use APIs to reduce the barrier of distributed machine learning, so researchers and developers can focus on core machine learning tasks like data, model and training.

The toolkit is unique because its features transcend system innovations by also offering machine learning advances, the researchers said. With the toolkit, the researchers said developers can tackle big-data, big-model machine learning problems much faster and with smaller clusters of computers than previously required.

For example, using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. That workload would previously have required thousands of machines.

This has been a banner week for machine learning!

On November 9th, Google open sourced TensorFlow.

On November 12th, Single Artificial Neuron Taught to Recognize Hundreds of Patterns (why neurons have thousands of synapses) is published.

On November 12th, Microsoft open sources its Distributed Machine Learning Toolkit.

Not every week is like that for machine learning but it is impressive when that many major stories drop in a week!

I do like the line from the Microsoft announcement:

For example, using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. (emphasis added)

Prices are falling all the time and a 24 machine cluster should be within the reach of most startups if not most individuals now. Next year? Possibly within the reach of a large number of individuals.

What are your machine learning plans for 2016?

More DMTK information.

Wandora – 2015-11-13 Release

Filed under: Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 1:44 pm

Wandora (download page)

The change log is rather brief:

Wandora 2015-11-13 fixes a lot of OS X related bugs. Release introduces enhanced subject locator previews for WWW resources, including videos, images, audio files and interactive fiction (z-machine). The release has been compiled and tested in Java 8.

Judging from tweets between this release and the prior one, new features include:

  • Subject locator preview for web pages
  • Subject locator preview for a #mp3 #ogg #mod #sidtune #wav

If you are new to Wandora be sure to check out the Wandora YouTube Channel.

I need to do an update on the Wandora YouTube Channel, lots of good stuff there!

You do not want to be an edge case [The True Skynet: Your Homogenized Future]

Filed under: Design,Humanities,Identification,Programming — Patrick Durusau @ 1:15 pm

You do not want to be an edge case.

John D. Cook writes:

Hilary Mason made an important observation on Twitter a few days ago:

You do not want to be an edge case in this future we are building.

Systems run by algorithms can be more efficient on average, but make life harder on the edge cases, people who are exceptions to the system developers’ expectations.

Algorithms, whether encoded in software or in rigid bureaucratic processes, can unwittingly discriminate against minorities. The problem isn’t recognized minorities, such as racial minorities or the disabled, but unrecognized minorities, people who were overlooked.

For example, two twins were recently prevented from getting their drivers licenses because DMV software couldn’t tell their photos apart. Surely the people who wrote the software harbored no malice toward twins. They just didn’t anticipate that two drivers licence applicants could have indistinguishable photos.

I imagine most people reading this have had difficulty with software (or bureaucratic procedures) that didn’t anticipate something about them; everyone is an edge case in some context. Maybe you don’t have a middle name, but a form insists you cannot leave the middle name field blank. Maybe there are more letters in your name or more children in your family than a programmer anticipated. Maybe you choose not to use some technology that “everybody” uses. Maybe you happen to have a social security number that hashes to a value that causes a program to crash.

When software routinely fails, there obviously has to have a human override. But as software improves for most people, there’s less apparent need to make provision for the exceptional cases. So things could get harder for edge cases as they get better for more people.

Recent advances in machine learning have led reputable thinkers (Steven Hawking for example) to envision a future where an artificial intelligence will arise to dispense with humanity.

If you think you have heard that theme before, you have, most recently as Skynet, an entirely fictional creation in the Terminator science fiction series.

Given that no one knows how the human brain works, much less how intelligence arises, despite such alarmist claims making good press, the risk is less than a rogue black hole or a gamma-ray burst. I don’t lose sleep over either one of those, do you?

The greater “Skynet” threat to people and their cultures is the enforced homogenization of language and culture.

John mentions lacking a middle name but consider the complexities of Japanese names. Due to the creeping infection of Western culture and computer-based standardization, many Japanese list their names in Western order, given name, family name, instead of the Japanese order of family name, given name.

Even languages can start the slide to being “edge cases,” as you will see from the erosion of Hangul (Korean alphabet) from public signs in Seoul.

Computers could be preserving languages and cultural traditions, they have the capacity and infinite patience.

But they are not being used for that purpose.

Cellphones, for example, are linking humanity into a seething mass of impoverished social interaction. Impoverished social interaction that is creating more homogenized languages, not preserving diverse ones.

Not only should you be an edge case but you should push back against the homogenizing impact of computers. The diversity we lose could well be your own.

November 12, 2015

The Architecture of Open Source Applications

Filed under: Books,Computer Science,Programming,Software,Software Engineering — Patrick Durusau @ 9:08 pm

The Architecture of Open Source Applications

From the webpage:

Architects look at thousands of buildings during their training, and study critiques of those buildings written by masters. In contrast, most software developers only ever get to know a handful of large programs well—usually programs they wrote themselves—and never study the great programs of history. As a result, they repeat one another’s mistakes rather than building on one another’s successes.

Our goal is to change that. In these two books, the authors of four dozen open source applications explain how their software is structured, and why. What are each program’s major components? How do they interact? And what did their builders learn during their development? In answering these questions, the contributors to these books provide unique insights into how they think.

If you are a junior developer, and want to learn how your more experienced colleagues think, these books are the place to start. If you are an intermediate or senior developer, and want to see how your peers have solved hard design problems, these books can help you too.

Follow us on our blog at http://aosabook.org/blog/, or on Twitter at @aosabook and using the #aosa hashtag.

I happened upon these four books because of a tweet that mentioned: Early Access Release of Allison Kaptur’s “A Python Interpreter Written in Python” Chapter, which I found to be the tenth chapter of “500 Lines.”

OK, but what the hell is “500 Lines?” Poking around a bit I found The Architecture of Open Source Applications.

Which is the source for the material I quote above.

Do you learn from example?

Let me give you the flavor of three of the completed volumes and the “500 Lines” that is in progress:

The Architecture of Open Source Applications: Elegance, Evolution, and a Few Fearless Hacks (vol. 1), from the introduction:

Carpentry is an exacting craft, and people can spend their entire lives learning how to do it well. But carpentry is not architecture: if we step back from pitch boards and miter joints, buildings as a whole must be designed, and doing that is as much an art as it is a craft or science.

Programming is also an exacting craft, and people can spend their entire lives learning how to do it well. But programming is not software architecture. Many programmers spend years thinking about (or wrestling with) larger design issues: Should this application be extensible? If so, should that be done by providing a scripting interface, through some sort of plugin mechanism, or in some other way entirely? What should be done by the client, what should be left to the server, and is “client-server” even a useful way to think about this application? These are not programming questions, any more than where to put the stairs is a question of carpentry.

Building architecture and software architecture have a lot in common, but there is one crucial difference. While architects study thousands of buildings in their training and during their careers, most software developers only ever get to know a handful of large programs well. And more often than not, those are programs they wrote themselves. They never get to see the great programs of history, or read critiques of those programs’ designs written by experienced practitioners. As a result, they repeat one another’s mistakes rather than building on one another’s successes.

This book is our attempt to change that. Each chapter describes the architecture of an open source application: how it is structured, how its parts interact, why it’s built that way, and what lessons have been learned that can be applied to other big design problems. The descriptions are written by the people who know the software best, people with years or decades of experience designing and re-designing complex applications. The applications themselves range in scale from simple drawing programs and web-based spreadsheets to compiler toolkits and multi-million line visualization packages. Some are only a few years old, while others are approaching their thirtieth anniversary. What they have in common is that their creators have thought long and hard about their design, and are willing to share those thoughts with you. We hope you enjoy what they have written.

The Architecture of Open Source Applications: Structure, Scale, and a Few More Fearless Hacks (vol. 2), from the introduction:

In the introduction to Volume 1 of this series, we wrote:

Building architecture and software architecture have a lot in common, but there is one crucial difference. While architects study thousands of buildings in their training and during their careers, most software developers only ever get to know a handful of large programs well… As a result, they repeat one another’s mistakes rather than building on one another’s successes… This book is our attempt to change that.

In the year since that book appeared, over two dozen people have worked hard to create the sequel you have in your hands. They have done so because they believe, as we do, that software design can and should be taught by example—that the best way to learn how think like an expert is to study how experts think. From web servers and compilers through health record management systems to the infrastructure that Mozilla uses to get Firefox out the door, there are lessons all around us. We hope that by collecting some of them together in this book, we can help you become a better developer.

The Performance of Open Source Applications, from the introduction:

It’s commonplace to say that computer hardware is now so fast that most developers don’t have to worry about performance. In fact, Douglas Crockford declined to write a chapter for this book for that reason:

If I were to write a chapter, it would be about anti-performance: most effort spent in pursuit of performance is wasted. I don’t think that is what you are looking for.

Donald Knuth made the same point thirty years ago:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

but between mobile devices with limited power and memory, and data analysis projects that need to process terabytes, a growing number of developers do need to make their code faster, their data structures smaller, and their response times shorter. However, while hundreds of textbooks explain the basics of operating systems, networks, computer graphics, and databases, few (if any) explain how to find and fix things in real applications that are simply too damn slow.

This collection of case studies is our attempt to fill that gap. Each chapter is written by real developers who have had to make an existing system faster or who had to design something to be fast in the first place. They cover many different kinds of software and performance goals; what they have in common is a detailed understanding of what actually happens when, and how the different parts of large applications fit together. Our hope is that this book will—like its predecessor The Architecture of Open Source Applications—help you become a better developer by letting you look over these experts’ shoulders.

500 Lines or Less From the GitHub page:

Every architect studies family homes, apartments, schools, and other common types of buildings during her training. Equally, every programmer ought to know how a compiler turns text into instructions, how a spreadsheet updates cells, and how a database efficiently persists data.

Previous books in the AOSA series have done this by describing the high-level architecture of several mature open-source projects. While the lessons learned from those stories are valuable, they are sometimes difficult to absorb for programmers who have not yet had to build anything at that scale.

“500 Lines or Less” focuses on the design decisions and tradeoffs that experienced programmers make when they are writing code:

  • Why divide the application into these particular modules with these particular interfaces?
  • Why use inheritance here and composition there?
  • How do we predict where our program might need to be extended, and how can we make that easy for other programmers

Each chapter consists of a walkthrough of a program that solves a canonical problem in software engineering in at most 500 source lines of code. We hope that the material in this book will help readers understand the varied approaches that engineers take when solving problems in different domains, and will serve as a basis for projects that extend or modify the contributions here.

If you answered the question about learning from example with yes, adding these works to your read and re-read list.

BTW, for markup folks, check out Parsing XML at the Speed of Light by Arseny Kapoulkine.

Many hours of reading and keyboard pleasure await anyone using these volumes.

Visualizing What Your Computer (and Science) Ignore (mostly)

Filed under: Computer Science,Geometry,Image Processing,Image Understanding,Physics — Patrick Durusau @ 8:01 pm

Deviation Magnification: Revealing Departures from Ideal Geometries by Neal Wadhwa, Tali Dekel, Donglai Wei, Frédo Durand, William T. Freeman.

Abstract:

Structures and objects are often supposed to have idealized geome- tries such as straight lines or circles. Although not always visible to the naked eye, in reality, these objects deviate from their idealized models. Our goal is to reveal and visualize such subtle geometric deviations, which can contain useful, surprising information about our world. Our framework, termed Deviation Magnification, takes a still image as input, fits parametric models to objects of interest, computes the geometric deviations, and renders an output image in which the departures from ideal geometries are exaggerated. We demonstrate the correctness and usefulness of our method through quantitative evaluation on a synthetic dataset and by application to challenging natural images.

The video for the paper is quite compelling:

Read the full paper here: http://people.csail.mit.edu/nwadhwa/deviation-magnification/DeviationMagnification.pdf

From the introduction to the paper:

Many phenomena are characterized by an idealized geometry. For example, in ideal conditions, a soap bubble will appear to be a perfect circle due to surface tension, buildings will be straight and planetary rings will form perfect elliptical orbits. In reality, however, such flawless behavior hardly exists, and even when invisible to the naked eye, objects depart from their idealized models. In the presence of gravity, the bubble may be slightly oval, the building may start to sag or tilt, and the rings may have slight perturbations due to interactions with nearby moons. We present Deviation Magnification, a tool to estimate and visualize such subtle geometric deviations, given only a single image as input. The output of our algorithm is a new image in which the deviations from ideal are magnified. Our algorithm can be used to reveal interesting and important information about the objects in the scene and their interaction with the environment. Figure 1 shows two independently processed images of the same house, in which our method automatically reveals the sagging of the house’s roof, by estimating its departure from a straight line.

Departures from “idealized geometry” make for captivating videos but there is a more subtle point that Deviation Magnification will help bring to the fore.

“Idealized geometry,” just like discrete metrics for attitude measurement or metrics of meaning, etc. are all myths. Useful myths as houses don’t (usually) fall down, marketing campaigns have a high degree of success, and engineering successfully relies on approximations that depart from the “real world.”

Science and computers have a degree of precision that has no counterpart in the “real world.”

Watch the video again if you doubt that last statement.

Whether you are using science and/or a computer, always remember that your results are approximations based upon approximations.

I first saw this in Four Short Links: 12 November 2015 by Nat Torkington.

R for cats

Filed under: Programming,R — Patrick Durusau @ 5:36 pm

An intro to R for new programmers by Scott Chamberlain.

From the webpage:

This is an introduction to R. I promise this will be fun. Since you have never used a programming language before, or any language for that matter, you won’t be tainted by other programming languages with different ways of doing things. This is good – we can teach you the R way of doing things.

Scott says this site is a rip off of JSforcats.com and I suggest we take his word for it.

If being “for cats” interests people who would not otherwise study either language, great.

Enjoy!

Why Use Make

Filed under: Replication,Workflow — Patrick Durusau @ 5:08 pm

Why Use Make by Mike Bostock.

From the post:

I love Make. You may think of Make as merely a tool for building large binaries or libraries (and it is, almost to a fault), but it’s much more than that. Makefiles are machine-readable documentation that make your workflow reproducible.

To illustrate with a recent example: yesterday Kevin and I needed to update a six-month old graphic on drought to accompany a new article on thin snowpack in the West. The article was already on the homepage, so the clock was ticking to republish with new data as soon as possible.

Shamefully, I hadn’t documented the data-transformation process, and it’s painfully easy to forget details over six months: I had a mess of CSV and GeoJSON data files, but not the exact source URL from the NCDC; I was temporarily confused as to the right Palmer drought metric (Drought Severity Index or Z Index?) and the corresponding categorical thresholds; finally, I had to resurrect the code to calculate drought coverage area.

Despite these challenges, we republished the updated graphic without too much delay. But I was left thinking how much easier it could have been had I simply recorded the process the first time as a makefile. I could have simply typed make in the terminal and be done!

Remember how science has been losing the ability to replicate experiments due to computers? How Computers Broke Science… [Soon To Break Businesses …]

So you are trying to remember and explain to an opponent’s attorney the process you went through in processing data, after about 3 hours of sharp questioning, how clear do you think you will be? Will you really remember every step? The source of every file?

Had you documented your workflow you can read from your Make file and say exactly what happened, in what order and with what sources. You do need to do that every time if you want anyone to believe the make file represents what actually happened.

You will be on more solid ground than trying to remember which files, the dates on those files, their content, etc.

Mike concludes his post with:

So do your future self and coworkers a favor, and use Make!

Let’s modify that to read:

So do your future self, coworkers, and lawyer a favor, and use Make!

I first saw this in a tweet by Christophe Lalanne.

Why Neurons Have Thousands of Synapses! (Quick! Someone Call the EU Brain Project!)

Single Artificial Neuron Taught to Recognize Hundreds of Patterns.

From the post:

Artificial intelligence is a field in the midst of rapid, exciting change. That’s largely because of an improved understanding of how neural networks work and the creation of vast databases to help train them. The result is machines that have suddenly become better at things like face and object recognition, tasks that humans have always held the upper hand in (see “Teaching Machines to Understand Us”).

But there’s a puzzle at the heart of these breakthroughs. Although neural networks are ostensibly modeled on the way the human brain works, the artificial neurons they contain are nothing like the ones at work in our own wetware. Artificial neurons, for example, generally have just a handful of synapses and entirely lack the short, branched nerve extensions known as dendrites and the thousands of synapses that form along them. Indeed, nobody really knows why real neurons have so many synapses.

Today, that changes thanks to the work of Jeff Hawkins and Subutai Ahmad at Numenta, a Silicon Valley startup focused on understanding and exploiting the principles behind biological information processing. The breakthrough these guys have made is to come up with a new theory that finally explains the role of the vast number of synapses in real neurons and to create a model based on this theory that reproduces many of the intelligent behaviors of real neurons.

A very enjoyable and accessible summary of a paper on the cutting edge of neuroscience!

Relevant for another concern, that I will be covering in the near future, but the post concludes with:


One final point is that this new thinking does not come from an academic environment but from a Silicon Valley startup. This company is the brain child of Jeff Hawkins, an entrepreneur, inventor and neuroscientist. Hawkins invented the Palm Pilot in the 1990s and has since turned his attention to neuroscience full-time.

That’s an unusual combination of expertise but one that makes it highly likely that we will see these new artificial neurons at work on real world problems in the not too distant future. Incidentally, Hawkins and Ahmad call their new toys Hierarchical Temporal Memory neurons or HTM neurons. Expect to hear a lot more about them.

If you want all the details, see:

Why Neurons Have Thousands of Synapses, A Theory of Sequence Memory in Neocortex by Jeff Hawkins, Subutai Ahmad.

Abstract:

Neocortical neurons have thousands of excitatory synapses. It is a mystery how neurons integrate the input from so many synapses and what kind of large-scale network behavior this enables. It has been previously proposed that non-linear properties of dendrites enable neurons to recognize multiple patterns. In this paper we extend this idea by showing that a neuron with several thousand synapses arranged along active dendrites can learn to accurately and robustly recognize hundreds of unique patterns of cellular activity, even in the presence of large amounts of noise and pattern variation. We then propose a neuron model where some of the patterns recognized by a neuron lead to action potentials and define the classic receptive field of the neuron, whereas the majority of the patterns recognized by a neuron act as predictions by slightly depolarizing the neuron without immediately generating an action potential. We then present a network model based on neurons with these properties and show that the network learns a robust model of time-based sequences. Given the similarity of excitatory neurons throughout the neocortex and the importance of sequence memory in inference and behavior, we propose that this form of sequence memory is a universal property of neocortical tissue. We further propose that cellular layers in the neocortex implement variations of the same sequence memory algorithm to achieve different aspects of inference and behavior. The neuron and network models we introduce are robust over a wide range of parameters as long as the network uses a sparse distributed code of cellular activations. The sequence capacity of the network scales linearly with the number of synapses on each neuron. Thus neurons need thousands of synapses to learn the many temporal patterns in sensory stimuli and motor sequences.

BTW, did I mention the full source code is available at: https://github.com/numenta/nupic?

Coming from a startup, this discovery doesn’t have a decade of support for travel, meals, lodging, support staff, publications, administrative overhead, etc., for a cast of hundreds across the EU. But, then that decade would not have resulted in such a fundamental discovery in any event.

Is that a hint about the appropriate vehicle for advancing fundamental discoveries in science?

Quartz to open source two mapping tools

Filed under: Journalism,Mapping,Maps,Open Source,Reporting — Patrick Durusau @ 3:53 pm

Quartz to open source two mapping tools by Caroline Scott.

From the post:

News outlet Quartz is developing a searchable database of compiled map data from all over the world, and a tool to help journalists visualise this data.

The database, called Mapquery, received $35,000 (£22,900) from the Knight Foundation Prototype Fund on 3 November.

Keith Collins, project lead, said Mapquery will aim to make the research stage in the creation of maps easier and more accessible, by creating a system for finding, merging and refining geographic data.

Mapquery will not be able to produce visual maps itself, as it simply provides a database of information from which maps can be created – so Quartz will also open source Mapbuilder as the “front end” that will enable journalists to visualise the data.

Quartz aims to have a prototype of Mapquery by April, and will continue to develop Mapbuilder afterwards.

That’s news to look forward to in 2016!

I’m real curious where Quartz is going to draw the boundary around “map data?” The post mentions Mapquery including “historical boundary data,” which would be very useful for some stories, but is traditional “map data.”

What if Mapquery could integrate people who have posted images with geographic locations? So a reporter could quickly access a list of potential witnesses for events the Western media doesn’t cover?

Live feeds of the results of US bombing raids against ISIS for example. (Doesn’t cover out of deference to the US military propaganda machine or for other reasons I can’t say.)

Looking forward to more news on Mapquery and Mapbuilder!

I first saw this in a tweet by Journalism Tools.

November 11, 2015

TensorFlow – A Collection of Resources

Filed under: Machine Learning,TensorFlow — Patrick Durusau @ 8:36 pm

Your Twitter account is groaning from tweets and retweets about Google open sourcing TensorFlow, its machine learning system.

To help you cut through the clutter (and myself), I have gleaned the following resources from tweets about TensorFlow:

1. Ground zero for TensorFlow: TensorFlow: smarter machine learning, for everyone by Sundar Pichai, CEO, Google.

Just a couple of years ago, you couldn’t talk to the Google app through the noise of a city sidewalk, or read a sign in Russian using Google Translate, or instantly find pictures of your Labradoodle in Google Photos. Our apps just weren’t smart enough. But in a short amount of time they’ve gotten much, much smarter. Now, thanks to machine learning, you can do all those things pretty easily, and a lot more. But even with all the progress we’ve made with machine learning, it could still work much better.

So we’ve built an entirely new machine learning system, which we call “TensorFlow.” TensorFlow is faster, smarter, and more flexible than our old system, so it can be adapted much more easily to new products and research. It’s a highly scalable machine learning system—it can run on a single smartphone or across thousands of computers in datacenters. We use TensorFlow for everything from speech recognition in the Google app, to Smart Reply in Inbox, to search in Google Photos. It allows us to build and train neural nets up to five times faster than our first-generation system, so we can use it to improve our products much more quickly.

2FNLTensorFlow

We’ve seen firsthand what TensorFlow can do, and we think it could make an even bigger impact outside Google. So today we’re also open-sourcing TensorFlow. We hope this will let the machine learning community—everyone from academic researchers, to engineers, to hobbyists—exchange ideas much more quickly, through working code rather than just research papers. And that, in turn, will accelerate research on machine learning, in the end making technology work better for everyone. Bonus: TensorFlow is for more than just machine learning. It may be useful wherever researchers are trying to make sense of very complex data—everything from protein folding to crunching astronomy data.

Machine learning is still in its infancy—computers today still can’t do what a 4-year-old can do effortlessly, like knowing the name of a dinosaur after seeing only a couple examples, or understanding that “I saw the Grand Canyon flying to Chicago” doesn’t mean the canyon is hurtling over the city. We have a lot of work ahead of us. But with TensorFlow we’ve got a good start, and we can all be in it together.

2. Homepage: TensorFlow is an Open Source Software Library for Machine Intelligence, which reads in part:

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

3. The TensorFlow whitepaper: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, by Martín Abadi, et al. (forty authors).

Abstract:

TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

4. Tutorials: Tutorials and Machine Learning Examples. I have omitted the descriptions to provide a quick pick-list of the current tutorial materials.

5. Non-Google Tutorials/Examples of TensorFlow:

TensorFlow Tutorials by Nathan Lintz.

Introduction to deep learning based on Google’s TensorFlow framework. These tutorials are direct ports of Newmu’s Theano Tutorials.

TensorFlow-Examples by Aymeric Damien.

Basic code examples for some machine learning algorithms, using TensorFlow library.

BTW, as of Wednesday, November 11, 2015, 22:10 UTC, Github shows 70 repository results for TensorFlow The two I list above were in my Twitter stream.

Given the time lapse between being open-sourced and non-Google examples on GitHub, it looks like TensorFlow is going to be popular. What do you think?

6. Benchmark TensorFlow by Soumith Chintala.

This will show up in the GitHub search link but I wanted to call it out for the high quality use of benchmarks and the discussion that follows. (I am sure there are other high quality discussions but I haven’t seen and therefore have not captured them.)

7. TensorFlow: Second Generation Deep Learning System by Jeff Dean. (approximately 45 minutes)

8. Rob Story opines that TensorFlow is built on the same ideas as @mrocklin‘s Dask (dask.pydata.org/en/latest).

9. If you need a Popular Science article to pass onto management, Dave Gershgorn has you covered with: How Google Aims To Dominate AI: The Search Giant Is Making Its AI Open Source So Anyone Can Use It.

In November 2007, Google laid the groundwork to dominate the mobile market by releasing Android, an open ­source operating system for phones. Eight years later to the month, Android has an an 80 percent market share, and Google is using the same trick—this time with artificial intelligence.

Today Google is announcing TensorFlow, its open ­source platform for machine learning, giving anyone a computer and internet connection (and casual background in deep learning algorithms) access to one of the most powerful machine learning platforms ever created. More than 50 Google products have adopted TensorFlow to harness deep learning (machine learning using deep neural networks) as a tool, from identifying you and your friends in the Photos app to refining its core search engine. Google has become a machine learning company. Now they’re taking what makes their services special, and giving it to the world. (emphasis in original)

10. One short commentary: Google TensorFlow: Updates & Lessons by Delip Rao.

TensorFlow came out today, and like the rest of the ML world, I buried myself with it. I have never been more excited about a new open source code. There are actionable tutorials etc. on the home page that’s worth checking out, but I wanted to know if this was yet another computational graph framework — we already have Theano and CGT (CGT is fast; Theano is most popular).

Links for Delip’s post:

Computational Graph Toolkit (CGT)

Theano


In the time it took to collect these resources on TensorFlow, I am certain more resources have appeared but hopefully these will continue to be fundamental resources for everyone interested in TensorFlow.

Enjoy!

November 10, 2015

The Gene Hackers [Chaos Remains King]

Filed under: Bioinformatics,Biology,Biomedical,Medical Informatics — Patrick Durusau @ 8:46 pm

The Gene Hackers by Michael Specter.

From the post:

It didn’t take Zhang or other scientists long to realize that, if nature could turn these molecules into the genetic equivalent of a global positioning system, so could we. Researchers soon learned how to create synthetic versions of the RNA guides and program them to deliver their cargo to virtually any cell. Once the enzyme locks onto the matching DNA sequence, it can cut and paste nucleotides with the precision we have come to expect from the search-and-replace function of a word processor. “This was a finding of mind-boggling importance,” Zhang told me. “And it set off a cascade of experiments that have transformed genetic research.”

With CRISPR, scientists can change, delete, and replace genes in any animal, including us. Working mostly with mice, researchers have already deployed the tool to correct the genetic errors responsible for sickle-cell anemia, muscular dystrophy, and the fundamental defect associated with cystic fibrosis. One group has replaced a mutation that causes cataracts; another has destroyed receptors that H.I.V. uses to infiltrate our immune system.

The potential impact of CRISPR on the biosphere is equally profound. Last year, by deleting all three copies of a single wheat gene, a team led by the Chinese geneticist Gao Caixia created a strain that is fully resistant to powdery mildew, one of the world’s most pervasive blights. In September, Japanese scientists used the technique to prolong the life of tomatoes by turning off genes that control how quickly they ripen. Agricultural researchers hope that such an approach to enhancing crops will prove far less controversial than using genetically modified organisms, a process that requires technicians to introduce foreign DNA into the genes of many of the foods we eat.

The technology has also made it possible to study complicated illnesses in an entirely new way. A few well-known disorders, such as Huntington’s disease and sickle-cell anemia, are caused by defects in a single gene. But most devastating illnesses, among them diabetes, autism, Alzheimer’s, and cancer, are almost always the result of a constantly shifting dynamic that can include hundreds of genes. The best way to understand those connections has been to test them in animal models, a process of trial and error that can take years. CRISPR promises to make that process easier, more accurate, and exponentially faster.

Deeply compelling read on the stellar career of Feng Zhang and his use of “clustered regularly interspaced short palindromic repeats” (CRISPR) for genetic engineering.

If you are up for the technical side, try PubMed on CRISPR at 2,306 “hits” as of today.

If not, continue with Michael’s article. You will get enough background to realize this is a very profound moment in the development of genetic engineering.

A profound moment that can be made all the more valuable by linking its results to the results (not articles or summaries of articles) of prior research.

Proposals for repackaging data in some yet-to-be-invented format are a non-starter from my perspective. That is more akin to the EU science/WPA projects than a realistic prospect for value-add.

Let’s start with the assumption that when held in electronic format, data has its native format as a given. Nothing we can change about that part of the problem of access.

Whether labbooks, databases, triple stores, etc.

That one assumption reduces worries about corrupting the original data and introduces a sense of “tinkering” with existing data interfaces. (Watch for a post tomorrow on the importance of “tinkering.”)

Hmmm, nodes anyone?

PS: I am not overly concerned about genetic “engineering.” My money is riding on chaos in genetics and environmental factors.

Editors’ Choice: An Introduction to the Textreuse Package [+ A Counter Example]

Filed under: R,Similarity,Similarity Retrieval,Text Mining — Patrick Durusau @ 5:58 pm

Editors’ Choice: An Introduction to the Textreuse Package by Lincoln Mullen.

From the post:

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers. Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out. (emphasis added)

Kudos to Lincoln on this important contribution to the digital humanities! Not to mention the package will also be useful for researchers who want to compare the “similarity” of texts as “subjects” for purposes of elimination of duplication (called merging in some circles) for presentation to a reader.

I highlighted

Put most simply, full text goes in and measures of similarity come out.

to offer a cautionary tale about the assumption that a high measure of similarity is an indication of the “source” of a text.

Louisiana, my home state, is the only civilian jurisdiction in the United States. Louisiana law, more at one time than now, is based upon Roman law.

Roman law and laws based upon it have a very deep and rich history that I won’t even attempt to summarize.

It is sufficient for present purposes to say the Digest of the Civil Laws now in Force in the Territory of Orleans (online version, English/French) was enacted in 1808.

A scholarly dispute arose (1971-1972) between Professor Batiza (Tulane), who considered the Digest to reflect the French civil code and Professor Pascal (LSU), who argued that despite quoting the French civil code quite liberally, that the redactors intended to codify the Spanish civil law in force at the time of the Louisiana Purchase.

The Batiza vs. Pascal debate was carried out at length and in public:

Batiza, The Louisiana Civil Code of 1808: Its Actual Sources and Present Relevance, 46 TUL. L. REV. 4 (1971); Pascal, Sources of the Digest of 1808: A Reply to Professor Batiza, 46 TUL.L.REV. 603 (1972); Sweeney, Tournament of Scholars over the Sources of the Civil Code of 1808, 46 TUL. L. REV. 585 (1972); Batiza, Sources of the Civil Code of 1808, Facts and Speculation: A Rejoinder, 46 TUL. L. REV. 628 (1972).

I could not find any freely available copies of those articles online. (Don’t encourage paywalls accessing such material. Find it at your local law library.)

There are a couple of secondary articles that discuss the dispute: A.N. Yiannopoulos, The Civil Codes of Louisiana, 1 CIV. L. COMMENT. 1, 1 (2008) at http://www.civil-law.org/v01i01-Yiannopoulos.pdf, and John W. Cairns, The de la Vergne Volume and the Digest of 1808, 24 Tulane European & Civil Law Forum 31 (2009), which are freely available online.

You won’t get the full details from the secondary articles but they do capture some of the flavor of the original dispute. I can report (happily) that over time, Pascal’s position has prevailed. Textual history is more complex than rote counting techniques can capture.

A far more complex case of “text similarity” than Lincoln addresses in the Textreuse package, but once you move beyond freshman/doctoral plagiarism, the “interesting cases” are all complicated.

Let’s Make Clojure.org Better! [Lurk No Longer!]

Filed under: Clojure — Patrick Durusau @ 4:11 pm

Let’s Make Clojure.org Better! by Alex Miller.

From the post:

The Clojure community is full of talented writers and valuable experience, and together we can create great documentation for the language and the ecosystem. With that in mind, we are happy to announce a new initiative to replace the existing http://clojure.org site. The new site will contain most of the existing reference and information content but will also provide an opportunity for additional guides or tutorials about Clojure and its ecosystem.

The new site content is hosted in a GitHub repository and is open for contributions. All contributions require a signed Clojure Contributor Agreement. This repository will accept contributions via pull request and issues with GitHub issues. The contribution and review process is described in more detail on the site contribution page.

We are currently working on the design elements for the site but if you would like to suggest a new guide, tutorial, or other content to be included on the site, please file an issue for discussion or create a thread on the Clojure mailing list with [DOCS] in the subject. There will be an unsession at the Clojure/conj conference next week for additional discussion. This is the beginning of a process, and things will likely evolve in the future. In the meantime, we look forward to seeing your contributions!

You have been waiting to showcase your writing and/or Clojure talents for some time now. Here’s your chance to lurk no longer!

Don’t keep all that talent hidden under a bushel basket! Step up and make the Clojure community, not to mention Clojure.org the beneficiary of all that talent.

You never know what connections you will make or who will become aware of your talents.

You never will know if you don’t participate. Your call.

How Computers Broke Science… [Soon To Break Businesses …]

Filed under: Business Intelligence,Replication,Scientific Computing,Transparency — Patrick Durusau @ 3:04 pm

How Computers Broke Science — and What We can do to Fix It by Ben Marwick.

From the post:

Reproducibility is one of the cornerstones of science. Made popular by British scientist Robert Boyle in the 1660s, the idea is that a discovery should be reproducible before being accepted as scientific knowledge.

In essence, you should be able to produce the same results I did if you follow the method I describe when announcing my discovery in a scholarly publication. For example, if researchers can reproduce the effectiveness of a new drug at treating a disease, that’s a good sign it could work for all sufferers of the disease. If not, we’re left wondering what accident or mistake produced the original favorable result, and would doubt the drug’s usefulness.

For most of the history of science, researchers have reported their methods in a way that enabled independent reproduction of their results. But, since the introduction of the personal computer — and the point-and-click software programs that have evolved to make it more user-friendly — reproducibility of much research has become questionable, if not impossible. Too much of the research process is now shrouded by the opaque use of computers that many researchers have come to depend on. This makes it almost impossible for an outsider to recreate their results.

Recently, several groups have proposed similar solutions to this problem. Together they would break scientific data out of the black box of unrecorded computer manipulations so independent readers can again critically assess and reproduce results. Researchers, the public, and science itself would benefit.

Whether you are looking for specific proposals to make computed results capable of replication or quotes to support that idea, this is a good first stop.

FYI for business analysts, how are you going to replicate results of computer runs to establish your “due diligence” before critical business decisions?

What looked like a science or academic issue has liability implications!

Changing a few variables in a spreadsheet or more complex machine learning algorithms can make you look criminally negligent if not criminal.

The computer illiteracy/incompetence of prosecutors and litigants is only going to last so long. Prepare defensive audit trails to enable the replication of your actual* computer-based business analysis.

*I offer advice on techniques for such audit trails. The audit trails you choose to build are up to you.

November 9, 2015

Vintage Infodesign [138] Old Map, Charts and Graphics

Filed under: Graphics,Maps,Visualization — Patrick Durusau @ 11:50 am

Vintage Infodesign [138] Old Map, Charts and Graphics by Tiago Veloso

From the post:

Those who follow these weekly updates with vintage examples of information design know how maps fill a good portion of our posts. Cartography has been having a crucial role in our lives for centuries and two recent books help understand this influence throughout the ages: The Art of Illustrated Maps by John Roman, and Map: Exploring The World, featuring some of the most influential mapmakers and institutions in history, like Gerardus Mercator, Abraham Ortelius, Phyllis Pearson, Heinrich Berann, Bill Rankin, Ordnance Survey and Google Earth.

Gretchen Peterson reviewed the first one in this article, with a few questions answered by the author. As for the second book recommendation, you can learn more about it in this interview conducted by Mark Byrnes with John Hessler, a cartography expert at the Library of Congress and one of the people behind the book, published in CityLab. Both publications seem quite a treat for map lovers and additions to

All delightful and instructive but I think my favorite is How Many Will Die Flying the Atlantic This Season? (Aug, 1931).

The cover is a must see graphic/map.

It reminds me of the over-the-top government reports on terrorism which are dutifully parroted by both traditional and online media.

Any sane person who looks at the statistics for causes of death in Canada, the United States and Europe, will conclude that “terrorism” is a government-fueled and media-driven non-event. Terrorist events should qualify as Trivial Pursuit questions.

The infrequent victims of terrorism and their families deserve all the support and care we can provide. But the same is true of traffic accident victims and they are far more common than victims of terrorism.

November 8, 2015

600 websites about R [How to Avoid Duplicate Content?]

Filed under: Indexing,R,Searching — Patrick Durusau @ 9:44 pm

600 websites about R by Laetitia Van Cauwenberge.

From the post:

Anyone interested in categorizing them? It could be an interesting data science project, scraping these websites, extracting keywords, and categorizing them with a simple indexation or tagging algorithm. For instance, some of these blogs cater about stats, or Bayesian stats, or R libraries, or R training, or visualization, or anything else. This indexation technique was used here to classify 2,500 data science websites. For web crawling tutorials, click here or here.

BTW, Laetitia lists, with links, all 600 R sites.

How many of those R sites will you visit?

Or will you scan the list for your site or your favorite R site?

For that matter, how duplicated content are you going to find at those R sites?

All have some unique content, but neither an index nor classification will help you find unique content.

Thinking of this as a potential data science experiment, we have a list of 600 sites with content related to R.

What would be your next step towards avoiding duplicated content?

By what criteria would you judge “success” in avoiding duplicate content?

November 7, 2015

Fed Security Sprint – Ans: Multi-Year Egg Roll

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 8:26 pm

You may recall my post: Cybersecurity Sprint or Multi-Year Egg Roll?.

Back in June 2015, the White house ordered all agencies via Chief Information Officer Tony Scott, a 30-day security sprint.

I must report that the FBI didn’t get the memo.

If you want to help the FBI with its security efforts, email or call them with a link to my earlier posting.

I say that because today it was confirmed that the 30-day security sprint is turning into a multi-year egg roll. My concluding question in that post.

I read today about Crackas With Attitude (CWA), hacking in the Joint Automated Booking System (JABS) (think FBI and law enforcement access only)

Swati Khandelwal reports in Hackers have Hacked into US Arrest Records Database:

The hacking group, Crackas With Attitude (CWA), claims it has gained access to a Law Enforcement Portal through which one can access:

  • Arrest records
  • Tools for sharing information about terrorist events and active shooters

The system in question is reportedly known as the Joint Automated Booking System (JABS), which is only available to the Federal Bureau of Investigation (FBI) and law enforcement.

Today is November the 7th and as I track time, we are way past Tony Scott’s 30-day security sprint.

I did check and Tony Scott is still the Chief Information Officer for the United States and recently blogged about federal agencies using strong authentication over 80% of the time.

I guess that information resources like Joint Automated Booking System (JABS) must not be high enough priority to qualify for strong authentication.

Or perhaps Crackas With Attitude (CWA) have broken what the FBI considers to be strong authentication.

Maybe Crackas With Attitude (CWA) will dump raw data to the Dark Web from their hack. Give everyone a chance to see what the FBI considers to be low-value data.

November 6, 2015

Learn R From Scratch

Filed under: Programming,R — Patrick Durusau @ 11:52 am

Learn R From Scratch

From the description:

A Channel dedicated to R Programming – The language of Data Science. We notice people learning the language in parts, so the initial lectures are dedicated to teach the language to aspiring Data Science Professionals, in a structured fashion so that you learn the language completely and be able to contribute back to the community. Upon taking the course, you will appreciate the inherent brilliance of R.

If I haven’t missed anything, thirty-seven (37) R videos await your viewing pleasure!

None of the videos are long, the vast majority shorter than four (4) minutes but a skilled instructor can put a lot in a four minute video.

The short length means you can catch a key concept and go on to practice it before it fades from memory. Plus you can find time for a short video when finding time for an hour lecture is almost impossible.

Enjoy!

« Newer PostsOlder Posts »

Powered by WordPress