Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 18, 2014

Twitter Now Lets You Search For Any Tweet Ever Sent

Filed under: Searching,Tweets — Patrick Durusau @ 1:46 pm

Twitter Now Lets You Search For Any Tweet Ever Sent by Cade Metz.

From the post:


This morning, Twitter began rolling out a search service that lets you search for any tweet in its archive.

Though the new Twitter search engine is limited to rather rudimentary keyword searches today, the company plans to expand into more complex queries in the months and years to come. And the foundational search infrastructure laid down by the company will help drive other Twitter tools as well. “It lets us power a lot more things down the road—not just search,” says Gilad Mishne, the Twitter engineering director who helped oversee the project.

Well, that’s both good news and better news!

Good news because of being able to search and link to the full corpus of tweets.

Better news because of the search market gap that Cade reports, which is quite similar to Google’s.

You can search for anything you want, but the results, semantically speaking, are going to be a crap shoot.

Do users really have time for hit or miss search results? Some do, some don’t.

If yours don’t, let’s talk.

CouchDB 2.0 Developer Preview

Filed under: CouchDB — Patrick Durusau @ 1:32 pm

CouchDB 2.0 Developer Preview

From the post:

This is an early, still in-development version of CouchDB. It is a significant departure from the 1.x series and will be foundation of the 2.0 version and beyond.

The target audience of this release are people who use CouchDB today and want to see what the future brings.

The CouchDB community is requesting feedback on the following areas:

  • New Features
  • Compatibility with existing software
  • Bug reports

Please report your findings to the Developer Mailing List or the Issue Tracker.

There is a dockerized version of CouchDB 2.0 at: https://github.com/klaemo/docker-couchdb/tree/2.0-dev

Enjoy!

Topic Maps By Another Name

Filed under: Merging,Topic Maps — Patrick Durusau @ 11:32 am

Data Integration up to 90% Faster and Cheaper by Marty Loughlin.

From the post:

data glue

A powerful new approach to addressing this challenge involves using semantic web technology as the “data glue” to guide integration and dramatically simplify the process. There are several key components to this approach:

  • Using semantic models to describe data in standard business terms (e.g., FIBO, CDISC, existing enterprise model etc.)
  • Mapping source and target data to the semantic model instead of directly from source to target
  • Combining these maps as needed to create end-to-end semantic descriptions of ETL jobs
  • Automatically generating ETL code from the semantic descriptions for leading ETL tools (e.g., Informatica and Pentaho)

There are significant benefits to this approach:

  • Data integration can be done by business analysts with minimal IT involvement
  • Adding a new source or target only requires an expert in that system to map to the common model as all maps are reusable
  • The time and cost do an integration project can be reduced up to 90%
  • Projects can be repurposed to a new ETL tool with the click of a mouse
  • The semantic model that describes that data, sources, maps and transformation is always up-to-date and can be queried for data meaning and lineage

The mapping of the source and target data to a semantic model is one use for a topic map. The topic map itself is then a data store to be queried using the source or target data models.

The primary differences (there are others) between topic maps and “data glue” is that topic maps don’t necessarily use MS Excel spreadsheets and aren’t called “data glue.”

I do appreciate Cambridge Semantics determining that a topic map-like mapping approach can save 90% on data integration projects.

That sounds a bit optimistic but marketing literature is always optimistic.

6 links that will show you what Google knows about you

Filed under: Cybersecurity,Privacy,Security — Patrick Durusau @ 9:27 am

6 links that will show you what Google knows about you by Cloud Fender.

After reviewing these links, ask yourself: “How do I keep Google, etc. from knowing more about me?”

November 17, 2014

My Way Into Clojure: Building a Card Game with OM – Part 1

Filed under: Clojure,Functional Programming,Programming — Patrick Durusau @ 6:03 pm

My Way Into Clojure: Building a Card Game with OM – Part 1

From the introduction:

This two-part blog post tells the story of my venturing into Clojure. To get a better grasp of the language, I wanted to move beyond solving programming puzzles and build something tangible in the browser. Omingard is a Solitaire-like HTML5 card game built with Om, a ClojureScript interface to Facebook’s React.

In this first part, “My Way into Clojure”, I’ll provide some background on why I built Omingard and introduce the concepts behind Clojure. What’s so fascinating about functional programming in general, and Clojure in particular, and why was the appearance of Om a game changer for me?

In the upcoming second part, “Building a Card Game with Om”, we’ll look at how I built Omingard. What are the rules of the game, and what role do React, ClojureScript, Om, Leiningen, Garden, and Google Closure Tools play in its implementation? We’ll also take a detailed look at the concepts behind Om, and how it achieves even faster performance than React.

This is a very cool exercise in learning Clojure.

Do try the game. The version I know, has slightly different rules than the ones I observe here.

Apache Lucene™ 5.0.0 is coming!

Filed under: Lucene,Search Engines — Patrick Durusau @ 4:16 pm

Apache Lucene™ 5.0.0 is coming! by Michael McCandless.

At long last, after a strong series of 4.x feature releases, most recently 4.10.2, we are finally working towards another major Apache Lucene release!

There are no promises for the exact timing (it’s done when it’s done!), but we already have a volunteer release manager (thank you Anshum!).

A major release in Lucene means all deprecated APIs (as of 4.10.x) are dropped, support for 3.x indices is removed while the numerous 4.x index formats are still supported for index backwards compatibility, and the 4.10.x branch becomes our bug-fix only release series (no new features, no API changes).

5.0.0 already contains a number of exciting changes, which I describe below, and they are still rolling in with ongoing active development.

Michael has a great list and explanation of changes you will be seeing in Lucene 5.0.0. Pick your favorite(s) to follow and/or contribute to the next release.

Programming in the Life Sciences

Filed under: Bioinformatics,Life Sciences,Medical Informatics,Programming,Science — Patrick Durusau @ 4:06 pm

Programming in the Life Sciences by Egon Willighagen.

From the first post in this series, Programming in the Life Sciences #1: a six day course (October, 2013):

Our department will soon start the course Programming in the Life Sciences for a group of some 10 students from the Maastricht Science Programme. This is the first time we give this course, and over the next weeks I will be blogging about this course. First, some information. These are the goals, to use programming to:

  • have the ability to recognize various classes of chemical entities in pharmacology and to understand the basic physical and chemical interactions.
  • be familiar with technologies for web services in the life sciences.
  • obtain experience in using such web services with a programming language.
  • be able to select web services for a particular pharmacological question.
  • have sufficient background for further, more advanced, bioinformatics data analyses.

So, this course will be a mix of things. I will likely start with a lecture or too about scientific programming, such as the importance of reproducibility, licensing, documentation, and (unit) testing. To achieve these learning goals we have set a problem. The description is:


    In the life sciences the interactions between chemical entities is of key interest. Not only do these play an important role in the regulation of gene expression, and therefore all cellular processes, they are also one of the primary approaches in drug discovery. Pharmacology is the science studies the action of drugs, and for many common drugs, this is studying the interaction of small organic molecules and protein targets.
    And with the increasing information in the life sciences, automation becomes increasingly important. Big data and small data alike, provide challenges to integrate data from different experiments. The Open PHACTS platform provides web services to support pharmacological research and in this course you will learn how to use such web services from programming languages, allowing you to link data from such knowledge bases to other platforms, such as those for data analysis.

So, it becomes pretty clear what the students will be doing. They only have six days, so it won’t be much. It’s just to learn them the basic skills. The students are in their 3rd year at the university, and because of the nature of the programme they follow, a mixed background in biology, mathematics, chemistry, and physics. So, I have a good hope they will surprise me in what they will get done.

Pharmacology is the basic topic: drug-protein interaction, but the students are free to select a research question. In fact, I will not care that much what they like to study, as long as they do it properly. They will start with Open PHACTS’ Linked Data API, but here too, they are free to complement data from the OPS cache with additional information. I hope they do.

Now, regarding the technology they will use. The default will be JavaScript, and in the next week I will hack up demo code showing the integration of ops.js and d3.js. Let’s see how hard it will be; it’s new to me too. But, if the students already are familiar with another programming language and prefer to use that, I won’t stop them.

(For the Dutch readers, would #mscpils be a good tag?)

For quite a few “next weeks,” Egon’s blogging has gone on and life sciences, to say nothing of his readers, are all better off for it! His most recent post is titled: Programming in the Life Sciences #20: extracting data from JSON.

Definitely a series to catch or to pass along for anyone involved in life sciences.

Enjoy!

Apache Spark RefCardz

Filed under: Spark — Patrick Durusau @ 3:43 pm

Apache Spark RefCardz by Ashwini Kuntamukkala.

Counting the cover, four (4) out of the eight (8) pages don’t qualify for inclusion in a cheat sheet or refcard. Depending on how hard you want to push that, the count could easily be six (6) out of the eight (8) pages should not be in a cheat sheet or refcard.

Reasoning that cheat sheets or refcards are meant for practitioners or experts who have forgotten a switch on date or bc.

The “extra” information present on this RefCard is useful but you will rapidly outgrow it. Unless you routinely need help installing Apache Spark or working a basic word count problem.

A two (2) page (front/back) cheatsheet for Spark would be more useful.

This is your Brain on Big Data: A Review of “The Organized Mind”

This is your Brain on Big Data: A Review of “The Organized Mind” by Stephen Few.

From the post:

In the past few years, several fine books have been written by neuroscientists. In this blog I’ve reviewed those that are most useful and placed Daniel Kahneman’s Thinking, Fast & Slow at the top of the heap. I’ve now found its worthy companion: The Organized Mind: Thinking Straight in the Age of Information Overload.

the organized mind - book cover

This new book by Daniel J. Levitin explains how our brains have evolved to process information and he applies this knowledge to several of the most important realms of life: our homes, our social connections, our time, our businesses, our decisions, and the education of our children. Knowing how our minds manage attention and memory, especially their limitations and the ways that we can offload and organize information to work around these limitations, is essential for anyone who works with data.

See Stephen’s review for an excerpt from the introduction and summary comments on the work as a whole.

I am particularly looking forward to reading Levitin’s take on the transfer of information tasks to us and the resulting cognitive overload.

I don’t have the volume, yet, but it occurs to me that the shift from indexes (Readers Guide to Periodical Literature and the like) and librarians to full text search engines, is yet another example of the transfer of information tasks to us.

Indexers and librarians do a better job of finding information than we do because discovery of information is a difficult intellectual task. Well, perhaps, discovering relevant and useful information is a difficult task. Almost without exception, every search produces a result on major search engines. Perhaps not a useful result but a result none the less.

Using indexers and librarians will produce a line item in someone’s budget. What is needed is research on the differential between the results from indexer/librarians versus us and what that translates to as a line item in enterprise budgets.

That type of research could influence university, government and corporate budgets as the information age moves into high gear.

The Organized Mind by Daniel J. Levitin is a must have for the holiday wish list!

Functional and Reactive Domain Modeling

Functional and Reactive Domain Modeling by Debasish Ghosh.

From the post:

Manning has launched the MEAP of my upcoming book on Domain Modeling.

functional-reactive programming cover

The first time I was formally introduced to the topic was way back when I played around with Erik Evans’ awesome text on the subject of Domain Driven Design. In the book he discusses various object lifecycle patterns like the Factory, Aggregate or Repository that help separation of concerns when you are implementing the various interactions between the elements of the domain model. Entities are artifacts with identities, value objects are pure values while services model the coarse level use cases of the model components.

In Functional and Reactive Domain Modeling I look at the problem with a different lens. The primary focus of the book is to encourage building domain models using the principles of functional programming. It’s a completely orthogonal approach than OO and focuses on verbs first (as opposed to nouns first in OO), algebra first (as opposed to objects in OO), function composition first (as opposed to object composition in OO), lightweight objects as ADTs (instead of rich class models).

The book starts with the basics of functional programming principles and discusses the virtues of purity and the advantages of keeping side-effects decoupled from the core business logic. The book uses Scala as the programming language and does an extensive discussion on why the OO and functional features of Scala are a perfect fit for modelling complex domains. Chapter 3 starts the core subject of functional domain modeling with real world examples illustrating how we can make good use of patterns like smart constructors, monads and monoids in implementing your domain model. The main virtue that these patterns bring to your model is genericity – they help you extract generic algebra from domain specific logic into parametric functions which are far more reusable and less error prone. Chapter 4 focuses on advanced usages like typeclass based design and patterns like monad transformers, kleislis and other forms of compositional idioms of functional programming. One of the primary focus of the book is an emphasis on algebraic API design and to develop an appreciation towards ability to reason about your model.

An easy choice for your holiday wish list! Being a MEAP, it will continue to be “new” for quite some time.

Enjoy!

November 16, 2014

New SNOMED CT Data Files Available

Filed under: Medical Informatics,SNOMED — Patrick Durusau @ 8:01 pm

New SNOMED CT Data Files Available

From the post:

NLM is pleased to announce the following releases available for download:

  1. A new subset from Convergent Medical Terminology (CMT) is now available for download from the UMLS Terminology Services (UTS) by UMLS licensees. This problem list subset includes concepts that KP uses within the ED Problem List. There are 2189 concepts in this file. SNOMED Concepts are based on the 1/31/2012 version of the International Release.

    For more information about CMT, please see the NLM CMT Frequently Asked Questions page.

  2. The Spanish Edition of the SNOMED CT International Release is now available for download.
  3. On behalf of the International Health Terminology Standards Development Organisation (IHTSDO), NLM is pleased to announce the release of the SNOMED CT International General/Family Practice subset (GP/FP Subset) and map from the GP/FP Subset to the International Classification of Primary Care (ICPC-2). This is the baseline work release resulting from the harmonization agreement between the IHTSDO and WONCA.

    The purpose of this subset is to provide the frequently used SNOMED CT concepts for use in general/family practice electronic health records within the following data fields: reason for encounter, and health issue. The purpose of the map from the SNOMED CT GP/FP subset to ICPC-2 is to allow for the granular concepts to be recorded by GPs/FPs at the point of care using SNOMED CT, with subsequent analysis and reporting using the internationally recognized ICPC-2 classification. However please note that use within clinical systems cannot be supported at this time. This Candidate Baseline is distributed for evaluation purposes only and should not be used in production clinical systems or in clinical settings.

    The subsets are aligned to the July 2014 SNOMED CT International Release. The SNOMED CT to ICPC-2 map is a Candidate Baseline, which IHTSDO expects to confirm as the Baseline release following the January 2015 SNOMED CT International Release.

If your work in any way touches upon medical teminology, Convergent Medical Terminology (CMT) and SNOMED CT (Systematized Nomenclature of Medicine–Clinical Terms), among other collections of medical terminology will be of interest to you.

Medical terminology is a small part of the world at large and you can see what it takes for the NLM to maintain a semblance of chaotic order. Great benefit flow even from a semblance of order but those benefits are not free.

Greek Digitisation Project Update: 40 Manuscripts Newly Uploaded

Filed under: British Library,Manuscripts — Patrick Durusau @ 7:43 pm

Greek Digitisation Project Update: 40 Manuscripts Newly Uploaded by Sarah J Biggs.

From the post:

We have now passed the half-way point of this phase of the Greek Manuscripts Digitisation Project, generously funded by the Stavros Niarchos Foundation and many others, including the A. G. Leventis Foundation, Sam Fogg, the Sylvia Ioannou Foundation, the Thriplow Charitable Trust, and the Friends of the British Library. What treasures are in store for you this month? To begin with, there are quite a few interesting 17th- and 18th-century items to look at, including two very fine 18th-century charters, with seals intact, an iconographic sketch-book (Add MS 43868), and a fascinating Greek translation of an account of the siege of Vienna in 1683 (Add MS 38890). We continue to upload some really exciting Greek bindings – of particular note here are Add MS 24372 and Add MS 36823. A number of scrolls have also been uploaded, mostly containing the Liturgy of Basil of Caesarea. A number of Biblical manuscripts are included, too, but this month two manuscripts of classical authors take pride of place: Harley MS 5600, a stunning manuscript of the Iliad from 15th-century Florence, and Burney MS 111, a lavishly decorated copy of Ptolemy’s Geographia.

Additional riches from the British Library!

Enjoy!

Spark: Parse CSV file and group by column value

Filed under: Linux OS,Spark — Patrick Durusau @ 7:24 pm

Spark: Parse CSV file and group by column value by Mark Needham.

Mark parses a 1GB file that details 4 million crimes from the City of Chicago.

And he does it two ways: Using Unix and Spark.

Results? One way took more than 2 minutes, the other way, less than 10 seconds.

Place your bets with office staff and then visit Mark’s post for the results.

Defence: a quick guide to key internet links

Filed under: Defense,Intelligence,Military — Patrick Durusau @ 6:55 pm

Defence: a quick guide to key internet links by David Watt and Nicole Brangwin.

While browsing at Full Text Reports, I saw this title with the following listing of contents:

  • Australian Parliament
  • Australian Government
  • Military history
  • Strategic studies
  • Australian think tanks and non-government organisations
  • International think tanks and organisations
  • Foreign defence

The document is a five (5) page PDF file that has a significant number of links, particularly to Australian military resources. Under “Foreign defense” I did find the Chinese Peoples’ Liberation Army but no link for ISIL.

This may save you some time if you are spidering Australian military sites but appears to be incomplete for other areas.

Encyclopedia of Ethical Failure — Updated October 2014

Filed under: Encyclopedia,Ethics,Government — Patrick Durusau @ 3:32 pm

Encyclopedia of Ethical Failure — Updated October 2014 by the Department of Defense, Office of General Counsel, Standards of Conduct Office. (Word Document)

From the introduction:

The Standards of Conduct Office of the Department of Defense General Counsel’s Office has assembled the following selection of cases of ethical failure for use as a training tool. Our goal is to provide DoD personnel with real examples of Federal employees who have intentionally or unwittingly violated the standards of conduct. Some cases are humorous, some sad, and all are real. Some will anger you as a Federal employee and some will anger you as an American taxpayer.

Please pay particular attention to the multiple jail and probation sentences, fines, employment terminations and other sanctions that were taken as a result of these ethical failures. Violations of many ethical standards involve criminal statutes. Protect yourself and your employees by learning what you need to know and accessing your Agency ethics counselor if you become unsure of the proper course of conduct. Be sure to access them before you take action regarding the issue in question. Many of the cases displayed in this collection could have been avoided completely if the offender had taken this simple precaution.

The cases have been arranged according to offense for ease of access. Feel free to reproduce and use them as you like in your ethics training program. For example – you may be conducting a training session regarding political activities. Feel free to copy and paste a case or two into your slideshow or handout – or use them as examples or discussion problems. If you have a case you would like to make available for inclusion in a future update of this collection, please email it to OSD.SOCO@MAIL.MIL or you may fax it to (703) 695-4970.

One of the things I like about the United States military is they have no illusions about being better or worse than any other large organization and they prepare accordingly. Instead of pretending they are “…shocked, shocked to find gambling…,” they are prepared for rule breaking and try to keep it in check.

If you are interested in exploring or mapping this area, you will find the U.S. Office of Government Ethics useful. Unfortunately, the “Office of Inspector General” is distinct for each agency so collating information across executive departments will be challenging. To say nothing of obtaining similar information for other branches of the United States government.

Not from a technical standpoint for a topic map but from a data mining and analysis perspective.

I first saw this at Full Text Reports as Encyclopedia of Ethical Failure — Updated October 2014.

8 Easy Steps to Becoming a Data Scientist

Filed under: Data Science — Patrick Durusau @ 2:57 pm

How to become a data scientist.

Not a bad graphic to have printed poster size for your wall. Write in what you have done on each step.

I first saw this at Ryan Swanstrom’s 8 Easy Steps to Becoming a Data Scientist and Ryan obtained the graphic from DataCamp, an instructional vendor that can assist you in becoming a data scientist.

81% of Tor users can be de-anonymised by analysing router information, research indicates

Filed under: Security,Tor — Patrick Durusau @ 8:45 am

81% of Tor users can be de-anonymised by analysing router information, research indicates by Martin Anderson.

From the post:

Research undertaken between 2008 and 2014 suggests that more than 81% of Tor clients can be ‘de-anonymised’ – their originating IP addresses revealed – by exploiting the ‘Netflow’ technology that Cisco has built into its router protocols, and similar traffic analysis software running by default in the hardware of other manufacturers.

Professor Sambuddho Chakravarty, a former researcher at Columbia University’s Network Security Lab and now researching Network Anonymity and Privacy at the Indraprastha Institute of Information Technology in Delhi, has co-published a series of papers over the last six years outlining the attack vector, and claims a 100% ‘decloaking’ success rate under laboratory conditions, and 81.4% in the actual wilds of the Tor network.

Chakravarty’s technique [PDF] involves introducing disturbances in the highly-regulated environs of Onion Router protocols using a modified public Tor server running on Linux – hosted at the time at Columbia University. His work on large-scale traffic analysis attacks in the Tor environment has convinced him that a well-resourced organisation could achieve an extremely high capacity to de-anonymise Tor traffic on an ad hoc basis – but also that one would not necessarily need the resources of a nation state to do so, stating that a single AS (Autonomous System) could monitor more than 39% of randomly-generated Tor circuits.

Before you panic, read the rest of Mark’s article. Tor wasn’t designed for highly interactive web connections, which creates conditions where traffic in and out of routers can leave patterns to trace connections.

For years we got along with email-based search systems for mailing list archives and other materials. For security reasons, perhaps your next Dark Web app should offer email-based transactions.

I first saw this in a tweet by Nik Cubrilovic.

November 15, 2014

Black Friday Dreaming with Bob DuCharme

Filed under: Microdata,Schema.org,SPARQL — Patrick Durusau @ 8:23 pm

Querying aggregated Walmart and BestBuy data with SPARQL by Bob DuCharme.

From the post:

The combination of microdata and schema.org seems to have hit a sweet spot that has helped both to get a lot of traction. I’ve been learning more about microdata recently, but even before I did, I found that the W3C’s Microdata to RDF Distiller written by Ivan Herman would convert microdata stored in web pages into RDF triples, making it possible to query this data with SPARQL. With major retailers such as Walmart and BestBuy making such data available on—as far as I can tell—every single product’s web page, this makes some interesting queries possible to compare prices and other information from the two vendors.

Bob’s use of SPARQL won’t be ready for this coming Black Friday but some Black Friday in the future?

One can imagine “blue light specials” being input by shoppers on location and driving traffic patterns at the larger malls.

Well worth your time to see where Bob was able to get using public tools.

I first saw this in a tweet by Ivan Herman.

Py2neo 2.0

Filed under: Graphs,Neo4j,py2neo,Python — Patrick Durusau @ 7:30 pm

Py2neo 2.0 by Nigel Small.

From the webpage:

Py2neo is a client library and comprehensive toolkit for working with Neo4j from within Python applications and from the command line. The core library has no external dependencies and has been carefully designed to be easy and intuitive to use.

If you are using Neo4j or Python or both, you need to be aware of Py2Neo 2.0.

Impressive documentation!

I haven’t gone through all of it but contributed examples would be helpful.

For example:

API: Cypher

exception py2neo.cypher.ClientError(message, **kwargs)

The Client sent a bad request – changing the request might yield a successful outcome.

exception py2neo.cypher.error.request.Invalid(message, **kwargs)[source]

The client provided an invalid request.

Without an example the difference between a “bad” versus an “invalid” request isn’t clear.

Writing examples would not be a bad way to work through the Py2neo 2.0 documentation.

Enjoy!

I first saw this in a tweet by Nigel Small.

UK to stop its citizens seeing extremist material online

Filed under: Censorship,Government — Patrick Durusau @ 7:05 pm

UK to stop its citizens seeing extremist material online by David Meyer.

From the post:

The U.K.’s big internet service providers, including BT, Talk Talk, Virgin Media and Sky, have agreed to filter out terrorist and extremist material at the government’s behest, in order to stop people seeing things that may make them sympathetic towards terrorists.

The move will also see providers host a public reporting button for terrorist material. This is likely to be similar to what is already done with websites that may host child pornography – people can report content to the Internet Watch Foundation (IWF), an organization that maintains a blacklist, to which that site could then be added.

In the case of extremist material, though, it appears that the reports would go through to the Counter Terrorism Internet Referral Unit (CTIRU), which is based in London’s Metropolitan Police and has already been very active in identifying extremist material and having it taken down. CTIRU told me in a statement: “The unit works with UK based companies that are hosting such material. However the unit has also established good working relationships with companies overseas in order to make the internet a more hostile place for terrorists.”

Government sources also told me that Facebook, Google, Yahoo and Twitter have agreed to “raise their standards and improve their capacity to deal with this material.”

Please read David’s post in full, he has the right of it.

I am truly sorry to see the UK deciding to ape China and Russia by censoring what its citizens can see.

Once the censorship is in full swing, I expect to see sites using the blacklist to offer indexing and content delivery services for censored sites. The delivery address not matching the blacklist will defeat this particularly lame attempt at censorship.

Or perhaps the Internet Watch Foundation will have an unusually high number of censorship requests for things you think should be censored. 😉

I trust the imagination of UK residents to come up with any number of options to avoid censorship. (I try to never confuse citizens with their governments. That is so unfair to the citizenry.)

PS: You know, the censoring of online content ties into my difficulties in deciding if Dabiq is a legitimate publication of ISIL. Suppressing Authentic Information How do people become informed if all that is available is government vetted propaganda?

Perhaps that is the answer isn’t it? The government prefers uninformed citizens, i.e., those who only have the range of information the government deems wise for them to have.

News about subverting such efforts is always welcome.

Would You Protect Nazi Torturers And Their Superiors?

Filed under: Government,Government Data,Security — Patrick Durusau @ 5:10 pm

If you answered “Yes,” this post won’t interest you.

If you answered “No,” read on:

Senator Mark Udall faces the question: “Would You Protect Nazi Torturers And Their Superiors?” as reported by Mike Masnick in:

Mark Udall’s Open To Releasing CIA Torture Report Himself If Agreement Isn’t Reached Over Redactions.

Mike writes in part:

As we were worried might happen, Senator Mark Udall lost his re-election campaign in Colorado, meaning that one of the few Senators who vocally pushed back against the surveillance state is about to leave the Senate. However, Trevor Timm pointed out that, now that there was effectively “nothing to lose,” Udall could go out with a bang and release the Senate Intelligence Committee’s CIA torture report. The release of some of that report (a redacted version of the 400+ page “executive summary” — the full report is well over 6,000 pages) has been in limbo for months since the Senate Intelligence Committee agreed to declassify it months ago. The CIA and the White House have been dragging out the process hoping to redact some of the most relevant info — perhaps hoping that a new, Republican-controlled Senate would just bury the report.

Mike details why Senator Udall’s recent reelection defeat makes release of the report, either in full or in summary, a distinct possibility.

In addition to Mike’s report, here is some additional information you may find useful

Contact Information for Senator Udall

http://www.markudall.senate.gov/

Senator Mark Udall
Hart Office Building Suite SH-730
Washington, D.C. 20510

P: 202-224-5941
F: 202-224-6471

An informed electorate is essential to the existence of self-governance.

No less a figure than Thomas Jefferson spoke about the star chamber proceedings we now take for granted saying:

An enlightened citizenry is indispensable for the proper functioning of a republic. Self-government is not possible unless the citizens are educated sufficiently to enable them to exercise oversight. It is therefore imperative that the nation see to it that a suitable education be provided for all its citizens. It should be noted, that when Jefferson speaks of “science,” he is often referring to knowledge or learning in general. “I know no safe depositary of the ultimate powers of the society but the people themselves; and if we think them not enlightened enough to exercise their control with a wholesome discretion, the remedy is not to take it from them, but to inform their discretion by education. This is the true corrective of abuses of constitutional power.” –Thomas Jefferson to William C. Jarvis, 1820. ME 15:278

“Every government degenerates when trusted to the rulers of the people alone. The people themselves, therefore, are its only safe depositories. And to render even them safe, their minds must be improved to a certain degree.” –Thomas Jefferson: Notes on Virginia Q.XIV, 1782. ME 2:207

“The most effectual means of preventing [the perversion of power into tyranny are] to illuminate, as far as practicable, the minds of the people at large, and more especially to give them knowledge of those facts which history exhibits, that possessed thereby of the experience of other ages and countries, they may be enabled to know ambition under all its shapes, and prompt to exert their natural powers to defeat its purposes.” –Thomas Jefferson: Diffusion of Knowledge Bill, 1779. FE 2:221, Papers 2:526

Jefferson didn’t have to contend with Middle East terrorists, only the English terrorizing the country side. Since more Americans died in British prison camps than in the Revolution proper, I would say they were as bad as terrorists. Prisoners of war in the American Revolutionary War

Noise about the CIA torture program post 9/11 is plentiful. But the electorate, that would be voters in the United States, lack facts about the CIA torture program, its oversight (or lack thereof) and those responsible for torture, from top to bottom. There isn’t enough information to “connect the dots,” a common phrase in the intelligence community.

Connecting those dots are what could bring the accountability and transparency necessary to prevent torture from returning as an instrument of US policy.

Thirty retired generals are urging President Obama to declassify the Senate Intelligence Committee’s report on CIA torture, arguing that without accountability and transparency the practice could be resumed. (Even Generals in US Military Oppose CIA Torture)

Hiding the guilty will produce an expectation of potential future torturers that they too will get a free pass on torture.

Voters are responsible for turning out those who authorized the use of torture and to hold their subordinates are held accountable for their crimes. To do so voters must have the information contained in the full CIA torture report.

Release of the Full CIA Torture Report: No Doom and Gloom

Senator Udall should ignore speculation that release of the full CIA torture report will “doom the nation.”

Poppycock!

There have been similar claims in the past and none of them, not one, has ever proven to be true. Here are some of the ones that I remember personally:

Documents Released Date Nation Doomed?
Pentagon Papers 1971 No
Nixon White House Tapes 1974 No
The Office of Special Investigations: Striving for Accountability in the Aftermath of the Holocaust 2010 No
United States diplomatic cables leak 2010 No
Snowden (Global Surveillance Disclosures (2013—present))

2013 No

Others that I should add to this list?

Is Saying “Nazi” Inflammatory?

Is using the term “Nazi” inflammatory in this context? The only difference between CIA and Nazi torture is the government that ordered or tolerated the torture. Unless you know of some other classification of torture. The United States military apparently doesn’t and I am willing to take their word for it.

Some will say the torturers were “serving the American people.” The same could be and was said by many a death camp guard for the Nazis. Wrapping yourself in a flag, any flag, does not put criminal activity beyond the reach of the law. It didn’t at Nuremberg and it should not work here.

Conclusion

A functioning democracy requires an informed electorate. Not elected officials, not a star chamber group but an informed electorate. To date the American people lack details about illegal torture carried out by a government agency, the CIA. To exercise their rights and responsibilities an an informed electorate, American voters must have full access to the full CIA torture report.

Release of anything less than the full CIA torture report protects torture participants and their superiors. I have no interest in protecting those who engage in illegal activities nor their superiors. As an American citizen, do you?

Experience with prior “sensitive” reports indicates that despite the wailing and gnashing of teeth, the United States will not fall when the guilty are exposed, prosecuted and lead off to jail. This case is no different.

As many retired US generals point out, transparency and accountability are the only ways to keep illegal torture from returning as an instrument of United States policy.

Is there any reason to wait until American torturers are in their nineties, suffering from dementia and living in New Jersey to hold them accountable for their crimes?

I don’t think so either.

PS: When Senator Udall releases the full CIA torture report in the Congressional Record (not to the New York Times or Wikileaks, both of which censor information for reasons best known to themselves), I hereby volunteer to assist in the extraction of names, dates, places and the association of those items with other, pubic data, both in topic map form as well as in other formats.

How about you?

PPS: On the relationship between Nazis and the CIA, see: Nazis Were Given ‘Safe Haven’ in U.S., Report Says. The special report that informed that article: The Office of Special Investigations: Striving for Accountability in the Aftermath of the Holocaust. (A leaked document)

When you compare Aryanism to American Exceptionalism the similarities between the CIA and the Nazi regime are quite pronounced. How could any act that protects the fatherland/homeland be a crime?

November 14, 2014

Seaborn: statistical data visualization (Python)

Filed under: Graphics,Statistics,Visualization — Patrick Durusau @ 8:21 pm

Seaborn: statistical data visualization

From the introduction:

Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.

Some of the features that seaborn offers are

Seaborn aims to make visualization a central part of exploring and understanding data. The plotting functions operate on dataframes and arrays containing a whole dataset and internally perform the necessary aggregation and statistical model-fitting to produce informative plots. Seaborn’s goals are similar to those of R’s ggplot, but it takes a different approach with an imperative and object-oriented style that tries to make it straightforward to construct sophisticated plots. If matplotlib “tries to make easy things easy and hard things possible”, seaborn aims to make a well-defined set of hard things easy too.

From the “What’s New” page:

v0.5.0 (November 2014)

This is a major release from 0.4. Highlights include new functions for plotting heatmaps, possibly while applying clustering algorithms to discover structured relationships. These functions are complemented by new custom colormap functions and a full set of IPython widgets that allow interactive selection of colormap parameters. The palette tutorial has been rewritten to cover these new tools and more generally provide guidance on how to use color in visualizations. There are also a number of smaller changes and bugfixes.

The What’s New page has a more detailed listing of the improvements over 0.40.

If you haven’t seen Seaborn before, let me suggest that you view the tutorial on Visual Dataset Exploration.

You will be impressed. But if you aren’t, check yourself for a pulse. 😉

I first saw this in a tweet by Michael Waskom.

Amazon Aurora – New Cost-Effective MySQL-Compatible Database Engine for Amazon RDS

Filed under: Amazon Aurora,Amazon Web Services AWS,MySQL — Patrick Durusau @ 7:42 pm

Amazon Aurora – New Cost-Effective MySQL-Compatible Database Engine for Amazon RDS by Jeff Barr.

From the post:

We launched the Amazon Relational Database Service (RDS) service way back in 2009 to help you to set up, operate, and scale a MySQL database in the cloud. Since that time, we have added a multitude of options to RDS including extensive console support, three additional database engines (Oracle, SQL Server, and PostgreSQL), high availability (multiple Availability Zones) and dozens of other features.

We have come a long way in five years, but there’s always room to do better! The database engines that I listed above were designed to function in a constrained and somewhat simplistic hardware environment — a constrained network, a handful of processors, a spinning disk or two, and limited opportunities for parallel processing or a large number of concurrent I/O operations.

The RDS team decided to take a fresh look at the problem and to create a relational database designed for the cloud. Starting from a freshly scrubbed white board, they set as their goal a material improvement in the price-performance ratio and the overall scalability and reliability of existing open source and commercial database engines. They quickly realized that they had a unique opportunity to create an efficient, integrated design that encompassed the storage, network, compute, system software, and database software, purpose-built to handle demanding database workloads. This new design gave them the ability to take advantage of modern, commodity hardware and to eliminate bottlenecks caused by I/O waits and by lock contention between database processes. It turned out that they were able to increase availability while also driving far more throughput than before.

In preview now but you can sign up at the end of Jeff’s post.

Don’t become confused between Apache Aurora (“a service scheduler that runs on top of Mesos”) and Amazon Aurora, the MySQL compatible database from Amazon. (I guess all the good names have been taken for years.)

What am I missing?

Oh, following announcement of open source from Microsoft, Intel, Mapillary (to name the ones I noticed this week), I can’t find any reference to the source code for Amazon Aurora.

Do you think Amazon Aurora is closed source? One of those hiding places for government surveillance/malware? Hopefully not.

Perhaps Jeff just forgot to mention the GitHub respository with the Amazon Aurora source code.

It’s Friday (my location) so let’s see what develops by next Monday, 17 November 2014. If there is no announcement that Amazon Aurora is open source, …, well, at least everyone can factor that into their database choices.

PS: Open source does not mean bug or malware free. Open source means that you have a sporting chance at finding (and correcting) bugs and malware. Non-open source software may have bugs and malware which you will experience but not be able to discover/fix/correct.

Open Sourcing 3D City Reconstruction

Filed under: Graphics,Mapillary,Maps,Visualization — Patrick Durusau @ 3:05 pm

Open Sourcing 3D City Reconstruction by Jan Erik Solem.

From the post:

One of the downsides of using simple devices for mapping the world is that the GPS accuracy is not always great, especially in cities with tall buildings. Since the start we have always wanted to correct this using image matching and we are now making progress in that area.

The technique is called ‘Structure from Motion‘ (SfM) and means that you compute the relative camera positions and a 3D reconstruction of the environment using only the images.

We are now open sourcing our tools under the name OpenSfM and developing it in the open under a permissive BSD license. The project is intended to be a complete end-to-end easy-to-use SfM pipeline on top of OpenCV. We welcome all contributors, from industry and academia, to join the project. Driving this work inside Mapillary is Pau and Yubin.

Moving forward we are initially going to use this for improving the positioning and connection between Mapillary photos. Later, we are going to have an ever improving 3D reconstruction of every place on the planet too ;).

Are you ready to enhance your maps with 3D?

BTW, evidence that small vendors also support open source.

I first saw this in a tweet by Peter Neubauer.

Exemplar Public Health Datasets

Filed under: Bioinformatics,Dataset,Medical Informatics,Public Data — Patrick Durusau @ 2:45 pm

Exemplar Public Health Datasets Editor: Jonathan Tedds.

From the post:

This special collection contains papers describing exemplar public heath datasets published as part of the Enhancing Discoverability of Public Health and Epidemiology Research Data project commissioned by the Wellcome Trust and the Public Health Research Data Forum.

The publication of the datasets included in this collection is intended to promote faster progress in improving health, better value for money and higher quality science, in accordance with the joint statement made by the forum members in January 2011.

Submission to this collection is by invitation only, and papers have been peer reviewed. The article template and instructions for submission are available here.

Data for analysis as well as examples of best practices for pubic health datasets.

Enjoy!

I first saw this in a tweet by Christophe Lallanne.

Feds Put Fake Cell Towers On Planes, Spied On Tons Of Innocent Americans

Filed under: Cybersecurity,Security — Patrick Durusau @ 2:17 pm

Feds Put Fake Cell Towers On Planes, Spied On Tons Of Innocent Americans by Mike Masnick. (TechDirt.)

From the post:

The Wall Street Journal broke the news that the DOJ has been spying on tons of innocent Americans by putting fake mobile phone towers on airplanes and scooping up all sorts of data from people who thought they were connecting to regular mobile phone towers.

You can search Techdirt for more stories on Stingray (fake mobile phone towers). As of today, 4,920 results.

See the Wikiepdia entry for IMSI catcher for general information on how such devices work.

For a recent take on defensive measures as well as pointers to DYI IMSI catchers, see: IMSI-Catch Me If You Can: IMSI-Catcher-Catchers by Adrian Dabrowski, et al. (preprint, paper to appear at ACSAC 2014, December 8-12, New Orleans.)

Not that anyone would consider sitting outside federal or law enforcement facilities to spy on not-so-innocent Americans and posting those results online but outliers do exist.

Federal surveillance will stop when the cost of a surveillance society becomes too high. Anyone suggesting otherwise is either uninformed, a liar, or both.

PS: Techdirt doesn’t mention them but even if government agents don’t use a private plane, other danger zones for you include:

  • Black Friday shopping malls (and other high traffic shopping days)
  • Golf courses
  • Graduation (Google Search)
  • Megachurch services (Google Search)
  • New Year’s Eve Celebrations
  • Parades (Google Search)
  • Sports events
  • Resorts
  • Any large gathering of people is a potential target for gathering cellphone data. (not to mention them sitting outside your home)

    PPS: Warning: Applicable law on interception of electronic communications varies from jurisdiction to jurisdiction. This post covers technical aspects of these techniques and their current usage, with pointers to further information. No representation is made concerning the legality of any of the described activities or equipment. Consult legal counsel concerning your rights and liabilities under local law.

November 13, 2014

Spark for Data Science: A Case Study

Filed under: Linux OS,Spark — Patrick Durusau @ 7:28 pm

Spark for Data Science: A Case Study by Casey Stella.

From the post:

I’m a pretty heavy Unix user and I tend to prefer doing things the Unix Way™, which is to say, composing many small command line oriented utilities. With composability comes power and with specialization comes simplicity. Although, sometimes if two utilities are used all the time, sometimes it makes sense for either:

  • A utility that specializes in a very common use-case
  • One utility to provide basic functionality from another utility

For example, one thing that I find myself doing a lot of is searching a directory recursively for files that contain an expression:

find /path/to/root -exec grep -l "search phrase" {} \;

Despite the fact that you can do this, specialized utilities, such as ack have come up to simplify this style of querying. Turns out, there’s also power in not having to consult the man pages all the time. Another example, is the interaction between uniq and sort. uniq presumes sorted data. Of course, you need not sort your data using the Unix utility sort, but often you find yourself with a flow such as this:

sort filename.dat | uniq > uniq.dat

This is so common that a -u flag was added to sort to support this flow, like so:

sort -u filename.dat > uniq.dat

Now, obviously, uniq has utilities beyond simply providing distinct output from a stream, such as providing counts for each distinct occurrence. Even so, it’s nice for the situation where you don’t need the full power of uniq for the minimal functionality of uniq to be a part of sort. These simple motivating examples got me thinking:

  • Are there opportunities for folding another command’s basic functionality into another command as a feature (or flag) as in sort and uniq?
  • Can we answer the above question in a principled, data-driven way?

This sounds like a great challenge and an even greater opportunity to try out a new (to me) analytics platform, Apache Spark. So, I’m going to take you through a little journey doing some simple analysis and illustrate the general steps. We’re going to cover

  1. Data Gathering
  2. Data Engineering
  3. Data Analysis
  4. Presentation of Results and Conclusions

We’ll close with my impressions of using Spark as an analytics platform. Hope you enjoy!

All of that is just the setup for a very cool walk through a data analysis example with Spark.

Enjoy!

Computational Culture

Filed under: Computation,Computational Semantics,Cultural Anthropology,Social Sciences — Patrick Durusau @ 7:16 pm

Computational Culture: a journal of software studies

From the about page:

Computational Culture is an online open-access peer-reviewed journal of inter-disciplinary enquiry into the nature of the culture of computational objects, practices, processes and structures.

The journal’s primary aim is to examine the ways in which software undergirds and formulates contemporary life. Computational processes and systems not only enable contemporary forms of work and play and the management of emotional life but also drive the unfolding of new events that constitute political, social and ontological domains. In order to understand digital objects such as corporate software, search engines, medical databases or to enquire into the use of mobile phones, social networks, dating, games, financial systems or political crises, a detailed analysis of software cannot be avoided.

A developing form of literacy is required that matches an understanding of computational processes with those traditionally bound within the arts, humanities, and social sciences but also in more informal or practical modes of knowledge such as hacking and art.

The journal welcomes contributions that address such topics and many others that may derive and mix methodologies from cultural studies, science and technology studies, philosophy of computing, metamathematics, computer science, critical theory, media art, human computer interaction, media theory, design, philosophy.

Computational Culture publishes peer-reviewed articles, special projects, interviews, and reviews of books, projects, events and software. The journal is also involved in developing a series of events and projects to generate special issues.

A few of the current articles:

Not everyone’s cup of tea but for those who appreciate it, this promises to be a real treasure.

Using Clojure To Generate Java To Reimplement Clojure

Filed under: Clojure,Data Structures,Immutable,Java — Patrick Durusau @ 7:01 pm

Using Clojure To Generate Java To Reimplement Clojure by Zach Tellman.

From the post:

Most data structures are designed to hold arbitrary amounts of data. When we talk about their complexity in time and space, we use big O notation, which is only concerned with performance characteristics as n grows arbitrarily large. Understanding how to cast an O(n) problem as O(log n) or even O(1) is certainly valuable, and necessary for much of the work we do at Factual. And yet, most instances of data structures used in non-numerical software are very small. Most lists are tuples of a few entries, and most maps are a few keys representing different facets of related data. These may be elements in a much larger collection, but this still means that the majority of operations we perform are on small instances.

But except in special cases, like 2 or 3-vectors that represent coordinates, it’s rarely practical to specify that a particular tuple or map will always have a certain number of entries. And so our data structures have to straddle both cases, behaving efficiently at all possible sizes. Clojure, however, uses immutable data structures, which means it can do an end run on this problem. Each operation returns a new collection, which means that if we add an element to a small collection, it can return something more suited to hold a large collection.

Tellman describes this problem and his solution in Predictably Fast Clojure. (The URL is to a time mark but I think the entire video is worth your time.)

If that weren’t cool enough, Tellman details the creation of 1000 lines of Clojure that generate 5500 lines of Java so his proposal can be rolled into Clojure.

What other data structures can be different when immutability is a feature?

Wintel and Open Source

Filed under: C/C++,Julia,Open Source — Patrick Durusau @ 6:44 pm

The software world is reverberating with the news that Microsoft is in the process of making .NET completely open source.

On the same day, Intel announced that it had released “Julia2C, a source-to-source translator from Julia to C.”

Hmmm, is this evidence that open source is a viable path for commercial vendors? 😉

Next Question: How long before non-open source code become a liability? As in a nesting place for government surveillance/malware.

Speculation: Not as long as it took Wintel to move towards open source.

Consumers should demand open source code as a condition for purchase. All software, all the time.

« Newer PostsOlder Posts »

Powered by WordPress