Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 18, 2015

40000 vulnerable MongoDB databases

Filed under: Cybersecurity,MongoDB,Security — Patrick Durusau @ 5:17 pm

Discovered 40000 vulnerable MongoDB databases on the Internet by Pierluigi Paganini.

From the post:

Today MongoDB is used by many organizations, the bad news is that nearly 40,000 entities running MongoDB are exposed and vulnerable to risks of hacking attacks.

MongoDB-vulnerable

Three students from University of Saarland in Germany, Kai Greshake, Eric Petryka and Jens Heyens, discovered that MongoDB databases running at TCP port 27017 as a service of several thousand of commercial web servers are exposed on the Internet without proper defense measures.

In MongoDB databases at risk – Several thousand MongoDBs without access control on the Internet, Jens Heyens, Kai Greshake, Eric Petryka, report the cause as:

The reason for this problem is twofold:

  • The defaults of MongoDB are tailored for running it on the same physical machine or virtual machine instances.
  • The documentations and guidelines for setting up MongoDB servers with Internet access may not be sufficiently explicit when it comes to the necessity to activate access control, authentication, and transfer encryption mechanisms.

Err, “…may not be sufficiently explicit…?”

You think?

Looking at Install MongoDB on Ubuntu, do you see a word about securing access to MongoDB? Nope.

How about Security Introduction? A likely place for new users to check. Nope.

Authentication has your first clue about the localhost exception but doesn’t mention network access at all.

You finally have to reach Network Exposure and Security before you start learning how to restrict access to your MongoDB instance.

Or if you have grabbed the latest MongoDB documentation as a PDF file (2.6), the security information you need starts at page 286.

I setup a MongoDB instance a couple of weeks ago and remember being amazed that there wasn’t even a default admin password. As a former sysadmin I knew that was trouble to hunted through the documentation until finally hitting upon the necessary information.

Limiting access to a MongoDB instance should be included in the installation document. With bold, perhaps even red letters saying the security steps are necessary before starting your MongoDB instance.

Failure to provide security instructions has resulted in 39,890 vulnerable MongoDBs on the Internet.

Failed to be explicit? More like failed documentation. (full stop)

Users do a bad enough job with security without providing them with bad documentation.

Call me if you need $paid documentation assistance.

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

Filed under: GPU,Language,Translation — Patrick Durusau @ 4:40 pm

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars by Hua He, Jimmy Lin, Adam Lopez. (Transactions of the Association for Computational Linguistics, vol. 3, pp. 87–100, 2015.)

Abstract:

Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.

If you are interested in cross-language search, DNA sequence alignment or other pattern matching problems, you need to watch the progress of this work.

This article and other important research is freely accessible at: Transactions of the Association for Computational Linguistics

Efficient Estimation of Word Representations in Vector Space

Filed under: Machine Learning — Patrick Durusau @ 3:59 pm

Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean.

Abstract:

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

The technical side of Learning the meaning behind words, where we reported the open sourcing of Google’s word2vec toolkit.

A must read.

I first saw this in a tweet by onepaperperday.

Dato Updates Machine Learning Platform…

Filed under: Dato,GraphLab,Machine Learning — Patrick Durusau @ 2:51 pm

Dato Updates Machine Learning Platform, Puts Spotlight on Data Engineering Automation, Spark and Hadoop Integrations

From the post:

Today at Strata + HadoopWorld San Jose, Dato (formerly known as GraphLab) announced new updates to its machine learning platform, GraphLab Create, that allow data science teams to wrangle terabytes of data on their laptops at interactive speeds so that they can build intelligent applications faster. With Dato, users leverage machine learning to build prototypes, tune them, deploy in production and even offer them as a predictive service, all in minutes. These are the intelligent applications that provide predictions for a myriad of use cases including recommenders, sentiment analysis, fraud detection, churn prediction and ad targeting.

Continuing with its commitment to the Open Source community, Dato is also announcing the Open Source release of its core engine, including the out of core machine learning(ML)-optimized SFrame and SGraph data structures which make ML tasks blazing fast. Commercial and non-commercial versions of the full GraphLab Create platform are available for download at www.dato.com/download.

New features available in the GraphLab Create platform include:

  • Predictive Service Deployment Enhancements:
    enables easy integrations of Dato predictive services with applications regardless of development environment and allows administrators to view information about deployed models and statistics on requests and latency on a per predictive object basis.
  • Data Science Task Automation:
    a new Data Matching Toolkit allows for automatic tagging of data from a reference dataset and deduplication of lists automatically. In addition, the new Feature Engineering pipeline makes it easy to chain together multiple feature transformations–a vast simplification for the data engineering stage.
  • Open Source Version of GraphLab Create:
    Dato is offering an open-source release of GraphLab Create’s core code. Included in this version is the source for the SFrame and SGraph, along with many machine learning models, such as triangle counting, pagerank and more. Using this code, it is easy to build a new machine learning toolkit or a connector from the Dato SFrame to a data store. The source code can be found on
    Dato’s GitHub page.
  • New Pricing and Packaging Options:
    updated pricing and packaging include a non-commercial, free offering with the same features as the GraphLab Create commercial version. The free version allows data science enthusiasts to interact with and prototype on a leading machine learning platform. Also available is a new 30-day, no obligation evaluation license of the full-feature, commercial version of Dato’s product line.

Excellent news!

Now if we just had secure hardware to run it on.

On the other hand, it is open source so you can verify there are no backdoors in the software. That is a step in the right direction for security.

Clojure (introduction)

Filed under: Clojure,Functional Programming — Patrick Durusau @ 2:35 pm

Clojure (introduction) by Marko Bonaci.

Another “Clojure? That’s a Lisp, for god’s sake!” Clojure introduction by a rehabilitated Lisp avoider.

From some of the Lisp horror stories, I have to wonder if the C or C++ communities have agents they send out to teach Lisp. 😉

Your experience here will be nothing like any prior unpleasant Lisp memories!

The GitHub view of the tutorial exposes other interesting resources on Scala, Spark, and more.

I first saw this in a tweet by Anna Pawlicka

The Revolution in Astronomy Education: Data Science for the Masses

Filed under: Astroinformatics,Data Science,Education — Patrick Durusau @ 12:42 pm

The Revolution in Astronomy Education: Data Science for the Masses
by Kirk D. Borne, et al.

Abstract:

As our capacity to study ever-expanding domains of our science has increased (including the time domain, non-electromagnetic phenomena, magnetized plasmas, and numerous sky surveys in multiple wavebands with broad spatial coverage and unprecedented depths), so have the horizons of our understanding of the Universe been similarly expanding. This expansion is coupled to the exponential data deluge from multiple sky surveys, which have grown from gigabytes into terabytes during the past decade, and will grow from terabytes into Petabytes (even hundreds of Petabytes) in the next decade. With this increased vastness of information, there is a growing gap between our awareness of that information and our understanding of it. Training the next generation in the fine art of deriving intelligent understanding from data is needed for the success of sciences, communities, projects, agencies, businesses, and economies. This is true for both specialists (scientists) and non-specialists (everyone else: the public, educators and students, workforce). Specialists must learn and apply new data science research techniques in order to advance our understanding of the Universe. Non-specialists require information literacy skills as productive members of the 21st century workforce, integrating foundational skills for lifelong learning in a world increasingly dominated by data. We address the impact of the emerging discipline of data science on astronomy education within two contexts: formal education and lifelong learners.

Kirk Borne posted a tweet today about this paper with following graphic:

turning-people

I deeply admire the work that Kirk has done, is doing and hopefully will continue to do, but is the answer really that simple? That is we need to provide people with “…great tools written by data scientists?”

As an example of what drives my uncertainty, I saw a presentation a number of years ago in biblical studies that involved statistical analysis and when the speaker was asked by a particular result was significant, the response was the manual said that it was. Ouch!

On the other hand, it may be that like automobiles, we have to accept a certain level of accidents/injuries/deaths as a cost of making such tools widely available.

Should we acknowledge up front that a certain level of mis-use, poor use, inappropriate use of “great tools written by data scientists” is a cost of making data and data tools available?

PS: I am leaving to one side cases where tools have been deliberately fashioned to reach false or incorrect results. Detecting those cases might challenge seasoned data scientists.

Controlled Vocabularies and the Semantic Web

Filed under: Semantic Web,Vocabularies — Patrick Durusau @ 11:45 am

Controlled Vocabularies and the Semantic Web Journal of Library Metadata – Special Issue Call for Papers

From the webpage:

Ranging from large national libraries to small and medium-sized institutions, many cultural heritage organizations, including libraries, archives, and museums, have been working with controlled vocabularies in linked data and semantic web contexts.  Such work has included transforming existing vocabularies, thesauri, subject heading schemes, authority files, term and code lists into SKOS and other machine-consumable linked data formats. 

This special issue of the Journal of Library Metadata welcomes articles from a wide variety of types and sizes of organizations on a wide range of topics related to controlled vocabularies, ontologies, and models for linked data and semantic web deployment, whether theoretical, experimental, or actual. 

Topics include, but are not restricted to the following:

  • Converting existing vocabularies into SKOS and/or other linked data formats.
  • Publishing local vocabularies as linked data in online repositories such as the Open Metadata Registry.
  • Development or use of special tools, platforms and interfaces that facilitate the creation and deployment of vocabularies as linked data.
  • Working with Linked Data / Semantic Web W3C standards such as RDF, RDFS, SKOS, and OWL.
  • Work with the BIBFRAME, Europeana, DPLA, CIDOC-CRM, or other linked data / semantic web models, frameworks, and ontologies.
  • Challenges in transforming existing vocabularies and models into linked data and semantic web vocabularies and models.

Click here for a complete list of possible topics.

Researchers and practitioners are invited to submit a proposal (approximately 500 words) including a problem statement, problem significance, objectives, methodology, and conclusions (or tentative conclusions for work in progress). Proposals must be received by March 1, 2015. Full manuscripts (4000-7000 words) are expected to be submitted by June 1, 2015. All submitted manuscripts will be reviewed on a double-blind review basis.

Please forward inquiries and proposal submissions electronically to the guest editors at: perkintj@miamioh.edu

Proposal Deadline: March 1, 2015.

Library of Metadata online. Unfortunately one of those journals where authors have to pay for their work to be accessible to others. The interface makes it look like you are going to have access until you attempt to view a particular article. I didn’t stumble across any that were accessible but I only tried four (4) or (5) of them.

Interesting journal if you have access to it or if you are willing to pay $40.00 per article for viewing. I worked for an academic publisher for a number of years and have an acute sense of the value-add publishers bring to the table. Volunteer authors, volunteer editors, etc.

Former CIA Employee Barry Eisler Explains Why You Shouldn’t Trust The CIA

Filed under: Cybersecurity,Security — Patrick Durusau @ 11:21 am

Former CIA Employee Barry Eisler Explains Why You Shouldn’t Trust The CIA (Techdirt Podcast Episode 12)

From the webpage:

If you checked out last week’s episode, you know that Barry Eisler is a bestselling author with a lot to say about the publishing industry. What you might not know is that he also used to work for the CIA, and he’s got a lot to say about that world as well. This week, Barry is back to talk about the culture and inner workings of the intelligence community.

Roughly paraphrasing Barry Eisler:

[the lack of discussion of cost/benefit analysis is a] tell that fear is turned on and brain is turned off

The cost/benefit analysis discussion alone is worth listening to the podcast.

Question: When was the last time you heard cost/benefit analysis in the context of a cybersecurity or terrorism discussion?

Ask about cost/benefit analysis early and often.

PS: If you go searching for resources on “cost-benefit analysis,” it is also known as “benefit-cost analysis.”

CyberDefense: Appeal to Fear – Chinese Stole Anthem Data For HUMINT

Filed under: Cybersecurity,Defense,Military,Security — Patrick Durusau @ 10:27 am

Chinese Stole Anthem Data For HUMINT; Should Raise US ‘Hackles’ by John Quigg.

From the post:

140514-D-VO565-015.JPG

(Gen. Fang Fenghui, chief of PLA General Staff, and Gen. Martin Dempsey, chairman of the Joint Chiefs of Staff. [Two peas in a pod?])

The Chinese just walked out of Anthem’s enormous data warehouse (though without encrypting their data it might as well have been a troop of Girl Scouts) with personal data on a quarter of America’s population. Assuming that the pro forma outrage and denial is a confirmation of culpability, the People’s Liberation Army and its various subsidiaries will comb over this and other data they hoover up in the maw of their cyber apparatus for defense and economic intelligence purposes for years, further enabling their surveillance and exploitation of Americans they find interesting.

Which leads the article to conclude, among other things:

Our toothless response as a nation is doing little to deter attacks.

To his credit, John does point out in bolded text:

This is one of the largest corporate breaches ever and has significant fiscal, legal, and intelligence implications. The latest reports indicate that the breach occurred because the data was not encrypted and the attacker used the credentials of an authorized user.

But there is a radical disconnect between national cyberdefense and unencrypted data being stolen using credentials of an authorized user.

Fear will drive the construction of a national cyberdefense equivalent to the TSA and phone record vacuuming, neither of which has succeeded at identifying a single terrorist in the fourteen (14) years since 9/11. (Not my opinion, conclusions of U.S. government agencies, see the links.)

No cyberdefense system, private, governmental or otherwise, can protect data that is not encrypted and for which an attacker has authenticated access. What part of that is unclear?

Let’s identify and correct known computer security weaknesses and then and only then, identify gaps that remain to be addressed by a national cybersecurity program. Otherwise a cybersecurity program will address fictional security gaps, take ineffectual action against others and be as useless and wasteful as similar unfocused efforts.

February 17, 2015

Clustering by Descending to the Nearest Neighbor in the Delaunay Graph Space

Filed under: Clustering,Graphs — Patrick Durusau @ 5:37 pm

Clustering by Descending to the Nearest Neighbor in the Delaunay Graph Space by Teng Qiu and Yongjie Li.

Abstract:

In our previous works, we proposed a physically-inspired rule to organize the data points into an in-tree (IT) structure, in which some undesired edges are allowed to occur. By removing those undesired or redundant edges, this IT structure is divided into several separate parts, each representing one cluster. In this work, we seek to prevent the undesired edges from arising at the source. Before using the physically-inspired rule, data points are at first organized into a proximity graph which restricts each point to select the optimal directed neighbor just among its neighbors. Consequently, separated in-trees or clusters automatically arise, without redundant edges requiring to be removed.

The latest in a series of papers exploring clustering issues. The author’s concede the method demonstrated here isn’t important but represents another step in their exploration.

It isn’t often that I see anything other than final and “defend to the death” results. Preliminary and non-successful results being published will increase the bulk of scientific material to be searched but it will also leave a more accurate record of the scientific process.

Enjoy!

Making Maps in R

Filed under: Mapping,Maps,R — Patrick Durusau @ 5:14 pm

Making Maps in R by Kevin Johnson.

from the post:

I make a lot of maps in my line of work. R is not the easiest way to create maps, but it is convenient and it allows for full control of what the map looks like. There are tons of different ways to create maps, even just within R. In this post I’ll talk about the method I use most of the time. I will assume you are proficient in R and have some level of familiarity with the ggplot2 package.

The American Community Survey provides data on almost any topic imaginable for various geographic levels in the US. For this example I will look at the 2012 5-year estimates of the percent of people without health insurance by census tract in the state of Georgia (obtained from the US Census FactFinder). Shapefiles were obtained from the US Census TIGER database. I generally use the cartographic boundary files since they are simplified representations of the boundaries, which saves a lot of space and processing time.

Occurs to me that getting students to make maps of their home states with a short list of data options (for a class), could be an illustration of testing whether results are “likely” or not. Reasoning that students are likely to have some sense of demographic distributions for their home states (or should).

I first saw this in a tweet by Neil Saunders.

The Future of AI: Refections From A Dark Mirror

Filed under: Artificial Intelligence — Patrick Durusau @ 4:29 pm

You have seen Artificial Intelligence could make us extinct, warn Oxford University researchers or similar pieces in the news of late.

With the usual sound bites (shortened even more here):

  • Oxford researchers: “intelligent AIs a unique risk, in that extinction is more likely than lesser impacts.”
  • Elon Musk, the man behind PayPal, Tesla Motors and SpaceX,… ‘our biggest existential threat
  • Bill Gates backed up Musk’s concerns…”I agree with Elon Musk and some others on this and don’t understand why some people are not concerned.”
  • The Greatest Living Physicist? Stephen Hawking…”The development of full artificial intelligence could spell the end of the human race. Humans, who are limited by slow biological evolution, couldn’t compete, and would be superseded.

This is what is known as the “argument from authority” (a fallacy).

As the Wikipedia article on argument from authority notes:

…authorities can come to the wrong judgments through error, bias, dishonesty, or falling prey to groupthink. Thus, the appeal to authority is not a generally reliable argument for establishing facts.[7]

This article and others like it must use the “argument from authority” fallacy because they have no facts with which to persuade you of the danger of future AI. It isn’t often that you find others, outside of science fiction, who admit their alleged dangers are invented out of whole clothe.

The Oxford Researchers attempt to dress their alarmist assertions up to sound better than “appeal to authority:”

Such extreme intelligences could not easily be controlled (either by the groups creating them, or by some international regulatory regime), 485 and would probably act in a way to boost their own intelligence and acquire maximal resources for almost all initial AI motivations. 486 And if these motivations do not detail 487 the survival and value of humanity in exhaustive detail, the intelligence will be driven to construct a world without humans or without meaningful features of human existence.

This makes extremely intelligent AIs a unique risk, 488 in that extinction is more likely than lesser impacts. An AI would only turn on humans if it foresaw a likely chance of winning; otherwise it would remain fully integrated into society. And if an AI had been able to successfully engineer a civilisation collapse, for instance, then it could certainly drive the remaining humans to extinction.

Let’s briefly compare the statements made about some future AI with the sources cited by the authors.

486 See Omohundro, Stephen M.: The basic AI drives. Frontiers in Artificial Intelligence and applications 171 (2008): 483

The Basic AI Drives offers the following abstract:

One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted. We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves. We then show that self-improving systems will be driven to clarify their goals and represent them as economic utility functions. They will also strive for their actions to approximate rational economic behavior. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. We also discuss some exceptional systems which will want to modify their utility functions. We next discuss the drive toward self-protection which causes systems try to prevent themselves from being harmed. Finally we examine drives toward the acquisition of resources and toward their efficient utilization. We end with a discussion of how to incorporate these insights in designing intelligent technology which will lead to a positive future for humanity.

Omohundro reminds me of Alan Greenspan, who had to admit to Congress that his long held faith in “…rational economic behavior…” of investors to be mistaken.

From wikipedia:

In Congressional testimony on October 23, 2008, Greenspan finally conceded error on regulation. The New York Times wrote, “a humbled Mr. Greenspan admitted that he had put too much faith in the self-correcting power of free markets and had failed to anticipate the self-destructive power of wanton mortgage lending. … Mr. Greenspan refused to accept blame for the crisis but acknowledged that his belief in deregulation had been shaken.” Although many Republican lawmakers tried to blame the housing bubble on Fannie Mae and Freddie Mac, Greenspan placed far more blame on Wall Street for bundling subprime mortgages into securities.[80]

Like Greenspan, Omohundro has created a hedge around intelligence that he calls “rational economic behavior,” which has its roots Boolean logic. The problem is that Omonundro, as so many others, appears to know Boole’s An Investigation of the Laws of Thought by reputation and/or repetition by others.

Boole was very careful to point out that his rules were only one aspect of what it means to “reason,” saying at pp. 327-328:

But the very same class of considerations shows with equal force the error of those who regard the study of Mathematics, and of their applications, as a sufficient basis either of knowledge or of discipline. If the constitution of the material frame is mathematical, it is not merely so. If the mind, in its capacity of formal reasoning, obeys, whether consciously or unconsciously, mathematical laws, it claims through its other capacities of sentiment and action, through its perceptions of beauty and of moral fitness, through its deep springs of emotion and affection, to hold relation to a different order of things. There is, moreover, a breadth of intellectual vision, a power of sympathy with truth in all its forms and manifestations, which is not measured by the force and subtlety of the dialectic faculty. Even the revelation of the material universe in its boundless magnitude, and pervading order, and constancy of law, is not necessarily the most fully apprehended by him who has traced with minutest accuracy the steps of the great demonstration. And if we embrace in our survey the interests and duties of life, how little do any processes of mere ratiocination enable us to comprehend the weightier questions which they present! As truly, therefore, as the cultivation of the mathematical or deductive faculty is a part of intellectual discipline, so truly is it only a part. The prejudice which would either banish or make supreme any one department of knowledge or faculty of mind, betrays not only error of judgment, but a defect of that intellectual modesty which is inseparable from a pure devotion to truth. It assumes the office of criticising a constitution of things which no human appointment has established, or can annul. It sets aside the ancient and just conception of truth as one though manifold. Much of this error, as actually existent among us, seems due to the special and isolated character of scientific teaching—which character it, in its turn, tends to foster. The study of philosophy, notwithstanding a few marked instances of exception, has failed to keep pace with the advance of the several departments of knowledge, whose mutual relations it is its province to determine. It is impossible, however, not to contemplate the particular evil in question as part of a larger system, and connect it with the too prevalent view of knowledge as a merely secular thing, and with the undue predominance, already adverted to, of those motives, legitimate within their proper limits, which are founded upon a regard to its secular advantages. In the extreme case it is not difficult to see that the continued operation of such motives, uncontrolled by any higher principles of action, uncorrected by the personal influence of superior minds, must tend to lower the standard of thought in reference to the objects of knowledge, and to render void and ineffectual whatsoever elements of a noble faith may still survive.

As far as “drives” of an AI, we only have one speculation on such drives and not any factual evidence. Restricting the future model of AI to current mis-understandings of what it means to reason doesn’t seem like a useful approach.

487 See Muehlhauser, Luke, and Louie Helm.: Intelligence Explosion and Machine Ethics. In Singularity Hypotheses: A Scientific and Philosophical Assessment, edited by Amnon Eden, Johnny Søraker, James H. Moor, and Eric Steinhart. Berlin: Springer (2012)

Muehlhauser and Helm are cited for the proposition:

And if these motivations do not detail 487 the survival and value of humanity in exhaustive detail, the intelligence will be driven to construct a world without humans or without meaningful features of human existence.

The abstract for Intelligence Explosion and Machine Ethics reads:

Many researchers have argued that a self-improving artificial intelligence (AI) could become so vastly more powerful than humans that we would not be able to stop it from achieving its goals. If so, and if the AI’s goals differ from ours, then this could be disastrous for humans. One proposed solution is to program the AI’s goal system to want what we want before the AI self-improves beyond our capacity to control it. Unfortunately, it is difficult to specify what we want. After clarifying what we mean by “intelligence,” we offer a series of “intuition pumps” from the field of moral philosophy for our conclusion that human values are complex and difficult to specify. We then survey the evidence from the psychology of motivation, moral psychology, and neuroeconomics that supports our position. We conclude by recommending ideal preference theories of value as a promising approach for developing a machine ethics suitable for navigating an intelligence explosion or “technological singularity.”

What follows is a delightful discussion of the difficulties of constructing moral rules of universal application and how many moral guidance for AIs could reach unintended consequences. I take the essay as evidence of our imprecision in moral reasoning and the need to do better for ourselves and any future AI. Its relationship to “…driven to construct a world without humans or without meaningful features of human existence” is tenuous at best.

For their most extreme claim:

This makes extremely intelligent AIs a unique risk, 488 in that extinction is more likely than lesser impacts.

the authors rely upon the most reliable source, themselves:

488 Dealing with most risks comes under the category of decision theory: finding the right approaches to maximise the probability of the most preferred options. But an intelligent agent can react to decisions in a way the environment cannot, meaning that interactions with AIs are better modelled by the more complicated discipline of game theory.

For the extinction by a future AI being more likely, the authors only have self-citation as authority.

To summarize, the claims about future AI are based on arguments from authority and the evidence cited by the “Oxford researchers” consists of one defective notion of AI, one exploration of specifying moral rules and a self-citation.

As a contrary example, consider all the non-human inhabitants of the Earth, none of which have exhibited that unique human trait, the need to drive other species into extinction. Perhaps those who fear a future AI are seeing a reflection from a dark mirror.

PS: You can see the full version of the Oxford report: 12 Risks that threaten human civilisation.

The authors and/or their typesetter is very skilled at page layout and the use of color. It is unfortunate they did not have professional editing for the AI section of the report.

Russian researchers expose breakthrough U.S. spying program

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 2:57 pm

Russian researchers expose breakthrough U.S. spying program by Joseph Menn.

From the post:

The U.S. National Security Agency has figured out how to hide spying software deep within hard drives made by Western Digital, Seagate, Toshiba and other top manufacturers, giving the agency the means to eavesdrop on the majority of the world’s computers, according to cyber researchers and former operatives.

That long-sought and closely guarded ability was part of a cluster of spying programs discovered by Kaspersky Lab, the Moscow-based security software maker that has exposed a series of Western cyberespionage operations.

Kaspersky said it found personal computers in 30 countries infected with one or more of the spying programs, with the most infections seen in Iran, followed by Russia, Pakistan, Afghanistan, China, Mali, Syria, Yemen and Algeria. The targets included government and military institutions, telecommunication companies, banks, energy companies, nuclear researchers, media, and Islamic activists, Kaspersky said. (reut.rs/1L5knm0)

Don’t have a sense for all thirty countries? Reuters has a visual to help with that:

Reuters-Equation-Infection

The Reuters report is great but if you want more technical details, see: Equation Group: The Crown Creator of Cyber-EspionageThe original Kaspersky report, and Equation: The Death Star of Malware Galaxy by GReAT (Kaspersky Labs’ Global Research & Analysis Team), which is an in depth review of the exploit.

There is a comment to the GReAT blog post that reads:

Ok, reading through NSA files that Der Spiegel released i found this:

http://www.spiegel.de/media/media-35661.pdf

This is a file that shows the job postings for NSA interns, you can find a NSA wiki link in the last page. And this is very interesting:

(TS//SI//REL) Create a covert storage product that is enabled from a hard drive firmware modification. The ideia would be to modify the firmware of a particular hard drive so that it normally only recognizes half of its available space. It would report this size back to the operating system and not provide any way to access the additional space.

This is a 2006 document, it took 8 years to finish this product, which is what kaspersky found.

So maybe you guys would easily find the malware if you revert the firmware to a state prior of this date.

Has anyone been collecting hard drive firmware? Another example of where “secret” code exposes users to dangers difficult to guard against.

Public open source code (whether “free” or not) should be a legal requirement for the distribution of software and/or devices with firmware. Just for security reasons alone.

BTW, anyone still in favor of “trusting” the intelligence community if they say your privacy is being respected?

I found the Reuters story because of a tweet by Violet Blue. I then tracked down the source documents for your convenience (I haven’t seen them in other accounts).

Data Mining: Spring 2013 (CMU)

Filed under: Data Mining,R — Patrick Durusau @ 2:33 pm

Data Mining: Spring 2013 (CMU) by Ryan Tibshirani.

Overview and Objectives [from syllabus]

Data mining is the science of discovering structure and making predictions in data sets (typically, large ones). Applications of data mining are happening all around you|and if they are done well, they may sometimes even go unnoticed. How does Google web search work? How does Shazam recognize a song playing in the background? How does Net Flix recommend movies to each of its users? How could we predict whether or not a person will develop breast cancer based on genetic information? How could we search for possible subgroups among breast cancer patients, suggesting diff erent variants of the disease? An expert’s answer to any one of these questions may very well contain enough material to fill its own course, but basic answers stem from the principles of data mining.

Data mining spans the fi elds of statistics and computer science. Since this is a course in statistics, we will adopt a statistical perspective the majority of the course. Data mining also involves a good deal of both applied work (programming, problem solving, data analysis) and theoretical work (learning, understanding, and evaluating methodologies). We will try to maintain a balance between the two.

Upon completing this course, you should be able to tackle new data mining problems, by: (1) selecting the appropriate methods and justifying your choices; (2) implementing these methods programmatically (using, say, the R programming language) and evaluating your results; (3) explaining your results to a researcher outside of statistics or computer science.

Lecture notes, R files, what more could you want? 😉

Enjoy!

February 16, 2015

Big Data, or Not Big Data: What is <your> question?

Filed under: BigData,Complexity,Data Mining — Patrick Durusau @ 7:55 pm

Big Data, or Not Big Data: What is <your> question? by Pradyumna S. Upadrashta.

From the post:

Before jumping on the Big Data bandwagon, I think it is important to ask the question of whether the problem you have requires much data. That is, I think its important to determine when Big Data is relevant to the problem at hand.

The question of relevancy is important, for two reasons: (i) if the data are irrelevant, you can’t draw appropriate conclusions (collecting more of the wrong data leads absolutely nowhere), (ii) the mismatch between the problem statement, the underlying process of interest, and the data in question is critical to understand if you are going to distill any great truths from your data.

Big Data is relevant when you see some evidence of a non-linear or non-stationary generative process that varies with time (or at least, collection time), on the spectrum of random drift to full blown chaotic behavior. Non-stationary behaviors can arise from complex (often ‘hidden’) interactions within the underlying process generating your observable data. If you observe non-linear relationships, with underlying stationarity, it reduces to a sampling problem. Big Data implicitly becomes relevant when we are dealing with processes embedded in a high dimensional context (i.e., after dimension reduction). For high embedding dimensions, we need more and more well distributed samples to understand the underlying process. For problems where the underlying process is both linear and stationary, we don’t necessarily need much data

bigdata-complexity

Great post and a graphic that is worthy of being turned into a poster! (Pradyumna asks for suggestions on the graphic so you may want to wait a few days to see if it improves. Plus send suggestions if you have them.)

What is <your> question? wasn’t the starting point for: Dell: Big opportunities missed as Big Data remains big business.

The barriers to big data:

While big data has proven marketing benefits, infrastructure costs (35 per cent) and security (35 per cent) tend to be the primary obstacles for implementing big data initiatives.

Delving deeper, respondents believe analytics/operational costs (34 per cent), lack of management support (22 per cent) and lack of technical skills (21 per cent) are additional barriers in big data strategies.

“So where do the troubles with big data stem from?” asks Jones, citing cost (e.g. price of talent, storage, etc.), security concerns, uncertainty in how to leverage data and a lack of in-house expertise.

“In fact, only 36 percent of organisations globally have in-house big data expertise. Yet, the proven benefits of big data analytics should justify the investment – businesses just have to get started.

Do you see What is <your> question? being answered anywhere?

I didn’t, yet the drum beat for big data continues.

I fully agree that big data techniques and big data are important advances and they should be widely adopted and used, but only when they are appropriate to the question at hand.

Otherwise you will be like a non-profit I know that spent upward of $500,000+ on a CMS system that was fundamentally incompatible with their data. Wasn’t designed for document management. Fine system but not appropriate for the task at hand. It was like a sleeping dog in the middle of the office. No matter what you wanted to do, it was hard to avoid the dog.

Certainly could not admit that the purchasing decision was a mistake because those in charge would lose face.

Don’t find yourself in a similar situation with big data.

Unless and until someone produces an intelligible business plan that identifies the data, the proposed analysis of the data and the benefits of the results, along with cost estimates, etc., keep a big distance from big data. Make business ROI based decisions, not cult ones.

I first saw this in a tweet by Kirk Borne.

7 Traps to Avoid Being Fooled by Statistical Randomness

Filed under: Random Walks,Randomness,Statistics — Patrick Durusau @ 5:47 pm

7 Traps to Avoid Being Fooled by Statistical Randomness by Kirk Borne.

From the post:

Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere — if a process is truly random, then it is not predictable, in the analytic sense of that term. Randomness refers to the absence of patterns, order, coherence, and predictability in a system.

Unfortunately, we are often fooled by random events whenever apparent order emerges in the system. In moments of statistical weakness, some folks even develop theories to explain such “ordered” patterns. However, if the events are truly random, then any correlation is purely coincidental and not causal. I remember learning in graduate school a simple joke about erroneous scientific data analysis related to this concept: “Two points in a monotonic sequence display a tendency. Three points in a monotonic sequence display a trend. Four points in a monotonic sequence define a theory.” The message was clear — beware of apparent order in a random process, and don’t be tricked into developing a theory to explain random data.

Suppose I have a fair coin (with a head or a tail being equally likely to appear when I toss the coin). Of the following 3 sequences (each representing 12 sequential tosses of the fair coin), which sequence corresponds to a bogus sequence (i.e., a sequence that I manually typed on the computer)?

(a) HTHTHTHTHTHH

(b) TTTTTTTTTTTT

(c) HHHHHHHHHHHT

(d) None of the above.

In each case, a coin toss of head is listed as “H”, and a coin toss of tail is listed as “T”.

The answer is “(d) None of the Above.”

None of the above sequences was generated manually. They were all actual subsequences extracted from a larger sequence of random coin tosses. I admit that I selected these 3 subsequences non-randomly (which induces a statistical bias known as a selection effect) in order to try to fool you. The small-numbers phenomenon is evident here — it corresponds to the fact that when only 12 coin tosses are considered, the occurrence of any “improbable result” may lead us (incorrectly) to believe that it is statistically significant. Conversely, if we saw answer (b) continuing for dozens of more coin tosses (nothing but Tails, all the way down), then that would be truly significant.

Great post on randomness where Kirk references a fun example using Nobel Prize winners with various statistical “facts” for your amusement.

Kirk suggests a reading pack for partial avoidance of this issue in your work:

  1. Fooled By Randomness“, by Nassim Nicholas Taleb.
  2. The Flaw of Averages“, by Sam L. Savage.
  3. The Drunkard’s Walk – How Randomness Rules Our Lives, by Leonard Mlodinow.

I wonder if you could get Amazon to create a also-bought-with package of those three books? Something you could buy for your friends in big data and intelligence work. 😉

Interesting that I saw this just after posting Structuredness coefficient to find patterns and associations. The call on “likely” or “unlikely” comes down to human agency. Yes?

Structuredness coefficient to find patterns and associations

Filed under: Data Mining,Visualization — Patrick Durusau @ 5:27 pm

Structuredness coefficient to find patterns and associations by Livan Alonso.

From the post:

The structuredness coefficient, let’s denote it as w, is not yet fully defined – we are working on this right now. You are welcome to help us come up with a great, robust, simple, easy-to-compute, easy-to-understand, easy-to-interpret metric. In a nutshell, we are working under the following framework:

  • We have a data set with n points. For simplicity, let’s consider for now that these n points are n vectors (x, y) where x, y are real numbers.
  • For each pair of points {(x,y), (x’,y’)} we compute a distance d between the two points. In a more general setting, it could be a proximity metric between two keywords.
  • We order all the distances d and compute the distance distribution, based on these n points
  • Leaving-one-out: we remove one point at a time and compute the n new distance distributions, each based on n-1 points
  • We compare the distribution computed on n points, with the n ones computed on n-1 points
  • We repeat this iteration, but this time with n-2, then n-3, n-4 points etc.
  • You would assume that if there is no pattern, these distance distributions (for successive values of n) would have some kind of behavior uniquely characterizing the absence of structure, behavior that can be identified via simulations. Any deviation from this behavior would indicate the presence of a structure. And the pattern-free behavior would be independent of the underlying point distribution or domain – a very important point. All of this would have to be established or tested, of course.
  • It would be interesting to test whether this metric can identify patterns such as fractal distribution / fractal dimension. Would it be able to detect patterns in time series?

Note that this type of structuredness coefficient makes no assumption on the shape of the underlying domains, where the n points are located. These domains could be smooth, bumpy, made up of lines, made up of dual points etc. They might even be non numeric domain at all (e.g. if the data consists of keywords).

fractal

Deeply interesting work and I appreciate the acknowledgement that “structuredness coefficient” isn’t fully defined.

I will be trying to develop more links to resources on this topic. Please chime in if you have some already.

Jeb EMails – Poor Pickings for SSNs

Filed under: Privacy,Security — Patrick Durusau @ 4:57 pm

I have been meaning to mention Jeb Bush’s release of his emails as Florida governor as training data. JebEmails A reported 300,000+ emails were available in six files (original Outlook (.pst) format). The raw files aren’t available now due to SSNs being included in the original data release?

Anyone with a copy of the original data have a pointer?

That may seem callous but one of the rantings about the privacy violation, does mention:

Most of the exposed numbers (roughly 12,500) came from a spreadsheet attached to an email, meaning most of the people screwed over weren’t just randomly messaging their personal information to the then-governor. The bulk of the social security numbers were from a PowerPoint email attachment about people on a family services waiting list.

How many people on a family services waiting list do you think have accounts at stock trading houses or even a credit card with an unlimited overdraft privilege?

What are the odds that some of the 80 million SSNs hacked from Anthem Health Insurance might fall into one or both of those categories?

To say “privacy” and “breach” in the same sentence isn’t a signal to go to DEFCON 1.

Some breaches of privacy are more serious than others. Unless and until priorities are debated and adopted for sliding scale of types of privacy, public discussion will continue to flail about ineffectually every time privacy is mentioned.

When Jeb’s emails become available, again, I will return to the topic of using them as demonstration data.

PS: I saw that Jeb’s emails ended in 2007. Did Jeb stop using email after he left the governor’s office? Or is there a seven year blank spot in his email record?

I first saw this in a tweet by Charles Ditzel.

Visualizing Interstellar ‘s Wormhole

Filed under: Astroinformatics,Physics — Patrick Durusau @ 4:19 pm

Visualizing Interstellar’s Wormhole by Oliver James, Eugenie von Tunzelmann, Paul Franklin, Kip S. Thorne.

Abstract:

Christopher Nolan’s science fiction movie Interstellar offers a variety of opportunities for students in elementary courses on general relativity theory. This paper describes such opportunities, including: (i) At the motivational level, the manner in which elementary relativity concepts underlie the wormhole visualizations seen in the movie. (ii) At the briefest computational level, instructive calculations with simple but intriguing wormhole metrics, including, e.g., constructing embedding diagrams for the three-parameter wormhole that was used by our visual effects team and Christopher Nolan in scoping out possible wormhole geometries for the movie. (iii) Combining the proper reference frame of a camera with solutions of the geodesic equation, to construct a light-ray-tracing map backward in time from a camera’s local sky to a wormhole’s two celestial spheres. (iv) Implementing this map, for example in Mathematica, Maple or Matlab, and using that implementation to construct images of what a camera sees when near or inside a wormhole. (v) With the student’s implementation, exploring how the wormhole’s three parameters influence what the camera sees—which is precisely how Christopher Nolan, using our implementation, chose the parameters for \emph{Interstellar}’s wormhole. (vi) Using the student’s implementation, exploring the wormhole’s Einstein ring, and particularly the peculiar motions of star images near the ring; and exploring what it looks like to travel through a wormhole.

Finally! A use for all the GFLOPS at your finger tips! You can vet images shown in movies that purport to represent wormholes. Seriously, the appendix to this article has instructions.

Moreover, you can visit: Visualizing Interstellar’s Wormhole (I know, same name as the paper but this is a website with further details and high-resolution images for use by students.)

A poor cropped version of one of those images:

interstellar

A great demonstration of what awaits anyone with an interest to explore and sufficient computing power.

I first saw this in a tweet by Computer Science.

Intelligence Sharing, Crowd Sourcing and Good News for the NSA

Filed under: Crowd Sourcing,Intelligence,NSA — Patrick Durusau @ 3:11 pm

Lisa Vaas posted an entertaining piece today with the title: Are Miami cops really flooding Waze with fake police sightings?. Apparently an NBC affiliate (not FOX, amazing) tried its hand at FUD, alleging that Miami police officers were gaming Waze.

There is a problem with that theory, which Lisa points out quoting Julie Mossler, a spokes person for Waze:

Waze algorithms rely on crowdsourcing to confirm or negate what has been reported on the road. Thousands of users in Florida do this, both passively and actively, every day. In addition, we place greater trust in reports from heavy users and terminate accounts of those whose behavior demonstrate a pattern of contributing false information. As a result the Waze map will remain reliable and updated to the minute, reflecting real-time conditions.

Oops!

See Lisa’s post for the blow-by-blow account of this FUD attempt by the NBC affiliate.

However foolish an attempt to game Waze would be, it is a good example to promote the sharing of intelligence.

Think about it. Rather than the consensus poop that emerges as the collaboration of the senior management in intelligence agencies, why not share all intelligence between agencies between working analysts addressing the same areas or issues? Make the “crowd” people who have similar security clearances and common subject areas. And while contributions are trackable within a agency, to the “crowd,” everyone has a handle and their contributions on shared intelligence is voted up or down. Just like with Waze, people will develop reputations within the system.

I assume for turf reasons you could put handles on the intelligence so the participants would not know its origins as well, just until people started building up trust in the system.

Changing the cultures at the intelligence agencies, which hasn’t succeeded since 9/11, would require a more dramatic approach than has been tried to date. My suggestion is to give the Inspector Generals the ability to block promotions and/or fire people in the intelligence agencies who don’t actively promote the sharing of intelligence. Where “actively promotes” is measured by intelligence shared and not activities to plan to share intelligence, etc.

Unless and until there are consequences for the failure of members of the intelligence community to put the interests of their employers (in this case, citizens of the United States) above their own or that of their agency, the failure to share intelligence since 9/11 will continue.

PS: People will object that the staff in question have been productive, loyal, etc., etc. in the past. The relevant question is whether they have the skills and commitment that is required now? The answer to that last question is either yes or no. Employment is an opportunity to perform, not an entitlement.

February 15, 2015

SPARQLES: Monitoring Public SPARQL Endpoints

Filed under: RDF,Semantic Web,SPARQL — Patrick Durusau @ 7:48 pm

SPARQLES: Monitoring Public SPARQL Endpoints by Pierre-Yves Vandenbussche, Jürgen Umbrich, Aidan Hogan, and Carlos Buil-Aranda.

Abstract:

We describe SPARQLES: an online system that monitors the health of public SPARQL endpoints on the Web by probing them with custom-designed queries at regular intervals. We present the architecture of SPARQLES and the variety of analytics that it runs over public SPARQL endpoints, categorised by availability, discoverability, performance and interoperability. To motivate the system, we gives examples of some key questions about the health and maturation of public SPARQL endpoints that can be answered by the data it has collected in the past year(s). We also detail the interfaces that the system provides for human and software agents to learn more about the recent history and current state of an individual SPARQL endpoint or about overall trends concerning the maturity of all endpoints monitored by the system.

I started to pass on this article since it does date from 2009 but am now glad that I didn’t. The service is still active and can be found at: http://sparqles.okfn.org/.

The discoverability of SPARQL endpoints is reported to be:

sparql-discovery

From the article:

[VoID Description:] The Vocabulary of Interlinked Data-sets (VoID) [2] has become the de facto standard for describing RDF datasets (in RDF). The vocabulary allows for specifying, e.g., an OpenSearch description, the number of triples a dataset contains, the number of unique subjects, a list of properties and classes used, number of triples associated with each property (used as predicate), number of instances of a given class, number of triples used to describe all instances of a given class, predicates used to describe class instances, and so forth. Likewise, the description of the dataset is often enriched using external vocabulary, such as for licensing information.

[SD Description:] Endpoint capabilities – such as supported SPARQL version, query and update features, I/O formats, custom functions, and/or entailment regimes – can be described in RDF using the SPARQL 1.1 Service Description (SD) vocabulary, which became a W3C Recommendation in March 2013 [21]. Such descriptions, if made widely available, could help a client find public endpoints that support the features it needs (e.g., find SPARQL 1.1 endpoints)

No, I’m not calling your attention to this to pick on SPARQL, especially, but the lack of discoverability raises a serious issue for any information retrieval system that hopes to better the dumb luck searching.

Clearly SPARQL has the capability to increase discoverability, whether those mechanisms would be effective or not cannot be answered due to lack of use. So my first question is: Why aren’t the mechanisms of SPARQL being used to increase discoverability?

Or perhaps better, having gone to the trouble to construct a SPARQL endpoint, why aren’t people taking the next step to make them more discoverable?

Is it because discoverability benefits some remote and faceless user instead of those being called upon to make the endpoint more discoverable? In that sense, it is a lack of positive feedback for the person tasked with increasing discoverability?

I ask because if we can’t find the key to motivating people to increase the discoverability of information (SPARQL or no) then we are in serious trouble as the rate of big data continues to increase. The amount of data will continue to grow and discoverability continues to go down. That can’t be a happy circumstance for anyone interested in discovering information.

Suggestions?

I first saw this in a tweet by Ruben Verborgh.

Debunking the Myth of Academic Meritocracy

Filed under: Visualization — Patrick Durusau @ 7:09 pm

Preface: I don’t think the results reported by the authors will surprise anyone. Heretofore the evidence has been whispered at conferences, ancedotal, and piecemeal. All of which made it easier to sustain the myth of an academic meritocracy. In the face of over 19,000 faculty positions over three distinct disciplines and careful analysis, sustaining the meritocracy myth will be much harder. It has been my honor to know truly meritorious scholars but I have also know the socialite type as well.

Systematic inequality and hierarchy in faculty hiring networks by Aaron Clauset, Samuel Arbesman, Daniel B. Larremore. (Science Advances 01 Feb 2015: Vol. 1 no. 1 e1400005 DOI: 10.1126/sciadv.1400005)

Abstract:

The faculty job market plays a fundamental role in shaping research priorities, educational outcomes, and career trajectories among scientists and institutions. However, a quantitative understanding of faculty hiring as a system is lacking. Using a simple technique to extract the institutional prestige ranking that best explains an observed faculty hiring network—who hires whose graduates as faculty—we present and analyze comprehensive placement data on nearly 19,000 regular faculty in three disparate disciplines. Across disciplines, we find that faculty hiring follows a common and steeply hierarchical structure that reflects profound social inequality. Furthermore, doctoral prestige alone better predicts ultimate placement than a U.S. News & World Report rank, women generally place worse than men, and increased institutional prestige leads to increased faculty production, better faculty placement, and a more influential position within the discipline. These results advance our ability to quantify the influence of prestige in academia and shed new light on the academic system.

A must read from the standpoint of techniques, methodology and the broader implications for our research/educational facilities and society at large.

The authors conclusion is quite chilling:

More broadly, the strong social inequality found in faculty placement across disciplines raises several questions. How many meritorious research careers are derailed by the faculty job market’s preference for prestigious doctorates? Would academia be better off, in terms of collective scholarship, with a narrower gap in placement rates? In addition, if collective scholarship would improve with less inequality, what changes would do more good than harm in practice? These are complicated questions about the structure and efficacy of the academic system, and further study is required to answer them. We note, however, that economics and the study of income and wealth inequality may offer some insights about the practical consequences of strong inequality (13).

In closing, there is nothing specific to faculty hiring in our network analysis, and the same methods for extracting prestige hierarchies from interaction data could be applied to study other forms of academic activities, for example, scientific citation patterns among institutions (32). These methods could also be used to characterize the movements of employees among firms within or across commercial sectors, which may shed light on mechanisms for economic and social mobility (33). Finally, because graduate programs admit as students the graduates of other institutions, a similar approach could be used to assess the educational outcomes of undergraduate programs.

I think there are three options at this point:

  • Punish the data
  • Ignore the data
  • See where the data takes us

Which one are you and your academic institution (if any), going to choose?

If you are outside academia, you might want to make a similar study of your organization or industry to help plot your career.

If you are outside academia and the private sector, consider a similar study of government.

I discovered this paper by seeing the Faculty Hiring Networks data page in a tweet by Aaron Clauset.

50 Shades Sex Scene detector

Filed under: Natural Language Processing,Python — Patrick Durusau @ 4:55 pm

NLP-in-Python by Lynn Cherny.

No, the title is not “click-bait” because section 4 of Lynn’s tutorial is titled:

4. Naive Bayes Classification – the infamous 50 Shades Sex Scene Detection because spam is boring

Titles can be accurate and NLP can be interesting.

Imagine an ebook reader that accepts 3rd party navigation for ebooks. Running NLP on novels could provide navigation that isolates the sex or other scenes for rapid access.

An electronic abridging of the original. Not unlike CliffsNotes.

I suspect that could be a marketable information product separate from the original ebook.

As would the ability to overlay 3rd party content on original ebook publications.

Are any of the open source ebook readers working on such a feature? Easier to develop demand for that feature on open source ebook readers and then tackle the DRM/proprietary format stuff.

The software behind this clickbait data visualization will blow your mind

Filed under: R,Visualization — Patrick Durusau @ 4:26 pm

The software behind this clickbait data visualization will blow your mind by David Smith.

From the post:

New media sites like Buzzfeed and Upworthy have mastered the art of "clickbait": headlines and content designed to drive as much traffic as possible to their sites. One technique is to use coy headlines like "If you take a puppy video break today, make sure this is the dog video you watch." (Gawker apparently spends longer writing a headline than the actual article.) But the big stock-in-trade is "listicles": articles that are, well, just lists of things. (Exactly half of Buzzfeed's top 20 posts of this week are listicles, including "32 Paintings Paired With Quotes From 'Mean Girls'".)

If your goal is to maximize virality, how long should a listicle be? Max Woolf, an R user and Bay Area Software QA Engineer, set out to answer that question with data. Buzzfeed reports the number of Facebook shares for each of its articles, so he scraped BuzzFeed’s website and counted the number of items in 15,656 listicles. He then used R's ggplot2 package to plot number of Facebook shares versus number of listicle items, and added a smooth line to show the relationship:

Not that I read Buzzfeed very often but at least the lists are true lists, you aren’t forced to load each item separately with ads each time. Not great curation but one item at a time display or articles broken into multiple parts for ad reasons are far more objectionable.

That said, if you are looking for shares on Facebook, take this as your guide to creating listicles. 😉

The US Patent and Trademark Office should switch from documents to data

Filed under: Government Data,Patents — Patrick Durusau @ 2:00 pm

The US Patent and Trademark Office should switch from documents to data by Justin Duncan.

From the post:

The debate over patent reform — one of Silicon Valley’s top legislative priorities — is once again in focus with last week’s introduction of the Innovation Act (H.R. 9) by House Judiciary Committee Chairman Bob Goodlatte (R-Va.), Rep. Peter DeFazio (D-Ore.), Subcommittee on Courts, Intellectual Property, and the Internet Chairman Darrell Issa (R-Calif.) and Ranking Member Jerrold Nadler (D-N.Y.), and 15 other original cosponsors.

The Innovation Act largely takes aim at patent trolls (formally “non-practicing entities”), who use patent litigation as a business strategy and make money by threatening lawsuits against other companies. While cracking down on litigious patent trolls is important, that challenge is only one facet of what should be a larger context for patent reform.

The need to transform patent information into open data deserves some attention, too.

The United States Patent and Trademark Office (PTO), the agency within the Department of Commerce that grants patents and registers trademarks, plays a crucial role in empowering American innovators and entrepreneurs to create new technologies. Ironically, many of the PTO’s own systems and technologies are out of date.

Last summer, Data Transparency Coalition advisor Joel Gurin and his colleagues organized an Open Data Roundtable with the Department of Commerce, co-hosted by the Governance Lab at New York University (GovLab) and the White House Office of Science and Technology Policy (OSTP). The roundtable focused on ways to improve data management, dissemination, and use at the Department of Commerce. It shed some light on problems faced by the PTO.

According to GovLab’s report of the day’s findings and recommendations, the PTO is currently working to improve the use and availability of some patent data by putting it in a more centralized, easily searchable form.

To make patent applications easier to navigate – for inventors, investors, the public, and the agency itself – the PTO should more fully embrace the use of structured data formats, like XML, to express the information currently collected as PDFs or text documents.

Justin’s post is a brief history of efforts to improve access to patent and trademark information, mostly focusing on the need for the USPTO (US Patent and Trademark Office) to stop relying on PDF as its default format.

Other potential improvements:

Additional GovLab recommendations included:

  • PTO [should] make more information available about the scope of patent rights, including expiration dates, or decisions by the agency and/or courts about patent claims.
  • PTO should add more context to its data to make it usable by non-experts – e.g. trademark transaction data and trademark assignment.
  • Provide Application Programming Interfaces (APIs) to enable third parties to build better interfaces for the existing legacy systems. Access to Patent Application Information Retrieval (PAIR) and Patent Trial and Appeal Board (PTAB) data are most important here.
  • Improve access to Cooperative Patent Classification (CPC)/U.S. Patent Classification (USPC) harmonization data; tie this data more closely to economic data to facilitate analysis.

Tying in related information, the first and last recommendations on the GovLab list is another step in the right direction.

But only a step.

If you have ever searched the USPTO patent database you know making the data “searchable” is only a nod and wink towards accessibility. Making the data is nothing to sneeze at but USPTO reform should have a higher target than simple being “searchable.”

Outside of patent search specialists (and not all of them), what ordinary citizen is going to be able to navigate the terms of art across domains when searching patents?

The USPTO should go beyond making patents literally “searchable” and instead make patents “reliably” searchable. By “reliable” searching I mean searching that returns all the relevant patents. A safe harbor if you will that protects inventors, investors and implementers from costly suits arising out of the murky wood filled with traps, intellectual quicksand and formulaic chants that are the USPTO patent database.

I first saw this in a tweet by Joel Gurin.

Federal Spending Data Elements

Filed under: Government Data,Transparency — Patrick Durusau @ 10:43 am

Federal Spending Data Elements

From the webpage:

The data elements in the below list represent the existing Federal Funding Accountability and Transparency Act (FFATA) data elements currently displayed on USAspending.gov and the additional data elements that will be posted pursuant to the DATA Act. These elements are currently being deliberated on and discussed by the Federal community as a part of DATA Act implementation. At this point, this list is exhaustive. However, additional data elements may be standardized for transparency reporting in the future based on agency or community needs.

Join the Conversation

At this time, we are asking for comments in response to the following questions:

  1. Which data elements are most crucial to your current reporting and/or analysis?
  2. In setting standards, what are industry standards the Treasury and OMB should be considering?
  3. What are some of the considerations that Treasury and OMB should take into account when establishing data standards?

Just reading the responses to the questions on GitHub will give you a sense of what other community members are thinking about.

What responses are you going to contribute?

I first saw this in a tweet by Hudson Hollister.

Frequently updated Machine Learning blogs

Filed under: Machine Learning — Patrick Durusau @ 10:29 am

Frequently updated Machine Learning blogs

From the webpage:

Are you looking for some of the frequently updated Machine Learning blogs to learn what’s happening in the world of Machine Learning and related areas that explore the construction and study of algorithms that can learn from data and make predictions or decisions?

Check out our list.

In the process of searching top frequently updated Machine Learning blogs, we’ve found plenty of Machine Learning blogs on the internet, but shortlisted only those which are active since 2014. If we’ve missed a blog which you think should be included in this list, please let us know.

Here we go…

I count forty-seven (47) blogs listed.

A great starting point if you want to try your hand at crawling blogs on machine learning.

I first saw this in a tweet by Gregory Piatetsky.

LamdaConf 2015 – May 22-25 Boulder CO

Filed under: Conferences,Functional Programming,Functional Reactive Programming (FRP) — Patrick Durusau @ 10:17 am

LamdaConf 2015 – May 22-25 Boulder CO

Early Bird Registration (self payment) ends Feb. 28, 2015

From the webpage:

Ignite your functional programming skills at the second annual LambdaConf — the largest interdisciplinary functional programming conference in the Mountain West.

With more than 40 speakers and two and a half days worth of content, LambdaConf brings together attendees with a diverse set of skills and backgrounds, united by a common passion for the power of functional programming.

Students, researchers, and programming professionals of all persuasions will find relevant topics, including introductory material on functional programming, PL and type theory, industry case studies, language workshops, and library tutorials.

In addition to two and a half days of content, the conference has ample opportunity for networking, including five group meals, one drink social, one group activity (hiking or tea), a half day unconference, and unscheduled time Saturday evening.

A non-final list of presentations:

  • How to Learn Haskell in Less Than 5 Years by Chris Allen
  • The Abstract Method, In General by Gershom Bazerman
  • Make Up Your Own: “Hello World!” by Justin Campbell
  • Why I Like Functional Programming by Adelbert Chang
  • Scalaz-Streams: A Functional Approach to Compositional, Streaming I/O by Derek Chen-Becker
  • HTTP through Functional Programming by Andrew Cherry
  • Reactive Programming with Algebra by André van Delft and Anatoliy Kmetyuk
  • Shipping a Production Web App in Elm by Richard Feldman
  • ooErlang: A Programmer-Friendly Approach to OOP in Erlang by Emiliano Firmino
  • Scalaz 102 – Taking Your Scalaz Usage Up a Notch! by Colt Fredrickson
  • Loom and Functional Graphs in Clojure by Aysylu Greenberg
  • Dynamic vs. Static: Having a Discussion without Sounding Like a Lunatic by David Greenberg
  • The Meaning of LFE by Zeeshan Lakhani
  • What’s New in Scala by Marconi Lanna
  • Idiomatic Scala: Your Options Do Not Match by Marconi Lanna
  • Introducing Emily: Simplifying Functional Programming by Andi McClure
  • Pattern Functors: Wandering Around Fix-points, Free Monads and Generics by Alejandro Serrano Mena
  • Accelerating Haskell: GPGPU Programming with Haskell by Joe Nash
  • Programs as Values: Pure Composable Database Access in Scala by Rob Norris
  • Type Theory and its Meaning Explanations by Jon Sterling
  • A Bird’s Eye View of ClojureScript by Chandu Tennety
  • Building Concurrent, Fault-Tolerant, Scalable Applications in F# using Akka.Net by Riccardo Terrell
  • Fault-Tolerance on the Cheap: Making Systems That (Probably) Won’t Fall Over by Brian L. Troutwine

With more content in the form of lighting talks, workshops, etc.

You have seen languages modifying themselves to become more functional.

Now see languages that are functional!

February 14, 2015

Flow: Actor-based Concurrency with C++ [FoundationDB]

Filed under: C/C++,Concurrent Programming,FoundationDB — Patrick Durusau @ 8:37 pm

Flow: Actor-based Concurrency with C++

From the post:

FoundationDB began with ambitious goals for both high performance per node and scalability. We knew that to achieve these goals we would face serious engineering challenges while developing the FoundationDB core. We’d need to implement efficient asynchronous communicating processes of the sort supported by Erlang
or the Async library in .NET, but we’d also need the raw speed and I/O efficiency of C++. Finally, we’d need to perform extensive simulation to engineer for reliability and fault tolerance on large clusters.

To meet these challenges, we developed several new tools, the first of which is Flow, a new programming language that brings actor-based concurrency to C++11. To add this capability, Flow introduces a number of new keywords and control-flow primitives for managing concurrency. Flow is implemented as a compiler which analyzes an asynchronous function (actor) and rewrites it as an object with many different sub-functions that use callbacks to avoid blocking (see streamlinejs for a similar concept using JavaScript). The Flow compiler’s output is normal C++11 code, which is then compiled to a binary using traditional tools. Flow also provides input to our simulation tool, Lithium, which conducts deterministic simulations of the entire system, including its physical interfaces and failure modes. In short, Flow allows efficient concurrency within C++ in a maintainable and extensible manner, achieving all three major engineering goals:

  • high performance (by compiling to native code),
  • actor-based concurrency (for high productivity development),
  • simulation support (for testing).

Flow Availability

Flow is not currently available outside of FoundationDB, but we’d like to open-source it in the future. If you’d like to stay in the loop with our progress subscribe below.

Are you going to be ready when Flow is released separate from FoundationDB?

Streets of Paris Colored by Orientation

Filed under: Mapping,Maps,R — Patrick Durusau @ 8:12 pm

Streets of Paris Colored by Orientation by Mathieu Rajerison.

From the post:

Recently, I read an article by datapointed which presented maps of streets of different cities colored by orientation.

The author gave some details about the method, which I tried to reproduce. In this post, I present the different steps from the calculation in my favorite spatial R ToolBox to the rendering in QGIS using a specific blending mode.

An opportunity to practice R and work with maps. More enjoyable than sifting data to find less corrupt politicians.

I first saw this in a tweet by Caroline Moussy.

« Newer PostsOlder Posts »

Powered by WordPress