Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 16, 2015

clojure-datascience (Immutability for Auditing)

Filed under: Clojure,Data Science — Patrick Durusau @ 5:56 pm

clojure-datascience

From the webpage:

Resources for the budding Clojure Data Scientist.

Lots of opportunities for contributions!

It occurs to me that immutability is a prerequisite for auditing.

Yes?

If I were the SEC, as in the U.S. Securities and Exchange Commission, and NOT the SEC, as in the Southeastern Conference (sports), I would make immutability a requirement for data systems in the finance industry.

Any mutable change would be presumptive evidence of fraud.

That would certainly create a lot of jobs in the financial sector for functional programmers. And jailers as well considering the history of the finance industry.

Eye Candy: Spiral Triangle

Filed under: D3,SVG,Visualization — Patrick Durusau @ 5:41 pm

spiral-triangle

Mike Bostock unpacked this impossible gif.

See Spiral Triangle for all its moving glory and code.

Being Thankful IBM is IBM

Filed under: Cybersecurity,Security — Patrick Durusau @ 5:16 pm

There are days when you have to be thankful IBM is IBM. Seriously.

Conversations about threat sharing to improve cybersecurity abound, including a a start from scratch public threat portal by DHS (if it makes it out of Congress). Seriously? A build from scratch on government bid public security portal?

Then in the midst of all that consternation, uncertainty and doubt, IBM speaks:

IBM Opens Threat Intelligence to Combat Cyber Attacks

IBM (NYSE: IBM) today announced it is making its vast library of security intelligence data available via the IBM X-Force Exchange, a new cyber threat intelligence sharing platform powered by IBM Cloud. This collaborative platform provides access to volumes of actionable IBM and third-party threat data from across the globe, including real-time indicators of live attacks, which can be used to defend against cybercrimes.

The need for trusted threat intelligence is greater than ever, as 80 percent of cyber attacks are driven by highly organized crime rings in which data, tools and expertise are widely shared1. Though hackers have mobilized, their targets have not. A majority (65 percent) of in-house cybersecurity teams use multiple sources of trusted and untrusted external intelligence to fight attackers2.

The X-Force Exchange builds on IBM’s tremendous scale in security intelligence, integrating its powerful portfolio of deep threat research data and technologies like QRadar, thousands of global clients, and acumen of a worldwide network of security analysts and experts from IBM Managed Security Services. Leveraging the open and powerful infrastructure of the cloud, users can collaborate and tap into multiple data sources, including:   

  •  One of the largest and most complete catalogs of vulnerabilities in the world;
  •  Threat information based on monitoring of more than 15 billion monitored security events per day;
  •  Malware threat intelligence from a network of 270 million endpoints;
  •  Threat information based on over 25 billion web pages and images;
  •  Deep intelligence on more than 8 million spam and phishing attacks;
  •  Reputation data on nearly 1 million malicious IP addresses.

Today, the X-Force Exchange features over 700 terabytes of raw aggregated data supplied by IBM. This will continue to grow, be updated and shared as the platform can add up to a thousand malicious indicators every hour. This data includes real-time information which is critical to the battle against cybercrime.

“The IBM X-Force Exchange platform will foster collaboration on a scale necessary to counter the rapidly rising and sophisticated threats that companies are facing from cybercriminals,” said Brendan Hannigan, General Manager, IBM Security. “We’re taking the lead by opening up our own deep and global network of cyberthreat research, customers, technologies and experts. By inviting the industry to join our efforts and share their own intelligence, we’re aiming to accelerate the formation of the networks and relationships we need to fight hackers."

Open, Automated and Social Threat Sharing

Built by IBM Security, the IBM X-Force Exchange is a new, cloud-based platform that allows organizations to easily collaborate on security incidents, as well as benefit from the ongoing contributions of IBM experts and community members. Since the beta launch of the X-Force Exchange, numerous early adopters have joined the community.

By freely consuming, sharing and acting on real-time threat intelligence from their networks and IBM’s own repository of known threat intelligence, users can identify and help stop threats via:

  • A collaborative, social interface to easily interact with and validate information from industry peers, analysts and researchers;
  • Volumes of intelligence from multiple third parties, the depth and breadth of which will continue to grow as the platform’s user base grows;
  • A collections tool to easily organize and annotate findings, bringing priority information to the forefront;
  • Open, web-based access built for security analysts and researchers;
  • A library of APIs to facilitate programmatic queries between the platform, machines and applications; allowing businesses to operationalize threat intelligence and take action.

The link? http://xforce.ibmcloud.com. Bookmark it. I suspect we are all going to be spending time there.

Thanks IBM!

I have just logged in and started to explore XForce. Very cool!

PS: I first saw this in a post without the decency to include a link to the press release or the XForce site. Screw that.

New CSV on the Web Drafts (CVS From RDF?)

Filed under: CSV,W3C — Patrick Durusau @ 3:49 pm

Four Drafts Published by the CSV on the Web Working Group

From the post:

The CSV on the Web Working Group has published a new set of Working Drafts, which the group considers feature complete and implementable.

The group is keen to receive comments on these specifications, either as issues on the Group’s GitHub repository or by posting to public-csv-wg-comment target=”_blank”s@w3.org.

The CSV on the Web Working Group would also like to invite people to start implementing these specifications and to donate their test cases into the group’s test suite. Building this test suite, as well as responding to comments, will be the group’s focus over the next couple of months.

Learn more about the CSV on the Web Working Group.

If nothing else, Model for Tabular Data and Metadata on the Web, represents a start on documenting a largely undocumented data format. Perhaps the most common undocumented data format of all. I say that, there may be a dissertation or even a published book that has collected all the CSV variants at a point in time. Sing out if you know of such a volume.

Will be interested to see if the group issues a work product entitled: Generating CSV from RDF on the Web. Our return to data opacity will be complete.

Not that I have any objections to CSV, compact, easy to parse (well, perhaps not correctly but I didn’t say that), widespread, but it isn’t the best format for blind interchange, which by definition includes legacy data. Odd that in a time of practically limitless storage and orders of magnitude faster processing, that data appears to be seeking less documented semantics.

Or is that the users and producers of data prefer data with less documented semantics? I knew an office manager once upon a time who produced reports from unshared cheatsheets using R Writer. Was loathe to share field information with anyone. Seemed like a very sad way to remain employed.

Documenting data semantics isn’t going to obviate the need for the staffs who previously concealed data semantics. Just as data is dynamic so are its semantics which will require the same staffs to continually update and document the changing semantics of data. Same staffs for the most part, just more creative duties.

Understanding Data Visualisations

Filed under: Graphics,Visualization — Patrick Durusau @ 3:21 pm

Understanding Data Visualisations by Andy Kirk.

From the webpage:

Regular readers will be somewhat aware of my involvement in a research project called ‘Seeing Data’, a 15 month study funded by the UK Arts and Humanities Research Council and led by Professor Helen Kennedy from the University of Sheffield.

The aim of ‘Seeing Data’ was to further our understanding about how people make sense of data visualisations. Through learning about the ways in which people engage with data visualisations our aim was to provide some key resources for the general public, to help them develop the skills they need to interact with visualisations, and also for visualisation designers/producers, to help them understand what matters to the people who view and engage with their visualisations.

We are now concluding our findings and beginning our dissemination of a range of outputs to fulfil our aims.

This looks very promising! Each section leads to a fuller presentation with an opportunity to test yourself at the end of each section.

Will results on visualization in the UK will hold true for subjects in other locations? If there are differences, what are they and how are those variances understood?

Looking forward to more details on the project!

I first saw this in a tweet by Amanda Hobbs.

GOBLET: The Global Organisation for Bioinformatics Learning, Education and Training

Filed under: Bioinformatics,Python — Patrick Durusau @ 1:21 pm

GOBLET: The Global Organisation for Bioinformatics Learning, Education and Training by Teresa K. Atwood, et al. (PLOS Published: April 9, 2015 DOI: 10.1371/journal.pcbi.1004143)

Abstract:

In recent years, high-throughput technologies have brought big data to the life sciences. The march of progress has been rapid, leaving in its wake a demand for courses in data analysis, data stewardship, computing fundamentals, etc., a need that universities have not yet been able to satisfy—paradoxically, many are actually closing “niche” bioinformatics courses at a time of critical need. The impact of this is being felt across continents, as many students and early-stage researchers are being left without appropriate skills to manage, analyse, and interpret their data with confidence. This situation has galvanised a group of scientists to address the problems on an international scale. For the first time, bioinformatics educators and trainers across the globe have come together to address common needs, rising above institutional and international boundaries to cooperate in sharing bioinformatics training expertise, experience, and resources, aiming to put ad hoc training practices on a more professional footing for the benefit of all.

Great background on GOBLET, www.mygoblet.org.

One of the functions of GOBLET is to share training materials in bioinformatics and that is well underway. The Training Portal has eighty-nine (89) sets of training materials as of today, ranging from Pathway and Network Analysis 2014 Module 1 – Introduction to Gene Lists to Parsing data records using Python programming and points in between!

If your training materials aren’t represented, perhaps it is time for you to correct that oversight.

Enjoy!

I first saw this in a tweet by Mick Watson.

Wandora – Heads Up! New release 2015-04-20

Filed under: Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 12:42 pm

No details, just saw a tweet about the upcoming release set for next next Monday.

The latest date that the new web search application from DARPA will drop as well.

Could be the start of a busy week!

April 15, 2015

Google Antitrust Charges: Guilty Until Proven Innocent

Filed under: EU,Law — Patrick Durusau @ 4:56 pm

The EU antitrust charges against Google will be news for some time so start with the the primary sources.

Competition Commissioner Margrethe Vestager

First, the official press release from the European Commission: Antitrust: Commission sends Statement of Objections to Google on comparison shopping service; opens separate formal investigation on Android, which reads in part:

The European Commission has sent a Statement of Objections to Google alleging the company has abused its dominant position in the markets for general internet search services in the European Economic Area (EEA) by systematically favouring its own comparison shopping product in its general search results pages. The Commission’s preliminary view is that such conduct infringes EU antitrust rules because it stifles competition and harms consumers. Sending a Statement of Objections does not prejudge the outcome of the investigation.

EU Commissioner in charge of competition policy Margrethe Vestager said: “The Commission’s objective is to apply EU antitrust rules to ensure that companies operating in Europe, wherever they may be based, do not artificially deny European consumers as wide a choice as possible or stifle innovation”.

“In the case of Google I am concerned that the company has given an unfair advantage to its own comparison shopping service, in breach of EU antitrust rules. Google now has the opportunity to convince the Commission to the contrary.

In the first paragraph, “Sending a Statement of Objections does not prejudge the outcome….” and by the fourth paragraph, “…Google now has the opportunity to convince the Commission to the contrary.”???

That sounds remarkably like “guilty until proven innocent” to me. You?

Can you imagine a judge in a US antitrust trial telling the defendant:

“We are going to have a fair trial and you will have to opportunity to convince me your’re not guilty.”

It’s unfortunate that vendors continue to use the EU as a pawn in efforts to compete other vendors. It just encourages the EU, with its admittedly Euro-centric view of the world, to attempt to manage activities best left un-managed. Yes, Google is the world leader in search, if you think indexing 5% of the web constitutes leadership. A “leader” that is still wedded to its lemming (page-rank) based ranking algorithm.

Apparently the EU hasn’t noticed that raw search data is now easily available for potential competitors to Google. (You know it as Common Crawl Link is to a series of my posts on Common Crawl.) The EU is unaware of the ongoing revolution in deep learning, which will make lemming-based ranking passé. (Yes, Google has contributed heavily to that research but research isn’t criminal, at least not yet.) And the very technology for performing Internet searches may be about to change (Darpa/Memex).

Does Google dominate the ad-supported, users-as-end-product, search market? Sure, if you don’t like that, why not create a search service that returns one (1) result, the one that I am looking for? No ads, no selling my information, just returning one useful result. Given the time wasted in a day scrolling through some search engine results, do you see a market for that among professionals?

If I search for pizza, given my IP address and order history, there is only one result that needs to show up. With the number highlighted for calling. Think about all the one result searches you need in a day, week, month. I suppose that doesn’t work for dating services but no one search solution will fit all use cases. Entirely different market from Google, paid for by vendors.

Source documents for your topic map:

Antitrust: Commission probes allegations of antitrust violations by Google (2010)

Antitrust: Commission sends Statement of Objections to Google on comparison shopping service (April 15, 2015)

Antitrust: Commission opens formal investigation against Google in relation to Android mobile operating system

Council Regulation (EC) No 1/2003 of 16 December 2002 on the implementation of the rules on competition laid down in Articles 81 and 82 of the Treaty (Text with EEA relevance) (in English, as of today) The canonical link: http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32003R0001.

Maybe Friday (17th April) or Monday (20th April) DARPA – Dark Net

Filed under: DARPA,Search Analytics,Search Engines,Search Requirements,Uncategorized — Patrick Durusau @ 2:20 pm

Memex In Action: Watch DARPA Artificial Intelligence Search For Crime On The ‘Dark Web’ by Thomas Fox-Brewster.

Is DARPA’s Memex search engine a Google-killer? by Mark Stockleyhttps

A couple of “while you wait” pieces to read while you expect part of the DARPA Memex project to appear on its Open Catalog page, either this coming Friday (17th of April) or Monday (20th of April).

Fox-Brewster has video of a part of the system that:

It is trying to overcome one of the main barriers to modern search: crawlers can’t click or scroll like humans do and so often don’t collect “dynamic” content that appears upon an action by a user.

If you think searching is difficult now, with an estimated 5% of the web being indexed, just imagine bumping that up 10X or more.

Entirely manual indexing is already impossible and you have experienced the short comings of page ranking.

Perhaps the components of Memex will enable us to step towards a fusion of human and computer capabilities to create curated information resources.

Imagine an electronic The Art of Computer Programming that has several human experts per chapter who are assisted by deep searching and updating references and the text on an ongoing basis? So readers don’t have to weed through all the re-inventions of particular algorithms across numerous computer and math journals.

Or perhaps a more automated search of news reports so the earliest/most complete report is returned with the notation: “There are NNNNNN other, later and less complete versions of this story.” It isn’t that every major paper adds value, more often just content.

BTW, the focus on the capabilities of the search engine, as opposed to the analysis of those results most welcome.

See my post on its post-search capabilities: DARPA Is Developing a Search Engine for the Dark Web.

Looking forward to Friday or Monday!

How’s Your Belkin Router?

Filed under: Cybersecurity,Security — Patrick Durusau @ 1:09 pm

We TOLD you not to use WPS on your Wi-Fi router! We TOLD you not to knit your own crypto! by Paul Ducklin.

Paul writes a very amusing account of router insecurity that involves crypto tricks that I suspect you can find elsewhere as well.

Like reading critically, security is your responsibility.

If you have the issues Paul describes and you follow his advice you will be less insecure than before you read his post.

Not secure, just less insecure.

PS: Do you know if government offices use Belkin routers? 😉

Multiple Locks On A Front Door? (Adm. Rogers)

Filed under: Cybersecurity,Security — Patrick Durusau @ 12:55 pm

As encryption spreads, U.S. grapples with clash between privacy, security by Ellen Nakashima and Barton Gellman.

From the post:


Recently, the head of the National Security Agency provided a rare hint of what some U.S. officials think might be a technical solution. Why not, suggested Adm. Michael S. Rogers, require technology companies to create a digital key that could open any smartphone or other locked device to obtain text messages or photos, but divide the key into pieces so that no one person or agency alone could decide to use it?

“I don’t want a back door,” Rogers, the director of the nation’s top electronic spy agency, said during a speech at Princeton University, using a tech industry term for covert measures to bypass device security. “I want a front door. And I want the front door to have multiple locks. Big locks.”

I wanted to point you to the full report by Nakashima and Gellman as opposed to some of the tech news summaries because they provide good background on the history of the encryption issue. Worth a very close read.

What truly puzzles me is why Adm. Rogers would think anyone would trust any key, multi-part or not, to which the government has access?

That’s really the relevant question in a nutshell isn’t it? Setting aside the obvious technical issue of making such a key, trusting all the non-government parties with parts of the key, etc., why would anyone trust the government?

There wasn’t any reason to trust government prior to Edward Snowden but post-Snowden, no sane person should urge trust of the United States government.

Any “front door,” “back door,” whatever access through encryption should be rejected on the basis there being no reason to trust any government use of such access. It really is that simple.

The imagined cases where access through encryption might be useful are just that, imagined cases. Whereas the cases where law enforcement/intelligence have proven untrustworthy are legion.

Secretive Twitter Censorship Fairy Strikes Again!

Filed under: Censorship,Twitter — Patrick Durusau @ 10:34 am

Twitter shuts down 10,000 ISIS-linked accounts in one day by Lisa Vaas.

From the post:


A Twitter representative on Thursday confirmed to news outlets that its violations department had in fact suspended some 10,000 accounts on one day – 2 April – “for tweeting violent threats”.

The Twitter representative, who spoke on the condition of anonymity, attributed the wave of shutdowns to ISIS opponents who’ve been vigilant in reporting accounts for policy violation:

We received a large amount of reports.

In early March, Twitter acknowledged shutting down at least 2000 ISIS-linked accounts per week in recent months.

Fact 1: Twitter is a private service and can adopt and apply any “terms of service” it chooses in any manner it chooses.

Fact 2: The “abuse” reporting system of Twitter and its lack of transparency, not to mention missing any opportunity for a public hearing and appeal, create the opportunity for and appearance of, arbitrary and capricious application.

Fact 3: The organization sometimes known as ISIS and its supporters have been targeted for suppression of all their communications, which violate the “terms of service” of Twitter or not, without notice and a hearing, thereby depriving other Twitter users of the opportunity to hear their views on current subjects of world importance.

Twitter is under no legal obligation to avoid censorship but Twitter should take steps to reduce its role as censor:

Step 1: Twitter should alter its “abuse” policy to provide alleged abusers with notice of the alleged abuse and a reasonable amount of time to respond to the allegation of abuse. Both the notice of alleged abuse and response to the notice shall be and remain public documents hosted by Twitter and indexed under the account alleged to be used for abuse. Along with the Twitter resolution described in Step 2.

Step 2: Twitter staff should issue a written statement as to what was found to transgress its “terms of service” so that other users can avoid repeating the alleged “abuse” accidentally.

Step 3: Twitter should adopt a formal “hands-off” policy when it comes to comments by, for or against political entities or issues, including ISIS in particular. What is a “threat” in some countries is not a “threat” in others. Twitter should act as a global citizen and not a parochial organization based in rural Alabama.

I would not visit areas under the control of ISIS even if you offered me a free ticket. Support or non-support of ISIS isn’t the issue.

The issue is whether we will allow private and unregulated entities to control a common marketplace for the interchange of ideas. If Twitter likes an unregulated common marketplace then it had best make sure it maintains a transparent and fair common marketplace. Not one where some people or ideas are second-class citizens and who can be arbitrarily silenced, in secret, by unknown Twitter staff.

April 14, 2015

Tips on Digging into Scholarly Research Journals

Filed under: Journalism,News,Research Methods — Patrick Durusau @ 4:42 pm

Tips on Digging into Scholarly Research Journals by Gary Price.

Gary gives a great guide to using JournalTOCs, a free service that provides tables of content and abstracts where available for thousands of academic journals.

Subject to the usual warning about reading critically, academic journals can be a rich source of analysis and data.

Enjoy!

Apache Spark, Now GA on Hortonworks Data Platform

Filed under: Hortonworks,Spark — Patrick Durusau @ 4:29 pm

Apache Spark, Now GA on Hortonworks Data Platform by Vinay Shukla.

From the post:

Hortonworks is pleased to announce the general availability of Apache Spark in Hortonworks Data Platform (HDP)— now available on our downloads page. With HDP 2.2.4 Hortonworks now offers support for your developers and data scientists using Apache Spark 1.2.1.

HDP’s YARN-based architecture enables multiple applications to share a common cluster and dataset while ensuring consistent levels of service and response. Now Spark is one of the many data access engines that works with YARN and that is supported in an HDP enterprise data lake. Spark provides HDP subscribers yet another way to derive value from any data, any application, anywhere.

What more need I say?

Get thee to the downloads page!

₳ustral Blog

Filed under: Sentiment Analysis,Social Graphs,Social Networks,Topic Models (LDA) — Patrick Durusau @ 4:14 pm

₳ustral Blog

From the post:

We’re software developers and entrepreneurs who wondered what Reddit might be able to tell us about our society.

Social network data have revolutionized advertising, brand management, political campaigns, and more. They have also enabled and inspired vast new areas of research in the social and natural sciences.

Traditional social networks like Facebook focus on mostly-private interactions between personal acquaintances, family members, and friends. Broadcast-style social networks like Twitter enable users at “hubs” in the social graph (those with many followers) to disseminate their ideas widely and interact directly with their “followers”. Both traditional and broadcast networks result in explicit social networks as users choose to associate themselves with other users.

Reddit and similar services such as Hacker News are a bit different. On Reddit, users vote for, and comment on, content. The social network that evolves as a result is implied based on interactions rather than explicit.

Another important difference is that, on Reddit, communication between users largely revolves around external topics or issues such as world news, sports teams, or local events. Instead of discussing their own lives, or topics randomly selected by the community, Redditors discuss specific topics (as determined by community voting) in a structured manner.

This is what we’re trying to harness with Project Austral. By combining Reddit stories, comments, and users with technologies like sentiment analysis and topic identification (more to come soon!) we’re hoping to reveal interesting trends and patterns that would otherwise remain hidden.

Please, check it out and let us know what you think!

Bad assumption on my part! Since ₳ustral uses Neo4j to store the Reddit graph, I was expecting a graph-type visualization. If that was intended, that isn’t what I found. 😉

Most of my searching is content oriented and not so much concerned with trends or patterns. An upsurge in hypergraph queries could happen in Reddit, but aside from references to publications and projects, the upsurge itself would be a curiosity to me.

Nothing against trending, patterns, etc. but just not my use case. May be yours.

Attribute-Based Access Control with a graph database [Topic Maps at NIST?]

Filed under: Cybersecurity,Graphs,Neo4j,NIST,Security,Subject Identity,Topic Maps — Patrick Durusau @ 3:25 pm

Attribute-Based Access Control with a graph database by Robin Bramley.

From the post:

Traditional access control relies on the identity of a user, their role or their group memberships. This can become awkward to manage, particularly when other factors such as time of day, or network location come into play. These additional factors, or attributes, require a different approach, the US National Institute of Standards and Technology (NIST) have published a draft special paper (NIST 800-162) on Attribute-Based Access Control (ABAC).

This post, and the accompanying Graph Gist, explore the suitability of using a graph database to support policy decisions.

Before we dive into the detail, it’s probably worth mentioning that I saw the recent GraphGist on Entitlements and Access Control Management and that reminded me to publish my Attribute-Based Access Control GraphGist that I’d written some time ago, originally in a local instance having followed Stefan Armbruster’s post about using Docker for that very purpose.

Using a Property Graph, we can model attributes using relationships and/or properties. Fine-grained relationships without qualifier properties make patterns easier to spot in visualisations and are more performant. For the example provided in the gist, the attributes are defined using solely fine-grained relationships.

Graph visualization (and querying) of attribute-based access control.

I found this portion of the NIST draft particularly interesting:


There are characteristics or attributes of a subject such as name, date of birth, home address, training record, and job function that may, either individually or when combined, comprise a unique identity that distinguishes that person from all others. These characteristics are often called subject attributes. The term subject attributes is used consistently throughout this document.

In the course of a person’s life, he or she may work for different organizations, may act in different roles, and may inherit different privileges tied to those roles. The person may establish different personas for each organization or role and amass different attributes related to each persona. For example, an individual may work for Company A as a gate guard during the week and may work for Company B as a shift manager on the weekend. The subject attributes are different for each persona. Although trained and qualified as a Gate Guard for Company A, while operating in her Company B persona as a shift manager on the weekend she does not have the authority to perform as a Gate Guard for Company B.
…(emphasis in the original)

Clearly NIST recognizes that subjects, at least in the sense of people, are identified by a set of “subject attributes” that uniquely identify that subject. It doesn’t seem like much of a leap to recognize that for other subjects, including the attributes used to identify subjects.

I don’t know what other US government agencies have similar language but it sounds like a starting point for a robust discussion of topic maps and their advantages.

Yes?

Most misinformation inserted into Wikipedia may persist [Read Responsibly]

Filed under: Skepticism,Wikipedia — Patrick Durusau @ 2:57 pm

Experiment concludes: Most misinformation inserted into Wikipedia may persist by Gregory Kohs.

A months-long experiment to deliberately insert misinformation into thirty different Wikipedia articles has been brought to an end, and the results may surprise you. In 63% of cases, the phony information persisted not for minutes or hours, but for weeks and months. Have you ever heard of Ecuadorian students dressed in formal three-piece suits, leading hiking tours of the Galapagos Islands? Did you know that during the testing of one of the first machines to make paper bags, two thumbs and a toe were lost to the cutting blade? And would it surprise you to learn that pain from inflammation is caused by the human body’s release of rhyolite, an igneous, volcanic rock?

None of these are true, but Wikipedia has been presenting these “facts” as truth now for more than six weeks. And the misinformation isn’t buried on seldom-viewed pages, either. Those three howlers alone have been viewed by over 125,000 Wikipedia readers thus far.

The second craziest thing of all may be that when I sought to roll back the damage I had caused Wikipedia, after fixing eight of the thirty articles, my User account was blocked by a site administrator. The most bizarre thing is what happened next: another editor set himself to work restoring the falsehoods, following the theory that a blocked editor’s edits must be reverted on sight.

Alex Brown tweeted this story along with the comment:

Wikipedia’s purported “self-correcting” prowess is more myth than reality

True, but not to pick on Wikipedia, the same is true for the benefits of peer review in general. A cursory survey of the posts at Retraction Watch will leave you wondering what peer reviewers are doing because it certainly isn’t reading assigned papers. At least not closely.

For historical references on peer review, see: Three myths about scientific peer review by Michael Nielsen.

Peer review is also used in grant processes, prompting the Wall Street Journal to call for lotteries to award NIH grants.

There are literally hundreds of other sources and accounts that demonstrate whatever functions peer review may have, quality assurance isn’t one of them. I suspect “gate keeping,” by academics who are only “gate keepers,” is its primary function.

The common thread running through all of these accounts is that you and only you can choose to read responsibly.

As a reader: Read critically! Do the statements in an article, post, etc., fit with what you know about the subject? Or with general experience? What sources did the author cite? Simply citing Pompous Work I does not mean Pompous Work I said anything about the subject. Check the citations by reading the citations. (You will be very surprised in some cases.) After doing your homework, if you still have doubts, such as with reported experiments, contact the author and explain what you have done thus far and your questions (nicely).

Even agreement between Pompous Work I and the author doesn’t mean you don’t have a good question. Pompous works are corrected year in and year out.

As an author: Do not cite papers you have not read. Do not cite papers because another author said a paper said. Verify your citations do exist and that they in fact support your claims. Post all of your data publicly. (No caveats, claims without supporting evidence are simply noise.)

Hash Table Performance in R: Part I + Part 2

Filed under: Hashing,R — Patrick Durusau @ 10:53 am

Hash Table Performance in R: Part I + Part 2 by Jeffrey Horner.

From part 1:

A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment.

But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time, n being the number of elements in the hash table.

For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table.

This rocks! Talk about performance increases!

My current Twitter client doesn’t dedupe my home feed and certainly doesn’t dedupe it against search based feeds. I’m not so concerned with retweets as with authors that repeat the same tweet several times in a row. What I don’t know is what period of uniqueness would be best? Will have to experiment with that.

I originally saw this series at Hash Table Performance in R: Part II In Part I of this series, I explained how R hashed… on R-Bloggers, the source of so much excellent R related content.

Phishing catches victims ‘in minutes’ [Verification and the BBC]

Filed under: Journalism,News,Reporting — Patrick Durusau @ 10:26 am

Phishing catches victims ‘in minutes’

From the post:

It takes 82 seconds for cyber-thieves to ensnare the first victim of a phishing campaign, a report suggests.

Compiled by Verizon, the report looks at analyses of almost 80,000 security incidents that hit thousands of companies in 2014.

It found that, in many companies, about 25% of those who received a phishing email were likely to open it.

“Training your employees is a critical element of combating this threat,” said Bob Rudis, lead author on the report.

Threat spotting

Tricking people into opening a booby-trapped message let attackers grab login credentials that could be used to trespass on a network and steal data, the report said.

“They do not have to use complex software exploits, because often they can get hold of legitimate credentials,” Mr Rudis said.
…(emphasis in original)

You might be tempted to quote this story on phishing but I wouldn’t. Not without looking further.

When I read “…a report suggests…,” without a link to the report, all sorts of alarms start ringing. If there is such a report, why no link? Is the author fearful the report isn’t as lurid as their retelling? Or fearful that readers might reach their own conclusions? And for that matter, despite being “lead author” of this alleged report, who the hell is Bob Rudis? Not quite in the same class as Prince or the Queen of England.

None of which is hard to fix:

Verizon 2015 PCI Compliance Report

Bob Rudis took a little more effort but not much: Bob Rudis (Twitter), not to mention being the co-author of: Data-Driven Security: Analysis, Visualization and Dashboards (review). Which is repaid by finding a R blogger and author of a recent security analysis text.

When you read the report, to which the BBC provides no link, you discover things like:

Incentives (none) to prevent payment fraud:

Page 4: The annual cost of payment fraud in 2014 was $14 Billion.

Then Page 5 gives the lack of incentive to combat the $14 Billion in fraud, total card payments are expected to reach $20 Trillion.

In other words:

20,000,000,000,000 – 14,000,000,000 = 19,986,000,000,000

Hardly even a rounding error.

BTW, the quote that caught my eye:

More than 99% of the vulnerabilities exploited in data breaches had been known about for more than a year, Mr Rudis said. And some had been around for a decade.

Doesn’t occur in the Verizon report, so one assumes an interview with Mr. Rudis.

Moreover, it is a good illustration for why a history of exploits may be as valuable if not more so than the latest exploit.

None of that was particularly difficult but it enriches the original content with links that may be useful to readers. What’s the point of hypertext without hyperlinks?

Importing the Hacker News Interest Graph

Filed under: Graphs,Neo4j — Patrick Durusau @ 6:58 am

Importing the Hacker News Interest Graph by Max De Marzi.

From the post:

Graphs are everywhere. Think about the computer networks that allow you to read this sentence, the road or train networks that get you to work, the social network that surrounds you and the interest graph that holds your attention. Everywhere you look, graphs. If you manage to look somewhere and you don’t see a graph, then you may be looking at an opportunity to build one. Today we are going to do just that. We are going to make use of the new Neo4j Import tool to build a graph of the things that interest Hacker News.

The basic premise of Hacker News is that people post a link to a Story, people read it, and comment on what they read and the comments of other people. We could try to extract a Social Graph of people who interact with each other, but that wouldn’t be super helpful. Want to know the latest comment Patio11 made? You’ll have to find their profile and the threads they participated in. Unlike Facebook, Twitter or other social networks, Hacker News is an open forum.

A good starting point if you want to build a graph of Hacker News.

Enjoy!

The Forgotten V of Data – Verification

Filed under: Journalism,News,Reporting — Patrick Durusau @ 6:40 am

Tools for verifying and assessing the validity of social media and user-generated content by Josh Stearns.

From the post:

“Interesting if true” is the old line about some tidbit of unverified news. Recast as “Whoa, if true” for the Twitter age, it allows people to pass on rumors without having to perform even the most basic fact-checking — the equivalent of a whisper over a quick lunch. Working journalists don’t have such luxuries, however, even with the continuous deadlines of a much larger and more competitive media landscape. A cautionary tale was the February 2015 report of the death of billionaire Martin Bouygues, head of a French media conglomerate. The news was instantly echoed across the Web, only to be swiftly retracted: The mayor of the village next to Bouygues’s hometown said that “Martin” had died. Alas, it was the wrong one.

The issue has become even knottier in the era of collaborative journalism, when nonprofessional reporting and images can be included in mainstream coverage. The information can be crucial — but it also can be wrong, and even intentionally faked. For example, two European publications, Bild and Paris Match, said they had seen a video purportedly shot within the Germanwings flight that crashed in March 2015, but doubts about such a video’s authenticity have grown. (Of course, there is a long history of image tampering, and news organizations have been culpable year after year of running — and even producing — manipulated images.)

The speed of social media and the sheer volume of user-generated content make fact-checking by reporters even more important now. Thankfully, a wide variety of digital tools have been developed to help journalists check facts quickly. This post was adapted from VerificationJunkie, a directory of tools for assessing the validity of social-media and user-generated content. The author is Josh Stearns, director of the journalism sustainability project at the Geraldine R. Dodge Foundation.

The big data crowd added veracity as a fourth V some time ago but veracity isn’t the same thing as verification. Veracity is a question of how much credit do you given the data. Verification is the process of determining the veracity of the data. Different activity with different tools.

Josh also maintains Verification Junkie, of which this post is a quick summary.

Don’t limit verification to social media only. Whatever the source, check the “facts” that it claims. You may be surprised.

New Non-Meaningful NoSQL Benchmark

Filed under: Benchmarks,NoSQL — Patrick Durusau @ 5:57 am

New NoSQL benchmark: Cassandra, MongoDB, HBase, Couchbase by Jon Jensen.

From the post:

Today we are pleased to announce the results of a new NoSQL benchmark we did to compare scale-out performance of Apache Cassandra, MongoDB, Apache HBase, and Couchbase. This represents work done over 8 months by Josh Williams, and was commissioned by DataStax as an update to a similar 3-way NoSQL benchmark we did two years ago.

If you can guess the NoSQL database used by DataStax, then you already know the results of the benchmark test.

Amazing how that works isn’t it? I can’t think of a single benchmark test sponsored by a vendor that shows a technology option, other than their own, would be the better choice.

Technology vendors aren’t like Progressive where you can get competing quotes for automobile insurance.

Technology vendors are convinced that with just enough effort, your problem can be tamed to be met by their solution.

I won’t bother to list the one hundred and forty odd (140+) NoSQL databases that did not appear in this benchmark or use cases that would challenge the strengths and weaknesses of each one. Unless benchmarking is one of your use cases, ask vendors for performance characteristics based on your use cases. You will be less likely to be disappointed.

April 13, 2015

Selection bias and bombers

Filed under: Modeling,Statistics — Patrick Durusau @ 2:50 pm

Selection bias and bombers

John D. Cook didn’t just recently start having interesting opinions! This is a post from 2008 that starts:

During WWII, statistician Abraham Wald was asked to help the British decide where to add armor to their bombers. After analyzing the records, he recommended adding more armor to the places where there was no damage!

A great story of how the best evidence may not be right in front of us.

Enjoy!

27 Free Data Mining Books (with comments and a question)

Filed under: Data Mining — Patrick Durusau @ 2:41 pm

27 Free Data Mining Books

From the post:

As you know, here at DataOnFocus we love to share information, specially about data sciences and related subjects. And what is one of the best ways to learn about a specific topic? Reading a book about it, and then practice with the fresh knowledge you acquired.

And what is better than increase your knowledge by studying a high quality book about a subject you like? It’s reading it for free! So we did some work and created an epic list of absolutelly free books on data related subjects, from which you can learn a lot and become an expert. Be aware that these are complex subjects and some require some previous knowledge.

Some comments on the books:

Caution:

Machine Learning – Wikipedia Guide

A great resource provided by Wikipedia assembling a lot of machine learning in a simple, yet very useful and complete guide.

is failing to compile with this message:

Generation of the document file has failed.

Status: Rendering process died with non zero code: 1

One possible source of the error is that the collection is greater than 500 articles (577 to be exact), which no doubt pushes it beyond 800 pages (another rumored limitation).

If I create sub-sections that successfully render I will post a note about it.

Warning:

Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More (link omitted)

The exploration of social web data is explained on this book. Data capture from the social media apps, it’s manipulation and the final visualization tools are the focus of this resource.

This site gives fake virus warnings along with choices. Bail as soon as you see them. Or better yet, miss this site altogether. The materials offered are under copyright.

That’s the thing that government and corporation officials don’t realize about lying. If they are willing to lie for “our” side, then they are most certainly willing to lie to you and me. The same holds true for thieves.

Broken Link:

Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management (This is the broken link, I don’t have a replacement.)

A data mining book oriented specifically to marketing and business management. With great case studies in order to understand how to apply these techniques on the real world.


Assuming you limit yourself to the legally available materials, there are several thousand pages pages of materials, all of which are relevant to some aspect of data mining or another.

Each of these works covers material where new techniques have emerged since their publication.

This isn’t big data, there being only twenty-four (23) volumes if you exclude the three (one noted in the listing) with broken links and the illegal O’Reilly material.

Where would you start with organizing this collection of “small data?”

An experimental bird migration visualization

Filed under: CartoDB,Cartography,Mapping,Science — Patrick Durusau @ 9:36 am

Time Integrated Multi-Altitude Migration Patterns by Wouter Van den Broeck, Jan Klaas Van Den Meersche, Kyle Horton, and Sérgio Branco.

From the webpage:

The Problem

Every year hundreds of millions of birds migrate to and from their wintering and breeding grounds, often traveling hundreds, if not thousands of kilometers twice a year. Many of these individuals make concentrated movements under the cover of darkness, and often at high altitudes, making it exceedingly difficult to precisely monitor the passage of these animals.

However one tool, radar, has the ability to measure the mass flow of migrants both day and night at a temporal and spatial resolution that cannot be matched by any other monitoring tool. Weather surveillance radars such as those of the EUMETNET/OPERA and NEXRAD networks continually monitor and collect data in real-time, monitoring meteorological phenomena, but also biological scatters (birds, bats, and insects). For this reason radar offers a unique tool for collecting large-scale data on biological movements. However, visualizing these data in a comprehensive manner that facilitates insight acquisition, remains a challenge.

Our contribution

To help tackle this challenge, the European Network for the Radar Surveillance of Animal Movement (ENRAM) organized the Bird Migration Visualization Challenge & Hackathon in March 2015 with the support of the European Cooperation in Science and Technology (COST) programme. We participated and explored a particular approach.

Using radar measures of bioscatter (birds, bats, and insects), algorithms can estimate the density, speed, and direction of migration movement at different altitudes around a radar. By interpolating these data both spatially and temporally, and mapping these geographically in the form of flow lines, a visualization might be obtained that offers insights in the migration patterns when applied to a large-scale dataset. The result is an experimental interactive web-based visualization that dynamically loads data from the given case study served by the CartoDB system.

Impressive work with both static and interactive visualizations!

Enjoy!

Modern Methods for Sentiment Analysis

Filed under: Natural Language Processing,Sentiment Analysis — Patrick Durusau @ 9:15 am

Modern Methods for Sentiment Analysis by Michael Czerny.

From the post:

Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.

The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”….

Great discussion of Word2Vec and Doc2Vec, along with worked examples of both as well as analyzing sentiment in Emoji tweets.

Another limitation of the +1 / -1 approach is that human sentiments are rarely that sharply defined. Moreover, however strong or weak the “likes” or “dislikes” of a group of users, they are all collapsed into one score.

Be mindful that modeling is a lossy process.

April 12, 2015

Do You Have An Obligation To Rat Out Your Friends?

Filed under: Law,Politics,Security — Patrick Durusau @ 7:14 pm

Along with arresting a mentally ill person that the FBI assisted at every step of the way in creating a fake car bomb for Fort Riley, the FBI has conjured out of thin air an obligation to rat out anyone you know or suspect may be about to commit a federal crime.

The FBI wants a nation of informers to pad it files with reports. We all know how well that worked in East Germany. Why not in the United States? A toxic brew of suspicion and distrust, of everyone. Your family, in-laws, children, acquaintances at work, etc.

I don’t have an explanation for why the FBI wants such a social policy but I can point to a case in point where they argue for it. Take a look at the complaint filed in April 10, 2015 against Alexander E. Blair.

Blair was charged with violating Title 18, United States Code, Section 4, misprison of felony.

Whoever, having knowledge of the actual commission of a felony cognizable by a court of the United States, conceals and does not as soon as possible make known the same to some judge or other person in civil or military authority under the United States, shall be fined under this title or imprisoned not more than three years, or both.

Joseph Broadbent explains misprison of felony (the following is not legal advice, for legal advice contact an attorney) as follows:

“Misprision of felony” is a crime that occurs when someone knows a felony has been committed, but fails to inform the authorities about it. The crime originated in English common law and required that citizens report crimes or face criminal prosecution. (Common law is law originating from custom and court decisions rather than statutes.)

Due to the harshness of imprisoning people merely for failing to report a crime, most states chose not to include misprision of felony in their criminal laws. Instead, conduct that would fit the misprision definition is covered by other laws, such as those dealing with accomplice liability.

Federal Law

First enacted into U.S. law in 1789, misprision of a felony in the federal system is a felony punishable by a fine and up to three years in prison. The common-law rule criminalized simply knowing about a felony and not notifying the authorities. But contemporary federal law also requires that the defendant take some affirmative act to conceal the felony. The crime has four elements:

  • a completed felony
  • the defendant knowing about the felony’s commission
  • the defendant failing to notify a proper law enforcement authority, and
  • the defendant taking some affirmative step to conceal the felony.

(18 U.S.C. §4.)

Typical acts of concealment include making false statements, hiding evidence, and harboring the felon. Whether someone’s actions amount to concealment is for the jury to decide.

Suppose Marty knows his neighbor, Biff, is growing marijuana. Marty wouldn’t be guilty of federal misprision simply for remaining silent. But if he lies to the police about Biff’s growing, he’s committed the crime.

Although the crime has a broad definition, misprision prosecutions are uncommon. Prosecutors usually reserve misprision charges for people with special duties to report crimes, such as prison guards and elected officials. That said, nothing in the statute’s language limits it to such cases. The authorities might invoke it for certain types of crimes where the government wants to encourage reporting, like treason and terrorism.

The FBI had these facts about Alexander E. Blair:

…agents contacted and interviewed Blair immediately after Booker’s arrest on April 10, 2015. During the interview, Blair admitted that he knew about Booker’s plan to detonate the VBIED. He further stated that he knew Booker believed he (Booker) was acting on behalf of ISIL; he knew Booker was gathering materials for constructing the VBIED; he knew Booker intended to deliver the device onto Fort Riley; and, he knew that Booker planned to kill as many soldiers as possible. Blair admitted to agents that he loaned money to Booker for rental of the storage unit, knowing that the unit would be used to store and construct the VBIED. Blair also advised agents that he urged Booker to cease talking openly about his intentions to conduct an attack for fear of attracting public attention and being reported to law enforcement. Blair told agents that he believed he had in fact been recently put under law enforcement surveillance. Finally, Blair told agents that he believed Booker would carry out the attack but chose not to alert authorities and report Booker’s actions.

Knowing the elements of Title 18, Section 4, let’s see how those compare to the complaint:

  • a completed felony
  • the defendant knowing about the felony’s commission
  • the defendant failing to notify a proper law enforcement authority, and
  • the defendant taking some affirmative step to conceal the felony. Opps!

No lying to the FBI, no attempt to conceal the felony, no misprison of felony.

A first year law student could have worked that out without any prompting.

It is fair to note that loaning money to someone in furtherance of the commission of a criminal act generates other questions of criminal liability but it isn’t misprison of felony.

I suspect the real reason the FBI keeps assisting mentally ill people and big talkers with terrorist activities is because it can’t find enough real terrorists in the United States. Rather than simply admit that terrorism as a domestic crime is a rare as crimes get, the FBI manufactures terrorist plots so it can ask for more anti-terrorist funding.

There was Oklahoma City, 9/11, the Boston Marthon, Olympic Park in Atlanta, the New York guy who set his car on fire, so what five (5) in twenty years? Can you imagine if there were only five murders, five rapes, or five armed robberies in twenty years?

On one hand, don’t tell me you are about to commit a felony but on the other, let’s not become East Germany. OK?

US, Chile to ‘officially’ kick off LSST construction

Filed under: Astroinformatics,BigData — Patrick Durusau @ 5:01 pm

US, Chile to ‘officially’ kick off LSST construction

From the post:

From distant exploding supernovae and nearby asteroids to the mysteries of dark matter, the Large Synoptic Survey Telescope (LSST) promises to survey the night skies and provide data to solve the universe’s biggest mysteries. On April 14, news media are invited to join the U.S. National Science Foundation (NSF), the U.S. Department of Energy (DoE) and other public-private partners as they gather outside La Serena, Chile, to “officially” launch LSST’s construction in a traditional Chilean stone-laying ceremony.

LSST is an 8.4-meter, wide-field survey telescope that will image the entire visible sky a few times a week for 10 years. It is located in Cerro Pachón, a mountain peak in northern Chile, chosen for its clear air, low levels of light pollution and dry climate. Using a 3-billion pixel camera–the largest digital camera in the world–and a unique three-mirror construction, it will allow scientists to see a vast swath of sky, previously impervious to study.

The compact construction of LSST will enable rapid movement, allowing the camera to observe fleeting, rare astronomical events. It will detect and catalogue billions of objects in the universe, monitoring them over time and will provide this data–more than 30 terabytes each night–to astronomers, astrophysicists and the interested public around the world. Additionally, the digital camera will shed light on dark energy, which scientists have determined is accelerating the universe’s expansion. It will probe further into the mystery of dark energy, creating a unique dataset of billions of galaxies.

It’s not coming online tomorrow, first light in 2019 and full operation in 2022, but its not too early to start thinking about how to process such a flood of data. Astronomers have been working on those issues for some time so if you are looking for new ways to think about processing data, don’t forget to check with the astronomy department.

Even by today’s standards, thirty (30) terabytes of data a night is a lot of data.

Enjoy!

The State of Probabilistic Programming

Filed under: Probabilistic Programming,Probalistic Models,Programming — Patrick Durusau @ 4:47 pm

The State of Probabilistic Programming by Mohammed AlQuraishi.

From the post:

For two weeks last July, I cocooned myself in a hotel in Portland, OR, living and breathing probabilistic programming as a “student” in the probabilistic programming summer school run by DARPA. The school is part of the broader DARPA program on Probabilistic Programming for Advanced Machine Learning (PPAML), which has resulted in a great infusion of energy (and funding) into the probabilistic programming space. Last year was the inaugural one for the summer school, one that is meant to introduce and disseminate the languages and tools being developed to the broader scientific and technology communities. The school was graciously hosted by Galois Inc., which did a terrific job of organizing the event. Thankfully, they’re hosting the summer school again this year (there’s still time to apply!), which made me think that now is a good time to reflect on last year’s program and provide a snapshot of the state of the field. I will also take some liberty in prognosticating on the future of this space. Note that I am by no means a probabilistic programming expert, merely a curious outsider with a problem or two to solve.

A very engaging introduction to probabilistic programming and current work in the field.

It has the added advantage of addressing a subject of interest to a large investor with lots of cash (DARPA). You may have heard of another of their projects, ARPANET, predecessor to the Internet.

Research Reports by U.S. Congress and UK House of Commons

Filed under: Government,Government Data,Research Methods — Patrick Durusau @ 4:27 pm

Research Reports by U.S. Congress and UK House of Commons by Gary Price.

Gary’s post covers the Congressional Research Service (CRS) (US) and the House of Commons Library Research Service (UK).

Truly amazing I know for an open and transparent government like the United States Goverment but CRS reports are not routinely made available to the public and so we have to rely on the kindness of strangers to make them available. Gary reports:

The good news is that Steven Aftergood, director of the Government Secrecy Project at the Federation of American Scientists (FAS), gets ahold of many of these reports and shares them on the FAS website.

The House of Commons Library Research Service appears to not mind officially sharing its research with anyone with web access.

Unlike some government agencies and publications, the CRS and LRS enjoy reputations for high quality scholarship and accuracy. You still need to evaluate their conclusions and the evidence cited or not, but outright deception and falsehood aren’t part of their traditions.

« Newer PostsOlder Posts »

Powered by WordPress