Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 4, 2015

“The ultimate goal is evidence-based data analysis”

Filed under: Science,Statistics — Patrick Durusau @ 1:30 pm

Statistics: P values are just the tip of the iceberg by Jeffrey T. Leek & Roger D. Peng.

From the summary:

Ridding science of shoddy statistics will require scrutiny of every step, not merely the last one, say Jeffrey T. Leek and Roger D. Peng.

From the post:


Leek and Peng are right but I would shy away from ever claiming “…evidence-based data analysis.”

You can disclose the choices you make at every stage of the data pipeline but the result isn’t “…evidence-based data analysis.”

I say that because “…evidence-based data analysis” implies that whatever the result, human agency wasn’t a factor in it. On the contrary, an ineffable part of human judgement is a part of every data analysis.

The purpose of documenting the details of each step is to enable discussion and debate about the choices made in the process.

Just as I object to politicians wrapping themselves in national flags, I equally object to anyone wrapping themselves in “evidence/facts” as though they and only they possess them.

Montage Mosaics The Pillars Of Creation!

Filed under: Astroinformatics — Patrick Durusau @ 10:56 am

Montage Mosaics The Pillars Of Creation!

From the post:

The Pillars of Creation in the Eagle Nebula (M16) remain one of the iconic images of the Hubble Space Telescope. Three pillars rise from a molecular cloud into an enormous HII region, powered by the massive young cluster NGC 6611. Such pillars are common in regions of massive star formation, where they form as a result of ionization and stellar winds.

In a paper that will shortly be published in MNRAS, entitled “The Pillars of Creation revisited with MUSE: gas kinematics and high-mass stellar feedback traced by optical spectroscopy,” McLeod et al (2015) analyze of new data acquired with the Multi Unit Spectroscopy Explorer (MUSE) instrument on the VLT. They used Montage to create integrated line maps of the single pointings obtained at the telescope. The figure below shows an example of these maps:


The images were too spectacular to pass without reposting.

Also a reminder that “national security” posturing has all the significance of a peacock spreading its feathers. Of interest to other peacocks, possibly female ones, not of much interest to anyone else.


It’s too bad that Hieronymus Bosch isn’t still around. I can easily imagine him painting “The Garden of Paranoid Delights,” for the security establishment.

Association Discovery Beyond Ten Variables – Prioritized Chordalysis

Filed under: Chordalysis — Patrick Durusau @ 10:29 am

Scaling log-linear analysis to datasets with thousands of variables by François Petitjean and Geoffrey I. Webb.


Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

The authors reduce the number of edges required to be examined in one example from 10,000,000 to 10,000, with corresponding savings in computation time. That was not an artifact of the data set but has been generalized by the authors and released as open source code: Chordalysis (GitHub).

If you prefer numbers, the analysis of a data set with 10,000,000 edges went from 39 hours to 27 seconds, a speedup of more than 5200X.

Definitely an addition to your data mining toolkit!

Distributed Machine Learning with Apache Mahout

Filed under: Machine Learning,Mahout,Spark — Patrick Durusau @ 9:51 am

Distributed Machine Learning with Apache Mahout by Ian Pointer and Dr. Ir. Linda Terlouw.

The Refcard for Mahout takes a different approach from many other DZone Refcards.

Instead of a plethora of switches and commands, it covers two basis tasks:

  • Training and testing a Random Forest for handwriting recognition using Amazon Web Services EMR
  • Running a recommendation engine on a standalone Spark cluster

Different style from the usual Refcard but a welcome addition to the documentation available for Apache Mahout!


May 3, 2015

How fear and self-preservation are driving a cyber arms race disaster

Filed under: Cybersecurity,Security — Patrick Durusau @ 8:53 pm

How fear and self-preservation are driving a cyber arms race by Max Taves.

From the post:

When a man was fired from his job in Minneapolis, Minn., last May, he inadvertently touched off a boom in Silicon Valley.

Gregg Steinhafel, then a 35-year veteran of Target and its CEO, was shown the door after hackers infiltrated the retailer’s computer systems, stealing 70 million shoppers’ information and 40 million credit and debit card numbers. It turned out the hack might have been prevented, had the company not ignored warnings from its own security systems.

It happened again in December, when Amy Pascal, one of the most powerful women in Hollywood, was fired from her job heading up Sony Pictures after hackers exposed thousands of financial documents and emails revealing the film studio’s inner secrets. The hack captured the world’s attention and elicited criticism from customers, industry leaders and even the president of the United States.

Pascal’s and Steinhafel’s exits sent shockwaves through corporate America. The message was clear: Top executives will be held responsible for their companies’ cybersecurity failings.

The result, venture capitalists say, has been a boom for cybersecurity startups. In ways that previous attacks on consumers never did, the firings have sparked a scramble for new security technology by companies desperate to head off the next costly, embarrassing cyberattack. And venture capitalists are responding, pouring unprecedented billions into a dizzying array of young companies and their, largely, untested products.

Last year, these companies received an aggregate $2.39 billion in funding, a 35 percent increase over 2013, according to venture capital data firm CB Insights. That’s the most money that’s been funneled into cybersecurity companies ever. Silicon Valley is betting companies have woken up to the real dangers of living in the Internet age.

(emphasis added)

Wait! Do you remember the graphic for point-of-sales systems?


The security faults of these systems are in software.

So, $2.39 billion in being invested in software (which will have vulnerabilities) to sit on top of already vulnerable systems.

Somehow, that fails to fill me with warm fuzzies.

Funding research on better software engineering techniques, research on and adoption of standard software practices, funding dissemination of security research and information, etc., would all be positive contributions to improving computer security.

Using techniques known to produce vulnerable software and expecting an improvement in security is by definition, insanity.

Advisers to venture capitalists need to check their E&O policies before advising clients to invest in security software.

Debating Public Policy, On The Basis of Fictions

Filed under: Cybersecurity,Government,NSA,Security,Uncategorized — Patrick Durusau @ 4:12 pm

Striking a Balance—Whistleblowing, Leaks, and Security Secrets by Cody Poplin.

From the post:

Last weekend, the New York Times published an article outlining the strength of congressional support for the CIA targeted killing program. In the story, the Times also purported to reveal the identities of three covert CIA operatives who now hold senior leadership roles within the Agency.

As you might expect, the decision generated a great deal of controversy, which Lawfare covered here and here. Later in the week, Jack Goldsmith interviewed Executive Editor of the New York Times Dean Baquet to discuss the decision. That conversation also prompted responses from Ben, Mark Mazzetti (one of the authors of the piece), and an anonymous intelligence community reader.

Following Times’ story, the Johns Hopkins University Center for Advanced Governmental Studies, along with the James Madison Project and our friends at Just Security, hosted an a timely conference on Secrecy, Openness and National Security: Lessons and Issues for the Next Administration. In a panel entitled Whistleblowing and America’s Secrets: Ensuring a Viable Balance, Bob Litt, General Counsel for the Office of the Director of National Security, blasted the Times, saying that the paper had “disgraced itself.”

However, the panel—which with permission from the Center for Advanced Governmental Studies, we now present in full—covered much more than the latest leak published in the Times. In a conversation moderated by Mark Zaid, the Executive Director of the James Madison Project, Litt, along with Ken Dilanian, Dr. Gabriel Schoenfeld, and Steve Vladeck, tackled a vast array of important legal and policy questions surrounding classified leak prosecutions, the responsibilities of the press, whistleblower protections, and the future of the Espionage Act.

It’s a jam-packed discussion full of candid exchanges—some testy, most cordial—that greatly raises the dialogue on the recent history of leaks, prosecutions, and future lessons for the next Administration.

Spirited debate but on the basis of known fictions.

For example, Bob Litt, General Counsel for the Office of the Director of National Security, poses a hypothetical question that compares an alleged suppression of information about the Bay of Pigs invasion to whether a news organization would be justified in leaking the details of plans to assassinate Osama bin Laden.

The premise of the hypothetical is flawed. It is based on an alleged statement by President Kennedy wishing the New York Times had published the details in their possession. One assumes so that public reaction would have prevented the ensuing disaster.

The story of President Kennedy suppressing a story in the New York Times about the Bay of Pigs is a myth.

Busting the NYTimes suppression myth, 50 years on reports:

Indeed, the Times’ purported spiking has been called the “symbolic journalistic event of the 1960s.”

Only the Times didn’t censor itself.

It didn’t kill, spike, or otherwise emasculate the news report published 50 years ago tomorrow that lies at the heart of this media myth.

That article was written by a veteran Times correspondent named Tad Szulc, who reported that 5,000 to 6,000 Cuban exiles had received military training for a mission to topple Fidel Castro’s regime; the actual number of invaders was about 1,400.

The story, “Anti-Castro Units Trained At Florida Bases,” ran on April 7, 1961, above the fold on the front page of the New York Times.

The invasion of the Bay of Pigs happened ten days later, April 17, 1961.

Hardly sounds like suppression of the story does it?

That is just one fiction that formed the basis for part of the discussion in this podcast.

Another fiction is that leaked national security information, take some of Edward Snowden‘s materials for example, were damaging to national security. Except that those who claim to know can’t say what information or how it was damaging.

Without answers to what information and how it was damaging to national security, their claims of “damage to national security” should go straight into the myth bin. The unbroken record of leaks shows illegal activity, incompetence, waste and avoidance of responsibility. None of those are in the national interest.

If the media does want to act in the “public interest,” then it should stop repeating unsubstantiated claims of damage to the “national interest,” by the security community. Repeated falsehoods does not make them useful for debates of public policy. When advanced such claims should be challenged and then excluded from further discussion without sufficient details for the public to reach their own conclusion about the claim.

Another myth in this discussion is the assumption that the media has a in loco parentis role vis-a-vis the public. That media representatives should act on the public’s behalf in determining what is or is not in the “public interest.” Complete surprise to me and I have read the Constitution more than once or twice.

I don’t remember seeing the media called out in the Constitution as guardians for a public too stupid to decide matters of public policy for itself.

That is the central flaw with national security laws and the rights of leakers and leakees. The government of the United States, for those unfamiliar with the Constitution, is answerable under the Constitution to the citizens of the United States. Not any branch of government or its agencies but to the citizens.

There are no exceptions to United States government being accountable to its citizens. Not one. To hold government accountable, its citizens need to know what government has been doing, to whom and why. The government has labored long and hard, especially its security services, to avoid accountability to its citizens. Starting shortly after its inception.

There should be no penalties for leakers or leakees. Leaks will cause hardships, such as careers ending due to dishonestly, incompetence, waste and covering for others engaged in the same. If you don’t like that, move to a country where the government isn’t answerable to its citizens. May I suggest Qatar?

You Can Help Keep Others Secure (Use Tor)

Filed under: Privacy,Security,Tor — Patrick Durusau @ 1:25 pm

Tor Browser 4.5 released by Mike Perry.

From the post:

The Tor Browser Team is proud to announce the first stable release in the 4.5 series. This release is available from the Tor Browser Project page and also from our distribution directory.

The 4.5 series provides significant usability, security, and privacy enhancements over the 4.0 series. Because these changes are significant, we will be delaying the automatic update of 4.0 users to the 4.5 series for one week.

Time to upgrade!

Why use Tor?

The Tor network is a group of volunteer-operated servers that allows people to improve their privacy and security on the Internet. Tor’s users employ this network by connecting through a series of virtual tunnels rather than making a direct connection, thus allowing both organizations and individuals to share information over public networks without compromising their privacy. Along the same line, Tor is an effective censorship circumvention tool, allowing its users to reach otherwise blocked destinations or content. Tor can also be used as a building block for software developers to create new communication tools with built-in privacy features.

Individuals use Tor to keep websites from tracking them and their family members, or to connect to news sites, instant messaging services, or the like when these are blocked by their local Internet providers. Tor’s hidden services let users publish web sites and other services without needing to reveal the location of the site. Individuals also use Tor for socially sensitive communication: chat rooms and web forums for rape and abuse survivors, or people with illnesses.

Journalists use Tor to communicate more safely with whistleblowers and dissidents. Non-governmental organizations (NGOs) use Tor to allow their workers to connect to their home website while they’re in a foreign country, without notifying everybody nearby that they’re working with that organization.

Groups such as Indymedia recommend Tor for safeguarding their members’ online privacy and security. Activist groups like the Electronic Frontier Foundation (EFF) recommend Tor as a mechanism for maintaining civil liberties online. Corporations use Tor as a safe way to conduct competitive analysis, and to protect sensitive procurement patterns from eavesdroppers. They also use it to replace traditional VPNs, which reveal the exact amount and timing of communication. Which locations have employees working late? Which locations have employees consulting job-hunting websites? Which research divisions are communicating with the company’s patent lawyers?

A branch of the U.S. Navy uses Tor for open source intelligence gathering, and one of its teams used Tor while deployed in the Middle East recently. Law enforcement uses Tor for visiting or surveilling web sites without leaving government IP addresses in their web logs, and for security during sting operations.

The variety of people who use Tor is actually part of what makes it so secure. Tor hides you among the other users on the network, so the more populous and diverse the user base for Tor is, the more your anonymity will be protected. (From

If you are concerned about privacy, yours and of others, use a Tor browser by default.

May 2, 2015

Sony Emails and Dilbert Cartoons

Filed under: Cybersecurity,Humor,Wikileaks — Patrick Durusau @ 9:10 pm

WikiLeaks Adds More Hacked Emails From Sony Pictures Entertainment by Sohini Auddy.

From the post:

WikiLeaks has added thousands more of Sony Pictures Entertainment’s hacked emails in its database, as mentioned in a Twitter post on Thursday.

Sony has yet to develop a sense of humor over the hack attack late last year.

Suggestion: Search the Sony emails at Wikileaks and then the Dilbert archives for a matching Dilbert cartoon.

Tweet the link for the Sony email and your matching Dilbert cartoon, #sonydilbert.

Let’s try that for a week, ending May 9, 2014.

Tweet with the most retweets will be declared the winner by acclamation. (Contest not open to Sony managers.)


Homonyms on EOL

Filed under: Homonymous — Patrick Durusau @ 8:50 pm

Homonyms on EOL [Encyclopedia of Life]

From the webpage:

Please join the Homonym Hunters community and help us find all the homonyms on EOL!

This collection is for all kinds of homonyms:

Cross-code homonyms

Homonyms across nomenclatural codes (ICBN, ICZN, ICNB, ICTV) are allowed, so there are plenty of them. Example: Satyrium, the orchid genus and Satyrium, the butterfly genus.

Cross-rank homonyms

At least in zoological nomenclature, homonyms are allowed if they refer to groups at different ranks. Example: Polyphaga, the roach genus and Polyphaga, the beetle suborder.

Invalid homonyms

Within codes and ranks, homonyms are not allowed, so only one of the homonymous names can be valid/accepted. If EOL gets these invalid names from a provider, we will have a page for it. Example: Acanthurus, the surgeon fish genus and Acanthurus, the weevil genus.

Comprehensive lists of homonyms have also been compiled elsewhere:

Systema Naturae 2000: Homonyms

Wikispecies: List of valid homonyms

In topic map parlance, the identification of homonyms across nomenclatural codes and across different ranks translates into setting the scope on a homonym.

That helps both people and machines in distinguishing homonyms.

For merging purposes, that also helps merge homonyms correctly. For example, Aaron Black tweeted:


As seen in the Washington Post.

Close to being a homonym anyway. 😉 I could distinguish Kirstie Alley from any possible Christie ally, even on a bad day. Our machines, not so much.

HT: Sam Hunting for the tweet.

On The Bleeding Edge – PySpark, DataFrames, and Cassandra

Filed under: Cassandra,Data Frames,Python — Patrick Durusau @ 8:17 pm

On The Bleeding Edge – PySpark, DataFrames, and Cassandra.

From the post:

A few months ago I wrote a post on Getting Started with Cassandra and Spark.

I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time looking for libraries that would normally take me less than a second to recall the proper Python library (JSON being an example) since I don’t know Scala very well.

If you need help deciding whether to read this post, take a look at Spark SQL and DataFrame Guide to see what you stand to gain.


Intelligent Life in Congress!

Filed under: Cybersecurity,Government,Politics — Patrick Durusau @ 4:47 pm

Congressman with computer science degree: Encryption back-doors are ‘technologically stupid’ by Adrea Peterson.

Two tidbit from a must read story:

“It is clear to me that creating a pathway for decryption only for good guys is technologically stupid,” said Rep. Ted Lieu (D-Calif.), who has a bachelor’s in computer science from Stanford University. “You just can’t do that.”

Subcommittee Chair Will Hurd (R-Tex.), who also has a computer science degree and worked in information security after nearly a decade at the CIA, shared Lieu’s skepticism of the security of such back doors. As did Rep. Blake Farenthold (R-Tex.), who asked the panel of witnesses to raise their hands if they thought it was possible to build a technically secure back-door — often mockingly called a “golden key” — into modern encryption systems.

None of them did — including Amy Hess, executive assistant director of the FBI’s Science and Technology Branch, and Daniel F. Conley, the district attorney for Suffolk County in Massachusetts. Conley at one point argued that companies like Apple are protecting “those who rape, defraud, assault, or even kill” with their encryption policies. (Lieu later said he took “great offense” at this comment, which he called a “fundamental misunderstanding of the problem.”)

Since I am so quick to point out dumb things that members of Congress do or say, it’s only appropriate that I highlight when one or more of them does something right.

No promises you will be able to contact either representative, given the provincialism of members of Congress who communicate only with members of their own districts, but its worth a shot.

Rep. Ted Lieu (D-Calif)

Washington, DC Office
415 Cannon House Office Building
Washington, DC 20515
Phone: (202) 225-3976

Rep. Will Hurd (R-Tex.)

Washington, DC Office
317 Cannon House Office Building
Washington, DC 20515
Phone: (202) 225-4511

Cheer them on and send money if you can.

New Natural Language Processing and NLTK Videos

Filed under: Natural Language Processing,NLTK,Python — Patrick Durusau @ 3:59 pm

Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences and Stop Words – Natural Language Processing With Python and NLTK p.2 by Harrison Kinsley.

From part 1:

Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language.

The NLTK module comes packed full of everything from trained algorithms to identify parts of speech to unsupervised machine learning algorithms to help you train your own machine to understand a specific bit of text.

NLTK also comes with a large corpora of data sets containing things like chat logs, movie reviews, journals, and much more!

Bottom line, if you’re going to be doing natural language processing, you should definitely look into NLTK!

Playlist link:…

sample code:

Use the Playlist link:… link as I am sure more videos will be appearing in the near future.


The Power of Symmetry

Filed under: Programming — Patrick Durusau @ 1:24 pm

The Power of Symmetry by Felienne Hermans.

From the description:

This presentation by @Felienne presents programming problems, and how they can be solved efficiently and elegantly using symmetry.

The description is true but fails to capture the elegance of of Felienne’s presentation as she uses symmetry to dramatically reduce the number of states in classic programming problems.

Highly recommended if you need to “wow” a student or class with what is possible by looking just a bit deeper at a problem.

May 1, 2015

Replication in Psychology?

Filed under: Peer Review,Psychology,Researchers,Science — Patrick Durusau @ 8:28 pm

First results from psychology’s largest reproducibility test by Monya Baker.

From the post:

An ambitious effort to replicate 100 research findings in psychology ended last week — and the data look worrying. Results posted online on 24 April, which have not yet been peer-reviewed, suggest that key findings from only 39 of the published studies could be reproduced.

But the situation is more nuanced than the top-line numbers suggest (See graphic, ‘Reliability test’). Of the 61 non-replicated studies, scientists classed 24 as producing findings at least “moderately similar” to those of the original experiments, even though they did not meet pre-established criteria, such as statistical significance, that would count as a successful replication.

The project, known as the “Reproducibility Project: Psychology”, is the largest of a wave of collaborative attempts to replicate previously published work, following reports of fraud and faulty statistical analysis as well as heated arguments about whether classic psychology studies were robust. One such effort, the ‘Many Labs’ project, successfully reproduced the findings of 10 of 13 well-known studies3.

Replication is a “hot” issue and likely to get hotter if peer review shifts to be “open.”

Do you really want to be listed as a peer reviewer for a study that cannot be replicated?

Perhaps open peer review will lead to more accountability of peer reviewers.


Security Incentives With Bite?

Filed under: Cybersecurity,Security — Patrick Durusau @ 8:03 pm

SEC Releases Cybersecurity Guidance, Highlights Compliance Role

From the post:

The SEC’s Division of Investment Management recently released cybersecurity guidance highlighting best practices and warning that cybersecurity breaches and deficiencies in cybersecurity programs could cause funds and advisers to run afoul of securities laws. Importantly, the guidance places significant obligations on compliance officers to ensure that funds have adopted adequate cybersecurity policies and procedures.

The guidance recommends that funds and advisers conduct periodic cybersecurity assessments; create a strategy to prevent, identify, and respond to cyber threats; and implement the strategy through policies, procedures, and training that help to guide officers and employees and monitor compliance. According to the guidance, periodic assessments should include attention to internal and external vulnerabilities as well as the likely effects of a breach so that funds and advisers can better assess and mitigate risk. With respect to cybersecurity strategies, funds and advisers should consider exerting tighter control over data access, ramping up encryption, limiting the use of removable storage media to prevent data theft, monitoring system access, backing up data, developing an incident response plan, and implementing routine testing.

First step, make cybersecurity breaches into violations of something important, like securities laws.

Second step, prosecute violations of securities laws rooted in cybersecurity breaches.

Third step, defendants in securities actions take an interest in spreading the joy of securities liabilities.

Fourth step, software liability doctrines develop in the context of securities litigation.

Liability for software defects is coming.

The question is whether it will develop piecemeal and unexpectedly, or will it develop in a comprehensive and moderated fashion?

How’s your appetite for risk?

I first saw this in a tweet by Milo Camacho.

Large-Scale Social Phenomena – Data Mining Demo

Filed under: Data Mining,Python — Patrick Durusau @ 7:48 pm

Large-Scale Social Phenomena – Data Mining Demo by Artemy Kolchinsky.

From the webpage:

For your mid-term hack-a-thons, you will be expected to quickly acquire, analyze and draw conclusion from some real-world datasets. The goal of this tutorial is to provide you with some tools that will hopefully enable you to spend less time debugging and more time generating and testing interesting ideas.

Here, I chose to focus on Python. It is beautiful language that is quickly developing an ecosystem of powerful and free scientific computing and data mining tools (e.g. the Homogenization of scientific computing, or why Python is steadily eating other languages’ lunch). For this reason, as well as my own familiarity with it, I encourage (though certainly not require) you to use it for your mid-term hack-a-thons. From my own experience, getting comfortable with these tools will pay off in terms of making many future data analysis projects (including perhaps your final projects) easier & more enjoyable.

Just in time for the weekend! I first saw this in a tweet by Lynn Cherny.

Suggestions of odd data sources for mining?


Filed under: Archives,Library,Manuscripts — Patrick Durusau @ 7:29 pm

OPenn: Primary Digital Resources Available to All through Penn Libraries’ New Online Platform by Jessie Dummer.

From the post:

The Penn Libraries and the Schoenberg Institute for Manuscript Studies are thrilled to announce the launch of OPenn: Primary Resources Available to Everyone (, a new website that makes digitized cultural heritage material freely available and accessible to the public. OPenn is a major step in the Libraries’ strategic initiative to embrace open data, with all images and metadata on this site available as free cultural works to be freely studied, applied, copied, or modified by anyone, for any purpose. It is crucial to the mission of SIMS and the Penn Libraries to make these materials of great interest and research value easy to access and reuse. The OPenn team at SIMS has been working towards launching the website for the past year. Director Will Noel’s original idea to make our Medieval and Renaissance manuscripts open to all has grown into a space where the Libraries can collaborate with other institutions who want to open their data to the world.

Images of the manuscripts are currently available on OPenn at full resolution, with derivatives also provided for easy reuse on the web. Downloading, whether several select images or the entire dataset, is easily accomplished by following instructions or recipes posted in the Technical Read Me on OPenn. The website is designed to be machine-readable, but easy for individuals to use, too.

Oh, the manuscripts themselves?

Licensing is a real treat:

All images and their contents from the Lawrence J. Schoenberg Collection are free of known copyright restrictions and in the public domain. See the Creative Commons Public Domain Mark page for more information on terms of use:

Unless otherwise stated, all manuscript descriptions and other cataloging metadata are ©2015 The University of Pennsylvania Libraries. They are licensed for use under a Creative Commons Attribution Licensed version 4.0 (CC-BY-4.0):

For a description of the terms of use see, the Creative Commons Deed:

In substance and licensing such a departure from academic societies that still consider comping travel and hotel rooms as “fostering scholarship.” “Ye shall know them by their fruits.” (Matthew 7:16)

Practical Text Analysis using Deep Learning

Filed under: Deep Learning,Natural Language Processing,Text Mining — Patrick Durusau @ 4:34 pm

Practical Text Analysis using Deep Learning by Michael Fire.

From the post:

Deep Learning has become a household buzzword these days, and I have not stopped hearing about it. In the beginning, I thought it was another rebranding of Neural Network algorithms or a fad that will fade away in a year. But then I read Piotr Teterwak’s blog post on how Deep Learning can be easily utilized for various image analysis tasks. A powerful algorithm that is easy to use? Sounds intriguing. So I decided to give it a closer look. Maybe it will be a new hammer in my toolbox that can later assist me to tackle new sets of interesting problems.

After getting up to speed on Deep Learning (see my recommended reading list at the end of this post), I decided to try Deep Learning on NLP problems. Several years ago, Professor Moshe Koppel gave a talk about how he and his colleagues succeeded in determining an author’s gender by analyzing his or her written texts. They also released a dataset containing 681,288 blog posts. I found it remarkable that one can infer various attributes about an author by analyzing the text, and I’ve been wanting to try it myself. Deep Learning sounded very versatile. So I decided to use it to infer a blogger’s personal attributes, such as age and gender, based on the blog posts.

If you haven’t gotten into deep learning, here’s another opportunity focused on natural language processing. You can follow Michael’s general directions to learn on your own or follow more detailed instructions in his Ipython notebook.


Point-of-Sale (PoS) RAM Scrapers (And Security Incentives)

Filed under: Cybersecurity — Patrick Durusau @ 4:18 pm

This graphic speaks volumes about Point-of-Sale (PoS) systems:


From Defending Against PoS RAM Scrapers: Current Strategies and Next-Gen Technologies in Trend Micro.

All is not entirely lost. PoS RAM Scraper Malware: Past, Present, and Future by Numaan Huq, also of Trend Micro.

If you want to understand PoS RAM scrapers at a deeper level than “malware, bad,” this report should meet your needs. It runs ninety-three (93) pages with seventy (70) references.

In terms of security policy to encourage better cybersecurity, losses from bugs in no longer supported software should not be eligible for insurance coverage as business losses, nor tax deductible.

« Newer Posts

Powered by WordPress