Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 12, 2015

NIST Big Data interoperability Framework (Comments Wanted!)

Filed under: BigData,NIST — Patrick Durusau @ 3:49 pm

NIST Big Data interoperability Framework

The NIST Big Data Public Working Group (NBD-PWG) is seeking your comments on drafts of its first seven (7) deliverables. Comments are due by May 21, 2015.

NIST Big Data Definitions & Taxonomies Subgroup
1. M0392: Draft SP 1500-1 — Volume 1: Definitions
2. M0393: Draft SP 1500-2 — Volume 2: Taxonomies

NIST Big Data Use Case & Requirements Subgroup
3. M0394: Draft SP 1500-3 — Volume 3: Use Case & Requirements (See Use Cases Lising)

NIST Big Data Security & Privacy Subgroup
4. M0395: Draft SP 1500-4 — Volume 4: Security and Privacy

NIST Big Data Reference Architecture Subgroup
5. M0396: Draft SP 1500-5 — Volume 5: Architectures White Paper Survey
6. M0397: Draft SP 1500-6 — Volume 6: Reference Architecture

NIST Big Data Technology Roadmap Subgroup
7. M0398: Draft SP 1500-7 — Volume 7: Standards Roadmap

You can participate too:

Big Data professionals continue to be welcome to join the NBD-PWG to help craft the work contained in the volumes of the NIST Big Data Interoperability Framework. Please register to join our effort.

See the webpage for details on submitting comments. Or contact me if you want assistance in preparing and submitting comments.

April 11, 2015

Another Walter Mitty Terrorist Arrested

Filed under: Government,Politics,Security — Patrick Durusau @ 7:08 pm

The FBI cleverly arrested another Walter Mitty terrorist in Kansas on Friday, April 10, 2015.

For attempting a suicide car bombing of Fort Riley.

Sounds impressive doesn’t it?

But if you call up map of Fort Riley:

Fort-Riley

saying you are going to car bomb “Fort Riley” seems a bit vague doesn’t it? I mean you wouldn’t construct a car bomb and start driving around a base that size looking for targets of opportunity. But that was what the defendant, John T. Booker, Jr., a/k/a “Mohammed Abdullah Hassan,” was going to do.

If you read the complaint against John T. Booker, it quickly becomes apparent that Booker needed a mental health referral, not assistance “FBI Confidential Human Sources” to “plan” and construct a car bomb.

For example, Booker enlists in the United States Army in February of 2014. A few weeks later he posts on his public Facebook page:

“I will soon be leaving you forever so goodbye! I’m going to wage jihad and hopes that i die.

Someone who hasn’t read Ann Landers MYOB (mind your own business) advice reports the post, which eventually results in the FBI interviewing Booker before the car bombing plot.

From the complaint:

On March 15, 2014, Booker publicly posted on Facebook: “I will soon be leaving you forever so goodbye! I’m going to wage jihad and hopes that i die.” On March 19, 2014, Booker publically posted on Facebook: “Getting ready to be killed in jihad is a HUGE adrenaline rush!! I am so nervous. NOT because I’m scared to die but I am eager to meet my lord.” (Complaint, paragraph 4)

Then of course he is accosted by the FBI:

That same day, the FBI became aware of Booker’s postings based on a citizen’s complaint. The FBI was able to identify Booker based on the publically available content on his Facebook account. On March 20, 2014, Booker was interviewed by FBI agents related to his Facebook postings. After being advised of and waiving his Miranda rights, Booker admitted that he enlisted in the United States Army with the intent to commit an insider attack against American soldiers like Major Nidal Hassan had done at Fort Hood, Texas. Booker stated that if he went overseas and was told to kill a fellow Muslim, he would rather turn around and shoot the person giving orders. Booker stated that he formulated several plans for committing jihad once enlisted, including firing at other soldiers while at basic training at the firing range or while at his pre-deployment military base after completing his initial military training. Booker clarified that he did not intend to kill “privates,” but that he instead wanted to target someone with power. Booker also said that he did not intend to use large guns, but instead a small gun or a sword. Booker was subsequently denied entry into the military. (Complaint, paragraph 4))

All of this is prior to any discussion of a car bombing of Fort Riley.

I don’t know that anyone has seen the recruitment records for Booker but it looks like the Army didn’t take his attempt to enlist to kill fellow soldiers very seriously. Maybe they realized he is just a sad case but then didn’t follow through with a mental health referral.

But the FBI didn’t forget about Booker.

It isn’t clear from the complaint how Booker came into contact with “FBI Confidential Human Source (CHS 1)” but aside from videos and a number of outrageous statements, the basic facts are as follows:

  • CHS 1 goes with Booker and assists in the purchasing of material to make a bomb
  • CHS 1 and CHS 2 build the Vehicle Borne Improvised Explosive Device (“VBIED”).
  • “CHS 1 and CHS 2 then provided Booker with a map of the area of Fort Riley at Booker’s request.”
  • Booker circles three buildings (isn’t indicated which ones)
  • CHS 1 and CHS 2 supply the required vehicle.
  • CHS 1 goes with Booker (I assume to insure he can find Fort Riley)
  • Booker is arrested while trying to “arm” the inert device

Without the assistance of the FBI, Booker doesn’t have:

  1. A vehicle as the basis for the Vehicle Borne Improvised Explosive Device (“VBIED”).
  2. The materials necessary to build such a bomb, aside from the vehicle.
  3. The knowledge or training to build such a device.
  4. A map of Fort Riley to plan possible detonation locations.

Without FBI assistance, Booker was just a troubled young man who tried to enlist in the Army in order to kill fellow soldiers. And bragged about it on Facebook.

Of course, a mental health referral would have denied US Attorney Barry Grissom to opportunity to wrap up in flag and intone very solemnly about the grave perils the FBI and US Attorney’s office battle every day.

barry-grissom

That’s the real problem isn’t it? The FBI and US Attorney’s office do combat serious crimes that would diminish the quality of life in the United States for everyone. But, they also create street theater for publicity purposes that victimize the disturbed, the incompetent and the inane.

PS: Cleric: Man charged in Kansas bomb plot is mentally ill by Nicholas Clayton and John Hanna (Associated Press).

If you think I was speculating about Booker’s mental health:

Imam Omar Hazim of the Islamic Center of Topeka told The Associated Press that two FBI agents brought Booker to him last year for counseling, hoping to turn the young man away from radical beliefs. Hazim said the agents told him that Booker suffered from bipolar disorder, characterized by unusual mood swings that can affect functioning.

Hazim said he expressed concerns to the FBI about allowing Booker to move freely in the community after their first encounter.

Hazim said he later heard that two others were involved in a bombing plot with Booker. He said the FBI told him they were undercover FBI agents and that the sting was arranged to get Booker “off the streets.”

“I think the two FBI agents set him up, because they felt at that point someone else might have done the same thing and put a real bomb in his hands,” Hazim said.

He said he has come to the conclusion that the sting was the right thing to do. He said Booker admitted to him on Tuesday that he had stopped taking his medication because he didn’t like the way it made him feel and it was expensive.

Saying it doesn’t make it so but Hazim’s account fits with all the facts in the complaint.

Victimizing the mentally ill, for offenses that carry life sentences, doesn’t sound much like “Fidelity – Bravery – Integrity.” (The FBI motto.) Notice the complaint fails to say anything about Booker being mentally ill. I know, mental disease is an affirmative defense to be raised by the defendant, but prosecutors, even ambitious US prosecutors, have an affirmative obligation to not seek injustice.

April 10, 2015

UNESCO Transparency Portal

Filed under: Government,Transparency — Patrick Durusau @ 7:09 pm

UNESCO Transparency Portal

From the about page:

Public access to information is a key component of UNESCO’s commitment to transparency and its accountability vis-à-vis stakeholders. UNESCO recognizes that there is a positive correlation between a high level of transparency through information sharing and public participation in UNESCO-supported activities.

The UNESCO transparency portal has been designed to enable public access to information about the Organization’s activities across sectors, countries, and regions, accompanied by some detail on budgetary and donor information. We see this as a work in progress. Our objective is to enable access to as much quality data about our activities as possible. The portal will be regularly updated and improved.

The data is a bit stale, 2014 and by the site’s admission, data on “10 Category I Institutes operating as separate economic entities” and the “UNESCO Brasilia Office,” will be included “in a later phase….”

The map navigation on the default homepage works quite well and I tested it from focusing on Zimbabwe, lead by everyone’s favorite, Robert Mugabe. If you zoom in and select Zimbabwe on the map, the world map displays with a single icon over Zimbabwe. Hovering over that icon displays the number of projects, budget, so I selected projects and the screen scrolls down to show the five projects.

I then selected: UBRAF: Supporting Comprehensive Education Sector Responses to HIV, Sexual and Reproductive Health in Botswana, Malawi, Zambia and Zimbabwe and you are presented with the same summary information that was already presented.

Not a great showing of transparency. The United States Congress can do that well, most of the time. Transparency isn’t well served by totals and bulk amounts. Those are more suited to concealment than transparency.

At a minimum, transparency requires disclosure of who the funds were disbursed to (one assume some entity in Zimbabwe) and to who that entity transferred funds, and so on, until we reach consumables or direct services. Along with the identities of every actor along that trail.

I first saw this in The Research Desk: UNESCO, SIPRI, and Searching iTunes by Gary Price.

Incentives to be Cybersecure?

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:14 pm

A year after its exposure, Heartbleed bug remains a serious threat (+video) by Joe Uchill.

From the post:


Venafi scanned publicly accessible servers and discovered that only 416 of the 2,000 companies listed on the Forbes Global 2000 – a ranking of the largest public companies in the world – have fully completed Heartbleed remediation. That’s a marginal improvement over the 387 companies that Venafi identified in a July survey as taking action to fix the bug.

Venafi report, Hearts Continue to Bleed: Heartbleed One Year Later.

You may also find this helpful, The Forbes Global 2000 list.

I don’t know which companies you want to help out by pointing out the IP addresses of vulnerable servers so I’ll leave generating the IP lists for individual companies as an exercise for you.

This is a good illustration of the lack of skin in the game by a majority of companies listed in the Forbes Global 2000 group. As we discussed in The reason companies don’t fix cybersecurity [Same reason software is insecure], companies have no incentives to practice cybersecurity. The Heartbleed vulnerability is an example of that principle at work.

Government moves to increase the level of cybersecurity?

“Blocking the Property of Certain Persons Engaging in Significant Malicious Cyber-Enabled Activities”, (drum roll), a executive order from President Obama.

Don’t you have that warm feeling from a real hug now? Feeling like your data is getting safer by the day!

If you have either of those feelings after reading that executive order, it may be time to call 911 because you have been off your meds for too long. 😉

Serious cyberhackers are hacking because they are being paid for their services or the content they obtain. Given the difficulty in identifying, locating and capturing serious cyberhackers, I am sure they are willing to take the risk, such as it is, generated by Obama’s executive order. As a matter of fact, I suspect cyberhackers recognize the purely media side effects of such an order. Something, an ineffectual something, but something, is being done about cybersecurity.

I understand that the President cannot be an expert in every field but he/she should have advisers who have access to experts in every field.

Assuming cyberhackers haven’t changed since the discovery of the Heartbleed vulnerability, guess who else hasn’t changed? Some one-thousand five-hundred and sixteen (1,516) of the Forbes Global 2000 group. If more than 3/4 of the Forbes Global 2000 group wants to go naked with regard to Heartbleed, that’s their choice. However, that choice means that your data, not just theirs, is at risk.

To put it another way, 3/4 of the Forbes Global 2000 group has made a decision that your data isn’t worth the cost of fixing the Heartbleed vulnerability.

Since influencing the behavior of cyberhackers seem a bridge too far, doesn’t it make sense to incentivize U.S. based members of the Forbes Global 2000 group and others to protect your data more effectively?

The current regimes of fines from some data breaches don’t appear to be sufficient incentives, in light of the evidence, to increase cybersecurity. If there were a real incentive to secure your data, more than 3/4 of the Forbes Global 2000 group that has fixed the Heartbleed vulnerability.

My suggestion: Increase the penalties for theft of consumer data and make the liability personal to everyone defined to be part of “management” by the National Labor Relations Board (NLRB) and proportionate to their compensation. The only out is for a member of management to have reported the cybersecurity issue to US-CERT before the breach for which a fine is being imposed.

Year to year, the GAO would hire out a survey of cyberinsecurities and based on the percentage of mitigation of known flaws, propose increases in the fines designed to encourage cybersecurity. The increases in fines take effect unless Congress votes by a 2/3 majority on both houses to change the proposed rate.

Cybersecurity is too complex for a fixed legislative or even a regulatory solution. American business has always prided itself on solving problems so let’s rely on their ingenuity once they have the proper incentives.

PS: I realize this would also increase business donations to congressional campaigns but 2/3 majorities in Congress appear to be few and far in between.

PPS: You will notice that be making the liabilities “personal” to management, that I have avoided diminishing the value of the company for its shareholders. Probably need to ban reinbursement agreements, insurance coverage, etc. for those fines as well. You can imagine that Jamie Dimon facing the prospect of spending his retirement years in a Bronx homeless shelter would have a great incentive to improve cybersecurity at JPMorgan Chase.

April 9, 2015

Nonlinear Dynamics and Chaos:

Filed under: Chaos,Nonlinear Models — Patrick Durusau @ 6:41 pm

Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering, Steven H. Strogatz by Daniel Lathrop.

For someone with Steven Strogatz‘s track record on lucid explanations, it comes as no surprise that Latrop says:


In presenting the subject, the author draws from the past 30 years of developments that have advanced our understanding of dynamics beyond the linear examples—for instance, harmonic oscillators—that permeate current physics curricula. The advances came from theoretical and computational scholars, and the book does a great job of acknowledging them. The methods and techniques that form the bulk of the book’s content apply useful concepts—bifurcations, phase-space analysis, and fractals, to name a few—that have been widely adopted in physics, biology, chemistry, and engineering. One of the book’s biggest strengths is that it explains core concepts through practical examples drawn from various fields and from real-world systems; the examples include pendula, Josephson junctions, chemical oscillators, and convecting atmospheres. The illustrations, in particular, have been enhanced in the new edition.

The techniques needed to understand the behavior of nonlinear systems are inherently mathematical. Fortunately, the author’s excellent use of geometric and graphical techniques greatly clarifies what can be amazingly complex behavior. For example, in carefully working through the development and behavior of the Lorenz equations, Strogatz introduces a simple waterwheel machine as a model to help define terms and tie together such key concepts as fixed points, bifurcations, chaos, and fractals. The reader gets a feel for the science behind the differential equations. Moreover, for each concept, the mathematics is accompanied by clear figures and nicely posed student exercises.

Rate this one as must buy!

Ahem, I guess you noticed that political science, sociology, psychology, sentiment, etc. aren’t included in the title?

Blake LeBaron of Brandeis University penned an answer to “Has chaos theory found any useful application in the social sciences?,” which reads in part:


One of several key ideas in chaos is that simple models can generate very rich (and random-looking) dynamics. Implicit in some early work in the social sciences was a hope that simple chaotic models of social phenomena could be matched up with many of the near-random and difficult-to-explain empirical patterns that are observed. This early goal has proved elusive.

One problem is that determining if a time series was generated by deterministic chaos is not easy. (A time series is a data set showing the state of a system over a period of time–a sequence of voting results, for instance, or the fluctuating price of gold.) There is no single statistic capable of being estimated that indicates what is going on in a social system. Also, many common time-series problems (such as seasonality and trends) can confuse most of the diagnostic tools that people use. These complications have led to many conflicting results. Building an easy-to-use test that can handle the intricacies of a real-world time series is a tough problem, one which will probably not be solved anytime soon.

A second difficulty is that most of the theoretical structure in chaos is based on purely deterministic models that have no noise, or at most just a very small amount of noise, affecting the dynamics of the system. This approach works well in many physical situations, but it does not offer a very good picture of most social situations. It is hard to look at social systems isolated from the environment in the way that one can analyze fluid in a laboratory beaker. Once noise plays a major role in the dynamics, the problems involved in analyzing nonlinear systems become much more difficult.

A great illustration of why “low noise” techniques have difficulty providing meaningful results when applied to social systems. Social systems are noisy, very noisy. That isn’t to say you can’t ignore the noise and make decisions on the results, but you could save the consulting fees and consult a Ouija Board instead. Social engineering programs, both liberal and conservative in the United States suffer from a failure to appreciate the complexity of human interaction.

PS: I did see in the index that Steven cites Romeo and Juliet but I will have to await the arrival of a copy to discover what was said.

Building upon the Current Capabilities of WWT

Filed under: Astroinformatics,Science — Patrick Durusau @ 6:09 pm

Building upon the Current Capabilities of WWT

From the post:

WWT to GitHub

WorldWide Telescope is a complex system that supports a wide variety of research, education and outreach activities.  By late 2015, the Windows and HTML5/JavaScript code needed to run WWT will be available in a public (Open Source) GitHub repository. As code moves through the Open Sourcing process during 2015, the OpenWWT web site (www.openwwt.org) will offer updated details appropriate for a technical audience, and contact links for additional information.

Leveraging and Extending WorldWide Telescope

The open WorldWide Telescope codebase will provide new ways of leveraging and extending WWT functionality in the future.  WWT is already friendly to data and reuse thanks to its extant software development kits, and its ability to import data through both the user interface and “WTML” (WWT’s XML based description language to add data into WWT).  The short listing below gives some examples of how data can be accessed, displayed, and explained using WWT as it presently is. Most of these capabilities are demonstrated quickly in the “What Can WorldWide Telescope Do for Me?” video at tinyurl.com/wwt-for-me. The www.worldwidetelescope.org/Developers/ site offers resources useful to developers, and details beyond those offered below.

Creating Tours

What you can do: You can create a variety of tours with WWT. The tour-authoring interface allows tour creators to guide tour viewers through the Universe by positioning a virtual camera in various slides, and WWT animates the between-slide transitions automatically. Tour creators can also add their own images, data, text, music, voice over and other media to enhance the message. Buttons, images and other elements can link to other Tours, ultimately allowing tour viewers to control their own paths. Tour functionality can be used to create Kiosks, menu-driven multimedia content, presentations, training and quizzing interactives and self-service data exploration. In addition to their educational value, tours can be particularly useful in collaborative research projects, where researchers can narrate and/or annotate various views of data.  Tour files are typically small enough to be exchanged easily by email or cloud services. Tours that follow a linear storyline can also be output to high quality video frames for professional quality video production at any resolution desired. Tours can also be hosted in a website to create interactive web content.

Skills Required: WWT tours are one of the most powerful aspects of WWT, and creating them doesn’t require any programing skills. You should know what story you want to tell and understand presentation and layout skills. If you can make a PowerPoint presentation then you should be able to make a WWT tour.  The WorldWide Telescope Ambassadors (outreach-focused) website provides a good sample of Tours, at wwtambassadors.org/tours, and a good tour to experience to see the largest number of tour features in use all at once is “John Huchra’s Universe,” at wwtambassadors.org/john-huchras-universe.  A sample tour-based kiosk is online at edukiosks.harvard.edu.  A video showing a sample research tour (meant for communication with collaborators) is at tinyurl.com/morenessies.

That is just a sample of the news from the WorldWide Telescope!

The popular press keeps bleating about “big data.” Some of which will be useful, some of which will not. But imagine a future when data from all scientific experiments supported by the government are streamed online at the time of acquisition. It won’t be just “big data” but rather “data that makes a difference.” As the decades of data accumulates, synthetic analysis can be performed on all the available data, not just the snippet that you were able to collect.

Hopefully even private experiments will be required to contribute their data as well. Facts are facts and not subject to ownership. Private entities could produce products subject to patents but knowledge itself should be patent free.

Almost a Topic Map? Or Just a Mashup?

Filed under: Digital Library,Library,Mashups,Topic Maps — Patrick Durusau @ 4:34 pm

WikipeDPLA by Eric Phetteplace.

From the webpage:

See relevant results from the Digital Public Library of America on any Wikipedia article. This extension queries the DPLA each time you visit a Wikipedia article, using the article’s title, redirects, and categories to find relevant items. If you click a link at the top of the article, it loads in a series of links to the items. The original code behind WikiDPLA was written at LibHack, a hackathon at the American Library Association’s 2014 Midwinter Meeting in Philadelphia: http://www.libhack.org/.

Google Chrome App Home Page

GitHub page

Wikipedia:The Wikipedia Library/WikipeDPLA

How you resolve the topic map versus mashup question depends on how much precision you expect from a topic map. While knowing additional places to search is useful, I never have a problem with assembling more materials than can be read in the time allowed. On the other hand, some people may need more prompting than others, so I can’t say that general references are out of bounds.

Assuming you were maintaining data sets with locally unique identifiers, using a modification of this script to query an index of all local scripts (say Pig scripts) to discover other scripts using the same data could be quite useful.

BTW, you need to have a Wikipedia account and be logged in for the extension to work. Or at least that was my experience.

Enjoy!

Big Data To Identify Rogue Employees (Who To Throw Under The Bus)

Filed under: BigData,Prediction,Predictive Analytics — Patrick Durusau @ 3:23 pm

Big Data Algorithm Identifies Rogue Employees by Hugh Son.

From the post:

Wall Street traders are already threatened by computers that can do their jobs faster and cheaper. Now the humans of finance have something else to worry about: Algorithms that make sure they behave.

JPMorgan Chase & Co., which has racked up more than $36 billion in legal bills since the financial crisis, is rolling out a program to identify rogue employees before they go astray, according to Sally Dewar, head of regulatory affairs for Europe, who’s overseeing the effort. Dozens of inputs, including whether workers skip compliance classes, violate personal trading rules or breach market-risk limits, will be fed into the software.

“It’s very difficult for a business head to take what could be hundreds of data points and start to draw any themes about a particular desk or trader,” Dewar, 46, said last month in an interview. “The idea is to refine those data points to help predict patterns of behavior.”

Sounds worthwhile until you realize that $36 billion in legal bills “since the financial crisis” covers a period of seven (7) years, which works out to be about $5 billion per year. Considering that net revenue for 2014 was $21.8 billion, after deducting legal bills, they aren’t doing too badly. 2014 Annual Report

Hugh raises the specter of The Minority Report in terms of predicting future human behavior. True enough but much more likely to discover cues that resulted in prior regulatory notice with cautions to employees to avoid those “tells.” If the trainer reviews three (3) real JPMorgan Chase cases and all of them involve note taking and cell phone records (later traced), how bright do you have to be to get clued in?

People who don’t get clued in will either be thrown under the bus during the next legal crisis or won’t be employed at JPMorgan Chase.

If this were really a question of predicting human behavior the usual concerns about fairness, etc. would obtain. I suspect it is simply churn so that JPMorgan Chase appears to be taking corrective action. Some low level players will be outed, like the Walter Mitty terrorists the FBI keeps capturing in its web of informants. (I am mining some data now to collect those cases for a future post.)

It will be interesting to see if Jamie Dimon electronic trail is included as part of the big data monitoring of employees. Bets anyone?

Web Gallery of Art

Filed under: Art,Humanities — Patrick Durusau @ 11:11 am

Web Gallery of Art

From the homepage:

The Web Gallery of Art is a virtual museum and searchable database of European fine arts from 11th to 19th centuries. It was started in 1996 as a topical site of the Renaissance art, originated in the Italian city-states of the 14th century and spread to other countries in the 15th and 16th centuries. Intending to present Renaissance art as comprehensively as possible, the scope of the collection was later extended to show its Medieval roots as well as its evolution to Baroque and Rococo via Mannerism. Encouraged by the feedback from the visitors, recently 19th-century art was also included. However, we do not intend to present 20th-century and contemporary art.

The collection has some of the characteristics of a virtual museum. The experience of the visitors is enhanced by guided tours helping to understand the artistic and historical relationship between different works and artists, by period music of choice in the background and a free postcard service. At the same time the collection serves the visitors’ need for a site where various information on art, artists and history can be found together with corresponding pictorial illustrations. Although not a conventional one, the collection is a searchable database supplemented by a glossary containing articles on art terms, relevant historical events, personages, cities, museums and churches.

The Web Gallery of Art is intended to be a free resource of art history primarily for students and teachers. It is a private initiative not related to any museums or art institutions, and not supported financially by any state or corporate sponsors. However, we do our utmost, using authentic literature and advice from professionals, to ensure the quality and authenticity of the content.

We are convinced that such a collection of digital reproductions, containing a balanced mixture of interlinked visual and textual information, can serve multiple purposes. On one hand it can simply be a source of artistic enjoyment; a convenient alternative to visiting a distant museum, or an incentive to do just that. On the other hand, it can serve as a tool for public education both in schools and at home.

The Gallery doesn’t own the works in question and so resolves the copyright issue thus:

The Web Gallery of Art is copyrighted as a database. Images and documents downloaded from this database can only be used for educational and personal purposes. Distribution of the images in any form is prohibited without the authorization of their legal owner.

The Gallery suggests contacting the Scala Group (or Art Resource, Scala’s U.S. representative) if you need rights beyond educational and personal purposes.

To see how images are presented, view 10 random images from the database. (Warning: The 10 random images link will work only once. If you try it again, images briefly display and then an invalid CGI environment message pops up. Suspect if you clear the browser cache it should work a second time.)

BTW, you can listen to classical music in the background while you browse/search. That is a very nice touch.

The site offers other features and options so take time to explore.

Having seen some of Michelangelo‘s works in person, I can attest no computer screen can duplicate that experience. However, if given the choice between viewing a pale imitation on a computer screen and not seeing his work at all, the computer version is a no brainer.

Rijksmuseum Online Collection Doubles!

Filed under: Art — Patrick Durusau @ 10:21 am

Rijksmuseum Digitizes & Makes Free Online 210,000 Works of Art, Masterpieces Included! by Colin Marshall.

From the post:

We all found it impressive when Amsterdam’s Rijksmuseum put up 125,000 Dutch works of art online. “Users can explore the entire collection, which is handily sorted by artist, subject, style and even by events in Dutch history,” explained Kate Rix in our first post announcing it. ” “Not only can users create their own online galleries from selected works in the museum’s collection, they can download Rijksmuseum artwork for free to decorate new products.”

I first posted about the Rijksmuseum in 2011, High-Quality Images from the Amsterdam Rijksmuseum, when the collection was “only” 103,000 items.

Significant not only for the quantity of high quality materials but also because you are free to remix the images found here to create your own! Unusual to say the least in these IP maddened times.

Remember there is an API to access this collection, see: https://www.rijksmuseum.nl/nl/api. API keys are free for the asking.

You have to register with the museum (free) and after creating your login, etc. choose “Advanced setting” while on your profile page. Describe your intended use and request a key.

BTW, the one improvement I would make to the museum pages would be to make registering in order to obtain an API key a bit more obvious. Say on the “About” page? I didn’t find an obvious page to register. Fortunately the GitHub site mentions it, so start at: https://www.rijksmuseum.nl/nl/mijn/gegevens, create a new account and then you will see “Advanced settings” at the bottom of the full registration page.

Enjoy!

April 8, 2015

Drawing Causal Inference from Big Data

Filed under: BigData,Inference — Patrick Durusau @ 7:03 pm

Drawing Causal Inference from Big Data.

Overview:

This colloquium was motivated by the exponentially growing amount of information collected about complex systems, colloquially referred to as “Big Data”. It was aimed at methods to draw causal inference from these large data sets, most of which are not derived from carefully controlled experiments. Although correlations among observations are vast in number and often easy to obtain, causality is much harder to assess and establish, partly because causality is a vague and poorly specified construct for complex systems. Speakers discussed both the conceptual framework required to establish causal inference and designs and computational methods that can allow causality to be inferred. The program illustrates state-of-the-art methods with approaches derived from such fields as statistics, graph theory, machine learning, philosophy, and computer science, and the talks will cover such domains as social networks, medicine, health, economics, business, internet data and usage, search engines, and genetics. The presentations also addressed the possibility of testing causality in large data settings, and will raise certain basic questions: Will access to massive data be a key to understanding the fundamental questions of basic and applied science? Or does the vast increase in data confound analysis, produce computational bottlenecks, and decrease the ability to draw valid causal inferences?

Videos of the talks are available on the Sackler YouTube Channel. More videos will be added as they are approved by the speakers.

Great material but I’m in the David Hume camp when it comes to causality. Or more properly the sceptical realist interpretation of David Hume. The contemporary claims that ISIS is a social media Svengali is a good case in point. The only two “facts” that not in dispute is that ISIS has used social media and some Westerners have in fact joined up with ISIS.

Both of those facts are true, but to assert a causal link between them borders on the bizarre. Joshua Berlinger reports in The names: Who has been recruited to ISIS from the West that some twenty thousand (20,000) foreign fighters have joined ISIS. That group of foreign fighters hails from ninety (90) countries and thirty-four hundred are from Western states.

Even without Hume’s skepticism on causation, there is no evidence for the proposition that current foreign fighters read about ISIS on social media and therefore decided to join up. None, nada, the empty set. The causal link between social media and ISIS is wholly fictional and made to further other policy goals, like censoring ISIS content.

Be careful how you throw “causality” about when talking about big data or data in general.

The listing of the current videos at YouTube has the author names only, does not include the titles or abstracts. To make these slightly more accessible, I have created the following listing with the author, title (link to YouTube if available), and Abstract/Slides as appropriate. In alphabetical order by last name. Author names are hyperlinks to identify the authors.

Edo Airoldi, Harvard University, Optimal Design of Causal Experiments in the Presence of Social Interference. Abstract

Susan Athey, Stanford University, Estimating Heterogeneous Treatment Effects Using Machine Learning in Observational Studies. Slides.

Leon Bottou, Facebook AI Research, Causal Reasoning and Learning Systems Abstract

Peter Buhlmann, ETH Zurich, Causal Inference Based on Invariance: Exploiting the Power of Heterogeneous Data Slides

Dean Eckles, Facebook, Identifying Peer Effects in Social Networks Abstract

James Fowler, University of California, San Diego, An 85 Million Person Follow-up to a 61 Million Person Experiment in Social Influence and Political Mobilization. Abstract

Michael Hawrylycz, Allen Institute, Project MindScope:  From Big Data to Behavior in the Functioning Cortex Abstract

David Heckerman, Microsoft Corporation, Causal Inference in the Presence of Hidden Confounders in Genomics Slides.

Michael Jordan, University of California, Berkeley, On Computational Thinking, Inferential Thinking and Big Data . Abstract.

Steven Levitt, The University of Chicago, Thinking Differently About Big Data Abstract

David Madigan, Columbia University, Honest Inference From Observational Database Studies Abstract

Judea Pearl, University of California, Los Angeles, Taming the Challenge of Extrapolation: From Multiple Experiments and Observations to Valid Causal Conclusions Slides

Thomas Richardson, University of Washington, Non-parametric Causal Inference Abstract

James Robins, Harvard University, Personalized Medicine, Optimal Treatment Strategies, and First Do No Harm: Time Varying Treatments and Big Data Abstract

Bernhard Schölkopf, Max Planck Institute, Toward Causal Machine Learning Abstract.

Jasjeet Sekhon, University of California, Berkeley, Combining Experiments with Big Data to Estimate Treatment Effects Abstract.

Richard Shiffrin, Indiana University, The Big Data Sea Change Abstract.

I call your attention to this part of Shiffrin’s abstract:

Second, having found a pattern, how can we explain its causes?

This is the focus of the present Sackler Colloquium. If in a terabyte data base we notice factor A is correlated with factor B, there might be a direct causal connection between the two, but there might be something like 2**300 other potential causal loops to be considered. Things could be even more daunting: To infer probabilities of causes could require consideration all distributions of probabilities assigned to the 2**300 possibilities. Such numbers are both fanciful and absurd, but are sufficient to show that inferring causality in Big Data requires new techniques. These are under development, and we will hear some of the promising approaches in the next two days.

John Stamatoyannopoulos, University of Washington, Decoding the Human Genome:  From Sequence to Knowledge.

Hal Varian, Google, Inc., Causal Inference, Econometrics, and Big Data Abstract.

Bin Yu, University of California, Berkeley, Lasso Adjustments of Treatment Effect Estimates in Randomized Experiments  Abstract.

If you are interested in starting an argument, watch the Steven Levitt video starting at timemark 46:20. 😉

Enjoy!

Four Mistakes To Avoid If You’re Analyzing Data

Filed under: Data Analysis,Plotly — Patrick Durusau @ 9:54 am

Four Mistakes To Avoid If You’re Analyzing Data

The post highlights four (4) common mistakes in analyzing data, with visualizations.

Four (4) seems like a low number, at least in my personal experience. 😉

Still, I am encouraged that the post concludes with:

Analyzing data is not easy. We hope this post helps. Has your team made or avoided any of these mistakes? Do you have suggestions for a future post? Let us know; we’re @plotlygraphs, or email us at feedback at plot dot ly.

I just thought of a common data analysis mistake, reliance on source or authority.

As we saw in Photoshopping Science? Where Was Peer Review?, apparently peer reviewers were too impressed by the author’s status to take a close look at photos submitted with his articles. On later and closer examination, those same photos, as published, revealed problems that should have been caught by the peer reviewers.

Do you spot check all your data sources?

PyCon 2015 Scikit-learn Tutorial

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 8:45 am

PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas.

Abstract:

Machine learning is the branch of computer science concerned with the development of algorithms which can be trained by previously-seen data in order to make predictions about future data. It has become an important aspect of work in a variety of applications: from optimization of web searches, to financial forecasts, to studies of the nature of the Universe.

This tutorial will explore machine learning with a hands-on introduction to the scikit-learn package. Beginning from the broad categories of supervised and unsupervised learning problems, we will dive into the fundamental areas of classification, regression, clustering, and dimensionality reduction. In each section, we will introduce aspects of the Scikit-learn API and explore practical examples of some of the most popular and useful methods from the machine learning literature.

The strengths of scikit-learn lie in its uniform and well-document interface, and its efficient implementations of a large number of the most important machine learning algorithms. Those present at this tutorial will gain a basic practical background in machine learning and the use of scikit-learn, and will be well poised to begin applying these tools in many areas, whether for work, for research, for Kaggle-style competitions, or for their own pet projects.

You can view the tutorial at: PyCon 2015 Scikit-Learn Tutorial Index.

Jake is presenting today (April 8, 2015), so this is very current news!

Enjoy!

April 7, 2015

q – Text as Data

Filed under: CSV,SQL,Text Mining — Patrick Durusau @ 5:03 pm

q – Text as Data by Harel Ben-Attia.

From the webpage:

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.

q’s web site is http://harelba.github.io/q/. It contains everything you need to download and use q in no time.

I’m not looking for an alternative to awk or sed for CSV/TSV files but you may be.

From the examples I suspect it would be “easier” in some sense of the word to teach than either awk or sed.

Give it a try and let me know what you think.

I first saw this in a tweet by Scott Chamberlain.

Federal Data Integration: Dengue Fever

The White House issued a press release today (April 7, 2015) titled: FACT SHEET: Administration Announces Actions To Protect Communities From The Impacts Of Climate Change.

That press release reads in part:


Unleashing Data: As part of the Administration’s Predict the Next Pandemic Initiative, in May 2015, an interagency working group co-chaired by OSTP, the CDC, and the Department of Defense will launch a pilot project to simulate efforts to forecast epidemics of dengue – a mosquito-transmitted viral disease affecting millions of people every year, including U.S. travelers and residents of the tropical regions of the U.S. such as Puerto Rico. The pilot project will consolidate data sets from across the federal government and academia on the environment, disease incidence, and weather, and challenge the research and modeling community to develop predictive models for dengue and other infectious diseases based on those datasets. In August 2015, OSTP plans to convene a meeting to evaluate resulting models and showcase this effort as a “proof-of-concept” for similar forecasting efforts for other infectious diseases.

I tried finding more details on earlier workshops in this effort but limiting the search to “Predict the Next Pandemic Initiative” and the domain to “.gov,” I got two “hits.” One of which was the press release I cite above.

I sent a message (webform) to the White House Office of Science and Technology Policy office and will update you with any additional information that arrives.

Of course my curiosity is about the means used to integrate the data sets. Once integrated, such data sets can be re-used, at least until it is time to integrate additional data sets. Bearing in mind that dirty data can lead to poor decision making, I would rather not duplicate the cleaning of data time after time.

33% of Poor Business Decisions Track Back to Data Quality Issues

Filed under: BigData,Data,Data Quality — Patrick Durusau @ 3:46 pm

Stupid errors in spreadsheets could lead to Britain’s next corporate disaster by Rebecca Burn-Callander.

From the post:

Errors in company spreadsheets could be putting billions of pounds at risk, research has found. This is despite high-profile spreadsheet catastrophes, such as the collapse of US energy giant Enron, ringing alarm bells more than a decade ago.

Almost one in five large businesses have suffered financial losses as a result of errors in spreadsheets, according to F1F9, which provides financial modelling and business forecasting to blue chips firms. It warns of looming financial disasters as 71pc of large British business always use spreadsheets for key financial decisions.

The company’s new whitepaper entitiled Capitalism’s Dirty Secret showed that the abuse of humble spreadsheet could have far-reaching consequences. Spreadsheets are used in the preparation of British company accounts worth up to £1.9 trillion and the UK manufacturing sector uses spreadsheets to make pricing decisions for up to £170bn worth of business.

Felienne Hermans, of Delft University of Technology, analysed 15,770 spreadsheets obtained from over 600,000 emails from 158 former employees. He found 755 files with more than a hundred errors, with the maximum number of errors in one file being 83,273.

Dr Hermans said: “The Enron case has given us a unique opportunity to look inside the workings of a major corporate organisation and see first hand how widespread poor spreadsheet practice really is.

First, a gender correction, Dr. Hermans is not a he. The post should read: “She found 755 files with more than….

Second, how bad is poor spreadsheet quality? The download page has this summary:

  • 33% of large businesses report poor decision making due to spreadsheet problems.
  • Nearly 1 in 5 large businesses have suffered direct financial loss due to poor spreadsheets.
  • Spreadsheets are used in the preparation of British company accounts worth up to £1.9 trillion.

You read that correctly, not that 33% of spreadsheet have quality issues but that 33% of poor business decisions can be traced to spreadsheet problems.

A comment to the blog post supplied a link for the report: A Research Report into the Uses and Abuses of Spreadsheets.

Spreadsheets are small to medium sized data.

Care to comment on the odds of big data and its processes pushing the percentage of poor business decisions past 33%?

How would you discover you are being misled by big data and/or its processing?

How do you validate the results of big data? Run another big data process?

When you hear sales pitches about big data, be sure to ask about the impact of dirty data. If assured that your domain doesn’t have a dirty data issue, grab your wallet and run!

PS: A Research Report into the Uses and Abuses of Spreadsheets is a must have publication.

The report itself is useful, but Appendix A 20 Principles For Good Spreadsheet Practice is a keeper. With a little imagination all of those principles could be applied to big data and its processing.

Just picking one at random:

3. Ensure that everyone involved in the creation or use of spreadsheet has an appropriate level of knowledge and understanding.

For big data, reword that to:

Ensure that everyone involved in the creation or use of big data has an appropriate level of knowledge and understanding.

Your IT staff are trained, but do the managers who will use the results understand the limitations of the data and/or it processing? Or do they follow the results because “the data says so?”

Download 422 Free Art Books from The Metropolitan Museum of Art

Filed under: Art,Museums — Patrick Durusau @ 2:54 pm

Download 422 Free Art Books from The Metropolitan Museum of Art by Colin Marshall.

From the post:

Met-1

You could pay $118 on Amazon for the Metropolitan Museum of Art’s catalog The Art of Illumination: The Limbourg Brothers and the Belles Heures of Jean de France, Duc de Berry. Or you could pay $0 to download it at MetPublications, the site offering “five decades of Met Museum publications on art history available to read, download, and/or search for free.” If that strikes you as an obvious choice, prepare to spend some serious time browsing MetPublications’ collection of free art books and catalogs.

Judging from the speed of my download today, this is a really popular announcement!

Stash this with your other links for art, artwork, etc. as resources for a topic map.

Exploring the Unknown Frontier of the Brain

Filed under: Neural Information Processing,Neural Networks,Neuroinformatics,Science — Patrick Durusau @ 1:33 pm

Exploring the Unknown Frontier of the Brain by James L. Olds.

From the post:

To a large degree, your brain is what makes you… you. It controls your thinking, problem solving and voluntary behaviors. At the same time, your brain helps regulate critical aspects of your physiology, such as your heart rate and breathing.

And yet your brain — a nonstop multitasking marvel — runs on only about 20 watts of energy, the same wattage as an energy-saving light bulb.

Still, for the most part, the brain remains an unknown frontier. Neuroscientists don’t yet fully understand how information is processed by the brain of a worm that has several hundred neurons, let alone by the brain of a human that has 80 billion to 100 billion neurons. The chain of events in the brain that generates a thought, behavior or physiological response remains mysterious.

Building on these and other recent innovations, President Barack Obama launched the Brain Research through Advancing Innovative Neurotechnologies Initiative (BRAIN Initiative) in April 2013. Federally funded in 2015 at $200 million, the initiative is a public-private research effort to revolutionize researchers’ understanding of the brain.

James reviews currently funded efforts under the BRAIN Initiative, each of which is pursuing possible ways to explore, model and understand brain activity. Exploration in its purest sense. The researchers don’t know what they will find.

I suspect the leap from not understanding <302 neurons in a worm to understanding the 80 to 100 billion neurons in each person, is going to happen anytime soon. Just as well, think of all the papers, conferences and publications along the way!

April 6, 2015

Evolving Parquet as self-describing data format –

Filed under: Drill,MapR,Parquet — Patrick Durusau @ 7:08 pm

Evolving Parquet as self-describing data format – New paradigms for consumerization of Hadoop data by Neeraja Rentachintala.

From the post:

With Hadoop becoming more prominent in customer environments, one of the frequent questions we hear from users is what should be the storage format to persist data in Hadoop. The data format selection is a critical decision especially as Hadoop evolves from being about cheap storage to a pervasive query and analytics platform. In this blog, I want to briefly describe self-describing data formats, how they are gaining a lot of interest as a new management paradigm to consumerize Hadoop data in organizations and the work we have been doing as part of the Parquet community to evolve Parquet as fully self-describing format.

About Parquet

Apache Parquet is a columnar storage format for the Hadoop ecosystem. Since its inception about 2 years ago, Parquet has gotten very good adoption due to the highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem. A variety of tools and frameworks including MapReduce, Hive, Impala, and Pig provided the ability to work with Parquet data and a number of data models such as AVRO, Protobuf, and Thrift have been expanded to be used with Parquet as storage format. Parquet is widely adopted by a number of major companies including tech giants such as Twitter and Netflix.

Self-describing data formats and their growing role in analytics on Hadoop/NoSQL

Self-describing data is where schema or structure is embedded in the data itself. The schema is comprised of metadata such as element names, data types, compression/encoding scheme used (if any), statistics, and a lot more. There are a variety of data formats including Parquet, XML, JSON, and NoSQL databases such as HBase that belong to the spectrum of self-describing data and typically vary in the level of metadata they expose about themselves.

While the self-describing data has been in rise with NoSQL databases (e.g., the Mongo BSON model) for a while now empowering developers to be agile and iterative in application development cycle, the prominence of these has been growing in analytics as well when it comes to Hadoop. So what is driving this? The answer is simple – it’s the same reason – i.e., the requirement to be agile and iterative in BI/analytics.

More and more organizations are now using Hadoop as a data hub to store all their data assets. These data assets often contain existing datasets offloaded from the traditional DBMS/DWH systems, but also new types of data from new data sources (such as IOT sensors, logs, clickstream) including external data (such as social data, 3rd party domain specific datasets). The Hadoop clusters in these organizations are often multi-tenant and shared by multiple groups in the organizations. The traditional data management paradigms of creating centralized models/metadata definitions upfront before the data can be used for analytics are quickly becoming bottlenecks in Hadoop environments. The new complex and schema-less data models are hard to map to relational models and modeling data upfront for unknown ad hoc business questions and data discovery needs is challenging and keeping up with the schema changes as the data models evolve is practically impossible.

By pushing metadata to data and then using tools that can understand this metadata available in self-describing formats to expose it directly for end user consumption, the analysis life cycles can become drastically more agile and iterative. For example, using Apache Drill, the world’s first schema-free SQL query engine, you can query self-describing data (in files or NoSQL databases such as HBase/MongoDB) immediately without having to define and manage schema overlay definitions in centralize metastores. Another benefit of this is business self-service where the users don’t need to rely on IT departments/DBAs constantly for adding/changing attributes to centralized models, but rather focus on getting answers to the business questions by performing queries directly on raw data.

Think of it this way, Hadoop scaled processing by pushing processing to the nodes that have data. Analytics on Hadoop/NoSQL systems can be scaled to the entire organization by pushing more and more metadata to the data and using tools that leverage that metadata automatically to expose it for analytics. The more self-describing the data formats are (i.e., the more metadata they contain about data), the smarter the tools that leverage the metadata can be.

The post walks through example cases and points to additional resources.

To become self-describing, Parquet will need to move beyond assigning data types to tokens. In the example given, “amount” has the datatype “double,” but that doesn’t tell me if we are discussing grams, Troy ounces (for precious metals), carats or pounds.

We all need to start following the work on self-describing data formats more closely.

Barkan, Bintliff, and Whisner’s Fundamentals of Legal Research, 10th

Filed under: Government,Indexing,Law,Law - Sources,Research Methods,Search Requirements — Patrick Durusau @ 6:43 pm

Barkan, Bintliff, and Whisner’s Fundamentals of Legal Research, 10th by Steven M Barkan; Barbara Bintliff; Mary Whisner. (ISBN-13: 9781609300562)

Description:

This classic textbook has been updated to include the latest methods and resources. Fundamentals of Legal Research provides an authoritative introduction and guide to all aspects of legal research, integrating electronic and print sources. The Tenth Edition includes chapters on the true basics (case reporting, statutes, and so on) as well as more specialized chapters on legislative history, tax law, international law, and the law of the United Kingdom. A new chapter addresses Native American tribal law. Chapters on the research process, legal writing, and citation format help integrate legal research into the larger process of solving legal problems and communicating the solutions. This edition includes an updated glossary of research terms and revised tables and appendixes. Because of its depth and breadth, this text is well suited for advanced legal research classes; it is a book that students will want to retain for future use. Moreover, it has a place on librarians’ and attorneys’ ready reference shelves. Barkan, Bintliff and Whisner’s Assignments to Fundamentals of Legal Research complements the text.

I haven’t seen this volume in hard copy but if you are interested in learning what connections researchers are looking for with search tools, law is a great place to start.

The purpose of legal research, isn’t to find the most popular “fact” (Google), or to find every term for a “fact” ever tweeted (Twitter), but rather to find facts and their relationships to other facts, which flesh out to a legal view of a situation in context.

If you think about it, putting legislation, legislative history, court records and decisions, along with non-primary sources online, is barely a start towards making that information “accessible.” A necessary first step but not sufficient for meaningful access.

Combining the power of R and D3.js

Filed under: D3,R,Visualization — Patrick Durusau @ 6:10 pm

Combining the power of R and D3.js by Andries Van Humbeeck.

From the post:

According to wikipedia, the amount of unstructured data might account for more than 70%-80% of all data in organisations. Because everyone wants to find hidden treasures in these mountains of information, new tools for processing, analyzing and visualizing data are being developed continually. This post focuses on data processing with R and visualization with the D3 JavaScript library.

Great post with fully worked examples of using R with D3.js to create interactive graphics.

Unfortunate that it uses the phrase “immutable images.” A more useful dichotomy is static versus interactive. And it lowers the number of false positives for anyone searching on “immutable.”

Enjoy!

I first saw this in a tweet by Christophe Lalanne.

Scikit-Learn 0.16 release

Filed under: Machine Learning,Scikit-Learn — Patrick Durusau @ 2:24 pm

Scikit-Learn 0.16 is out!

Highlights:

BTW, improvements are already being listed for Scikit-Learn 0.17.

Fast Lane to Python…

Filed under: Programming,Python — Patrick Durusau @ 12:43 pm

Fast Lane to Python – A quick, sensible route to the joys of Python coding by Norm Matloff.

From the preface:

My approach here is different from that of most Python books, or even most Python Web tutorials. The usual approach is to painfully go over all details from the beginning, with little or no context. For example, the usual approach would be to first state all possible forms that a Python integer can take on, all possible forms a Python variable name can have, and for that matter how many different ways one can launch Python with.

I avoid this here. Again, the aim is to enable the reader to quickly acquire a Python foundation. He/she should then be able to delve directly into some special topic if and when the need arises. So, if you want to know, say, whether Python variable names can include underscores, you’ve come to the wrong place. If you want to quickly get into Python programming, this is hopefully the right place. (emphasis in the original)

You may know Norm Matloff as the author of Algorithms to Z-Scores:… or Programming on Parallel Machines, both open source textbooks.

What do you think about Norm’s approach to teaching Python? Noting that we don’t teach children language by sitting them down with a grammar but through corrected usage, and lots of it. At some point they learn or can look up the edge cases. Parallel’s to Norm’s approach?

I first saw this in a tweet by Christophe Lalanne.

Down the Clojure Rabbit Hole

Filed under: Clojure,Functional Programming — Patrick Durusau @ 7:08 am

Down the Clojure Rabbit Hole by Christophe Grand.

From the description:

Christophe Grand tells Clojure stories full of immutability, data over behavior, relational programming, declarativity, incrementalism, parallelism, collapsing abstractions, harmful local state and more.

A personal journey down the Clojure rabbit hole.

Not an A to B type talk but with that understanding, it is entertaining and useful.

Slides and MP3 are available for download.

Enjoy!

April 5, 2015

How to format Python code without really trying

Filed under: Programming,Python — Patrick Durusau @ 7:08 pm

How to format Python code without really trying by Bill Wendling.

From the post:

Years of writing and maintaining Python code have taught us the value of automated tools for code formatting, but the existing ones didn’t quite do what we wanted. In the best traditions of the open source community, it was time to write yet another Python formatter.

YAPF takes a different approach to formatting Python code: it reformats the entire program, not just individual lines or constructs that violate a style guide rule. The ultimate goal is to let engineers focus on the bigger picture and not worry about the formatting. The end result should look the same as if an engineer had worried about the formatting.

You can run YAPF on the entire program or just a part of the program. It’s also possible to flag certain parts of a program which YAPF shouldn’t alter, which is useful for generated files or sections with large literals.

One step towards readable code!

Enjoy!

Photoshopping Science? Where Was Peer Review?

Filed under: Bioinformatics,Peer Review,Science — Patrick Durusau @ 6:46 pm

Too Much to be Nothing? by Leonid Schneider.

From the post:

(March 24th, 2015) Already at an early age, Olivier Voinnet had achieved star status among plant biologists – until suspicions arose last year that more than 30 of his publications contained dubious images. Voinnet’s colleagues are shocked – and demand an explanation.

Several months ago, a small group of international plant scientists set themselves the task of combing through the relevant literature for evidence of potential data manipulation. They posted their discoveries on the post-publication peer review platform PubPeer. As one of these anonymous scientists (whose real name is known to Laborjournal/Lab Times) explained, all this detective work was accomplished simply by taking a good look at the published figures. Soon, the scientists stumbled on something unexpected: putative image manipulations in the papers of one of the most eminent scientists in the field, Sir David Baulcombe. Even more strikingly, all these suspicious publications (currently seven, including papers in Cell, PNAS and EMBO J) featured his former PhD student, Olivier Voinnet, as first or co-author.

Baulcombe’s research group at The Sainsbury Laboratory (TSL) in Norwich, England, has discovered nothing less than RNA interference (RNAi) in plants, the famous viral defence mechanism, which went on to revolutionise biomedical research as a whole and the technology of controlled gene silencing in particular. Olivier Voinnet himself also prominently contributed to this discovery, which certainly helped him, then only 33 years old, to land a research group leader position at the CNRS Institute for Plant Molecular Biology in Strasbourg, in his native country, France. During his time in Strasbourg, Voinnet won many prestigious prizes and awards, such as the ERC Starting Grant and the EMBO Young Investigator Award, plus the EMBO Gold Medal. Finally, at the end of 2010, the Swiss Federal Institute of Technology (ETH) in Zürich appointed the 38-year-old EMBO Member as Professor of RNA biology. Shortly afterwards, Voinnet was awarded the well-endowed Max Rössler Prize of the ETH.

Disturbing news from the plant sciences of evidence of photo manipulation in published articles.

The post examines the charges at length and indicates what is or is not known at this juncture. Investigations are underway and reports from those investigation will appear in the future.

A step that could be taken now, since the articles in question (about 20) have been published, would be for the journals to disclose the peer reviewers who failed to catch the photo manipulation.

The premise of peer review is holding an author responsible for the content of their article so it is only fair to hold peer reviewers responsible for articles approved by their reviews.

Peer review isn’t much of a gate keeper if it is unable to discover false information or even patterns of false information prior to publication.

I haven’t been reading Lab Times on a regular basis but it looks like I need to correct that oversight.

Key Documents on the Iran Deal

Filed under: Government,Politics — Patrick Durusau @ 3:58 pm

Key Documents on the Iran Deal by R. Taj Moore.

Just in case you are looking for primary materials on the proposed “Iran Deal,” concerning the development of nuclear power by Iran, this is the spot!

As primary materials go, there isn’t much. Probably less than ten (10) pages. You will be amazed out of how many column inches will be spun from such a thin source.

As an exercise in detecting puffery, compare the U.S. fact sheet to the resulting commentary.

Building a complete Tweet index

Filed under: Indexing,Searching,Twitter — Patrick Durusau @ 10:46 am

Building a complete Tweet index by Yi Zhuang.

Since it is Easter Sunday in many religious traditions, what could be more inspirational than “…a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.“?

From the post:

Today [11/8/2014], we are pleased to announce that Twitter now indexes every public Tweet since 2006.

Since that first simple Tweet over eight years ago, hundreds of billions of Tweets have captured everyday human experiences and major historical events. Our search engine excelled at surfacing breaking news and events in real time, and our search index infrastructure reflected this strong emphasis on recency. But our long-standing goal has been to let people search through every Tweet ever published.

This new infrastructure enables many use cases, providing comprehensive results for entire TV and sports seasons, conferences (#TEDGlobal), industry discussions (#MobilePayments), places, businesses and long-lived hashtag conversations across topics, such as #JapanEarthquake, #Election2012, #ScotlandDecides, #HongKong. #Ferguson and many more. This change will be rolling out to users over the next few days.

In this post, we describe how we built a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.

The most important factors in our design were:

  • Modularity: Twitter already had a real-time index (an inverted index containing about a week’s worth of recent Tweets). We shared source code and tests between the two indices where possible, which created a cleaner system in less time.
  • Scalability: The full index is more than 100 times larger than our real-time index and grows by several billion Tweets a week. Our fixed-size real-time index clusters are non-trivial to expand; adding capacity requires re-partitioning and significant operational overhead. We needed a system that expands in place gracefully.
  • Cost effectiveness: Our real-time index is fully stored in RAM for low latency and fast updates. However, using the same RAM technology for the full index would have been prohibitively expensive.
  • Simple interface: Partitioning is unavoidable at this scale. But we wanted a simple interface that hides the underlying partitions so that internal clients can treat the cluster as a single endpoint.
  • Incremental development: The goal of “indexing every Tweet” was not achieved in one quarter. The full index builds on previous foundational projects. In 2012, we built a small historical index of approximately two billion top Tweets, developing an offline data aggregation and preprocessing pipeline. In 2013, we expanded that index by an order of magnitude, evaluating and tuning SSD performance. In 2014, we built the full index with a multi-tier architecture, focusing on scalability and operability.

If you are interested in scaling search issues, this is a must read post!

Kudos to Twitter Engineering!

PS: Of course all we need now is a complete index to Hilary Clinton’s emails. The NSA probably has a copy.

You know, the NSA could keep the same name, National Security Agency, and take over providing backups and verification for all email and web traffic, including the cloud. Would have to work on who could request copies but that would resolve the issue of backups of the Internet rather neatly. No more deleted emails, tweets, etc.

That would be a useful function, as opposed to harvesting phone data on the premise that at some point in the future it might prove to be useful, despite having not proved useful in the past.

April 4, 2015

Handbook of Applied Cryptography

Filed under: Cryptography,Cybersecurity — Patrick Durusau @ 7:32 pm

Handbook of Applied Cryptography by Alfred J. Menezes, Paul C. van Oorschot and Scott A. Vanstone.

Use as historical reference only. Fifth reprinting (2005) of the 1996 edition. Some of the information is eighteen (18) years out of date.

Still, it should make for a useful read.

CRC Press has generously given us permission to make all chapters available for free download.

Please read this copyright notice before downloading any of the chapters.

  • Table of Contents
    ps
    pdf

  • Chapter 1 – Overview of Cryptography
    ps
    pdf

  • Chapter 2 – Mathematics Background
    ps
    pdf

  • Chapter 3 – Number-Theoretic Reference Problems
    ps
    pdf

  • Chapter 4 – Public-Key Parameters
    ps
    pdf

  • Chapter 5 – Pseudorandom Bits and Sequences
    ps
    pdf

  • Chapter 6 – Stream Ciphers
    ps
    pdf

  • Chapter 7 – Block Ciphers
    ps
    pdf

  • Chapter 8 – Public-Key Encryption
    ps
    pdf

  • Chapter 9 – Hash Functions and Data Integrity
    ps
    pdf

  • Chapter 10 – Identification and Entity Authentication
    ps
    pdf

  • Chapter 11 – Digital Signatures
    ps
    pdf

  • Chapter 12 – Key Establishment Protocols
    ps
    pdf

  • Chapter 13 – Key Management Techniques
    ps
    pdf

  • Chapter 14 – Efficient Implementation
    ps
    pdf

  • Chapter 15 – Patents and Standards
    ps
    pdf

  • Appendix – Bibliography of Papers from Selected Cryptographic Forums
    ps
    pdf

  • References
    ps
    pdf

  • Index
    ps
    pdf

April 3, 2015

e451 and Senator Dianne Feinstein

Filed under: Government,Politics,Security — Patrick Durusau @ 3:49 pm

2004 101

Senator Dianne Feinstein is attempting to whip up support for an e451 (a real life version of Fahrenheit 451). Speaking of recent arrests in New York she commented:

I am particularly struck that the alleged bombers made use of online bombmaking guides like the Anarchist Cookbook and Inspire Magazine. These documents are not, in my view, protected by the First Amendment and should be removed from the Internet.

Can you guess where the alleged bombers got their copy of the Anarchist Cookbook?

Christopher Ingraham (Washington Post) in Dianne Feinstein says the Anarchist’s Cookbook should be “removed from the Internet” points to the complaint in this case which says the undercover (UC) agent was the source of the Anarchist Cookbook.

But the details are even better.

Page 9, paragraph 23:

On or about August 17, 2014, SIDDIQUI and the UC went to a public library. SIDDIQUI states to the UC that she remembered the conversation about “science” and that they should consult chemistry books for beginners books at the library. At the library, SIDDIQUI looked up the on-line catalog for chemistry books for beginners, but stated there were a limited number of books about chemistry.

Page 12, paragraph 35 reads in part:

The UC and VELENTZAS the discussed the fact that the UC had downloaded The Anarchist Cookbook.9 VELENTZAS suggested that the UC print out the parts of the book they would need. During the conversation, the UC stated, “We read chemistry books with breakfast. Like, who does that?” VELENTZAS responded, “People who want to make history.”

Page 14, paragraph 44 reads in part:

…VELENTZAS asked whether the UC still had the PDF file—referring to The Anarchist Cookbook. The US responded affirmatively and indicated he/she had printed the relevant portions for the group.

All told the complaint runs twenty-nine (29) pages, double-spaced, so you really should read it for yourself. Even with the assistance from the UC, practicing soldering (I’m not joking), reading chemistry books, viewing YouTube videos, the defendants posed more of a danger to themselves than anyone else.

The author of the Anarchist Cookbook has called for it to go out of print because its underlying premise was flawed (I wrote the Anarchist Cookbook in 1969. Now I see its premise as flawed by William Powell):


Over the years, I have come to understand that the basic premise behind the Cookbook is profoundly flawed. The anger that motivated the writing of the Cookbook blinded me to the illogical notion that violence can be used to prevent violence. I had fallen for the same irrational pattern of thought that led to US military involvement in both Vietnam and Iraq. The irony is not lost on me.

You do know that Dianne voted to support the invasion of Iraq. Yes? Now that the war is unpopular, she was “mis-led.”

Was Dianne mis-led in August of 2014 when she said?:

Feinstein Statement on Airstrikes Against ISIL

“I strongly support the president’s authorization for airstrikes against ISIL. This is not a typical terrorist organization–it is a terrorist army, operating with military expertise, advancing across Iraq and rapidly consolidating its position.

“ISIL is capturing new Iraqi towns every day, is reported to be in control of Mosul Dam and is engaging in a campaign of ethnic cleansing that appears to be attempted genocide. I believe that once this group solidifies its hold on what it calls the Islamic State, its next target may be Baghdad.

“It has become clear that ISIL is recruiting fighters in Western countries, training them to fight its battles in the Middle East and possibly returning them to European and American cities to attack us in our backyard. We simply cannot allow this to happen.

“It takes an army to defeat an army, and I believe that we either confront ISIL now or we will be forced to deal with an even stronger enemy in the future. Inaction is no longer an option. I support actions by the administration to coordinate efforts with Iraq and other allies to use our military strength and targeting expertise to the fullest extent possible.”

The “moral calculus” of Dianne Feinstein seems to run something like this:

Use of violence by the United States: Great!

Use of violence by others against the United States: No!, No!, No!

Powell gets it right when he says that violence won’t stop violence, so that refutes Diane’s first premise.

Violence against ordinary citizens and local police forces are unjustifiable. They have no more of a voice in acts of foreign violence by the United States than the victims of that violence. Let’s just leave it at that for her second premise.

PS: With regard to the Anarchist Cookbook, yes, you can find it online, or you can order a hard copy, along with similar titles from Delta Press.

A word of warning: DYI explosives are more dangerous to you than any intended target. It is possible to make homemade explosives but why bother? You can buy/steal explosives and professional detonators far more safely than you can follow the DYI route. The smudged diagrams, bad photocopies, etc. give the appearance of forbidden or secret knowledge. Don’t fall for it.

No, I’m not going to tell you where/how to buy such things but will say that anyone who offers bomb making advice, components, vehicles or entire bombs, is likely an undercover agent.

Update: Apologies, I forgot to insert the link to the complaint.

« Newer PostsOlder Posts »

Powered by WordPress