Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 30, 2015

Baltimore Burning and Verification

Filed under: News,Reporting — Patrick Durusau @ 7:28 pm

Baltimore ‘looting’ tweets show importance of quick and easy image checks by Eoghan mac Suibhne.

From the post:

Anyone who has ever asked me for tips on content verification and debunking of fakes knows one of the first things I always mention is reverse image search. It’s one of the simplest and most powerful tools at your disposal. This week provided another good example of how overlooked it is.

Unrest in Baltimore, like any other dramatic event these days, created a surge of activity on social media. In the age of the selfie and ubiquitous cameras, many people have become compulsive chroniclers of all their activities — sometimes unwisely so.

Reactions ranged from shock and disgust to disbelief and amusement when a series of images started to circulate showing looters proudly displaying their ill-gotten gains. Not all, however, was as it seemed.

(emphasis in original)

I often get asked about the fundamentals of verification, and one of the first things I alway mention is the ability — and indeed the reflex — to always perform a reverse image search. I also mention, only half-jokingly, that this should possibly even be added to the school curriculum. It’s not as if it would take up much of the school year; it can be taught in approximately 30 seconds.

In the case of the trashed KFC above, a quick check via Google reverse image search or Tineye showed that the photo was taken in Karachi, Pakistan, in 2012.

google-image

Don’t be confused by the “reverse image search” terminology. What you see on Google Images is the standard search box, that includes camera and microphone icons. Choose the camera icon and you will be given the opportunity to search using an image. Paste in an image URL and search. Simple as that.

Imagine describing a standard Google search as a “Google reverse word search.” Confusion and hilarity would ensue pretty quickly.

Develop a habit of verification.

You will have fewer occasions to say, “That’s my opinion and I am entitled to it,” in the face of contrary evidence.

The NYT and Your Security Guardians At Work

Filed under: News,Reporting — Patrick Durusau @ 4:38 pm

Mark Liberman, in R.I.P. Jack Ely, quotes rather extensively from Sam Roberts, “Jack Ely, Who Sang the Kingsmen’s ‘Louie Louie’, Dies at 71“, NYT 4/29/2015, which includes this snippet:


High school and college students who thought they understood what Mr. Ely was singing traded transcripts of their meticulously researched translations of the lyrics. The F.B.I. began investigating after an Indiana parent wrote to Attorney General Robert F. Kennedy in 1964: “My daughter brought home a record of ‘LOUIE LOUIE’ and I, after reading that the record had been banned on the air because it was obscene, proceeded to try to decipher the jumble of words. The lyrics are so filthy that I cannot enclose them in this letter.”

The F.B.I. Laboratory’s efforts at decryption were less fruitful. After more than two years and a 455-page report, the bureau concluded that “three governmental agencies dropped their investigations because they were unable to determine what the lyrics of the song were, even after listening to the records at speeds ranging from 16 r.p.m. to 78 r.p.m.”

It is true that Louie Louie was recorded by the Kingsmen, with Jack Ely as lead signer. It is also true that the FBI, who currently protects you from domestic terrorists and emotionally disturbed teenagers, did an obscenity investigation of the song, but, they concluded the lyrics were incomprehensible.

Where the NYT drops the ball is in attributed to the FBI a 455-page report. You can view the FBI report at: FBI Records: The Vault, under SUBJECT: LOUIE, LOUIE (THE 60’s SONG).

Like the Internet of Things, PDF viewers don’t lie and the page count for the FBI report comes to one hundred and nineteen (119) pages. Of course, the NYT did not have a link to the FBI report or else one of its proof readers could have verified that claim.

The lack of accuracy doesn’t impact the story, except the NYT doesn’t share where it saw the 455-page report from the FBI. Anything is possible and there may be such a report. But without a hyperlink, you know, those things that point to locations on the web, we won’t ever know.

What does the NYT gain by not gracing its readers with links to original materials? There are numerous NYT articles that do, so you have to wonder why it doesn’t happen in all cases?

Suggested rule for New York Times reporters: If you cite a publicly available document or written statement, include a link to the original at the first mention of the document or statement in your story. (Some of us want to know more than will fit into your story.)

What Should Remain After Retraction?

Filed under: Archives,Citation Practices,Preservation — Patrick Durusau @ 3:39 pm

Antony Williams asks in a tweet:

If a paper is retracted shouldn’t it remain up but watermarked PDF as retracted? More than this? http://pubs.acs.org/doi/abs/10.1021/ja910615z

Here is what you get instead of the front page:

jacs-retraction

A retraction should appear in bibliographic records maintained by the publisher as well as on any online version maintained by the publisher.

The Journal of the American Chemical Society (JACS) method of retraction, removal of the retracted content:

  • Presents a false view of the then current scientific context. Prior to retraction such an article is part of the overall scientific context in a field. Editing that context post-publication, is historical revisionism at its worst.
  • Interrupts the citation chain of publications cited in the retracted publication.
  • Leaves dangling citations of the retracted publication in later publications.
  • Places author who cited the retracted publication in an untenable position. Their citations of a retracted work are suspect with no opportunity to defend their citations.
  • Falsifies the memories of every reader who read the retracted publication. They cannot search for and retrieve that paper in order to revisit an idea, process or result sparked by the retracted publication.

Sound off to: Antony Williams (@ChemConnector) and @RetractionWatch

Let’s leave the creation of false histories to professionals, such as politicians.

New Survey Technique! Ask Village Idiots

Filed under: Artificial Intelligence,News,Survey — Patrick Durusau @ 1:38 pm

I was deeply disappointed to see Scientific Computing with the headline: ‘Avengers’ Stars Wary of Artificial Intelligence by Ryan Pearson.

The respondents are all talented movie stars but acting talent and even celebrity doesn’t give them insight into issues such as artificial intelligence. You might as well ask football coaches about the radiation hazards of a possible mission to Mars. Football coaches, the winning ones anyway, are bright and intelligent folks, but as a class, aren’t the usual suspects to ask about inter-planetary radiation hazards.

President Reagan was known to confuse movies with reality but that was under extenuating circumstances. Confusing people acting in movies with people who are actually informed on a subject doesn’t make for useful news reporting.

Asking Chris Hemsworth who plays Thor in Avengers: Age of Ultron what the residents of Asgard think about relief efforts for victims of the recent earthquake in Nepal would be as meaningful.

They still publish the National Enquirer. A much better venue for “surveys” of the uninformed.

Pwning a thin client in less than two minutes

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:54 am

Pwning a thin client in less than two minutes by Roberto Suggi Liverani

From the post:

Have you ever encountered a zero client or a thin client? It looks something like this…

HP-T520

f yes, keep reading below, if not, then if you encounter one, you know what you can do if you read below…

The model above is a T520, produced by HP – this model and other similar models are typically employed to support a medium/large VDI (Virtual Desktop Infrastructure) enterprise.

These clients run a Linux-based HP ThinPro OS by default and I had a chance to play with image version T6X44017 in particular, which is fun to play with it, since you can get a root shell in a very short time without knowing any password…

Normally, HP ThinPro OS interface is configured in a kiosk mode, as the concept of a thin/zero client is based on using a thick client to connect to another resource. For this purpose, a standard user does not need to authenticate to the thin client per se and would just need to perform a connection – e.g. VMware Horizon View. The user will eventually authenticate through the connection.

The point of this blog post is to demonstrate that a malicious actor can compromise such thin clients in a trivial and quick way provided physical access, a standard prerequisite in an attack against a kiosk.

During my testing, I have tried to harden as much as possible the thin client, with the following options:

Physical security is a commonly overlooked aspect of network security. That was true almost twenty (20) years ago when I was a Novell CNE and that hasn’t changed since. (Physical & Network Security: Better Together In 2014)

You don’t have to take my word for it. Take a walk around your office and see what network or cables equipment could be physically accessed for five minutes or less by any casual visitor. (Don’t forget unattended workstations.)

Don’t spend time and resources on popular “threats” such as China and North Korea when the pizza delivery guy can plug a wireless hub into an open Ethernet port inside your firewall. Yes?

For PR purposes the FBI would describe such a scheme as evidence of advanced networking and computer protocol knowledge. It may be from their perspective. 😉 It shouldn’t be from yours.

April 29, 2015

Some notes on why crypto backdoors are unreasonable

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:41 pm

Some notes on why crypto backdoors are unreasonable by Robert Graham.

From the post:

Robert gives a good summary of the usual arguments against crypto backdoors and then makes a new-to-me case against the FBI lobbying for such backdoors.

From the post:


Today’s testimony by the FBI and the DoJ discussed the tradeoffs between privacy and protection. Victims of crimes, those who get raped and murdered, deserve to have their killers brought to justice. That criminals get caught dissuades crime. Crypto makes prosecuting criminals harder.

That’s all true, and that’s certainly the argument victim rights groups should make when lobbying government. But here’s the thing: it’s not the FBI’s job to care. We, the people, make the decision about these tradeoffs. It’s solely we, the people, who are the constituents lobbying congress. The FBI’s job is to do what we tell them. They aren’t an interested party. Sure, it’s their job to stop crime, but it’s also their job to uphold rights. They don’t have an opinion, by definition, which one takes precedence over the other — congress makes that decision.

Yet, in this case, they do have an opinion. The only reason the subcommittee held hearings today is in response to the FBI lobbying for backdoors. Even if this issue were reasonable, it’s not reasonable that the FBI should lobby for it.

Where I depart from Robert are his concessions that there is a tradeoff between privacy and protection, that getting caught dissuades crime and that crypto makes prosecuting criminals more difficult.

Amy Hess, the executive assistant director of the FBI’s science and technology branch, testified:

It’s critical for police to “have the ability to accept or to receive the information that we might need in order to hold those accountable who conduct heinous crimes or conduct terrorist attacks,” (Government ‘backdoors’ to bypass encryption will make them vulnerable to attacks – industry experts)

The victims of crimes and prosecution arguments are entirely speculative. If there were a single concrete case where crypto either allowed the guilty to escape, that would be the first words out of Hess’ mouth. Law enforcement types would trot it out every day. Not even single case has come to light. There isn’t any balancing to do with the needs of law enforcement. They should come back when they can show real harm.

The other thing that prompted me to write was Robert saying that getting caught “dissuades crime.” Hardly, that’s the old canard about the death penalty being a deterrent to crime. Never has been the case that punishment deters crime. Even when hands were removed for theft.

The FBI has an interest to advance, for the same reason that it sets up emotionally disturbed young men to be busted for terrorist offenses. It has a budget and staff to maintain and you can’t do that without keeping yourself in the public eye. It also captures real criminals from time to time, but that is more of a sideline than its main purpose. Like all agencies and businesses, its main objection is its own preservation.

My disagreement with the FBI is over its use of fictional threats to deceive the public and its representatives for purposes that have nothing to do with the public good.

800,000 NPR Audio Files!

Filed under: Audio,Data — Patrick Durusau @ 6:54 pm

There Are Now 800,000 Reasons To Share NPR Audio On Your Site by Patrick Cooper.

From the post:

From NPR stories to shows to songs, today we’re making more than 800,000 pieces of our audio available for you to share around the Web. We’re throwing open the doors to embedding, putting our audio on your site.

Complete with simple instructions for embedding!

I often think of topic maps when listening to NPR so don’t be surprised if you start seeing embedded NPR audio in the very near future!

Enjoy!

[U.S.] House Member Data in XML

Filed under: Government,XML — Patrick Durusau @ 6:31 pm

User Guide and Data Dictionary. (In PDF)

From the Introduction:

The Office of the Clerk makes available membership lists for the U.S. House of Representatives. These lists are available in PDF format, and starting with the 114th Congress, the data is available in XML format. The document and data are available at http://clerk.house.gov.

For unknown reasons, the link does not appear as a hyperlink in the guide. http://clerk.house.gov.

Just as well because the link to the XML isn’t on that page anyway. Try: http://clerk.house.gov/xml/lists/MemberData.xml instead.

Looking forward to the day when all information generated by Congress being available in daily XML dumps.

MapR on Open Data Platform: Why we declined

Filed under: Hadoop,Hortonworks,MapR,Standards — Patrick Durusau @ 4:31 pm

MapR on Open Data Platform: Why we declined by John Schroeder.

From the post:


Open Data Platform is “solving” problems that don’t need solving

Companies implementing Hadoop applications do not need to be concerned about vendor lock-in or interoperability issues. Gartner analysts Merv Adrian and Nick Heudecker disclosed in a recent blog that less than 1% of companies surveyed thought that vendor lock-in or interoperability was an issue—dead last on the list of customer concerns. Project and sub-project interoperability are very good and guaranteed by both free and paid-for distributions. Applications built on one distribution can be migrated with virtually zero switching costs to the other distributions.

Open Data Platform participation lacks participation by the Hadoop leaders

~75% of Hadoop implementations run on MapR and Cloudera. MapR and Cloudera have both chosen not to participate. The Open Data Platform without MapR and Cloudera is a bit like one of the Big Three automakers pushing for a standards initiative without the involvement of the other two.

I mention this post because it touches on two issues that should concern all users of Hadoop applications.

On “vendor lock-in” you will find the question that was asked was “…how many attendees considered vendor lock-in a barrier to investment in Hadoop. It came in dead last. With around 1% selecting it.” Who Asked for an Open Data Platform?. Considering that it was in the context of a Gartner webinar, it could have been only one person selected it. Not what I would call a representative sample.

Still, I think John in right in saying that vendor lock-in isn’t a real issue with Hadoop. Hadoop applications aren’t off the shelf items and are custom constructs for your needs and data. Not much opportunity for vendor lock-in. You’re in greater danger of IT lock-in due to poor or non-existent documentation for your Hadoop application. If anyone tells you a Hadoop application doesn’t need documentation because you can “…read the code…,” they are building up job security, quite possibly at your future expense.

John is spot on about the Open Data Platform not including all of the Hadoop market leaders. As John says, Open Data Platform does not include those responsible for 75% of the existing Hadoop implementations.

I have seen that situation before in standards work and it never leads to a happy conclusion, for the participants, non-participants and especially the consumers, who are supposed to benefit from the creation of standards. Non-standards for a minority of the market only serve to confuse not overly clever consumers. To say nothing of the popular IT press.

The Open Data Platform also raises questions about how one goes about creating a standard. One approach is to create a standard based on your projection of market needs and to campaign for its adoption. Another is to create a definition of an “ODP Core” and see if it is used by customers in development contracts and purchase orders. If consumers find it useful, they will no doubt adopt it as a de facto standard. Formalization can follow in due course.

So long as we are talking about possible future standards, a practice of documentation more advanced than C style comments for Hadoop ecosystems would be a useful Hadoop standard in the future.

No Incentives = No Improvement in Cybersecurity

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:14 pm

The State of Cybersecurity: Implications for 2015 (An ISACA and RSA Conference Survey) is now available.

It won’t take you long to conclude that the state of cybersecurity for 2015 and any year thereafter, is going to be about the same.

I say that because out of twenty-five (25) questions, only two (2) dealt with motivations and those were questions about motives for attacks (questions 9 and 10).

Changing the cybersecurity landscape, in favor of becoming more, not less secure will require:

  1. Discussion of positive incentives for greater security, more secure code, etc.
  2. Creation of positive incentives by government and industry for greater security, etc.
  3. Increases in security driven by sufficient incentives to produce greater security.

Think of security as a requirement. If you aren’t willing to pay for a requirement, why should anyone write software that meets that requirement?

Or to put it differently, you don’t have a right to be secure, but you should have the opportunity.

April 28, 2015

Present and Future of Big Data

Filed under: BigData — Patrick Durusau @ 6:57 pm

I thought you might find this amusing as a poster for your office.

Someday your grandchildren will find it similar to “The World of Tomorrow” at the 1939 World’s Fair.

Infographic: Big Data, present and future

Crowdsourcing Courses

Filed under: Crowd Sourcing — Patrick Durusau @ 6:38 pm

Kurt Luther is teaching a crowdsourcing course this Fall and has a partial list of crowdsourcing courses.

Any more to suggest?

Kurt tweets about crowdsourcing and history so you may want to follow him on Twitter.

Markov Composer

Filed under: Music — Patrick Durusau @ 6:27 pm

Markov Composer – Using machine learning and a Markov chain to compose music by Andrej Budinčević.

From the post:

In the following article, I’ll present some of the research I’ve been working on lately. Algorithms, or algorithmic composition, have been used to compose music for centuries. For example, Western punctus contra punctum can be sometimes reduced to algorithmic determinacy. Then, why not use fast-learning computers capable of billions of calculations per second to do what they do best, to follow algorithms? In this article, I’m going to do just that, using machine learning and a second order Markov chain.

If you like exploring music with a computer, Andrej’s post will be a real treat!

Enjoy!

MarkovComposer (GitHub)

I first saw this in a tweet by Debasish Ghosh.

One Word Twitter Search Advice

Filed under: Search Behavior,Searching,Twitter — Patrick Durusau @ 6:16 pm

The one word journalists should add to Twitter searches that you probably haven’t considered by Daniel Victor.

Daniel takes you through five results without revealing how he obtained them. A bit long but you will be impressed when he reveals the answer.

He also has some great tips for other Twitter searching. Tips that you won’t see from any SEO.

Definitely something to file with your Twitter search tips.

April 27, 2015

Hacking telesurgery robots, a concrete risk

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:09 pm

Hacking telesurgery robots, a concrete risk by Pierluigi Paganini.

From the post:

Technology will help humans to overwhelm any obstacle, one of them is the concept of space that for some activities could represent a serious problem. Let’s think for example to a life-saving surgery that could be performed by surgeons that are physically located everywhere in the world.

Telesurgery is a reality that could allow experts in one place controlling a robot in another that physically performs the surgical operation. The advantages are enormous in term of cost saving, and timely intervention of the medical staff, but which are the possible risks.

Telesurgery relies on sophisticated technology for computing, robotics and communications, and it’s easy to imagine the problem that could be caused by a threat actor.

The expert Tamara Bonaci and other colleagues at the University of Washington in Seattle have analyzed possible threats to the telesurgery, being focused on the possible cyber attacks that modify the behavior of a telerobot during surgery.

One more cyberinsecurity to add to the list!

Professional hand wringers can keep hand wringing, conference speakers can intone about the absolute necessity of better security, governments can keep buying surveillance as though it were security (yes, they both start with “s” but are not the same thing), corporations can keep evaluating cost versus the benefit of security and absent any effective incentives for cyber security, we will remain insecure.

Let me put it more bluntly: So long as cyber insecurity pays better than cyber security, cyber insecurity will continue to have the lead. Cyber security, for all of the talk and noise, is a boutique business compared to the business of cyber insecurity. How else would you explain the reported ten (10) year gap between defenders and hackers?

Government and corporate buyers could start us down the road to cyber security by refusing to purchase software that isn’t warranted to be free from buffer overflow conditions from outside input. (Not the only buffer overflow situation but an obvious one.) With warranties that have teeth in the event that such buffer overflow bugs are found.

The alternative is to have more pronouncements on the need for security, lots of papers on security, etc., and in 2016 and every year thereafter, there will be more vulnerabilities and less security than the year before. Your call.

DIY Security Fix For Android [43 year old vulnerability]

Filed under: Cybersecurity,Security — Patrick Durusau @ 9:12 am

Wi-Fi security software chokes on network names, opens potential hole for hackers by Paul Ducklin.

Paul details a bug that has been found in wpa_supplicant. The bug arises only when using Wi-Fi Direct, which is supported by Android. 🙁

The bug?, failure to check for buffer overflow. This must be what Dave Merkel, chief technology officer at IT security vendor FireEye, means by:

testing [software] for all things it shouldn’t do is an infinite, impossible challenge.

According to the Wikipedia article Buffer Overflow, buffer overflows were understood as 1972 and the first hostile use was in 1988. Those dates translate into forty-three (43) and twenty-seven (27) years ago.

Is it unreasonable to expect vulnerabilities known for forty-three (43) and used twenty-seven (27) years ago to be avoided in current programming practice?

This is the sort of issue where programming standards, along with legal liability as an incentive, could make a real difference.

If you are interested in knowing more about buffer overflows, see: Writing buffer overflow exploits – a tutorial for beginners.

April 26, 2015

Hijacking a Plane with Excel

Filed under: Cybersecurity,Security — Patrick Durusau @ 9:35 pm

Wait! That’s not the right title! Hacking Airplanes by Bruce Schneier.

I was thinking about the Dilbert cartoon where the pointed haired boss tries to land a plane using Excel. 😉

There are two points where I disagree with Bruce’s post, at least a little.

From the post:


Governments only have a fleeting advantage over everyone else, though. Today’s top-secret National Security Agency programs become tomorrow’s Ph.D. theses and the next day’s hacker’s tools. So while remotely hacking the 787 Dreamliner’s avionics might be well beyond the capabilities of anyone except Boeing engineers today, that’s not going to be true forever.

What this all means is that we have to start thinking about the security of the Internet of Things–whether the issue in question is today’s airplanes or tomorrow’s smart clothing. We can’t repeat the mistakes of the early days of the PC and then the Internet, where we initially ignored security and then spent years playing catch-up. We have to build security into everything that is going to be connected to the Internet.

First, I’m not so sure that only current Boeing engineers would be capable of hacking a 787 Dreamliner’s avionics. I don’t have a copy of it but I assume there are plenty of ex-Boeing engineers who may have a copy. And other people who could obtain a copy of it. Probably more of a lack of interest than access to the avionics code that explains why it hasn’t been hacked so far. If you want to crash an airline there are many easier methods than hacking its avionics code.

Second, I am far from convinced by Bruce’s argument:

We can’t repeat the mistakes of the early days of the PC and then the Internet, where we initially ignored security and then spent years playing catch-up.

Unless a rule against human stupidity was passed quite recently I don’t know of any reason why we won’t duplicate the mistakes of the early days of the PC and then of the Internet. Credit cards have been around far longer than both the PC and the Internet, yet fraud abounds in the credit card industry.

Do you remember: The reason companies don’t fix cybersecurity?

The reason why credit card companies don’t stop credit card fraud is that stopping it would cost more than the fraud. It isn’t a moral issue for them, it is a question of profit and loss. There is a point at which fraud becomes too costly and the higher cost of security is worth the cost.

For example, did you know at some banks that no check under $5,000.00 is ever inspected by anyone? Not even for signatures. It isn’t worth the cost of checking every item.

Security, at least for vendors, in the Internet of Things will be the same way. Security if and only if the cost of not having the security is justified against their bottom lines.

That plus human stupidity makes me think that cyber insecurity is here to stay.

PS: You should not attempt to hijack a plane with Excel. I don’t think your chances are all that good and the FBI and TSA (never having caught a hijacker yet), are warning airlines to be looking out for you. The FBI and TSA should be focusing on more likely threats, like hijacking a plane using telepathy.

New York Times Gets Stellarwind IG Report Under FOIA

Filed under: Government,NSA,Privacy — Patrick Durusau @ 4:57 pm

New York Times Gets Stellarwind IG Report Under FOIA by Benjamin Wittes.

A big thank you! to Benjamin Wittes and the New York Times.

They are the only two (2) stories on the Stellarwind IG report, released Friday evening, that give a link to the document!

The NYT story with the document: Government Releases Once-Secret Report on Post-9/11 Surveillance by Charlie Savage.

The document does not appear at:

Office of the Director of National Intelligence (as of Sunday, 25 April 2015, 17:45 EST).

US unveils 6-year-old report on NSA surveillance by Nedra Pickler (Associated Press or any news feed that parrots the Associated Press).

Suggestion: Don’t patronize news feeds that refer to documents but don’t include links to them.

NOAA weather data – Valuing Open Data – Guessing – History Repeats

Filed under: Cloud Computing,Government Data,NOAA — Patrick Durusau @ 4:02 pm

Tech titans ready their clouds for NOAA weather data by Greg Otto.

From the post:

It’s fitting that the 20 terabytes of data the National Oceanic and Atmospheric Administration produces every day will now live in the cloud.

The Commerce Department took a step Tuesday to make NOAA data more accessible as Commerce Secretary Penny Pritzker announced a collaboration among some of the country’s top tech companies to give the public a range of environmental, weather and climate data to access and explore.

Amazon Web Services, Google, IBM, Microsoft and the Open Cloud Consortium have entered into a cooperative research and development agreement with the Commerce Department that will push NOAA data into the companies’ respective cloud platforms to increase the quantity of and speed at which the data becomes publicly available.

“The Commerce Department’s data collection literally reaches from the depths of the ocean to the surface of the sun,” Pritzker said during a Monday keynote address at the American Meteorological Society’s Washington Forum. “This announcement is another example of our ongoing commitment to providing a broad foundation for economic growth and opportunity to America’s businesses by transforming the department’s data capabilities and supporting a data-enabled economy.”

According to Commerce, the data used could come from a variety of sources: Doppler radar, weather satellites, buoy networks, tide gauges, and ships and aircraft. Commerce expects this data to launch new products and services that could benefit consumer goods, transportation, health care and energy utilities.

The original press release has this cheery note on the likely economic impact of this data:

So what does this mean to the economy? According to a 2013 McKinsey Global Institute Report, open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide. If more of this data could be efficiently released, organizations will be able to develop new and innovative products and services to help us better understand our planet and keep communities resilient from extreme events.

Ah, yes, that would be the Open data: Unlocking innovation and performance with liquid information, on which the summary page says:

Open data can help unlock $3 trillion to $5 trillion in economic value annually across seven sectors.

But you need to read the full report (PDF) in order to find footnote 3 on “economic value:”

3. Throughout this report we express value in terms of annual economic surplus in 2013 US dollars, not the discounted value of future cash flows; this valuation represents estimates based on initiatives where open data are necessary but not sufficient for realizing value. Often, value is achieved by combining analysis of open and proprietary information to identify ways to improve business or government practices. Given the interdependence of these factors, we did not attempt to estimate open data’s relative contribution; rather, our estimates represent the total value created.

That is a disclosure that the estimate of $3 to $5 trillion is a guess and/or speculation.

Odd how the guess/speculation disclosure drops out of the Commerce Department press release and when it gets to Greg’s story it reads:

open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide.

From guess/speculation to no mention to fact, all in the short space of three publications.

Does the valuing of open data remind you of:

virginia-ad

(Image from: http://civics.sites.unc.edu/files/2012/06/EarlyAmericanSettlements1.pdf)

The date of 1609 is important. Wikipedia has an article on Virginia, 1609-1610, titled, Starving Time. That year, only sixty (60) out of five hundred (500) colonists survived.

Does “Excellent Fruites by Planting” sound a lot like “new and innovative products and services?”

It does to me.

I first saw this in a tweet by Kirk Borne.

Getting Started with Spark (in Python)

Filed under: Hadoop,MapReduce,Python,Spark — Patrick Durusau @ 2:21 pm

Getting Started with Spark (in Python) by Benjamin Bengfort.

From the post:

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

These two ideas have been the prime drivers for the advent of scaling analytics, large scale machine learning, and other big data appliances for the last ten years! However, in technology terms, ten years is an incredibly long time, and there are some well-known limitations that exist, with MapReduce in particular. Notably, programming MapReduce is difficult. You have to chain Map and Reduce tasks together in multiple steps for most analytics. This has resulted in specialized systems for performing SQL-like computations or machine learning. Worse, MapReduce requires data to be serialized to disk between each step, which means that the I/O cost of a MapReduce job is high, making interactive analysis and iterative algorithms very expensive; and the thing is, almost all optimization and machine learning is iterative.

To address these problems, Hadoop has been moving to a more general resource management framework for computation, YARN (Yet Another Resource Negotiator). YARN implements the next generation of MapReduce, but also allows applications to leverage distributed resources without having to compute with MapReduce. By generalizing the management of the cluster, research has moved toward generalizations of distributed computation, expanding the ideas first imagined in MapReduce.

Spark is the first fast, general purpose distributed computing paradigm resulting from this shift and is gaining popularity rapidly. Spark extends the MapReduce model to support more types of computations using a functional programming paradigm, and it can cover a wide range of workflows that previously were implemented as specialized systems built on top of Hadoop. Spark uses in-memory caching to improve performance and, therefore, is fast enough to allow for interactive analysis (as though you were sitting on the Python interpreter, interacting with the cluster). Caching also improves the performance of iterative algorithms, which makes it great for data theoretic tasks, especially machine learning.

In this post we will first discuss how to set up Spark to start easily performing analytics, either simply on your local machine or in a cluster on EC2. We then will explore Spark at an introductory level, moving towards an understanding of what Spark is and how it works (hopefully motivating further exploration). In the last two sections we will start to interact with Spark on the command line and then demo how to write a Spark application in Python and submit it to the cluster as a Spark job.

Be forewarned, this post uses the “F” word (functional) to describe the programming paradigm of Spark. Just so you know. 😉

If you aren’t already using Spark, this is about as easy a learning curve as can be expected.

Enjoy!

I first saw this in a tweet by DataMining.

April 25, 2015

pandas: powerful Python data analysis toolkit Release 0.16

Filed under: Data Analysis,Programming,Python — Patrick Durusau @ 7:42 pm

pandas: powerful Python data analysis toolkit Release 0.16 by Wes McKinney and PyData Development Team.

I mentioned Wes’ 2011 paper on pandas in 2011 and a lot has changed since then.

From the homepage:

pandas: powerful Python data analysis toolkit

PDF Version

Zipped HTML

Date: March 24, 2015 Version: 0.16.0

Binary Installers: http://pypi.python.org/pypi/pandas

Source Repository: http://github.com/pydata/pandas

Issues & Ideas: https://github.com/pydata/pandas/issues

Q&A Support: http://stackoverflow.com/questions/tagged/pandas

Developer Mailing List: http://groups.google.com/group/pydata

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

Some other notes

  • pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrifices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool.
  • pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
  • pandas has been used extensively in production in financial applications.

Note

This documentation assumes general familiarity with NumPy. If you haven’t used NumPy much or at all, do invest some time in learning about NumPy first.

Not that I’m one to make editorial suggestions, ;-), but with almost 200 pages of What’s New entries going back to September of 2011 and topping out at over 1600 pages, I would move all but the latest What’s New to the end. Yes?

BTW, at 1600 pages, you may already be behind in your reading. Are you sure you want to get further behind?

Not only will the reading be entertaining, it will have the side benefit of improving your data analysis skills as well.

Enjoy!

I first saw this mentioned in a tweet by Kirk Borne.

April 24, 2015

Mathematicians Reduce Big Data Using Ideas from Quantum Theory

Filed under: Data Reduction,Mathematics,Physics,Quantum — Patrick Durusau @ 8:20 pm

Mathematicians Reduce Big Data Using Ideas from Quantum Theory by M. De Domenico, V. Nicosia, A. Arenas, V. Latora.

From the post:

A new technique of visualizing the complicated relationships between anything from Facebook users to proteins in a cell provides a simpler and cheaper method of making sense of large volumes of data.

Analyzing the large volumes of data gathered by modern businesses and public services is problematic. Traditionally, relationships between the different parts of a network have been represented as simple links, regardless of how many ways they can actually interact, potentially loosing precious information. Only recently a more general framework has been proposed to represent social, technological and biological systems as multilayer networks, piles of ‘layers’ with each one representing a different type of interaction. This approach allows a more comprehensive description of different real-world systems, from transportation networks to societies, but has the drawback of requiring more complex techniques for data analysis and representation.

A new method, developed by mathematicians at Queen Mary University of London (QMUL), and researchers at Universitat Rovira e Virgili in Tarragona (Spain), borrows from quantum mechanics’ well tested techniques for understanding the difference between two quantum states, and applies them to understanding which relationships in a system are similar enough to be considered redundant. This can drastically reduce the amount of information that has to be displayed and analyzed separately and make it easier to understand.

The new method also reduces computing power needed to process large amounts of multidimensional relational data by providing a simple technique of cutting down redundant layers of information, reducing the amount of data to be processed.

The researchers applied their method to several large publicly available data sets about the genetic interactions in a variety of animals, a terrorist network, scientific collaboration systems, worldwide food import-export networks, continental airline networks and the London Underground. It could also be used by businesses trying to more readily understand the interactions between their different locations or departments, by policymakers understanding how citizens use services or anywhere that there are large numbers of different interactions between things.

You can hop over to Nature, Structural reducibility of multilayer networks, where if you don’t have an institutional subscription:

ReadCube: $4.99 Rent, $9.99 to buy, or Purchase a PDF for $32.00.

Let me save you some money and suggest you look at:

Layer aggregation and reducibility of multilayer interconnected networks

Abstract:

Many complex systems can be represented as networks composed by distinct layers, interacting and depending on each others. For example, in biology, a good description of the full protein-protein interactome requires, for some organisms, up to seven distinct network layers, with thousands of protein-protein interactions each. A fundamental open question is then how much information is really necessary to accurately represent the structure of a multilayer complex system, and if and when some of the layers can indeed be aggregated. Here we introduce a method, based on information theory, to reduce the number of layers in multilayer networks, while minimizing information loss. We validate our approach on a set of synthetic benchmarks, and prove its applicability to an extended data set of protein-genetic interactions, showing cases where a strong reduction is possible and cases where it is not. Using this method we can describe complex systems with an optimal trade–off between accuracy and complexity.

Both articles have four (4) illustrations. Same four (4) authors. The difference being the second one is at http://arxiv.org. Oh, and it is free for downloading.

I remain concerned by the focus on reducing the complexity of data to fit current algorithms and processing models. That said, there is no denying that such reduction methods have proven to be useful.

The authors neatly summarize my concerns with this outline of their procedure:

The whole procedure proposed here is sketched in Fig. 1 and can be summarised as follows: i) compute the quantum Jensen-Shannon distance matrix between all pairs of layers; ii) perform hierarchical clustering of layers using such a distance matrix and use the relative change of Von Neumann entropy as the quality function for the resulting partition; iii) finally, choose the partition which maximises the relative information gain.

With my corresponding concerns:

i) The quantum Jensen-Shannon distance matrix presumes a metric distance for its operations, which may or may not reflect the semantics of the layers (or than by simplifying assumption).

ii) The relative change of Von Neumann entropy is a difference measurement based upon an assumed metric, which may or not represent the underlying semantics of the relationships between layers.

iii) The process concludes by maximizing a difference measurement based upon an assigned metric, which has been assigned to the different layers.

Maximizing a difference, based on an entropy calculation, which is itself based on an assigned metric doesn’t fill me with confidence.

I don’t doubt that the technique “works,” but doesn’t that depend upon what you think is being measured?

A question for the weekend: Do you think this is similar to the questions about dividing continuous variables into discrete quantities?

How to secure your baby monitor [Keep Out Creeps, FBI, NSA, etc.]

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:21 pm

How to secure your baby monitor by Lisa Vaas.

From the post:

Two more nurseries have been invaded, with strangers apparently spying on parents and their babies via their baby monitors.

This is nuts. We’re hearing more and more about these kinds of crimes, but there’s nothing commonplace about the level of fear they’re causing as families’ privacy is invaded. It’s time we put some tools into parents’ hands to help.

First, the latest creep-out cyber nursery tales. Read on to the bottom for ways to help keep strangers out of your family’s business.

I don’t know for a fact the FBI or NSA have tapped into baby monitors. But anyone who engages in an orchestrated campaign of false testimony in court spanning decades, lies to Congress (and the public), as well as kidnaps, tortures and executes people, well, my expectations aren’t all that high.

You really are entitled to privacy in your own home, especially with such a joyous occasion as the birth of a child. But that isn’t going to happen by default. Nor is the government going to guarantee that privacy. And it isn’t going to happen by default. Sorry.

You would not bath or dress your child in the front yard so don’t allow their room to become the front yard.

Teach your children good security habits along with looking both ways and holding hands to cross the street.

Almost all digitally recorded data is or can be compromised. That won’t change in the short run but we can create islands of privacy for our day to day lives. Starting with every child’s bedroom.

>30 Days From Patch – No Hacker Liability – Civil or Criminal

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:54 pm

Potent, in-the-wild exploits imperil customers of 100,000 e-commerce sites by Dan Goodin.

From the post:

Criminals are exploiting an extremely critical vulnerability found on almost 100,000 e-commerce websites in a wave of attacks that puts the personal information for millions of people at risk of theft.

The remote code-execution hole resides in the community and enterprise editions of Magento, the Internet’s No. 1 content management system for e-commerce sites. Engineers from eBay, which owns the e-commerce platform, released a patch in February that closes the vulnerability, but as of earlier this week, more than 98,000 online merchants still hadn’t installed it, according to researchers with Byte, a Netherlands-based company that hosts Magento-using websites. Now, the consequences of that inaction are beginning to be felt, as attackers from Russia and China launch exploits that allow them to gain complete control over vulnerable sites.

“The vulnerability is actually comprised of a chain of several vulnerabilities that ultimately allow an unauthenticated attacker to execute PHP code on the Web server,” Netanel Rubin, a malware and vulnerability researcher with security firm Checkpoint, wrote in a recent blog post. “The attacker bypasses all security mechanisms and gains control of the store and its complete database, allowing credit card theft or any other administrative access into the system.”

This flaw has been fixed but:

Engineers from eBay, which owns the e-commerce platform, released a patch in February that closes the vulnerability, but as of earlier this week, more than 98,000 online merchants still hadn’t installed it,…

The House of Representatives (U.S.) recently passed a cybersecurity bill to give companies liability protection while sharing threat data. As a step towards more sharing of cyberthreat information.

OK, but so far, have you heard of any incentives to encourage better security practices? Better security practices such as installing patches for known vulnerabilities.

Here’s an incentive idea for patch installation:

Exempt hackers from criminal and civil liability for vulnerabilities with patches more than thirty (30) days old.

Why not?

It will create a small army of hackers who pounce on every announced patch in hopes of catching someone over the thirty day deadline. It neatly solves the problem of how to monitor the installation of patches. (I am assuming the threat of being looted provides some incentive for patch maintenance.)

The second part should be a provision that insurance cannot be sold to cover losses due to hacks more than thirty days after patch release. As we have seen before, users rely on insurance to avoid spending money on cybersecurity. For more than thirty day after patch hacks, users have to eat the losses.

Let me know if you are interested in the >30-Day-From-Patch idea. I am willing to help draft the legislation.


For further information on this vulnerability:

Wikipedia on Magento, has about 30% of the ecommerce market.

Magento homepage, etc.

Analyzing the Magento Vulnerability (Updated) by Netanel Rubin.

From Rubin’s post:

Check Point researchers recently discovered a critical RCE (remote code execution) vulnerability in the Magento web e-commerce platform that can lead to the complete compromise of any Magento-based store, including credit card information as well as other financial and personal data, affecting nearly two hundred thousand online shops.

Check Point privately disclosed the vulnerabilities together with a list of suggested fixes to eBay prior to public disclosure. A patch to address the flaws was released on February 9, 2015 (SUPEE-5344 available here). Store owners and administrators are urged to apply the patch immediately if they haven’t done so already.
For a visual demonstration of one way the vulnerability can be exploited, please see our video here.

What kind of attack is it?

The vulnerability is actually comprised of a chain of several vulnerabilities that ultimately allow an unauthenticated attacker to execute PHP code on the web server. The attacker bypasses all security mechanisms and gains control of the store and its complete database, allowing credit card theft or any other administrative access into the system.

This attack is not limited to any particular plugin or theme. All the vulnerabilities are present in the Magento core, and affects any default installation of both Community and Enterprise Editions. Check Point customers are already protected from exploitation attempts of this vulnerability through the IPS software blade.

Rubin’s post has lots of very nice PHP code.

I first saw this in a tweet by Ciuffy.

Ordinary Least Squares Regression: Explained Visually

Filed under: Mathematics,Visualization — Patrick Durusau @ 2:55 pm

Ordinary Least Squares Regression: Explained Visually by Victor Powell and Lewis Lehe.

From the post:

Statistical regression is basically a way to predict unknown quantities from a batch of existing data. For example, suppose we start out knowing the height and hand size of a bunch of individuals in a “sample population,” and that we want to figure out a way to predict hand size from height for individuals not in the sample. By applying OLS, we’ll get an equation that takes hand size—the ‘independent’ variable—as an input, and gives height—the ‘dependent’ variable—as an output.

Below, OLS is done behind-the-scenes to produce the regression equation. The constants in the regression—called ‘betas’—are what OLS spits out. Here, beta_1 is an intercept; it tells what height would be even for a hand size of zero. And beta_2 is the coefficient on hand size; it tells how much taller we should expect someone to be for a given increment in their hand size. Drag the sample data to see the betas change.

[interactive graphic omitted]

At some point, you probably asked your parents, “Where do betas come from?” Let’s raise the curtain on how OLS finds its betas.

Error is the difference between prediction and reality: the vertical distance between a real data point and the regression line. OLS is concerned with the squares of the errors. It tries to find the line going through the sample data that minimizes the sum of the squared errors. Below, the squared errors are represented as squares, and your job is to choose betas (the slope and intercept of the regression line) so that the total area of all the squares (the sum of the squared errors) is as small as possible. That’s OLS!

The post includes a visual explanation of ordinary least squares regression up to 2 independent variables (3-D).

Height wasn’t the correlation I heard with hand size but Visually Explained is a family friendly blog. And to be honest, I got my information from another teenager (at the time), so my information source is suspect.

jQAssistant 1.0.0 released

Filed under: Neo4j,Programming,Software,Software Engineering — Patrick Durusau @ 2:25 pm

jQAssistant 1.0.0 released by Dirk Mahler.

From the webpage:

We’re proud to announce the availability of jQAssistant 1.0.0 – lots of thanks go to all the people who made this possible with their ideas, criticism and code contributions!

Feature Overview

  • Static code analysis tool using the graph database Neo4j
  • Scanning of software related structures, e.g. Java artifacts (JAR, WAR, EAR files), Maven descriptors, XML files, relational database schemas, etc.
  • Allows definition of rules and automated verification during a build process
  • Rules are expressed as Cypher queries or scripts (e.g. JavaScript, Groovy or JRuby)
  • Available as Maven plugin or CLI (command line interface)
  • Highly extensible by plugins for scanners, rules and reports
  • Integration with SonarQube
  • It’s free and Open Source

Example Use Cases

  • Analysis of existing code structures and matching with proposed architecture and design concepts
  • Impact analysis, e.g. which test is affected by potential code changes
  • Visualization of architectural concepts, e.g. modules, layers and their dependencies
  • Continuous verification and reporting of constraint violations to provide fast feedback to developers
  • Individual gathering and filtering of metrics, e.g. complexity per component
  • Post-Processing of reports of other QA tools to enable refactorings in brown field projects
  • and much more…

Get it!

jQAssistant is available as a command line client from the downloadable distribution

jqassistant.sh scan -f my-application.war
jqassistant.sh analyze
jqassistant.sh server

or as Maven plugin:

<dependency>
    <groupId>com.buschmais.jqassistant.scm</groupId>
    <artifactId>jqassistant-maven-plugin</artifactId>
    <version>1.0.0</version>
</dependency>

For a list of latest changes refer to the release notes, the documentation provides usage information.

Those who are impatient should go for the Get Started page which provides information about the first steps about scanning applications and running analysis.

Your Feedback Matters

Every kind of feedback helps to improve jQAssistant: feature requests, bug reports and even questions about how to solve specific problems. You can choose between several channels – just pick your preferred one: the discussion group, stackoverflow, a Gitter channel, the issue tracker, e-mail or just leave a comment below.

Workshops

You want to get started quickly for an inventory of an existing Java application architecture? Or you’re interested in setting up a continuous QA process that verifies your architectural concepts and provides graphical reports?
The team of buschmais GbR offers individual workshops for you! For getting more information and setting up an agenda refer to http://jqassistant.de (German) or just contact us via e-mail!

Short of wide spread censorship, in order for security breaches to fade from the news spotlight, software quality/security must improve.

jQAssistant 1.0.0 is one example of the type of tool required for software quality/security to improve.

Of particular interest is its use of Neo4j, enables having named relationships of materials to your code.

I don’t mean to foster the “…everything is a graph…” any more than I would foster “…everything is a set of relational tables…” or “…everything is a key/value pair…,” etc. Yes, but the question is: “What is the best way, given my requirements and constraints to achieve objective X?” Whether relationships are explicit, if so, what can I say about them?, or implicit, depends on my requirements, not those of a vendor.

In the case of recording who wrote the most buffer overflows and where, plus other flaws, tracking named relationships and similar information should be part of your requirements and graphs are a good way to meet that requirement.

Animation of Gerrymandering?

Filed under: Geographic Data,Geospatial Data,Government,Mapping,Maps — Patrick Durusau @ 1:45 pm

United States Congressional District Shapefiles by Jeffrey B. Lewis, Brandon DeVine, and Lincoln Pritcher with Kenneth C. Martis.

From the description:

This site provides digital boundary definitions for every U.S. Congressional District in use between 1789 and 2012. These were produced as part of NSF grant SBE-SES-0241647 between 2009 and 2013.

The current release of these data is experimental. We have had done a good deal of work to validate all of the shapes. However, it is quite likely that some irregulaties remain. Please email jblewis@ucla.edu with questions or suggestions for improvement. We hope to have a ticketing system for bugs and a versioning system up soon. The district definitions currently available should be considered an initial-release version.

Many districts were formed by aggregragating complete county shapes obtained from the National Historical Geographic Information System (NHGIS) project and the Newberry Library’s Atlas of Historical County Boundaries. Where Congressional district boundaries did not coincide with county boundaries, district shapes were constructed district-by-district using a wide variety of legal and cartographic resources. Detailed descriptions of how particular districts were constructed and the authorities upon which we relied are available (at the moment) by request and described below.

Every state districting plan can be viewed quickly at https://github.com/JeffreyBLewis/congressional-district-boundaries (clicking on any of the listed file names will create a map window that can be paned and zoomed). GeoJSON definitions of the districts can also be downloaded from the same URL. Congress-by-Congress district maps in ERSI ShapefileA format can be downloaded below. Though providing somewhat lower resolution than the shapefiles, the GeoJSON files contain additional information about the members who served in each district that the shapefiles do not (Congress member information may be useful for creating web applications with, for example, Google Maps or Leaflet).

Project Team

The Principal Investigator on the project was Jeffrey B. Lewis. Brandon DeVine and Lincoln Pitcher researched district definitions and produced thousands of digital district boundaries. The project relied heavily on Kenneth C. Martis’ The Historical Atlas of United States Congressional Districts: 1789-1983. (New York: The Free Press, 1982). Martis also provided guidance, advice, and source materials used in the project.

How to cite

Jeffrey B. Lewis, Brandon DeVine, Lincoln Pitcher, and Kenneth C. Martis. (2013) Digital Boundary Definitions of United States Congressional Districts, 1789-2012. [Data file and code book]. Retrieved from http://cdmaps.polisci.ucla.edu on [date of
download].

An impressive resource for anyone interested in the history of United States Congressional Districts and their development. An animation of gerrymandering of congressional districts was the first use case that jumped to mind. 😉

Enjoy!

I first saw this in a tweet by Larry Mullen.

April 23, 2015

Are Government Agencies Trustworthy? FBI? No!

Filed under: Authoring Topic Maps,Government — Patrick Durusau @ 8:14 pm

Pseudoscience in the Witness Box: The FBI faked an entire field of forensic science by Dahlia Lithwick.

From the post:

The Washington Post published a story so horrifying this weekend that it would stop your breath: “The Justice Department and FBI have formally acknowledged that nearly every examiner in an elite FBI forensic unit gave flawed testimony in almost all trials in which they offered evidence against criminal defendants over more than a two-decade period before 2000.”

What went wrong? The Post continues: “Of 28 examiners with the FBI Laboratory’s microscopic hair comparison unit, 26 overstated forensic matches in ways that favored prosecutors in more than 95 percent of the 268 trials reviewed so far.” The shameful, horrifying errors were uncovered in a massive, three-year review by the National Association of Criminal Defense Lawyers and the Innocence Project. Following revelations published in recent years, the two groups are helping the government with the country’s largest ever post-conviction review of questioned forensic evidence.

Chillingly, as the Post continues, “the cases include those of 32 defendants sentenced to death.” Of these defendants, 14 have already been executed or died in prison.

You should read Dahlia’s post carefully and then write “untrustworthy” next to any reference to or material from the FBI.

This particular issue involved identifying hair samples to be the same, which went beyond any known science.

But if 26 out of 28 experts were willing to go there, how far do you think the average agent on the street goes towards favoring the prosecution?

True, the FBI is working to find all the cases where this has happened, but questions about this type of evidence were raised long before now. But questioning the prosecution’s evidence doesn’t work in favor of the FBI.

Defense teams need to start requesting judicial notice of the propensity of executive branch department employees to give false testimony and a cautionary instruction to jurors in cases where they appear in trials.

Unker Non-Linear Writing System

Filed under: Language,Linguistics,Writing — Patrick Durusau @ 7:46 pm

Unker Non-Linear Writing System by Alex Fink & Sai.

From the webpage:

non-linear

“I understood from my parents, as they did from their parents, etc., that they became happier as they more fully grokked and were grokked by their cat.”[3]

Here is another snippet from the text:

Binding points, lines and relations

Every glyph includes a number of binding points, one for each of its arguments, the semantic roles involved in its meaning. For instance, the glyph glossed as eat has two binding points—one for the thing consumed and one for the consumer. The glyph glossed as (be) fish has only one, the fish. Often we give glosses more like “X eat Y”, so as to give names for the binding points (X is eater, Y is eaten).

A basic utterance in UNLWS is put together by writing out a number of glyphs (without overlaps) and joining up their binding points with lines. When two binding points are connected, this means the entities filling those semantic roles of the glyphs involved coincide. Thus when the ‘consumed’ binding point of eat is connected to the only binding point of fish, the connection refers to an eaten fish.

This is the main mechanism by which UNLWS clauses are assembled. To take a worked example, here are four glyphs:

non-linear2

If you are interested in graphical representations for design or presentation, this may be of interest.

Sam Hunting forwarded this while we were exploring TeX graphics.

PS: The “cat” people on Twitter may appreciate the first graphic. 😉

Protecting Your Privacy From The NSA?

Filed under: Government,Privacy,Security — Patrick Durusau @ 4:26 pm

House passes cybersecurity bill by Cory Bennett and Cristina Marcos.

From the post:

The House on Wednesday passed the first major cybersecurity bill since the calamitous hacks on Sony Entertainment, Home Depot and JPMorgan Chase.

Passed 307-116, the Protecting Cyber Networks Act (PCNA), backed by House Intelligence Committee leaders, would give companies liability protections when sharing cyber threat data with government civilian agencies, such as the Treasury or Commerce Departments.

“This bill will strengthen our digital defenses so that American consumers and businesses will not be put at the mercy of cyber criminals,” said House Intelligence Committee Chairman Devin Nunes (R-Calif.).

Lawmakers, government officials and most industry groups argue more data will help both sides better understand their attackers and bolster network defenses that have been repeatedly compromised over the last year.

Privacy advocates and a group of mostly Democratic lawmakers worry the bill will simply shuttle more sensitive information to the National Security Agency (NSA), further empowering its surveillance authority. Many security experts agree, adding that they already have the data needed to study hackers’ tactics.

The connection between sharing threat data and loss of privacy to the NSA escapes me.

At present, the NSA can or is:

  • Monitoring all Web traffic
  • Monitoring all Email traffic
  • Collecting all Phone metadata
  • Collecting all Credit Card information
  • Collecting all Social Media data
  • Collecting all Travel data
  • Collecting all Banking data
  • Has spied on Congress and other agencies
  • Can demand production of other information and records from anyone
  • Probably has a copy of your income tax and social security info

You are concerned private information about you might be leaked to the NSA in the form of threat data?

Seriously?

Anything is possible so something the NSA doesn’t already know could possibly come to light, but I would not waste my energy opposing a bill that is virtually no additional threat to privacy.

The NSA is the issue that needs to be addressed. Its very existence is incompatible with any notion of privacy.

Older Posts »

Powered by WordPress