Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 20, 2014

Lap Dancing With Big Data

Filed under: BigData,Data,Data Analysis — Patrick Durusau @ 4:27 pm

Real scientists make their own data by Sean J. Taylor.

From the first list in the post:

4. If you are the creator of your data set, then you are likely to have a great understanding the data generating process. Blindly downloading someone’s CSV file means you are much more likely to make assumptions which do not hold in the data.

A good point among many good points.

Sean provides guidance on how you can collect data, not just have it dumped on you.

Or as Kaiser Fung says in the post that lead me to Sean’s:

In theory, the availability of data should improve our ability to measure performance. In reality, the measurement revolution has not taken place. It turns out that measuring performance requires careful design and deliberate collection of the right types of data — while Big Data is the processing and analysis of whatever data drops onto our laps. Ergo, we are far from fulfilling the promise.

So, do you make your own data?

Or do you lap dance with data?

I know which one I aspire to.

You?

Microsoft Research adopts Open Access… [Write to MS]

Filed under: Microsoft,Open Access — Patrick Durusau @ 3:43 pm

Microsoft Research adopts Open Access policy for publications

From the post:

In a recent interview with Scientific American, Peter Lee, head of Microsoft Research, discussed three main motivations for basic research at Microsoft. The first relates to an aspiration to advance human knowledge, the second derives from a culture that relies deeply on the ambitions of individual researchers, and the last concerns “promoting open publication of all research results and encouraging deep collaborations with academic researchers.”

It is in keeping with this third motivation that Microsoft Research recently committed to an Open Access policy for our researchers’ publications.

As evidenced by a long-running series of blog posts by Tony Hey, vice president of Microsoft Research Connections, Microsoft Research has carefully deliberated our role in the growing movement toward open publications and open data.

This is great news. When Microsoft steps, it’s a big step. Heard near and far.

Take the time to write to anyone you know at Microsoft just to say you appreciate the decision.

We all write to them to complain about MS products, so why not write a nice note about open access?

It won’t take five (5) minutes if you open up your email client right now. (I wrote one before I posted this entry.)

Digital Humanities?

Filed under: Humanities,Humor — Patrick Durusau @ 3:32 pm

xkcd on digital humanities

I saw this mentioned by Ted Underwood in a tweet saying:

An xkcd that could just as well be titled “Digital Humanities.”

Not to be too harsh on the digital humanists, they have bad role models in programming projects, the maintenance of which is called “job security.”

Data sharing, OpenTree and GoLife

Filed under: Biodiversity,Bioinformatics,Biology,Data Integration — Patrick Durusau @ 3:14 pm

Data sharing, OpenTree and GoLife

From the post:

NSF has released GoLife, the new solicitation that replaces both AToL and AVAToL. From the GoLife text:

The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.

Data completeness, open data and data integration are key components of these proposals – inferring well-sampled trees that are linked with other types of data (molecular, morphological, ecological, spatial, etc) and made easily available to scientific and non-scientific users. The solicitation requires that trees published by GoLife projects are published in a way that allows them to be understood and re-used by Open Tree of Life and other projects:

Integration and standardization of data consistent with three AVAToL projects: Open Tree of Life (www.opentreeoflife.org), ARBOR (www.arborworkflows.com), and Next Generation Phenomics (www.avatol.org/ngp) is required. Other data should be made available through broadly accessible community efforts (i.e., specimen data through iDigBio, occurrence data through BISON, etc). (I corrected the URLs for ARBOR and Next Generation Phenomics)

What does it mean to publish data consistent with Open Tree of Life? We have a short page on data sharing with OpenTree, a publication coming soon (we will update this post when it comes out) and we will be releasing our new curation / validation tool for phylogenetic data in the next few weeks.

A great resource on the NSF GoLife proposal that I just posted about.

Some other references:

AToL – Assembling the Tree of Life

AVATOL – Assembling, Visualizing and Analyzing the Tree of Life

Be sure to contact the Open Tree of Life group if you are interested in the GoLife project.

Genealogy of Life (GoLife)

Filed under: Bioinformatics,Biology,Data Integration — Patrick Durusau @ 2:43 pm

Genealogy of Life (GoLife) NSF.


Full Proposal Deadline Date: March 26, 2014
Fourth Wednesday in March, Annually Thereafter

Synopsis:

All of comparative biology depends on knowledge of the evolutionary relationships (phylogeny) of living and extinct organisms. In addition, understanding biodiversity and how it changes over time is only possible when Earth’s diversity is organized into a phylogenetic framework. The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.

The ultimate vision of this program is an open access, universal Genealogy of Life that will provide the comparative framework necessary for testing questions in systematics, evolutionary biology, ecology, and other fields. A further strategic integration of this genealogy of life with data layers from genomic, phenotypic, spatial, ecological and temporal data will produce a grand synthesis of biodiversity and evolutionary sciences. The resulting knowledge infrastructure will enable synthetic research on biological dynamics throughout the history of life on Earth, within current ecosystems, and for predictive modeling of the future evolution of life.

Projects submitted to this program should emphasize increased efficiency in contributing to a complete Genealogy of Life and integration of various types of organismal data with phylogenies.

This program also seeks to broadly train next generation, integrative phylogenetic biologists, creating the human resource infrastructure and workforce needed to tackle emerging research questions in comparative biology. Projects should train students for diverse careers by exposing them to the multidisciplinary areas of research within the proposal.

You may have noticed the emphasis on data integration:

to integrate this genealogical architecture with underlying organismal data.

comparative framework necessary for testing questions in systematics, evolutionary biology, ecology, and other fields

strategic integration of this genealogy of life with data layers from genomic, phenotypic, spatial, ecological and temporal data

synthetic research on biological dynamics

integration of various types of organismal data with phylogenies

next generation, integrative phylogenetic biologists

That sounds like a tall order! Particularly if your solution does not enable researchers to ask on what basis data was integrated as it was and by who?

If you can’t ask and answer those two questions, the more data and integration you mix together, the more fragile the integration structure will become.

I’m not trying to presume that such a project will use dynamic merging because it may well not. “Merging” in topic map terms may well be an operation ordered by a member of a group of curators. It is the capturing of the basis for that operation that makes it maintainable over a series of curators through time.

I first saw this at: Data sharing, OpenTree and GoLife, which I am about to post on but thought the NSF call merited a separate post as well.

OpenAIRE Legal Study has been published

Filed under: Law,Licensing,Open Access,Open Data,Open Source — Patrick Durusau @ 2:14 pm

OpenAIRE Legal Study has been published

From the post:

Guibault, Lucie; Wiebe, Andreas (Eds) (2013) Safe to be Open: Study on the protection of research data and recommendation for access and usage. The full-text of the book is available (PDF, ca. 2 MB ) under the CC BY 4.0 license. Published by University of Göttingen Press (Copies can be ordered from the publisher’s website)

Any e-infrastructure which primarily relies on harvesting external data sources (e.g. repositories) needs to be fully aware of any legal implications for re-use of this knowledge, and further application by 3rd parties. OpenAIRE’s legal study will put forward recommendations as to applicable licenses that appropriately address scientific data in the context of OpenAIRE.

CAUTION:: Safe to be Open is a EU-centric publication and while very useful in copyright discussions elsewhere, should not be relied upon as legal advice. (That’s not an opinion about relying on it in the EU. Ask local counsel for that advice.)

I say that having witnessed too many licensing discussions that were uninformed by legal counsel. Entertaining to be sure but if I have a copyright question, I will be posing it to counsel who is being paid to be correct.

At least until ignorance of the law becomes an affirmative shield against liability for copyright infringement. 😉

To be sure, I recommend reading of Safe to be Open as a means to become informed about the contours of access and usage of research data in the EU. And possibly a model for solutions in legal systems that lag behind the EU in that regard.

Personally I favor Attribution CC BY because the other CC licenses presume the licensed material was created without unacknowledged/uncompensated contributions from others.

Think of all the people who taught you to read, write, program and all the people whose work you have read, been influenced by, etc. Hopefully you can add to the sum of communal knowledge but it is unfair to claim ownership of the whole of communal knowledge simply because you contributed a small part. (That’s not legal advice either, just my personal opinion.)

Without all the instrument makers, composers, singers, organists, etc. that came before him, Mozart would not the same Mozart that we remember. Just as gifted but without a context to display his gifts.

Patent and copyright need to be recognized as “thumbs on the scale” against development of services and knowledge. That’s where I would start a discussion of copyright and patents.

January 19, 2014

10 Awesome Google Chrome Experiments

Filed under: Graphics,Visualization — Patrick Durusau @ 9:16 pm

10 Awesome Google Chrome Experiments

From the post:

With the coming of 2014, it is quite evident in the market that there is a huge need for innovation in the field of technology. The field of technology as in this case includes a lot of things like the process of doing the things and the way the output is extracted. Most of the economic giants are of the idea that now is the time when we should actually be looking to invest and develop new things such that there is no problem in the coming years. When asked about the problems that one might face, the most common answer was that if the technology is not advanced there will be no increase in the revenue. Well, this might just bring in a bit of thoughts in the minds of bloggers.

Although there is no requirement to be worried about because of the fact that in case of blogging the utmost creativity lies in the field of the articles that you write and the design that you maintain. Thus it becomes important a topic enough to think about making some improvement in the designs that you put in and the quality that you maintain. Now if the main area of your concern is in the field of designing then there are many online tutorials and tools that can help you to sort out your problems. To add to this the extravagant coding language like HTML5 and CSS3 will surely help you out in your area of concern. Considering the possible extent of the two languages, it is highly advisable to start developing the knowledge related to them with some real care!

But not all the people are too much interested in giving the hell lot of efforts in the region. Well, frankly speaking there is actually no shortcut to success and thus considering the statement you will have to do it the correct way. If you are looking for help that the Google Chrome Experiments is one of the best places to give a visit in the times of need! Google Chrome Experiments are the places that harbors the best designers of the world and believe me when I say that is one of the perfect places to be a part of if you are really in some moods to learn new things about designing. The place remains constantly updated with excellent things to know about and work with. You can also share your thoughts and look for the answers of your queries.

Some eye candy to start the work week!

Importing data to Neo4j…

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:48 pm

Importing data to Neo4j the spreadsheet way in Neo4j 2.0! by Pernilla.

From the post:

And happy new year! I hope you had an excellent start, let’s keep this year rocking with a spirit of graph-love! Our Rik Van Bruggen did a lovely blog post on how to import data into Neo4j using spreadsheets in March last year. Simple and easy to understand but only for Neo4j version 1.9.3. Now it’s a new year and in December we launched a shiny new version of Neo4j, the 2.0.0 release! Baadadadaam! So, I thought better provide an update to his blogpost, with the spirit of his work. (Thank you Rik!)

If you don’t think spreadsheets are all that weird in data processing, ;-), you should feel right at home.

Medicare Spending Data…

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 2:11 pm

Medicare Spending Data May Be Publicly Available Under New Policy by Gavin Baker.

From the post:

On Jan. 14, the Centers for Medicare & Medicaid Services (CMS) announced a new policy that could bring greater transparency to Medicare, one of the largest programs in the federal government. CMS revoked its long-standing policy not to release publicly any information about Medicare’s payments to doctors. Under the new policy, the agency will evaluate requests for such information on a case-by-case basis. Although the impact of the change is not yet clear, it creates an opportunity for a welcome step forward for data transparency and open government.

Medicare’s tremendous size and impact – expending an estimated $551 billion and covering roughly 50 million beneficiaries in 2012 – mean that increased transparency in the program could have big effects. Better access to Medicare spending data could permit consumers to evaluate doctor quality, allow journalists to identify waste or fraud, and encourage providers to improve health care delivery.

Until now, the public hasn’t been able to learn how much Medicare pays to particular medical businesses. In 1979, a court blocked Medicare from releasing such information after doctors fought to keep it secret. However, the court lifted the injunction in May 2013, freeing CMS to consider whether to release the data.

In turn, CMS asked for public comments about what it should do and received more than 130 responses. The Center for Effective Government was among the organizations that filed comments, calling for more transparency in Medicare spending and urging CMS to revoke its previous policy implementing the injunction. After considering those comments, CMS adopted its new policy.

The change may allow the public to examine the reimbursement amounts paid to medical providers under Medicare. Under the new approach, CMS will not release those records wholesale. Instead, the agency will wait for specific requests for the data and then evaluate each to consider if disclosure would invade personal privacy. While information about patients is clearly off-limits, it’s not clear what kind of information about doctors CMS will consider private, so it remains to be seen how much information is ultimately disclosed under the new policy. It should be noted, however, that the U.S. Supreme Court has held that businesses don’t have “personal privacy” under the Freedom of Information Act (FOIA), and the government already discloses the amounts it pays to other government contractors.

The announcement from CMS: Modified Policy on Freedom of Information Act Disclosure of Amounts Paid to Individual Physicians under the Medicare Program

The case by case determination of a physician’s privacy rights is an attempt to discourage requests for public information.

If all physician payment data, say by procedure, were available in state by state data sets, local residents in a town of 500 would know a 2,000 x-rays a year is on the high side. Without every knowing any patient’s identity.

If you are a U.S. resident, take this opportunity to push for greater transparency in Medicare spending. Be polite and courteous but also be persistent. You need no more reason than an interest in how Medicare is being spent.

Let’s have an FOIA (Freedom of Information Act) request pending for every physician in the United States within 90 days of the CMS rule becoming final.

It’s not final yet, but when it is, let slip the lease on the dogs of FOAI.

Mapping Security Flaws?

Filed under: Cybersecurity — Patrick Durusau @ 10:45 am

SCADA Researcher Drops Zero-Day, ICS-CERT Issues Advisory by Kelly Jackson Higgins. (A patch has been released: Advisory (ICSA-14-016-01))

From the post:

A well-known and prolific ICS/SCADA vulnerability researcher here today revealed a zero-day flaw in a Web server-based system used for monitoring, controlling, and viewing devices and systems in process control environments.

Luigi Auriemma, CEO of Malta-based zero-day vulnerability provider and penetration testing firm ReVuln, showed a proof-of-concept for executing a buffer overflow attack on Ecava’s IntegraXor software, which is used in human machine interfaces (HMIs) for SCADA systems.

The ICS-CERT responded later in the day with a security alert on the zero-day vulnerability, and requested that Ecava confirm the bug and provide mitigation. Ecava as of this posting had not responded publicly, nor had it responded to an email inquiry by Dark Reading.

The IntegraXor line is used in process control environments in 38 countries, mainly in the U.K., U.S., Australia, Poland, Canada, and Estonia, according to ICS-CERT.

How bad is this?

First, zero-day is a media hype term. Essentially means the vendor finds out about the flaw as the same time it is made public.

Second, what are SCADA systems?

The summary in Wikipedia reads:

SCADA (supervisory control and data acquisition) is a type of industrial control system (ICS). Industrial control systems are computer-controlled systems that monitor and control industrial processes that exist in the physical world. SCADA systems historically distinguish themselves from other ICS systems by being large-scale processes that can include multiple sites, and large distances. [footnote [1] omitted] These processes include industrial, infrastructure, and facility-based processes, as described below:

The geographic range of vulnerable systems was specified in the original CERT alert:

IntegraXor is a suite of tools used to create and run a Web-based human-machine interface for a SCADA system. IntegraXor is currently used in several areas of process control in 38 countries with the largest installation based in the United Kingdom, United States, Australia, Poland, Canada, and Estonia. (emphasis added)

From a follow up CERT alert, VULNERABILITY DETAILS:

EXPLOITABILITY

This vulnerability could be exploited remotely.

EXISTENCE OF EXPLOIT

Exploits that target this vulnerability are publicly available.

DIFFICULTY

An attacker with a low skill would be able to exploit this vulnerability.

The top six geographic locations:

integraxor

have vulnerabilities in: industrial processes, manufacturing, production, power generation, fabrication, and refining; infrastructure processes, water treatment and distribution, wastewater collection and treatment, oil and gas pipelines, electrical power transmission and distribution, wind farms, civil defense siren systems, and large communication systems; facility processes, buildings, airports, ships, and space stations.

due to buffer overflows.

According to Wikipedia, the earliest documentation on buffer overflows dates from 1972 and the first hostile exploit of a buffer overflow was in 1988 (the Morris worm).

A quick search on integraxor “buffer overflow” returned 1,850 “hits,” most of them duplicates of the original news or opinions about the infrastructures at risk.

But there was one “buffer overflow” POC with Integraxor in 2010. But you had to sift through a number of “hits” to find it.

Experts already know where to find buffer overflow opportunities but it doesn’t appear that level of expertise is widely shared.

It appears to be time consuming, but not difficult, to identify customers of particular vendors to approach with security services in the event of exploits.

For a security flaw story, what would you want to know that can’t be learned at Exploit Database?

What facts, other data sources, organization of that information, etc.?

Thinking topic maps as “in addition to” rather than “an alternative for” some existing information store, will be easier to sell.

January 18, 2014

Licensing Your Code:…

Filed under: Licensing,Software — Patrick Durusau @ 9:10 pm

Licensing Your Code: GPL, BSD and Edvard Munch’s “The Scream” by Bruce Berriman.

From the post:

I have for some time considered changing to a more permissive license (with Caltech’s approval) for the Montage image mosaic engine, as the the current license forbids modification and redistribution of the code. My first attempt at navigating the many licenses available led me to believe that the subject of Edvard Munch’s “The Scream” was not oppressed by society but simply trying to find the best license for his software.

The license, of course, specifies the terms and conditions under which the software may be used, distributed and modified, and distinctions between licenses are important. Trouble is, there are so many of them. The Wikipedia page on Comparison of Free and Open Source Licenses licenses lists over 40 such licenses, each with varying degrees of approval from the free software community.

Not that I have any code to release but I assume the same issues apply to releasing data sets.

Do not to leave licensing of code or data as “understood” or to “later” in a project. Interests and levels of cooperation may vary over time.

Best to get data and code licensing details in writing when everyone is in a good humor.

A better way to explore and learn on GitHub (Google Cloud)

Filed under: Cloud Computing,Google Analytics — Patrick Durusau @ 8:43 pm

A better way to explore and learn on GitHub

From the post:

Almost one year ago, Google Cloud Platform launched our GitHub organization, with repositories ranging from tutorials to samples to utilities. This is where developers could find all resources relating to the platform, and get started developing quickly. We started with 36 repositories, with lofty plans to add more over time in response to requests from you, our developers. Many product releases, feature launches, and one logo redesign later, we are now up to 123 repositories illustrating how to use all parts of our platform!

Despite some clever naming schemes, it was becoming difficult to find exactly the code that you wanted amongst all of our repositories. Idly browsing through over 100 options wasn’t productive. The repository names gave you an idea of what stacks they used, but not what problems they solved.

Today, we are making it easier to browse our repositories and search for sample code with our landing page at googlecloudplatform.github.io. Whether you want to find all Compute Engine resources, locate all samples that are available in your particular stack, or find examples that fit your particular area of interest, you can find it with the new GitHub page. We’ll be rotating the repositories in the featured section, so make sure to wander that way from time to time.

Less than a year old and their standard metadata (read navigation) details are changing.

Judging from the comments, their users deeply appreciate the new approach.

Change is something that funders calling for standard metadata just don’t get. Which is why new standard metadata projects are so common. It is the same mistake, repeated over and over again.

To be sure, domains need to take their best shot at today’s standard metadata, but with an eye of it pointing to tomorrow’s standard metadata. To be truly useful in STEM fields, it needs to point back to yesterday’s standard metadata as well.

Sorry, got distracted.

Check out the new resources and get thee to the cloud!

How to Query the StackExchange Databases

Filed under: Data,Subject Identity,Topic Maps — Patrick Durusau @ 8:29 pm

How to Query the StackExchange Databases by Brent Ozar.

From the post:

During next week’s Watch Brent Tune Queries webcast, I’m using my favorite demo database: Stack Overflow. The Stack Exchange folks are kind enough to make all of their data available via BitTorrent for Creative Commons usage as long as you properly attribute the source.

There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.
….

I’m sure you have never found duplicate questions or answers on StackExchange.

But just in case such a thing existed, detecting and merging the duplicates from StackExchange would be a good exercise at data analysis, subject identification, etc.

😉

BTW, Brent’s webinar is 21 January 2014, or next Tuesday (as of this post).

Enjoy!

A course in sample surveys for political science

Filed under: Politics,Statistics,Survey — Patrick Durusau @ 8:11 pm

A course in sample surveys for political science by Andrew Gelman.

From the post:

A colleague asked if I had any material for a course in sample surveys. And indeed I do. See here.

It’s all the slides for a 14-week course, also the syllabus (“surveyscourse.pdf”), the final exam (“final2012.pdf”) and various misc files. Also more discussion of final exam questions here (keep scrolling thru the “previous entries” until you get to Question 1).

Enjoy! This is in no way a self-contained teach-it-yourself course, but I do think it could be helpful for anyone who is trying to teach a class on this material.

An impressive bundle of survey material!

I mention it because you may be collecting survey data or at least asked to process survey data.

Hopefully it won’t originate from Survey Monkey.

If I had $1 for every survey composed by a statistical or survey illiterate on Survey Monkey, I could make a substantial down payment on the national debt.

That’s not the fault of Survey Monkey but there is more to survey work than asking questions.

If you don’t know how to write a survey, do us all a favor, make up the numbers and say that in a footnote. You will be in good company with the piracy estimators.

Introduction to Statistical Computing

Filed under: Computation,Computer Science,R,Statistics — Patrick Durusau @ 7:54 pm

Introduction to Statistical Computing by Cosma Shalizi.

Description:

Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data-analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed create their own programs.

Students will learn the core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.

The class will be taught in the R language.

Slides and R code for three years (as of the time of this post.

DataCamp (R & Data Analysis)

Filed under: Data Analysis,R — Patrick Durusau @ 7:43 pm

DataCamp Learn R & Become a Data Analyst.

From the overview:

Like english is the language spoken by the inhabitants of the United States, R is the language spoken by millions of statisticians and data analysts around the globe.

In this interactive R tutorial for beginners you will learn the basics of R. By the end of the summer you will be able to analyze data with R and create some very good looking graphs. This course is targeted at real beginners who are just getting started with R.

Opportunities for experts as well! Teach a course!

The homepage has this statistic:

“By 2018 there will be a shortage of 200.000 Data Analysts and 1,5 million data savvy managers, in the US alone.” Mc Kinsey & Company

It doesn’t say how many of the 200,000 jobs will be with the NSA. 😉

Seriously, a site and delivery methodology that may well take off in 2014.

I first saw this at Learn Data Science Online with DataCamp by Ryan Swanstrom.

Black Hat Asia 2014

Filed under: Conferences,Cybersecurity — Patrick Durusau @ 2:28 pm

Black Hat Asia 2014 March 25-28, 2014 Marina Bay Sands, Singapore.

Early registration pricing ends: January 24, 2014.

From the homepage:

Black Hat is returning to Asia for the first time since 2008, and we have quite an event in store. Here the brightest professionals and researchers in the industry will come together for a total of four days–two days of deeply technical hands-on Trainings, followed by two days of the latest research and vulnerability disclosures at our Briefings.

Black Hat Asia 2014: First Three Briefings

From the briefings page:

Welcome to 2014! Today we’re focusing on the first trio of Briefings selected for Black Hat Asia 2014. From hacking cars to the ins and outs of surveying the entire Internet, we’ve got an incredible amount of fascinating insider knowledge to share.

You might have caught Alberto Garcia Illera and Javier Vazquez Vidal’s Black Hat USA 2013 Arsenal presentation, “Dude, WTF in My Car!,” where they thoroughly dissected automobile ECUs (engine control units) and released a powerful tool to exploit them. Join the duo again for Dude, WTF in My CAN!, where their focus shifts to the CAN (controller area network) bus at the heart of many modern vehicles. They’ll show you how to build a device for only $20 that can pwn the CAN bus and allow an attacker to control it remotely. Also on the agenda: the current state of car forensics and how such data can be extracted and used in legal cases.

When flaws and exploits emerge in Microsoft products and the security hits the fan, the company has a history of issuing so-called “Fix It” patches that attempt to take care of the immediate threat. The In-Memory Fix It is one recently documented variation on the concept. In Persist It: Using and Abusing Microsoft’s Fix It Patches Jon Erickson will share his research on these in-memory patches. Through reverse engineering, he’s gained the ability to create new patches, which can maintain persistence on a host system. Microsoft’s Fix Its may need a fix themselves.

Between the Critical.IO and Internet Census 2012 scanning projects, there have been great strides made over the last year or two in Internet survey cost and practicality. While some of the results have been dismaying — i.e. misconfigured hardware across the Internet leaves it vulnerable to attack — the datasets generated by this massive-scale research provide rare evidence on risks and vulnerability exposure, and show where further security research is needed most. Come to Scan All the Things – Project Sonar with Mark Schloesser to learn how these surveys were conducted, as well as the eye-opening results they’ve generated so far.

If you are wavering about attending after reading about those briefings, see the full briefing page or the Training page. That should have you registering and making travel arrangements rather quickly.

The NSA will be there. Will you?

Pay the Man!

Filed under: Publishing,Transparency — Patrick Durusau @ 11:07 am

Books go online for free in Norway by Martin Chilton.

From the post:

More than 135,000 books still in copyright are going online for free in Norway after an innovative scheme by the National Library ensured that publishers and authors are paid for the project.

The copyright-protected books (including translations of foreign books) have to be published before 2000 and the digitising has to be done with the consent of the copyright holders.

National Library of Norway chief Vigdis Moe Skarstein said the project is the first of its kind to offer free online access to books still under copyright, which in Norway expires 70 years after the author’s death. Books by Stephen King, Ken Follett, John Steinbeck, Jo Nesbø, Karin Fossum and Nobel Laureate Knut Hamsun are among those in the scheme.

The National Library has signed an agreement with Kopinor, an umbrella group representing major authors and publishers through 22 member organisations, and for every digitised page that goes online, the library pays a predetermined sum to Kopinor, which will be responsible for distributing the royalties among its members. The per-page amount was 0.36 Norwegian kroner (four pence), which will decrease to three pence when the online collection reaches its estimated target of 250,000 books.

Norway has discovered a way out of the copyright conundrum, pay the man!

Can you imagine the impact if the United States were to bulk license all of the Springer publications in digital format?

Some immediate consequences:

  1. All citizen-innovators would have access to a vast library of high quality content, without restriction by place of employment or academic status.
  2. Taking over the cost of Springer materials would act as a additional funding for libraries with existing subscriptions.
  3. It would even out access to Springer materials across the educational system in the U.S.
  4. It would reduce the administrative burden on both libraries and Springer by consolidating all existing accounts into one account.
  5. Springer could offer “advanced” services in addition to basic search and content for additional fees, leveraged on top of the standard content.
  6. Other vendors could offer “advanced” services for fees leveraged on top of standard content.

I have nothing against the many “open access” journals but bear in mind the vast legacy of science and technology that remains the property of Springer and others.

The principal advantage that I would pitch to Springer would be the availability of its content under bulk licensing would result in other vendors building services on top of that content.

What advantage is there for Springer? Imagine that you can be either a road (content) or a convenience store (app. built on content) next to the road. Which one gets maintained longer?

Everybody has an interest in maintaining and even expanding the road. By becoming part of the intellectual infrastructure of education, industry and government, even more than it is now, Springer would secure a very stable and lucrative future.

Put that way, I would much rather be the road than the convenience store.

You?

Coding books should be stories first…

Filed under: Marketing,Topic Maps — Patrick Durusau @ 10:31 am

I saw this in a tweet by Jordan Leigh:

Coding books should be stories first, and just happen to be about code. Instead we have coding docs that just happen to be in a book form.

Looking back over topic map presentations, papers, books, etc., did we also fall into that trap?

It’s rubs the wrong way to have spend so much time on obscure issues only to have to ignore them to talk about what interests users. 😉

What user stories do you think are the most interesting?

The only way to test starting from users stories is to start with user stories. Quite possibly fleshing out the user story and issues before even mentioning topic maps.

Sort of like the fiction writing advice to “start with chapter two.” The book I have in mind recommended that you start with some sort of crisis, emergency, etc. Get readers interested in the character before getting into background details.

Perhaps that approach would work for topic maps.

For all the “solutions” for Big Data, I have yet to see one that addresses the semantic needs of “Big Data.”

You?

January 17, 2014

Rule-based deduplication…

Filed under: Deduplication,Information Retrieval,Topic Maps,Uncategorized — Patrick Durusau @ 8:24 pm

Rule-based deduplication of article records from bibliographic databases by Yu Jiang, et.al.

Abstract:

We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.

I found this report encouraging, particularly when read along side Rule-based Information Extraction is Dead!…, with regard to merging rules authored by human editors.

Both reports indicate a pressing need for more complex rules than matching a URI for purposes of deduplication (merging in topic maps terminology).

I assume such rules would need to be easier for the average users to declare than TMCL.

Algorithms are not enough:…

Filed under: News,Reporting — Patrick Durusau @ 8:06 pm

Algorithms are not enough: lessons bringing computer science to journalism by Jonathan Stray.

From the post:

There are some amazing algorithms coming out the computer science community which promise to revolutionize how journalists deal with large quantities of information. But building a tool that journalists can use to get stories done takes a lot more than algorithms. Closing this gap has been one of the most challenging and rewarding aspects of building Overview, and I really think we’ve learned something.

I want to get into the process of going form algorithm to application here, because — somewhat to my surprise — I don’t think this process is widely understood. The computer science research community is going full speed ahead developing exciting new algorithms, but seems a bit disconnected from what it takes to get their work used. This is doubly disappointing, because understanding the needs of users often shows that you need a different algorithm.

The development of Overview is a story about text analysis algorithms applied to journalism, but the principles might apply to any sort of data analysis system. One definition says data science is the intersection of computer science, statistics, and subject matter expertise. This post is about connecting computer science with subject matter expertise.

I rather like the line:

This post is about connecting computer science with subject matter expertise.

If you have ever wondered about how an idea goes from one-off code to software that is easy to use for others, this is a post you need to read.

Jonathan being a reporter by trade makes the story all the more compelling.

It also makes me wonder if topic map interfaces should focus more on how users see the world and not so much on how topic map mavens see the world.

For example the precision of identification users expect may be very different from that of specialists.

Thoughts?

Three Linked Data Vocabularies

Filed under: Linked Data,Vocabularies — Patrick Durusau @ 7:27 pm

Three Linked Data Vocabularies are W3C Recommendations

From the post:

Three Recommendations were published today to enhance data interoperability, especially in government data. Each one specifies an RDF vocabulary (a set of properties and classes) for conveying a particular kind of information:

  • The Data Catalog (DCAT) Vocabulary is used to provide information about available data sources. When data sources are described using DCAT, it becomes much easier to create high-quality integrated and customized catalogs including entries from many different providers. Many national data portals are already using DCAT.
  • The Data Cube Vocabulary brings the cube model underlying SDMX (Statistical Data and Metadata eXchange, a popular ISO standard) to Linked Data. This vocabulary enables statistical and other regular data, such as measurements, to be published and then integrated and analyzed with RDF-based tools.
  • The Organization Ontology provides a powerful and flexible vocabulary for expressing the official relationships and roles within an organization. This allows for interoperation of personnel tools and will support emerging socially-aware software.

More vocabularies for mapping into their respective areas, backwards for pre-existing vocabularies and forward for vocabularies that succeed them.

Petuum

Filed under: Hadoop,Machine Learning,Petuum — Patrick Durusau @ 7:14 pm

Petuum

From the homepage:

Petuum is a distributed machine learning framework. It takes care of the difficult system “plumbing work”, allowing you to focus on the ML. Petuum runs efficiently at scale on research clusters and cloud compute like Amazon EC2 and Google GCE.

A Bit More Details

Petuum provides essential distributed programming tools that minimize programmer effort. It has a distributed parameter server (key-value storage), a distributed task scheduler, and out-of-core (disk) storage for extremely large problems. Unlike general-purpose distributed programming platforms, Petuum is designed specifically for ML algorithms. This means that Petuum takes advantage of data correlation, staleness, and other statistical properties to maximize the performance for ML algorithms.

Plug and Play

Petuum comes with a fast and scalable parallel LASSO regression solver, as well as an implementation of topic model (Latent Dirichlet Allocation) and L2-norm Matrix Factorization – with more to be added on a regular basis. Petuum is fully self-contained, making installation a breeze – if you know how to use a Linux package manager and type “make”, you’re ready to use Petuum. No mucking around trying to find that Hadoop cluster, or (worse still) trying to install Hadoop yourself. Whether you have a single machine or an entire cluster, Petuum just works.

What’s Petuum anyway?

Petuum comes from “perpetuum mobile,” which is a musical style characterized by a continuous steady stream of notes. Paganini’s Moto Perpetuo is an excellent example. It is our goal to build a system that runs efficiently and reliably — in perpetual motion.

Musically inclined programmers? 😉

The bar for using Hadoop and machine learning gets lower by the day. At least in terms of details that can be mastered by code.

Which is how it should be. The creative work, choosing data, appropriate algorithms, etc., being left to human operators.

I first saw this at Danny Bickson’s Petuum – a new distributed machine learning framework from CMU (Eric Xing).

PS: Remember to register for the 3rd GraphLab Conference!

Cybersecurity – Know Your Network

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 7:02 pm

10 Free Or Low-Cost Network Discovery And Mapping Tools by Ericka Chickowski.

To see the ten (10) tools you have to page through ten (10) screen refreshes of advertising.

I though you might have better ways to spend your time:

  1. Open-AudIT
  2. NetSurveyor
  3. Advanced IP Scanner
  4. Fing
  5. Network Mapping Software
  6. Cheops-ng
  7. Open NMS
  8. NetworkView
  9. Nmap
  10. Angry IP Scanner

Despite the uber-hacker tales about the NSA, the NSA succeeds for the same reason some spammers make $7,000 a day, people are careless.

Using one or more of these tools you can start hardening your network against government intrusion.

Government intrusion isn’t a question of if but of when and for how long?

After you start working on your network, enlist your friends as well. A neighborhood network watch program as it were.

You will run into issues when sharing local network maps with your friends. Most of you will have one or more conflicting local IP addresses inside your routers.

One easy solution is to use topic maps to create unique topics to represent all of the machines individually, even if they share the same local IP address.

That will enable you to query across all the local networks in the data set for similar probes, etc.

The larger your network of friends, the more data you will be gathering on the activities of the shadow government in the U.S.

Post your data publicly so it can be combined with data from other neighborhood network watch groups.

Let’s take back the Internet, one local data pipe at a time.

NSA News!

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 5:16 pm

Four Questionable Claims Obama Has Made on NSA Surveillance by Kara Brandeisky of ProPublico.

Kara does a great job pointing out four specific claims made by President Obama that just don’t add up.

Read Kara’s article and then share it. (Results in a small donation for ProPublico.)

My concern isn’t that the President made mis-leading statements. You can turn on CNN any day of the week and hear mis-leading statements from inside the “beltway” as they call it.

I am more concerned that the “merits” of the President’s “reforms” will become serious topics of discussion. That would be an unfortunate distraction from the only remedy that might deter other federal agencies from going completely rogue, closure of the NSA.

When I say closure I mean exactly that. No transfers of files, personnel, physical assets, etc. Lock the doors and just seal it up.

Why? Well, consider that the Director of the NSA, James Clapper, Jr. lied to Congress about surveillance of U.S. citizens and has not been held accountable for those lies.

How would you know if any of the reforms are performed? Ask Clapper?

Reform of the NSA is a farce wrapped in a lie and concealed inside secret budget allocations.

Closing the NSA should be the first step to making the government as transparent as the average U.S. (and non-U.S.) citizen is today.

Data-Driven Discovery Initiative

Filed under: BigData,Data Science — Patrick Durusau @ 4:12 pm

Data-Driven Discovery Initiative

Pre-applications due: Monday, February 24, 2014 by 5 pm Pacific Time

From the webpage:

Our Data-Driven Discovery Initiative seeks to advance the people and practices of data-intensive science, to take advantage of the increasing volume, velocity, and variety of scientific data to make new discoveries. Within this initiative, we’re supporting data-driven discovery investigators – individuals who exemplify multidisciplinary, data-driven science, coalescing natural sciences with methods from statistics and computer science.

These innovators are striking out in new directions and are willing to take risks with the potential of huge payoffs in some aspect of data-intensive science. Successful applicants must make a strong case for developments in the natural sciences (biology, physics, astronomy, etc.) or science enabling methodologies (statistics, machine learning, scalable algorithms, etc.), and applicants that credibly combine the two are especially encouraged. Note that the Science Program does not fund disease targeted research.

It is anticipated that the DDD initiative will make about 15 awards at ~$1,500,000 each, at $200K-$300K/year for five years.

Be aware, you must be an employee of a PhD-granting institution or a private research institute in the United States to apply.

Open Educational Resources for Biomedical Big Data

Filed under: BigData,Bioinformatics,Biomedical,Funding — Patrick Durusau @ 3:38 pm

Open Educational Resources for Biomedical Big Data (R25)

Deadline for submission: April 1, 2014

Additional information: bd2k_training@mail.nih.gov

As part of the NIH Big Data to Knowledge (BD2K) project, BD2K R25 FOA will support:

Curriculum or Methods Development of innovative open educational resources that enhance the ability of the workforce to use and analyze biomedical Big Data.

The challenges:

The major challenges to using biomedical Big Data include the following:

Locating data and software tools: Investigators need straightforward means of knowing what datasets and software tools are available and where to obtain them, along with descriptions of each dataset or tool. Ideally, investigators should be able to easily locate all published and resource datasets and software tools, both basic and clinical, and, to the extent possible, unpublished or proprietary data and software.

Gaining access to data and software tools: Investigators need straightforward means of 1) releasing datasets and metadata in standard formats; 2) obtaining access to specific datasets or portions of datasets; 3) studying datasets with the appropriate software tools in suitable environments; and 4) obtaining analyzed datasets.

Standardizing data and metadata: Investigators need data to be in standard formats to facilitate interoperability, data sharing, and the use of tools to manage and analyze the data. The datasets need to be described by standard metadata to allow novel uses as well as reuse and integration.

Sharing data and software: While significant progress has been made in broad and rapid sharing of data and software, it is not yet the norm in all areas of biomedical research. More effective data- and software-sharing would be facilitated by changes in the research culture, recognition of the contributions made by data and software generators, and technical innovations. Validation of software to ensure quality, reproducibility, provenance, and interoperability is a notable goal.

Organizing, managing, and processing biomedical Big Data: Investigators need biomedical data to be organized and managed in a robust way that allows them to be fully used; currently, most data are not sufficiently well organized. Barriers exist to releasing, transferring, storing, and retrieving large amounts of data. Research is needed to design innovative approaches and effective software tools for organizing biomedical Big Data for data integration and sharing while protecting human subject privacy.

Developing new methods for analyzing biomedical Big Data: The size, complexity, and multidimensional nature of many datasets make data analysis extremely challenging. Substantial research is needed to develop new methods and software tools for analyzing such large, complex, and multidimensional datasets. User-friendly data workflow platforms and visualization tools are also needed to facilitate the analysis of Big Data.

Training researchers for analyzing biomedical Big Data: Advances in biomedical sciences using Big Data will require more scientists with the appropriate data science expertise and skills to develop methods and design tools, including those in many quantitative science areas such as computational biology, biomedical informatics, biostatistics, and related areas. In addition, users of Big Data software tools and resources must be trained to utilize them well.

Another big data biomedical data integration funding opportunity!

I do wonder about the suggestion:

The datasets need to be described by standard metadata to allow novel uses as well as reuse and integration.

Do they mean:

“Standard” metadata for a particular academic lab?

“Standard” metadata for a particular industry lab?

“Standard” metadata for either one five (5) years ago?

“Standard” metadata for either one (5) years from now?

The problem being the familiar one that knowledge that isn’t moving forward is outdated.

It’s hard to do good research with outdated information.

Making metadata dynamic, so that it reflects yesterday’s terminology, today’s and someday tomorrow’s, would be far more useful.

The metadata displayed to any user would be their choice of metadata and not the complexities that make the metadata dynamic.

Interested?

RICON West 2013 Videos Posted!

Filed under: Erlang,Programming,Riak — Patrick Durusau @ 2:52 pm

RICON West 2013 Videos Posted!

Rather than streaming the entire two (2) days, you can now view individual videos from RICON West 2013!

By author:

By title:

  • Bad As I Wanna Be: Coordination and Consistency in Distributed Databases (Bailis) – RICON West 2013
  • Bringing Consistency to Riak (Part 2) (Joseph Blomstedt) – RICON West 2013
  • Building Next Generation Weather Data Distribution and On-demand Forecast Systems Using Riak (Raja Selvaraj)
  • Controlled Epidemics: Riak's New Gossip Protocol and Metadata Store (Jordan West) – RICON West 2013
  • CRDTs: An Update (or maybe just a PUT) (Sam Elliott) – RICON West 2013
  • CRDTs in Production (Jeremy Ong) – RICON West 2013
  • Denormalize This! Riak at State Farm (Richard Simon and Richard Berglund) – RICON West 2013
  • Distributed Systems Archeology (Michael Bernstein) – RICON West 2013
  • Distributing Work Across Clusters: Adventures With Riak Pipe (Susan Potter) – RICON West 2013
  • Dynamic Dynamos: Comparing Riak and Cassandra (Jason Brown) – RICON West 2013
  • LVars: lattice-based data structures for deterministic parallelism (Lindsey Kuper) – RICON West 2013
  • Maximum Viable Product (Justin Sheehy) – RICON West 2013
  • More Than Just Data: Using Riak Core to Manage Distributed Services (O'Connell) – RICON West 2013
  • Practicalities of Productionizing Distributed Systems (Jeff Hodges) – RICON West 2013
  • The Raft Consensus Algorithm (Diego Ongaro) – RICON West 2013
  • Riak Search 2.0 (Eric Redmond) – RICON West 2013
  • Riak Security; Locking the Distributed Chicken Coop (Andrew Thompson) – RICON West 2013
  • RICON West 2013 Lightning Talks
  • Seagate Kinetic Open Storage: Innovation to Enable Scale Out Storage (Hughes) – RICON West 2013
  • The Tail at Scale: Achieving Rapid Response Times in Large Online Services (Dean) – RICON West 2013
  • Timely Dataflow in Naiad (Derek Murray) – RICON West 2013
  • Troubleshooting a Distributed Database in Production (Shoffstall and Voiselle) – RICON West 2013
  • Yuki: Functional Data Structures for Riak (Ryland Degnan) – RICON West 2013
  • Enjoy!

    January 16, 2014

    Leiningen Install, The Missing Bits

    Filed under: Clojure — Patrick Durusau @ 7:43 pm

    Quite recently I installed Leiningen following the instructions from the homepage:

    1. Download the lein
      script
      (or on
      Windows lein.bat)
    2. Place it on your $PATH where your shell can find it (eg. ~/bin)
    3. Set it to be executable (chmod a+x ~/bin/lein)

    Well, actually not.

    Installing using sudo:

    1. Download the lein script (or on Windows lein.bat)
    2. sudo mv lein.txt /bin/lein (changes the file name and moves it to ~/bin) or sudo mv lein /bin *
    3. Set it to be executable: sudo chmod a+x ~/bin/lein
    4. Execute lein, $lein **

    * If you save the lein script without a file extension, use the second command. Chrome on Ubuntu would only save the file with an extension.

    ** When you execute lein, leiningen-(version)-standalone.jar will be installed in your home directory under .lein.

    You can always type: lein -h to see your options or check this lein 2.3.4-cheatsheet.pdf.

    Courses for Skills Development in Biomedical Big Data Science

    Filed under: BigData,Bioinformatics,Biomedical,Funding — Patrick Durusau @ 6:45 pm

    Courses for Skills Development in Biomedical Big Data Science

    Deadline for submission: April 1, 2014

    Additional information: bd2k_training@mail.nih.gov

    As part of the NIH Big Data to Knowledge (BD2K) the purpose of BD2K R25 FOA will support:

    Courses for Skills Development in topics necessary for the utilization of Big Data, including the computational and statistical sciences in a biomedical context. Courses will equip individuals with additional skills and knowledge to utilize biomedical Big Data.

    Challenges in biomedical Big Data?

    The major challenges to using biomedical Big Data include the following:

    Locating data and software tools: Investigators need straightforward means of knowing what datasets and software tools are available and where to obtain them, along with descriptions of each dataset or tool. Ideally, investigators should be able to easily locate all published and resource datasets and software tools, both basic and clinical, and, to the extent possible, unpublished or proprietary data and software.

    Gaining access to data and software tools: Investigators need straightforward means of 1) releasing datasets and metadata in standard formats; 2) obtaining access to specific datasets or portions of datasets; 3) studying datasets with the appropriate software tools in suitable environments; and 4) obtaining analyzed datasets.

    Standardizing data and metadata: Investigators need data to be in standard formats to facilitate interoperability, data sharing, and the use of tools to manage and analyze the data. The datasets need to be described by standard metadata to allow novel uses as well as reuse and integration.

    Sharing data and software: While significant progress has been made in broad and rapid sharing of data and software, it is not yet the norm in all areas of biomedical research. More effective data- and software-sharing would be facilitated by changes in the research culture, recognition of the contributions made by data and software generators, and technical innovations. Validation of software to ensure quality, reproducibility, provenance, and interoperability is a notable goal.

    Organizing, managing, and processing biomedical Big Data: Investigators need biomedical data to be organized and managed in a robust way that allows them to be fully used; currently, most data are not sufficiently well organized. Barriers exist to releasing, transferring, storing, and retrieving large amounts of data. Research is needed to design innovative approaches and effective software tools for organizing biomedical Big Data for data integration and sharing while protecting human subject privacy.

    Developing new methods for analyzing biomedical Big Data: The size, complexity, and multidimensional nature of many datasets make data analysis extremely challenging. Substantial research is needed to develop new methods and software tools for analyzing such large, complex, and multidimensional datasets. User-friendly data workflow platforms and visualization tools are also needed to facilitate the analysis of Big Data.

    Training researchers for analyzing biomedical Big Data: Advances in biomedical sciences using Big Data will require more scientists with the appropriate data science expertise and skills to develop methods and design tools, including those in many quantitative science areas such as computational biology, biomedical informatics, biostatistics, and related areas. In addition, users of Big Data software tools and resources must be trained to utilize them well.

    It’s hard to me to read that list and not see subject identity as playing some role in meeting all of those challenges. Not a complete solution because there are a variety of problems in each challenge. But to preserve access to data sets over time, issues and approaches, subject identity is a necessary component of any solution.

    Applicants have to be institutions of higher education but I assume they can hire expertise as required.

    « Newer PostsOlder Posts »

    Powered by WordPress