Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 13, 2014

The Battleship Moves

Filed under: Microsoft,Open Source — Patrick Durusau @ 2:45 pm

A milestone moment for Microsoft: .NET is now an open-source project by Jonathan Vanian.

During the acrimonious debate about OOXML, a friend said that Microsoft was like a very large battleship, it could turn, but movement wasn’t ever sudden.

From what I read in Jonathan’s post, MS is in the process of making yet another turn, this time to make .NET an open source project.

A move that gives credence to the proposition that being open source isn’t inconsistent with being a commercial enterprise and a profitable one.

But just as important is commercial open source software as a bulwark against government surveillance. Consumers will have the choice of buying binary and possibly government surveillance infected software or they can use open source and the services of traditional vendors such as MS, IBM, HP, etc. to compile specific software packages for their use.

Opening up such a large package isn’t an overnight lark so I encourage everyone to be patient as MS eases .NET into the waters of open source. Continued good experiences with an open source .NET will further the open source agenda at Microsoft.

The more open source software in use, the fewer dark places for government surveillance to hide.

Fewer dark places for government surveillance to hide.” Yet another benefit from open source software!

November 12, 2014

Mapnik

Filed under: Mapping,Maps — Patrick Durusau @ 8:31 pm

Mapnik

From the FAQ:

What is mapnik?

Mapnik is a Free Toolkit for developing mapping applications. It’s written in C++ and there are Python bindings to facilitate fast-paced agile development. It can comfortably be used for both desktop and web development, which was something I wanted from the beginning.

Mapnik is about making beautiful maps. It uses the AGG library and offers world class anti-aliasing rendering with subpixel accuracy for geographic data. It is written from scratch in modern C++ and doesn’t suffer from design decisions made a decade ago. When it comes to handling common software tasks such as memory management, filesystem access, regular expressions, parsing and so on, Mapnik doesn’t re-invent the wheel, but utilizes best of breed industry standard libraries from boost.org

Which platforms does it run on?

Mapnik is a cross platform toolkit that runs on Windows, Mac, and Linux (Since release 0.4). Users commonly run Mapnik on Mac >=10.4.x (both intel and PPC), as well as Debian/Ubuntu, Fedora, Centos, OpenSuse, and FreeBSD. If you run Mapnik on another Linux platform please add to the list on the Trac Wiki

What data formats are supported?

Mapnik uses a plugin architecture to read different datasources. Current plugins can read ESRI shapefiles, PostGIS, TIFF raster, OSM xml, Kismet, as well as all OGR/GDAL formats. More data access plug-ins will be available in the future. If you cannot wait and/or like coding in C++, why not write your own data access plug-in?

What are the plans for the future?

As always, there are lots of things in the pipeline. Sign up for the mapnik-users list or mapnik-devel list to join the community discussion.

Governments as well as NGOs need mapping applications.

What mapping application will you create? What data will it merge on the fly for your users?

Completely open Collections on Europeana

Filed under: Europeana,Open Data — Patrick Durusau @ 8:14 pm

Completely open Collections on Europeana (spreadsheet)

A Google spreadsheet listing collections from Europena.

The title isn’t completely accurate since it also lists collections that are not completely open.

I count ninety-eight (98) collections that are completely open, another two hundred and thirty-three (233) that use a Creative Commons license and four hundred and seven (407) that aren’t completely open or use a Creative Commons license.

You will need to check the individual entries to be sure of the licensing rights. I tried MusicMasters, which is listed as closed, to find that one (1) image could be used with attribution and two hundred and forty-seven (247) only with permission.

Europena is a remarkable site that is marred by a pop-up that takes you to FaceBook or to exhibits. For whatever reason, it is a “feature” of this pop-up that it cannot be closed. At least on Firefox and Chrome.

The spreadsheet should be useful as a quick reference for potentially open materials at Europeana.

I first saw this in a tweet by Amanda French.

Solr/Lucene 5.0 (December, 2014)

Filed under: Lucene,Solr — Patrick Durusau @ 7:48 pm

Just so you know, email traffic suggests a release candidate for Solr/Lucene 5.0 will appear in December, 2014.

If you are curious, see the unreleased Solr Reference Guide (for Solr 5.0).

If you are even more curious, see the issues targeted for Solr 5.0.

OK, I have to admit that not everyone uses Solr so see also the issues targeted for Lucene 5.0.

Nothing like a pre-holiday software drop to provide leisure activities for the holidays!

Solr’s New Website [with comments]

Filed under: Solr — Patrick Durusau @ 5:55 pm

Solr

Solr has a snazzy new website!

A couple of comments though:

Features

The Features page starts with impressive svg icons but aren’t hyperlinks to more information on the page or elsewhere. Seems like a wasted opportunity to navigate to deeper information about that particular feature.

Further down on Features there are large bands that headline “detailed features,” which don’t correspond to the features named in the SVG icons, although in addition to brief text, they offer hyperlinks to the Solr Ref Guide.

Inserting Solr Ref Guide links for the more detailed SVG icons would accord with my expectations for such a page. You?

Would you still need the no particular order “detailed features?”

Resources

The Resources cites very high quality materials but it seems a bit sparse considering the wide usage of Solr.

Moreover, I’m not sure the search links to Slideshare, Lucene/Solr Revolution, YouTube and Vimeo are as useful as possible.

The Search *** for Solr links with comments:

Search Slideshare for Solr:

Varying results. The URL http://www.slideshare.net/search/slideshow?&q=solr returns 5769 “hits” consistently. However, if you substitute the entity reference for &, that is & in the string between the “?” and “q”, the results are 4243 “hits” consistently.

I discovered the difference because I used the resolved entity reference in the URL for this post and checking the link gave a different answer than the URL at the Solr page.

The general search results are in no particular date order. Add a date to your “Solr” search string to narrow the results down. Adding 2012 to the search string gives one thousand one hundred and seventy-five (1,175) “hits.” Not that I would want to search that many presentations for one relevant to a particular issue. Curated indexing would make a vast difference in the usefulness of Slideshare.

Lucene/Solr Revolution Videos from Past Events

Prime content for Lucene/Solr and yearly organization helps you guess at which Lucene/Solr version is likely to be covered. Eight conferences and there is no index by across the years by concept, issue, etc. Happy hunting!

Search YouTube for Solr

Ten thousand and six hundred (10,600) “hits” where the first “hit” is four years old. Yeah. Searching YouTube is like flushing a toilet and hoping something interesting comes into view.

At a minimum, use Vimeo Solr search sorted by date link which gives you the videos sorted by upload date. YouTube does have within N time but it only goes to one year and no ranges.

Search Vimeo for Solr

Two hundred and seventy-two (272) “hits” with the top ones being two (2) and five (5) years ago. Certainly not in date order.

At a minimum, use: Vimeo Solr search sorted by date link instead.

Be aware that slides and videos resources tend to overlap so you are likely to have to dedupe your results with every use.

A deduped and curated index of Lucene/Solr resources would be a real boon to developers/users.


Update: November 16, 2014. Apparently other people shared my concerns over the homepage and it is now substantially better than I reported above. Alas, the search links I mention remain as reported.

Open Sourcing Cubert: A High Performance Computation Engine for Complex Big Data Analytics

Filed under: Analytics,BigData,Cubert — Patrick Durusau @ 4:01 pm

Open Sourcing Cubert: A High Performance Computation Engine for Complex Big Data Analytics by Maneesh Varshney and Srinivas Vemuri.

From the post:

Cubert was built with the primary focus on better algorithms that can maximize map-side aggregations, minimize intermediate data, partition work in balanced chunks based on cost-functions, and ensure that the operators scan data that is resident in memory. Cubert has introduced a new paradigm of computation that:

  • organizes data in a format that is ideally suited for scalable execution of subsequent query processing operators
  • provides a suite of specialized operators (such as MeshJoin, Cube, Pivot) using algorithms that exploit the organization to provide significantly improved CPU and resource utilization

Cubert was shown to outperform other engines by a factor of 5-60X even when the data set sizes extend into 10s of TB and cannot fit into main memory.

The Cubert operators and algorithms were developed to specifically address real-life big data analytics needs:

  • Complex Joins and aggregations frequently arise in the context of analytics on various user level metrics which are gathered on a daily basis from a user facing website. Cubert provides the unique MeshJoin algorithm that can process data sets running into terabytes over large time windows.
  • Reporting workflows are distinct from ad-hoc queries by virtue of the fact that the computation pattern is regular and repetitive, allowing for efficiency gains from partial result caching and incremental processing, a feature exploited by the Cubert runtime for significantly improved efficiency and resource footprint.
  • Cubert provides the new power-horse CUBE operator that can efficiently (CPU and memory) compute additive, non-additive (e.g. Count Distinct) and exact percentile rank (e.g. Median) statistics; can roll up inner dimensions on-the-fly and compute multiple measures within a single job.
  • Cubert provides novel algorithms for graph traversal and aggregations for large-scale graph analytics.

Finally, Cubert Script is a developer-friendly language that takes out the hints, guesswork and surprises when running the script. The script provides the developers complete control over the execution plan (without resorting to low-level programming!), and is extremely extensible by adding new functions, aggregators and even operators.

and the source/documentation:

Cubert source code and documentation

The source code is open sourced under Apache v2 License and is available at https://github.com/linkedin/Cubert

The documentation, user guide and javadoc are available at http://linkedin.github.io/Cubert

The abstractions for data organization and calculations were present in the following paper:
Execution Primitives for Scalable Joins and Aggregations in Map Reduce”, Srinivas Vemuri, Maneesh Varshney, Krishna Puttaswamy, Rui Liu. 40th International Conference on Very Large Data Bases (VLDB), Hangzhou, China, Sept 2014. (PDF)

Another advance in the processing of big data!

Now if we could just see a similar advance in the identification of entities/subjects/concepts/relationships in big data.

Nothing wrong with faster processing but a PB of poorly understood data is a PB of poorly understood data no matter how fast you process it.

Preventing Future Rosetta “Tensions”

Filed under: Astroinformatics,Open Access,Open Data — Patrick Durusau @ 2:34 pm

Tensions surround release of new Rosetta comet data by Eric Hand.

From the post:


For the Rosetta mission, there is an explicit tension between satisfying the public with new discoveries and allowing scientists first crack at publishing papers based on their own hard-won data. “There is a tightrope there,” says Taylor, who’s based at ESA’s European Space Research and Technology Centre (ESTEC) in Noordwijk, the Netherlands. But some ESA officials are worried that the principal investigators for the spacecraft’s 11 instruments are not releasing enough information. In particular, the camera team, led by principal investigator Holger Sierks, has come under special criticism for what some say is a stingy release policy. “It’s a family that’s fighting, and Holger is in the middle of it, because he holds the crown jewels,” says Mark McCaughrean, an ESA senior science adviser at ESTEC.

Allowing scientists to withhold data for some period is not uncommon in planetary science. At NASA, a 6-month period is typical for principal investigator–led spacecraft, such as the MESSENGER mission to Mercury, says James Green, the director of NASA’s planetary science division in Washington, D.C. However, Green says, NASA headquarters can insist that the principal investigator release data for key media events. For larger strategic, or “flagship,” missions, NASA has tried to release data even faster. The Mars rovers, such as Curiosity, have put out images almost as immediately as they are gathered.

Sierks, of the Max Planck Institute for Solar System Research in Göttingen, Germany, feels that the OSIRIS team has already been providing a fair amount of data to the public—about one image every week. Each image his team puts out is better than anything that has ever been seen before in comet research, he says. Furthermore, he says other researchers, unaffiliated with the Rosetta team, have submitted papers based on these released images, while his team has been consumed with the daily task of planning the mission. After working on OSIRIS since 1997, Sierks feels that his team should get the first shot at using the data.

“Let’s give us a chance of a half a year or so,” he says. He also feels that his team has been pressured to release more data than other instruments. “Of course there is more of a focus on our instrument,” which he calls “the eyes of the mission.”

What if there was another solution to the Rosetta “tensions” than 1) privilege researchers with six (6) months exclusive access to data or 2) release data as soon as gathered?

I am sure everyone can gather arguments for one or the other of those sides but either gathering or repeating them isn’t going to move the discussion forward.

What if there were an agreed upon registry for data sets (not a repository but registry) where researchers could register anticipated data and, when acquired, the date the data was deposited to a public repository and a list of researchers entitled to publish using that data?

The set of publications in most subject areas are rather small and if they agreed to not accept or review papers based upon registered data, for six (6) months or some other agreed upon period, that would enable researchers to release data as acquired and yet protect their opportunity for first use of the data for publication purposes.

This simple sketch leaves a host of details to explore and answer but registering data for publication delay could answer the concerns that surround publicly funded data in general.

Thoughts?

Online Master of Science In Computer Science [Georgia Tech]

Filed under: Computer Science,Education — Patrick Durusau @ 11:24 am

Online Master of Science In Computer Science

From the homepage:

The Georgia Institute of Technology, Udacity and AT&T have teamed up to offer the first accredited Master of Science in Computer Science that students can earn exclusively through the Massive Open Online Course (MOOC) delivery format and for a fraction of the cost of traditional, on-campus programs.

This collaboration—informally dubbed “OMS CS” to account for the new delivery method—brings together leaders in education, MOOCs and industry to apply the disruptive power of massively open online teaching to widen the pipeline of high-quality, educated talent needed in computer science fields.

Whether you are a current or prospective computing student, a working professional or simply someone who wants to learn more about the revolutionary program, we encourage you to explore the Georgia Tech OMS CS: the best computing education in the world, now available to the world.

A little more than a year old, the Georgia Tech OMS CS program continues to grow. Carl Straumsheim writes in One Down, Many to Go of high marks for the program by students and administrators feeling their way along in this exercise in delivery of education.

At an estimated cost of less than $7,000 for a Master of Science in Computer Science, this program has the potential to change the complexion of higher education in computer science at least.

How many years (decades?) it will take for this delivery model to trickle down to the humanities is uncertain. Acknowledging that J.J. O’Donnell made waves in 2004 by teaching Augustine: the Seminar to a global audience. There has been no rush of humanities scholars to follow his example.

Potentially catastrophic bug bites all versions of Windows. Patch now

Filed under: Cybersecurity,Microsoft — Patrick Durusau @ 10:41 am

Potentially catastrophic bug bites all versions of Windows. Patch now by Dan Goodin.

From the post:

Microsoft has disclosed a potentially catastrophic vulnerability in virtually all versions of Windows. People operating Windows systems, particularly those who run websites, should immediately install a patch Microsoft released Tuesday morning.

The vulnerability resides in the Microsoft secure channel (schannel) security component that implements the secure sockets layer and transport layer security (TLS) protocols, according to a Microsoft advisory. A failure to properly filter specially formed packets makes it possible for attackers to execute attack code of their choosing by sending malicious traffic to a Windows-based server.

While the advisory makes reference to vulnerabilities targeting Windows servers, the vulnerability is rated critical for client and server versions of Windows alike, an indication the remote-code bug may threaten Windows desktops and laptop users as well. Amol Sarwate, director of engineering at Qualys, told Ars the flaw leaves client machines open if users run software that monitors Internet ports and accepts encrypted connections.

This sort of security announcement makes you nostalgic for the Black Screen and Blue Screen of Death doesn’t it? While looking up the reference on the Blue Screen of Death, I discovered that Windows still has that feature. I was thinking about the Blue Screen of Death from Windows NT days. I haven’t seen a blue screen on Windows XP so assumed they had fixed those issues. My bad.

Danger, Danger!

This security update is rated Critical for all supported releases of Microsoft Windows. (emphasis added)

The earliest versions of Windows listed are Vista and Windows Server 2003.

Which excludes Windows XP, whose security support ended on April 8, 2014.

I mention that because 95% of bank ATMs face end of security support by Jose Pagliery.

Yes, 95% of bank ATMs were running Windows XP (est.). Some banks were reported to have made arrangements with MS for continued support but who and for how long isn’t known.

The support bulletin doesn’t say if the vulnerability exists in Windows XP but you could start looking with: Vulnerability in the Windows Schannel Security Package Could Allow Remote Code Execution (935840) Published: June 12, 2007. A different security issue with Schannel.

If you confirm issue in MS14-066 with Windows XP, please post a comment. Thanks!

PS: Better organization of the Windows documentation would help security researchers. Being able to navigate from releases to specific files for a particular problem and thence backward to other versions and their files and thence to the files, would be quite helpful. Even if packages are needed for updates due to dependencies between files.


Update: November 16, 2014.

On November 14, 2014, Sara Peters posted: Microsoft Fixes Critical SChannel & OLE Bugs, But No Patches For XP and writes in part:

Joe Barrett, senior security consultant of Foreground Security says that Winshock “will most likely be the first true ‘forever-day’ vulnerability for Windows NT, Windows 2000, and Windows XP. As Microsoft has ceased all support and publicly stated they will no longer release security patches, enterprises who still have Windows 2000 and Windows XP machines will find themselves in the uncomfortable situation of having an exploitable-but-unpatchable system on their network,” he says.

“Security researchers and blackhats alike are most likely racing to get the first workable exploit against this vulnerability, and the bad guys will begin immediately using it to compromise as much as they can,” he says. “As a result, enterprises need to immediately deploy the patch to every system they can and also begin isolating and removing the unpatchable systems to prevent serious compromise of their networks.”

I guess that removes all doubt about XP based ATMs being vulnerable.

November 11, 2014

Discovering Patterns for Cyber Defense Using Linked Data Analysis [12th Nov., 10am PDT]

Filed under: Cybersecurity,Hadoop,Hortonworks,Linked Data — Patrick Durusau @ 5:22 pm

Discovering Patterns for Cyber Defense Using Linked Data Analysis

Wednesday, Nov. 12th | 10am PDT

I am always suspicious of one-day announcements of webinars. This post appeared on November 11th for a webinar on November 12th.

Only one way to find out so I registered. Join me to find out: substantive presentation or click-bait.

If enough people attend and then comment here, one way or the other, who knows? It might make a difference.

From the post:

Almost every week, news of a proprietary or customer data breach hits the news wave. While attackers have increased the level of sophistication in their tactics, so too have organizations advanced in their ability to build a robust, data-driven defense.

Apache Hadoop has emerged as the de facto big data platform, which makes it the perfect fit to accumulate cybersecurity data and diagnose the latest attacks.  As Enterprises roll out and grow their Hadoop implementations, they require effective ways for pinpointing and reasoning about correlated events within their data, and assessing their network security posture.

Join Hortonworks and Sqrrl to learn:

  • How Linked Data Analysis enables intuitive exploration, discovery, and pattern recognition over your big cybersecurity data
  • Effective ways to correlated events within your data, and assessing your network security posture
  • New techniques for discovering hidden patterns and detecting anomalies within your data
  • How Hadoop fits into your current data structure forming a secure, Modern Data Architecture

Register now to learn how combining the power of Hadoop and the Hortonworks Data Platform with massive, secure, entity-centric data models in Sqrrl Enterprise allows you to create a data-driven defense.

Bring your red pen. November 12, 2014 at 10am PDT. (That should be 1pm East Coast time.) See you then!

clj-turtle: A Clojure Domain Specific Language (DSL) for RDF/Turtle

Filed under: Clojure,DSL,RDF,Semantic Web — Patrick Durusau @ 5:04 pm

clj-turtle: A Clojure Domain Specific Language (DSL) for RDF/Turtle by Frédéerick Giasson.

From the post:

Some of my recent work leaded me to heavily use Clojure to develop all kind of new capabilities for Structured Dynamics. The ones that knows us, knows that every we do is related to RDF and OWL ontologies. All this work with Clojure is no exception.

Recently, while developing a Domain Specific Language (DSL) for using the Open Semantic Framework (OSF) web service endpoints, I did some research to try to find some kind of simple Clojure DSL that I could use to generate RDF data (in any well-known serialization). After some time, I figured out that no such a thing was currently existing in the Clojure ecosystem, so I choose to create my simple DSL for creating RDF data.

The primary goal of this new project was to have a DSL that users could use to created RDF data that could be feed to the OSF web services endpoints such as the CRUD: Create or CRUD: Update endpoints.

What I choose to do is to create a new project called clj-turtle that generates RDF/Turtle code from Clojure code. The Turtle code that is produced by this DSL is currently quite verbose. This means that all the URIs are extended, that the triple quotes are used and that the triples are fully described.

This new DSL is mean to be a really simple and easy way to create RDF data. It could even be used by non-Clojure coder to create RDF/Turtle compatible data using the DSL. New services could easily be created that takes the DSL code as input and output the RDF/Turtle code. That way, no Clojure environment would be required to use the DSL for generating RDF data.

I mention Frédéerick’s DSL for RDF despite my doubts about RDF. Good or not, RDF has achieved the status of legacy data.

Linked Data Integration with Conflicts

Filed under: Data Integration,Integration,Linked Data — Patrick Durusau @ 4:40 pm

Linked Data Integration with Conflicts by Jan Michelfeit, Tomáš Knap, Martin Nečaský.

Abstract:

Linked Data have emerged as a successful publication format and one of its main strengths is its fitness for integration of data from multiple sources. This gives them a great potential both for semantic applications and the enterprise environment where data integration is crucial. Linked Data integration poses new challenges, however, and new algorithms and tools covering all steps of the integration process need to be developed. This paper explores Linked Data integration and its specifics. We focus on data fusion and conflict resolution: two novel algorithms for Linked Data fusion with provenance tracking and quality assessment of fused data are proposed. The algorithms are implemented as part of the ODCleanStore framework and evaluated on real Linked Open Data.

Conflicts in Linked Data? The authors explain:

From the paper:

The contribution of this paper covers the data fusion phase with conflict resolution and a conflict-aware quality assessment of fused data. We present new algorithms that are implemented in ODCleanStore and are also available as a standalone tool ODCS-FusionTool.2

Data fusion is the step where actual data merging happens – multiple records representing the same real-world object are combined into a single, consistent, and clean representation [3]. In order to fulfill this definition, we need to establish a representation of a record, purge uncertain or low-quality values, and resolve identity and other conflicts. Therefore we regard conflict resolution as a subtask of data fusion.

Conflicts in data emerge during the data fusion phase and can be classified as schema, identity, and data conflicts. Schema conflicts are caused by di fferent source data schemata – di fferent attribute names, data representations (e.g., one or two attributes for name and surname), or semantics (e.g., units). Identity conflicts are a result of di fferent identifiers used for the same real-world objects. Finally, data conflicts occur when di fferent conflicting values exist for an attribute of one object.

Conflict can be resolved on entity or attribute level by a resolution function. Resolution functions can be classified as deciding functions, which can only choose values from the input such as the maximum value, or mediating functions, which may produce new values such as average or sum [3].

Oh, so the semantic diversity of data simply flowed into Linked Data representation.

Hmmm, watch for a basis for in the data for resolving schema, identity and data conflicts.

The related work section is particularly rich with references to non-Linked Data conflict resolution projects. Definitely worth a close read and chasing the references.

To examine the data fusion and conflict resolution algorithm the authors start by restating the problem:

  1. Diff erent identifying URIs are used to represent the same real-world entities.
  2. Diff erent schemata are used to describe data.
  3. Data conflicts emerge when RDF triples sharing the same subject and predicate have inconsistent values in place of the object.

I am skipping all the notation manipulation for the quads, etc., mostly because of the inputs into the algorithm:

ld-input-resolution

As a result of human intervention, the different identifying URIs have been mapped together. Not to mention the weighting of the metadata and the desired resolution for data conflicts (location data).

With that intervention, the complex RDF notation and manipulation becomes irrelevant.

Moreover, as I am sure you are aware, there is more than one “Berlin” listed in DBpedia. Several dozen as I recall.

I mention that because the process as described does not say where the authors of the rules/mappings obtained the information necessary to distinguish one Berlin from another?

That is critical for another author to evaluate the correctness of their mappings.

At the end of the day, after the “resolution” proposed by the authors, we are in no better position to map their result to another than we were at the outset. We have bald statements with no additional data on which to evaluate those statements.

Give Appendix A. List of Conflict Resolution Functions, a close read. The authors have extracted conflict resolution functions from the literature. Should be a time saver as well as suggestive of other needed resolution functions.

PS: If you look for ODCS-FusionTool you will find LD-Fusion Tool (GitHub), which was renamed to ODCS-FusionTool a year ago. See also the official LD-FusionTool webpage.

Computational drug repositioning through heterogeneous network clustering

Filed under: Bioinformatics,Biomedical,Drug Discovery — Patrick Durusau @ 3:49 pm

Computational drug repositioning through heterogeneous network clustering by Wu C, Gudivada RC, Aronow BJ, Jegga AG. (BMC Syst Biol. 2013;7 Suppl 5:S6. doi: 10.1186/1752-0509-7-S5-S6. Epub 2013 Dec 9.)

Abstract:

BACKGROUND:

Given the costly and time consuming process and high attrition rates in drug discovery and development, drug repositioning or drug repurposing is considered as a viable strategy both to replenish the drying out drug pipelines and to surmount the innovation gap. Although there is a growing recognition that mechanistic relationships from molecular to systems level should be integrated into drug discovery paradigms, relatively few studies have integrated information about heterogeneous networks into computational drug-repositioning candidate discovery platforms.

RESULTS:

Using known disease-gene and drug-target relationships from the KEGG database, we built a weighted disease and drug heterogeneous network. The nodes represent drugs or diseases while the edges represent shared gene, biological process, pathway, phenotype or a combination of these features. We clustered this weighted network to identify modules and then assembled all possible drug-disease pairs (putative drug repositioning candidates) from these modules. We validated our predictions by testing their robustness and evaluated them by their overlap with drug indications that were either reported in published literature or investigated in clinical trials.

CONCLUSIONS:

Previous computational approaches for drug repositioning focused either on drug-drug and disease-disease similarity approaches whereas we have taken a more holistic approach by considering drug-disease relationships also. Further, we considered not only gene but also other features to build the disease drug networks. Despite the relative simplicity of our approach, based on the robustness analyses and the overlap of some of our predictions with drug indications that are under investigation, we believe our approach could complement the current computational approaches for drug repositioning candidate discovery.

A reminder that data clustering isn’t just of academic interest but is useful in highly remunerative fields as well. 😉

There is a vast amount of literature on data clustering but I don’t know if there is a collection of data clustering patterns?

That is a work that summarizes where data clustering has been used by domain and the similarities on which clustering was performed.

In this article, the clustering was described as:

The nodes represent drugs or diseases while the edges represent shared gene, biological process, pathway, phenotype or a combination of these features.

Has that been used elsewhere in medical research?

Not that clustering should be limited to prior patterns but prior patterns could stimulate new patterns to be applied.

Thoughts?

Massively Parallel Clustering: Overview

Filed under: Clustering,Hadoop,MapReduce — Patrick Durusau @ 3:35 pm

Massively Parallel Clustering: Overview by Grigory Yaroslavtsev.

From the post:

Clustering is one of the main vechicles of machine learning and data analysis.
In this post I will describe how to make three very popular sequential clustering algorithms (k-means, single-linkage clustering and correlation clustering) work for big data. The first two algorithms can be used for clustering a collection of feature vectors in \(d\)-dimensional Euclidean space (like the two-dimensional set of points on the picture below, while they also work for high-dimensional data). The last one can be used for arbitrary objects as long as for any pair of them one can define some measure of similarity.

mapreduce clustering

Besides optimizing different objective functions these algorithms also give qualitatively different types of clusterings.
K-means produces a set of exactly k clusters. Single-linkage clustering gives a hierarchical partitioning of the data, which one can zoom into at different levels and get any desired number of clusters.
Finally, in correlation clustering the number of clusters is not known in advance and is chosen by the algorithm itself in order to optimize a certain objective function.

All algorithms described in this post use the model for massively parallel computation that I described before.

I thought you might be interested in parallel clustering algorithms after the post on OSM-France. Don’t skip model for massively parallel computation. It and the discussion that follows is rich in resources on parallel clustering. Lots of links.

I take heart from the line:

The last one [Correlation Clustering] can be used for arbitrary objects as long as for any pair of them one can define some measure of similarity.

The words “some measure of similarity” should be taken as a warning the any particular “measure of similarity” should be examined closely and tested against the data so processed. It could be that the “measure of similarity” produces a desired result on a particular data set. You won’t know until you look.

I/O Problem @ OpenStreetMap France

Filed under: Clustering,Mapping,Maps,Topic Maps — Patrick Durusau @ 2:54 pm

Benefit of data clustering for osm2pgsql/mapnik rending by Christian Quest.

The main server for OpenStreetMap France had an I/O problem:

OSM-FR

See Christian’s post for the details but the essence of the solution was to cluster geographic data on the basis of its location. To reduce the amount of I/O. Not unlike randomly seeking topics with similar characteristics.

How much did clustering reduce the I/O?

OSM-FR stats

Nearly 100% I/O was reduced to 15% I/O. 85% improvement.

An 85% improvement in I/O doesn’t look bad on a weekly/monthly activity report!

Now imagine clustering topics for dynamic merging and presentation to a user. Among other things, you can have an “auditing” view that shows all the topics that will merge to form a single topic in a presentation view.

Or a “pay-per-view” view that uses a different cluster to reveal more information for paying customers.

All while retaining the capacity to produce a serialized static file as an information product.

Open Source Aerospike NoSQL Database Scales To 1M TPS For $1.68 Per Hour…

Filed under: Aerospike,Cloud Computing — Patrick Durusau @ 11:39 am

Open Source Aerospike NoSQL Database Scales To 1M TPS For $1.68 Per Hour On A Single Amazon Web Services Instance at AWS re:Invent 2014

From the post:

Aerospike – the first flash-optimized open source database and the world’s fastest in-memory NoSQL database – will be at Amazon Web Services (AWS) re:Invent 2014 conference in Las Vegas, Nev.

An ultra low latency Key-Value Store, Aerospike can operate in pure RAM backed by Amazon Elastic Block Store (EBS) for persistence as well as in a hybrid mode using RAM and SSDs. Aerospike engineers have documented the performance of different AWS EC2 instances and described the best techniques to achieve 1 Million transactions per second on one instance with sub-millisecond latency.

The Aerospike AMI in the Amazon Marketplace comes with cloud formation scripts for simple, single click deployments. The open source Aerospike Community Edition is free and the Aerospike Enterprise Edition with certified binaries and Cross Data Center Replication (XDR) is also free for startups in the startup special program. Aerospike is priced simply based on the volume of unique data managed, with no charge for replicated data, for Transactions Per Second (TPS) or number of servers in a cluster.

Aerospike is popularly used as a session store, cookie store, user profile store, id-mapping store, for fraud detection, dynamic pricing, real-time product recommendations and personalization of cross channel user experiences on websites, mobile apps, e-commerce portals, travel portals, financial services portals and real-time bidding platforms. To ensure 24x7x365 operations; data in Aerospike is replicated synchronously with immediate consistency within a cluster and asynchronously across clusters in different availability zones using Aerospike Cross Data Center Replication (XDR).

This is not a plug for or against Aerospike. I am mostly posting this as a reminder to me as much as you that cloud data prices can be remarkably sane. Even $1.68 per hour could add up over a week but if you develop locally and test in the cloud, you should be able to meet your budget targets.

For any paying client, you can pass the cloud hosting fees (with an upfront deposit and one month in advance) to them.

Other examples of reasonable cloud pricing?

Riak 2.0

Filed under: Riak — Patrick Durusau @ 11:29 am

Discovering Riak 2.0 Webinar Series

From the webpage:

Webinar Registration

Join Basho Product experts and customers as we take a deep dive into the Riak 2.0 features and capabilities. A brief overview of Riak 2.0 is covered in a recent blog post here http://basho.com/distributed-data-types-riak-2-0

Each webinar will be held twice on the indicated days to accommodate different time zones. Please register for the Webinars that interest you by clicking on the links below.

Thurs. 11/13 – “Deep Dive on Riak 2.0 Data Types”

Thu, Nov 13, 2014 8:00 AM – 9:00 AM PST
Thu, Nov 13, 2014 12:00 PM – 1:00 PM PST

Riak is an eventually consistent system. When handling conflicts, due to concurrent writes, in a distributed database the client application must have a way to resolve conflicts.

Riak Data Types give the developer the power of application modeling, while relieving them of the burden of designing and testing merge functions.

In this webinar we will provide an overview of Riak Data Types, the approach to adding them to Riak, and their usage in a practical application.

Wed. 11/19 – “Using Solr to Find Your Keys”

Wed, Nov 19, 2014 8:00 AM – 9:00 AM PST
Wed, Nov 19, 2014 12:00 PM – 1:00 PM PST

Riak 2.0 contains the next iteration of Riak Search, it pairs the strength of Riak as a horizontally scalable, distributed database with the powerful full-text search functionality of Apache Solr.

Reading the blog post at: http://basho.com/distributed-data-types-riak-2-0 will be good preparation for the first seminar.

From the post:

CRDT stands for (variously) Conflict-free Replicated Data Type, Convergent Replicated Data Type, Commutative Replicated Data Type, and others. The key, repeated, phrase is “Replicated Data Types”.

One strategy for avoiding data conflicts is normalization as we know from the relational world. Where normalization results in only one copy of any data. But that presumes human curation of the data structure that eliminates duplication of data.

Normalization isn’t a concern for distributed systems, which by definition can have multiple copies of the same data. But what happens when inconsistent duplicated data is combined together is the issue addressed by CRDTs (whatever your expansion).

If you think about it, topics that represent the same subject may well hold “inconsistent” data about that subject. Data that is present on one topic and absent on the other. Or that is on both topics and is inconsistent. CRDTs offer a way to define automated handling of some forms of “inconsistency.”

Suggestion: Install a copy of Riack 2.0 before the webinar.

More Public Input @ W3C

Filed under: Standards,W3C — Patrick Durusau @ 10:15 am

In an effort to get more public input on W3C drafts, a new mailing list has been created:

public-review-announce@w3.org list.

One outcome of this list could be little or not increase in public input on W3C drafts. In which case the forces that favor a closed club at the W3C will be saying “I told you so,” privately of course.

Another outcome of this list could be an increase in public input on W3C drafts. From a broader range of stakeholders that has been the case in the past. In which case the W3C drafts will benefit from the greater input and the case can be made for a greater public voice at the W3C.

But the fate of a greater public voice at the W3C rests with you and others like you. If you don’t speak up when you have the opportunity, people will assume you don’t want to speak at all. Perhaps wrong but that is the way it works.

My recommendation is that you subscribe to this new list and as appropriate, spread the news of W3C drafts of interest to stakeholders in your community. More than that, you should actively encourage people to review and submit comments. And review and submit comments yourself.

The voice at risk is yours.

Your call.

subscribe to public-review-announce

Crimebot

Filed under: Mapping,Open Data — Patrick Durusau @ 8:41 am

Open Data On the Ground: Jamaica’s Crimebot by Samuel Lee.

From the post:

Some areas of Jamaica, particularly major cities such as Kingston and Montego Bay, experience high levels of crime and violence. If you were to type, “What is Jamaica’s biggest problem” in a Google search, you’ll see that the first five results are about crime.

(image omitted)

Using data to pinpoint high crime areas

CrimeBot (www.crimebot.net) fights crime by providing crime hotspot views and sending out alerts based on locations through mobile devices. By allowing citizens to submit information about suspicious activity in real-time, CrimeBot also serves as a tool to fight back against crime and criminals. As its base of users grow and information expands, CrimeBot can more accurately pinpoint areas of higher crime frequency for informed and improved public safety. Developed by a team in Jamaica, CrimeBot improves the “neighborhood watch” concept by applying mobile technology to information dissemination and real-time data collection. A Google Hangout discussion with CrimeBot team member Dave Oakley can be viewed through this link.

Data collection technology that helps reduce violence and crime

The CrimeBot team – Kashif Hewitt, Dave Oakley, Aldrean Smith, Garth Thompson – came together in the lead up to a Caribbean apps competition in Jamaica called Digital Jam 3, in which CrimeBot was awarded the top prize. Prior to entering the contest, the group researched the most pressing issues in the Caribbean and Jamaica, which turned out to be violence and crime.

(image omitted)

The team decided to help Jamaicans fight and reduce crime by taking a deeper look at international statistics and conducting interviews with potential users of the app among friends and other contacts. In just 19 days of development, the team took CrimeBot from concept to working prototype.

The team discovered that 58% of crimes around the world go unreported. Interviews with potential users of the app revealed that many would-be tipsters feared for their safety, lacked confidence in local authorities, or preferred to take matters into their own hands. To counter some of these barriers, CrimeBot offers an anonymous way to report crime. While this doesn’t directly solve crimes, CrimeBot provides law enforcement officials with better data, intelligence, and affords citizens and tourists greater protection through preventative measures.

Crimebot is of particular interest because it includes unreported crimes, which don’t show up in maps constructed on the basis of arrests.

One can imagine real time crime maps at a concierge desk with offers from local escort (in the traditional sense) services.

Or when merged with other records, the areas with the lowest conviction rates and/or prison sentences.

The project also has a compelling introduction video:

November 10, 2014

SVM – Understanding the math

Filed under: Machine Learning,Mathematics,Support Vector Machines — Patrick Durusau @ 4:13 pm

SVM – Understanding the math – Part 1 by Alexandre Kowalczy. (Part 2)

The first two tutorials of a series on Support Vector Machines (SVM) and their use in data analysis.

If you shudder when you read:

The objective of a support vector machine is to find the optimal separating hyperplane which maximizes the margin of the training data.

you won’t after reading these tutorials. Well written and illustrated.

If you think about it, math symbolism is like programming. It is a very precise language written with a great deal of economy. Which makes it hard to understand for the uninitiated. The underlying ideas, however, can be extracted and explained. That is what you find here.

Want to improve your understanding of what appears on the drop down menu as SVM? This is a great place to start!

PS: A third tutorial is due out soon

Is prostitution really worth £5.7 billion a year? [Data Skepticism]

Filed under: Data Analysis,Skepticism — Patrick Durusau @ 3:40 pm

Is prostitution really worth £5.7 billion a year? by David Spiegelhalter.

From the post:

The EU has demanded rapid payment of £1.7 billion from the UK because our economy has done better than predicted, and some of this is due to the prostitution market now being considered as part of our National Accounts and contributing an extra £5.3 billion to GDP at 2009 prices, which is 0.35% of GDP, half that of agriculture. But is this a reasonable estimate?

This £5.3 billion figure was assessed by the Office of National Statistics in May 2014 based on the following assumptions, derived from this analysis. To quote the ONS:

  • Number of prostitutes in UK: 61,000
  • Average cost per visit: £67
  • Clients per prostitute per week: 25
  • Number of weeks worked per year: 52

Multiply these up and you get £5.3 billion at 2009 prices, around £5.7 billion now.

An excellent example of data skepticism. Taking commonly available data, David demonstrates the “£5.7 billion a year” claim depends on 400,000 Englishmen visiting prostitutes every three (3) days. Existing data on use of prostitutes suggests that figure is far too high.

There are other problems with the data. See David’s post for the details.

BTW, there was some quibbling about the price for prostitutes, as in being too low. Perhaps the authors of the original estimate were accustomed to government subsidized prostitutes. 😉

Should prostitution pricing come up in your data analysis, one source (not necessarily a reliable one) is Havocscope Prostitution Prices. The price for a UK street prostitute is listed in U.S. dollars at $20.00. Even lower than the original estimate. Would dramatically increase the number of required visits, by about a factor of five (5).

On the trail of rootkits and other malware

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:55 am

Notes from SophosLabs: On the trail of rootkits and other malware by Paul Ducklin.

From the post:

When an interesting new piece of malware makes the news, the first questions people ask are usually, “How does it work? What does it do?”

In the old days, back when there were no more than a few hundred new viruses each year, almost all of them written in assembly language, we’d often start with a static, analytical approach by disassembling or decompiling the machine code itself.

Once we knew what sequence of operations the malware performed – for example, that it scanned through the directories on the C: drive and appended itself to every .COM file – we would then run the malware on a freshly-prepared computer and confirm our analysis using a dynamic, deductive approach.

But these days there are hundreds of thousands of new malware samples every day, written in a variety of programming languages, and delivered in a variety of ways.

The vast majority of the samples we get aren’t truly new, of course.

They’re unique only in the strictly technical sense that they consist of a sequence of bytes that we haven’t encountered before, in the same way that Good morning and GOOD MORNING are not literally the same.

Indeed, most of the new samples that show up each day are merely minor variants that we already detect, or known malware that has been encrypted or packaged differently.

Nevertheless, that still leaves plenty of samples worth looking at.

So, these days we usually start dynamically and deductively, using automated systems that run the malware in a controlled environment, instead of first trying to deconstruct each new sample by hand, like we did in the 1980s.

And that leaves us with the questions behind the questions that we asked at the start, namely, “How do you tell how it works? How do you keep track of what it does?”
….

As you can tell from my posts, Naked Security, is on my regular reading list.

Malware is an area where collation of information on malware, weaknesses, solutions would be more than helpful. When you are a reported ten (10) years behind the opposition, merging information from a variety of sources could be a significant step towards catching up.

Paul’s post is a high level view of a process to answer the questions:

“How do you tell how it works? How do you keep track of what it does?”

Information that could be used to identify a particular bit of malware.

Not overly technical but deep enough to give you a sense of the technique.

Enjoy!

November 9, 2014

Clojure

Filed under: Clojure,Programming — Patrick Durusau @ 8:14 pm

Clojure by Harris Brakmic.

A very nice curated set of Clojure links.

I count thirty-three (33) links but they all appear to be high quality links.

Hosted at ZEEF.com, which has 1,170 pages of links by 730 curators.

But you can apparently start your own page, which results in Clojure, but by Vlad Bokov.

It’s not a bad idea, curated links, but there doesn’t appear to be any annotation capacity or ways to avoid duplication of links.

Use this for the links but not as an example of an interface for curating links.

Almost Everything in “Dr. Strangelove” Was True

Filed under: Cybersecurity,Government,News,Reporting,Security — Patrick Durusau @ 8:00 pm

Almost Everything in “Dr. Strangelove” Was True by Eric Schlosser. (New Yorker, January 17, 2014)

From the post:

This month marks the fiftieth anniversary of Stanley Kubrick’s black comedy about nuclear weapons, “Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.” Released on January 29, 1964, the film caused a good deal of controversy. Its plot suggested that a mentally deranged American general could order a nuclear attack on the Soviet Union, without consulting the President. One reviewer described the film as “dangerous … an evil thing about an evil thing.” Another compared it to Soviet propaganda. Although “Strangelove” was clearly a farce, with the comedian Peter Sellers playing three roles, it was criticized for being implausible. An expert at the Institute for Strategic Studies called the events in the film “impossible on a dozen counts.” A former Deputy Secretary of Defense dismissed the idea that someone could authorize the use of a nuclear weapon without the President’s approval: “Nothing, in fact, could be further from the truth.” (See a compendium of clips from the film.) When “Fail-Safe”—a Hollywood thriller with a similar plot, directed by Sidney Lumet—opened, later that year, it was criticized in much the same way. “The incidents in ‘Fail-Safe’ are deliberate lies!” General Curtis LeMay, the Air Force chief of staff, said. “Nothing like that could happen.” The first casualty of every war is the truth—and the Cold War was no exception to that dictum. Half a century after Kubrick’s mad general, Jack D. Ripper, launched a nuclear strike on the Soviets to defend the purity of “our precious bodily fluids” from Communist subversion, we now know that American officers did indeed have the ability to start a Third World War on their own. And despite the introduction of rigorous safeguards in the years since then, the risk of an accidental or unauthorized nuclear detonation hasn’t been completely eliminated.

Grim reading for good password advocates when they learn that all the Minuteman launch sites shared a common launch code: 00000000.

The type of command and control issues discussed for nuclear weapons are the same issues that should be debated now for government surveillance. Which aren’t being debated I should say because of the curtain of secrecy that surrounds government surveillance operations.

A curtain of secrecy that has the same justifications, “we are defending the public interest,” “there is an implacable foe at the ramparts,” etc.

The question you have to ask for many government offices and agencies isn’t are they lying but why?

The government of my youth lied, the government for every year thereafter has lied.

On what basis should I trust the government to not be lying today and in the future?

PS: How do you draft privacy controls with a known liar as the enforcing party?

Another Big Brother? [Dark Car Services]

Filed under: Cybersecurity,Privacy,Security — Patrick Durusau @ 7:30 pm

Lenders Can Now Disable Your Car When You’re Driving on the Freeway by Cliff Weathers.

From the post:


The New York Times recently reported that about 2 million cars are now outfitted with such kill switches in the U.S., which is about one-quarter of subprime car loans, and creditors are not shy when it comes to remotely disabling cars whose owners are behind on their payments:

“Some borrowers say their cars were disabled when they were only a few days behind on their payments, leaving them stranded in dangerous neighborhoods. Others said their cars were shut down while idling at stoplights. Some described how they could not take their children to school or to doctor’s appointments. One woman in Nevada said her car was shut down while she was driving on the freeway.

“Beyond the ability to disable a vehicle, the devices have tracking capabilities that allow lenders and others to know the movements of borrowers, a major concern for privacy advocates. And the warnings the devices emit — beeps that become more persistent as the due date for the loan payment approaches — are seen by some borrowers as more degrading than helpful.”

Subprime automotive-loan borrowers, those with FICO credit scores below 660, debt-to-income ratios of more than 50% or a bankruptcy in the past 60 months, are a growing segment of automotive borrowers. This phenomenon has been buoyed by auto dealers trying to continue a strong sales rebound after years of weak sales and by securities investors who buy bonds backed by those loans and see them as a way to get ample returns when other interest rates remain low.

Hacking automobiles isn’t a new idea. (Rootkit for an Automobile Near You) But building automobiles for remote control by others? Of course we all trust our well-meaning government with such powers (NOT!) but what do you do when the disabling device becomes as common as seat belts?

Not that I think you will be able to stop this trend but you may want to start or invest in “dark car” services. That is services that replace/remove and/or disable systems that make your car hackable or subject to control by others.

Automobile privacy will become a luxury of the well to do and selling privacy may be your ticket to joining that class.

PS: Here’s an idea for a Dark Hat conference contest. Have a car hacking offense and defense contest on a car with all the usual features and a kill switch.

November 8, 2014

Mazerunner – Update – Neo4J – GraphX

Filed under: Graphs,GraphX,Neo4j — Patrick Durusau @ 7:36 pm

Three new algorithms have been added to Mazerunner:

  • Triangle Count
  • Connected Components
  • Strongly Connected Components

From: Using Apache Spark and Neo4j for Big Data Graph Analytics

Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark’s GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.

That’s good news!

I first saw this in a tweet by Kenny Bastani

Big Data Driving Data Integration at the NIH

Filed under: BigData,Data Integration,Funding,NIH — Patrick Durusau @ 5:17 pm

Big Data Driving Data Integration at the NIH by David Linthicum.

From the post:

The National Institutes of Health announced new grants to develop big data technologies and strategies.

“The NIH multi-institute awards constitute an initial investment of nearly $32 million in fiscal year 2014 by NIH’s Big Data to Knowledge (BD2K) initiative and will support development of new software, tools and training to improve access to these data and the ability to make new discoveries using them, NIH said in its announcement of the funding.”

The grants will address issues around Big Data adoption, including:

  • Locating data and the appropriate software tools to access and analyze the information.
  • Lack of data standards, or low adoption of standards across the research community.
  • Insufficient polices to facilitate data sharing while protecting privacy.
  • Unwillingness to collaborate that limits the data’s usefulness in the research community.

Among the tasks funded is the creation of a “Perturbation Data Coordination and Integration Center.” The center will provide support for data science research that focuses on interpreting and integrating data from different data types and databases. In other words, it will make sure the data moves to where it should move, in order to provide access to information that’s needed by the research scientist. Fundamentally, it’s data integration practices and technologies.

This is very interesting from the standpoint that the movement into big data systems often drives the reevaluation, or even new interest in data integration. As the data becomes strategically important, the need to provide core integration services becomes even more important.

The NIH announcement. NIH invests almost $32 million to increase utility of biomedical research data, reads in part:

Wide-ranging National Institutes of Health grants announced today will develop new strategies to analyze and leverage the explosion of increasingly complex biomedical data sets, often referred to as Big Data. These NIH multi-institute awards constitute an initial investment of nearly $32 million in fiscal year 2014 by NIH’s Big Data to Knowledge (BD2K) initiative, which is projected to have a total investment of nearly $656 million through 2020, pending available funds.

With the advent of transformative technologies for biomedical research, such as DNA sequencing and imaging, biomedical data generation is exceeding researchers’ ability to capitalize on the data. The BD2K awards will support the development of new approaches, software, tools, and training programs to improve access to these data and the ability to make new discoveries using them. Investigators hope to explore novel analytics to mine large amounts of data, while protecting privacy, for eventual application to improving human health. Examples include an improved ability to predict who is at increased risk for breast cancer, heart attack and other diseases and condition, and better ways to treat and prevent them.

And of particular interest:

BD2K Data Discovery Index Coordination Consortium (DDICC). This program will create a consortium to begin a community-based development of a biomedical data discovery index that will enable discovery, access and citation of biomedical research data sets.

Big data driving data integration. Who knew? 😉

The more big data the greater the pressure for robust data integration.

Sounds like they are playing the topic maps tune.

The Concert Programmer

Filed under: Lisp,Music,Scheme — Patrick Durusau @ 4:50 pm

From the description:

From OSCON 2014: Is it possible to imagine a future where “concert programmers” are as common a fixture in the worlds auditoriums as concert pianists? In this presentation Andrew will be live-coding the generative algorithms that will be producing the music that the audience will be listening too. As Andrew is typing he will also attempt to narrate the journey, discussing the various computational and musical choices made along the way. A must see for anyone interested in creative computing.

This impressive demonstration is performed using Extempore.

From the GitHub page:

Extempore is a systems programming language designed to support the programming of real-time systems in real-time. Extempore promotes human orchestration as a meta model of real-time man-machine interaction in an increasingly distributed and environmentally aware computing context.

Extempore is designed to support a style of programming dubbed ‘cyberphysical’ programming. Cyberphysical programming supports the notion of a human programmer operating as an active agent in a real-time distributed network of environmentally aware systems. The programmer interacts with the distributed real-time system procedurally by modifying code on-the-fly. In order to achieve this level of on-the-fly interaction Extempore is designed from the ground up to support code hot-swapping across a distributed heterogeneous network, compiler as service, real-time task scheduling and a first class semantics for time.

Extempore is designed to mix the high-level expressiveness of Lisp with the low-level expressiveness of C. Extempore is a statically typed, type-inferencing language with strong temporal semantics and a flexible concurrency architecture in a completely hot-swappable runtime environment. Extempore makes extensive use of the LLVM project to provide back-end code generation across a variety of architectures.

For more detail on what the Extempore project is all about, see the Extempore philosophy.

For programmers only at this stage but can you imagine the impact of “live searching?” Where data structures and indexes arise from interaction with searchers? Definitely worth a long look!

I first saw this in a tweet by Alan Zucconi.

An Open Platform (MapBox)

Filed under: MapBox,Mapping,Maps — Patrick Durusau @ 2:04 pm

An Open Platform (Mapbox)

From the post:

When you hear the term web map, what comes to mind first? You might have thought of a road map – maps created to help you get from one place to another. However, there are many other types of maps that use the same mapping conventions.

maps

Mapbox is built from open specifications to serve all types of maps, not just road maps. Open specifications solve specific problems so the solution is simple and direct.

This guide runs through all the open specifications Mapbox uses.

If you aren’t familiar with Mapbox, you need to correct that oversight.

There are Starter (free to start) and Basic ($5/month) plans, so it isn’t a burden to learn the basics.

Maps offer a familiar way to present information to users.

Terms of Service

Filed under: BigData,Cybersecurity,Privacy,Security,WWW — Patrick Durusau @ 11:53 am

Terms of Service: understanding our role in the world of Big Data by Michael Keller and Josh Neufeld.

Caution: Readers of Terms of Service will discover they are products and only incidentally consumers of digital services. Surprise, dismay, depression, and despair are common symptoms post-reading. You have been warned.

Al Jazeera uses a comic book format to effectively communicate privacy issues raised by Big Data, the Internet of Things, the Internet, and “free” services.

The story begins with privacy concerns over scanning of Gmail content (remember that?) and takes the reader up to present and likely future privacy concerns.

I quibble with the example of someone being denied a loan because they failed to exercise regularly. The authors innocently assume that banks make loans with the intention of being repaid. That’s the story in high school economics but a long way from how lending works in practice.

The recent mortgage crisis in the United States was caused by banks inducing borrowers to over state their incomes, financing a home loan and its down payment, etc. Banks don’t keep such loans but package them as securities which they then foist off onto others. Construction companies make money building the houses, local government gain tax revenue, etc. Basically a form of churn.

But the authors are right that in some theoretical economy loans could be denied because of failure to exercise. Except that would exclude such a large market segment in the United States. Did you know they are about to change the words “…the land of the free…” to “…the land of the obese…?”

That is a minor quibble about what is overall a great piece of work. In only forty-six (46) pages it brings privacy issues into a sharper focus than many longer and more turgid works.

Do you know of any comparable exposition on privacy and Big Data/Internet?

Suggest it for conference swag/holiday present. Write to Terms-of-Service.

I first saw this in a tweet by Gregory Piatetsky.

« Newer PostsOlder Posts »

Powered by WordPress