Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 21, 2012

Graph Drawing Sept 19~21, 2012 | Redmond, Washington

Filed under: Conferences,Graphs — Patrick Durusau @ 6:52 pm

Graph Drawing Sept 19~21, 2012 | Redmond, Washington

Too late for a paper but posters are being accepted until: August 20, 2012.

From the webpage:

Graph Drawing is concerned with the visualization of graphs and networks and is motivated by those application domains where it is crucial to visually analyze and interact with relational datasets. Examples of such application domains include social sciences, Internet and Web computing, information systems, computational biology, networking, VLSI circuit design, and software engineering. Bridging the gap between theoretical advances and system implementations is therefore a key factor of Graph Drawing.

The International Symposium on Graph Drawing is the main annual event in this area. This year the conference celebrates its 20th anniversary. It will take place on September 19-21, 2012, and will be hosted by Microsoft Research in Redmond, Washington, USA. Researchers and practitioners working on any aspect of graph drawing are invited to contribute papers and posters, and participate in the graph drawing contest.

The range of topics that are within the scope of the International Symposium on Graph Drawing includes (but is not limited to):

  • Design and experimentation of graph drawing algorithms
  • Visualization of graphs and networks in application areas
  • Graph visualization and data mining
  • Geometric and topological graph theory
  • Optimization on graphs
  • Software systems for graph visualization
  • Interfaces for interacting with graphs
  • Cognitive studies on graph drawing readability and user interaction

Accepted papers and abstract of accepted posters will appear in the conference proceedings, published by Springer in the series Lecture Notes in Computer Science. Selected papers will be invited for submission to a special issue of the Journal of Graph Algorithms and Applications. Best paper awards for each of the two tracks will be given.

Apologies for not seeing this earlier. Will have it on my list for next year.

Just so you don’t miss it, a listing of all prior Graph Drawing conferences appears under Tradition.

The proceedings for 1992 and 1993 are available as PDF files. Proceedings for 1994 forward, appear Springer titles.

Announcing TokuDB v6.1

Filed under: Database,TokuDB — Patrick Durusau @ 4:56 pm

Announcing TokuDB v6.1

From the post:

TokuDB v6.1 is now generally available and can be downloaded here.

New features include:

  • Added support for MariaDB 5.5 (5.5.25)
    • The TokuDB storage engine is now available with all the additional functionality of MariaDB 5.5.
  • Added HCAD support to our MySQL 5.5 version (5.5.24)
    • Hot column addition/deletion was present in TokuDB v6.0 for MySQL 5.1 and MariaDB 5.2, but not in MySQL 5.5. This feature is now present in all MySQL and MariaDB versions of TokuDB.
  • Improved in-memory point query performance via lock/latch refinement
    • TokuDB has always been a great performer on range scans and workloads where the size of the working data set is significantly larger than RAM. TokuDB v6.0 improved the performance of in-memory point queries at low levels of concurrency. TokuDB v6.1 further increased the performance at all concurrency levels.
    • The following graph shows our sysbench.oltp.uniform performance on an in-memory data set (16 x 5 million row tables, server is 2 x Xeon 5520, 72GB RAM, Centos 5.8)

Go to the post to see impressive performance numbers.

I do wonder, when do performance numbers cease to be meaningful for the average business application?

Like a car that can go from 0 to 60 in under 3 seconds. (Yes, there is such a car, 2011 Bugatti.)

Nice to have, but where are you going to drive it?

As you can tell from this blog, I am all for the latest algorithms, software, hardware, but at the same time, the latest may not be the best for your application.

It maybe that simpler, less high performance solutions will not only be more appropriate but also more robust.

July 20, 2012

…10 billion lines of code…

Filed under: Open Data,Programming,Search Data,Search Engines — Patrick Durusau @ 5:46 pm

Also know as (aka):

Black Duck’s Ohloh lets data from nearly 500,000 open source projects into the wild by Chris Mayer.

From the post:

In a bumper announcement, Black Duck Software have embraced the FOSS mantra by revealing their equivalent of a repository Yellow Pages, through the Ohloh Open Data Initiative.

The website tracks 488,823 projects, allowing users to compare data from a vast amount of repositories and forges. But now, Ohloh’s huge dataset has been licensed under the Creative Commons Attribution 3.0 Unported license, encouraging further transparency across the companies who have already bought into Ohloh’s aggregation mission directive.

“Licensing Ohloh data under Creative Commons offers both enterprises and the open source community a new level of access to FOSS data, allowing trending, tracking, and insight for the open source community,” said Tim Yeaton, President and CEO of Black Duck Software.

He added: “We are constantly looking for ways to help the open source developer community and enterprise consumers of open source. We’re proud to freely license Ohloh data under this respected license, and believe that making this resource more accessible will allow contributors and consumers of open source gain unique insight, leading to more rapid development and adoption.”

What sort of insight would you expect to gain from “…10 billion lines of code…?”

How would you capture it? Pass it on to others in your project?

Mix or match semantics with other lines of code? Perhaps your own?

PyKnot: a PyMOL tool for the discovery and analysis of knots in proteins

Filed under: Bioinformatics,Graphics,Visualization — Patrick Durusau @ 4:25 pm

PyKnot: a PyMOL tool for the discovery and analysis of knots in proteins (Rhonald C. Lua PyKnot: a PyMOL tool for the discovery and analysis of knots in proteins Bioinformatics 2012 28: 2069-2071. )

Abstract:

Summary: Understanding the differences between knotted and unknotted protein structures may offer insights into how proteins fold. To characterize the type of knot in a protein, we have developed PyKnot, a plugin that works seamlessly within the PyMOL molecular viewer and gives quick results including the knot’s invariants, crossing numbers and simplified knot projections and backbones. PyKnot may be useful to researchers interested in classifying knots in macromolecules and provides tools for students of biology and chemistry with which to learn topology and macromolecular visualization.

Availability: PyMOL is available at http://www.pymol.org. The PyKnot module and tutorial videos are available at http://youtu.be/p95aif6xqcM.

Contact: rhonald.lua@gmail.com

Apologies but this article is not open access.

You can reach the PyMOL and PyKnot software and supporting documentation.

Learning how others use visualization can’t be a bad thing!

Optimal simultaneous superpositioning of multiple structures with missing data

Filed under: Alignment,Bioinformatics,Multidimensional,Subject Identity,Superpositioning — Patrick Durusau @ 3:55 pm

Optimal simultaneous superpositioning of multiple structures with missing data (Douglas L. Theobald and Phillip A. Steindel Optimal simultaneous superpositioning of multiple structures with missing data Bioinformatics 2012 28: 1972-1979. )

Abstract:

Motivation: Superpositioning is an essential technique in structural biology that facilitates the comparison and analysis of conformational differences among topologically similar structures. Performing a superposition requires a one-to-one correspondence, or alignment, of the point sets in the different structures. However, in practice, some points are usually ‘missing’ from several structures, for example, when the alignment contains gaps. Current superposition methods deal with missing data simply by superpositioning a subset of points that are shared among all the structures. This practice is inefficient, as it ignores important data, and it fails to satisfy the common least-squares criterion. In the extreme, disregarding missing positions prohibits the calculation of a superposition altogether.

Results: Here, we present a general solution for determining an optimal superposition when some of the data are missing. We use the expectation–maximization algorithm, a classic statistical technique for dealing with incomplete data, to find both maximum-likelihood solutions and the optimal least-squares solution as a special case.

Availability and implementation: The methods presented here are implemented in THESEUS 2.0, a program for superpositioning macromolecular structures. ANSI C source code and selected compiled binaries for various computing platforms are freely available under the GNU open source license from http://www.theseus3d.org.

Contact: dtheobald@brandeis.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

From the introduction:

How should we properly compare and contrast the 3D conformations of similar structures? This fundamental problem in structural biology is commonly addressed by performing a superposition, which removes arbitrary differences in translation and rotation so that a set of structures is oriented in a common reference frame (Flower, 1999). For instance, the conventional solution to the superpositioning problem uses the least-squares optimality criterion, which orients the structures in space so as to minimize the sum of the squared distances between all corresponding points in the different structures. Superpositioning problems, also known as Procrustes problems, arise frequently in many scientific fields, including anthropology, archaeology, astronomy, computer vision, economics, evolutionary biology, geology, image analysis, medicine, morphometrics, paleontology, psychology and molecular biology (Dryden and Mardia, 1998; Gower and Dijksterhuis, 2004; Lele and Richtsmeier, 2001). A particular case we consider here is the superpositioning of multiple 3D macromolecular coordinate sets, where the points to be superpositioned correspond to atoms. Although our analysis specifically concerns the conformations of macromolecules, the methods developed herein are generally applicable to any entity that can be represented as a set of Cartesian points in a multidimensional space, whether the particular structures under study are proteins, skulls, MRI scans or geological strata.

We draw an important distinction here between a structural ‘alignment’ and a ‘superposition.’ An alignment is a discrete mapping between the residues of two or more structures. One of the most common ways to represent an alignment is using the familiar row and column matrix format of sequence alignments using the single letter abbreviations for residues (Fig. 1). An alignment may be based on sequence information or on structural information (or on both). A superposition, on the other hand, is a particular orientation of structures in 3D space. [emphasis added]

I have deep reservations about the representations of semantics using Cartesian metrics but in fact that happens quite frequently. And allegedly, usefully.

Leaving my doubts to one side, this superpositioning technique could prove to be a useful exploration technique.

If you experiment with this technique, a report of your experiences would be appreciated.

Software support for SBGN maps: SBGN-ML and LibSBGN

Filed under: Bioinformatics,Biomedical,Graphs,Hypergraphs — Patrick Durusau @ 3:30 pm

Software support for SBGN maps: SBGN-ML and LibSBGN (Martijn P. van Iersel, Alice C. Villéger, Tobias Czauderna, Sarah E. Boyd, Frank T. Bergmann, Augustin Luna, Emek Demir, Anatoly Sorokin, Ugur Dogrusoz, Yukiko Matsuoka, Akira Funahashi, Mirit I. Aladjem, Huaiyu Mi, Stuart L. Moodie, Hiroaki Kitano, Nicolas Le Novère, and Falk Schreiber
Software support for SBGN maps: SBGN-ML and LibSBGN Bioinformatics 2012 28: 2016-2021. )

Warning: Unless you really like mapping and markup languages this is likely to be a boring story. If you do (and I do), it is the sort of thing you will print out and enjoy reading. Just so you know.

Abstract:

Motivation: LibSBGN is a software library for reading, writing and manipulating Systems Biology Graphical Notation (SBGN) maps stored using the recently developed SBGN-ML file format. The library (available in C++ and Java) makes it easy for developers to add SBGN support to their tools, whereas the file format facilitates the exchange of maps between compatible software applications. The library also supports validation of maps, which simplifies the task of ensuring compliance with the detailed SBGN specifications. With this effort we hope to increase the adoption of SBGN in bioinformatics tools, ultimately enabling more researchers to visualize biological knowledge in a precise and unambiguous manner.

Availability and implementation: Milestone 2 was released in December 2011. Source code, example files and binaries are freely available under the terms of either the LGPL v2.1+ or Apache v2.0 open source licenses from http://libsbgn.sourceforge.net.

Contact: sbgn-libsbgn@lists.sourceforge.net

I included the hyperlinks to standards and software for the introduction but not the article references. Those are of interest too but for the moment I only want to entice you to read the article in full. There is a lot of graph work going on in bioinformatics and we would all do well to be more aware of it.

The Systems Biology Graphical Notation (SBGN, Le Novère et al., 2009) facilitates the representation and exchange of complex biological knowledge in a concise and unambiguous manner: as standardized pathway maps. It has been developed and supported by a vibrant community of biologists, biochemists, software developers, bioinformaticians and pathway databases experts.

SBGN is described in detail in the online specifications (see http://sbgn.org/Documents/Specifications). Here we summarize its concepts only briefly. SBGN defines three orthogonal visual languages: Process Description (PD), Entity Relationship (ER) and Activity Flow (AF). SBGN maps must follow the visual vocabulary, syntax and layout rules of one of these languages. The choice of language depends on the type of pathway or process being depicted and the amount of available information. The PD language, which originates from Kitano’s Process Diagrams (Kitano et al., 2005) and the related CellDesigner tool (Funahashi et al., 2008), is equivalent to a bipartite graph (with a few exceptions) with one type of nodes representing pools of biological entities, and a second type of nodes representing biological processes such as biochemical reactions, transport, binding and degradation. Arcs represent consumption, production or control, and can only connect nodes of differing types. The PD language is very suitable for metabolic pathways, but struggles to concisely depict the combinatorial complexity of certain proteins with many phosphorylation states. The ER language, on the other hand, is inspired by Kohn’s Molecular Interaction Maps (Kohn et al., 2006), and describes relations between biomolecules. In ER, two entities can be linked with an interaction arc. The outcome of an interaction (for example, a protein complex), is considered an entity in itself, represented by a black dot, which can engage in further interactions. Thus ER represents dependencies between interactions, or putting it differently, it can represent which interaction is necessary for another one to take place. Interactions are possible between two or more entities, which make ER maps roughly equivalent to a hypergraph in which an arc can connect more than two nodes. ER is more concise than PD when it comes to representing protein modifications and protein interactions, although it is less capable when it comes to presenting biochemical reactions. Finally, the third language in the SBGN family is AF, which represents the activities of biomolecules at a higher conceptual level. AF is suitable to represent the flow of causality between biomolecules even when detailed knowledge on biological processes is missing.

Efficient integration of the SBGN standard into the research cycle requires adoption by visualization and modeling software. Encouragingly, a growing number of pathway tools (see http://sbgn.org/SBGN_Software) offer some form of SBGN compatibility. However, current software implementations of SBGN are often incomplete and sometimes incorrect. This is not surprising: as SBGN covers a broad spectrum of biological phenomena, complete and accurate implementation of the full SBGN specifications represents a complex, error-prone and time-consuming task for individual tool developers. This development step could be simplified, and redundant implementation efforts avoided, by accurately translating the full SBGN specifications into a single software library, available freely for any tool developer to reuse in their own project. Moreover, the maps produced by any given tool usually cannot be reused in another tool, because SBGN only defines how biological information should be visualized, but not how the maps should be stored electronically. Related community standards for exchanging pathway knowledge, namely BioPAX (Demir et al., 2010) and SBML (Hucka et al., 2003), have proved insufficient for this role (more on this topic in Section 4). Therefore, we observed a second need, for a dedicated, standardized SBGN file format.

Following these observations, we started a community effort with two goals: to encourage the adoption of SBGN by facilitating its implementation in pathway tools, and to increase interoperability between SBGN-compatible software. This has resulted in a file format called SBGN-ML and a software library called LibSBGN. Each of these two components will be explained separately in the next sections.

Of course, there is always the data prior to this markup and the data that comes afterwards, so you could say I see a role for topic maps. 😉

Technology-Assisted Review Boosted in TREC 2011 Results

Filed under: Document Classification,Legal Informatics,Searching — Patrick Durusau @ 2:48 pm

Technology-Assisted Review Boosted in TREC 2011 Results by Evan Koblentz.

From the post:

TREC Legal Track, an annual government-sponsored project for evaluating document review methods, on Friday released its 2011 results containing a virtual vote of confidence for technology-assisted review.

“[T]he results show that the technology-assisted review efforts of several participants achieve recall scores that are about as high as might reasonably be measured using current evaluation methodologies. These efforts require human review of only a fraction of the entire collection, with the consequence that they are far more cost-effective than manual review,” the report states.

The term “technology-assisted review” refers to “any semi-automated process in which a human codes documents as relevant or not, and the system uses that information to code or prioritize further documents,” said TREC co-leader Gordon Cormack, of the University of Waterloo. Its meaning is far wider than just the software method known as predictive coding, he noted.

As such, “There is still plenty of room for improvement in the efficiency and effectiveness of technology-assisted review efforts, and, in particular, the accuracy of intra-review recall estimation tools, so as to support a reasonable decision that ‘enough is enough’ and to declare the review complete. Commensurate with improvements in review efficiency and effectiveness is the need for improved external evaluation methodologies,” the report states.

Good snapshot of current results, plus fertile data sets for testing alternative methodologies.

The report mentions the 100 GB data set size was a problem for some participants? (Overview of the TREC 2011 Legal Track, page 2)

Suggestion: Post the 2013 data set as a public data set to AWS. Would be available to everyone and if not using local clusters, they can fire up capacity on demand. More realistic scenario than local data processing.

Perhaps an informal survey of the amortized cost of processing by different methods (cloud, local cluster) would be of interest to the legal community.

I can hear the claims of “security, security” from here. The question to ask is: What disclosed premium your client is willing to pay for security on data you are going to give to the other side if responsive and non-privileged? 25% 50% 125% or more?

BTW, looking forward to the 2013 competition. Particularly if it gets posted to the AWS or similar cloud.

Let me know if you are interested in forming an ad hoc team or investigating the potential for an ad hoc team.

[It] knows if you’ve been bad or good so be good for [your own sake]

Filed under: Marketing,Microsoft — Patrick Durusau @ 1:49 pm

I had to re-write a line from “Stanta Claus is coming to town” just a bit to fit the story about SkyDrive I read today: Watch what you store on SkyDrive–you may lose your Microsoft life.

I don’t find the terms of service surprising. Everybody has to say that sort of thing to avoid liability in case you store, transfer, etc., something illegal using their service.

The rules used to require notice and refusal to remove content before you have any liability.

Has that changed?

Curious for a number of reasons, not the least of which is providing topic map data products and topic map appliances online.

Data Jujitsu: The art of turning data into product

Filed under: Data,Marketing,Topic Maps — Patrick Durusau @ 11:00 am

Data Jujitsu: The art of turning data into product: Smart data scientists can make big problems small by DJ Patil.

From the post:

Having worked in academia, government and industry, I’ve had a unique opportunity to build products in each sector. Much of this product development has been around building data products. Just as methods for general product development have steadily improved, so have the ideas for developing data products. Thanks to large investments in the general area of data science, many major innovations (e.g., Hadoop, Voldemort, Cassandra, HBase, Pig, Hive, etc.) have made data products easier to build. Nonetheless, data products are unique in that they are often extremely difficult, and seemingly intractable for small teams with limited funds. Yet, they get solved every day.

How? Are the people who solve them superhuman data scientists who can come up with better ideas in five minutes than most people can in a lifetime? Are they magicians of applied math who can cobble together millions of lines of code for high-performance machine learning in a few hours? No. Many of them are incredibly smart, but meeting big problems head-on usually isn’t the winning approach. There’s a method to solving data problems that avoids the big, heavyweight solution, and instead, concentrates building something quickly and iterating. Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small.

We call this Data Jujitsu: the art of using multiple data elements in clever ways to solve iterative problems that, when combined, solve a data problem that might otherwise be intractable. It’s related to Wikipedia’s definition of the ancient martial art of jujitsu: “the art or technique of manipulating the opponent’s force against himself rather than confronting it with one’s own force.”

How do we apply this idea to data? What is a data problem’s “weight,” and how do we use that weight against itself? These are the questions that we’ll work through in the subsequent sections.

To start, for me, a good definition of a data product is a product that facilitates an end goal through the use of data. It’s tempting to think of a data product purely as a data problem. After all, there’s nothing more fun than throwing a lot of technical expertise and fancy algorithmic work at a difficult problem. That’s what we’ve been trained to do; it’s why we got into this game in the first place. But in my experience, meeting the problem head-on is a recipe for disaster. Building a great data product is extremely challenging, and the problem will always become more complex, perhaps intractable, as you try to solve it.

Before investing in a big effort, you need to answer one simple question: Does anyone want or need your product? If no one wants the product, all the analytical work you throw at it will be wasted. So, start with something simple that lets you determine whether there are any customers. To do that, you’ll have to take some clever shortcuts to get your product off the ground. Sometimes, these shortcuts will survive into the finished version because they represent some fundamentally good ideas that you might not have seen otherwise; sometimes, they’ll be replaced by more complex analytic techniques. In any case, the fundamental idea is that you shouldn’t solve the whole problem at once. Solve a simple piece that shows you whether there’s an interest. It doesn’t have to be a great solution; it just has to be good enough to let you know whether it’s worth going further (e.g., a minimum viable product).

Here’s the question to ask for an open source topic map project:

Does anyone want or need your product?

Ouch!

A few of us, not enough to make a small market, like to have topic maps as interesting computational artifacts.

For a more viable (read larger) market, we need to sell data products topic maps can deliver.

How we create or deliver that product, hypergraphs, elves chained to desks, quantum computers or even magic, doesn’t matter to any sane end user.

What matters is the utility of the data product for some particular need or task.

No, I don’t know what data product to suggest. If I did, it would have been the first thing I would have said.

Suggestions?

PS: Read DJ’s post in full. Every other day or so until you have a successful, topic map based, data product.

Open Source at Netflix [Open Source Topic Maps Are….?]

Filed under: Open Source,Topic Maps,Wikipedia — Patrick Durusau @ 10:34 am

Open Source at Netflix by Ruslan Meshenberg.

A great plug for open source (among others):

Improved code and documentation quality – we’ve observed that the peer pressure from “Social Coding” has driven engineers to make sure code is clean and well structured, documentation is useful and up to date. What we’ve learned is that a component may be “Good enough for running in production, but not good enough for Github”.

A question as much to myself as anyone: Where are the open source topic maps?

There have been public dump sites for topic maps but have you seen an active community maintaining a public topic map?

Is it a technology/interface issue?

A control/authorship issue?

Something else?

Wikipedia works, although uneven. And there are a number of other similar efforts that are more or less successful.

Suggestions on what sets them apart?

Or suggestions you think should be tried? It isn’t possible to anticipate success. If the opposite were true, we would all be very successful. (Or at least that’s what I would wish for, your mileage may vary.)

Take it as given that any effort at a public topic map tool, a public topic map community or even a particular public topic map, or some combination thereof, is likely to fail.

But, we simply have to dust ourselves off and try other subject or combination of those things or others.

The Art of Social Media Analysis with Twitter and Python

Filed under: Python,Social Graphs,Social Media,Tweets — Patrick Durusau @ 4:59 am

The Art of Social Media Analysis with Twitter and Python by Krishna Sankar.

All that social media data in your topic map has to come from somewhere. 😉

Covers both the basics of the Twitter API and social graph analysis. With code of course.

I first saw this at KDNuggets.

Cloudera Manager 4.0.3 Released!

Filed under: Cloud Computing,Cloudera — Patrick Durusau @ 4:39 am

Cloudera Manager 4.0.3 Released! by Bala Venkatrao.

From the post:

We are pleased to announce the availability of Cloudera Manager 4.0.3. This is an enhancement release, with several improvements to configurability and usability. Some key enhancements include:

  • Configurable user/group settings for Oozie, HBase, YARN, MapReduce, and HDFS processes.
  • Support new configuration parameters for MapReduce services.
  • Auto configuration of reserved space for non-DFS use parameter for HDFS service.
  • Improved cluster upgrade process.
  • Support for LDAP users/groups that belong to more than one Organization Unit (OU).
  • Flexibility with distribution of key tabs when using existing Kerberos infrastructure (e.g. Active Directory).

Detailed release notes available at:

https://ccp.cloudera.com/display/ENT4DOC/New+Features+in+Cloudera+Manager+4.0

Cloudera Manager 4.0.3 is available to download from:

https://ccp.cloudera.com/display/SUPPORT/Downloads

Something for the weekend!

July 19, 2012

Following Even More of the Money

Filed under: Data,Politics — Patrick Durusau @ 3:27 pm

Following Even More of the Money By Derek Willis.

From the post:

Since we last rolled out new features in the Campaign Finance API, news organizations such as ProPublica and Mother Jones have used them to build interactive features about presidential campaigns, Super PACs and their funders. As the November election approaches, we’re announcing some additions and improvements to the API. We hope these enhancements will help others create web applications and graphics that help explain the connections between money and elections. This round of updates does not include any deprecations or backwards-incompatible changes, which is why we’re not changing the version number.

Welcome news from the NY Times on campaign finance data.

I can’t say that I follow their logic on version numbering but they are a news organization, not a software development house. 😉

More of Microsoft’s App Development Tools Goes Open Source

Filed under: Microsoft,Open Source — Patrick Durusau @ 2:38 pm

More of Microsoft’s App Development Tools Goes Open Source by Gianugo Rabellino.

From the post:

Today marks a milestone since we launched Microsoft Open Technologies, Inc. (MS Open Tech) as we undertake some important open source projects. We’re excited to share the news that MS Open Tech will be open sourcing the Entity Framework (EF), a database mapping tool useful for application development in the .NET Framework. EF will join the other open source components of Microsoft’s dev tools – MVC, Web API, and Web Pages with Razor Syntax – on CodePlex to help increase the development transparency of this project.

MS Open Tech will serve as an accelerator for these projects by working with the open source communities through our new MS Open Tech CodePlex landing page. Together, we will help build out its source code until shipment of the next product version.

This will enable everyone in the community to monitor and provide feedback on code check-ins, bug-fixes, new feature development, and build and test the products on a daily basis using the most up-to-date version of the source code.

The newly opened EF will, for the first time, allow developers outside Microsoft to submit patches and code contributions that the MS Open Tech development team will review for potential inclusion in the products.

We need more MS “native” topic map engines and applications.

Or topic map capabilities in the core of MS Office™.

Lots of people could start writing topic maps.

Which would be a good thing. A lot of people write documents using MS Word™, they also reach for professional typesetters for publication.

Same will be true for topic maps.

Advanced Data Visualization Makes BI Stand Out

Filed under: Graphics,Visualization — Patrick Durusau @ 2:05 pm

Advanced Data Visualization Makes BI Stand Out

From the post:

As one of the industry-renowned data visualization experts Edward Tufte once said, “The world is complex, dynamic, multidimensional; the paper is static, flat. How are we to represent the rich visual world of experience and measurement on mere flatland?”

Indeed, there’s just too much information out there for all categories of knowledge workers to visualize it effectively. More often than not, traditional reports using tabs, rows, and columns do not paint the whole picture or, even worse, lead an analyst to a wrong conclusion. Firms need to use data visualization because information workers:

  • Cannot see a pattern without data visualization. …
  • Cannot fit all of the necessary data points onto a single screen. …
  • Cannot effectively show deep and broad data sets on a single screen. …

I’m not a Forrester client so if you are, share the details of the report among yourselves. (Only 52 downloads so far. Maybe Forrester analysts are emailing it to each other.)

It is interesting that the last technical capability mentioned in the blog post was:

What are the ADV platform’s integration capabilities

Data integration keeps coming up.

Would almost make you think all those data governance; one platform, ontology to replace them all; master data model efforts had not succeeded.

Almost. 😉

World’s Most Accurate Pie Chart

Filed under: Graphics,Humor,Visualization — Patrick Durusau @ 1:17 pm

World’s Most Accurate Pie Chart

🙂

OK, I had to pass that one along but it has an important message:

Even with a picture, say what you have to say, then stop.

GraphLab 2.1 [New Release]

Filed under: GraphLab,Graphs,Machine Learning,Networks — Patrick Durusau @ 10:43 am

GraphLab 2.1

A new release (July 10, 2012) of GraphLab!

From the webpage:

Overview

Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance.

The new GraphLab 2.1 features:

  • a new GraphLab 2 abstraction
  • Fully Distributed with HDFS integration
  • New toolkits
    • Collaborative Filtering
    • Clustering
    • Text Modeling
    • Computer Vision
  • Improved Build system and documentation

Go to http://graphlab.org for details.

If you want to get started with GraphLab today download the source or clone us Google code. We recommend cloning to get the latest features and bug fixes.

hg clone https://code.google.com/p/graphlabapi/

If you don't have mercurial (hg) you can get it from http://mercurial.selenic.com/.

I almost didn’t find the download link. Just larger than anything else on the page, white letters, black backgroud, at: http:www.graphlab.org. Kept looking for a drop down menu item, etc.

Shows even the clearest presentation can be missed by a user. 😉

Now to get this puppy running on my local box.

GraphLab Workshop Presentations

Filed under: Conferences,Graphs,Networks — Patrick Durusau @ 10:16 am

GraphLab Workshop Presentations

Just in case you missed the GraphLab workshop, most of the presentations are now available online, including an introduction to GraphLab 2.1!

Very much worth your time to review!

OK, just a sample:

GraphLab Version 2 Overview (60 mins) (GraphLab Keynote Slides) by Carlos Guestrin

Large scale ML challenges by Ted Willke, Intel Labs

See the agenda for more of same.

Web Scale with a Laptop? [GraphChi]

Filed under: GraphChi,Graphs,Networks — Patrick Durusau @ 9:54 am

GraphChi promises in part:

The promise of GraphChi is to bring web-scale graph computation, such as analysis of social networks, available to anyone with a modern laptop.

Well, that certainly draws a line in the sand doesn’t it?

A bit more from the introduction:

GraphChi is a spin-off of the GraphLab ( http://www.graphlab.org ) -project from the Carnegie Mellon University. It is based on research by Aapo Kyrola ( http://www.cs.cmu.edu/~akyrola/) and his advisors.

GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google's Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph. Section 'Performance' contains some examples of applications implemented for GraphChi and their running times on GraphChi.

The promise of GraphChi is to bring web-scale graph computation, such as analysis of social networks, available to anyone with a modern laptop. It saves you from the hassle and costs of working with a distributed cluster or cloud services. We find it much easier to debug applications on a single computer than trying to understand how a distributed algorithm is executed.

In some cases GraphChi can solve bigger problems in reasonable time than many other available distributed frameworks. GraphChi also runs efficiently on servers with plenty of memory, and can use multiple disks in parallel by striping the data.

Even if you do require the processing power of high-performance clusters, GraphChi can be an excellent tool for developing and debugging your algorithms prior to deploying them to the cluster. For high-performance graph computation in the distributed setting, we direct you to GraphLab's new version (v2.1), which can now handle large graphs in astonishing speed. GraphChi supports also most of the new GraphLab v2.1 API (with some restrictions), making the transition easy.

GraphChi is implemented in plain C++, and available as open-source under the flexible Apache License 2.0.

Java version

Java-version of GraphChi: http://code.google.com/p/graphchi-java

The performance numbers are impressive.

Not sure I would want to run production code on a laptop in any case but performance should be enough for on-the-road experiments.

Good documentation and examples that should ease you into experimenting with GraphChi.

I first saw this at High Scalability.

Analyzing 20,000 Comments

Filed under: Analytics,Data Mining — Patrick Durusau @ 7:34 am

Analyzing 20,000 Comments

First, congratulations on Chandoo.org reaching its 20,000th comment!

Second, the post does not release the data (email addresses, etc.) so it also doesn’t include the code.

Thinking of this as an exercise in analytics, which of the measures applied should lead to changes in behavior?

After all, we don’t mine data simply because we can.

What goals would you suggest and how would we measure meeting them based on the analysis described here?

Biological Dark Matter [Intelllectual Dark Matter?]

Filed under: Bioinformatics,Data Mining — Patrick Durusau @ 6:05 am

Biological Dark Matter

Nathan Wolfe answers a child’s question of “what is left to explore?” with an exposition on how little we know about the most abundant life form of all, the virus.

Opportunities abound for data mining and mapping the results of data mining on viruses.

Protection against the next pandemic is vitally important but I would have answered differently.

In addition to viruses, advances have been made in data structures, graph algorithms, materials science, digital chip design, programming languages, astronomy, just to name a few areas where substantial progress has been made and more is anticipated.

Those just happen to be areas of interest to me. I am sure you could create even longer lists of areas of interest to you where substantial progress has been made.

We need to convey a sense of excitement and discovery in all areas of the sciences and humanities.

Perhaps we should call it: Intellectual Dark Matter? (another name for the unknown?)

World Leaders Comment on Attack in Bulgaria

Filed under: Data Mining,Intelligence,Social Media — Patrick Durusau @ 4:53 am

World Leaders Comment on Attack in Bulgaria

From the post:

Following the terror attack in Bulgaria killing a number of Israeli tourists on an airport bus, we can see the statements from world leaders around the globe including Israel Prime Minister Benjamin Netanyahu openly pinning the blame on Iran and threatening retaliation

If you haven’t seen one of the visualizations by Recorded Future you will be impressed by this one. Mousing over people and locations invokes what we would call scoping in a topic map context and limits the number of connections you see. And each node can lead to additional information.

While this works like a topic map, I can’t say it is a topic map application because how it works isn’t disclosed. You can read How Recorded Future Works, but you won’t be any better informed than before you read it.

Impressive work but it isn’t clear how I would integrate their matching of sources to say an internal mapping of sources? Or how I would augment their mapping with additional mappings by internal subject experts?

Or how I would map this incident to prior incidents which lead to disproportionate responses?

Or map “terrorist” attacks by the world leaders now decrying other “terrorist” attacks?

That last mapping could be an interesting one for the application of the term “terrorist.” My anecdotal experience is that it depends on the sponsor.

Would be interesting to know if systematic analysis supports that observation.

Perhaps the news media could then evenly identify the probable sponsors of “terrorists” attacks.

July 18, 2012

Three.js: render real world terrain from heightmap using open data

Filed under: Mapping,Maps,Three.js,Visualization — Patrick Durusau @ 7:11 pm

Three.js: render real world terrain from heightmap using open data by Jos Dirksen.

From the post:

Three.js is a great library for creating 3D objects and animations. In a couple of previous articles I explored this library a bit and in one of those examples I showed you how you can take GIS information (in geoJSON) format and use D3.js and three.js to convert it to a 3D mesh you can render in the browser using javascript. This is great for infographic, but it doesn’t really show a real map, a real terrain. Three.js, luckily also has helper classes to render a terrain as you can see in this demo: http://mrdoob.github.com/three.js/examples/webgl_terrain_dynamic.html

This demo uses a noise generator to generate a random terrain, and adds a whole lot of extra functionality, but we can use this concept to also render maps of real terrain. In this article I’ll show you how you can use freely available open geo data containing elevation info to render a simple 3D terrain using three.js. In this example we’ll use elevation data that visualizes the data for the island of Corsica.

Rendering real world terrain, supplemented by a topic map for annotation, sounds quite interesting.

Assuming you could render any real world terrain, what would it be? For what purpose? What annotations would you supply?

Data Mining Projects (Ping Chen)

Filed under: Data,Data Mining — Patrick Durusau @ 6:59 pm

Data Mining Projects

From the webpage:

This is the website for the Data Mining CS 4319 class projects. Here you will find all of the information and data files you will need to begin working on the project you have selected for this semester. Please click on the link on the left hand side corresponding to your project to begin. Development of the projects hosted in this website is funded by NSF Award DUE 0737408.

Projects with resources and files are:

  • Netflix
  • Word Relevance Measures
  • Identify Time
  • Orbital Debris Analysis
  • Oil Exploration
  • Environmental Data Analysis
  • Association Rule Pre-Processing
  • Neural Network-Based Financial Market Forecasting
  • Identify Locations From a Webpage
  • Co-reference Resolution
  • Email Visualization

Now there is a broad selection of data mining projects!

BTW, be careful of the general Netflix file. It is 665 MB so don’t attempt it on airport WiFi.

I first saw this at KDNuggets.

PS: I can’t swear to the dates of the class but the grant ran from 2008 to 2010.

Building a Simple BI Solution in Excel 2013 (Part 1 & 2)

Filed under: Business Intelligence,Excel — Patrick Durusau @ 6:39 pm

Chris Webb writes up a quick BI solution in Excel 2013:

Building a Simple BI Solution in Excel 2013, Part 1

and

Building a Simple BI Solution in Excel 2013, Part 2

In the process Chris uncovers some bugs and disappointments, but on the whole the application works.

I mention it for a couple of reasons.

If you recall, something like 75% of the BI market is held by Excel. I don’t expect that to change any time soon.

What do you think happens when “self-service” BI applications are created by users? Other than becoming the default applications for offices and groups in organizations?

Will different users are going to make different choices with their Excel BI applications?

Will users with different Excel BI applications resort to knives, if not guns, to avoid changing their Excel BI applications?

Excel in its many versions leads to varying and inconsistent “self-service” applications in 75% of the BI marketplace.

Is it just me or does that sound like an opportunity for topic maps to you?

Data mining for network security and intrusion detection

Filed under: Intrusion Detection,Network Security,Security — Patrick Durusau @ 5:06 pm

Data mining for network security and intrusion detection by Dzidorius Martinaitis.

One of my favourite stories about network security/intrusion was in a Netware class. The instructor related that in a security “audit,” of a not small firm, it was discovered the Novell servers were sitting in a room that everyone, including the cleaning crew, had access.

Guess they never heard of physical security or Linux boot disks.

Assuming you have taken care of the obvious security risks, topic maps might be useful in managing the results of data mining.

From the post:

In preparation for “Haxogreen” hackers summer camp which takes place in Luxembourg, I was exploring network security world. My motivation was to find out how data mining is applicable to network security and intrusion detection.

Flame virus, Stuxnet, Duqu proved that static, signature based security systems are not able to detect very advanced, government sponsored threats. Nevertheless, signature based defense systems are mainstream today – think of antivirus, intrusion detection systems. What do you do when unknown is unknown? Data mining comes to mind as the answer.

There are following areas where data mining is or can be employed: misuse/signature detection, anomaly detection, scan detection, etc.

Misuse/signature detection systems are based on supervised learning. During learning phase, labeled examples of network packets or systems calls are provided, from which algorithm can learn about the threats. This is very efficient and fast way to find know threats. Nevertheless there are some important drawbacks, namely false positives, novel attacks and complication of obtaining initial data for training of the system.

The false positives happens, when normal network flow or system calls are marked as a threat. For example, an user can fail to provide the correct password for three times in a row or start using the service which is deviation from the standard profile. Novel attack can be define as an attack not seen by the system, meaning that signature or the pattern of such attack is not learned and the system will be penetrated without the knowledge of the administrator. The latter obstacle (training dataset) can be overcome by collecting the data over time or relaying on public data, such as DARPA Intrusion Detection Data Set.

Although misuse detection can be built on your own data mining techniques, I would suggest well known product like Snort which relays on crowd-sourcing.

Taking Snort as an example, what other system data would you want to merge with data from Snort?

Or for that matter, how would you share such information (Snort+) with others?

PS: Be aware that cyber-attack/security/warfare are hot topics and therefore marketing opportunities.

2013 FOSE Call for Presentations

Filed under: Conferences,Government,Software — Patrick Durusau @ 3:55 pm

2013 FOSE Call for Presentations

From the webpage:

The FOSE Team welcomes presentation proposals that provide meaningful, actionable insights about technology development for government IT decision makers. We are looking for presentations that detail use-case studies, lessons learned, or emerging trends that improve operational efficiency and ignite innovation within and across government agencies. We are also specifically seeking Local, Federal and State Government Employees with stories to tell about their IT experiences and successes.

It’s a vendor show so prepare accordingly.

Lots of swag, hire booth help at the local modeling agency, etc.

You can’t make a sale if you don’t get their attention.

Deadline for submissions: September 14, 2012.

Topic map based solutions should make a good showing against traditional ETL (Extra Tax and Labor) solutions.

No charge for use the expansion of ETL (it probably isn’t even original but if not, I don’t remember the source).

Computing for Data Analysis (Coursera course – R)

Filed under: Data Analysis,R — Patrick Durusau @ 11:21 am

Computing for Data Analysis by Roger D. Peng.

Description:

In this course you will learn how to program in R and how to use R for effective data analysis. You will learn how to install and configure software necessary for a statistical programming environment, discuss generic programming language concepts as they are implemented in a high-level statistical language. The course covers practical issues in statistical computing which includes programming in R, reading data into R, creating informative data graphics, accessing R packages, creating R packages with documentation, writing R functions, debugging, and organizing and commenting R code. Topics in statistical data analysis and optimization will provide working examples.

Readings:

The volume by Chambers looks comprehensive (500 or so pages) enough to be sufficient for the course.

Next Session: 24 September 2012 (4 weeks)
Workload: 3-5 hours per week

U.S. Senate vs. Apache Accumulo: Whose side are you on?

Filed under: Accumulo,Cassandra,HBase — Patrick Durusau @ 10:47 am

Jack Park sent a link to NSA Mimics Google, Pisses Off Senate this morning. If you are unfamiliar with the software, see: Apache Accumulo.

Long story made short:

The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.

At issue is:

The bill indicates that Accumulo may violate OMB Circular A-130, a government policy that bars agencies from building software if it’s less expensive to use commercial software that’s already available. And according to one congressional staffer who worked on the bill, this is indeed the case. He asked that his name not be used in this story, as he’s not authorized to speak with the press.

On its face, OMB Circular A-130 sounds like a good idea. Don’t build your own if it is cheaper to buy commercial.

But here the Senate trying to play favorites.

I have a suggestion: Let’s disappoint them.

Let’s contribute to all three projects:

Apache Accumulo

Apache Cassandra

Apache HBase

Would you look at that! All three of these projects are based at Apache!

Let’s push all three projects forward in terms of working on releases, documentation, testing, etc.

But more than that, let’s build applications based on all three projects that analyze political contributions, patterns of voting, land transfers, stock purchases, virtually every fact than can be known about members of the Senate and the Senate Armed Services Committee in particular.

They are accustomed to living in a gold fish bowl.

Let’s move them into a frying pan.

PS: +1 if the NSA is ordered to contribute to open source projects, if the projects are interested. Direction from the U.S. Senate is not part of the Apache governance model.

July 17, 2012

Making maps, part 1: Less interactivity

Filed under: Mapping,Maps — Patrick Durusau @ 6:37 pm

Making maps, part 1: Less interactivity

A six part series on making maps from the Chicago Tribune that has this gem in the first post:

Back to the beer-fueled map talk… so, how can we do this better? The answer quickly became obvious: borrow from paper. What’s great about paper maps?

  • Paper maps are BIG
  • Paper maps are high resolution (measured by DPI *and* information-density)
  • Paper maps are general at a distance and specific up close

What if most things on your page design didn’t jump, spin or flop on mouse-over?

Could you still delivery your content effectively?

Or have you mistaken interactivity for being effective?

On the other hand, are paper maps non-interactive?

I ask because I saw a book this past weekend that had no moving parts, popups, etc., but reading it you would swear it was interactive.

More on that in a future post.

I first saw this at PeteSearch.

« Newer PostsOlder Posts »

Powered by WordPress