Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 20, 2013

Python for Data Analysis: The Landscape of Tutorials

Filed under: Data Analysis,Python — Patrick Durusau @ 12:59 pm

Python for Data Analysis: The Landscape of Tutorials by Abhijit Dasgupta.

From the post:

Python has been one of the premier general scripting languages, and a major web development language. Numerical and data analysis and scientific programming developed through the packages Numpy and Scipy, which, along with the visualization package Matplotlib formed the basis for an open-source alternative to Matlab. Numpy provided array objects, cross-language integration, linear algebra and other functionalities. Scipy adds to this and provides optimization, linear algebra, optimization, statistics and basic image analysis capabilities. Matplotlib provides sophisticated 2-D and basic 3-D graphics capabilities with Matlab-like syntax.

Python

Further recent development has resulted in a rather complete stack for data manipulation and analysis, that includes Sympy for symbolic mathematics, pandas for data structures and analysis, and IPython as an enhanced console and HTML notebook that also facilitates parallel computation.

An even richer data analysis ecosystem is quickly evolving in Python, led by Enthought and Continuum Analytics and several other independent and associated efforts. We have described this ecosystem here. [“ecosystem” and “here” are two distinct links.]

(…)

A very impressive listing of tutorials on Python packages for data analysis.

July 19, 2013

Communicating Sequential Processes (CSP)

Filed under: Computer Science,Programming — Patrick Durusau @ 3:42 pm

Communicating Sequential Processes (CSP) by Tony Hoare.

From the webpage:

Communicating Sequential Processes, or CSP, is a language for describing patterns of interaction. It is supported by an elegant, mathematical theory, a set of proof tools, and an extensive literature. The book Communicating Sequential Processes was first published in 1985 by Prentice Hall International (who have kindly released the copyright); it is an excellent introduction to the language, and also to the mathematical theory.

An electronic version of the book has been produced, and may be copied, printed, and distributed free of charge. However, such copying, printing, or distribution may not: be carried out for commercial gain; or – for copyright reasons – take place within India, Pakistan, Bangladesh, Sri Lanka, or the Maldives; or involve any modification to the document itself.

Electronic version

Mailing list: csp-announce@comlab.ox.ac.uk

.Enjoy!

Neo4j 1.9.2 now available!

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:10 pm

Neo4j 1.9.2 now available! by Jim Webber.

Jim announces that Neo4j 1.9.2 is available for download.

From the post:

Neo4j 1.9.2 is available immediately and is an easy upgrade from any other 1.9.x versions as there are no store upgrades required and so everyone on Neo4j 1.9 and Neo4j 1.9.1 is strongly encouraged to upgrade to 1.9.2.

You need to look at the release notes for 1.9.2:

The 1.9.2 release of Neo4j is a maintenance release that corrects a serious issue concurrency issue introduced in Neo4j 1.9.1, and resolves some other critical issues when reading from the underlying store. All Neo4j users are highly encouraged to upgrade to this version.

Any release that fixes “critical” errors is a must upgrade.

Designing Topic Map Languages

Filed under: Crowd Sourcing,Graphics,Visualization — Patrick Durusau @ 2:00 pm

A graphical language for explaining, discussing, planning topic maps has come up before. But no proposal has ever caught on.

I encountered a paper today that describes how to author a notation language with a 300% increase in semantic transparency for novices and a reduction of interpretation errors by a factor of 5.

Interested?

Visual Notation Design 2.0: Designing UserComprehensible Diagramming Notations by Daniel L. Moody, Nicolas Genon, Patrick Heymans, Patrice Caire.

Designing notations that business stakeholders can understand is one of the most difficult practical problems and greatest research challenges in the IS field. The success of IS development depends critically on effective communication between developers and end users, yet empirical studies show that business stakeholders understand IS models very poorly. This paper proposes a radical new approach to designing diagramming notations that actively involves end users in the process. We use i*, one of the leading requirements engineering notations, to demonstrate the approach, but the same approach could be applied to any notation intended for communicating with non-experts. We present the results of 6 related empirical studies (4 experiments and 2 nonreactive studies) that conclusively show that novices consistently outperform experts in designing symbols that are comprehensible to novices. The differences are both statistically significant and practically meaningful, so have implications for IS theory and practice. Symbols designed by novices increased semantic transparency (their ability to be spontaneously interpreted by other novices) by almost 300% compared to the existing i* diagramming notation and reduced interpretation errors by a factor of 5. The results challenge the conventional wisdom about visual notation design, which has been accepted since the beginning of the IS field and is followed unquestioningly today by groups such as OMG: that it should be conducted by a small team of technical experts. Our research suggests that instead it should be conducted by large numbers of novices (members of the target audience). This approach is consistent with principles of Web 2.0, in that it harnesses the collective intelligence of end users and actively involves them as codevelopers (“prosumers”) in the notation design process rather than as passive consumers of the end product. The theoretical contribution of this paper is that it provides a way of empirically measuring the user comprehensibility of IS notations, which is quantitative and practical to apply. The practical contribution is that it describes (and empirically tests) a novel approach to developing user comprehensible IS notations, which is generalised and repeatable. We believe this approach has the potential to revolutionise the practice of IS diagramming notation design and change the way that groups like OMG operate in the future. It also has potential interdisciplinary implications, as diagramming notations are used in almost all disciplines.

This is a very exciting paper!

I thought the sliding scale from semantic transparency (mnemonic) to semantic opacity (conventional) to semantic perversity (false mnemonic) was particularly good.

Not to mention that their process is described in enough detail for others to use the same process.

For designing a Topic Map Graphical Language?

What about designing the next Topic Map Syntax?

We are going to be asking “novices” to author topic maps. Why not ask them to author the language?

And not just one language. A language for each major domain.

Talk about stealing the march on competing technologies!

Mapping Wikipedia – The Schema

Filed under: Topic Maps,Wikipedia — Patrick Durusau @ 12:58 pm

I haven’t finished documenting the issues I encountered with SQLFairy in parsing the MediaWiki schema but I was able to create a png diagram of the schema.

MediaWiki-1.21.1-diagram.png

Should be easier than reading the schema but otherwise I’m not all that impressed.

Some modeling issues to note up front.

SQL Identifiers:

The INT datatype in MySQL is defined as:

Type Storage Minimum Value Maximum Value
INT 4 -2147483648 2147483647

Whereas, the XML syntax for topic maps defines the item identifiers datatype as xsd:ID.

XSD:ID is defined as:

The type xsd:ID is used for an attribute that uniquely identifies an element in an XML document. An xsd:ID value must be an xsd:NCName. This means that it must start with a letter or underscore, and can only contain letters, digits, underscores, hyphens, and periods.

Opps! “[M]ust start with a letter or underscore….”

That leaves out all the INT type IDs that you find in SQL databases.

And it rules out all identifiers that don’t start with a letter or underscore.

One good reason to have an alternative (to XML) syntax for topic maps. The name limitation arose more than twenty years ago and should not trouble us now.

SQL Tables/Rows

Wikipedia summarizes relational database tables in part as: http://en.wikipedia.org/wiki/Relation_(database)

A relation is defined as a set of tuples that have the same attributes. A tuple usually represents an object and information about that object. Objects are typically physical objects or concepts. A relation is usually described as a table, which is organized into rows and columns. All the data referenced by an attribute are in the same domain and conform to the same constraints….

As the articles notes: “A tuple usually represents an object and information about that object.” (Read subject for object.)

Converting a database to a topic map begins with deciding what subject every row of each table represents. And recording what information has been captured for each subject.

As you work through the MediaWiki tables, ask yourself what information about a subject must be matched for it to be the same subject?

Normalization

From Wikipedia:

Normalization was first proposed by Codd as an integral part of the relational model. It encompasses a set of procedures designed to eliminate nonsimple domains (non-atomic values) and the redundancy (duplication) of data, which in turn prevents data manipulation anomalies and loss of data integrity. The most common forms of normalization applied to databases are called the normal forms.

True enough but the article glosses over the shortfall of then current databases to handle “non-atomic values” and to lack the performance to tolerate duplication of data.

I say “…then current databases…” but the limitations of “non-atomic values” and non-duplication of data persist to this day. Normalization, an activity by the user, is meant to compensate for poor hardware/software performance.

From a topic map perspective, normalization means you will find data about a subject in more than one table.

Next Week

I will start with the “user” table in MediaWiki-1.21.1-tables.sql next Monday.

Question: Which other tables, if any, should we look at while modeling the subject from rows in the user table?

July 18, 2013

Mapping Wikipedia – Update

Filed under: MySQL,SQL,Topic Maps,Wikipedia — Patrick Durusau @ 8:45 pm

I have spent a good portion of today trying to create an image of the MediaWiki table structure.

While I think the SQLFairy (aka SQL Translator) is going to work quite well, it has rather cryptic error messages.

For instance, if the SQL syntax isn’t supported by its internal parser, the error message references the start of the table.

Which means, of course, that you have to compare statements in the table to the subset of SQL that is supported.

I am rapidly losing my SQL parsing skills as the night wears on so I am stopping with a little over 50% of the MediaWiki schema parsing.

Hopefully will finish correcting the SQL file tomorrow and will post the image of the MediaWiki schema.

Plus notes on what I found to not be recognized in SQLFairy to ease your use of it on other SQL schemas.

July 17, 2013

Mapping Wikipedia

Filed under: Topic Maps,Wikipedia — Patrick Durusau @ 4:09 pm

Carl Lemp, commented in the XTM group at LinkedIn, potential redesign of topic maps discussion:

2. There are only a few tools to help build a Topic Map.
3. There is almost nothing to help translate familiar information structures to Topic Map structures.
(…)
Getting through 2 and 3 is a bitch.

I can’t help with #2 but I may be able to help with #3.

I suggest mapping the MediaWiki structure that is used for Wikipedia into a topic map.

As a demonstration it has the following advantages:

  1. Conversion from SQL dump to topic map scripts.
  2. Large enough to test alternative semantics.
  3. Sub-sets of Wikipedia good for starter maps.
  4. Useful to merge with other data sets.
  5. Well known data set.
  6. Widespread data format (SQL).

The MediaWiki schema MediaWiki-1.21.1-tables.sql.

The base output format will be CTM.

When we want to test alternative semantics, I suggest that we use “.” followed by “0tm” (zero followed by “tm”) as the file extension. Comments at the head of the file should reference or document the semantics to be applied in processing the file.

In terms of sigla for annotating the SQL, are there any strong feelings against? (Drawn from the TMDM vocabulary section):

A association representation of a relationship between one or more subjects
Ar association role representation of the involvement of a subject in a relationship represented by an association
Art association role type subject describing the nature of the participation of an association role player in an association
At association type subject describing the nature of the relationship represented by associations of that type
Ir information resource a representation of a resource as a sequence of bytes; it could thus potentially be retrieved over a network
Ii item identifier locator assigned to an information item in order to allow it to be referred to
O occurrence representation of a relationship between a subject and an information resource
Ot occurrence type subject describing the nature of the relationship between the subjects and information resources linked by the occurrences of that type
S scope context within which a statement is valid
Si subject identifier locator that refers to a subject indicator
Sl subject locator locator that refers to the information resource that is the subject of a topic
T topic symbol used within a topic map to represent one, and only one, subject, in order to allow statements to be made about the subject
Tn topic name name for a topic, consisting of the base form, known as the base name, and variants of that base form, known as variant names
Tnt topic name type subject describing the nature of the topic names of that type
Tf topic type subject that captures some commonality in a set of subjects
Vn variant name alternative form of a topic name that may be more suitable in a certain context than the corresponding base name

The first step I would suggest is creating a visualization of the MediaWiki schema.

We will still have to iterate over the tables but getting an over all view of the schema will be helpful.

Suggestions on your favorite schema visualization tool?

Poderopedia Plug & Play Platform

Filed under: Data Management,News,Reporting — Patrick Durusau @ 4:08 pm

Poderopedia Plug & Play Platform

From the post:

Poderopedia Plug & Play Platform is a Data Intelligence Management System that allows you to create and manage large semantic datasets of information about entities, map and visualize entity connections, include entity related documents, add and show sources of information and news mentions of entities, displaying all the information in a public or private website, that can work as a standalone product or as a public searchable database that can interoperate with a Newsroom website, for example, providing rich contextual information for news content using it`s archive.

Poderopedia Plug & Play Platform is a free open source software developed by the Poderomedia Foundation, thanks to the generous support of a Knight News Challenge 2011 grant by the Knight Foundation, a Startup Chile 2012 grant and a 2013 Knight fellowship grant by the International Center for Journalists (ICFJ).

WHAT CAN I USE IT FOR?

For anything that involves mapping entities and connections.

A few real examples:

  • NewsStack, an Africa News Challenge Winner, will use it for a pan-African investigation by 10 media organizations into the continent’s extractive industries.
  • Newsrooms from Europe and Latin America want to use it to make their own public searchable databases of entities, reuse their archive to develop new information products, provide context to new stories and make data visualizations—something like making their own Crunchbase.

Other ideas:

  • Use existing data to make searchable databases and visualizations of congresspeople, bills passed, what they own, who funds them, etc.
  • Map lobbyists and who they lobby and for whom
  • Create a NBApedia, Baseballpedia or Soccerpedia. Show data and connections about team owners, team managers, players, all their stats, salaries and related business
  • Map links between NSA, Prism and Silicon Valley
  • Keep track of foundation grants, projects that received funding, etc.
  • Anything related to data intelligence

CORE FEATURES

Plug & Play allows you to create and manage entity profile pages that include: short bio or summary, sheet of connections, long newsworthy profiles, maps of connections of an entity, documents related to the entity, sources of all the information and news river with external news about the entity.

Among several features (please see full list here) it includes:

  • Entity pages
  • Connections data sheet
  • Data visualizations without coding
  • Annotated documents repository
  • Add sources of information
  • News river
  • Faceted Search (using Solr)
  • Semantic ontology to express connections
  • Republish options and metrics record
  • View entity history
  • Report errors and inappropriate content
  • Suggest connections and new entities to add
  • Needs updating alerts
  • Send anonymous tips

Hmmm, when they say:

For anything that involves mapping entities and connections.

Topic maps would say:

For anything that involves mapping subjects and associations.

Poderopedia does lack is a notion of subject identity that would support “merging.”

I am going to install Poderopedia locally and see what the UI is like.

Appreciate your comments and reports if you do the same.

Plus suggestions about adding topic map capabilities to Poderopedia.

I first saw this in Nat Torkington’s Four Short Links: 5 July 2013.

July 16, 2013

Congressional Network Analysis

Filed under: D3,Government,Graphics,Visualization — Patrick Durusau @ 5:07 pm

Congressional Network Analysis by Christopher Roach.

From the post:

This page started out as a bit of code that I wrote for my network science talk at PyData 2013 (Silicon Valley). It was meant to serve as a simple example of how to apply some social network analysis techniques to a real world dataset. After the talk, I decided to get the code cleaned up a bit so that I could release it for anyone who had seen the talk, or just for anyone who happens to have a general interest in the topic. As I worked at cleaning the code up, I started adding a few little features here and there and started to think about how I could make the visualization easier to execute since Matplotlib can sometimes be a bit burdensome to install. The solution was to display the visualization in the browser. This way it could be viewed without needing to install a bunch of third-party Python libraries.

Quick Overview

The script that I created for the talk shows a social network of one of the houses for a specific session of Congress. The network is created by linking each member of Congress to other members with which they have worked on a at least one bill. The more bills the two members have worked on, the more intense the link is between the two in the visualization. In the browser-based visualization, you can change the size of the nodes relative to some network measure, by selecting the desired measure from the dropdown in the upper right corner of the visualization. Finally, unlike the script, the graph above only shows one network for the Senate of the 112th Congress. I chose this session specifically simply because it can be considered the most dysfunctional session of congress in our nation’s history and so I thought it might be an interesting session for us to study.

I think the title for “most dysfunctional session” of congress is up for grabs, again. 😉

But this is a great introduction to visualization with D3.js, along with appropriate warnings to not take the data at face value. Yes, the graph may seem to indicate a number of things but it is just a view of a snippet of data.

Christopher should get high marks for advocating skepticism in data analysis.

Haskell Tutorial

Filed under: Functional Programming,Haskell — Patrick Durusau @ 4:52 pm

Haskell Tutorial by Conrad Barski, M.D.

From the post:

There’s other tutorials out there, but you’ll like this one the best for sure: You can just cut and paste the code from this tutorial bit by bit, and in the process, your new program will create magically create more and more cool graphics along the way… The final program will have less than 100 lines of Haskell[1] and will organize a mass picnic in an arbitrarily-shaped public park map and will print pretty pictures showing where everyone should sit! (Here’s what the final product will look like, if you’re curious…)

The code in this tutorial is a simplified version of the code I’m using to organize flash mob picnics for my art project, picnicmob… Be sure to check out the site and sign up if you live in one of the cities we’re starting off with 🙂

Could be a model for a topic maps tutorial. On the technical parts.

Sam Hunting mentioned quite recently that topic maps lacks the equivalent of nsgmls.

A command line app that takes input and gives you back output.

Doesn’t have to be a command line app but certainly should support cut-n-paste with predictable results.

Which is how most people learn HTML.

Something to keep in mind.

ndtv: network dynamic temporal visualization

Filed under: Graphs,R,Statistics,Visualization — Patrick Durusau @ 3:55 pm

ndtv: network dynamic temporal visualization by Skye Bender-deMoll.

From the post:

The ndtv package is finally up on CRAN! Here is a the output of a toy “tergm” simulation of edge dynamics, rendered as an animated GIF:

animated gif

[link to movie version a basic tergm simulation]

For the past year or so I’ve been doing increasing amounts of work building R packages as part of the statnet team. The statnet project is focused on releasing tools for doing statistical analysis on networks (Exponential Random Graph Models “ERGMs”) but also includes some lower-level packages for efficiently working with network data in R, including dynamic network data (the networkDynamic package). One of my main responsibilities is to implement some network animation techniques in an R package to make it easy to generate movies of various types of simulation output. That package is named “ndtv“, and we finally got it released on CRAN (the main archive of R packages) a few days ago.

Dynamic network data?

What? Networks aren’t static? 😉

Truth be told, “static” is a simplification that depends on your frame of reference.

Something to remember when offered a “static” data solution.

I first saw this at Pete Warden’s Five Short Links, July 15, 2013.

What is wrong with these charts?

Filed under: Charts,Graphics,Humor — Patrick Durusau @ 3:28 pm

What is wrong with these charts? by Nathan Yau.

If you pride yourself on spotting mistakes, Nathan Yau has a chart for you!

Pair up and visit Nathan’s post. See if you and a friend spot the same errors.

Suggest you repeat the exercise with your next presentation but tell the contestants the slides are from someone else. 😉

InfiniteGraph 3.1

Filed under: Graphs,InfiniteGraph — Patrick Durusau @ 3:21 pm

InfiniteGraph 3.1 Features & Capabilities

From the webpage:

  • Faster Path Finding Provides dramatically improved search results when finding paths between two known objects by leveraging a two-way path finding algorithm.
  • Ingest Enhancements InfiniteGraph 3.1 offers up to 25% improved ingest performance over previous versions.
  • Visualizer Improvements The InfiniteGraph Visualizer provides users with additional ease-of-use and quick start functionality for visualizing and navigating the graph.
  • Storing Graph Views and Navigation Policy Chains Navigation policies, which customize the behavior of a navigation query, and graph views, which define a subset of a graph database for a navigation query, can now be saved in your graph database for later reuse.
  • Unlimited nodes and edges free for 60 days.

When your graph database scales into trillions of nodes, all you need are the facts to get attention.

Predicting Terrorism with Graphs

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:07 pm

A Little Graph Theory for the Busy Developer by Jim Webber.

From the description:

In this talk we’ll explore powerful analytic techniques for graph data. Firstly we’ll discover some of the innate properties of (social) graphs from fields like anthropology and sociology. By understanding the forces and tensions within the graph structure and applying some graph theory, we’ll be able to predict how the graph will evolve over time. To test just how powerful and accurate graph theory is, we’ll also be able to (retrospectively) predict World War 1 based on a social graph and a few simple mechanical rules.

A presentation for NoSQL Now!, August 20-22, 2013, San Jose, California.

I would appreciate your asking Jim to predict the next major act of terrorism using Neo4j.

If he can predict WWI with “a few mechanical rules,” the “power and accuracy of graphs” should support prediction of terrorism.

Yes?

If you read the 9/11 Commission Report (pdf), you too can predict 9/11, in retrospect.

Without any database at all.

Don’t get me wrong, I really like graph databases. And they have a number of useful features.

Why not sell graph databases based on technical merit?

As opposed to carny sideshow claims?

Plan for Light Table 0.5

Filed under: Documentation,Programming — Patrick Durusau @ 2:47 pm

The plan for 0.5 by Chris Granger.

From the post:

You guys have been waiting very patiently for a while now, so I wanted to give you an idea of what’s coming in 0.5. A fair amount of the work is in simplifying both the UI/workflow as well as refactoring everything to get ready for the beta (plugins!). I’ve been talking with a fair number of people to understand how they use LT or why they don’t and one of the most common pieces of feedback I’ve gotten is that while it is very simple it still seems heavier than something like Sublime. We managed to attribute this to the fact that it does some unfamiliar things, one of the biggest of which is a lack of standard menus. We don’t really gain anything by not doing menus and while there were some technical reasons I didn’t, I’ve modified node-webkit to fix that. So I’m happy to say 0.5 will use standard menus and the ever-present bar on the left will be disappearing. This makes LT about as light as it possibly can be and should alleviate the feeling that you can’t just use it as a text editor.

Looking forward to the first week or two of August, 2013. Chris’ goal for the 0.5 release!

Knowledge, Graphs & 3D CAD Systems

Filed under: Graphs,Knowledge — Patrick Durusau @ 1:53 pm

Knowledge, Graphs & 3D CAD Systems by David Bigelow.

I like this riff, Phase 1: collect data, Phase 2: something happens, Phase 3: Profit.

David says we need more detail on phase 2. 😉

Covers making data autonomous, capturing design frameworks, people “mobility,” and data flows.

Demo starts at time mark 28:00 or so.

The most interesting part is the emphasis on not storing all the data in the graph database.

The graph database is used to store relationship information, such as who can access particular data and other relationships between data.

Remarkably similar to some comments I have made recently on using topic map to supplement other information systems as opposed to replacing them.

A Guide to Documentary Editing

Filed under: Editor,Texts — Patrick Durusau @ 9:59 am

A Guide to Documentary Editing by Mary-Jo Kline and Susan Holbrook Perdue.

From the introduction:

Don’t be embarrassed if you aren’t quite sure what we mean by “documentary editing.” When the first edition of this Guide appeared in 1987, the author found that her local bookstore on the Upper West Side of Manhattan had shelved a copy in the “Movies and Film” section. When she pointed out the error and explained what the book was about, the store manager asked perplexedly, “Where the heck should we shelve it?”

Thus we offer no apologies for providing a brief introduction that explains what documentary editing is and how it came to be.

If this scholarly specialty had appeared overnight in the last decade, we could spare our readers the “history” as well as the definition of documentary editing. Unfortunately, this lively and productive area of scholarly endeavor evolved over more than a half century, and it would be difficult for a newcomer to understand many of the books and articles to which we’ll refer without some understanding of the intellectual debates and technological innovations that generated these discussions. We hope that our readers will find a brief account of these developments entertaining as well as instructive.

We also owe our readers a warning about a peculiar trait of documentary editors that creates a special challenge for students of the craft: practitioners have typically neglected to furnish the public with careful expositions of the principles and practices by which they pursue their goals. Indeed, it was editors’ failure to write about editing that made the first edition of this Guide necessary in the 1980s. It’s hard to overemphasize the impact of modern American scholarly editing in the third quarter of the twentieth century: volumes of novels, letters, diaries, statesmen’s papers, political pamphlets, and philosophical and scientific treatises were published in editions that claimed to be scholarly, with texts established and verified according to the standards of the academic community. Yet the field of scholarly editing grew so quickly that many of its principles were left implicit in the texts or annotation of the volumes themselves.

(…)

Even for materials under revision control, explicit principles of documentary editing will someday play a role in future editions of those texts. In part because texts do not stand alone, apart from social context.

Abbie Hoffman‘s introduction to Steal This Book:

We cannot survive without learning to fight and that is the lesson in the second section. FIGHT! separates revolutionaries from outlaws. The purpose of part two is not to fuck the system, but destroy it. The weapons are carefully chosen. They are “home-made,” in that they are designed for use in our unique electronic jungle. Here the uptown reviewer will find ample proof of our “violent” nature. But again, the dictionary of law fails us. Murder in a uniform is heroic, in a costume it is a crime. False advertisements win awards, forgers end up in jail. Inflated prices guarantee large profits while shoplifters are punished. Politicians conspire to create police riots and the victims are convicted in the courts. Students are gunned down and then indicted by suburban grand juries as the trouble-makers. A modern, highly mechanized army travels 9,000 miles to commit genocide against a small nation of great vision and then accuses its people of aggression. Slumlords allow rats to maim children and then complain of violence in the streets. Everything is topsy-turvy. If we internalize the language and imagery of the pigs, we will forever be fucked. Let me illustrate the point. Amerika was built on the slaughter of a people. That is its history. For years we watched movie after movie that demonstrated the white man’s benevolence. Jimmy Stewart, the epitome of fairness, puts his arm around Cochise and tells how the Indians and the whites can live in peace if only both sides will be reasonable, responsible and rational (the three R’s imperialists always teach the “natives”). “You will find good grazing land on the other side of the mountain,” drawls the public relations man. “Take your people and go in peace.” Cochise as well as millions of youngsters in the balcony of learning, were being dealt off the bottom of the deck. The Indians should have offed Jimmy Stewart in every picture and we should have cheered ourselves hoarse. Until we understand the nature of institutional violence and how it manipulates values and mores to maintain the power of the few, we will forever be imprisoned in the caves of ignorance. When we conclude that bank robbers rather than bankers should be the trustees of the universities, then we begin to think clearly. When we see the Army Mathematics Research and Development Center and the Bank of Amerika as cesspools of violence, filling the minds of our young with hatred, turning one against another, then we begin to think revolutionary.

Be clever using section two; clever as a snake. Dig the spirit of the struggle. Don’t get hung up on a sacrifice trip. Revolution is not about suicide, it is about life. With your fingers probe the holiness of your body and see that it was meant to live. Your body is just one in a mass of cuddly humanity. Become an internationalist and learn to respect all life. Make war on machines, and in particular the sterile machines of corporate death and the robots that guard them. The duty of a revolutionary is to make love and that means staying alive and free. That doesn’t allow for cop-outs. Smoking dope and hanging up Che’s picture is no more a commitment than drinking milk and collecting postage stamps. A revolution in consciousness is an empty high without a revolution in the distribution of power. We are not interested in the greening of Amerika except for the grass that will cover its grave.

would require a lot of annotation to explain to an audience that meekly submits to public gropings in airport security lines, widespread government surveillance and wars that benefit only contractors.

Both the Guide to Documentary Editing and Steal This Book are highly recommended.

Abbot MorphAdorner collaboration

Filed under: Data,Language — Patrick Durusau @ 9:12 am

Abbot MorphAdorner collaboration

From the webpage:

The Center for Digital Research in the Humanities at the University of Nebraska and Northwestern University’s Academic and Academic Research Technologies are pleased to announce the first fruits of a collaboration between the Abbot and EEBO-MorphAdorner projects: the release of some 2,000 18th century texts from the TCP-ECCO collections in a TEI-P5 format and with linguistic annotation. More texts will follow shortly, subject to the access restrictions that will govern the use of TCP texts for the remainder of this decade.

The Text Creation Partnership (TCP) collection currently consists of about 50,000 fully transcribed SGML texts from the first three centuries of English print culture. The collection will grow to approximately 75,000 volumes and will contain at least one copy of every book published before 1700 as well as substantial samples of 18th century texts published in the British Isles or North America. The ECCO-TCP texts are already in the public domain. The other texts will follow them between 2014 and 2015. The Evans texts will be released in June 2014, followed by a release of some 25,000 EEBO texts in 2015.

It is a major goal of the Abbot and EEBO MorphAdorner collaboration to turn the TCP texts into the foundation for a “Book of English,” defined as

  • a large, growing, collaboratively curated, and public domain corpus
  • of written English since its earliest modern form
  • with full bibliographical detail
  • and light but consistent structural and linguistic annotation

Texts in the annotated TCP corpus will exist in more than one format so as to facilitate different uses to which they are likely to be put. In a first step, Abbot transforms the SGML source text into a TEI P5 XML format. Abbot, a software program designed by Brian Pytlik Zillig and Stephen Ramsay, can read arbitrary XML files and convert them into other XML formats or a shared format. Abbot generates its own set of conversion routines at runtime by reading an XML schema file and programmatically effecting the desired transformations. It is an excellent tool for creating an environment in which texts originating in separate projects can acquire a higher degree of interoperability. A prototype of Abbot was used in the MONK project to harmonize texts from several collections, including the TCP, Chadwyck-Healey’s Nineteenth-Century Fiction, the Wright Archive of American novels 1851-1875, and Documenting the American South.

This first transformation maintains all the typographical data recorded in the original SGML transcription, including long ‘s’, printer’s abbreviations, superscripts etc. In a second step MorphAdorner tokenizes this file. MorphAdorner was developed by Philip R. Burns. It is a multi-purpose suite of NLP tools with special features for the tokenization, analysis, and annotation of historical corpora. The tokenization uses algorithms and heuristics specific to the practices of Early Modern print culture, wraps every word token in a <w> element with a unique ID, and explicitly marks sentence boundaries.

In the next step (conceptually different but merged in practice with the previous), some typographical features are removed from the tokenized text, but all such changes are recorded in a change log and may therefore be reversed. The changes aim at making it easier to manipulate the corpus with software tools that presuppose modern printing practices. They involve such things as replacing long ‘s’ with plain ‘s’, or resolving unambiguous printer’s abbreviations and superscripts.

Talk about researching across language as it changes!

This is way cool!

Lots of opportunities for topic map-based applications.

For more information:

Abbot Text Interoperability Tool

Download texts here

July 15, 2013

Natural Language Processing (NLP) Demos

Filed under: Natural Language Processing — Patrick Durusau @ 3:53 pm

Natural Language Processing (NLP) Demos (Cognitive Computation Group, University of Illinois at Urbana-Champaign)

From the webpage:

Most of the information available today is in free form text. Current technologies (google, yahoo) allow us to access text only via key-word search.

We would like to facilitate content-based access to information. Examples include:

  • Topical and Functional categorization of documents: Find documents that deal with stem cell research, but only Call for Proposals.
  • Semantic categorization: Find documents about Columbus (the City, not the Person).
  • Retrieval of concepts and entities rather than strings in text: Find documents about JFK, the president; include those documents that mention him as “John F. Kennedy, John Kennedy, Congressman Kennedy or any other possible writing; but not those that mention the baseball player John Kennedy, nor any of JFK’s relatives.
  • Extraction of information based on semantic categorization: Find a list of all companies that participated in merges in the last year. List all professors in Illinois that do research in Machine Learning.

I count twenty (20) separate demos.

Gives you a good sense of the current state of NLP.

I first saw this at: Demos of NLP by Ryan Swanstrom.

Purely Functional Data Structures in Clojure: Red-Black Trees

Filed under: Clojure,Functional Programming,Haskell — Patrick Durusau @ 3:34 pm

Purely Functional Data Structures in Clojure: Red-Black Trees by Leonardo Borges.

From the post:

Recently I had some free time to come back to Purely Functional Data Structures and implement a new data structure: Red-black trees.

Leonard continues his work on Chris Okasaki’s Purely Functional Data Structures.

Is a functional approach required for topic maps to move beyond being static digital artifacts?

Corporate Culture Clash:…

Filed under: Communication,Diversity,Heterogeneous Data,Language,Marketing,Semantics — Patrick Durusau @ 3:05 pm

Corporate Culture Clash: Getting Data Analysts and Executives to Speak the Same Language by Drew Rockwell

From the post:

A colleague recently told me a story about the frustration of putting in long hours and hard work, only to be left feeling like nothing had been accomplished. Architecture students at the university he attended had scrawled their frustrations on the wall of a campus bathroom…“I wanted to be an architect, but all I do is create stupid models,” wrote students who yearned to see their ideas and visions realized as staples of metropolitan skylines. I’ve heard similar frustrations expressed by business analysts who constantly face the same uphill battle. In fact, in a recent survey we did of 600 analytic professionals, some of the biggest challenges they cited were “getting MBAs to accept advanced methods”, getting executives to buy into the potential of analytics, and communicating with “pointy-haired” bosses.

So clearly, building the model isn’t enough when it comes to analytics. You have to create an analytics-driven culture that actually gets everyone paying attention, participating and realizing what analytics has to offer. But how do you pull that off? Well, there are three things that are absolutely critical to building a successful, analytics-driven culture. Each one links to the next and bridges the gap that has long divided analytics professionals and business executives.

Some snippets to attract you to this “must read:”

(…)
In the culinary world, they say you eat with your eyes before your mouth. A good visual presentation can make your mouth water, while a bad one can kill your appetite. The same principle applies when presenting data analytics to corporate executives. You have to show them something that stands out, that they can understand and that lets them see with their own eyes where the value really lies.
(…)
One option for agile integration and analytics is data discovery – a type of analytic approach that allows business people to explore data freely so they can see things from different perspectives, asking new questions and exploring new hypotheses that could lead to untold benefits for the entire organization.
(…)
If executives are ever going to get on board with analytics, the cost of their buy-in has to be significantly lowered, and the ROI has to be clear and substantial.
(…)

I did pick the most topic map “relevant” quotes but they are as valid for topic maps as any other approach.

Seeing from different perspectives sounds like on-the-fly merging to me.

How about you?

Better Corporate Data!

Filed under: Open Data,Open Government — Patrick Durusau @ 2:43 pm

Announcing open corporate network data: not just good, but better

OpenCorporates announces three projects:

1. An open data corporate network platform

The most important part is a new platform for collecting, collating and allowing access to different types of corporate relationship data – subsidiary data, parent company data, and shareholding data. This means that governments around the world (and companies too) can publish corporate network data and they can be combined in a single open-data repository, for a more complete picture. We think this is a game-changer, as it not only allows seamless, lightweight co-operation, but will identify errors and contradictions. We’ll be blogging about the platform in more details over the coming weeks, but it’s been a genuinely hard computer-science problem that has resulted in some really innovative work.

2. Three key initial datasets

(…)

The shareholder data from the New Zealand company register, for example, is granular and up to date, and if you have API access is available as data. It talks about parental control, often to very granular data, and importing this data allows you to see not just shareholders (which you can also see on the NZ Companies House pages) but also what companies are owned by another company (which you can’t). And it’s throwing up some interesting examples, of which more in a later blog post.

The data from the Federal Reserve’s National Information Center is also fairly up to date, but is (for the biggest banks) locked away in horrendous PDFs and talks about companies controlled by other companies.

The data from the 10-K and 20-F filings from the US Securities and Exchange Commission is the most problematic of all, being published once a year, as arbitrary text (pretty shocking in the 21st century for this still to be the case), and talks about ‘significant subsidiaries’.

(…)

3. An example of the power of this dataset.

We think just pulling the data together as open data is pretty cool, and that many of the best uses will come from other users (we’re going to include the data in the next version of our API in a couple of weeks). But we’ve built in some network visualisations to allow the information to be explored. Check out Barclays Bank PLC, Pearson PLC, The Gap or Starbucks.

OpenCorporates is engineering the critical move between “open data,” ho-hum, to “corporate visibility using open data.”

Not quite to the point of “accountability” but you have to identity evil doers before they can be held accountable.

A project that merits your interest, donations and support. Please pass this on. Thanks!

Sqrrl Enterprise…

Filed under: Accumulo,Graphs — Patrick Durusau @ 2:25 pm

Sqrrl Enterprise = 3 Databases in 1 (Column + Document + Graph)

From the post:

When looking across the NoSQL landscape, most folks partition NoSQL databases into 4 categories:

  • Key Value Stores (e.g., Riak, Redis)
  • Column Stores (e.g., HBase, Cassandra, Accumulo)
  • Document Stores (e.g., MongoDB, CouchDB)
  • Graph Stores (e.g., Neo4j, TitanDB)

In addition to being creators of the Accumulo database, the team here at Sqrrl can also appreciate the benefits of other databases in the NoSQL landscape. For this reason, when we began architecting Sqrrl Enterprise, we decided to not limit ourselves to just Accumulo’s column store data structure. Sqrrl Enterprise features Document and Graph Store functionality in addition to being a Column Store at its core.

Sqrrl Enterprise is built using open source Apache Accumulo, giving it it’s column store core. However, we love the ease of use of document stores, so when we ingest data, we convert that data from Accumulo’s native key/value format into hierarchical JSON documents (giving Sqrrl Enterprise document store functionality).

At ingest we also extract all of the graph relationships in the datasets and store them as sets of nodes and edges, giving Sqrrl Enterprise a variety of graph capabilities.

Interesting.

If you have experience with this “enhanced” version of Accumulo, will you share your experience with “a variety of graph capabilities?”

[Updated November 7, 2013. Changed the link to the post. To one that works.]

Why Unique Indexes are Bad [Caveat on Fractal Tree(R) Indexes]

Filed under: Fractal Trees,Indexing,TokuDB — Patrick Durusau @ 2:12 pm

Why Unique Indexes are Bad by Zardosht Kasheff.

From the post:

Before creating a unique index in TokuMX or TokuDB, ask yourself, “does my application really depend on the database enforcing uniqueness of this key?” If the answer is ANYTHING other than yes, do not declare the index to be unique. Why? Because unique indexes may kill your write performance. In this post, I’ll explain why.

Unique indexes are a strange beast: they have no impact on standard databases that use B-Trees, such as MongoDB and MySQL, but may be horribly painful for databases that use write optimized data structures, like TokuMX’s Fractal Tree(R) indexes. How? They essentially drag the Fractal Tree index down to the B-Tree’s level of performance.

When a user declares a unique index, the user tells the database, “please help me and enforce uniqueness on this index.” So, before doing any insertion into a unique index, the database must first verify that the key being inserted does not already exist. If the possible location of the key is not in memory, which may happen if the working set does not fit in memory, then the database MUST perform an I/O to bring into memory the contents of the potential location (be it a leaf node in a tree, or an offset into a memory mapped file), in order to check whether the key exists in that location.

(…)

Zardosht closes by recommending if your application does require unique indexes that you consider re-writing it so it doesn’t.

Ouch!

Not a mark against Fractal Tree(R) indexes but certainly a consideration in deciding to adopt technology using them.

Would be nice if this type of information could be passed along as more than sysadmin lore.

Like a plugin for your browser that at your request highlights products or technologies of interest and on mouse-over displays known limitations or bugs.

The sort of things that vendors loath to disclose.

The Book on Apache Sqoop is Here!

Filed under: Hadoop,Sqoop — Patrick Durusau @ 12:45 pm

The Book on Apache Sqoop is Here! by Justin Kestelyn.

From the post:

Continuing the fine tradition of Clouderans contributing books to the Apache Hadoop ecosystem, Apache Sqoop Committers/PMC Members Kathleen Ting and Jarek Jarcec Cecho have officially joined the book author community: their Apache Sqoop Cookbook is now available from O’Reilly Media (with a pelican the assigned cover beast).

The book arrives at an ideal time. Hadoop has quickly become the standard for processing and analyzing Big Data, and in order to integrate a new Hadoop deployment into your existing environment, you will very likely need to transfer data stored in legacy relational databases into your new cluster.

Sqoop is just the ticket; it optimizes data transfers between Hadoop and RDBMSs via a command-line interface listing 60 parameters. This new cookbook focuses on applying these parameters to common use cases — one recipe at a time, Kate and Jarek guide you from basic commands that don’t require prior Sqoop knowledge all the way to very advanced use cases. These recipes are sufficiently detailed not only to enable you to deploy Sqoop in your environment, but also to understand its inner workings.

Good to see a command with a decent number of options, sixty (60).

A little lite when compared to ps at one hundred and eight-six (186) options and formatting flags.

I didn’t find a quick answer to the question: Which *nix command has the most options and formatting flags?

If you have a candidate, sing out!

Combining Neo4J and Hadoop (part II)

Filed under: Graphs,Hadoop,Neo4j — Patrick Durusau @ 12:20 pm

Combining Neo4J and Hadoop (part II) by Kris Geusebroek.

From the post:

In the previous post Combining Neo4J and Hadoop (part I) we described the way we combine Hadoop and Neo4J and how we are getting the data into Neo4J.

In this second part we will take you through the journey we took to implement a distributed way to create a Neo4J database. The idea is to use our Hadoop cluster for creating the underlying file structure of a Neo4J database.

To do this we must first understand this file-structure. Luckily Chris Gioran has done a great job describing this structure in his blog Neo4J internal file storage.

The description was done for version 1.6 but largely still matches the 1.8 file-structure.

First I’ll start with a small recap of the file-structure.

The Chris Gioran post has been updated at: Rooting out redundancy – The new Neo4j Property Store.

Internal structures influence what you can or can’t easily say. Best to know about those structures in advance.

RICON East 2013 [videos, slides, resources]

Filed under: Erlang,Riak — Patrick Durusau @ 12:05 pm

RICON East 2013 [videos, slides, resources]

I have sorted (by author) and included the abstracts for the RICON East presentations. The RICON East webpage has links to blog entries about the conference.

Enjoy!


Brian Akins, Large Scale Data Service as a Service
Slides | Video

Turner Broadcasting hosts several large sites that need to serve “data” to millions of clients over HTTP. A couple of years ago, we started building a generic service to solve this and to retire several legacy systems. We will discuss the general architecture, the growing pains, and why we decided to use Riak. We will also share some implementation details and the use of the service for a few large internet events.


Neil Conway, Bloom: Big Systems from Small Programs
Slides | Video

Distributed systems are ubiquitous, but distributed programs remain stubbornly hard to write. While many distributed algorithms can be concisely described, implementing them requires large amounts of code–often, the essence of the algorithm is obscured by low-level concerns like exception handling, task scheduling, and message serialization. This results in programs that are hard to write and even harder to maintain. Can we do better?

Bloom is a new programming language we’ve developed at UC Berkeley that takes two important steps towards improving distributed systems development. First, Bloom programs are designed to be declarative and concise, aided by a new philosophy for reasoning about state and time. Second, Bloom can analyze distributed programs for their consistency requirements and either certify that eventual consistency is sufficient, or identify program locations where stronger consistency guarantees are needed. In this talk, I’ll introduce the language, and also suggest how lessons from Bloom can be adopted in other distributed programming stacks.


Sean Cribbs, Just Open a Socket – Connecting Applications to Distributed Systems
Slides | Video

Client-server programming is a discipline as old as computer networks and well-known. Just connect socket to the server and send some bytes back and forth, right?

Au contraire! Building reliable, robust client libraries and applications is actually quite difficult, and exposes a lot of classic distributed and concurrent programming problems. From understanding and manipulating the TCP/IP network stack, to multiplexing connections across worker threads, to handling partial failures, to juggling protocols and encodings, there are many different angles one must cover.

In this talk, we’ll discuss how Basho has addressed these problems and others in our client libraries and server-side interfaces for Riak, and how being a good client means being a participant in the distributed system, rather than just a spectator.


Reid Draper, Advancing Riak CS
Slides | Video

Riak CS has come a long way since it was first released in 2012, and then open sourced in March 2013. We’ll take a look at some of the features and improvements in the recently released Riak CS 1.3.0, and planned for the future, like better integration with CloudStack and OpenStack. Next, we’ll go over some of the Riak CS guts that deployers should understand in order to successfully deploy, monitor and scale Riak CS.


Camille Fournier, ZooKeeper for the Skeptical Architect
Slides | Video

ZooKeeper is everywhere these days. It’s a core component of the Hadoop ecosystem. It provides the glue that enables high availability for systems like Redis and Solr. Your favorite startup probably uses it internally. But as every good skeptic knows, just because something is popular doesn’t mean you should use it. In this talk I will go over the core uses of ZooKeeper in the wild and why it is suited to these use cases. I will also talk about systems that don’t use ZooKeeper and why that can be the right decision. Finally I will discuss the common challenges of running ZooKeeper as a service and things to look out for when architecting a deployment.


Sathish Gaddipati, Building a Weather Data Services Platform on Riak
Slides | Video

In this talk Sathish will discuss the size, complexity and use cases surrounding weather data services and analytics, which will entail an overview of the architecture of such systems and the role of Riak in these patterns.


Sunny Gleason, Riak Techniques for Advanced Web & Mobile Application Development
Slides | Video

In recent years, there have been tremendous advances in high-performance, high-availability data storage for scalable web and mobile application development. Often times, these NoSQL solutions are portrayed as sacrificing the crispness and rapid application development features of relational database alternatives. In this presentation, we show the amazing things that are possible using a variety of techniques to apply Riak’s advanced features such as map-reduce, search, and secondary indexes. We review each feature in the context of a demanding real-world Ruby & Javascript “Pinterest clone” application with advanced features such as real-time updates via Websocket, comment feeds, content quarantining, permissions, search and social graph modeling. We pay specific attention to explaining the ‘why’ of these Riak techniques for high-performance, high availability applications, not just the ‘how’.


Andy Gross, Lessons Learned and Questions Raised from Building Distributed Systems
Slides | Video


Shawn Gravelle and Sam Townsend, High Availability with Riak and PostgreSQL
Slides | Videos

This talk will cover work to build out an internal cloud offering using Riak and PostgreSQL as a data layer, architectural decisions made to achieve high availability, and lessons learned along the way.


Rich Hickey, Using Datomic with Riak
Video

Rich Hickey, the author of Clojure and designer of Datomic, is a software developer with over 20 years of experience in various domains. Rich has worked on scheduling systems, broadcast automation, audio analysis and fingerprinting, database design, yield management, exit poll systems, and machine listening, in a variety of languages.


James Hughes, Revolution in Storage
Slides | Video

The trends of technology are rocking the storage industry. Fundamental changes in basic technology, combined with massive scale, new paradigms, and fundamental economics leads to predictions of a new storage programming paradigm. The growth of low cost/GB disk is continuing with technologies such as Shingled Magnetic Recording. Flash and RAM are continuing to scale with roadmaps, some argue, down to atom scale. These technologies do not come without a cost. It is time to reevaluate the interface that we use to all kinds of storage, RAM, Flash and Disk. The discussion starts with the unique economics of storage (as compared to processing and networking), discusses technology changes, posits a set of open questions and ends with predictions of fundamental shifts across the entire storage hierarchy.


Kyle Kingsbury, Call Me Maybe: Carly Rae Jepsen and the Perils of Network Partitions
Slides | Code | Video

Network partitions are real, but their practical consequences on complex applications are poorly understood. I want to talk about some of the neat ways I’ve found to lose important data, the challenge of building systems which are reliable under partitions, and what it means for you, an application developer.


Hilary Mason, Realtime Systems for Social Data Analysis
Slides | Video

It’s one thing to have a lot of data, and another to make it useful. This talk explores the interplay between infrastructure, algorithms, and data necessary to design robust systems that produce useful and measurable insights for realtime data products. We’ll walk through several examples and discuss the design metaphors that bitly uses to rapidly develop these kinds of systems.


Michajlo Matijkiw, Firefighting Riak at Scale
Slides | Video

Managing a business critical Riak instance in an enterprise environment takes careful planning, coordination, and the willingness to accept that no matter how much you plan, Murphy’s law will always win. At CIM we’ve been running Riak in production for nearly 3 years, and over those years we’ve seen our fair share of failures, both expected and unexpected. From disk melt downs to solar flares we’ve managed to recover and maintain 100% uptime with no customer impact. I’ll talk about some of these failures, how we dealt with them, and how we managed to keep our clients completely unaware.


Neha Narula, Why Is My Cache So Dumb? Smarter Caching with Pequod
Slides | Video

Pequod is a key/value cache we’re developing at MIT and Harvard that automatically updates the cache to keep data fresh. Pequod exploits a common pattern in these computations: different kinds of cached data are often related to each other by transformations equivalent to simple joins, filters, and aggregations. Pequod allows applications to pre-declare these transformations with a new abstraction, the cache join. Pequod then automatically applies the transformations and tracks relationships to materialize data and keep the cache up to date, and in many cases improves performance by reducing client/cacheserver communication. Sound like a database? We use abstractions from databases like joins and materialized views, while still maintaining the performance of an in-memory key/value cache.

In this talk, I’ll describe the challenges caching solves, the problems that still exist, and how tools like Pequod can make the space better.


Alex Payne, Nobody ever got fired for picking Java: evaluating emerging programming languages for business-critical systems
Slides | Video

When setting out to build greenfield systems, engineers today have a broader choice of programming language than ever before. Over the past decade, language development has accelerated dramatically thanks to mature runtimes like the JVM and CLR, not to mention the prevalence of near-universal targets for cross-compilation like JavaScript. With strong technological foundations to build on and an active open source community, modern languages can evolve from rough hobbyist projects into capable tools in a stunningly short period of time. With so many strong contenders emerging every day, how do you decide what language to bet your business on? We’ll explore the landscape of new languages and provide a decision-making framework you can use to narrow down your choices.


Theo Schlossnagle and Robert Treat, How Do You Eat An Elephant?
Slides | Video

When OmniTI first set out to build a next generation monitoring system, we turned to one of our most trusted tools for data management; Postgres. While this worked well for developing the initial Open Source application, as we continued to grow the Circonus public monitoring service, we eventually ran into scaling issues. This talk will cover some of the changes we made to make the original Postgres system work better, talk about some of the other systems we evaluated, and discuss the eventual solution to our problem; building our own time series database. Of course, that’s only half the story. We’ll also go into how we swapped out these backend data storage pieces in our production environment, all the while capturing and reporting on millions of metrics, without downtime or customer interruption.


Dr. Margo Seltzer, Automatically Scalable Computation
Slides | Video

As our computational infrastructure races gracefully forward into increasingly parallel multi-core and blade-based systems, our ability to easily produce software that can successfully exploit such systems continues to stumble. For years, we’ve fantasized about the world in which we’d write simple, sequential programs, add magic sauce, and suddenly have scalable, parallel executions. We’re not there. We’re not even close. I’ll present trajectory-based execution, a radical, potentially crazy, approach for achieving automatic scalability. To date, we’ve achieved surprisingly good speedup in limited domains, but the potential is tantalizingly enormous.


Chris Tilt, Riak Enterprise Revisited
Slides | Video

Riak Enterprise has undergone an overhaul since it’s 1.2 days, mostly around Mult-DataCenter replication. We’ll talk about the “Brave New World” of replication in depth, how it manages concurrent TCP/IP connections, Realtime Sync, and the technology preview of Active Anti-Entropy Fullsync. Finally, we’ll peek over the horizon at new features such as chaining of Realtime sync messages across multiple clusters.


Sam Townsend, High Availability with Riak and PostgreSQL
Slides | Video


Mark Wunsch, Scaling Happiness Horizontally
Slides | Video

This talk will discuss how Gilt has grown its technology organization to optimize for engineer autonomy and happiness and how that optimization has affected its software. Conway’s Law states that an organization that designs systems will inevitably produce systems that are copies of the communication structures of the organization. This talk will work its way between both the (gnarly) technical details of Gilt’s application architecture (something we internally call “LOSA”) and the Gilt Tech organization structure. I’ll discuss the technical challenges we came up against, and how these often pointed out areas of contention in the organization. I’ll discuss quorums, failover, and latency in the context of building a distributed, decentralized, peer-to-peer technical organization.


Matthew Von-Maszewski, Optimizing LevelDB for Performance and Scale
Slides | Video

LevelDB is a flexible key-value store written by Google and open sourced in August 2011. LevelDB provides an ordered mapping of binary keys to binary values. Various companies and individuals utilize LevelDB on cell phones and servers alike. The problem, however, is it does not run optimally on either as shipped.

This presentation outlines the basic internal mechanisms of LevelDB and then proceeds to discuss the tuning opportunities in the source code for each mechanism. This talk will draw heavily from our experiences optimizing LevelDB for use in Riak, which is handy for running sufficiently large clusters.


Ryan Zezeski, Yokozuna: Distributed Search You Don’t Think About
Slides | Video

Allowing users to run arbitrary and complex searches against your data is a feature required by most consumer facing applications. For example, the ability to get ranked results based on free text search and subsequently drill down on that data based on secondary attributes is at the heart of any good online retail shop. Not only must your application support complex queries such as “doggy treats in a 2 mile radius, broken down by popularity” but it must also return in hundreds of milliseconds or less to keep users happy. This is what systems like Solr are built for. But what happens when the index is too big to fit on a single node? What happens when replication is needed for availability? How do you give correct answers when the index is partitioned across several nodes? These are the problems of distributed search. These are some of the problems Yokozuna solves for you without making you think about it.

In this talk Ryan will explain what search is, why it matters, what problems distributed search brings to the table, and how Yokozuna solves them. Yokozuna provides distributed and available search while appearing to be a single-node Solr instance. This is very powerful for developers and ops professionals.


I first saw this in a tweet by Alex Popescu.

PS: If more videos go up and I miss it, please ping me. Thanks!

July 14, 2013

Solr vs ElasticSearch

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 7:14 pm

Solr vs ElasticSearch by Ryan Tabora.

Ryan evaluates Solr and ElasticSearch (both based on Lucene) in these categories:

  1. Foundations
  2. Coordination
  3. Shard Splitting
  4. Automatic Shard Rebalancing
  5. Schema
  6. Schema Creation
  7. Nested Typing
  8. Queries
  9. Distributed Group By
  10. Percolation Queries
  11. Community
  12. Vendor Support

As Ryan points out, making a choice between Solr and ElasticSearch requires detailed knowledge of your requirements.

If you are a developer, I would suggest following Lucene, as well as Solr and ElasticSearch.

No one tool is going to be the right tool for every job.

Unlocking the Big Data Silos Through Integration

Filed under: BigData,Data Integration,Data Silos,ETL,Silos — Patrick Durusau @ 7:01 pm

Unlocking the Big Data Silos Through Integration by Theo Priestly.

From the post:

Big Data, real-time and predictive analytics present companies with the unparalleled ability to understand consumer behavior and ever-shifting market trends at a relentless pace in order to take advantage of opportunity.

However, organizations are entrenched and governed by silos; data resides across the enterprise in the same way, waiting to be unlocked. Information sits in different applications, on different platforms, fed by internal and external sources. It’s a CIO’s headache when the CEO asks why the organization can’t take advantage of it. According to a recent survey, 54% of organizations state that managing data from various sources is their biggest challenge when attempting to make use of the information for customer analytics.

(…)

Data integration. Again?

A problem that just keeps on giving. The result of every ETL operation is a data set that needs another ETL operation sooner or later.

If Topic Maps weren’t a competing model but a way to model your information for re-integration, time after time, that would be a competitive advantage.

Both for topic maps and your enterprise.

CrowdSource:… [BlackHat USA July 27 – August 1, 2013]

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:48 pm

CrowdSource: Crowd Trained Machine Learning Model for Malware Capability Detection by Joshua Saxe.

Abstract:

Due to the exploding number of unique malware binaries on the Internet and the slow process required for manually analyzing these binaries, security practitioners today have only limited visibility into the functionality implemented by the global population of malware. To date little work has been focused explicitly on quickly and automatically detecting the broad range of high level malware functionality such as the ability of malware to take screenshots, communicate via IRC, or surreptitiously operate users’ webcams.

To address this gap, we debut CrowdSource, an open source machine learning based reverse engineering tool. CrowdSource approaches the problem of malware capability identification in a novel way, by training a malware capability detection engine on millions of technical documents from the web. Our intuition for this approach is that malware reverse engineers already rely heavily on the web “crowd” (performing web searches to discover the purpose of obscure function calls and byte strings, for example), so automated approaches, using the tools of machine learning, should also take advantage of this rich and as of yet untapped data source.

As a novel malware capability detection approach, CrowdSource does the following:

  1. Generates a list of detected software capabilities for novel malware samples (such as the ability of malware to communicate via a particular protocol, perform a given data exfiltration activity, or load a device driver);
  2. Provides traceable output for capability detections by including “citations” to the web technical documents that detections are based on;
  3. Provides probabilistic malware capability detections when appropriate: e.g., system output may read, “given the following web documents as evidence, it is 80% likely the sample uses IRC as a C2 channel, and 70% likely that it also encrypts this traffic.”

CrowdSource is funded under the DARPA Cyber Fast Track initiative, is being developed by the machine learning and malware analysis group at Invincea Labs and is scheduled for beta, open source release to the security community this October. In this presentation we will give complete details on our algorithm for CrowdSource as it stands, including compelling results that demonstrate that CrowdSource can already rapidly reverse engineer a variety of currently active malware variants.

If you attend the conference, be sure blog about this presentation in detail.

Of particular interest are the identification techniques.

Reasoning that the nature of one bug in software is likely repeated elsewhere in the same software.

And you can use those techniques against more traditional “malware” as well.

« Newer PostsOlder Posts »

Powered by WordPress