May « 2012 « Another Word For It

May 12, 2012

Paxos Made Moderately Complex

Filed under: Paxos,Scalability — Patrick Durusau @ 4:16 pm

From the post:

If you are a normal human being and find the Paxos protocol confusing, then this paper, Paxos Made Moderately Complex, is a great find. Robbert van Renesse from Cornell University has written a clear and well written paper with excellent explanations.

The Abstract:

For anybody who has ever tried to implement it, Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details. The initial description avoids optimizations that complicate comprehension. Next we discuss liveness, and list various optimizations that make the protocol practical.

If you need safety (“freedom from inconsistency”) and fault-tolerant topic map results, you may want to spend some quality time with this paper.

As with most things, user requirements are going to drive the choices you have to make.

Hard for me to think a “loosely consistent” merging system is useful, but for TV entertainment data that may be enough. Who is sleeping with who probably has lag time in reporting anyway.

For more serious data, Paxos may be your protocol of choice.

Comments Off

Cell Architectures (adding dashes of heterogeneity)

Filed under: Cell Architecture,Heterogeneous Data,Heterogeneous Programming,Parallelism,Scalability — Patrick Durusau @ 4:01 pm

Cell Architectures

From the post:

A consequence of Service Oriented Architectures is the burning need to provide services at scale. The architecture that has evolved to satisfy these requirements is a little known technique called the Cell Architecture.

A Cell Architecture is based on the idea that massive scale requires parallelization and parallelization requires components be isolated from each other. These islands of isolation are called cells. A cell is a self-contained installation that can satisfy all the operations for a shard. A shard is a subset of a much larger dataset, typically a range of users, for example.

Cell Architectures have several advantages:

Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.

Cell are added in an incremental fashion as more capacity is required.

Cells isolate failures. One cell failure does not impact other cells.

Cells provide isolation as the storage and application horsepower to process requests is independent of other cells.

Cells enable nice capabilities like the ability to test upgrades, implement rolling upgrades, and test different versions of software.

Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

The intersection of semantic heterogeneity and scaling remains largely unexplored.

I suggest scaling in a homogeneous environment and then adding dashes of heterogeneity to see what breaks.

Adjust and try again.

Comments Off

Outlier detection in two review articles (Part 1)

Filed under: Data Mining,Outlier Detection — Patrick Durusau @ 3:38 pm

Outlier detection in two review articles (Part 1) by Sandro Saitta.

Sandro writes:

The first one, Outlier Detection: A Survey, is written by Chandola, Banerjee and Kumar. They define outlier detection as the problem of “[…] finding patterns in data that do not conform to expected normal behavior“. After an introduction to what outliers are, authors present current challenges in this field. In my experience, non-availability of labeled data is a major one.

…

One of their main conclusions is that “[…] outlier detection is not a well-formulated problem“. It is your job, as a data miner, to formulate it correctly.

The final quote seems particularly well suited to subject identity issues. While any one subject identity may be well defined, the question is how to find and manage other subject identifications that may not be well defined.

As Sandro points out, it has nineteen (19) pages of references. However, only nine of those are as recent at 2007. All the rest are older references. I am sure it remains an excellent reference source but suspect more recent review articles on outlier detection exist.

Suggestions?

Comments Off

CDH3 update 4 is now available

Filed under: Flume,Hadoop,HBase,MapReduce — Patrick Durusau @ 3:24 pm

CDH3 update 4 is now available by David Wang.

From the post:

We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.

First, there have been a few notable HBase updates. In this release, we’ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see the CDH3 Known Issues and Workarounds page for details.

In addition to the HBase updates, CDH3 update 4 also includes the latest release of Apache Flume (incubating) – version 1.1.0. A detailed description of what it brings to the table is found in a previous Cloudera blog post describing its architecture. Please note that we will continue to ship Flume 0.9.4 as well.

Comments Off

Meta-tools for exploring explanations

Filed under: Graphics,Visualization — Patrick Durusau @ 3:16 pm

Meta-tools for exploring explanations

Jon Udell writes:

At the Canadian University Software Engineering Conference in January, Bret Victor gave a brilliant presentation that continues to resonate in the technical community. No programmer could fail to be inspired by Bret’s vision, which he compellingly demonstrated, of a system that makes software abstractions visual, concrete, and directly manipulable. Among the inspired were Eric Maupin and Chris Granger, both of whom quickly came up with their own implementations — in C# and ClojureScript — of the ideas Bret Victor had fleshed out in JavaScript.

Here is an example of the sort of problem Jon thinks we can address:

We need robust explorable explanations that state assumptions, link to supporting data, and assemble context that enables us to cross-check assumptions and evaluate consequences.

And we need them everywhere, for everything. Consider, for example, the current debate about fracking. We’re having this conversation because, as Daniel Yergin explains in The Quest, a natural gas revolution has gotten underway pretty recently. There’s a lot of more of it available than was thought, particularly in North America, and we can recover it and burn it a lot more cleanly than the coal that generates so much of our electric power. Are there tradeoffs? Of course, There are always tradeoffs. What cripples us is our inability to evaluate them. We isolate every issue, and then polarize it. Economist Ed Dolan writes

These anti-frackers have a simple solution: ban it.

The pro-frackers, too, have a simple solution: get the government out of the way and drill baby, drill.

The environmental impacts of fracking are a real problem, but one to which neither prohibition nor laissez faire seems a sensible solution. Instead, we should look toward mitigation of impacts using economic tools that have been applied successfully in the case of other environmental harms.

In order to do that, we’ve got to be able to put people in both camps in front of an explorable explanation with a slider that varies how much natural gas we choose to produce, linked to other sliders that vary what we pay, in dollars, lives, and environmental impact, not only for fracking but also for coal production and use, for Middle East wars, and so on.

Whatever your position on mapping discussions and dialogues, you will find this an interesting essay.

Jon points to other resources by Bret Victor:

Explorable Explanations (essay)

Ten Brighter Ideas (demo for Explorable Explanations)

MagicInk (book length essay, 2006)

Comments Off

May 11, 2012

Data journalism handbook: Tips for Working with Numbers in the News

Filed under: Data,News — Patrick Durusau @ 6:40 pm

Michael Blastland writes in Data journalism handbook: Tips for Working with Numbers in the News some short tips that will ease you towards becoming a data journalist.

You might want to print out Michael’s tips and keep them close at hand.

After a while you may want to add your own tips about particular data sources.

Or better yet, share them with others!

Oh, btw, the Data Journalism Handbook.

Comments Off

Clustering by hypergraphs and dimensionality of cluster systems

Filed under: Clustering,Graphs,Hyperedges,Hypergraphs — Patrick Durusau @ 6:39 pm

Clustering by hypergraphs and dimensionality of cluster systems by S. Albeverio and S.V. Kozyrev.

Abstract:

In the present paper we discuss the clustering procedure in the case where instead of a single metric we have a family of metrics. In this case we can obtain a partially ordered graph of clusters which is not necessarily a tree. We discuss a structure of a hypergraph above this graph. We propose two definitions of dimension for hyperedges of this hypergraph and show that for the multidimensional p-adic case both dimensions are reduced to the number of p-adic parameters.

We discuss the application of the hypergraph clustering procedure to the construction of phylogenetic graphs in biology. In this case the dimension of a hyperedge will describe the number of sources of genetic diversity.

A pleasant reminder that hypergraphs and hyperedges are simplifications of the complexity we find in nature.

If hypergraphs/hyperedges are simplifications, what would you call a graph/edges?

A simplification of a simplification?

Graphs are useful sometimes.

Useful sometimes doesn’t mean useful at all times.

Comments Off

Picard and Dathon at El-Adrel

Filed under: bigdata®,Graphs,SQL — Patrick Durusau @ 4:50 pm

Orri Erling’s account of the seeing Bryan Thompson reminded me of Picard and Dathon at El-Adrel, albeit with happier results.

See what you think:

I gave an invited talk (“Virtuoso 7 – Column Store and Adaptive Techniques for Graph” (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012.

Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.

So, we could do it like Bigdata®, or we could add a “quad ID” column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.

Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.

Pushing past the metaphors it sounds like both Orri and Bryan are working on interesting projects. 😉

Comments Off

Nuts and Bolts of Data Mining: Correlation & Scatter Plots

Filed under: Correlation,Statistics — Patrick Durusau @ 4:19 pm

Nuts and Bolts of Data Mining: Correlation & Scatter Plots by Tim Graettinger.

From the post:

In this article, I continue the “Nuts and Bolts of Data Mining” series. We will tackle two, intertwined tools/topics this time: correlation and scatter plots. These tools are fundamental for gauging the relationship (if any) between pairs of data elements. For instance, you might want to view the relationship between the age and income of your customers as a scatter plot. Or, you might compute a number that is the correlation between these two customer demographics. As we’ll soon see, there are good, bad, and ugly things that can happen when you apply a purely computational method like correlation. My goal is to help you avoid the usual pitfalls, so that you can use correlation and scatter plots effectively in your own work.

You will smile at the examples but if the popular press is any indication, correlation is no laughing matter!

Tim’s post won’t turn the tide but short enough to forward to the local broadside folks.

Comments Off

Who Do You Say You Are?

Filed under: Data Governance,Identification,Identity — Patrick Durusau @ 3:55 pm

In Data Governance in Context, Jim Ericson outlines several paths of data governance, or as I put it: Who Do You Say You Are?:

On one path, more enterprises are dead serious about creating and using data they can trust and verify. It’s a simple equation. Data that isn’t properly owned and operated can’t be used for regulatory work, won’t be trusted to make significant business decisions and will never have the value organizations keep wanting to ascribe it on the balance sheet. We now know instinctively that with correct and thorough information, we can jump on opportunities, unite our understanding and steer the business better than before.

On a similar path, we embrace tested data in the marketplace (see Experian, D&B, etc.) that is trusted for a use case even if it does not conform to internal standards. Nothing wrong with that either.

And on yet another path (and areas between) it’s exploration and discovery of data that might engage huge general samples of data with imprecise value.

It’s clear that we cannot and won’t have the same governance standards for all the different data now facing an enterprise.

For starters, crowd sourced and third party data bring a new dimension, because “fitness for purpose” is by definition a relative term. You don’t need or want the same standard for how many thousands or millions of visitors used a website feature or clicked on a bundle in the way you maintain your customer or financial info.

Do mortgage-backed securities fall into the “…huge general samples of data with imprecise value?” I ask because I don’t work in the financial industry. Or do they not practice data governance, except to generate numbers for the auditors?

I mention this because I suspect that subject identity governance would be equally useful for topic map authoring.

For some topic maps, say on drug trials, need to have a high degree of reliability and auditability. As well as precise identification (even if double-blind) of the test subjects.

Or there may be different tests for subject identity, some of which appear to be less precise than others.

For example, merging all the topics entered by a particular operator in a day to look for patterns that may indicate they are not following data entry protocols. (It is hard to be as random as real data.)

As with most issues, there isn’t any hard and fast rule that works for all cases. You do need to document the rules you are following and for how long. It will help you test old rules and to formulate new ones. (“Document” meaning to write down. The vagaries of memory are insufficient.)

Comments Off

Evaluating the Design of the R Language

Filed under: Language,Language Design,R — Patrick Durusau @ 3:33 pm

Evaluating the Design of the R Language

Sean McDirmid writes:

From our recent discussion on R, I thought this paper deserved its own post (ECOOP final version) by Floreal Morandat, Brandon Hill, Leo Osvald, and Jan Vitek; abstract:

R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we assess the success of different language features.

Excerpts from the paper:

R comes equipped with a rather unlikely mix of features. In a nutshell, R is a dynamic language in the spirit of Scheme or JavaScript, but where the basic data type is the vector. It is functional in that functions are ﬁrst-class values and arguments are passed by deep copy. Moreover, R uses lazy evaluation by default for all arguments, thus it has a pure functional core. Yet R does not optimize recursion, and instead encourages vectorized operations. Functions are lexically scoped and their local variables can be updated, allowing for an imperative programming style. R targets statistical computing, thus missing value support permeates all operations.

One of our discoveries while working out the semantics was how eager evaluation of promises turns out to be. The semantics captures this with C[]; the only cases where promises are not evaluated is in the arguments of a function call and when promises occur in a nested function body, all other references to promises are evaluated. In particular, it was surprising and unnecessary to force assignments as this hampers building inﬁnite structures. Many basic functions that are lazy in Haskell, for example, are strict in R, including data type constructors. As for sharing, the semantics cleary demonstrates that R prevents sharing by performing copies at assignments.

The R implementation uses copy-on-write to reduce the number of copies. With superassignment, environments can be used as shared mutable data structures. The way assignment into vectors preserves the pass-by-value semantics is rather unusual and, from personal experience, it is unclear if programmers understand the feature. … It is noteworthy that objects are mutable within a function (since ﬁelds are attributes), but are copied when passed as an argument.

Perhaps not immediately applicable to a topic map task today but I would argue very relevant for topic maps in general.

In part because it is a reminder that we are fashioning, when writing topic maps or topic map interfaces or languages to be used with topic maps, languages. Languages that will or perhaps will not fit how our users view the world and how they tend to formulate queries or statements.

The test for an artificial language should be whether users have to stop to consider the correctness of their writing. Every pause is a sign that error may be about to occur. Will they remember that this is an SVO language? Or is the terminology a familiar one?

Correcting the errors of others may “validate” your self-worth but is that what you want as the purpose of your language?

(I saw this at Christophe Lalanne’s blog.)

Comments Off

Crowdsourcing – A Solution to your “Bad Data” Problems

Filed under: Crowd Sourcing,Data Quality — Patrick Durusau @ 3:11 pm

Crowdsourcing – A Solution to your “Bad Data” Problems by Hollis Tibbetts.

Hollis writes:

Data problems – whether they be inaccurate data, incomplete data, data categorization issues, duplicate data, data in need of enrichment – are age-old.

IT executives consistently agree that data quality/data consistency is one of the biggest roadblocks to them getting full value from their data. Especially in today’s information-driven businesses, this issue is more critical than ever.

Technology, however, has not done much to help us solve the problem – in fact, technology has resulted in the increasingly fast creation of mountains of “bad data”, while doing very little to help organizations deal with the problem.

One “technology” holds much promise in helping organizations mitigate this issue – crowdsourcing. I put the word technology in quotation marks – as it’s really people that solve the problem, but it’s an underlying technology layer that makes it accurate, scalable, distributed, connectable, elastic and fast. In an article earlier this week, I referred to it as “Crowd Computing”.

Crowd Computing – for Data Problems

The Human “Crowd Computing” model is an ideal approach for newly entered data that needs to either be validated or enriched in near-realtime, or for existing data that needs to be cleansed, validated, de-duplicated and enriched. Typical data issues where this model is applicable include:

Verification of correctness

Data conflict and resolution between different data sources

Judgment calls (such as determining relevance, format or general “moderation”)

“Fuzzy” referential integrity judgment

Data error corrections

Data enrichment or enhancement

Classification of data based on attributes into categories

De-duplication of data items

Sentiment analysis

Data merging

Image data – correctness, appropriateness, appeal, quality

Transcription (e.g. hand-written comments, scanned content)

Translation

In areas such as the Data Warehouse, Master Data Management or Customer Data Management, Marketing databases, catalogs, sales force automation data, inventory data – this approach is ideal – or any time that business data needs to be enriched as part of a business process.

Hollis has a number of good points. But the choice doesn’t have to be “big data/iron” versus “crowd computing.”

More likely to get useful results out of some combination of the two.

Make “big data/iron” responsible for raw access, processing, visualization in an interactive environment with semantics supplied by the “crowd computers.”

And vet participants on both sides in real time. Would be a novel thing to have firms competing to supply the interactive environment and being paid on the basis of the “crowd computers” that preferred it or got better results.

That is a ways past where Hollis is going but I think it leads naturally in that direction.

Comments Off

Debategraph

Filed under: Debate,Graphs — Patrick Durusau @ 2:52 pm

Debategraph

I am not real sure what to make of this so I thought I would ask you!

😉

The “details” report:

The objective with Debategraph is not so much an absolutism of rationality as a transparency of rationality; creating a means for people to collaboratively capture and display all of the arguments pertinent to a debate clearly and fairly so that all of the participants in the debate have the chance to see the debate as a whole and to understand how the positions they hold exist within that debate.

I wonder about this being an interface for authoring topic maps, perhaps as using a news reader? With links and nodes being auto-populated from pre-cooked sub-graphs?

Suggestions?

Comments Off

Google Chart Tools

Filed under: Graphics,Visualization — Patrick Durusau @ 2:19 pm

Google Chart Tools

From the introduction:

Google Chart Tools provide a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart galley provides a large number of well-designed chart types. Populating your data is easy using the provided client- and server-side tools.

A chart depends on the following building blocks:

Chart Library
…
Data Styles
…
Data Sources
…

More tools for exploring data.

Not to mention making that analysis available to others.

Comments Off

Read’em and Weep

Filed under: Government,Government Data,Intelligence — Patrick Durusau @ 2:14 pm

I read Progress Made and Challenges Remaining in Sharing Terrorism-Related Information today.

My summary: We are less than five years away from some unknown level of functioning for an Information Sharing Environment (ISE) that facilitates the sharing of terrorism-related information.

Less than 20 years after 9/11, we will have some capacity to share information that may enable the potential disruption of terrorist plots.

The patience of terrorists and their organizations is appreciated. (I added that part. The report doesn’t say that.)

The official summary.

A breakdown in information sharing was a major factor contributing to the failure to prevent the September 11, 2001, terrorist attacks. Since then, federal, state, and local governments have taken steps to improve sharing. This statement focuses on government efforts to (1) establish the Information Sharing Environment (ISE), a government-wide approach that facilitates the sharing of terrorism-related information; (2) support fusion centers, where states collaborate with federal agencies to improve sharing; (3) provide other support to state and local agencies to enhance sharing; and (4) strengthen use of the terrorist watchlist. GAO’s comments are based on products issued from September 2010 through July 2011 and selected updates in September 2011. For the updates, GAO reviewed reports on the status of Department of Homeland Security (DHS) efforts to support fusion centers, and interviewed DHS officials regarding these efforts. This statement also includes preliminary observations based on GAO’s ongoing watchlist work. For this work, GAO is analyzing the guidance used by agencies to nominate individuals to the watchlist and agency procedures for screening individuals against the list, and is interviewing relevant officials from law enforcement and intelligence agencies, among other things..

The government continues to make progress in sharing terrorism-related information among its many security partners, but does not yet have a fully-functioning ISE in place. In prior reports, GAO recommended that agencies take steps to develop an overall plan or roadmap to guide ISE implementation and establish measures to help gauge progress. These measures would help determine what information sharing capabilities have been accomplished and are left to develop, as well as what difference these capabilities have made to improve sharing and homeland security. Accomplishing these steps, as well as ensuring agencies have the necessary resources and leadership commitment, should help strengthen sharing and address issues GAO has identified that make information sharing a high-risk area. Federal agencies are helping fusion centers build analytical and operational capabilities, but have more work to complete to help these centers sustain their operations and measure their homeland security value. For example, DHS has provided resources, including personnel and grant funding, to develop a national network of centers. However, centers are concerned about their ability to sustain and expand their operations over the long term, negatively impacting their ability to function as part of the network. Federal agencies have provided guidance to centers and plan to conduct annual assessments of centers’ capabilities and develop performance metrics by the end of 2011 to determine centers’ value to the ISE. DHS and the Department of Justice are providing technical assistance and training to help centers develop privacy and civil liberties policies and protections, but continuous assessment and monitoring policy implementation will be important to help ensure the policies provide effective protections. In response to its mission to share information with state and local partners, DHS’s Office of Intelligence and Analysis (I&A) has taken steps to identify these partner’s information needs, develop related intelligence products, and obtain more feedback on its products. I&A also provides a number of services to its state and local partners that were generally well received by the state and local officials we contacted. However, I&A has not yet defined how it plans to meet its state and local mission by identifying and documenting the specific programs and activities that are most important for executing this mission. The office also has not developed performance measures that would allow I&A to demonstrate the expected outcomes and effectiveness of state and local programs and activities. In December 2010, GAO recommended that I&A address these issues, which could help it make resource decisions and provide accountability over its efforts. GAO’s preliminary observations indicate that federal agencies have made progress in implementing corrective actions to address problems in watchlist-related processes that were exposed by the December 25, 2009, attempted airline bombing. These actions are intended to address problems in the way agencies share and use information to nominate individuals to the watchlist, and use the list to prevent persons of concern from boarding planes to the United States or entering the country, among other things. These actions can also have impacts on agency resources and the public, such as traveler delays and other inconvenience. GAO plans to report the results of this work later this year. GAO is not making new recommendations, but has made recommendations in prior reports to federal agencies to enhance information sharing. The agencies generally agreed and are making progress, but full implementation of these recommendations is needed.

Full Report: Progress Made and Challenges Remaining in Sharing Terrorism-Related Information

Let me share with you the other GAO reports cited in this report:

Department of Homeland Security: Progress Made and Work Remaining in Implementing Homeland Security Missions 10 Years after 9/11. GAO-11-881. Washington, D.C: September 7, 2011.
Information Sharing Environment: Better Road Map Needed to Guide Implementation and Investments. GAO-11-455. Washington, D.C: July 21, 2011.
High-Risk Series: An Update. GAO-11-278. Washington, D.C.: February 2011.
Information Sharing: DHS Could Better Define How It Plans to Meet Its State and Local Mission and Improve Performance Accountability. GAO-11-223. Washington, D.C.: December 16, 2010.
Information Sharing: Federal Agencies Are Helping Fusion Centers Build and Sustain Capabilities and Protect Privacy, but Could Better Measure Results. GAO-10-972. Washington, D.C.: September 29, 2010.
Terrorist Watchlist Screening: FBI Has Enhanced Its Use of Information from Firearm and Explosives Background Checks to Support Counterterrorism Efforts. GAO-10-703T. Washington, D.C.: May 5, 2010.
Homeland Security: Better Use of Terrorist Watchlist Information and Improvements in Deployment of Passenger Screening Checkpoint Technologies Could Further Strengthen Security. GAO-10-401T. Washington, D.C.: January 27, 2010.
Information Sharing: Federal Agencies Are Sharing Border and Terrorism Information with Local and Tribal Law Enforcement Agencies, but Additional Efforts Are Needed. GAO-10-41. Washington, D.C.: December 18, 2009.
Information Sharing Environment: Definition of the Results to Be Achieved in Improving Terrorism-Related Information Sharing Is Needed to Guide Implementation and Assess Progress. GAO-08-492. Washington, D.C.: June 25, 2008.
Homeland Security: Federal Efforts Are Helping to Alleviate Some Challenges Encountered by State and Local Information Fusion Centers. GAO-08-35. Washington, D.C.: October 30, 2007.
Terrorist Watch List Screening: Efforts to Help Reduce Adverse Effects on the Public. GAO-06-1031. Washington, D.C.: September 29, 2006.
Information Sharing: The Federal Government Needs to Establish Policies and Processes for Sharing Terrorism-Related and Sensitive but Unclassified Information. GAO-06-385. Washington, D.C.: March 17, 2006.

Do you see semantic mapping opportunities in all those reports?

Comments (2)

May 10, 2012

Learn Hadoop and get a paper published

Filed under: Common Crawl,Hadoop,MapReduce — Patrick Durusau @ 6:47 pm

Learn Hadoop and get a paper published by Allison Domicone.

From the post:

We’re looking for students who want to try out the Hadoop platform and get a technical report published.

(If you’re looking for inspiration, we have some paper ideas below. Keep reading.)

Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.

So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.

As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.

Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.

How very cool!

Hurry, there are nineteen (19) comments already!

Comments (1)

Many Eyes

Filed under: Graphics,Visualization — Patrick Durusau @ 6:06 pm

Many Eyes

I haven’t reproduced all the hyperlinks but if you go to Tour you will find:

The heart of the site is a collection of data visualizations. You may want to begin by browsing through these collections—if you’d rather explore than read directions, take a look!

On Many Eyes you can:

1. View and discuss visualizations
2. View and discuss data sets
3. Create visualizations from existing data sets

If you register, you can also:

4. Rate data sets and visualizations
5. Upload your own data
6. Create and participate in topic centers
7. Select items to watch
8. Track your contributions, watchlist, and topic centers
9. See comments that others have written to you

From the website:

An experiment brought to you by IBM Research and the IBM Cognos software group.

Another step closer to data analysis being limited only by your imagination and not access to data or tools.

Well worth an extended visit.

Comments Off

Simple federated queries with RDF [Part 1]

Filed under: Federation,RDF,SPARQL — Patrick Durusau @ 4:12 pm

Simple federated queries with RDF [Part 1]

Bob DuCharme writes:

A few more triples to identify some relationships, and you’re all set.

[side note] Easy aggregation without conversion is where semantic web technology shines the brightest.

Once, at an XML Summer School session, I was giving a talk about semantic web technology to a group that included several presenters from other sessions. This included Henry Thompson, who I’ve known since the SGML days. He was still a bit skeptical about RDF, and said that RDF was in the same situation as XML—that if he and I stored similar information using different vocabularies, we’d still have to convert his to use the same vocabulary as mine or vice versa before we could use our data together. I told him he was wrong—that easy aggregation without conversion is where semantic web technology shines the brightest.

I’ve finally put together an example. Let’s say that I want to query across his address book and my address book together for the first name, last name, and email address of anyone whose email address ends with “.org”. Imagine that his address book uses the vCard vocabulary and the Turtle syntax and looks like this,

Bob is an expert in more areas of markup, SGML/XML, SPARQL and other areas than I can easily count. Not to mention being a good friend.

Take a look at Bob’s post and decide for yourself how “simple” the federation is following Bob’s technique.

I am just going to let it speak for itself today.

I will outline obvious and some not so obvious steps in Bob’s “simple” federated queries in Part II.

Comments Off

Workshops Semantic knowledge solutions

Filed under: Kamala,Ontopia,Ontopoly,Topic Map Software — Patrick Durusau @ 3:28 pm

Workshops Semantic knowledge solutions by Fiemke Griffioen.

From the post:

Morpheus is organizing a number of one-day workshops Semantic knowledge solutions about how knowledge applications can be developed within your organization. We show what the advantages are of gaining insight into your knowledge and sharing knowledge.

In the workshops our Kamala webapplication is used to model knowledge. Kamala is a web application for efficiently developing and sharing semantic knowledge and is based on the open source Topic Maps-engine Ontopia. Kamala is similar to the editor of Ontopia, Ontopoly, but more interactive and flexible because users require less knowledge of the Topic Maps data model in advance.

Since I haven’t covered Kamala before:

Kamala includes the following features:

Availability of the complete data model of Topic Maps standard

Navigation based on ontological structures

Search topics based on naming

Sharing topic maps with other users (optionally read-only)

Importing and exporting topic maps to the standard formats XTM, TMXML, LTM, CXTM, etc.

Querying topic maps with the TOLOG or TMQL query languages

Storing queries for simple repetition of the query

Validation of topic maps, so that ‘gaps’ in the knowledge model can be traced

Generating statistics

The following modules are available to expand Kamala’s core functionality:

Geo-module, so topics with a geotag can be placed on a map

Facet indexation for effective navigation based on classification

The workshops are on Landgoed Maarsbergen (That’s what I said, so I included the contact link, which has a map.)

Comments Off

CIA/NSA Diff Utility?

Filed under: Intelligence,Marketing,Topic Maps — Patrick Durusau @ 2:40 pm

How much of the data sold to the CIA/NSA is from public resources?

Of the sort you find at Knoema?

Albeit some of it isn’t easy to find but it is public data.

A topic map of public data resources would be a good CIA/NSA Diff Utility so they could avoid paying for data that is freely available on the WWW.

I suppose the fall back position of suppliers would be their “value add.”

With public data sets, the CIA/NSA could put that “value add” to the test. Along the lines of the Netflix competition.

Even if the results weren’t the goal, it would be a good way to discover new techniques and/or analysts.

How would you “diff” public data from that being supplied by a contractor?

Comments Off

Trawling the web for socioeconomic data? Look no further than Knoema

Filed under: Data Source,News,Socioeconomic Data — Patrick Durusau @ 2:21 pm

Trawling the web for socioeconomic data? Look no further than Knoema

From the Guardian Datablog, John Burn-Murdoch writes:

A joint venture by Russian and Indian technology professionals aims to be the Youtube of data. Knoema which launched last month and is marketed by its creators as “your personal knowledge highway”, combines data-gathering with presentation to create an online bank of socioeconomic and environmental data-sets.

The website’s homepage shows a selection of the topics on which Knoema has collected data. Among the categories are broad fields such as commodities and energy, but also more specialised collections including sexual exploitation and biofuels.

[graphics omitted]

Within each subject-area you can find one or more ‘dashboards’ – simple yet comprehensive presentations of data for a given topic, with all source-material documented. Knoema also provides choropleth maps for many of the datasets where figures are given for geographical areas.

‘Commodity passports‘ are another format in which Knoema offers some of its data. These give a detailed breakdown of production, consumption, imports, exports and market prices for a diverse range of products and materials including apples, cotton and natural gas.

Resource listings following the site review, including the Guardian’s world government data gateway and other resources.

Comments (1)

CNN Transcript Collection (2000-2012)

Filed under: Data Source,News — Patrick Durusau @ 2:03 pm

CNN Transcript Collection (2000-2012)

From the webpage:

For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.

Suggested transcript sources for other broadcast media?

Seen at Nathan Yau’s Flowing Data.

Comments Off

Book “R and Data Mining: Examples and Case Studies” on CRAN [blank chapters]

Filed under: Data Mining,R — Patrick Durusau @ 2:03 pm

Book “R and Data Mining: Examples and Case Studies” on CRAN [blank chapters]

Yanchang Zhao, RDataMining.com, writes:

My book in draft titled “R and Data Mining: Examples and Case Studies” is now available on CRAN at http://cran.r-project.org/other-docs.html. It is scheduled to be published by Elsevier in late 2012. Its latest version can be downloaded at http://www.rdatamining.com/docs.

The book presents many examples on data mining with R, including data exploration, decision trees, random forest, regression, clustering, time series analysis & mining, and text mining. Some other chapters in progress are social network analysis, outlier detection, association rules and sequential patterns.

Not to complain too much, what is present is good and an author can choose his/her distribution model but the above entry should be modified to add:

Chapters 7, 9, 12 – 15 are blank and reserved for the book version to be published by Elsevier Inc.

Just so you know.

Cambridge and other presses have chosen to follow other access models. Perhaps someday Elsevier will as well.

Comments (1)

May 9, 2012

Ask a baboon

Filed under: Humor,Language — Patrick Durusau @ 4:04 pm

Ask a baboon

A post by Mark Liberman that begins with this quote:

Sindya N. Bhanoo, “Real Words or Gibberish? Just Ask a Baboon“, NYT 4/16/2012:

While baboons can’t read, they can tell the difference between real English words and nonsensical ones, a new study reports.

“They are using information about letters and the relation between letters to perform the task without any kind of linguistic training,” said Jonathan Grainger, a psychologist at the French Center for National Research and at Aix-Marseille University in France who was the study’s first author.

Mark finds a number of sad facts, some in the coverage of the story and the others in the story itself.

His analysis of the coverage and the story proper are quite delightful.

Enjoy.

Comments Off

Big Data, R and SAP HANA: Analyze 200 Million Data Points…

Filed under: BigData,R,SAP HANA — Patrick Durusau @ 3:20 pm

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 – Part III by Jitender Aswani.

I have been collecting the pointers to this series of posts for some time now.

It is a good series on analysis/visualization, with lessons that you can transfer to other data sets.

Mash-up Airlines Performance Data with Historical Weather Data to Pinpoint Weather Related Delays

For this exercise, I combined following four separate blogs that I did on BigData, R and SAP HANA. Historical airlines and weather data were used for the underlying analysis. The aggregated output of this analysis was outputted in JSON which was visualized in HTML5, D3 and Google Maps. The previous blogs on this series are:

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 – Part II

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps

Getting Historical Weather Data in R and SAP HANA

Tracking SFO Airport’s Performance Using R, HANA and D3

In this blog, I wanted to mash-up disparate data sources in R and HANA by combining airlines data with weather data to understand the reasons behind the airport/airlines delay. Why weather – because weather is one of the commonly cited in the airlines industry for flight delays. Fortunately, the airlines data breaks up the delay by weather, security, late aircraft etc., so weather related delays can be isolated and then the actual weather data can be mashed-up to validate the airlines’ claims. However, I will not be doing this here, I will just be displaying the mashed-up data.

I have intentionally focused on the three bay-area airports and have used last 4 years of historical data to visualize the airport’s performance using a HTML5 calendar built from scratch using D3.js. One can use all 20 years of data and for all the airports to extend this example. I had downloaded historical weather data for the same 2005-2008 period for SFO and SJC airports as shown in my previous blog (For some strange reasons, there is no weather data for OAK, huh?). Here is how the final result will look like in HTML5:

Comments Off

Converged Cloud Growth…[Ally or Fan Fears on Interoperability]

Filed under: Cloud Computing,Interoperability — Patrick Durusau @ 2:58 pm

Demand For Standards—Interoperability To Fuel Converged Cloud Growth

Confused terminology is often a mess in stable CS areas, to say nothing of rapidly developing one such as cloud computing.

Add to that all the marketing hype that creates even more confusion.

Thinking there should be opportunities for standardizing terminology and mappings to vendor terminology in the process.

Topic maps would be a natural for the task.

Interested?

Comments Off

Data.gov launches developer community

Filed under: Dataset,Government Data — Patrick Durusau @ 2:15 pm

Data.gov launches developer community

Federal Computer Week reports:

Data.gov has launched a new community for software developers to share ideas, collaborate or compete on projects and request new datasets.

Developer.data.gov joins a growing list of communities and portals tapping into Data.gov’s datasets, including those for health, energy, education, law, oceans and the Semantic Web.

The developer site is set up to offer access to federal agency datasets, source code, applications and ongoing developer challenges, along with blogs and forums where developers can discuss projects and share ideas.

Source: FCW (http://s.tt/1azwt)

Depending upon your developer skills, this could be a good place to hone them.

Not to mention having a wealth of free data sets at hand.

Comments Off

GATE Teamware: Collaborative Annotation Factories (HOT!)

Filed under: Annotation,Cloud Computing,Collaboration,Corpora,Corpus Linguistics,Curation,Entity Resolution,Linguistics — Patrick Durusau @ 1:00 pm

GATE Teamware: Collaborative Annotation Factories

From the webpage:

Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.

It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)

GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:

Loading document collections (a “corpus” or “corpora”)

Creating re-usable project templates

Initiating projects based on templates

Assigning project roles to specific users

Monitoring progress and various project statistics in real time

Reporting of project status, annotator activity and statistics

Applying GATE-based processing routines (automatic annotations or post-annotation processing)

I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.

Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.

Then I read:

Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from GATECloud.net.

Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”

Let me know if you take the plunge!

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 12, 2012

May 11, 2012

May 10, 2012

May 9, 2012